|
[Sponsors] |
AWS EC2 Cluster Running in Parallel Issues with v1612+ |
|
LinkBack | Thread Tools | Search this Thread | Display Modes |
January 22, 2020, 11:05 |
AWS EC2 Cluster Running in Parallel Issues with v1612+
|
#1 |
New Member
B. Assaad
Join Date: Sep 2014
Posts: 14
Rep Power: 12 |
Hi Community,
Reaching out because I'm having issues in getting for example snappyHexMesh or simpleFoam ran in AWS cloud using EC2 c5 instances that have been clustered together. I've compiled the source code from ESI OpenFOAM v1612+ into the Master node for a Linux Ubuntu machine running on system version 18.04 bionic. When I run this command in terminal: Code:
>> mpirun --hostfile machines -np 12 snappyHexMesh -parallel -overwrite | tee log/snappyHexMesh.log It looks like the snappyHexMesh gets hung up and I get the below error: Code:
------------------------------------------------------------------------- [[12977,1],9]: A high-performance Open MPI point-to-point messaging module was unable to find any relevant network interfaces: Module: OpenFabrics (openib) Host: ip-10-0-1-37 Another transport will be used instead, although this may result in lower performance. NOTE: You can disable this warning by setting the MCA parameter btl_base_warn_component_unused to 0. -------------------------------------------------------------------------- /*---------------------------------------------------------------------------*\ | ========= | | | \\ / F ield | OpenFOAM: The Open Source CFD Toolbox | | \\ / O peration | Version: v1612+ | | \\ / A nd | Web: www.OpenFOAM.com | | \\/ M anipulation | | \*---------------------------------------------------------------------------*/ Build : v1612+ Exec : snappyHexMesh -parallel -overwrite Date : Jan 22 2020 Time : 01:24:17 Host : "ip-10-0-1-120" PID : 14358 [ip-10-0-1-120:14343] 11 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics [ip-10-0-1-120:14343] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages I think the issue here has to do with 'mpirun' and how AWS uses it versus how another machine may use it. Looks like amazon developed their own library for mpi runs. When I run this command I get the below output: Code:
>> printenv | grep /opt/amazon/openmpi LD_LIBRARY_PATH=/opt/amazon/efa/lib:/opt/amazon/openmpi/lib Code:
>> whereis mpirun mpirun: /usr/bin/mpirun.openmpi /usr/bin/mpirun /opt/amazon/openmpi/bin/mpirun /usr/share/man/man1/mpirun.1.gz |
|
January 22, 2020, 18:24 |
|
#2 |
Senior Member
Joachim Herb
Join Date: Sep 2010
Posts: 650
Rep Power: 22 |
What happens, if you call:
Code:
mpirun --hostfile machines -np 12 hostname Some additional ideas: Is OpenFOAM setups corretly, if MPI connects by ssh to the nodes? You can test this with: Code:
ssh slave_ip env | grep PATH ssh slave_ip env | grep LD_LIBRARY_PATH If not, you have to setup $HOME/.bashrc accordingly: Make sure, that something like source /.../OpenFOAM../etc/bashrc is at the top of the file, so it is really called in a non-login shell. |
|
January 22, 2020, 23:40 |
|
#3 |
New Member
B. Assaad
Join Date: Sep 2014
Posts: 14
Rep Power: 12 |
Hi jherb,
When I ran the first line I got this .. Based on terminal output from below code, that looks like machines are properly defined (IP .120 is master, .37 is slave1 and .65 is slave2). Code:
ip-10-0-1-120:000N >> mpirun --hostfile machines -np 12 hostname ip-10-0-1-37 ip-10-0-1-120 ip-10-0-1-120 ip-10-0-1-120 ip-10-0-1-37 ip-10-0-1-37 ip-10-0-1-65 ip-10-0-1-65 ip-10-0-1-65 ip-10-0-1-37 ip-10-0-1-120 ip-10-0-1-65 Code:
source /home/ubuntu/OpenFOAM/OpenFOAM-v1612+/etc/bashrc Code:
ip-10-0-1-120:000N >> ssh 10.0.1.37 env | grep PATH LD_LIBRARY_PATH=/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/home/ubuntu/OpenFOAM/ThirdParty-v1612+/platforms/linux64Gcc/fftw-3.3.5/lib64:/home/ubuntu/OpenFOAM/ThirdParty-v1612+/platforms/linux64Gcc/CGAL-4.9/lib64:/home/ubuntu/OpenFOAM/ThirdParty-v1612+/platforms/linux64Gcc/boost_1_62_0/lib64:/home/ubuntu/OpenFOAM/ThirdParty-v1612+/platforms/linux64Gcc/gperftools-2.5/lib64:/home/ubuntu/OpenFOAM/OpenFOAM-v1612+/platforms/linux64GccDPInt32Opt/lib/openmpi-system:/home/ubuntu/OpenFOAM/ThirdParty-v1612+/platforms/linux64GccDPInt32/lib/openmpi-system:/usr/lib/x86_64-linux-gnu/openmpi/lib:/home/ubuntu/OpenFOAM/ubuntu-v1612+/platforms/linux64GccDPInt32Opt/lib:/home/ubuntu/OpenFOAM/site/v1612+/platforms/linux64GccDPInt32Opt/lib:/home/ubuntu/OpenFOAM/OpenFOAM-v1612+/platforms/linux64GccDPInt32Opt/lib:/home/ubuntu/OpenFOAM/ThirdParty-v1612+/platforms/linux64GccDPInt32/lib:/home/ubuntu/OpenFOAM/OpenFOAM-v1612+/platforms/linux64GccDPInt32Opt/lib/dummy MPI_ARCH_PATH=/usr/lib/x86_64-linux-gnu/openmpi FFTW_ARCH_PATH=/home/ubuntu/OpenFOAM/ThirdParty-v1612+/platforms/linux64Gcc/fftw-3.3.5 SCOTCH_ARCH_PATH=/home/ubuntu/OpenFOAM/ThirdParty-v1612+/platforms/linux64GccDPInt32/scotch_6.0.3 CGAL_ARCH_PATH=/home/ubuntu/OpenFOAM/ThirdParty-v1612+/platforms/linux64Gcc/CGAL-4.9 PATH=/opt/amazon/efa/bin:/opt/amazon/openmpi/bin:/home/ubuntu/OpenFOAM/ThirdParty-v1612+/platforms/linux64Gcc/gperftools-2.5/bin:/home/ubuntu/OpenFOAM/ubuntu-v1612+/platforms/linux64GccDPInt32Opt/bin:/home/ubuntu/OpenFOAM/site/v1612+/platforms/linux64GccDPInt32Opt/bin:/home/ubuntu/OpenFOAM/OpenFOAM-v1612+/platforms/linux64GccDPInt32Opt/bin:/home/ubuntu/OpenFOAM/OpenFOAM-v1612+/bin:/home/ubuntu/OpenFOAM/OpenFOAM-v1612+/wmake:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games BOOST_ARCH_PATH=/home/ubuntu/OpenFOAM/ThirdParty-v1612+/platforms/linux64Gcc/boost_1_62_0 Code:
ip-10-0-1-120:000N >> ssh 10.0.1.37 env | grep LD_LIBRARY_PATH LD_LIBRARY_PATH=/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/home/ubuntu/OpenFOAM/ThirdParty-v1612+/platforms/linux64Gcc/fftw-3.3.5/lib64:/home/ubuntu/OpenFOAM/ThirdParty-v1612+/platforms/linux64Gcc/CGAL-4.9/lib64:/home/ubuntu/OpenFOAM/ThirdParty-v1612+/platforms/linux64Gcc/boost_1_62_0/lib64:/home/ubuntu/OpenFOAM/ThirdParty-v1612+/platforms/linux64Gcc/gperftools-2.5/lib64:/home/ubuntu/OpenFOAM/OpenFOAM-v1612+/platforms/linux64GccDPInt32Opt/lib/openmpi-system:/home/ubuntu/OpenFOAM/ThirdParty-v1612+/platforms/linux64GccDPInt32/lib/openmpi-system:/usr/lib/x86_64-linux-gnu/openmpi/lib:/home/ubuntu/OpenFOAM/ubuntu-v1612+/platforms/linux64GccDPInt32Opt/lib:/home/ubuntu/OpenFOAM/site/v1612+/platforms/linux64GccDPInt32Opt/lib:/home/ubuntu/OpenFOAM/OpenFOAM-v1612+/platforms/linux64GccDPInt32Opt/lib:/home/ubuntu/OpenFOAM/ThirdParty-v1612+/platforms/linux64GccDPInt32/lib:/home/ubuntu/OpenFOAM/OpenFOAM-v1612+/platforms/linux64GccDPInt32Opt/lib/dummy Code:
ip-10-0-1-120:000N >> mpirun --hostfile machines -np 12 snappyHexMesh -parallel -overwrite | tee log/snappyHexMesh.log -------------------------------------------------------------------------- [[32349,1],6]: A high-performance Open MPI point-to-point messaging module was unable to find any relevant network interfaces: Module: OpenFabrics (openib) Host: ip-10-0-1-37 Another transport will be used instead, although this may result in lower performance. NOTE: You can disable this warning by setting the MCA parameter btl_base_warn_component_unused to 0. -------------------------------------------------------------------------- /*---------------------------------------------------------------------------*\ | ========= | | | \\ / F ield | OpenFOAM: The Open Source CFD Toolbox | | \\ / O peration | Version: v1612+ | | \\ / A nd | Web: www.OpenFOAM.com | | \\/ M anipulation | | \*---------------------------------------------------------------------------*/ Build : v1612+ Exec : snappyHexMesh -parallel -overwrite Date : Jan 23 2020 Time : 03:35:42 Host : "ip-10-0-1-120" PID : 29939 Case : /home/ubuntu/Projects/01_TestRuns/53StateSt/000N nProcs : 12 Slaves : 11 ( "ip-10-0-1-120.29940" "ip-10-0-1-120.29941" "ip-10-0-1-120.29942" "ip-10-0-1-37.17118" "ip-10-0-1-37.17119" "ip-10-0-1-37.17120" "ip-10-0-1-37.17121" "ip-10-0-1-65.25946" "ip-10-0-1-65.25947" "ip-10-0-1-65.25948" "ip-10-0-1-65.25949" ) Pstream initialized with: floatTransfer : 0 nProcsSimpleSum : 0 commsType : nonBlocking polling iterations : 0 sigFpe : Enabling floating point exception trapping (FOAM_SIGFPE). fileModificationChecking : Monitoring run-time modified files using timeStampMaster (fileModificationSkew 10) allowSystemOperations : Allowing user-supplied system call operations // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * // Create time Create mesh for time = 0 [4] [4] [4] --> FOAM FATAL ERROR: [4] Cannot find file "points" in directory "polyMesh" in times 0 down to constant [4] [4] From function Foam::word Foam::Time::findInstance(const Foam::fileName&, const Foam::word&, Foam::IOobject::readOption, const Foam::word&) const [4] in file db/Time/findInstance.C at line [5] [5] [5] --> FOAM FATAL ERROR: [5] Cannot find file "points" in directory "polyMesh" in times 0 down to constant [5] [5] From function Foam::word Foam::Time::findInstance(const Foam::fileName&, const Foam::word&, Foam::IOobject::readOption, const Foam::word&) const [5] in file db/Time/findInstance.C at line 202. [5] FOAM parallel run exiting [5] [8] [8] [8] --> FOAM FATAL ERROR: [8] Cannot find file "points" in directory "polyMesh" in times 0 down to constant [8] [8] From function Foam::word Foam::Time::findInstance(const Foam::fileName&, const Foam::word&, Foam::IOobject::readOption, const Foam::word&) const [8] in file db/Time/findInstance.C at line 202. [8] FOAM parallel run exiting [8] [6] [6] [6] --> FOAM FATAL ERROR: [6] Cannot find file "points" in directory "polyMesh" in times 0 down to constant [6] [6] From function Foam::word Foam::Time::findInstance(const Foam::fileName&, const Foam::word&, Foam::IOobject::readOption, const Foam::word&) const [6] in file db/Time/findInstance.C at line 202. [6] FOAM parallel run exiting [6] [9] [9] [9] --> FOAM FATAL ERROR: [9] Cannot find file "points" in directory "polyMesh" in times 0 down to constant [9] [9] From function Foam::word Foam::Time::findInstance(const Foam::fileName&, const Foam::word&, Foam::IOobject::readOption, const Foam::word&) const [9] in file db/Time/findInstance.C at line 202. [9] FOAM parallel run exiting [9] [7] [7] [7] --> FOAM FATAL ERROR: [7] Cannot find file "points" in directory "polyMesh" in times 0 down to constant [7] [7] From function Foam::word Foam::Time::findInstance(const Foam::fileName&, const Foam::word&, Foam::IOobject::readOption, const Foam::word&) const [7] in file db/Time/findInstance.C at line 202. [7] FOAM parallel run exiting [7] [10] [10] [10] --> FOAM FATAL ERROR: [10] Cannot find file "points" in directory "polyMesh" in times 0 down to constant [10] [10] From function Foam::word Foam::Time::findInstance(const Foam::fileName&, const Foam::word&, Foam::IOobject::readOption, const Foam::word&) const [10] in file db/Time/findInstance.C at line 202. [10] FOAM parallel run exiting [10] 202. [4] FOAM parallel run exiting [4] [11] [11] [11] --> FOAM FATAL ERROR: [11] Cannot find file "points" in directory "polyMesh" in times 0 down to constant [11] [11] From function Foam::word Foam::Time::findInstance(const Foam::fileName&, const Foam::word&, Foam::IOobject::readOption, const Foam::word&) const [11] in file db/Time/findInstance.C at line 202. [11] FOAM parallel run exiting [11] -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 5 in communicator MPI_COMM_WORLD with errorcode 1. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. -------------------------------------------------------------------------- [ip-10-0-1-120:29931] 11 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics [ip-10-0-1-120:29931] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [ip-10-0-1-120:29931] 7 more processes have sent help message help-mpi-api.txt / mpi-abort When I run the same case in one node with 4 physical cpu's i don't have any of these issues with snappyHexMesh |
|
January 23, 2020, 05:16 |
|
#4 |
Senior Member
Joachim Herb
Join Date: Sep 2010
Posts: 650
Rep Power: 22 |
Do you have the processorXXX folders available on all machines? You can follow the instructions on https://cfd.direct/cloud/aws/cluster/ to setup NFS (steps 4 and 5).
|
|
February 4, 2020, 15:53 |
|
#5 |
New Member
B. Assaad
Join Date: Sep 2014
Posts: 14
Rep Power: 12 |
After checking that processor folders are available on all machines, and following the steps from the link provided - model worked! Thank you so much jherb.
|
|
March 10, 2020, 12:34 |
|
#6 |
New Member
B. Assaad
Join Date: Sep 2014
Posts: 14
Rep Power: 12 |
Hi jherb,
I've tried to do this same thing with a new AWS EC2 c5.18xlarge cluster of 1 Master and 1 Slave - but today the snappyHexMesh -pararell -overwrite operation is hanging up (not meshing is happening). When I do top on machine 1 (master) I see 100% cpu usage on master for 36 CPUs, and when I do top on machine 2 (slave) I see 100% cpu usage on slave for 36 CPUs. Attached is a screenshot of the message I'm getting. Please help as I've deleted my old machines that worked, but the new ones don't work and I don't know why. I'm establishing the cluster using these steps Code:
Add the .ppk file in pageant Connect to master EC2 using Putty where Allow agent forwarding is enabled ssh-add -l to check the added private key ssh-keygen to generate id_rsa private and id_rsa.pub public key ssh-add ~/.ssh/id_rsa to add the id_rsa key ssh-add -l to check private key added Check the pageant to verify it for two key ssh-copy-id ubuntu@<Private IP> to copy the id_rsa.pub on slaves ssh ubuntu@<Private IP> to verify that ssh authentication is enabled from master to slave Sharing the Master Instance Volume is required just once and need to run below command. sudo sh -c "echo '/home/ubuntu/OpenFOAM *(rw,sync,no_subtree_check)' >> /etc/exports" sudo exportfs -ra sudo service nfs-kernel-server start Mounting the Master Volume is required on each slave and need to run below command (non-interactive SSH / by writing a .shell file and running it in terminal). SPIPS="XX.X.X.XX YY.Y.Y.YY ZZ.Z.Z.ZZ" for IP in $SPIPS ; do ssh $IP 'rm -rf ${HOME}/OpenFOAM/*' ; done for IP in $SPIPS ; do ssh $IP 'sudo mount MM.M.M.MM:${HOME}/OpenFOAM ${HOME}/OpenFOAM' ; done for IP in $SPIPS ; do ssh $IP 'ls ${HOME}/OpenFOAM' ; done -BA Last edited by bassaad17; March 12, 2020 at 10:11. Reason: added ssh master slave steps |
|
March 12, 2020, 18:07 |
|
#7 |
Senior Member
Joachim Herb
Join Date: Sep 2010
Posts: 650
Rep Power: 22 |
Have you set up the hostfile correctly? Does a simple test like
Code:
mpirun -np 72 hostname |
|
March 16, 2020, 17:32 |
|
#8 |
New Member
B. Assaad
Join Date: Sep 2014
Posts: 14
Rep Power: 12 |
Hi jherb,
Downsized my 2 instances to c5.4xlarge (8 CPUs) and ran the line you provided in PuTTY terminal. This is the message I received (see image attached). I also tried running the normal command to mesh Code:
mpirun --hostfile machines -np 16 snappyHexMesh -parallel -overwrite | tee log/snappyHexMesh.log Code:
MM.M.N.167 cpu=8 SS.S.S.138 cpu=8 -BA |
|
March 16, 2020, 17:47 |
|
#9 |
Senior Member
Joachim Herb
Join Date: Sep 2010
Posts: 650
Rep Power: 22 |
Do you have a hostfile? It should look like this:
Code:
first.private.ip.address slots=8 second.private.ip.address slots=8 Code:
mpirun --hostfile my_hostfile -np 16 hostname (your first private IP seems to be 10.0.2.167). You can check the EC2 management console for them (perhaps you have to change its properties to make them visible). |
|
March 16, 2020, 18:02 |
|
#10 |
New Member
B. Assaad
Join Date: Sep 2014
Posts: 14
Rep Power: 12 |
I did your suggestion and it does list the 8 cpu's for master, and 8 cpu's for slave once i type your command provided
|
|
March 17, 2020, 08:20 |
|
#11 |
Senior Member
Joachim Herb
Join Date: Sep 2010
Posts: 650
Rep Power: 22 |
Okay. Now you could also start snappyHexMesh or an OpenFOAM solver this way. E.g.:
Code:
mpirun -np 72 --hostfile my_hostfile snappyHexMesh -parallel If the output should be redirected into an file, then: Code:
mpirun -np 72 --hostfile my_hostfile snappyHexMesh -parallel >log.snappyHexMesh 2>&1 |
|
March 17, 2020, 09:30 |
|
#12 |
New Member
B. Assaad
Join Date: Sep 2014
Posts: 14
Rep Power: 12 |
hi jherb,
Same issue occurs .... snappyHexMesh command you gave when typed into terminal 'hangs' during parallel run for master and slave in my example v1612+ case. Code:
mpirun -np 16 --hostfile machines snappyHexMesh -parallel -overwrite > log.snappyHexMesh 2>&1 I already tried running SHM in single processor on master machine and it works just fine. Not sure why the parallel run between master and slave nodes keeps hanging |
|
March 17, 2020, 14:45 |
|
#13 |
Senior Member
Joachim Herb
Join Date: Sep 2010
Posts: 650
Rep Power: 22 |
What is the output of snappyHexMesh, i.e. the content of log.snappyHexMesh?
|
|
March 17, 2020, 18:13 |
|
#14 |
New Member
B. Assaad
Join Date: Sep 2014
Posts: 14
Rep Power: 12 |
Output of log of SHM is attached as screenshot.
Master private ip .167 and slave private ip .138 showed 100% cpu usage for 8 CPU's on each instance for the SHM command, but no actual meshing iterations occurred. |
|
March 17, 2020, 18:46 |
|
#15 |
Senior Member
Joachim Herb
Join Date: Sep 2010
Posts: 650
Rep Power: 22 |
Is this the whole output? Then something with the communication is going wrong.
Again, have you checked any of the solvers? E. g. just try: Code:
mpirun --np 16 simpleFoam -parallel Does it complain, that there are no case files? Do you use the correct MPI version for which your OpenFOAM installation was build? Can you check your system setup with some of the "normal" OpenMPI examples/tutorials by compiling them yourself. I am using the Ubuntu packages of the OpenFOAM foundation on the Ubuntu AMI provided by Amazon. If I want to use the special openMPI version of Amzaon to support the EFA interface, I needed to compile the Pstream part of OpenFOAM newly. |
|
March 31, 2020, 12:13 |
|
#16 |
New Member
B. Assaad
Join Date: Sep 2014
Posts: 14
Rep Power: 12 |
Hi jherb,
Yes, this is the whole output - the SHM hangs when running it with Master & Slave instance. When I run SHM in parallel in Master only with 4 CPU's it works fine. How can I check that I'm using the correct MPI version for the OF v1612+ that I'm using? Do you think compiling the Pstream will help? What I find weird in this is that it worked a few months ago, and now it doesn't with whatever I'm doing. |
|
April 15, 2020, 18:13 |
|
#17 |
Senior Member
Joachim Herb
Join Date: Sep 2010
Posts: 650
Rep Power: 22 |
If you have not yet solved the problem, another idea: Add the following to the $HOME/.ssh/config file:
Code:
Host * StrictHostKeyChecking no And add the prefix option to the mpirun command. First check, which mpirun is used: Code:
which mpirun Then call mpirun with this addtional option: Code:
mpirun -prefix /opt/amazon/openmpi |
|
Tags |
aws, cluster, ec2, openfoam, v1612+ |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Openfoam running error in parallel in Cluster | sibo | OpenFOAM | 2 | February 25, 2017 14:26 |
Script to Run Parallel Jobs in Rocks Cluster | asaha | OpenFOAM Running, Solving & CFD | 12 | July 4, 2012 23:51 |
Running Foam on multiple nodes (small cluster) | Hisham | OpenFOAM Running, Solving & CFD | 4 | June 11, 2012 14:44 |
Parallel cluster solving with OpenFoam? P2P Cluster? | hornig | OpenFOAM Programming & Development | 8 | December 5, 2010 17:06 |
Kubuntu uses dash breaks All scripts in tutorials | platopus | OpenFOAM Bugs | 8 | April 15, 2008 08:52 |