|
[Sponsors] |
Almost have my cluster running openfoam, but not quite... |
|
LinkBack | Thread Tools | Search this Thread | Display Modes |
March 23, 2010, 17:14 |
Almost have my cluster running openfoam, but not quite...
|
#1 |
Member
|
I present two "successful cases" to describe where I am. Given these, does anybody spot the error in Case 3?
Case 1: "Hello world-ish" on a 6-node by 2-cpu cluster: ~/.bashrc is not yet including /root/OpenFOAM/OpenFOAM-1.6/etc/bashrc host1:~ # /usr/lib64/mpi/gcc/openmpi/bin/mpirun --mca btl openib,self -machinefile list.txt -np 12 test/comm_size_with_id.out Process 1 on host2 out of 12 Process 7 on host2 out of 12 Process 2 on host3 out of 12 Process 4 on host5 out of 12 Process 8 on host3 out of 12 Process 10 on host5 out of 12 Process 3 on host4 out of 12 Process 9 on host4 out of 12 Process 5 on host6 out of 12 Process 11 on host6 out of 12 Process 6 on host1 out of 12 Process 0 on host1 out of 12 Case 2: As close as I can get to OpenFOAM working in parallel: ~/.bashrc including OpenFOAM specific environment variables as set by /root/OpenFOAM/OpenFOAM-1.6/etc/bashrc host1:~ # which mpirun /root/OpenFOAM/ThirdParty-1.6/openmpi-1.3.3/platforms/linux64GccDPOpt/bin/mpirun host1:~ # mpirun -np 12 simpleFoam -parallel -case inletProfile/ /*---------------------------------------------------------------------------*\ | ========= | | | \\ / F ield | OpenFOAM: The Open Source CFD Toolbox | | \\ / O peration | Version: 1.6 | | \\ / A nd | Web: www.OpenFOAM.org | | \\/ M anipulation | | \*---------------------------------------------------------------------------*/ Build : 1.6-f802ff2d6c5a Exec : simpleFoam -parallel -case inletProfile/ Date : Mar 23 2010 Time : 12:43:31 Host : host1 PID : 25231 Case : ./inletProfile nProcs : 12 Slaves : 11 ( host1.25232 host1.25233 host1.25234 host1.25235 host1.25236 host1.25237 host1.25238 host1.25239 host1.25240 host1.25241 host1.25242 ) ........... ........... It goes on to work correctly, but only on one machine with 12 processes Case 3: Case 2, but including "-machinefile list.txt": host1:~ # mpirun -np 12 -machinefile list.txt simpleFoam -parallel -case inletProfile/ zsh:1: command not found: orted A daemon (pid 25297) died unexpectedly with status 127 while attempting to launch so we are aborting. There may be more information reported by the environment (see above). This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes. mpirun noticed that the job aborted, but has no info as to the process that caused that situation. zsh:1: command not found: orted mpirun was unable to cleanly terminate the daemons on the nodes shown below. Additional manual cleanup may be required - please refer to the "orte-clean" tool for assistance. 192.168.1.102 - daemon did not report back when launched 192.168.1.103 - daemon did not report back when launched 192.168.1.104 - daemon did not report back when launched 192.168.1.105 - daemon did not report back when launched 192.168.1.106 - daemon did not report back when launched zsh:1: command not found: orted host1:~ # zsh:1: command not found: orted |
|
March 23, 2010, 17:33 |
|
#2 |
Member
|
Getting closer, per...
http://www.open-mpi.org/faq/?categor...mpilers-static So I can run interactively, but not in a non-interactive login, per... host1:~ # ssh host5 $HOME/sum_serial.out /root/sum_serial.out: error while loading shared libraries: libmpi_cxx.so.0: cannot open shared object file: No such file or directory host1:~ # |
|
March 23, 2010, 17:41 |
|
#3 |
Member
|
The solution is to change the shell back to bash and copy the .bashrc to every machine so that it's sourced when you login. you can change the shell by running
chsh root /bin/bash chsh admin /bin/bash and then copy .bashrc to every node to /root (assuming you run mpirun as root) Last edited by bjr; March 23, 2010 at 19:01. |
|
March 23, 2010, 19:05 |
|
#4 |
Member
|
This works...
mpirun -np 12 -machinefile list.txt simpleFoam -parallel -case inletProfile/ > log.simpleFoam.Parallelopenib & But this (with the '--mca btl, openib,self) doesn't... mpirun --mca btl openib,self -np 12 -machinefile list.txt simpleFoam -parallel -case inletProfileMonday/ > log.simpleFoam.Parallelopenib & Do you know of a way to tell if my simulation is using infiniband for sure? I was thinking of pulling some ethernet cables and seeing what happens as a brute force approach. It's configured in such a way that I can picture it being able to us either one. That error message is this a bunch of times... """ A requested component was not found, or was unable to be opened. This means that this component is either not installed or is unable to be used on your system (e.g., sometimes this means that shared libraries that the component requires are unable to be found/loaded). Note that Open MPI stopped checking at the first component that it did not find. Host: host3 Framework: btl Component: openib -------------------------------------------------------------------------- [host3:08394] mca: base: components_open: component pml / csum open function failed -------------------------------------------------------------------------- """ |
|
March 23, 2010, 19:06 |
|
#5 |
Member
|
Turns out that it was using the ethernet side without these flags.
|
|
March 24, 2010, 11:35 |
Looks like we have the same problem...
|
#6 |
Member
|
||
March 24, 2010, 11:37 |
|
#7 | |||
New Member
Jeff Squyres
Join Date: Mar 2009
Posts: 6
Rep Power: 17 |
Quote:
/path/to/mpirun ... and mpirun ... In the former case, Open MPI will add itself to PATH and LD_LIBRARY_PATH on all the remote nodes (regardless of what you do in your .bashrc). In the latter case, it will not (meaning: you probably should have added Open MPI to your PATH / LD_LIBRARY_PATH in your .bashrc). Note that the "/path/to/mpirun ..." form is a shortcut for the --prefix command line option to mpirun. See Open MPI's mpirun(1) man page for details. Quote:
Quote:
If you had used the /path/to/mpirun... form, this case probably would have worked. Or you could propagate your .bashrc out to all nodes so that all nodes can find Open MPI's executables+libraries, and that should work, too (which, by a later post, I think you did -- but I wanted to explain just so that you knew *why* it worked). Make sense? |
||||
March 24, 2010, 11:48 |
|
#8 | |||
New Member
Jeff Squyres
Join Date: Mar 2009
Posts: 6
Rep Power: 17 |
Quote:
Hence, Open MPI doesn't (yet) give a positive ACK that you're using IB -- it gives a negative ACK if it can't. Enough people have asked for a positivie ACK that we're likely to add it in the v1.5 series sometime. BTW, you probably actually want to use "--mca btl openb,sm,self". This allows Open MPI to use shared memory for on-node communication (which can be faster than forcing it to loopback through your IB adapters). I don't know much about OpenFOAM to know if this will provide an overall performance boost or not. Quote:
Open MPI uses TCP for setup and teardown, even if you're using OpenFabrics transports (IB or iWARP) for MPI communications. Quote:
Did you happen to install one version of Open MPI and then install a different version over it? You *may* have Open MPI plugins from different versions in the same installation tree that don't play nicely with each other. If this is the case, try fully uninstalling Open MPI and then manually inspecting the $prefix/lib/openmpi dir and ensure that there's no plugins left over from a prior Open MPI installation. The install Open MPI again. Let me know if that works. You can also try: rm /open/mpi/installation/tree/lib/openmpi/mca_pml_csum.* (you probably won't use the CSUM PML, so it's save to either remove or move to a different location where Open MPI won't find it) I'm not optimistic that that will fix it, but it could (if csum is from a prior Open MPI install). |
||||
March 24, 2010, 11:50 |
|
#9 | |||
Member
|
Quote:
Quote:
Quote:
Thanks for the great clarifications on my untidy explainations. |
||||
March 24, 2010, 11:56 |
|
#10 | |||
Member
|
Quote:
Quote:
Quote:
Thanks again. |
||||
March 24, 2010, 13:12 |
|
#11 | |
New Member
Jeff Squyres
Join Date: Mar 2009
Posts: 6
Rep Power: 17 |
Quote:
Code:
./configure --prefix=/opt/openmpi ... make -j 4 install The $prefix/lib/openmpi/* files (or, more specifically, $libdir/openmpi/*) that I was referring to were Open MPI's plugins. Each plugin has at least 1 file (sometimes 2, depending on your installation) in $libdir/openmpi. For example, $libdir/openmpi/mca_btl_openib.so. That's the BTL openib plugin. The CSUM PML plugin is the Point-to-point messaging layer plugin named CSUM. The PML is the layer right behind MPI_SEND and friends. Specifically, MPI_SEND calls the back-end PML send function to actually effect the send (and so on). Think of the PML as the engine that drives all the MPI messaging semantics (communicator and tag matching, etc.). CSUM is a PML that does all the normal sending and receiving, but also does checksums on the data to ensure data integrity. This is a Good Thing, but it definitely imposes a performance overhead. Most transports provide their own data reliability checking (e.g., TCP), so CSUM typically isn't worth it. But some transports can have problems with end-to-end reliablility -- that's why we developed CSUM. Normally, you should probably use the "ob1" PML. OB1 and CSUM are identical except that CSUM does the checksumming. Specifically, both OB1 and CSUM use BTL plugins underneath the covers to effect point-to-point transmission and reception. Hence, both CSUM and OB1 can use the openib BTL. That being said, to be totally clear, while you can use multiple BTL plugins in a single run (e.g., openib, sm, and self), you can only use ONE PML at a time. So you'll use CSUM *or* OB1 -- not both. So my point before was that if CSUM was somehow mucked up on your system, you could remove the CSUM .so plugin file and then it wouldn't ever be used. But I wasn't confident that that would fix your problem. |
||
March 24, 2010, 17:01 |
|
#12 |
Member
|
Think I just got it figured out...
host2:~ # which ompi_info /root/OpenFOAM/ThirdParty-1.6/openmpi-1.3.3/platforms/linux64GccDPOpt/bin/ompi_info host2:~ # ompi_info | grep openib MCA btl: openib (MCA v2.0, API v2.0, Component v1.3.3) Turns out it was in the Allwmake file. The relevant lines being changed to (for my configuration)... ./configure \ --prefix=$MPI_ARCH_PATH \ --disable-mpirun-prefix-by-default \ --disable-orterun-prefix-by-default \ --enable-shared --disable-static \ --disable-mpi-f77 --disable-mpi-f90 --disable-mpi-cxx \ --disable-mpi-profile # These lines enable Infiniband support #--with-openib=/usr/local/ofed \ #--with-openib-libdir=/usr/local/ofed/lib64 --with-openib=/usr/include/infiniband |
|
March 24, 2010, 17:10 |
|
#13 | |
New Member
Jeff Squyres
Join Date: Mar 2009
Posts: 6
Rep Power: 17 |
I'm unfamiliar with Allwmake.
Quote:
Additionally, --enable-shared and --disable-static is also the default. The --with-openib line doesn't look quite right, but it probably squeaks by the tests we have in configure. Meaning: if you have OFED installed with the default /usr prefix, then Open MPI should be able to find OFED's headers and libraries with no extra help (because they're in the compiler's and linker's default search paths). So you should be able to do just --with-openib (i.e., not list any dir). But hey, if it works... :-) |
||
March 24, 2010, 17:13 |
|
#14 |
Member
|
Thanks Jeff... this Allwmake is the script that sets up all of our OpenFOAM "3rd party" tools. So the points you make are relevant community-wide and I wonder if I shouldn't try to make those changes and get them checked into the OpenFOAM SVN repository... as the best way to communicate this.
|
|
March 24, 2010, 17:23 |
|
#15 | |
New Member
Jeff Squyres
Join Date: Mar 2009
Posts: 6
Rep Power: 17 |
Quote:
There's no \ after the --disable-mpi-profile line, so it probably ignored your --with-openib line. But then again, if OFED is installed in compiler/linker default locations, the --with-openib option is not strictly necessary because OMPI will find that stuff by default (and therefore build support for it). Specifically, we treat --with-<foo> options in OMPI's configure thusly: http://www.open-mpi.org/faq/?categor...#default-build Hope that helps... |
||
March 24, 2010, 17:25 |
|
#16 |
Member
|
I've actually fixed that, but you make a good point... I think simply recompiling is what solved it, not the path reference... because it did work as shown above.
|
|
July 19, 2010, 21:01 |
OpenFOAM working on our cluster!
|
#17 |
Member
|
I did get this working, and would be happy to try to address similar problems with anybody in the future.
|
|
March 6, 2020, 10:52 |
|
#18 |
Senior Member
chandra shekhar pant
Join Date: Oct 2010
Posts: 220
Rep Power: 17 |
Hello Ben Racine,
I am also facing the same issue, which says:FIPS integrity verification test failed. orted: Command not found. -------------------------------------------------------------------------- ORTE was unable to reliably start one or more daemons. This usually is caused by: * not finding the required libraries and/or binaries on one or more nodes. Please check your PATH and LD_LIBRARY_PATH settings, or configure OMPI with --enable-orterun-prefix-by-default * lack of authority to execute on one or more specified nodes. Please verify your allocation and authorities. * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). Please check with your sys admin to determine the correct location to use. * compilation of the orted with dynamic libraries when static are required (e.g., on Cray). Please check your configure cmd line and consider using one of the contrib/platform definitions for your system type. * an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -------------------------------------------------------------------------- when running on a cluster of 2 nodes using Code:
mpirun --host n217:16,n219:16 -np 32 --use-hwthread-cpus snappyHexMesh -parallel -overwrite > log.snappyHexMesh |
|
Tags |
-machinefile, cluster, daemon, mpirun, parallel |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Difficulty running an inviscid compressible nozzle flow using OpenFOAM | rbozinoski | OpenFOAM Running, Solving & CFD | 11 | December 29, 2015 08:19 |
running OpenFoam in parallel | vishwa | OpenFOAM Running, Solving & CFD | 22 | August 2, 2015 09:53 |
Random machine freezes when running several OpenFoam jobs simultaneously | 2bias | OpenFOAM Installation | 5 | July 2, 2010 08:40 |
How to install the OpenFoam in the cluster. Please help me! | flying | OpenFOAM Installation | 6 | November 27, 2009 04:00 |
Kubuntu uses dash breaks All scripts in tutorials | platopus | OpenFOAM Bugs | 8 | April 15, 2008 08:52 |