|
[Sponsors] |
May 14, 2009, 14:27 |
running without rsh between nodes
|
#1 |
New Member
Paul Hatton
Join Date: May 2009
Posts: 6
Rep Power: 17 |
I've built OpenFOAM v1.5 on our 400-odd node cluster running Scientific Linux 5. It runs fine using all 4 cores on a node but fails when running across 2 nodes, 4 cores per node under torque with
bash: orted: command not found from the command mpirun --hostfile $PBS_NODEFILE -np 8 oodles -parallel We don't allow rsh between nodes, but do allow ssh. From experience this orted error from mpirun is often due to the application expecting rsh between nodes. The usual way of running parallel jobs on our cluster is mpiexec, not mpirun, but if I try this the openfoam application (orted in this case) thinks it's running on one core. Does anyone know: - can OpenFOAM use mpiexec rather than mpirun? - does multi-node OpenFOAM expect rsh access between nodes? - if so, can it be told to use an alternative such as ssh or a script that mimics rsh (we use pbsdsh for this) if there is a hook to hang an rsh replacement on? Thanks -- Paul Hatton University of Birmingham |
|
May 15, 2009, 04:02 |
|
#2 | |
Senior Member
Mark Olesen
Join Date: Mar 2009
Location: https://olesenm.github.io/
Posts: 1,714
Rep Power: 40 |
Quote:
http://www.open-mpi.org/faq/?category=rsh We use OpenMPI/SGE without any problems just by specifying mpirun SomeFoamApplication -parallel From the FAQ, it seems that Torque is similar. http://www.open-mpi.org/faq/?category=tm See point 5 on the FAQ about problems with the -host parameter. Maybe you are hitting that. |
||
May 15, 2009, 04:56 |
|
#3 |
New Member
Paul Hatton
Join Date: May 2009
Posts: 6
Rep Power: 17 |
Thanks you *very* much, Oleson. I was experimenting with the standard mpi 'hello world' program last night and also found that sepecifying a hostfile causes problems with shared libraries not being found under torque/OpenMPI.
I should probably start a new thread for this, but if I may pose one more question - it seems that the OpenFOAM binaries think that they are on just the master node if I just specify -np 8, for example, and no hostfile or host list to mpirun in a torque job using 4 cores on each of 2 nodes. So it seems that OpenFOAM needs the hostfile to get the host list but OpenMPI won't accept this under torque. Is this correct and, if so, is there any known way out of it? Can I build it against mvapich2, to also give us infiniband support - I guess not if OpenFOAM uses OpenMPI constructs? If there is a section of the OpenFOAM documentation that I should be looking at a pointer would be much appreciated. I look after the overall HPC service here and support many applications so I must apologise for being new to OpenFOAM - one of our research groups has asked for it. As an aside, the same problem arises with the Computational Chemistry program Molpro which has also caused me much grief recently .... Thanks again for the info and the URL to the OpenMPI FAQ. -- Paul Hatton University of Birmingham P.S.Hatton@bham.ac.uk |
|
May 15, 2009, 06:06 |
|
#4 |
Senior Member
Mark Olesen
Join Date: Mar 2009
Location: https://olesenm.github.io/
Posts: 1,714
Rep Power: 40 |
The problem is not OpenFOAM (it doesn't know anything about cores, cpus, hosts), but a general openmpi/torque problem. Your queuing system decides how many process 'slots' should be used on which hosts and passes this information to the orte. The only extra information that OpenFOAM needs is the -parallel option. Specifying '-np ...' is probably messing things up.
|
|
May 15, 2009, 06:20 |
|
#5 |
New Member
Paul Hatton
Join Date: May 2009
Posts: 6
Rep Power: 17 |
Thanks again. If I qsub the following script:
#!/bin/bash #PBS -j oe #PBS -l "walltime=1:00,nodes=2pn=4" #PBS -N FOAM-n2ppn4 #PBS -q bbadmin cd "$PBS_O_WORKDIR" module load apps/openfoam . /apps/OpenFOAM/OpenFOAM-1.5/etc/bashrc export WM_PROJECT_USER_DIR=$PWD module load intel/fce/10.1.008 mpirun oodles -parallel I get, in the stdout job output: +--------------------------------------------------------------------------+ | Job starting at 2009-05-15 10:12:24 for hattonps on the BlueBEAR Cluster | Job identity jobid 1195882 jobname FOAM-n2ppn4 | Job requests nodes=2pn=4,pmem=1996mb,walltime=00:01:00 | Job assigned to nodes u1n002 u1n001 +--------------------------------------------------------------------------+ bool Pstream::init(int& argc, char**& argv) : attempt to run parallel on 1 processor#0 Foam::error:rintStack(Foam::Ostream&) in "/apps/OpenFOAM/OpenFOAM-1.5/lib/linux64GccDPOpt/libOpenFOAM.so" #1 Foam::error::abort() in "/apps/OpenFOAM/OpenFOAM-1.5/lib/linux64GccDPOpt/libOpenFOAM.so" #2 Foam::Pstream::init(int&, char**&) in "/apps/OpenFOAM/OpenFOAM-1.5/lib/linux64GccDPOpt/openmpi-1.2.6/libPstream.so" #3 Foam::argList::argList(int&, char**&, bool, bool) in "/apps/OpenFOAM/OpenFOAM-1.5/lib/linux64GccDPOpt/libOpenFOAM.so" #4 __gxx_personality_v0 in "/apps/OpenFOAM/OpenFOAM-1.5/applications/bin/linux64GccDPOpt/oodles" #5 __libc_start_main in "/lib64/libc.so.6" #6 Foam::regIOobject::readIfModified() in "/apps/OpenFOAM/OpenFOAM-1.5/applications/bin/linux64GccDPOpt/oodles" and so on. The line bool Pstream::init(int& argc, char**& argv) : attempt to run parallel on 1 processor#0 suggests that oodles doesn't think it's running on multiple cores in this case? I can run across cores on a node by specifying -np to mpirun, and I need to do this to get oodles to run multi-core, but then fall down when trying to run across nodes. I'm missing something obvious here .... -- Paul Hatton University of Birmingham P.S.Hatton@bham.ac.uk |
|
May 15, 2009, 06:49 |
|
#6 | |
Senior Member
Mark Olesen
Join Date: Mar 2009
Location: https://olesenm.github.io/
Posts: 1,714
Rep Power: 40 |
Unfortunately, I don't have any experience with Torque.
Quote:
For example, -pe threaded 2 -pe mpich 16 -pe openmpi 16 The 'threaded' (eg, used by abaqus) has a particular allocation_rule. The 'mpich' (eg, used by STAR-CD) has some extra start/stop rules. The 'openmpi' (eg, used by OpenFOAM) also uses a 'fill_up' allocation rule (like mpich), but without special start/stop procedures. I can't see anything similar in your example. Is it really running in parallel at all? There must be a job env variable something like $NSLOTS that you can echo out from your job script to check that the job script is indeed running as a parallel job. If it is, then you should check that a HelloWorld mpi job works too. /mark |
||
May 18, 2009, 08:37 |
|
#7 |
New Member
Paul Hatton
Join Date: May 2009
Posts: 6
Rep Power: 17 |
Thanks. I've been looking more closely at our OpenMP setup and threads. At present the library is built in single-threaded mode which may be a problem - I can't even get a standard 'hello world' program running correctly, although running with mpich2 is fine. I've raised this with our suppliers - Clustervision - for advice.
Torque knows that a parallel run is asked for by the nodes=1pn=4 argument to the -l option, and there's no option to specify a particular parallel envoronment. I'll await advice from Clustervision and update this when I know more. -- Paul Hatton University of Birmingham P.S.Hatton@bham.ac.uk |
|
May 18, 2009, 08:49 |
|
#8 |
New Member
Paul Hatton
Join Date: May 2009
Posts: 6
Rep Power: 17 |
With smilies turned off this time (I wish I could disable their use in my account for all posts but I can't see how to do this ...)
~~~~ Thanks. I've been looking more closely at our OpenMP setup and threads. At present the library is built in single-threaded mode which may be a problem - I can't even get a standard 'hello world' program running correctly, although running with mpich2 is fine. I've raised this with our suppliers - Clustervision - for advice. Torque knows that a parallel run is asked for by the nodes=1:ppn=4 argument to the -l option, and there's no option to specify a particular parallel envoronment. I'll await advice from Clustervision and update this when I know more.
__________________
-- Paul Hatton The University of Birmingham |
|
May 18, 2009, 08:58 |
|
#9 | |
Senior Member
Mark Olesen
Join Date: Mar 2009
Location: https://olesenm.github.io/
Posts: 1,714
Rep Power: 40 |
Quote:
On my system (openSUSE 11.1), the system installed version (1.2.8) is found under /usr/lib64/mpi/gcc/openmpi/bin/ompi_info and shows that very little has been configured. My OpenFOAM version (1.3.2) is found under $MPI_ARCH_PATH/bin/ompi_info and shows lots of things have been configured - including 'gridengine'. Maybe you are getting the wrong version, or maybe it wasn't configured to handle torque. For new openmpi versions, the GridEngine must be configured as well (--with-sge) when configuring/compiling openmpi. |
||
March 19, 2010, 01:39 |
|
#10 |
Senior Member
J. Cai
Join Date: Apr 2009
Posts: 180
Rep Power: 17 |
Hi, Paul, how did you solve this problem? I just meet the same problem as yours.
Best regards, Chiven |
|
March 22, 2010, 16:02 |
|
#11 |
New Member
Paul Hatton
Join Date: May 2009
Posts: 6
Rep Power: 17 |
I ended up rebuilding the MPI that came with OpenFoam; I couldn't get it to link to an already-existing one. To do this:
tar xzf OpenFOAM-1.5.General.gtgz tar xzf ThirdParty.General.gtgz tar xzf ThirdParty.linux64Gcc.gtgz rm -r ThirdParty/openmpi-1.2.6/platforms - to force OpenMPI build Edit ThirdParty/Allwmake: ./configure \ --prefix=$MPI_ARCH_PATH \ --disable-mpirun-prefix-by-default \ --disable-orterun-prefix-by-default \ --enable-shared --disable-static \ --disable-mpi-f77 --disable-mpi-f90 --disable-mpi-cxx \ --disable-mpi-profile \ --with-openib=/cvos/shared/apps/ofed/1.3 \ --with-openib-libdir=/cvos/shared/apps/ofed/1.3/lib64 \ --with-tm=/cvos/shared/apps/torque/current # These lines enable Infiniband support # --with-openib=/usr/local/ofed \ # --with-openib-libdir=/usr/local/ofed/lib64 and then the usual build. Deleting ThirdParty/openmpi-1.2.6/platforms tells OpenFoam to build it's own OpenMPI; adding --with-openib=/cvos/shared/apps/ofed/1.3 \ --with-openib-libdir=/cvos/shared/apps/ofed/1.3/lib64 \ --with-tm=/cvos/shared/apps/torque/current are the usual arguments to the OpenMPI build to tell it to pick up the OpenIB Infiniband drivers and link with torque. It ran OK after this. HTH
__________________
-- Paul Hatton The University of Birmingham |
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
RSH problem for parallel running in CFX | Nicola | CFX | 5 | June 18, 2012 19:31 |
Statically Compiling OpenFOAM Issues | herzfeldd | OpenFOAM Installation | 21 | January 6, 2009 10:38 |
Kubuntu uses dash breaks All scripts in tutorials | platopus | OpenFOAM Bugs | 8 | April 15, 2008 08:52 |
RSH does't connect for two WIN XP nodes | Ali | CFX | 4 | June 17, 2006 15:25 |
CFX4.3 -build analysis form | Chie Min | CFX | 5 | July 13, 2001 00:19 |