running without rsh between nodes

hattonps · May 14, 2009, 14:27

I've built OpenFOAM v1.5 on our 400-odd node cluster running Scientific Linux 5. It runs fine using all 4 cores on a node but fails when running across 2 nodes, 4 cores per node under torque with

bash: orted: command not found

from the command

mpirun --hostfile $PBS_NODEFILE -np 8 oodles -parallel

We don't allow rsh between nodes, but do allow ssh. From experience this orted error from mpirun is often due to the application expecting rsh between nodes.

The usual way of running parallel jobs on our cluster is mpiexec, not mpirun, but if I try this the openfoam application (orted in this case) thinks it's running on one core. Does anyone know:

- can OpenFOAM use mpiexec rather than mpirun?

- does multi-node OpenFOAM expect rsh access between nodes?

- if so, can it be told to use an alternative such as ssh or a script that mimics rsh (we use pbsdsh for this) if there is a hook to hang an rsh replacement on?

Thanks

--
Paul Hatton
University of Birmingham

olesen · May 15, 2009, 04:02

Quote:

Originally Posted by hattonps

bash: orted: command not found
...
We don't allow rsh between nodes, but do allow ssh. From experience this orted error from mpirun is often due to the application expecting rsh between nodes.

According to the openmpi FAQ, ssh appears to be used by default, not rsh

http://www.open-mpi.org/faq/?category=rsh

We use OpenMPI/SGE without any problems just by specifying
mpirun SomeFoamApplication -parallel

From the FAQ, it seems that Torque is similar.
http://www.open-mpi.org/faq/?category=tm
See point 5 on the FAQ about problems with the -host parameter. Maybe you are hitting that.

hattonps · May 15, 2009, 04:56

Thanks you *very* much, Oleson. I was experimenting with the standard mpi 'hello world' program last night and also found that sepecifying a hostfile causes problems with shared libraries not being found under torque/OpenMPI.

I should probably start a new thread for this, but if I may pose one more question - it seems that the OpenFOAM binaries think that they are on just the master node if I just specify -np 8, for example, and no hostfile or host list to mpirun in a torque job using 4 cores on each of 2 nodes. So it seems that OpenFOAM needs the hostfile to get the host list but OpenMPI won't accept this under torque. Is this correct and, if so, is there any known way out of it? Can I build it against mvapich2, to also give us infiniband support - I guess not if OpenFOAM uses OpenMPI constructs? If there is a section of the OpenFOAM documentation that I should be looking at a pointer would be much appreciated. I look after the overall HPC service here and support many applications so I must apologise for being new to OpenFOAM - one of our research groups has asked for it.

As an aside, the same problem arises with the Computational Chemistry program Molpro which has also caused me much grief recently ....

Thanks again for the info and the URL to the OpenMPI FAQ.

--
Paul Hatton
University of Birmingham
P.S.Hatton@bham.ac.uk

olesen · May 15, 2009, 06:06

Quote:

Originally Posted by hattonps

... it seems that the OpenFOAM binaries think that they are on just the master node if I just specify -np 8, for example, and no hostfile or host list to mpirun in a torque job using 4 cores on each of 2 nodes.

The problem is not OpenFOAM (it doesn't know anything about cores, cpus, hosts), but a general openmpi/torque problem. Your queuing system decides how many process 'slots' should be used on which hosts and passes this information to the orte. The only extra information that OpenFOAM needs is the -parallel option. Specifying '-np ...' is probably messing things up.

hattonps · May 15, 2009, 06:20

Thanks again. If I qsub the following script:

#!/bin/bash
#PBS -j oe
#PBS -l "walltime=1:00,nodes=2

pn=4"
#PBS -N FOAM-n2ppn4
#PBS -q bbadmin

cd "$PBS_O_WORKDIR"

module load apps/openfoam
. /apps/OpenFOAM/OpenFOAM-1.5/etc/bashrc
export WM_PROJECT_USER_DIR=$PWD

module load intel/fce/10.1.008

mpirun oodles -parallel

I get, in the stdout job output:

+--------------------------------------------------------------------------+
| Job starting at 2009-05-15 10:12:24 for hattonps on the BlueBEAR Cluster
| Job identity jobid 1195882 jobname FOAM-n2ppn4
| Job requests nodes=2

pn=4,pmem=1996mb,walltime=00:01:00
| Job assigned to nodes u1n002 u1n001
+--------------------------------------------------------------------------+
bool Pstream::init(int& argc, char**& argv) : attempt to run parallel on 1 processor#0 Foam::error:

rintStack(Foam::Ostream&) in "/apps/OpenFOAM/OpenFOAM-1.5/lib/linux64GccDPOpt/libOpenFOAM.so"
#1 Foam::error::abort() in "/apps/OpenFOAM/OpenFOAM-1.5/lib/linux64GccDPOpt/libOpenFOAM.so"
#2 Foam::Pstream::init(int&, char**&) in "/apps/OpenFOAM/OpenFOAM-1.5/lib/linux64GccDPOpt/openmpi-1.2.6/libPstream.so"
#3 Foam::argList::argList(int&, char**&, bool, bool) in "/apps/OpenFOAM/OpenFOAM-1.5/lib/linux64GccDPOpt/libOpenFOAM.so"
#4 __gxx_personality_v0 in "/apps/OpenFOAM/OpenFOAM-1.5/applications/bin/linux64GccDPOpt/oodles"
#5 __libc_start_main in "/lib64/libc.so.6"
#6 Foam::regIOobject::readIfModified() in "/apps/OpenFOAM/OpenFOAM-1.5/applications/bin/linux64GccDPOpt/oodles"

and so on.

The line

bool Pstream::init(int& argc, char**& argv) : attempt to run parallel on 1 processor#0

suggests that oodles doesn't think it's running on multiple cores in this case?

I can run across cores on a node by specifying -np to mpirun, and I need to do this to get oodles to run multi-core, but then fall down when trying to run across nodes.

I'm missing something obvious here ....

--
Paul Hatton
University of Birmingham
P.S.Hatton@bham.ac.uk

olesen · May 15, 2009, 06:49

Unfortunately, I don't have any experience with Torque.

Quote:

#PBS -j oe
#PBS -l "walltime=1:00,nodes=2

pn=4"
#PBS -N FOAM-n2ppn4
#PBS -q bbadmin

Is the '-l' request sufficient for Torque to know it is a particular type of parallel job? Speaking from a GridEngine perspective, I need to specify a parallel environment.

For example,
-pe threaded 2
-pe mpich 16
-pe openmpi 16

The 'threaded' (eg, used by abaqus) has a particular allocation_rule.
The 'mpich' (eg, used by STAR-CD) has some extra start/stop rules.
The 'openmpi' (eg, used by OpenFOAM) also uses a 'fill_up' allocation rule (like mpich), but without special start/stop procedures.

I can't see anything similar in your example. Is it really running in parallel at all?
There must be a job env variable something like $NSLOTS that you can echo out from your job script to check that the job script is indeed running as a parallel job. If it is, then you should check that a HelloWorld mpi job works too.

/mark

hattonps · May 18, 2009, 08:37

Thanks. I've been looking more closely at our OpenMP setup and threads. At present the library is built in single-threaded mode which may be a problem - I can't even get a standard 'hello world' program running correctly, although running with mpich2 is fine. I've raised this with our suppliers - Clustervision - for advice.

Torque knows that a parallel run is asked for by the

nodes=1

pn=4

argument to the -l option, and there's no option to specify a particular parallel envoronment. I'll await advice from Clustervision and update this when I know more.

--
Paul Hatton
University of Birmingham
P.S.Hatton@bham.ac.uk

hattonps · May 18, 2009, 08:49

With smilies turned off this time (I wish I could disable their use in my account for all posts but I can't see how to do this ...)

~~~~

Thanks. I've been looking more closely at our OpenMP setup and threads. At present the library is built in single-threaded mode which may be a problem - I can't even get a standard 'hello world' program running correctly, although running with mpich2 is fine. I've raised this with our suppliers - Clustervision - for advice.

Torque knows that a parallel run is asked for by the

nodes=1:ppn=4

argument to the -l option, and there's no option to specify a particular parallel envoronment. I'll await advice from Clustervision and update this when I know more.

olesen · May 18, 2009, 08:58

Quote:

Originally Posted by hattonps

I've been looking more closely at our OpenMP setup and threads. At present the library is built in single-threaded mode which may be a problem - I can't even get a standard 'hello world' program running correctly

Good that you've localized the problem a bit. While waiting for ClusterVision to answer your call, you might check the openmpi config yourself. The 'ompi_info' command should provide some information.
On my system (openSUSE 11.1), the system installed version (1.2.8) is found under /usr/lib64/mpi/gcc/openmpi/bin/ompi_info and shows that very little has been configured. My OpenFOAM version (1.3.2) is found under $MPI_ARCH_PATH/bin/ompi_info and shows lots of things have been configured - including 'gridengine'.

Maybe you are getting the wrong version, or maybe it wasn't configured to handle torque. For new openmpi versions, the GridEngine must be configured as well (--with-sge) when configuring/compiling openmpi.

chiven · March 19, 2010, 01:39

Hi, Paul, how did you solve this problem? I just meet the same problem as yours.

Best regards,
Chiven

hattonps · March 22, 2010, 16:02

I ended up rebuilding the MPI that came with OpenFoam; I couldn't get it to link to an already-existing one. To do this:

tar xzf OpenFOAM-1.5.General.gtgz
tar xzf ThirdParty.General.gtgz
tar xzf ThirdParty.linux64Gcc.gtgz
rm -r ThirdParty/openmpi-1.2.6/platforms
- to force OpenMPI build

Edit ThirdParty/Allwmake:

./configure \
--prefix=$MPI_ARCH_PATH \
--disable-mpirun-prefix-by-default \
--disable-orterun-prefix-by-default \
--enable-shared --disable-static \
--disable-mpi-f77 --disable-mpi-f90 --disable-mpi-cxx \
--disable-mpi-profile \
--with-openib=/cvos/shared/apps/ofed/1.3 \
--with-openib-libdir=/cvos/shared/apps/ofed/1.3/lib64 \
--with-tm=/cvos/shared/apps/torque/current
# These lines enable Infiniband support
# --with-openib=/usr/local/ofed \
# --with-openib-libdir=/usr/local/ofed/lib64

and then the usual build.

Deleting ThirdParty/openmpi-1.2.6/platforms tells OpenFoam to build it's own OpenMPI; adding

--with-openib=/cvos/shared/apps/ofed/1.3 \
--with-openib-libdir=/cvos/shared/apps/ofed/1.3/lib64 \
--with-tm=/cvos/shared/apps/torque/current

are the usual arguments to the OpenMPI build to tell it to pick up the OpenIB Infiniband drivers and link with torque. It ran OK after this.

HTH

May 18, 2009, 08:37		#7
hattonps New Member Paul Hatton Join Date: May 2009 Posts: 6 Rep Power: 17	Thanks. I've been looking more closely at our OpenMP setup and threads. At present the library is built in single-threaded mode which may be a problem - I can't even get a standard 'hello world' program running correctly, although running with mpich2 is fine. I've raised this with our suppliers - Clustervision - for advice. Torque knows that a parallel run is asked for by the nodes=1pn=4 argument to the -l option, and there's no option to specify a particular parallel envoronment. I'll await advice from Clustervision and update this when I know more. -- Paul Hatton University of Birmingham P.S.Hatton@bham.ac.uk

May 18, 2009, 08:49		#8
hattonps New Member Paul Hatton Join Date: May 2009 Posts: 6 Rep Power: 17	With smilies turned off this time (I wish I could disable their use in my account for all posts but I can't see how to do this ...) ~~~~ Thanks. I've been looking more closely at our OpenMP setup and threads. At present the library is built in single-threaded mode which may be a problem - I can't even get a standard 'hello world' program running correctly, although running with mpich2 is fine. I've raised this with our suppliers - Clustervision - for advice. Torque knows that a parallel run is asked for by the nodes=1:ppn=4 argument to the -l option, and there's no option to specify a particular parallel envoronment. I'll await advice from Clustervision and update this when I know more. __________________ -- Paul Hatton The University of Birmingham

March 22, 2010, 16:02		#11
hattonps New Member Paul Hatton Join Date: May 2009 Posts: 6 Rep Power: 17	I ended up rebuilding the MPI that came with OpenFoam; I couldn't get it to link to an already-existing one. To do this: tar xzf OpenFOAM-1.5.General.gtgz tar xzf ThirdParty.General.gtgz tar xzf ThirdParty.linux64Gcc.gtgz rm -r ThirdParty/openmpi-1.2.6/platforms - to force OpenMPI build Edit ThirdParty/Allwmake: ./configure \ --prefix=$MPI_ARCH_PATH \ --disable-mpirun-prefix-by-default \ --disable-orterun-prefix-by-default \ --enable-shared --disable-static \ --disable-mpi-f77 --disable-mpi-f90 --disable-mpi-cxx \ --disable-mpi-profile \ --with-openib=/cvos/shared/apps/ofed/1.3 \ --with-openib-libdir=/cvos/shared/apps/ofed/1.3/lib64 \ --with-tm=/cvos/shared/apps/torque/current # These lines enable Infiniband support # --with-openib=/usr/local/ofed \ # --with-openib-libdir=/usr/local/ofed/lib64 and then the usual build. Deleting ThirdParty/openmpi-1.2.6/platforms tells OpenFoam to build it's own OpenMPI; adding --with-openib=/cvos/shared/apps/ofed/1.3 \ --with-openib-libdir=/cvos/shared/apps/ofed/1.3/lib64 \ --with-tm=/cvos/shared/apps/torque/current are the usual arguments to the OpenMPI build to tell it to pick up the OpenIB Infiniband drivers and link with torque. It ran OK after this. HTH __________________ -- Paul Hatton The University of Birmingham

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
RSH problem for parallel running in CFX	Nicola	CFX	5	June 18, 2012 19:31
Statically Compiling OpenFOAM Issues	herzfeldd	OpenFOAM Installation	21	January 6, 2009 10:38
Kubuntu uses dash breaks All scripts in tutorials	platopus	OpenFOAM Bugs	8	April 15, 2008 08:52
RSH does't connect for two WIN XP nodes	Ali	CFX	4	June 17, 2006 15:25
CFX4.3 -build analysis form	Chie Min	CFX	5	July 13, 2001 00:19

May 14, 2009, 14:27	running without rsh between nodes	#1
hattonps New Member Paul Hatton Join Date: May 2009 Posts: 6 Rep Power: 17	I've built OpenFOAM v1.5 on our 400-odd node cluster running Scientific Linux 5. It runs fine using all 4 cores on a node but fails when running across 2 nodes, 4 cores per node under torque with bash: orted: command not found from the command mpirun --hostfile $PBS_NODEFILE -np 8 oodles -parallel We don't allow rsh between nodes, but do allow ssh. From experience this orted error from mpirun is often due to the application expecting rsh between nodes. The usual way of running parallel jobs on our cluster is mpiexec, not mpirun, but if I try this the openfoam application (orted in this case) thinks it's running on one core. Does anyone know: - can OpenFOAM use mpiexec rather than mpirun? - does multi-node OpenFOAM expect rsh access between nodes? - if so, can it be told to use an alternative such as ssh or a script that mimics rsh (we use pbsdsh for this) if there is a hook to hang an rsh replacement on? Thanks -- Paul Hatton University of Birmingham

May 15, 2009, 04:56		#3
hattonps New Member Paul Hatton Join Date: May 2009 Posts: 6 Rep Power: 17	Thanks you very much, Oleson. I was experimenting with the standard mpi 'hello world' program last night and also found that sepecifying a hostfile causes problems with shared libraries not being found under torque/OpenMPI. I should probably start a new thread for this, but if I may pose one more question - it seems that the OpenFOAM binaries think that they are on just the master node if I just specify -np 8, for example, and no hostfile or host list to mpirun in a torque job using 4 cores on each of 2 nodes. So it seems that OpenFOAM needs the hostfile to get the host list but OpenMPI won't accept this under torque. Is this correct and, if so, is there any known way out of it? Can I build it against mvapich2, to also give us infiniband support - I guess not if OpenFOAM uses OpenMPI constructs? If there is a section of the OpenFOAM documentation that I should be looking at a pointer would be much appreciated. I look after the overall HPC service here and support many applications so I must apologise for being new to OpenFOAM - one of our research groups has asked for it. As an aside, the same problem arises with the Computational Chemistry program Molpro which has also caused me much grief recently .... Thanks again for the info and the URL to the OpenMPI FAQ. -- Paul Hatton University of Birmingham P.S.Hatton@bham.ac.uk

May 15, 2009, 06:20		#5
hattonps New Member Paul Hatton Join Date: May 2009 Posts: 6 Rep Power: 17	Thanks again. If I qsub the following script: #!/bin/bash #PBS -j oe #PBS -l "walltime=1:00,nodes=2pn=4" #PBS -N FOAM-n2ppn4 #PBS -q bbadmin cd "$PBS_O_WORKDIR" module load apps/openfoam . /apps/OpenFOAM/OpenFOAM-1.5/etc/bashrc export WM_PROJECT_USER_DIR=$PWD module load intel/fce/10.1.008 mpirun oodles -parallel I get, in the stdout job output: +--------------------------------------------------------------------------+ \| Job starting at 2009-05-15 10:12:24 for hattonps on the BlueBEAR Cluster \| Job identity jobid 1195882 jobname FOAM-n2ppn4 \| Job requests nodes=2pn=4,pmem=1996mb,walltime=00:01:00 \| Job assigned to nodes u1n002 u1n001 +--------------------------------------------------------------------------+ bool Pstream::init(int& argc, char& argv) : attempt to run parallel on 1 processor#0 Foam::error:rintStack(Foam::Ostream&) in "/apps/OpenFOAM/OpenFOAM-1.5/lib/linux64GccDPOpt/libOpenFOAM.so" #1 Foam::error::abort() in "/apps/OpenFOAM/OpenFOAM-1.5/lib/linux64GccDPOpt/libOpenFOAM.so" #2 Foam::Pstream::init(int&, char&) in "/apps/OpenFOAM/OpenFOAM-1.5/lib/linux64GccDPOpt/openmpi-1.2.6/libPstream.so" #3 Foam::argList::argList(int&, char&, bool, bool) in "/apps/OpenFOAM/OpenFOAM-1.5/lib/linux64GccDPOpt/libOpenFOAM.so" #4 __gxx_personality_v0 in "/apps/OpenFOAM/OpenFOAM-1.5/applications/bin/linux64GccDPOpt/oodles" #5 __libc_start_main in "/lib64/libc.so.6" #6 Foam::regIOobject::readIfModified() in "/apps/OpenFOAM/OpenFOAM-1.5/applications/bin/linux64GccDPOpt/oodles" and so on. The line bool Pstream::init(int& argc, char& argv) : attempt to run parallel on 1 processor#0 suggests that oodles doesn't think it's running on multiple cores in this case? I can run across cores on a node by specifying -np to mpirun, and I need to do this to get oodles to run multi-core, but then fall down when trying to run across nodes. I'm missing something obvious here .... -- Paul Hatton University of Birmingham P.S.Hatton@bham.ac.uk

March 19, 2010, 01:39		#10
chiven Senior Member J. Cai Join Date: Apr 2009 Posts: 180 Rep Power: 17	Hi, Paul, how did you solve this problem? I just meet the same problem as yours. Best regards, Chiven