Job Scheduler for parallel processing

kumar2 · March 6, 2006, 17:22

Hello friends

I saw in the manual ( Users manual , U-63 , damBreakCase ) that OpenFoam parallel runs can only be executed from the command line. Does this mean that we cannot use a job scheduler ?

Thanks in advance

kumar

kumar2 · March 7, 2006, 03:04

Can anyone please give some inputs ?

thanks

kumar

fra76 · March 7, 2006, 04:37

You can easily run OpenFOAM using a job scheduler, even for a parallel job. Here follows a simple script I used with Torque (PBS).
I hope it can be helpful!

#!/bin/bash

#PBS -N fiume
#PBS -j oe

#lamboot -v $PBS_O_MACHINES

cd $PBS_O_WORKDIR
export LAMRSH=ssh
lamboot $PBS_NODEFILE

mpiexec interFoam <rootcase> <casedir> -parallel </dev/null> output.out 2>&1

lamhalt -d

fra76 · March 7, 2006, 04:38

You can easily run OpenFOAM using a job scheduler, even for a parallel job. Here follows a simple script I used with Torque (PBS).
I hope it can be helpful!

#!/bin/bash

#PBS -N jobname
#PBS -j oe

#lamboot -v $PBS_O_MACHINES

cd $PBS_O_WORKDIR
export LAMRSH=ssh
lamboot $PBS_NODEFILE

mpiexec interFoam <rootcase> <casedir> -parallel </dev/null> output.out 2>&1

lamhalt -d

fra76 · March 7, 2006, 04:38

You can easily run OpenFOAM using a job scheduler, even for a parallel job. Here follows a simple script I used with Torque (PBS).
I hope it can be helpful!

#!/bin/bash

#PBS -N jobname
#PBS -j oe

cd $PBS_O_WORKDIR
export LAMRSH=ssh
lamboot $PBS_NODEFILE

mpiexec interFoam <rootcase> <casedir> -parallel </dev/null> output.out 2>&1

lamhalt -d

pierrot · March 7, 2006, 05:12

HI Francesco,

I tried your script BUT :

LAM attempted to execute a process on the remote node "node191",
but received some output on the standard error. This heuristic
assumes that any output on the standard error indicates a fatal error,
and therefore aborts. You can disable this behavior (i.e., have LAM
ignore output on standard error) in the rsh boot module by setting the
SSI parameter boot_rsh_ignore_stderr to 1.

LAM tried to use the remote agent command "ssh"
to invoke "echo $SHELL" on the remote node.

*** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
*** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
*** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
*** MAILING LIST.

This can indicate an authentication error with the remote agent, or
can indicate an error in your $HOME/.cshrc, $HOME/.login, or
$HOME/.profile files. The following is a (non-inclusive) list of items
that you should check on the remote node:

- You have an account and can login to the remote machine
- Incorrect permissions on your home directory (should
probably be 0755)
- Incorrect permissions on your $HOME/.rhosts file (if you are
using rsh -- they should probably be 0644)
- You have an entry in the remote $HOME/.rhosts file (if you
are using rsh) for the machine and username that you are
running from
- Your .cshrc/.profile must not print anything out to the
standard error
- Your .cshrc/.profile should set a correct TERM type
- Your .cshrc/.profile should set the SHELL environment
variable to your default shell

Try invoking the following command at the unix command line:

ssh node191 -n 'echo $SHELL'

You will need to configure your local setup such that you will *not*
be prompted for a password to invoke this command on the remote node.
No output should be printed from the remote node before the output of
the command is displayed.

When you can get this command to execute successfully by hand, LAM
will probably be able to function properly.
-----------------------------------------------------------------------------

LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University

Lamnodes Failed!
Check if you had booted lam before calling mpiexec else use -machinefile to pass host file to mpiexec

So can you help me ???

Pierre

kumar2 · March 7, 2006, 05:38

Hi Francesco
Thanks a lot for the script. Let me give it a try.

Hi Pierre , I will get back after trying out Francesco's script.

Regards

Kumar

fra76 · March 7, 2006, 05:53

This is qhat I get while executing commands contained in the script from an interactive job:
-------------------------------------------------
[carlo@epsilon runFiume]$ qsub -I -l nodes=2
qsub: waiting for job 2270.epsilon to start
qsub: job 2270.epsilon ready

Executing: /server/carlo/OpenFOAM/OpenFOAM-1.2/.bashrc
Executing: /server/carlo/OpenFOAM/OpenFOAM-1.2/.OpenFOAM-1.2/apps/ensightFoam/bashrc
Executing: /server/carlo/OpenFOAM/OpenFOAM-1.2/.OpenFOAM-1.2/apps/paraview/bashrc
[carlo@node2 ~]$ cd $PBS_O_WORKDIR
[carlo@node2 runFiume]$ export LAMRSH=ssh
[carlo@node2 runFiume]$ lamboot $PBS_NODEFILE

LAM 7.1.1 - Indiana University

[carlo@node2 runFiume]$
-------------------------------------------------

I guess the proble is the configuration of ssh/rsh in your cluster. I've configured my cluster so that a group of users can access with ssh from node to node without beeing asked for a password. This can be made easily because nodes share a NFS file system on the server, and the autentication is provided by a NIS server. So, the only thing I've done is adding the public key conatined in ~/.ssh/id_dsa.pub to the .ssh/authorized_keys? files.
This allows LAM to use ssh without be asked for the password. Another issue can be a possible error while ssh tries to forward X11 connection. This can be avoided with
export LAMRSH="ssh -x"
The standard behaviour of LAM distributed with OpenFOAM is trying to connect from a node to another using rsh. If you can access a node with rsh, you can try deleting the export LAMRSH line from the script.
For debuggin purposes, you can try this:
lamboot -v $PBS_NODEFILE
And post the result.
Francesco

pierrot · March 8, 2006, 06:15

Dear Francesco,
I tried lamboot -v $PBS_NODEFILE
And that's the result :

lamboot -v $PBS_NODEFILE

LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University

n-1<5503> ssi:boot:base:linear: booting n0 (node223)
n-1<5503> ssi:boot:base:linear: booting n1 (node224)
ERROR: LAM/MPI unexpectedly received the following on stderr:

------------------------- /usr/local/Modules/versions --------------------------
3.1.6

--------------------- /usr/local/Modules/3.1.6/modulefiles ---------------------
dot module-cvs module-info modules null use.own

------------------------ /usr/local/Modules/modulefiles ------------------------
NAGWare_f95-amd64_glibc22/22 goto/0.96-2(default)
NAGWare_f95-amd64_glibc23/23 gsl/1.6(default)
acml/2.5.1(default) intel-cc/8.1.024
acml_generic_pgi/2.5.0 intel-fc/8.1.021
acml_pathscale/2.5.1 mpich-gm-1.2.6..14b/pathscale-2.3
acml_scalapack_generic_pgi/2.5.0 mpich-gm-1.2.6..14b/pgi-6.0(default)
ansys/10.0 mpich_pathscale/1.2.6..13b
ansys/10.0_SP1(default) mpich_pgi/1.2.6..13b
cernlib/2005(default) nag/21
fftw-2.1.5/pgi-6.0 pathscale/2.0(default)
fftw-3.0.1/pgi-6.0 pathscale/2.3
fluent/6.0.20 pgi/5.2(default)
fluent/6.2.16(default) pgi/6.0
gcc/4.0.2
-----------------------------------------------------------------------------
LAM attempted to execute a process on the remote node "node224",
but received some output on the standard error. This heuristic
assumes that any output on the standard error indicates a fatal error,
and therefore aborts. You can disable this behavior (i.e., have LAM
ignore output on standard error) in the rsh boot module by setting the
SSI parameter boot_rsh_ignore_stderr to 1.

LAM tried to use the remote agent command "ssh"
to invoke "echo $SHELL" on the remote node.

*** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
*** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
*** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
*** MAILING LIST.

This can indicate an authentication error with the remote agent, or
can indicate an error in your $HOME/.cshrc, $HOME/.login, or
$HOME/.profile files. The following is a (non-inclusive) list of items
that you should check on the remote node:

- You have an account and can login to the remote machine
- Incorrect permissions on your home directory (should
probably be 0755)
- Incorrect permissions on your $HOME/.rhosts file (if you are
using rsh -- they should probably be 0644)
- You have an entry in the remote $HOME/.rhosts file (if you
are using rsh) for the machine and username that you are
running from
- Your .cshrc/.profile must not print anything out to the
standard error
- Your .cshrc/.profile should set a correct TERM type
- Your .cshrc/.profile should set the SHELL environment
variable to your default shell

Try invoking the following command at the unix command line:

ssh -x node224 -n 'echo $SHELL'

You will need to configure your local setup such that you will *not*
be prompted for a password to invoke this command on the remote node.
No output should be printed from the remote node before the output of
the command is displayed.

When you can get this command to execute successfully by hand, LAM
will probably be able to function properly.
-----------------------------------------------------------------------------
n-1<5503> ssi:boot:base:linear: Failed to boot n1 (node224)
n-1<5503> ssi:boot:base:linear: aborted!
n-1<5508> ssi:boot:base:linear: booting n0 (node223)
n-1<5508> ssi:boot:base:linear: booting n1 (node224)
ERROR: LAM/MPI unexpectedly received the following on stderr:

------------------------- /usr/local/Modules/versions --------------------------
3.1.6

--------------------- /usr/local/Modules/3.1.6/modulefiles ---------------------
dot module-cvs module-info modules null use.own

------------------------ /usr/local/Modules/modulefiles ------------------------
NAGWare_f95-amd64_glibc22/22 goto/0.96-2(default)
NAGWare_f95-amd64_glibc23/23 gsl/1.6(default)
acml/2.5.1(default) intel-cc/8.1.024
acml_generic_pgi/2.5.0 intel-fc/8.1.021
acml_pathscale/2.5.1 mpich-gm-1.2.6..14b/pathscale-2.3
acml_scalapack_generic_pgi/2.5.0 mpich-gm-1.2.6..14b/pgi-6.0(default)
ansys/10.0 mpich_pathscale/1.2.6..13b
ansys/10.0_SP1(default) mpich_pgi/1.2.6..13b
cernlib/2005(default) nag/21
fftw-2.1.5/pgi-6.0 pathscale/2.0(default)
fftw-3.0.1/pgi-6.0 pathscale/2.3
fluent/6.0.20 pgi/5.2(default)
fluent/6.2.16(default) pgi/6.0
gcc/4.0.2
-----------------------------------------------------------------------------
LAM attempted to execute a process on the remote node "node224",
but received some output on the standard error. This heuristic
assumes that any output on the standard error indicates a fatal error,
and therefore aborts. You can disable this behavior (i.e., have LAM
ignore output on standard error) in the rsh boot module by setting the
SSI parameter boot_rsh_ignore_stderr to 1.

LAM tried to use the remote agent command "ssh"
to invoke "echo $SHELL" on the remote node.

*** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
*** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
*** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
*** MAILING LIST.

This can indicate an authentication error with the remote agent, or
can indicate an error in your $HOME/.cshrc, $HOME/.login, or
$HOME/.profile files. The following is a (non-inclusive) list of items
that you should check on the remote node:

- You have an account and can login to the remote machine
- Incorrect permissions on your home directory (should
probably be 0755)
- Incorrect permissions on your $HOME/.rhosts file (if you are
using rsh -- they should probably be 0644)
- You have an entry in the remote $HOME/.rhosts file (if you
are using rsh) for the machine and username that you are
running from
- Your .cshrc/.profile must not print anything out to the
standard error
- Your .cshrc/.profile should set a correct TERM type
- Your .cshrc/.profile should set the SHELL environment
variable to your default shell

Try invoking the following command at the unix command line:

ssh -x node224 -n 'echo $SHELL'

You will need to configure your local setup such that you will *not*
be prompted for a password to invoke this command on the remote node.
No output should be printed from the remote node before the output of
the command is displayed.

When you can get this command to execute successfully by hand, LAM
will probably be able to function properly.
-----------------------------------------------------------------------------
n-1<5508> ssi:boot:base:linear: Failed to boot n1 (node224)
n-1<5508> ssi:boot:base:linear: aborted!
lamboot did NOT complete successfully

fra76 · March 8, 2006, 06:42

Look at the message:

[...]
Try invoking the following command at the unix command line:

ssh -x node224 -n 'echo $SHELL'

You will need to configure your local setup such that you will *not*
be prompted for a password to invoke this command on the remote node.
No output should be printed from the remote node before the output of
the command is displayed.

When you can get this command to execute successfully by hand, LAM
will probably be able to function properly.
[...]

Have you tried that command?

Another issue is the "module" command (it seems you have almost the same configuration of my cluster...). Look at the result of thees commands:

---------------------------------------------------------------
[francesco@epsilon ~]$ module av

---------------------------- /usr/local/modulefiles ----------------------------
dot module-cvs module-info modules null use.own

--------------------------- /opt/Modules/modulefiles ---------------------------
gnu gnu41 intel8 intel9 lam mpich openmpi
[francesco@epsilon ~]$ module av 2>/dev/null
[francesco@epsilon ~]$
---------------------------------------------------------------

This means that the command "module available" print its output on the standard error stream. LAM returns an error if there is any output in the standard error while executing remote shell command (ssh or rsh). That's why they suggest to use "ssh -x", because it doesn't even try to open an X connection.

I guess you have the command "module available" in your ~/.bashrc or in the global /etc/bashrc (I'm supposing you're using bash). You have to remove that command, or substitute it with "module available 2>&1", redirecting stderr to stdout. So it should not confuse LAM anymore.

Furthermore, I think you can always remove LAM distributed with OpenFOAM an install it from source, activating the support for Torque, so that neither rsh nor ssh would be required. I've done it, and it works fine with other mpi applications, but I've never tried with OpenFOAM.

I hope this can help you.
Francesco

pierrot · March 8, 2006, 07:09

Hi Francesco,

It's runnig perfectly thanks to "module available 2>&1"

Pierre

March 6, 2006, 17:22	Hello friends I saw in the	#1
kumar2 Senior Member kumar Join Date: Mar 2009 Posts: 112 Rep Power: 17	Hello friends I saw in the manual ( Users manual , U-63 , damBreakCase ) that OpenFoam parallel runs can only be executed from the command line. Does this mean that we cannot use a job scheduler ? Thanks in advance kumar

March 7, 2006, 03:04	Can anyone please give some in	#2
kumar2 Senior Member kumar Join Date: Mar 2009 Posts: 112 Rep Power: 17	Can anyone please give some inputs ? thanks kumar

March 7, 2006, 04:37	You can easily run OpenFOAM us	#3
fra76 Senior Member Francesco Del Citto Join Date: Mar 2009 Location: Zürich Area, Switzerland Posts: 237 Rep Power: 18	You can easily run OpenFOAM using a job scheduler, even for a parallel job. Here follows a simple script I used with Torque (PBS). I hope it can be helpful! #!/bin/bash #PBS -N fiume #PBS -j oe #lamboot -v $PBS_O_MACHINES cd $PBS_O_WORKDIR export LAMRSH=ssh lamboot $PBS_NODEFILE mpiexec interFoam <rootcase> <casedir> -parallel </dev/null> output.out 2>&1 lamhalt -d

March 7, 2006, 04:38	You can easily run OpenFOAM us	#4
fra76 Senior Member Francesco Del Citto Join Date: Mar 2009 Location: Zürich Area, Switzerland Posts: 237 Rep Power: 18	You can easily run OpenFOAM using a job scheduler, even for a parallel job. Here follows a simple script I used with Torque (PBS). I hope it can be helpful! #!/bin/bash #PBS -N jobname #PBS -j oe #lamboot -v $PBS_O_MACHINES cd $PBS_O_WORKDIR export LAMRSH=ssh lamboot $PBS_NODEFILE mpiexec interFoam <rootcase> <casedir> -parallel </dev/null> output.out 2>&1 lamhalt -d

March 7, 2006, 04:38	You can easily run OpenFOAM us	#5
fra76 Senior Member Francesco Del Citto Join Date: Mar 2009 Location: Zürich Area, Switzerland Posts: 237 Rep Power: 18	You can easily run OpenFOAM using a job scheduler, even for a parallel job. Here follows a simple script I used with Torque (PBS). I hope it can be helpful! #!/bin/bash #PBS -N jobname #PBS -j oe cd $PBS_O_WORKDIR export LAMRSH=ssh lamboot $PBS_NODEFILE mpiexec interFoam <rootcase> <casedir> -parallel </dev/null> output.out 2>&1 lamhalt -d

March 7, 2006, 05:12	HI Francesco, I tried your	#6
pierrot New Member Pierre Maruzewski Join Date: Mar 2009 Posts: 6 Rep Power: 17	HI Francesco, I tried your script BUT : LAM attempted to execute a process on the remote node "node191", but received some output on the standard error. This heuristic assumes that any output on the standard error indicates a fatal error, and therefore aborts. You can disable this behavior (i.e., have LAM ignore output on standard error) in the rsh boot module by setting the SSI parameter boot_rsh_ignore_stderr to 1. LAM tried to use the remote agent command "ssh" to invoke "echo $SHELL" on the remote node. * PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND * CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ * (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S * MAILING LIST. This can indicate an authentication error with the remote agent, or can indicate an error in your $HOME/.cshrc, $HOME/.login, or $HOME/.profile files. The following is a (non-inclusive) list of items that you should check on the remote node: - You have an account and can login to the remote machine - Incorrect permissions on your home directory (should probably be 0755) - Incorrect permissions on your $HOME/.rhosts file (if you are using rsh -- they should probably be 0644) - You have an entry in the remote $HOME/.rhosts file (if you are using rsh) for the machine and username that you are running from - Your .cshrc/.profile must not print anything out to the standard error - Your .cshrc/.profile should set a correct TERM type - Your .cshrc/.profile should set the SHELL environment variable to your default shell Try invoking the following command at the unix command line: ssh node191 -n 'echo $SHELL' You will need to configure your local setup such that you will not be prompted for a password to invoke this command on the remote node. No output should be printed from the remote node before the output of the command is displayed. When you can get this command to execute successfully by hand, LAM will probably be able to function properly. ----------------------------------------------------------------------------- LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University Lamnodes Failed! Check if you had booted lam before calling mpiexec else use -machinefile to pass host file to mpiexec So can you help me ??? Pierre

March 7, 2006, 05:38	Hi Francesco Thanks a lot for	#7
kumar2 Senior Member kumar Join Date: Mar 2009 Posts: 112 Rep Power: 17	Hi Francesco Thanks a lot for the script. Let me give it a try. Hi Pierre , I will get back after trying out Francesco's script. Regards Kumar

March 7, 2006, 05:53	This is qhat I get while execu	#8
fra76 Senior Member Francesco Del Citto Join Date: Mar 2009 Location: Zürich Area, Switzerland Posts: 237 Rep Power: 18	This is qhat I get while executing commands contained in the script from an interactive job: ------------------------------------------------- [carlo@epsilon runFiume]$ qsub -I -l nodes=2 qsub: waiting for job 2270.epsilon to start qsub: job 2270.epsilon ready Executing: /server/carlo/OpenFOAM/OpenFOAM-1.2/.bashrc Executing: /server/carlo/OpenFOAM/OpenFOAM-1.2/.OpenFOAM-1.2/apps/ensightFoam/bashrc Executing: /server/carlo/OpenFOAM/OpenFOAM-1.2/.OpenFOAM-1.2/apps/paraview/bashrc [carlo@node2 ~]$ cd $PBS_O_WORKDIR [carlo@node2 runFiume]$ export LAMRSH=ssh [carlo@node2 runFiume]$ lamboot $PBS_NODEFILE LAM 7.1.1 - Indiana University [carlo@node2 runFiume]$ ------------------------------------------------- I guess the proble is the configuration of ssh/rsh in your cluster. I've configured my cluster so that a group of users can access with ssh from node to node without beeing asked for a password. This can be made easily because nodes share a NFS file system on the server, and the autentication is provided by a NIS server. So, the only thing I've done is adding the public key conatined in ~/.ssh/id_dsa.pub to the .ssh/authorized_keys? files. This allows LAM to use ssh without be asked for the password. Another issue can be a possible error while ssh tries to forward X11 connection. This can be avoided with export LAMRSH="ssh -x" The standard behaviour of LAM distributed with OpenFOAM is trying to connect from a node to another using rsh. If you can access a node with rsh, you can try deleting the export LAMRSH line from the script. For debuggin purposes, you can try this: lamboot -v $PBS_NODEFILE And post the result. Francesco

March 8, 2006, 06:15	Dear Francesco, I tried lamb	#9
pierrot New Member Pierre Maruzewski Join Date: Mar 2009 Posts: 6 Rep Power: 17	Dear Francesco, I tried lamboot -v $PBS_NODEFILE And that's the result : lamboot -v $PBS_NODEFILE LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University n-1<5503> ssi:boot:base:linear: booting n0 (node223) n-1<5503> ssi:boot:base:linear: booting n1 (node224) ERROR: LAM/MPI unexpectedly received the following on stderr: ------------------------- /usr/local/Modules/versions -------------------------- 3.1.6 --------------------- /usr/local/Modules/3.1.6/modulefiles --------------------- dot module-cvs module-info modules null use.own ------------------------ /usr/local/Modules/modulefiles ------------------------ NAGWare_f95-amd64_glibc22/22 goto/0.96-2(default) NAGWare_f95-amd64_glibc23/23 gsl/1.6(default) acml/2.5.1(default) intel-cc/8.1.024 acml_generic_pgi/2.5.0 intel-fc/8.1.021 acml_pathscale/2.5.1 mpich-gm-1.2.6..14b/pathscale-2.3 acml_scalapack_generic_pgi/2.5.0 mpich-gm-1.2.6..14b/pgi-6.0(default) ansys/10.0 mpich_pathscale/1.2.6..13b ansys/10.0_SP1(default) mpich_pgi/1.2.6..13b cernlib/2005(default) nag/21 fftw-2.1.5/pgi-6.0 pathscale/2.0(default) fftw-3.0.1/pgi-6.0 pathscale/2.3 fluent/6.0.20 pgi/5.2(default) fluent/6.2.16(default) pgi/6.0 gcc/4.0.2 ----------------------------------------------------------------------------- LAM attempted to execute a process on the remote node "node224", but received some output on the standard error. This heuristic assumes that any output on the standard error indicates a fatal error, and therefore aborts. You can disable this behavior (i.e., have LAM ignore output on standard error) in the rsh boot module by setting the SSI parameter boot_rsh_ignore_stderr to 1. LAM tried to use the remote agent command "ssh" to invoke "echo $SHELL" on the remote node. * PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND * CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ * (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S * MAILING LIST. This can indicate an authentication error with the remote agent, or can indicate an error in your $HOME/.cshrc, $HOME/.login, or $HOME/.profile files. The following is a (non-inclusive) list of items that you should check on the remote node: - You have an account and can login to the remote machine - Incorrect permissions on your home directory (should probably be 0755) - Incorrect permissions on your $HOME/.rhosts file (if you are using rsh -- they should probably be 0644) - You have an entry in the remote $HOME/.rhosts file (if you are using rsh) for the machine and username that you are running from - Your .cshrc/.profile must not print anything out to the standard error - Your .cshrc/.profile should set a correct TERM type - Your .cshrc/.profile should set the SHELL environment variable to your default shell Try invoking the following command at the unix command line: ssh -x node224 -n 'echo $SHELL' You will need to configure your local setup such that you will not be prompted for a password to invoke this command on the remote node. No output should be printed from the remote node before the output of the command is displayed. When you can get this command to execute successfully by hand, LAM will probably be able to function properly. ----------------------------------------------------------------------------- n-1<5503> ssi:boot:base:linear: Failed to boot n1 (node224) n-1<5503> ssi:boot:base:linear: aborted! n-1<5508> ssi:boot:base:linear: booting n0 (node223) n-1<5508> ssi:boot:base:linear: booting n1 (node224) ERROR: LAM/MPI unexpectedly received the following on stderr: ------------------------- /usr/local/Modules/versions -------------------------- 3.1.6 --------------------- /usr/local/Modules/3.1.6/modulefiles --------------------- dot module-cvs module-info modules null use.own ------------------------ /usr/local/Modules/modulefiles ------------------------ NAGWare_f95-amd64_glibc22/22 goto/0.96-2(default) NAGWare_f95-amd64_glibc23/23 gsl/1.6(default) acml/2.5.1(default) intel-cc/8.1.024 acml_generic_pgi/2.5.0 intel-fc/8.1.021 acml_pathscale/2.5.1 mpich-gm-1.2.6..14b/pathscale-2.3 acml_scalapack_generic_pgi/2.5.0 mpich-gm-1.2.6..14b/pgi-6.0(default) ansys/10.0 mpich_pathscale/1.2.6..13b ansys/10.0_SP1(default) mpich_pgi/1.2.6..13b cernlib/2005(default) nag/21 fftw-2.1.5/pgi-6.0 pathscale/2.0(default) fftw-3.0.1/pgi-6.0 pathscale/2.3 fluent/6.0.20 pgi/5.2(default) fluent/6.2.16(default) pgi/6.0 gcc/4.0.2 ----------------------------------------------------------------------------- LAM attempted to execute a process on the remote node "node224", but received some output on the standard error. This heuristic assumes that any output on the standard error indicates a fatal error, and therefore aborts. You can disable this behavior (i.e., have LAM ignore output on standard error) in the rsh boot module by setting the SSI parameter boot_rsh_ignore_stderr to 1. LAM tried to use the remote agent command "ssh" to invoke "echo $SHELL" on the remote node. * PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND * CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ * (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S * MAILING LIST. This can indicate an authentication error with the remote agent, or can indicate an error in your $HOME/.cshrc, $HOME/.login, or $HOME/.profile files. The following is a (non-inclusive) list of items that you should check on the remote node: - You have an account and can login to the remote machine - Incorrect permissions on your home directory (should probably be 0755) - Incorrect permissions on your $HOME/.rhosts file (if you are using rsh -- they should probably be 0644) - You have an entry in the remote $HOME/.rhosts file (if you are using rsh) for the machine and username that you are running from - Your .cshrc/.profile must not print anything out to the standard error - Your .cshrc/.profile should set a correct TERM type - Your .cshrc/.profile should set the SHELL environment variable to your default shell Try invoking the following command at the unix command line: ssh -x node224 -n 'echo $SHELL' You will need to configure your local setup such that you will not be prompted for a password to invoke this command on the remote node. No output should be printed from the remote node before the output of the command is displayed. When you can get this command to execute successfully by hand, LAM will probably be able to function properly. ----------------------------------------------------------------------------- n-1<5508> ssi:boot:base:linear: Failed to boot n1 (node224) n-1<5508> ssi:boot:base:linear: aborted! lamboot did NOT complete successfully

March 8, 2006, 06:42	Look at the message: Try	#10
fra76 Senior Member Francesco Del Citto Join Date: Mar 2009 Location: Zürich Area, Switzerland Posts: 237 Rep Power: 18	Look at the message: [...] Try invoking the following command at the unix command line: ssh -x node224 -n 'echo $SHELL' You will need to configure your local setup such that you will not be prompted for a password to invoke this command on the remote node. No output should be printed from the remote node before the output of the command is displayed. When you can get this command to execute successfully by hand, LAM will probably be able to function properly. [...] Have you tried that command? Another issue is the "module" command (it seems you have almost the same configuration of my cluster...). Look at the result of thees commands: --------------------------------------------------------------- [francesco@epsilon ~]$ module av ---------------------------- /usr/local/modulefiles ---------------------------- dot module-cvs module-info modules null use.own --------------------------- /opt/Modules/modulefiles --------------------------- gnu gnu41 intel8 intel9 lam mpich openmpi [francesco@epsilon ~]$ module av 2>/dev/null [francesco@epsilon ~]$ --------------------------------------------------------------- This means that the command "module available" print its output on the standard error stream. LAM returns an error if there is any output in the standard error while executing remote shell command (ssh or rsh). That's why they suggest to use "ssh -x", because it doesn't even try to open an X connection. I guess you have the command "module available" in your ~/.bashrc or in the global /etc/bashrc (I'm supposing you're using bash). You have to remove that command, or substitute it with "module available 2>&1", redirecting stderr to stdout. So it should not confuse LAM anymore. Furthermore, I think you can always remove LAM distributed with OpenFOAM an install it from source, activating the support for Torque, so that neither rsh nor ssh would be required. I've done it, and it works fine with other mpi applications, but I've never tried with OpenFOAM. I hope this can help you. Francesco

March 8, 2006, 07:09	Hi Francesco, It's runnig p	#11
pierrot New Member Pierre Maruzewski Join Date: Mar 2009 Posts: 6 Rep Power: 17	Hi Francesco, It's runnig perfectly thanks to "module available 2>&1" Pierre

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
job scheduler	qq	Main CFD Forum	7	January 2, 2009 00:04
parallel processing V 3.24	Phil D	Siemens	3	November 10, 2007 05:31
parallel processing	mvee	FLUENT	3	September 18, 2007 04:18
Parallel Processing	AJ	Siemens	1	September 10, 2005 13:02
MPI for parallel processing	Chuck Leakeas	Main CFD Forum	4	January 14, 2002 21:18