|
[Sponsors] |
March 6, 2006, 17:22 |
Hello friends
I saw in the
|
#1 |
Senior Member
kumar
Join Date: Mar 2009
Posts: 112
Rep Power: 17 |
Hello friends
I saw in the manual ( Users manual , U-63 , damBreakCase ) that OpenFoam parallel runs can only be executed from the command line. Does this mean that we cannot use a job scheduler ? Thanks in advance kumar |
|
March 7, 2006, 03:04 |
Can anyone please give some in
|
#2 |
Senior Member
kumar
Join Date: Mar 2009
Posts: 112
Rep Power: 17 |
Can anyone please give some inputs ?
thanks kumar |
|
March 7, 2006, 04:37 |
You can easily run OpenFOAM us
|
#3 |
Senior Member
Francesco Del Citto
Join Date: Mar 2009
Location: Zürich Area, Switzerland
Posts: 237
Rep Power: 18 |
You can easily run OpenFOAM using a job scheduler, even for a parallel job. Here follows a simple script I used with Torque (PBS).
I hope it can be helpful! #!/bin/bash #PBS -N fiume #PBS -j oe #lamboot -v $PBS_O_MACHINES cd $PBS_O_WORKDIR export LAMRSH=ssh lamboot $PBS_NODEFILE mpiexec interFoam <rootcase> <casedir> -parallel </dev/null> output.out 2>&1 lamhalt -d |
|
March 7, 2006, 04:38 |
You can easily run OpenFOAM us
|
#4 |
Senior Member
Francesco Del Citto
Join Date: Mar 2009
Location: Zürich Area, Switzerland
Posts: 237
Rep Power: 18 |
You can easily run OpenFOAM using a job scheduler, even for a parallel job. Here follows a simple script I used with Torque (PBS).
I hope it can be helpful! #!/bin/bash #PBS -N jobname #PBS -j oe #lamboot -v $PBS_O_MACHINES cd $PBS_O_WORKDIR export LAMRSH=ssh lamboot $PBS_NODEFILE mpiexec interFoam <rootcase> <casedir> -parallel </dev/null> output.out 2>&1 lamhalt -d |
|
March 7, 2006, 04:38 |
You can easily run OpenFOAM us
|
#5 |
Senior Member
Francesco Del Citto
Join Date: Mar 2009
Location: Zürich Area, Switzerland
Posts: 237
Rep Power: 18 |
You can easily run OpenFOAM using a job scheduler, even for a parallel job. Here follows a simple script I used with Torque (PBS).
I hope it can be helpful! #!/bin/bash #PBS -N jobname #PBS -j oe cd $PBS_O_WORKDIR export LAMRSH=ssh lamboot $PBS_NODEFILE mpiexec interFoam <rootcase> <casedir> -parallel </dev/null> output.out 2>&1 lamhalt -d |
|
March 7, 2006, 05:12 |
HI Francesco,
I tried your
|
#6 |
New Member
Pierre Maruzewski
Join Date: Mar 2009
Posts: 6
Rep Power: 17 |
HI Francesco,
I tried your script BUT : LAM attempted to execute a process on the remote node "node191", but received some output on the standard error. This heuristic assumes that any output on the standard error indicates a fatal error, and therefore aborts. You can disable this behavior (i.e., have LAM ignore output on standard error) in the rsh boot module by setting the SSI parameter boot_rsh_ignore_stderr to 1. LAM tried to use the remote agent command "ssh" to invoke "echo $SHELL" on the remote node. *** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND *** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ *** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S *** MAILING LIST. This can indicate an authentication error with the remote agent, or can indicate an error in your $HOME/.cshrc, $HOME/.login, or $HOME/.profile files. The following is a (non-inclusive) list of items that you should check on the remote node: - You have an account and can login to the remote machine - Incorrect permissions on your home directory (should probably be 0755) - Incorrect permissions on your $HOME/.rhosts file (if you are using rsh -- they should probably be 0644) - You have an entry in the remote $HOME/.rhosts file (if you are using rsh) for the machine and username that you are running from - Your .cshrc/.profile must not print anything out to the standard error - Your .cshrc/.profile should set a correct TERM type - Your .cshrc/.profile should set the SHELL environment variable to your default shell Try invoking the following command at the unix command line: ssh node191 -n 'echo $SHELL' You will need to configure your local setup such that you will *not* be prompted for a password to invoke this command on the remote node. No output should be printed from the remote node before the output of the command is displayed. When you can get this command to execute successfully by hand, LAM will probably be able to function properly. ----------------------------------------------------------------------------- LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University Lamnodes Failed! Check if you had booted lam before calling mpiexec else use -machinefile to pass host file to mpiexec So can you help me ??? Pierre |
|
March 7, 2006, 05:38 |
Hi Francesco
Thanks a lot for
|
#7 |
Senior Member
kumar
Join Date: Mar 2009
Posts: 112
Rep Power: 17 |
Hi Francesco
Thanks a lot for the script. Let me give it a try. Hi Pierre , I will get back after trying out Francesco's script. Regards Kumar |
|
March 7, 2006, 05:53 |
This is qhat I get while execu
|
#8 |
Senior Member
Francesco Del Citto
Join Date: Mar 2009
Location: Zürich Area, Switzerland
Posts: 237
Rep Power: 18 |
This is qhat I get while executing commands contained in the script from an interactive job:
------------------------------------------------- [carlo@epsilon runFiume]$ qsub -I -l nodes=2 qsub: waiting for job 2270.epsilon to start qsub: job 2270.epsilon ready Executing: /server/carlo/OpenFOAM/OpenFOAM-1.2/.bashrc Executing: /server/carlo/OpenFOAM/OpenFOAM-1.2/.OpenFOAM-1.2/apps/ensightFoam/bashrc Executing: /server/carlo/OpenFOAM/OpenFOAM-1.2/.OpenFOAM-1.2/apps/paraview/bashrc [carlo@node2 ~]$ cd $PBS_O_WORKDIR [carlo@node2 runFiume]$ export LAMRSH=ssh [carlo@node2 runFiume]$ lamboot $PBS_NODEFILE LAM 7.1.1 - Indiana University [carlo@node2 runFiume]$ ------------------------------------------------- I guess the proble is the configuration of ssh/rsh in your cluster. I've configured my cluster so that a group of users can access with ssh from node to node without beeing asked for a password. This can be made easily because nodes share a NFS file system on the server, and the autentication is provided by a NIS server. So, the only thing I've done is adding the public key conatined in ~/.ssh/id_dsa.pub to the .ssh/authorized_keys? files. This allows LAM to use ssh without be asked for the password. Another issue can be a possible error while ssh tries to forward X11 connection. This can be avoided with export LAMRSH="ssh -x" The standard behaviour of LAM distributed with OpenFOAM is trying to connect from a node to another using rsh. If you can access a node with rsh, you can try deleting the export LAMRSH line from the script. For debuggin purposes, you can try this: lamboot -v $PBS_NODEFILE And post the result. Francesco |
|
March 8, 2006, 06:15 |
Dear Francesco,
I tried lamb
|
#9 |
New Member
Pierre Maruzewski
Join Date: Mar 2009
Posts: 6
Rep Power: 17 |
Dear Francesco,
I tried lamboot -v $PBS_NODEFILE And that's the result : lamboot -v $PBS_NODEFILE LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University n-1<5503> ssi:boot:base:linear: booting n0 (node223) n-1<5503> ssi:boot:base:linear: booting n1 (node224) ERROR: LAM/MPI unexpectedly received the following on stderr: ------------------------- /usr/local/Modules/versions -------------------------- 3.1.6 --------------------- /usr/local/Modules/3.1.6/modulefiles --------------------- dot module-cvs module-info modules null use.own ------------------------ /usr/local/Modules/modulefiles ------------------------ NAGWare_f95-amd64_glibc22/22 goto/0.96-2(default) NAGWare_f95-amd64_glibc23/23 gsl/1.6(default) acml/2.5.1(default) intel-cc/8.1.024 acml_generic_pgi/2.5.0 intel-fc/8.1.021 acml_pathscale/2.5.1 mpich-gm-1.2.6..14b/pathscale-2.3 acml_scalapack_generic_pgi/2.5.0 mpich-gm-1.2.6..14b/pgi-6.0(default) ansys/10.0 mpich_pathscale/1.2.6..13b ansys/10.0_SP1(default) mpich_pgi/1.2.6..13b cernlib/2005(default) nag/21 fftw-2.1.5/pgi-6.0 pathscale/2.0(default) fftw-3.0.1/pgi-6.0 pathscale/2.3 fluent/6.0.20 pgi/5.2(default) fluent/6.2.16(default) pgi/6.0 gcc/4.0.2 ----------------------------------------------------------------------------- LAM attempted to execute a process on the remote node "node224", but received some output on the standard error. This heuristic assumes that any output on the standard error indicates a fatal error, and therefore aborts. You can disable this behavior (i.e., have LAM ignore output on standard error) in the rsh boot module by setting the SSI parameter boot_rsh_ignore_stderr to 1. LAM tried to use the remote agent command "ssh" to invoke "echo $SHELL" on the remote node. *** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND *** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ *** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S *** MAILING LIST. This can indicate an authentication error with the remote agent, or can indicate an error in your $HOME/.cshrc, $HOME/.login, or $HOME/.profile files. The following is a (non-inclusive) list of items that you should check on the remote node: - You have an account and can login to the remote machine - Incorrect permissions on your home directory (should probably be 0755) - Incorrect permissions on your $HOME/.rhosts file (if you are using rsh -- they should probably be 0644) - You have an entry in the remote $HOME/.rhosts file (if you are using rsh) for the machine and username that you are running from - Your .cshrc/.profile must not print anything out to the standard error - Your .cshrc/.profile should set a correct TERM type - Your .cshrc/.profile should set the SHELL environment variable to your default shell Try invoking the following command at the unix command line: ssh -x node224 -n 'echo $SHELL' You will need to configure your local setup such that you will *not* be prompted for a password to invoke this command on the remote node. No output should be printed from the remote node before the output of the command is displayed. When you can get this command to execute successfully by hand, LAM will probably be able to function properly. ----------------------------------------------------------------------------- n-1<5503> ssi:boot:base:linear: Failed to boot n1 (node224) n-1<5503> ssi:boot:base:linear: aborted! n-1<5508> ssi:boot:base:linear: booting n0 (node223) n-1<5508> ssi:boot:base:linear: booting n1 (node224) ERROR: LAM/MPI unexpectedly received the following on stderr: ------------------------- /usr/local/Modules/versions -------------------------- 3.1.6 --------------------- /usr/local/Modules/3.1.6/modulefiles --------------------- dot module-cvs module-info modules null use.own ------------------------ /usr/local/Modules/modulefiles ------------------------ NAGWare_f95-amd64_glibc22/22 goto/0.96-2(default) NAGWare_f95-amd64_glibc23/23 gsl/1.6(default) acml/2.5.1(default) intel-cc/8.1.024 acml_generic_pgi/2.5.0 intel-fc/8.1.021 acml_pathscale/2.5.1 mpich-gm-1.2.6..14b/pathscale-2.3 acml_scalapack_generic_pgi/2.5.0 mpich-gm-1.2.6..14b/pgi-6.0(default) ansys/10.0 mpich_pathscale/1.2.6..13b ansys/10.0_SP1(default) mpich_pgi/1.2.6..13b cernlib/2005(default) nag/21 fftw-2.1.5/pgi-6.0 pathscale/2.0(default) fftw-3.0.1/pgi-6.0 pathscale/2.3 fluent/6.0.20 pgi/5.2(default) fluent/6.2.16(default) pgi/6.0 gcc/4.0.2 ----------------------------------------------------------------------------- LAM attempted to execute a process on the remote node "node224", but received some output on the standard error. This heuristic assumes that any output on the standard error indicates a fatal error, and therefore aborts. You can disable this behavior (i.e., have LAM ignore output on standard error) in the rsh boot module by setting the SSI parameter boot_rsh_ignore_stderr to 1. LAM tried to use the remote agent command "ssh" to invoke "echo $SHELL" on the remote node. *** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND *** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ *** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S *** MAILING LIST. This can indicate an authentication error with the remote agent, or can indicate an error in your $HOME/.cshrc, $HOME/.login, or $HOME/.profile files. The following is a (non-inclusive) list of items that you should check on the remote node: - You have an account and can login to the remote machine - Incorrect permissions on your home directory (should probably be 0755) - Incorrect permissions on your $HOME/.rhosts file (if you are using rsh -- they should probably be 0644) - You have an entry in the remote $HOME/.rhosts file (if you are using rsh) for the machine and username that you are running from - Your .cshrc/.profile must not print anything out to the standard error - Your .cshrc/.profile should set a correct TERM type - Your .cshrc/.profile should set the SHELL environment variable to your default shell Try invoking the following command at the unix command line: ssh -x node224 -n 'echo $SHELL' You will need to configure your local setup such that you will *not* be prompted for a password to invoke this command on the remote node. No output should be printed from the remote node before the output of the command is displayed. When you can get this command to execute successfully by hand, LAM will probably be able to function properly. ----------------------------------------------------------------------------- n-1<5508> ssi:boot:base:linear: Failed to boot n1 (node224) n-1<5508> ssi:boot:base:linear: aborted! lamboot did NOT complete successfully |
|
March 8, 2006, 06:42 |
Look at the message:
Try
|
#10 |
Senior Member
Francesco Del Citto
Join Date: Mar 2009
Location: Zürich Area, Switzerland
Posts: 237
Rep Power: 18 |
Look at the message:
[...] Try invoking the following command at the unix command line: ssh -x node224 -n 'echo $SHELL' You will need to configure your local setup such that you will *not* be prompted for a password to invoke this command on the remote node. No output should be printed from the remote node before the output of the command is displayed. When you can get this command to execute successfully by hand, LAM will probably be able to function properly. [...] Have you tried that command? Another issue is the "module" command (it seems you have almost the same configuration of my cluster...). Look at the result of thees commands: --------------------------------------------------------------- [francesco@epsilon ~]$ module av ---------------------------- /usr/local/modulefiles ---------------------------- dot module-cvs module-info modules null use.own --------------------------- /opt/Modules/modulefiles --------------------------- gnu gnu41 intel8 intel9 lam mpich openmpi [francesco@epsilon ~]$ module av 2>/dev/null [francesco@epsilon ~]$ --------------------------------------------------------------- This means that the command "module available" print its output on the standard error stream. LAM returns an error if there is any output in the standard error while executing remote shell command (ssh or rsh). That's why they suggest to use "ssh -x", because it doesn't even try to open an X connection. I guess you have the command "module available" in your ~/.bashrc or in the global /etc/bashrc (I'm supposing you're using bash). You have to remove that command, or substitute it with "module available 2>&1", redirecting stderr to stdout. So it should not confuse LAM anymore. Furthermore, I think you can always remove LAM distributed with OpenFOAM an install it from source, activating the support for Torque, so that neither rsh nor ssh would be required. I've done it, and it works fine with other mpi applications, but I've never tried with OpenFOAM. I hope this can help you. Francesco |
|
March 8, 2006, 07:09 |
Hi Francesco,
It's runnig p
|
#11 |
New Member
Pierre Maruzewski
Join Date: Mar 2009
Posts: 6
Rep Power: 17 |
Hi Francesco,
It's runnig perfectly thanks to "module available 2>&1" Pierre |
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
job scheduler | Main CFD Forum | 7 | January 2, 2009 00:04 | |
parallel processing V 3.24 | Phil D | Siemens | 3 | November 10, 2007 05:31 |
parallel processing | mvee | FLUENT | 3 | September 18, 2007 04:18 |
Parallel Processing | AJ | Siemens | 1 | September 10, 2005 13:02 |
MPI for parallel processing | Chuck Leakeas | Main CFD Forum | 4 | January 14, 2002 21:18 |