|
[Sponsors] |
March 5, 2020, 13:54 |
Running OpenFoam in cluster
|
#1 |
Senior Member
chandra shekhar pant
Join Date: Oct 2010
Posts: 220
Rep Power: 17 |
Hello Foamers,
I am trying to run the openFoam in a cluster having many nodes, particularly I am interested in running the OpenFoam in 200 processors. I am a having a PBS script to do that, in the PBS script I tried: 1. Code:
mpirun -np 200 -use-hwthread-cpus pimpleFoam/snappyHexMesh -parallel There are not enough slots available in the system to satisfy the 100 slots that were requested by the application: snappyHexMesh/pimpleFoam Either request fewer slots for your application, or make more slots available for use. 2. Code:
mpirun --host n217:16, n219:16, n221:16, n222:16, n224:16, n227:16, n228:16, n229:16, n230:16, n232:16, n225:16, n223:16, n231:16 -np 200 --use-hwthread-cpus snappyHexMesh -parallel -overwrite > log.snappyHexMesh mpirun was unable to find the specified executable file, and therefore did not launch the job. This error was first reported for process rank 0; it may have occurred for other processes as well. NOTE: A common cause for this error is misspelling a mpirun command line parameter option (remember that mpirun interprets the first unrecognized command line token as the executable). Node: n217 Executable: n219:16, 3. Code:
mpirun --hostfile machines -np 200 snappyHexMesh -parallel -overwrite > log.snappyHexMesh orted: Command not found. -------------------------------------------------------------------------- ORTE was unable to reliably start one or more daemons. This usually is caused by: * not finding the required libraries and/or binaries on one or more nodes. Please check your PATH and LD_LIBRARY_PATH settings, or configure OMPI with --enable-orterun-prefix-by-default * lack of authority to execute on one or more specified nodes. Please verify your allocation and authorities. * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). Please check with your sys admin to determine the correct location to use. * compilation of the orted with dynamic libraries when static are required (e.g., on Cray). Please check your configure cmd line and consider using one of the contrib/platform definitions for your system type. * an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -------------------------------------------------------------------------- -------------------------------------------------------------------------- ORTE does not know how to route a message to the specified daemon located on the indicated node: my node: n217 target node: n221 This is usually an internal programming error that should be reported to the developers. In the meantime, a workaround may be to set the MCA param routed=direct on the command line or in your environment. We apologize for the problem. 4. Code:
mpirun --host n217:200 -np 200 snappyHexMesh -parallel -overwrite > log.snappyHexMesh Any help/clue/suggestion is highly welcomed. Thanks a lot! |
|
March 5, 2020, 19:39 |
|
#2 |
Senior Member
Svetlana Tkachenko
Join Date: Oct 2013
Location: Australia, Sydney
Posts: 416
Rep Power: 15 |
2. Add single quotes? Marked in red below.
mpirun --host 'n217:16, n219:16, n221:16, n222:16, n224:16, n227:16, n228:16, n229:16, n230:16, n232:16, n225:16, n223:16, n231:16' -np 200 --use-hwthread-cpus snappyHexMesh -parallel -overwrite > log.snappyHexMesh |
|
March 6, 2020, 03:13 |
|
#3 |
Senior Member
chandra shekhar pant
Join Date: Oct 2010
Posts: 220
Rep Power: 17 |
Thanks a lot Svetlana Tkachenko,
I tried the command by adding the '....' Code:
mpirun --host 'n217:16, n219:16, n221:16, n222:16, n224:16, n227:16, n228:16, n229:16, n230:16, n232:16, n225:16, n223:16, n231:16' -np 200 --use-hwthread-cpus snappyHexMesh -parallel -overwrite > log.snappyHexMesh -------------------------------------------------------------------------- While trying to create a regular expression of the node names used in this application, the regex parser has detected the presence of an illegal character in the following node name: node: n219 Node names must be composed of a combination of ascii letters, digits, dots, and the hyphen ('-') character. See the following for an explanation: https://en.wikipedia.org/wiki/Hostname Please correct the error and try again. -------------------------------------------------------------------------- FIPS integrity verification test failed. ssh: Could not resolve hostname n219: Name or service not known -------------------------------------------------------------------------- ORTE was unable to reliably start one or more daemons. This usually is caused by: * not finding the required libraries and/or binaries on one or more nodes. Please check your PATH and LD_LIBRARY_PATH settings, or configure OMPI with --enable-orterun-prefix-by-default * lack of authority to execute on one or more specified nodes. Please verify your allocation and authorities. * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). Please check with your sys admin to determine the correct location to use. * compilation of the orted with dynamic libraries when static are required (e.g., on Cray). Please check your configure cmd line and consider using one of the contrib/platform definitions for your system type. * an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -------------------------------------------------------------------------- -------------------------------------------------------------------------- ORTE does not know how to route a message to the specified daemon located on the indicated node: my node: n217 target node: n221 This is usually an internal programming error that should be reported to the developers. In the meantime, a workaround may be to set the MCA param routed=direct on the command line or in your environment. We apologize for the problem. -------------------------------------------------------------------------- [n217:52224] 10 more processes have sent help message help-errmgr-base.txt / no-path [n217:52224] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages |
|
March 6, 2020, 04:15 |
|
#4 |
Senior Member
Yann
Join Date: Apr 2012
Location: France
Posts: 1,198
Rep Power: 27 |
Hi folks,
2. : the error states you tried to run the executable 'n219:16,' on node n217. This is due to a syntax problem in your command, you have to separate nodes with commas only, without any white spaces. Try this : Code:
mpirun --host n217:16,n219:16,n221:16,n222:16,n224:16,n227:16,n228:16,n229:16,n230:16,n232:16,n225:16,n223:16,n231:16 -np 200 --use-hwthread-cpus snappyHexMesh -parallel -overwrite > log.snappyHexMesh I hope it solves your problem. Yann |
|
March 6, 2020, 04:36 |
|
#5 |
Senior Member
chandra shekhar pant
Join Date: Oct 2010
Posts: 220
Rep Power: 17 |
Thanks a lot Yann, it seems that whatever you are suggesting is absolutely correct, but there are again some issues, and that is why it's again sending an error message that says:
FIPS integrity verification test failed. FIPS integrity verification test failed. FIPS integrity verification test failed. FIPS integrity verification test failed. FIPS integrity verification test failed. FIPS integrity verification test failed. FIPS integrity verification test failed. FIPS integrity verification test failed. FIPS integrity verification test failed. FIPS integrity verification test failed. FIPS integrity verification test failed. FIPS integrity verification test failed. Warning: the RSA host key for 'n224' differs from the key for the IP address '10.0.1.224' Offending key for IP in /vkm/chpant/.ssh/known_hosts:11 Matching host key in /vkm/chpant/.ssh/known_hosts:32 orted: Command not found. -------------------------------------------------------------------------- ORTE was unable to reliably start one or more daemons. This usually is caused by: * not finding the required libraries and/or binaries on one or more nodes. Please check your PATH and LD_LIBRARY_PATH settings, or configure OMPI with --enable-orterun-prefix-by-default * lack of authority to execute on one or more specified nodes. Please verify your allocation and authorities. * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). Please check with your sys admin to determine the correct location to use. * compilation of the orted with dynamic libraries when static are required (e.g., on Cray). Please check your configure cmd line and consider using one of the contrib/platform definitions for your system type. * an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -------------------------------------------------------------------------- -------------------------------------------------------------------------- ORTE does not know how to route a message to the specified daemon located on the indicated node: my node: n217 target node: n221 This is usually an internal programming error that should be reported to the developers. In the meantime, a workaround may be to set the MCA param routed=direct on the command line or in your environment. We apologize for the problem. -------------------------------------------------------------------------- [n217:56106] 10 more processes have sent help message help-errmgr-base.txt / no-path [n217:56106] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages |
|
March 6, 2020, 05:02 |
|
#6 |
Senior Member
Yann
Join Date: Apr 2012
Location: France
Posts: 1,198
Rep Power: 27 |
I'm glad I could help.
The last error you've posted isn't related to OpenFOAM. It seems there is a problem with ssh configuration on your cluster, or at least some nodes on it. It looks like there is a mismatch for the RSA key on node 224. Probably something like a renewed RSA key which has not been updated on the other nodes or something like this. I'm not familiar with FIPS so I can't really help with that. Yann |
|
March 6, 2020, 09:05 |
|
#7 |
Senior Member
chandra shekhar pant
Join Date: Oct 2010
Posts: 220
Rep Power: 17 |
Yes, your comments are a great help, using your comments and the error message, I predicted (off course I could be absolutely wrong) that there is some problem with the node n217 to communicate with the node n221, thus I tried to run the SHM only on 2 nodes with 16 cpus each, through the command
Code:
mpirun --host n217:16,n219:16 -np 32 --use-hwthread-cpus snappyHexMesh -parallel -overwrite > log.snappyHexMesh FIPS integrity verification test failed. orted: Command not found. -------------------------------------------------------------------------- ORTE was unable to reliably start one or more daemons. This usually is caused by: * not finding the required libraries and/or binaries on one or more nodes. Please check your PATH and LD_LIBRARY_PATH settings, or configure OMPI with --enable-orterun-prefix-by-default * lack of authority to execute on one or more specified nodes. Please verify your allocation and authorities. * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). Please check with your sys admin to determine the correct location to use. * compilation of the orted with dynamic libraries when static are required (e.g., on Cray). Please check your configure cmd line and consider using one of the contrib/platform definitions for your system type. * an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -------------------------------------------------------------------------- |
|
March 11, 2020, 06:32 |
|
#8 |
Senior Member
chandra shekhar pant
Join Date: Oct 2010
Posts: 220
Rep Power: 17 |
Hello All,
Thanks for your inputs, it work for me luckily, may be useful for others: I sourced the path of the Openfoam in the cshrc file of mine by adding the following line: Code:
source /usr/local/OpenFOAM/setup1906.csh |
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Encountering an error while running a case in cluster using openfoam | dnsuman | Main CFD Forum | 2 | August 6, 2018 02:35 |
[OpenFOAM.org] OpenFOAM Cluster Setup for Beginners | Ruli | OpenFOAM Installation | 7 | July 22, 2016 05:14 |
[OpenFOAM.org] How to get OpenFoam compiled on a cluster with CentOS 6.5 and no root permissions | hulli | OpenFOAM Installation | 2 | November 6, 2014 19:01 |
OpenFOAM parallel running error in cluster | vishal_s | OpenFOAM Running, Solving & CFD | 5 | March 11, 2014 16:11 |
Something weird encountered when running OpenFOAM in parallel on multiple nodes | xpqiu | OpenFOAM Running, Solving & CFD | 2 | May 2, 2013 05:59 |