CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Software User Forums > OpenFOAM > OpenFOAM Running, Solving & CFD

Running OpenFoam in cluster

Register Blogs Community New Posts Updated Threads Search

Like Tree4Likes
  • 1 Post By Svetlana
  • 1 Post By Yann
  • 1 Post By Yann
  • 1 Post By chandra shekhar pant

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   March 5, 2020, 13:54
Default Running OpenFoam in cluster
  #1
Senior Member
 
chandra shekhar pant
Join Date: Oct 2010
Posts: 220
Rep Power: 17
chandra shekhar pant is on a distinguished road
Hello Foamers,


I am trying to run the openFoam in a cluster having many nodes, particularly I am interested in running the OpenFoam in 200 processors. I am a having a PBS script to do that, in the PBS script I tried:
1.
Code:
mpirun  -np 200 -use-hwthread-cpus pimpleFoam/snappyHexMesh -parallel
gives me error that

There are not enough slots available in the system to satisfy the 100 slots
that were requested by the application:
snappyHexMesh/pimpleFoam
Either request fewer slots for your application, or make more slots available
for use.


2.
Code:
mpirun --host n217:16, n219:16, n221:16, n222:16, n224:16, n227:16, n228:16, n229:16, n230:16, n232:16, n225:16, n223:16, n231:16 -np 200 --use-hwthread-cpus snappyHexMesh -parallel -overwrite > log.snappyHexMesh
gives me error that
mpirun was unable to find the specified executable file, and therefore
did not launch the job. This error was first reported for process
rank 0; it may have occurred for other processes as well.

NOTE: A common cause for this error is misspelling a mpirun command
line parameter option (remember that mpirun interprets the first
unrecognized command line token as the executable).

Node: n217
Executable: n219:16,


3.
Code:
mpirun --hostfile machines -np 200 snappyHexMesh -parallel -overwrite > log.snappyHexMesh
In this I have created a machines file in which written the nodes number and the number of cpus



orted: Command not found.
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.

* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:

my node: n217
target node: n221

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.


4.
Code:
mpirun --host n217:200 -np 200 snappyHexMesh -parallel -overwrite > log.snappyHexMesh
This runs for me, but it is super slow, I think I am overburdening the node 217 by 200 processes which has only 16 cpus.



Any help/clue/suggestion is highly welcomed. Thanks a lot!
chandra shekhar pant is offline   Reply With Quote

Old   March 5, 2020, 19:39
Default
  #2
Senior Member
 
Svetlana Tkachenko
Join Date: Oct 2013
Location: Australia, Sydney
Posts: 416
Rep Power: 15
Svetlana is on a distinguished road
2. Add single quotes? Marked in red below.

mpirun --host 'n217:16, n219:16, n221:16, n222:16, n224:16, n227:16, n228:16, n229:16, n230:16, n232:16, n225:16, n223:16, n231:16' -np 200 --use-hwthread-cpus snappyHexMesh -parallel -overwrite > log.snappyHexMesh
Svetlana is offline   Reply With Quote

Old   March 6, 2020, 03:13
Default
  #3
Senior Member
 
chandra shekhar pant
Join Date: Oct 2010
Posts: 220
Rep Power: 17
chandra shekhar pant is on a distinguished road
Thanks a lot Svetlana Tkachenko,
I tried the command by adding the '....'
Code:
mpirun --host 'n217:16, n219:16, n221:16, n222:16, n224:16, n227:16, n228:16, n229:16, n230:16, n232:16, n225:16, n223:16, n231:16' -np 200 --use-hwthread-cpus snappyHexMesh -parallel -overwrite > log.snappyHexMesh
But it gives me this error:
--------------------------------------------------------------------------
While trying to create a regular expression of the node names
used in this application, the regex parser has detected the
presence of an illegal character in the following node name:

node: n219

Node names must be composed of a combination of ascii letters,
digits, dots, and the hyphen ('-') character. See the following
for an explanation:

https://en.wikipedia.org/wiki/Hostname

Please correct the error and try again.
--------------------------------------------------------------------------
FIPS integrity verification test failed.
ssh: Could not resolve hostname n219: Name or service not known
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.

* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:

my node: n217
target node: n221

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
--------------------------------------------------------------------------
[n217:52224] 10 more processes have sent help message help-errmgr-base.txt / no-path
[n217:52224] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
chandra shekhar pant is offline   Reply With Quote

Old   March 6, 2020, 04:15
Default
  #4
Senior Member
 
Yann
Join Date: Apr 2012
Location: France
Posts: 1,238
Rep Power: 29
Yann will become famous soon enoughYann will become famous soon enough
Hi folks,

2. : the error states you tried to run the executable 'n219:16,' on node n217. This is due to a syntax problem in your command, you have to separate nodes with commas only, without any white spaces.
Try this :

Code:
mpirun --host n217:16,n219:16,n221:16,n222:16,n224:16,n227:16,n228:16,n229:16,n230:16,n232:16,n225:16,n223:16,n231:16 -np 200 --use-hwthread-cpus snappyHexMesh -parallel -overwrite > log.snappyHexMesh

I hope it solves your problem.
Yann
Yann is offline   Reply With Quote

Old   March 6, 2020, 04:36
Default
  #5
Senior Member
 
chandra shekhar pant
Join Date: Oct 2010
Posts: 220
Rep Power: 17
chandra shekhar pant is on a distinguished road
Thanks a lot Yann, it seems that whatever you are suggesting is absolutely correct, but there are again some issues, and that is why it's again sending an error message that says:


FIPS integrity verification test failed.
FIPS integrity verification test failed.
FIPS integrity verification test failed.
FIPS integrity verification test failed.
FIPS integrity verification test failed.
FIPS integrity verification test failed.
FIPS integrity verification test failed.
FIPS integrity verification test failed.
FIPS integrity verification test failed.
FIPS integrity verification test failed.
FIPS integrity verification test failed.
FIPS integrity verification test failed.
Warning: the RSA host key for 'n224' differs from the key for the IP address '10.0.1.224'
Offending key for IP in /vkm/chpant/.ssh/known_hosts:11
Matching host key in /vkm/chpant/.ssh/known_hosts:32
orted: Command not found.
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.

* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:

my node: n217
target node: n221

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
--------------------------------------------------------------------------
[n217:56106] 10 more processes have sent help message help-errmgr-base.txt / no-path
[n217:56106] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
chandra shekhar pant is offline   Reply With Quote

Old   March 6, 2020, 05:02
Default
  #6
Senior Member
 
Yann
Join Date: Apr 2012
Location: France
Posts: 1,238
Rep Power: 29
Yann will become famous soon enoughYann will become famous soon enough
I'm glad I could help.

The last error you've posted isn't related to OpenFOAM. It seems there is a problem with ssh configuration on your cluster, or at least some nodes on it.

It looks like there is a mismatch for the RSA key on node 224. Probably something like a renewed RSA key which has not been updated on the other nodes or something like this.

I'm not familiar with FIPS so I can't really help with that.

Yann
Yann is offline   Reply With Quote

Old   March 6, 2020, 09:05
Default
  #7
Senior Member
 
chandra shekhar pant
Join Date: Oct 2010
Posts: 220
Rep Power: 17
chandra shekhar pant is on a distinguished road
Yes, your comments are a great help, using your comments and the error message, I predicted (off course I could be absolutely wrong) that there is some problem with the node n217 to communicate with the node n221, thus I tried to run the SHM only on 2 nodes with 16 cpus each, through the command
Code:
mpirun --host n217:16,n219:16 -np 32 --use-hwthread-cpus snappyHexMesh -parallel -overwrite > log.snappyHexMesh
And it seems to send the truncated error as compared to the previous one, but initial para is the same.
FIPS integrity verification test failed.
orted: Command not found.
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.

* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
chandra shekhar pant is offline   Reply With Quote

Old   March 11, 2020, 06:32
Default
  #8
Senior Member
 
chandra shekhar pant
Join Date: Oct 2010
Posts: 220
Rep Power: 17
chandra shekhar pant is on a distinguished road
Hello All,


Thanks for your inputs, it work for me luckily, may be useful for others:


I sourced the path of the Openfoam in the cshrc file of mine by adding the following line:
Code:
source /usr/local/OpenFOAM/setup1906.csh
Svetlana likes this.
chandra shekhar pant is offline   Reply With Quote

Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Encountering an error while running a case in cluster using openfoam dnsuman Main CFD Forum 2 August 6, 2018 02:35
[OpenFOAM.org] OpenFOAM Cluster Setup for Beginners Ruli OpenFOAM Installation 7 July 22, 2016 05:14
[OpenFOAM.org] How to get OpenFoam compiled on a cluster with CentOS 6.5 and no root permissions hulli OpenFOAM Installation 2 November 6, 2014 19:01
OpenFOAM parallel running error in cluster vishal_s OpenFOAM Running, Solving & CFD 5 March 11, 2014 16:11
Something weird encountered when running OpenFOAM in parallel on multiple nodes xpqiu OpenFOAM Running, Solving & CFD 2 May 2, 2013 05:59


All times are GMT -4. The time now is 23:20.