CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Software User Forums > OpenFOAM > OpenFOAM Running, Solving & CFD

Problem with OpenFoam 6 -parallel on remote HPC nodes

Register Blogs Community New Posts Updated Threads Search

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   May 15, 2019, 15:54
Default Problem with OpenFoam 6 -parallel on remote HPC nodes
  #1
Member
 
Rishikesh
Join Date: Apr 2016
Posts: 63
Rep Power: 10
mrishi is on a distinguished road
Hi,
I am trying to build OpenFOAM-6 in my home directory of an HPC cluster.
When I try to submit a domain-decomposed job to the compute nodes:
Code:
mpirun -np 2 simpleFoam -parallel
I get the following error message.
Code:
[cn364:03855] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file ess_env_module.c at line 367
[cn364:03854] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file ess_env_module.c at line 367


[cn364:03855] [[INVALID],INVALID]-[[36818,0],0] mca_oob_tcp_peer_try_connect: connect to 255.255.255.255:45715 failed: Network is unreachable (101)
[cn364:03854] [[INVALID],INVALID]-[[36818,0],0] mca_oob_tcp_peer_try_connect: connect to 255.255.255.255:45715 failed: Network is unreachable (101)
The above problem does not appear if I do not pass the -parallel argument, ie if I send 2 copies for serial processing:

Code:
mpirun -np 2 simpleFoam
Moreover, if I run the parallel case on my login node itself, it runs fine without error.

My login node has OpenMPI-1.8.1 whereas the compute nodes have 1.10.4. As a result I initially had problems finding mpirun, mpicc etc when submitting jobs, but got over it by adding both openmpi-x.x.x/bin directories to PATH, after which my serial jobs could run successfully on the compute node. However, the decomposed case still runs into the conflict shown above and I cannot figure out the reason behind it.


This makes me think there is an MPI related conflict somewhere.

The closest account of my problem I could find was here, where they advise to configure my mpi, but I do not understand how to do that.

https://users.open-mpi.narkive.com/k...-c-at-line-367

Should I/Can I create a local installation of OpenMPI as well? I should add that I have setup a local GCC v8.0.1 in $HOME in order to be compatible with OF-6. The global GCC version is 4.4.7 which I found incompatible.
At the time of building OF, my relevant environment var is set as:
$WM_MPLIB = SYSTEMOPENMPI


I would really appreciate help on how to diagnose the source of the problem.


PS:
This is the PBS script I use to submit my job to compute node:
Code:
#PBS -l nodes=1:ppn=2
#PBS -q workq
#PBS -V 
#EXECUTION SEQUENCE

#echo $HOME
#cd $HOME

export PATH=$HOME/local/bin:/usr/mpi/gcc/openmpi-1.10.4/bin:/usr/mpi/gcc/openmpi-1.8.1/bin:$PATH
export PKG_CONFIG_DISABLE_UNINSTALLED=true
export PKG_CONFIG_PATH=$HOME/local/lib/pkgconfig:$PKG_CONFIG_PATH
export HDF5_ROOT=$HOME/local
export CPATH=$HOME/local/include/:$CPATH

export LD_LIBRARY_PATH=/usr/lib64/compat-openmpi16/lib:/usr/mpi/gcc/openmpi-1.8.1/lib:/usr/mpi/gcc/openmpi-1.8.1/lib64:/usr/mpi/gcc/openmpi-1.10.4/lib:/usr/mpi/gcc/openmpi-1.10.4/lib64:$HOME/local/lib64:$HOME/local/lib:$LD_LIBRARY_PATH
export LIBRARY_PATH=$HOME/local/lib64:$HOME/local/lib:$LIBRARY_PATH



export MANPATH=$HOME/local/share/man/:$MANPATH


#echo $MPI_ARCH_PATH
which gcc
which mpicc
cd $PBS_O_WORKDIR

mpirun -np 2 hello_c
mpirun -np 2 simpleFoam
~
In the above example, the output is:



Code:
/home/mrishi
/usr/mpi/gcc/openmpi-1.10.4    <---version on remote node
/home/mrishi/local/bin/gcc        <----locally installed gcc
/usr/mpi/gcc/openmpi-1.10.4/bin/mpicc
--------------------------------------------------------------------------
mpirun was unable to find the specified executable file, and therefore
did not launch the job.  This error was first reported for process
rank 0; it may have occurred for other processes as well.

NOTE: A common cause for this error is misspelling a mpirun command
      line parameter option (remember that mpirun interprets the first
      unrecognized command line token as the executable).

Node:       cn364
Executable: hello_c       <----this happens even on the login node.
--------------------------------------------------------------------------
2 total processes failed to start
~

Edit:

I am currently working on rerouting the build process (./Allwmake) by logging onto a compute note and executing from there. This avoids clashes in MPI version at the very least, hoping this allows decomposePar and other utilities to build properly, although I do not have very high hopes.

Last edited by mrishi; May 15, 2019 at 19:54.
mrishi is offline   Reply With Quote

Old   May 16, 2019, 09:35
Talking Fixed by compiling it on a compute node
  #2
Member
 
Rishikesh
Join Date: Apr 2016
Posts: 63
Rep Power: 10
mrishi is on a distinguished road
As mentioned towards the end of previous post, by compiling it on a compute node, the problem was resolved and the parallel case runs now.


Although I am curious about how to parallelize efficiently. I scotch decomposed the domain into 32 parts and it is most definitely not a very high speed up compared to what I was getting on my 6-core PC.


How does one go about optimizing the calculation-communication times?
mrishi is offline   Reply With Quote

Old   May 17, 2019, 10:10
Default
  #3
Senior Member
 
Kmeti Rao
Join Date: May 2019
Posts: 145
Rep Power: 8
Krao is on a distinguished road
Hi Rishikesh,

Running on higher number of processor need not always mean high speed. If you have a very small number of cells in your grid and if you use more number of processors, then the speed is reduced due to the inter processor communication. Also, the speed depends on the decomposition method chosen. Try to find out the dependency between the number of processors and the simulation hour. Go through this link MPIRun How many processors. I hope it helps.
Krao is offline   Reply With Quote

Old   May 17, 2019, 13:06
Default
  #4
Member
 
Rishikesh
Join Date: Apr 2016
Posts: 63
Rep Power: 10
mrishi is on a distinguished road
Hi Krao,
Thanks. Indeed, I am working to find out the computing time with different combinations: 6, 8 and 32 (using scotch decomposition)

Thanks for the link you shared it helps to have some rule of thumb (>10k cells per thread) while decomposing.


Something strange happened to my 8 core parallel process when it crashed.
This is the end of transcript:

Code:
PIMPLE: Iteration 49
MULES: Solving for alpha.air
air volume fraction, min, max = 0.141103 -11.7584 122.479
MULES: Solving for alpha.water
water volume fraction, min, max = 0.764804 -0.879164 7.36259
MULES: Solving for alpha.oil
oil volume fraction, min, max = 0.094093 -2.42396e-14 1
Phase-sum volume fraction, min, max = 1 -12.6375 129.841
MULES: Solving for alpha.air
air volume fraction, min, max = 0.141067 -9037.89 70289.6
MULES: Solving for alpha.water
water volume fraction, min, max = 0.764802 -544.375 4225.82
MULES: Solving for alpha.oil
oil volume fraction, min, max = 0.094093 -2.42395e-14 1
Phase-sum volume fraction, min, max = 0.999962 -9582.26 74515.4
MULES: Solving for alpha.air
air volume fraction, min, max = -12.3348 -3.03557e+09 2.35101e+10
MULES: Solving for alpha.water
water volume fraction, min, max = 0.0145568 -1.82503e+08 1.41343e+09
MULES: Solving for alpha.oil
oil volume fraction, min, max = 0.094093 -2.42395e-14 1
Phase-sum volume fraction, min, max = -12.2262 -3.21807e+09 2.49235e+10
MULES: Solving for alpha.air
air volume fraction, min, max = -1.39731e+12 -3.39642e+20 2.63023e+21
MULES: Solving for alpha.water
water volume fraction, min, max = -8.40066e+10 -2.04193e+19 1.5813e+20
MULES: Solving for alpha.oil
oil volume fraction, min, max = 0.094093 -2.42394e-14 1
Phase-sum volume fraction, min, max = -1.48132e+12 -3.60061e+20 2.78836e+21
smoothSolver:  Solving for Ux, Initial residual = 1, Final residual = 0.000659045, No Iterations 3
smoothSolver:  Solving for Uy, Initial residual = 1, Final residual = 0.00083677, No Iterations 3
smoothSolver:  Solving for Uz, Initial residual = 1, Final residual = 0.00106711, No Iterations 3
GAMG:  Solving for p_rgh, Initial residual = 1, Final residual = 0.0051549, No Iterations 6
GAMG:  Solving for p_rgh, Initial residual = 8.06574e-13, Final residual = 8.06574e-13, No Iterations 0
time step continuity errors : sum local = 2.60083e+53, global = 1.5896e+37, cumulative = 1.5896e+37
GAMG:  Solving for p_rgh, Initial residual = 1.02356e-12, Final residual = 1.02356e-12, No Iterations 0
GAMGPCG:  Solving for p_rgh, Initial residual = 1.02356e-12, Final residual = 1.02356e-12, No Iterations 0
time step continuity errors : sum local = 3.30051e+53, global = -1.54293e+37, cumulative = 4.66766e+35
DILUPBiCG:  Solving for epsilon, Initial residual = 1, Final residual = 9.82834e-08, No Iterations 2
bounding epsilon, min: -3.14592e+69 max: 1.43365e+70 average: 2.87178e+64
DILUPBiCG:  Solving for k, Initial residual = 1, Final residual = 1.79736e-06, No Iterations 1
bounding k, min: -2.89671e+58 max: 6.34189e+59 average: 3.21589e+54
PIMPLE: Iteration 50
MULES: Solving for alpha.air
air volume fraction, min, max = 4.34815e+48 -5.33765e+72 2.40282e+73
MULES: Solving for alpha.water
water volume fraction, min, max = -2.23802e+47 -3.21497e+71 1.44458e+72
MULES: Solving for alpha.oil
oil volume fraction, min, max = 3.08295e+07 -1.63942e+26 1.11028e+26
Phase-sum volume fraction, min, max = -2.04619e+49 -5.65915e+72 2.54728e+73
--------------------------------------------------------------------------
An MPI process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your MPI job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.  

The process that invoked fork was:

  Local host:          cn324 (PID 10640)
  MPI_COMM_WORLD rank: 3

If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 3 with PID 10640 on node cn324 exited on signal 8 (Floating point exception).
--------------------------------------------------------------------------
I suppose that this error was provoked due to divergence. However, prior to this timestep, things were converging within 12-15 outer iterations. Moreover, both my 6 core and 32 core processes have moved past this stage of simulation successfully. This makes me question the reliability of these solutions. The image below illustrates this somewhat (velocity sampled at a point for different number of threads, not my own).

http://ww3.cad.de/foren/ubb/upl/F/Fr...eak_difcpu.pdf

Last edited by mrishi; May 17, 2019 at 14:52.
mrishi is offline   Reply With Quote

Old   May 22, 2019, 09:46
Default
  #5
Senior Member
 
Kmeti Rao
Join Date: May 2019
Posts: 145
Rep Power: 8
Krao is on a distinguished road
Hi Rishikesh,

most of the problems are already exist and you can find most of the answers on cfdOnline, it is just required to copy and paste the error in google. Open MPI-fork() error. Hope this link helps.

Krao
Krao is offline   Reply With Quote

Old   May 22, 2019, 13:33
Default
  #6
Member
 
Rishikesh
Join Date: Apr 2016
Posts: 63
Rep Power: 10
mrishi is on a distinguished road
Hi Krao,


Thanks for the link you shared. However, it doesn't have the same issue. Mine was linked to the internal openfoam divergence than OMPI, because of improper partitioning of info during decomposing, as I mentioned in the post above.

My query was regarding the interference of parallelization with realistic-ness of solution and how one can minimize it. I apologize if it was not clear the way I put it originally.





Regards
mrishi is offline   Reply With Quote

Old   September 6, 2019, 05:05
Default
  #7
Senior Member
 
Jianrui Zeng
Join Date: May 2018
Location: China
Posts: 157
Rep Power: 8
calf.Z is on a distinguished road
When I have run my case on HPC, similar error appears:

mpirun noticed that process rank 0 with PID 25382 on node gs1016 exited on signal 9 (Killed).

I have no idea about the reason. Any hint is appreciated.
calf.Z is offline   Reply With Quote

Old   September 6, 2019, 05:55
Default
  #8
Senior Member
 
Kmeti Rao
Join Date: May 2019
Posts: 145
Rep Power: 8
Krao is on a distinguished road
Quote:
Originally Posted by calf.Z View Post
When I have run my case on HPC, similar error appears:

mpirun noticed that process rank 0 with PID 25382 on node gs1016 exited on signal 9 (Killed).

I have no idea about the reason. Any hint is appreciated.
This error should be related with the memory. It would be nice, if you can provide more information about, how many number of cells does your simulation have and how many number of processors you are using and the total RAM assigned. It would be easy to understand these errors if you can run some simple test cases with less complexity.

Regards,

Krao
Krao is offline   Reply With Quote

Old   September 6, 2019, 09:05
Default
  #9
Senior Member
 
Jianrui Zeng
Join Date: May 2018
Location: China
Posts: 157
Rep Power: 8
calf.Z is on a distinguished road
Quote:
Originally Posted by Krao View Post
This error should be related with the memory. It would be nice, if you can provide more information about, how many number of cells does your simulation have and how many number of processors you are using and the total RAM assigned. It would be easy to understand these errors if you can run some simple test cases with less complexity.

Regards,

Krao
Thank you for your reply. The number of cells is 20 million and I have used different numbers of processors e.g. 64 128. The max RAM should reach 600+ G and I think it is enough. But maybe there is something wrong with the use of RAM.
What's more, when I run a case with smaller cells, the error disappears.
calf.Z is offline   Reply With Quote

Old   September 6, 2019, 09:30
Default
  #10
Senior Member
 
Kmeti Rao
Join Date: May 2019
Posts: 145
Rep Power: 8
Krao is on a distinguished road
Quote:
Originally Posted by calf.Z View Post
Thank you for your reply. The number of cells is 20 million and I have used different numbers of processors e.g. 64 128. The max RAM should reach 600+ G and I think it is enough. But maybe there is something wrong with the use of RAM.
What's more, when I run a case with smaller cells, the error disappears.
Good that the error disappeared, but with large number of cells, you can try with different decomposition strategies. I had similar problem once using pimpleDyMFoam, then my supervisor recommended this for me.
Krao is offline   Reply With Quote

Old   September 6, 2019, 09:46
Default
  #11
Senior Member
 
Jianrui Zeng
Join Date: May 2018
Location: China
Posts: 157
Rep Power: 8
calf.Z is on a distinguished road
Quote:
Originally Posted by Krao View Post
Good that the error disappeared, but with large number of cells, you can try with different decomposition strategies. I had similar problem once using pimpleDyMFoam, then my supervisor recommended this for me.
Thank you. I just use scotch or simple method to decompose. Do you have some research on the efficiency of different decomposition methods?
calf.Z is offline   Reply With Quote

Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
[snappyHexMesh] OpenFoam in parallel with sHM and sFE pradyumnsingh OpenFOAM Meshing & Mesh Conversion 4 October 26, 2018 17:25
Error running openfoam in parallel fede32 OpenFOAM Programming & Development 5 October 4, 2018 17:38
Problem with endTime by running OpenFOAM in parallel Zigec OpenFOAM Running, Solving & CFD 2 July 19, 2017 14:21
Unable to run OpenFOAM 1.6-ext in parallel with more than one machine mm.abdollahzadeh OpenFOAM Installation 14 January 27, 2014 10:40
CFX4.3 -build analysis form Chie Min CFX 5 July 13, 2001 00:19


All times are GMT -4. The time now is 15:48.