Problem with OpenFoam 6 -parallel on remote HPC nodes

mrishi · May 15, 2019, 15:54

Hi,
I am trying to build OpenFOAM-6 in my home directory of an HPC cluster.
When I try to submit a domain-decomposed job to the compute nodes:

Code:

mpirun -np 2 simpleFoam -parallel

I get the following error message.

Code:

[cn364:03855] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file ess_env_module.c at line 367
[cn364:03854] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file ess_env_module.c at line 367


[cn364:03855] [[INVALID],INVALID]-[[36818,0],0] mca_oob_tcp_peer_try_connect: connect to 255.255.255.255:45715 failed: Network is unreachable (101)
[cn364:03854] [[INVALID],INVALID]-[[36818,0],0] mca_oob_tcp_peer_try_connect: connect to 255.255.255.255:45715 failed: Network is unreachable (101)

The above problem does not appear if I do not pass the -parallel argument, ie if I send 2 copies for serial processing:

Code:

mpirun -np 2 simpleFoam

Moreover, if I run the parallel case on my login node itself, it runs fine without error.

My login node has OpenMPI-1.8.1 whereas the compute nodes have 1.10.4. As a result I initially had problems finding mpirun, mpicc etc when submitting jobs, but got over it by adding both openmpi-x.x.x/bin directories to PATH, after which my serial jobs could run successfully on the compute node. However, the decomposed case still runs into the conflict shown above and I cannot figure out the reason behind it.

This makes me think there is an MPI related conflict somewhere.

The closest account of my problem I could find was here, where they advise to configure my mpi, but I do not understand how to do that.

https://users.open-mpi.narkive.com/k...-c-at-line-367

Should I/Can I create a local installation of OpenMPI as well? I should add that I have setup a local GCC v8.0.1 in $HOME in order to be compatible with OF-6. The global GCC version is 4.4.7 which I found incompatible.
At the time of building OF, my relevant environment var is set as:
$WM_MPLIB = SYSTEMOPENMPI

I would really appreciate help on how to diagnose the source of the problem.

PS:
This is the PBS script I use to submit my job to compute node:

Code:

#PBS -l nodes=1:ppn=2
#PBS -q workq
#PBS -V 
#EXECUTION SEQUENCE

#echo $HOME
#cd $HOME

export PATH=$HOME/local/bin:/usr/mpi/gcc/openmpi-1.10.4/bin:/usr/mpi/gcc/openmpi-1.8.1/bin:$PATH
export PKG_CONFIG_DISABLE_UNINSTALLED=true
export PKG_CONFIG_PATH=$HOME/local/lib/pkgconfig:$PKG_CONFIG_PATH
export HDF5_ROOT=$HOME/local
export CPATH=$HOME/local/include/:$CPATH

export LD_LIBRARY_PATH=/usr/lib64/compat-openmpi16/lib:/usr/mpi/gcc/openmpi-1.8.1/lib:/usr/mpi/gcc/openmpi-1.8.1/lib64:/usr/mpi/gcc/openmpi-1.10.4/lib:/usr/mpi/gcc/openmpi-1.10.4/lib64:$HOME/local/lib64:$HOME/local/lib:$LD_LIBRARY_PATH
export LIBRARY_PATH=$HOME/local/lib64:$HOME/local/lib:$LIBRARY_PATH



export MANPATH=$HOME/local/share/man/:$MANPATH


#echo $MPI_ARCH_PATH
which gcc
which mpicc
cd $PBS_O_WORKDIR

mpirun -np 2 hello_c
mpirun -np 2 simpleFoam
~

In the above example, the output is:

Code:

/home/mrishi
/usr/mpi/gcc/openmpi-1.10.4    <---version on remote node
/home/mrishi/local/bin/gcc        <----locally installed gcc
/usr/mpi/gcc/openmpi-1.10.4/bin/mpicc
--------------------------------------------------------------------------
mpirun was unable to find the specified executable file, and therefore
did not launch the job.  This error was first reported for process
rank 0; it may have occurred for other processes as well.

NOTE: A common cause for this error is misspelling a mpirun command
      line parameter option (remember that mpirun interprets the first
      unrecognized command line token as the executable).

Node:       cn364
Executable: hello_c       <----this happens even on the login node.
--------------------------------------------------------------------------
2 total processes failed to start
~

Edit:

I am currently working on rerouting the build process (./Allwmake) by logging onto a compute note and executing from there. This avoids clashes in MPI version at the very least, hoping this allows decomposePar and other utilities to build properly, although I do not have very high hopes.

mrishi · May 16, 2019, 09:35

As mentioned towards the end of previous post, by compiling it on a compute node, the problem was resolved and the parallel case runs now.

Although I am curious about how to parallelize efficiently. I scotch decomposed the domain into 32 parts and it is most definitely not a very high speed up compared to what I was getting on my 6-core PC.

How does one go about optimizing the calculation-communication times?

Krao · May 17, 2019, 10:10

Hi Rishikesh,

Running on higher number of processor need not always mean high speed. If you have a very small number of cells in your grid and if you use more number of processors, then the speed is reduced due to the inter processor communication. Also, the speed depends on the decomposition method chosen. Try to find out the dependency between the number of processors and the simulation hour. Go through this link MPIRun How many processors. I hope it helps.

mrishi · May 17, 2019, 13:06

Hi Krao,
Thanks. Indeed, I am working to find out the computing time with different combinations: 6, 8 and 32 (using scotch decomposition)

Thanks for the link you shared it helps to have some rule of thumb (>10k cells per thread) while decomposing.

Something strange happened to my 8 core parallel process when it crashed.
This is the end of transcript:

Code:

PIMPLE: Iteration 49
MULES: Solving for alpha.air
air volume fraction, min, max = 0.141103 -11.7584 122.479
MULES: Solving for alpha.water
water volume fraction, min, max = 0.764804 -0.879164 7.36259
MULES: Solving for alpha.oil
oil volume fraction, min, max = 0.094093 -2.42396e-14 1
Phase-sum volume fraction, min, max = 1 -12.6375 129.841
MULES: Solving for alpha.air
air volume fraction, min, max = 0.141067 -9037.89 70289.6
MULES: Solving for alpha.water
water volume fraction, min, max = 0.764802 -544.375 4225.82
MULES: Solving for alpha.oil
oil volume fraction, min, max = 0.094093 -2.42395e-14 1
Phase-sum volume fraction, min, max = 0.999962 -9582.26 74515.4
MULES: Solving for alpha.air
air volume fraction, min, max = -12.3348 -3.03557e+09 2.35101e+10
MULES: Solving for alpha.water
water volume fraction, min, max = 0.0145568 -1.82503e+08 1.41343e+09
MULES: Solving for alpha.oil
oil volume fraction, min, max = 0.094093 -2.42395e-14 1
Phase-sum volume fraction, min, max = -12.2262 -3.21807e+09 2.49235e+10
MULES: Solving for alpha.air
air volume fraction, min, max = -1.39731e+12 -3.39642e+20 2.63023e+21
MULES: Solving for alpha.water
water volume fraction, min, max = -8.40066e+10 -2.04193e+19 1.5813e+20
MULES: Solving for alpha.oil
oil volume fraction, min, max = 0.094093 -2.42394e-14 1
Phase-sum volume fraction, min, max = -1.48132e+12 -3.60061e+20 2.78836e+21
smoothSolver:  Solving for Ux, Initial residual = 1, Final residual = 0.000659045, No Iterations 3
smoothSolver:  Solving for Uy, Initial residual = 1, Final residual = 0.00083677, No Iterations 3
smoothSolver:  Solving for Uz, Initial residual = 1, Final residual = 0.00106711, No Iterations 3
GAMG:  Solving for p_rgh, Initial residual = 1, Final residual = 0.0051549, No Iterations 6
GAMG:  Solving for p_rgh, Initial residual = 8.06574e-13, Final residual = 8.06574e-13, No Iterations 0
time step continuity errors : sum local = 2.60083e+53, global = 1.5896e+37, cumulative = 1.5896e+37
GAMG:  Solving for p_rgh, Initial residual = 1.02356e-12, Final residual = 1.02356e-12, No Iterations 0
GAMGPCG:  Solving for p_rgh, Initial residual = 1.02356e-12, Final residual = 1.02356e-12, No Iterations 0
time step continuity errors : sum local = 3.30051e+53, global = -1.54293e+37, cumulative = 4.66766e+35
DILUPBiCG:  Solving for epsilon, Initial residual = 1, Final residual = 9.82834e-08, No Iterations 2
bounding epsilon, min: -3.14592e+69 max: 1.43365e+70 average: 2.87178e+64
DILUPBiCG:  Solving for k, Initial residual = 1, Final residual = 1.79736e-06, No Iterations 1
bounding k, min: -2.89671e+58 max: 6.34189e+59 average: 3.21589e+54
PIMPLE: Iteration 50
MULES: Solving for alpha.air
air volume fraction, min, max = 4.34815e+48 -5.33765e+72 2.40282e+73
MULES: Solving for alpha.water
water volume fraction, min, max = -2.23802e+47 -3.21497e+71 1.44458e+72
MULES: Solving for alpha.oil
oil volume fraction, min, max = 3.08295e+07 -1.63942e+26 1.11028e+26
Phase-sum volume fraction, min, max = -2.04619e+49 -5.65915e+72 2.54728e+73
--------------------------------------------------------------------------
An MPI process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your MPI job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.  

The process that invoked fork was:

  Local host:          cn324 (PID 10640)
  MPI_COMM_WORLD rank: 3

If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 3 with PID 10640 on node cn324 exited on signal 8 (Floating point exception).
--------------------------------------------------------------------------

I suppose that this error was provoked due to divergence. However, prior to this timestep, things were converging within 12-15 outer iterations. Moreover, both my 6 core and 32 core processes have moved past this stage of simulation successfully. This makes me question the reliability of these solutions. The image below illustrates this somewhat (velocity sampled at a point for different number of threads, not my own).

http://ww3.cad.de/foren/ubb/upl/F/Fr...eak_difcpu.pdf

Krao · May 22, 2019, 09:46

Hi Rishikesh,

most of the problems are already exist and you can find most of the answers on cfdOnline, it is just required to copy and paste the error in google. Open MPI-fork() error. Hope this link helps.

Krao

mrishi · May 22, 2019, 13:33

Hi Krao,

Thanks for the link you shared. However, it doesn't have the same issue. Mine was linked to the internal openfoam divergence than OMPI, because of improper partitioning of info during decomposing, as I mentioned in the post above.

My query was regarding the interference of parallelization with realistic-ness of solution and how one can minimize it. I apologize if it was not clear the way I put it originally.

Regards

calf.Z · September 6, 2019, 05:05

When I have run my case on HPC, similar error appears:

mpirun noticed that process rank 0 with PID 25382 on node gs1016 exited on signal 9 (Killed).

I have no idea about the reason. Any hint is appreciated.

Krao · September 6, 2019, 05:55

Quote:

Originally Posted by calf.Z

When I have run my case on HPC, similar error appears:

mpirun noticed that process rank 0 with PID 25382 on node gs1016 exited on signal 9 (Killed).

I have no idea about the reason. Any hint is appreciated.

This error should be related with the memory. It would be nice, if you can provide more information about, how many number of cells does your simulation have and how many number of processors you are using and the total RAM assigned. It would be easy to understand these errors if you can run some simple test cases with less complexity.

Regards,

Krao

calf.Z · September 6, 2019, 09:05

Quote:

Originally Posted by Krao

This error should be related with the memory. It would be nice, if you can provide more information about, how many number of cells does your simulation have and how many number of processors you are using and the total RAM assigned. It would be easy to understand these errors if you can run some simple test cases with less complexity.

Regards,

Krao

Thank you for your reply. The number of cells is 20 million and I have used different numbers of processors e.g. 64 128. The max RAM should reach 600+ G and I think it is enough. But maybe there is something wrong with the use of RAM.
What's more, when I run a case with smaller cells, the error disappears.

Krao · September 6, 2019, 09:30

Quote:

Originally Posted by calf.Z

Thank you for your reply. The number of cells is 20 million and I have used different numbers of processors e.g. 64 128. The max RAM should reach 600+ G and I think it is enough. But maybe there is something wrong with the use of RAM.
What's more, when I run a case with smaller cells, the error disappears.

Good that the error disappeared, but with large number of cells, you can try with different decomposition strategies. I had similar problem once using pimpleDyMFoam, then my supervisor recommended this for me.

calf.Z · September 6, 2019, 09:46

Quote:

Originally Posted by Krao

Good that the error disappeared, but with large number of cells, you can try with different decomposition strategies. I had similar problem once using pimpleDyMFoam, then my supervisor recommended this for me.

Thank you. I just use scotch or simple method to decompose. Do you have some research on the efficiency of different decomposition methods?

May 16, 2019, 09:35	Fixed by compiling it on a compute node	#2
mrishi Member Rishikesh Join Date: Apr 2016 Posts: 63 Rep Power: 10	As mentioned towards the end of previous post, by compiling it on a compute node, the problem was resolved and the parallel case runs now. Although I am curious about how to parallelize efficiently. I scotch decomposed the domain into 32 parts and it is most definitely not a very high speed up compared to what I was getting on my 6-core PC. How does one go about optimizing the calculation-communication times?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
[snappyHexMesh] OpenFoam in parallel with sHM and sFE	pradyumnsingh	OpenFOAM Meshing & Mesh Conversion	4	October 26, 2018 17:25
Error running openfoam in parallel	fede32	OpenFOAM Programming & Development	5	October 4, 2018 17:38
Problem with endTime by running OpenFOAM in parallel	Zigec	OpenFOAM Running, Solving & CFD	2	July 19, 2017 14:21
Unable to run OpenFOAM 1.6-ext in parallel with more than one machine	mm.abdollahzadeh	OpenFOAM Installation	14	January 27, 2014 10:40
CFX4.3 -build analysis form	Chie Min	CFX	5	July 13, 2001 00:19

May 17, 2019, 10:10		#3
Krao Senior Member Kmeti Rao Join Date: May 2019 Posts: 145 Rep Power: 8	Hi Rishikesh, Running on higher number of processor need not always mean high speed. If you have a very small number of cells in your grid and if you use more number of processors, then the speed is reduced due to the inter processor communication. Also, the speed depends on the decomposition method chosen. Try to find out the dependency between the number of processors and the simulation hour. Go through this link MPIRun How many processors. I hope it helps.

May 22, 2019, 09:46		#5
Krao Senior Member Kmeti Rao Join Date: May 2019 Posts: 145 Rep Power: 8	Hi Rishikesh, most of the problems are already exist and you can find most of the answers on cfdOnline, it is just required to copy and paste the error in google. Open MPI-fork() error. Hope this link helps. Krao

May 22, 2019, 13:33		#6
mrishi Member Rishikesh Join Date: Apr 2016 Posts: 63 Rep Power: 10	Hi Krao, Thanks for the link you shared. However, it doesn't have the same issue. Mine was linked to the internal openfoam divergence than OMPI, because of improper partitioning of info during decomposing, as I mentioned in the post above. My query was regarding the interference of parallelization with realistic-ness of solution and how one can minimize it. I apologize if it was not clear the way I put it originally. Regards

September 6, 2019, 05:05		#7
calf.Z Senior Member Jianrui Zeng Join Date: May 2018 Location: China Posts: 157 Rep Power: 8	When I have run my case on HPC, similar error appears: mpirun noticed that process rank 0 with PID 25382 on node gs1016 exited on signal 9 (Killed). I have no idea about the reason. Any hint is appreciated.