|
[Sponsors] |
Problem with OpenFoam 6 -parallel on remote HPC nodes |
|
LinkBack | Thread Tools | Search this Thread | Display Modes |
May 15, 2019, 15:54 |
Problem with OpenFoam 6 -parallel on remote HPC nodes
|
#1 |
Member
Rishikesh
Join Date: Apr 2016
Posts: 63
Rep Power: 10 |
Hi,
I am trying to build OpenFOAM-6 in my home directory of an HPC cluster. When I try to submit a domain-decomposed job to the compute nodes: Code:
mpirun -np 2 simpleFoam -parallel Code:
[cn364:03855] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file ess_env_module.c at line 367 [cn364:03854] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file ess_env_module.c at line 367 [cn364:03855] [[INVALID],INVALID]-[[36818,0],0] mca_oob_tcp_peer_try_connect: connect to 255.255.255.255:45715 failed: Network is unreachable (101) [cn364:03854] [[INVALID],INVALID]-[[36818,0],0] mca_oob_tcp_peer_try_connect: connect to 255.255.255.255:45715 failed: Network is unreachable (101) Code:
mpirun -np 2 simpleFoam My login node has OpenMPI-1.8.1 whereas the compute nodes have 1.10.4. As a result I initially had problems finding mpirun, mpicc etc when submitting jobs, but got over it by adding both openmpi-x.x.x/bin directories to PATH, after which my serial jobs could run successfully on the compute node. However, the decomposed case still runs into the conflict shown above and I cannot figure out the reason behind it. This makes me think there is an MPI related conflict somewhere. The closest account of my problem I could find was here, where they advise to configure my mpi, but I do not understand how to do that. https://users.open-mpi.narkive.com/k...-c-at-line-367 Should I/Can I create a local installation of OpenMPI as well? I should add that I have setup a local GCC v8.0.1 in $HOME in order to be compatible with OF-6. The global GCC version is 4.4.7 which I found incompatible. At the time of building OF, my relevant environment var is set as: $WM_MPLIB = SYSTEMOPENMPI I would really appreciate help on how to diagnose the source of the problem. PS: This is the PBS script I use to submit my job to compute node: Code:
#PBS -l nodes=1:ppn=2 #PBS -q workq #PBS -V #EXECUTION SEQUENCE #echo $HOME #cd $HOME export PATH=$HOME/local/bin:/usr/mpi/gcc/openmpi-1.10.4/bin:/usr/mpi/gcc/openmpi-1.8.1/bin:$PATH export PKG_CONFIG_DISABLE_UNINSTALLED=true export PKG_CONFIG_PATH=$HOME/local/lib/pkgconfig:$PKG_CONFIG_PATH export HDF5_ROOT=$HOME/local export CPATH=$HOME/local/include/:$CPATH export LD_LIBRARY_PATH=/usr/lib64/compat-openmpi16/lib:/usr/mpi/gcc/openmpi-1.8.1/lib:/usr/mpi/gcc/openmpi-1.8.1/lib64:/usr/mpi/gcc/openmpi-1.10.4/lib:/usr/mpi/gcc/openmpi-1.10.4/lib64:$HOME/local/lib64:$HOME/local/lib:$LD_LIBRARY_PATH export LIBRARY_PATH=$HOME/local/lib64:$HOME/local/lib:$LIBRARY_PATH export MANPATH=$HOME/local/share/man/:$MANPATH #echo $MPI_ARCH_PATH which gcc which mpicc cd $PBS_O_WORKDIR mpirun -np 2 hello_c mpirun -np 2 simpleFoam ~ Code:
/home/mrishi /usr/mpi/gcc/openmpi-1.10.4 <---version on remote node /home/mrishi/local/bin/gcc <----locally installed gcc /usr/mpi/gcc/openmpi-1.10.4/bin/mpicc -------------------------------------------------------------------------- mpirun was unable to find the specified executable file, and therefore did not launch the job. This error was first reported for process rank 0; it may have occurred for other processes as well. NOTE: A common cause for this error is misspelling a mpirun command line parameter option (remember that mpirun interprets the first unrecognized command line token as the executable). Node: cn364 Executable: hello_c <----this happens even on the login node. -------------------------------------------------------------------------- 2 total processes failed to start ~ Edit: I am currently working on rerouting the build process (./Allwmake) by logging onto a compute note and executing from there. This avoids clashes in MPI version at the very least, hoping this allows decomposePar and other utilities to build properly, although I do not have very high hopes. Last edited by mrishi; May 15, 2019 at 19:54. |
|
May 16, 2019, 09:35 |
Fixed by compiling it on a compute node
|
#2 |
Member
Rishikesh
Join Date: Apr 2016
Posts: 63
Rep Power: 10 |
As mentioned towards the end of previous post, by compiling it on a compute node, the problem was resolved and the parallel case runs now.
Although I am curious about how to parallelize efficiently. I scotch decomposed the domain into 32 parts and it is most definitely not a very high speed up compared to what I was getting on my 6-core PC. How does one go about optimizing the calculation-communication times? |
|
May 17, 2019, 10:10 |
|
#3 |
Senior Member
Kmeti Rao
Join Date: May 2019
Posts: 145
Rep Power: 8 |
Hi Rishikesh,
Running on higher number of processor need not always mean high speed. If you have a very small number of cells in your grid and if you use more number of processors, then the speed is reduced due to the inter processor communication. Also, the speed depends on the decomposition method chosen. Try to find out the dependency between the number of processors and the simulation hour. Go through this link MPIRun How many processors. I hope it helps. |
|
May 17, 2019, 13:06 |
|
#4 |
Member
Rishikesh
Join Date: Apr 2016
Posts: 63
Rep Power: 10 |
Hi Krao,
Thanks. Indeed, I am working to find out the computing time with different combinations: 6, 8 and 32 (using scotch decomposition) Thanks for the link you shared it helps to have some rule of thumb (>10k cells per thread) while decomposing. Something strange happened to my 8 core parallel process when it crashed. This is the end of transcript: Code:
PIMPLE: Iteration 49 MULES: Solving for alpha.air air volume fraction, min, max = 0.141103 -11.7584 122.479 MULES: Solving for alpha.water water volume fraction, min, max = 0.764804 -0.879164 7.36259 MULES: Solving for alpha.oil oil volume fraction, min, max = 0.094093 -2.42396e-14 1 Phase-sum volume fraction, min, max = 1 -12.6375 129.841 MULES: Solving for alpha.air air volume fraction, min, max = 0.141067 -9037.89 70289.6 MULES: Solving for alpha.water water volume fraction, min, max = 0.764802 -544.375 4225.82 MULES: Solving for alpha.oil oil volume fraction, min, max = 0.094093 -2.42395e-14 1 Phase-sum volume fraction, min, max = 0.999962 -9582.26 74515.4 MULES: Solving for alpha.air air volume fraction, min, max = -12.3348 -3.03557e+09 2.35101e+10 MULES: Solving for alpha.water water volume fraction, min, max = 0.0145568 -1.82503e+08 1.41343e+09 MULES: Solving for alpha.oil oil volume fraction, min, max = 0.094093 -2.42395e-14 1 Phase-sum volume fraction, min, max = -12.2262 -3.21807e+09 2.49235e+10 MULES: Solving for alpha.air air volume fraction, min, max = -1.39731e+12 -3.39642e+20 2.63023e+21 MULES: Solving for alpha.water water volume fraction, min, max = -8.40066e+10 -2.04193e+19 1.5813e+20 MULES: Solving for alpha.oil oil volume fraction, min, max = 0.094093 -2.42394e-14 1 Phase-sum volume fraction, min, max = -1.48132e+12 -3.60061e+20 2.78836e+21 smoothSolver: Solving for Ux, Initial residual = 1, Final residual = 0.000659045, No Iterations 3 smoothSolver: Solving for Uy, Initial residual = 1, Final residual = 0.00083677, No Iterations 3 smoothSolver: Solving for Uz, Initial residual = 1, Final residual = 0.00106711, No Iterations 3 GAMG: Solving for p_rgh, Initial residual = 1, Final residual = 0.0051549, No Iterations 6 GAMG: Solving for p_rgh, Initial residual = 8.06574e-13, Final residual = 8.06574e-13, No Iterations 0 time step continuity errors : sum local = 2.60083e+53, global = 1.5896e+37, cumulative = 1.5896e+37 GAMG: Solving for p_rgh, Initial residual = 1.02356e-12, Final residual = 1.02356e-12, No Iterations 0 GAMGPCG: Solving for p_rgh, Initial residual = 1.02356e-12, Final residual = 1.02356e-12, No Iterations 0 time step continuity errors : sum local = 3.30051e+53, global = -1.54293e+37, cumulative = 4.66766e+35 DILUPBiCG: Solving for epsilon, Initial residual = 1, Final residual = 9.82834e-08, No Iterations 2 bounding epsilon, min: -3.14592e+69 max: 1.43365e+70 average: 2.87178e+64 DILUPBiCG: Solving for k, Initial residual = 1, Final residual = 1.79736e-06, No Iterations 1 bounding k, min: -2.89671e+58 max: 6.34189e+59 average: 3.21589e+54 PIMPLE: Iteration 50 MULES: Solving for alpha.air air volume fraction, min, max = 4.34815e+48 -5.33765e+72 2.40282e+73 MULES: Solving for alpha.water water volume fraction, min, max = -2.23802e+47 -3.21497e+71 1.44458e+72 MULES: Solving for alpha.oil oil volume fraction, min, max = 3.08295e+07 -1.63942e+26 1.11028e+26 Phase-sum volume fraction, min, max = -2.04619e+49 -5.65915e+72 2.54728e+73 -------------------------------------------------------------------------- An MPI process has executed an operation involving a call to the "fork()" system call to create a child process. Open MPI is currently operating in a condition that could result in memory corruption or other system errors; your MPI job may hang, crash, or produce silent data corruption. The use of fork() (or system() or other calls that create child processes) is strongly discouraged. The process that invoked fork was: Local host: cn324 (PID 10640) MPI_COMM_WORLD rank: 3 If you are *absolutely sure* that your application will successfully and correctly survive a call to fork(), you may disable this warning by setting the mpi_warn_on_fork MCA parameter to 0. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun noticed that process rank 3 with PID 10640 on node cn324 exited on signal 8 (Floating point exception). -------------------------------------------------------------------------- http://ww3.cad.de/foren/ubb/upl/F/Fr...eak_difcpu.pdf Last edited by mrishi; May 17, 2019 at 14:52. |
|
May 22, 2019, 09:46 |
|
#5 |
Senior Member
Kmeti Rao
Join Date: May 2019
Posts: 145
Rep Power: 8 |
Hi Rishikesh,
most of the problems are already exist and you can find most of the answers on cfdOnline, it is just required to copy and paste the error in google. Open MPI-fork() error. Hope this link helps. Krao |
|
May 22, 2019, 13:33 |
|
#6 |
Member
Rishikesh
Join Date: Apr 2016
Posts: 63
Rep Power: 10 |
Hi Krao,
Thanks for the link you shared. However, it doesn't have the same issue. Mine was linked to the internal openfoam divergence than OMPI, because of improper partitioning of info during decomposing, as I mentioned in the post above. My query was regarding the interference of parallelization with realistic-ness of solution and how one can minimize it. I apologize if it was not clear the way I put it originally. Regards |
|
September 6, 2019, 05:05 |
|
#7 |
Senior Member
Jianrui Zeng
Join Date: May 2018
Location: China
Posts: 157
Rep Power: 8 |
When I have run my case on HPC, similar error appears:
mpirun noticed that process rank 0 with PID 25382 on node gs1016 exited on signal 9 (Killed). I have no idea about the reason. Any hint is appreciated. |
|
September 6, 2019, 05:55 |
|
#8 | |
Senior Member
Kmeti Rao
Join Date: May 2019
Posts: 145
Rep Power: 8 |
Quote:
Regards, Krao |
||
September 6, 2019, 09:05 |
|
#9 | |
Senior Member
Jianrui Zeng
Join Date: May 2018
Location: China
Posts: 157
Rep Power: 8 |
Quote:
What's more, when I run a case with smaller cells, the error disappears. |
||
September 6, 2019, 09:30 |
|
#10 | |
Senior Member
Kmeti Rao
Join Date: May 2019
Posts: 145
Rep Power: 8 |
Quote:
|
||
September 6, 2019, 09:46 |
|
#11 |
Senior Member
Jianrui Zeng
Join Date: May 2018
Location: China
Posts: 157
Rep Power: 8 |
Thank you. I just use scotch or simple method to decompose. Do you have some research on the efficiency of different decomposition methods?
|
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
[snappyHexMesh] OpenFoam in parallel with sHM and sFE | pradyumnsingh | OpenFOAM Meshing & Mesh Conversion | 4 | October 26, 2018 17:25 |
Error running openfoam in parallel | fede32 | OpenFOAM Programming & Development | 5 | October 4, 2018 17:38 |
Problem with endTime by running OpenFOAM in parallel | Zigec | OpenFOAM Running, Solving & CFD | 2 | July 19, 2017 14:21 |
Unable to run OpenFOAM 1.6-ext in parallel with more than one machine | mm.abdollahzadeh | OpenFOAM Installation | 14 | January 27, 2014 10:40 |
CFX4.3 -build analysis form | Chie Min | CFX | 5 | July 13, 2001 00:19 |