|
[Sponsors] |
OF211 with mvapich2 on redhat cluster, error when using more than 64 cores? |
|
LinkBack | Thread Tools | Search this Thread | Display Modes |
October 31, 2013, 13:45 |
OF211 with mvapich2 on redhat cluster, error when using more than 64 cores?
|
#1 |
Member
Jack
Join Date: Dec 2011
Posts: 94
Rep Power: 15 |
Hi guys,
I am using a cluster (redhat 6.4) which does not support openmpi. So I compiled OF with mvapich2. (Gcc 4.4.7 and mvapich2-1.9) I managed to run small job using upto 64 cores (4nodes,16cores/node) and there is no error. BUT, when I tried to use 128 cores or more, there is an error coming out as was shown below. And then I re-compile OF on another cluster which support both openmpi and mvapich2 (I can run more than 512 cores on this cluster using openmpi). Similar error coming out, I can not run more than 64 cores using mvapich2! It is really weird. You guys met this error before? How to fix this? Thanks in advance! Regards, Code:
erro code from the cluster [cli_47]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(436)...: MPID_Init(371)..........: channel initialization failed MPIDI_CH3_Init(285).....: MPIDI_CH3I_CM_Init(1106): Error initializing MVAPICH2 ptmalloc2 library .... .... many of these [readline] Unexpected End-Of-File on file descriptor 9. MPI process died? [mtpmi_processops] Error while reading PMI socket. MPI process died? [child_handler] MPI process (rank: 43, pid: 92867) exited with status 1 ][child_handler] MPI process (rank: 78, pid: 37914) exited with status 1 [readline] Unexpected End-Of-File on file descriptor 14. MPI process died? [mtpmi_processops] Error while reading PMI socket. MPI process died? [child_handler] MPI process (rank: 47, pid: 92871) exited with status 1 [child_handler] MPI process (rank: 69, pid: 37905) exited with status 1 ][readline] Unexpected End-Of-File on file descriptor 16. MPI process died? ... ... many of these Code:
error code from another cluster [cli_8]: aborting job: Fatal error in MPI_Init: Other MPI error [cli_7]: aborting job: Fatal error in MPI_Init: Other MPI error [cli_15]: aborting job: Fatal error in MPI_Init: Other MPI error [cli_6]: aborting job: Fatal error in MPI_Init: Other MPI error [cli_68]: aborting job: Fatal error in MPI_Init: Other MPI error [cli_66]: aborting job: Fatal error in MPI_Init: Other MPI error [proxy:0:1@compute-0-72.local] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:913): assert (!closed) failed [proxy:0:1@compute-0-72.local] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status [proxy:0:1@compute-0-72.local] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event [proxy:0:6@compute-0-75.local] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:913): assert (!closed) failed [proxy:0:6@compute-0-75.local] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status [proxy:0:6@compute-0-75.local] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event [proxy:0:7@compute-0-76.local] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:913): assert (!closed) failed [proxy:0:7@compute-0-76.local] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status [proxy:0:7@compute-0-76.local] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event [proxy:0:2@compute-0-10.local] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:913): assert (!closed) failed [proxy:0:2@compute-0-10.local] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status [proxy:0:2@compute-0-10.local] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event [proxy:0:3@compute-0-37.local] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:913): assert (!closed) failed [proxy:0:3@compute-0-37.local] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status [proxy:0:3@compute-0-37.local] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event [proxy:0:5@compute-0-40.local] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:913): assert (!closed) failed [proxy:0:5@compute-0-40.local] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status [proxy:0:5@compute-0-40.local] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event [mpiexec@compute-0-6.local] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting [mpiexec@compute-0-6.local] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion [mpiexec@compute-0-6.local] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for completion [mpiexec@compute-0-6.local] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion |
|
October 31, 2013, 13:48 |
|
#2 |
Member
Jack
Join Date: Dec 2011
Posts: 94
Rep Power: 15 |
BTW: I compile OF-2.1.1 with mvapich2 on redhat 6.4 by:
in etc/bashrc , change WM_MPLIB=OPENMPI to WM_MPLIB=MPI in etc/config/settings.sh, replace: MPI) export FOAM_MPI=mpi export MPI_ARCH_PATH=/opt/mpi ;; with below MPI) export FOAM_MPI=mpi export MPI_HOME=/opt/apps/intel13/mvapich2/1.9 export MPI_ARCH_PATH=$MPI_HOME _foamAddPath $MPI_ARCH_PATH/bin _foamAddLib $MPI_ARCH_PATH/lib ;; All the other compilation are similar with that with openmpi, and there is no error. Last edited by ripperjack; October 31, 2013 at 16:37. |
|
October 31, 2013, 22:34 |
|
#3 |
New Member
Jerome Vienne
Join Date: Oct 2013
Posts: 2
Rep Power: 0 |
Hi Ripperjack,
This is a known issue for Mvapich2 team when some 3rd party libraries are interacting with their internal memory (ptmalloc) library. They got similar reports earlier with MPI programs integrated with Perl and some other external libraries. This interaction causing libc.so memory functions appearing before MVAPICH2 library(libmpich.so) in dynamic shared lib ordering which is leading to Ptmalloc initialization failure. Mvapich2 2.0a can manage this thing and only print a warning instead to crash. I know that there is an another way to avoid that by changing the order of linked library, but I don't remember exactly how it works. For time being, can you please try with run-time parameter MV2_ON_DEMAND_THRESHOLD=<your job size>. With this parameter, your application should continue without registration cache feature but it could lead to some performance degradation. You can also try MVAPICH2 2.0a. Thanks, Jerome
__________________
Jerome Vienne, Ph.D HPC Software Tools Group Texas Advanced Computing Center (TACC) viennej@tacc.utexas.edu | Phone: (512) 475-9322 Office: ROC 1.455B | Fax: (512) 475-9445 |
|
October 31, 2013, 22:52 |
|
#4 | |
Member
Jack
Join Date: Dec 2011
Posts: 94
Rep Power: 15 |
Quote:
Many thanks for your reply! I re-compiled the OpenFoam with mvapich2 2.0a and it worked! As you said, there is just a warning as follow but no error. I made a test and openfoam runs fine with more than 256 cores! Thanks again for your time! Best regards, Code:
WARNING: Error in initializing MVAPICH2 ptmalloc library.Continuing without InfiniBand registration cache support. |
||
August 30, 2014, 04:47 |
analogy problems
|
#5 |
New Member
lvcheng
Join Date: Aug 2014
Posts: 1
Rep Power: 0 |
hi, Ripperjack and Vienne,
I am a novice using linux and I meet analogy problems when I run fvcom(one kind of ocean model) using mvapich2_intel as following, mistake.jpg and my PBS script are as following : PBS.jpg My .bash_profile are as following: baprofile.jpg could you give me some advise how to solve it ? thanks a lot! lvcheng [node27:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 8. MPI process died? [node27:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI process died? [node28:mpispawn_1][readline] Unexpected End-Of-File on file descriptor 5. MPI process died? [node28:mpispawn_1][mtpmi_processops] Error while reading PMI socket. MPI process died? [node27:mpispawn_0][child_handler] MPI process (rank: 0, pid: 3374) terminated with signal 11 -> abort job [node27:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node node27 aborted: MPI process error (1) [node28:mpispawn_1][child_handler] MPI process (rank: 4, pid: 3240) terminated with signal 11 -> abort job |
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
[OpenFOAM.org] OpenFOAM 2.1.1 installation on openSUSE 12.2 32 bit | saturn_53 | OpenFOAM Installation | 13 | February 1, 2015 05:17 |
[Commercial meshers] Import mesh with internal face from Fluent to openfoam 2.1.1 | neeraj | OpenFOAM Meshing & Mesh Conversion | 1 | April 29, 2013 04:54 |
OpenFoam 2.1.1 x64 on Fedora 17 | abCFD | OpenFOAM Installation | 2 | January 14, 2013 17:29 |
Cross-compiling OpenFOAM 1.7.0 on Linux for Windows 32 and 64bits with Mingw-w64 | wyldckat | OpenFOAM Announcements from Other Sources | 3 | September 8, 2010 07:25 |
OpenFOAM 13 Intel quadcore parallel results | msrinath80 | OpenFOAM Running, Solving & CFD | 13 | February 5, 2008 06:26 |