|
[Sponsors] |
March 31, 2016, 21:34 |
MPI send recv error
|
#1 |
Member
|
I am getting an MPI send recv error after certain number of iterations.
The lift and drag appear to converge smoothly and all of a sudden This happens. The iteration that this error happens is different each time I run the code. I'm unable to debug as this is a large case with about 700 processes. Any pointers ? The send and recv count appear to be different. I'm running SU2 v4.0.0 cat su2768.e1192297 [291:std0317][../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapl_poll_rc.c:1372] Intel MPI fatal error: ofa-v2-mlx5_0-1u DTO operation posted for [312:std0344] completed with error. status=0x8. cookie=0x40138 [291:std0317] unexpected DAPL connection event 0x4006 from 312 Fatal error in MPI_Sendrecv: Internal MPI error!, error stack: MPI_Sendrecv(242)........: MPI_Sendrecv(sbuf=0x12a4b310, scount=19551, MPI_DOUBLE, dest=312, stag=0, rbuf=0x12a25a90, rcount=19215, MPI_DOUBLE, src=312, rtag=0, MPI_COMM_WORLD, status=0x7fffffffbc10) failed PMPIDI_CH3I_Progress(780): (unknown)(): Internal MPI error! [mpiexec@std0301] control_cb (../../pm/pmiserv/pmiserv_cb.c:781): connection to proxy 24 at host std0344 failed [mpiexec@std0301] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status [mpiexec@std0301] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:500): error waiting for event [mpiexec@std0301] main (../../ui/mpich/mpiexec.c:1125): process manager error waiting for completion Dominic |
|
May 25, 2016, 00:02 |
|
#2 | |
Senior Member
Heather Kline
Join Date: Jun 2013
Posts: 309
Rep Power: 14 |
Quote:
Given your description as a large case it may be an issue with running over the maximum memory available on the nodes. You can try more processors, or larger memory nodes as one solution. As some maintenance has been done recently you may also want to try out a more recent version of the code. |
||
May 25, 2016, 00:14 |
|
#3 |
Member
|
Heather,
I did use the recent development branch as well. To make sure that I don't run out of memory, I just use 2 processes per node x 480 Nodes = 960 processes. Each node has 125Gb. The problem still persists. I gave up using SU2 for this case as I couldn't complete the verification cases for AIAA-DPW6. Something encouraging: it takes 3.2 seconds per step on 960 processes and 35 sec per step on 96 processes. Good scalability. Mesh size is 20 Million Nodes, 83 Million Elements. Mesh is taken from AIAA-DPW6, NASA, WB configuration. Having said that, the AIAA-DPW5 meshes(which are coarser than DPW6) do work very well without any problems. Dominic |
|
Tags |
mpi errors |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
error compiling modified applications | yvyan | OpenFOAM Programming & Development | 21 | March 1, 2016 05:53 |
OpenFOAM without MPI | kokizzu | OpenFOAM Installation | 4 | May 26, 2014 10:17 |
How to install CGNS under windows xp? | lzgwhy | Main CFD Forum | 1 | January 11, 2011 19:44 |
Installation OF1.5-dev | ttdtud | OpenFOAM Installation | 46 | May 5, 2009 03:32 |
error while compiling the USER Sub routine | CFD user | CFX | 3 | November 25, 2002 16:16 |