CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Software User Forums > SU2

MPI send recv error

Register Blogs Community New Posts Updated Threads Search

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   March 31, 2016, 21:34
Default MPI send recv error
  #1
Member
 
Dominic Chandar
Join Date: Mar 2009
Location: United Kingdom
Posts: 31
Rep Power: 17
dominic is on a distinguished road
Send a message via Skype™ to dominic
I am getting an MPI send recv error after certain number of iterations.
The lift and drag appear to converge smoothly and all of a sudden
This happens. The iteration that this error happens is different each time I run the code. I'm unable to debug as this is a large case with about 700 processes.
Any pointers ? The send and recv count appear to be different.

I'm running SU2 v4.0.0

cat su2768.e1192297

[291:std0317][../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapl_poll_rc.c:1372] Intel MPI fatal error: ofa-v2-mlx5_0-1u DTO operation posted for [312:std0344] completed with error. status=0x8. cookie=0x40138
[291:std0317] unexpected DAPL connection event 0x4006 from 312
Fatal error in MPI_Sendrecv: Internal MPI error!, error stack:
MPI_Sendrecv(242)........: MPI_Sendrecv(sbuf=0x12a4b310, scount=19551, MPI_DOUBLE, dest=312, stag=0, rbuf=0x12a25a90, rcount=19215, MPI_DOUBLE, src=312, rtag=0, MPI_COMM_WORLD, status=0x7fffffffbc10) failed

PMPIDI_CH3I_Progress(780):
(unknown)(): Internal MPI error!
[mpiexec@std0301] control_cb (../../pm/pmiserv/pmiserv_cb.c:781): connection to proxy 24 at host std0344 failed
[mpiexec@std0301] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@std0301] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:500): error waiting for event
[mpiexec@std0301] main (../../ui/mpich/mpiexec.c:1125): process manager error waiting for completion



Dominic
dominic is offline   Reply With Quote

Old   May 25, 2016, 00:02
Default
  #2
hlk
Senior Member
 
Heather Kline
Join Date: Jun 2013
Posts: 309
Rep Power: 14
hlk is on a distinguished road
Quote:
Originally Posted by dominic View Post
I am getting an MPI send recv error after certain number of iterations.
The lift and drag appear to converge smoothly and all of a sudden
This happens. The iteration that this error happens is different each time I run the code. I'm unable to debug as this is a large case with about 700 processes.
Any pointers ? The send and recv count appear to be different.

I'm running SU2 v4.0.0

cat su2768.e1192297

[291:std0317][../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapl_poll_rc.c:1372] Intel MPI fatal error: ofa-v2-mlx5_0-1u DTO operation posted for [312:std0344] completed with error. status=0x8. cookie=0x40138
[291:std0317] unexpected DAPL connection event 0x4006 from 312
Fatal error in MPI_Sendrecv: Internal MPI error!, error stack:
MPI_Sendrecv(242)........: MPI_Sendrecv(sbuf=0x12a4b310, scount=19551, MPI_DOUBLE, dest=312, stag=0, rbuf=0x12a25a90, rcount=19215, MPI_DOUBLE, src=312, rtag=0, MPI_COMM_WORLD, status=0x7fffffffbc10) failed

PMPIDI_CH3I_Progress(780):
(unknown)(): Internal MPI error!
[mpiexec@std0301] control_cb (../../pm/pmiserv/pmiserv_cb.c:781): connection to proxy 24 at host std0344 failed
[mpiexec@std0301] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@std0301] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:500): error waiting for event
[mpiexec@std0301] main (../../ui/mpich/mpiexec.c:1125): process manager error waiting for completion



Dominic

Given your description as a large case it may be an issue with running over the maximum memory available on the nodes. You can try more processors, or larger memory nodes as one solution. As some maintenance has been done recently you may also want to try out a more recent version of the code.
hlk is offline   Reply With Quote

Old   May 25, 2016, 00:14
Default
  #3
Member
 
Dominic Chandar
Join Date: Mar 2009
Location: United Kingdom
Posts: 31
Rep Power: 17
dominic is on a distinguished road
Send a message via Skype™ to dominic
Heather,

I did use the recent development branch as well. To make sure that I don't run out of memory, I just use 2 processes per node x 480 Nodes = 960 processes. Each node has 125Gb. The problem still persists. I gave up using SU2 for this case as I couldn't complete the verification cases for AIAA-DPW6.

Something encouraging: it takes 3.2 seconds per step on 960 processes and 35 sec per step on 96 processes. Good scalability.

Mesh size is 20 Million Nodes, 83 Million Elements. Mesh is taken from AIAA-DPW6, NASA, WB configuration.

Having said that, the AIAA-DPW5 meshes(which are coarser than DPW6) do work very well without any problems.

Dominic
dominic is offline   Reply With Quote

Reply

Tags
mpi errors


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
error compiling modified applications yvyan OpenFOAM Programming & Development 21 March 1, 2016 05:53
OpenFOAM without MPI kokizzu OpenFOAM Installation 4 May 26, 2014 10:17
How to install CGNS under windows xp? lzgwhy Main CFD Forum 1 January 11, 2011 19:44
Installation OF1.5-dev ttdtud OpenFOAM Installation 46 May 5, 2009 03:32
error while compiling the USER Sub routine CFD user CFX 3 November 25, 2002 16:16


All times are GMT -4. The time now is 21:08.