CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Software User Forums > OpenFOAM > OpenFOAM Running, Solving & CFD

Communication Deadlock in MPPICFoam Parallel Solver

Register Blogs Community New Posts Updated Threads Search

Like Tree1Likes
  • 1 Post By HPE

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   May 29, 2021, 12:17
Default Communication Deadlock in MPPICFoam Parallel Solver
  #1
New Member
 
Bradley Morgan
Join Date: May 2021
Posts: 1
Rep Power: 0
auburnhpc is on a distinguished road
I have posted this as an issue to the OpenFoam 8 GitHub repository (#14) but I wanted to see if anyone here may also have some insight.

Our research team has also prepared a detailed slide deck with more simulation specific findings, but the attachment it too large to upload here. Please see
the issue for the attachment and the full debug summary.
https://github.com/OpenFOAM/OpenFOAM-8/issues/14

Summary
-------

OpenFOAM version 8 experiences what appears to be a communication deadlock in a scheduled send\receive operation.

The case in question attempts to solve a toy CFD problem evaluating airflow within a rectangular prism using parallel instances of MPPICFoam on decomposed input.

Multiple versions of OpenMPI in terms of both release (e.g. 4.0.3, 2.1.6), compiler (e.g Intel, gcc), bit layer transport (e.g. ucx, openib) in conjunction with multiple builds of OpenFOAM 8, have been attempted. Blocking vs. nonblocking communication and a number of mpirun command line tuning parameters (including varied world sizes) have also been attempted with no resolution.

To determine if the file-system was a factor, the case was run on both local and parallel (GPFS) storage. No difference in runtime behavior was observed when running on local vs. parallel storage.

Additionally, a number of case configuration values (e.g. mesh sizing, simulation times, etc.) without any effect.

For debugging purposes the simulation deltaT was adjusted from 1e-3 to 1.0 which greatly reduces the time to failure.


decomposeParDict
===========

/*--------------------------------*- C++ -*----------------------------------*\
========= |
\\ / F ield | OpenFOAM: The Open Source CFD Toolbox
\\ / O peration | Website: https://openfoam.org|
\\ / A nd | Version: 8
\\/ M anipulation |
\*---------------------------------------------------------------------------*/
FoamFile
{
version 2.0;
format ascii;
class dictionary;
location "system";
object decomposeParDict;
}

numberOfSubdomains 3;

method simple;

simpleCoeffs
{
n (3 1 1);
delta 0.001;
}

controlDict
=======

/*--------------------------------*- C++ -*----------------------------------*\
========= |
\\ / F ield | OpenFOAM: The Open Source CFD Toolbox
\\ / O peration | Website: |url|https://openfoam.org|/url|
\\ / A nd | Version: 8
\\/ M anipulation |
\*---------------------------------------------------------------------------*/
FoamFile
{
version 2.0;
format ascii;
class dictionary;
location "system";
object controlDict;
}

application MPPICFoam;
startFrom startTime;
startTime 0.0;
stopAt endTime;
endTime 6.5;
deltaT 1.0;
writeControl timeStep;
writeInterval 1;
purgeWrite 0;
writeFormat ascii;
writePrecision 6;
writeCompression off;
timeFormat general;
timePrecision 6;
runTimeModifiable no;

OptimisationSwitches {

fileModificationSkew 60;
fileModificationChecking timeStampMaster;
fileHandler uncollated;
maxThreadFileBufferSize 2e9;
maxMasterFileBufferSize 2e9;
commsType blocking; // nonBlocking; // scheduled; // blocking;
floatTransfer 0;
nProcsSimpleSum 0;
Force dumping (at next timestep) upon signal (-1 to disable)
writeNowSignal -1; // 10;
stopAtWriteNowSignal -1;
inputSyntax dot;
mpiBufferSize 200000000;
maxCommsSize 0;
trapFpe 1;
setNaN 0;
}

DebugSwitches {
UPstream 1;
Pstream 1;
processor 1;
IFstream 1;
OFstream 1;
}


Summary of Debug Output
------------------------------

The following debug output was generated using the above case configuration with an MPI_World size of 3...

$ srun -N1 -n3 --pty /bin/bash
...
$ module load openfoam/8-ompi2
$ source /tools/openfoam-8/mpich/OpenFOAM-8/etc/bashrc
$ decomposePar -force
$ mpirun -np $SLURM_NTASKS MPPICFoam -parallel

Following the process tree of ...

$ pstree -ac --show-parents -p -l 54148
systemd,1
└─slurmstepd,262636
└─bash,262643
└─mpirun,54148 -np 3 MPPICFoam -parallel
├─MPPICFoam,54152 -parallel
│ ├─{MPPICFoam},<tid>
│ ├─{MPPICFoam},<tid>
│ └─{MPPICFoam},<tid>
├─MPPICFoam,541523 -parallel
│ ├─{MPPICFoam},<tid>
│ ├─{MPPICFoam},<tid>
│ └─{MPPICFoam},<tid>
├─MPPICFoam,54154 -parallel
│ ├─{MPPICFoam},<tid>
│ ├─{MPPICFoam},<tid>
│ └─{MPPICFoam},<tid>
├─{mpirun},<tid>
├─{mpirun},<tid>
└─{mpirun},<tid>


The case output at the time of failure looks like ...

|0| UPstream::waitRequests : starting wait for 0 outstanding requests starting at 0
|0| UPstream::waitRequests : finished wait.
|0| UIPstream::read : starting read from:1 tag:1 comm:0 wanted size:1 commsType:scheduled
|0| UIPstream::read : finished read from:1 tag:1 read size:1 commsType:scheduled
|0| UIPstream::read : starting read from:2 tag:1 comm:0 wanted size:1 commsType:scheduled
|0| UIPstream::read : finished read from:2 tag:1 read size:1 commsType:scheduled
|0| UOPstream::write : starting write to:2 tag:1 comm:0 size:1 commsType:scheduled
|0| UOPstream::write : finished write to:2 tag:1 size:1 commsType:scheduled
|0| UOPstream::write : starting write to:1 tag:1 comm:0 size:1 commsType:scheduled
|0| UOPstream::write : finished write to:1 tag:1 size:1 commsType:scheduled
|2| UPstream::waitRequests : starting wait for 0 outstanding requests starting at 0
|2| UPstream::waitRequests : finished wait.
|2| UOPstream::write : starting write to:0 tag:1 comm:0 size:1 commsType:scheduled
|2| UOPstream::write : finished write to:0 tag:1 size:1 commsType:scheduled
|2| UIPstream::read : starting read from:0 tag:1 comm:0 wanted size:1 commsType:scheduled
|2| UIPstream::read : finished read from:0 tag:1 read size:1 commsType:scheduled
|1| UPstream::waitRequests : starting wait for 0 outstanding requests starting at 0
|1| UPstream::waitRequests : finished wait.
|1| UOPstream::write : starting write to:0 tag:1 comm:0 size:1 commsType:scheduled
|1| UOPstream::write : finished write to:0 tag:1 size:1 commsType:scheduled
|1| UIPstream::read : starting read from:0 tag:1 comm:0 wanted size:1 commsType:scheduled
|1| UIPstream::read : finished read from:0 tag:1 read size:1 commsType:scheduled

<... freeze ...>

Here, the communication schedule seems to be balanced with all matching send and receives (based on size and tag). However, the behavior indicates a blocking send or receive call.

The deadlock always seems to occur for size=1 send\recv operations.

*** The remaining content consists of gdb output from the MPI ranks.

In the root mpirun process (54152) looks like it is stuck in a poll loop.

Rank 0 appears to be issuing a return from Foam::BarycentricTensor<double>::BarycentricTensor ().

Ranks 1 and 2 appear to be waiting on PMPI_Alltoall communication.

Root (MPI) Process
============

|node040 B-3-22471|$ mpirun -np $SLURM_NTASKS MPPICFoam -parallel > /dev/null 2>&1 &
|1| 54148

|node040 B-3-22471|$ ps -ef | grep hpcuser
hpcuser 47219 47212 0 09:08 pts/0 00:00:00 /bin/bash
hpcuser 54148 47219 1 09:54 pts/0 00:00:00 mpirun -np 3 MPPICFoam -parallel
hpcuser 54152 54148 72 09:54 pts/0 00:00:02 MPPICFoam -parallel
hpcuser 54153 54148 81 09:54 pts/0 00:00:02 MPPICFoam -parallel
hpcuser 54154 54148 81 09:54 pts/0 00:00:02 MPPICFoam -parallel
hpcuser 54166 47219 0 09:54 pts/0 00:00:00 ps -ef
hpcuser 54167 47219 0 09:54 pts/0 00:00:00 grep --color=auto hpcuser

|node040 B-3-22471|$ gdb -p 54148
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-120.el7


(gdb) frame
#0 0x00002aaaac17fccd in poll () from /usr/lib64/libc.so.6
(gdb) where
#0 0x00002aaaac17fccd in poll () from /usr/lib64/libc.so.6
#1 0x00002aaaab096fc6 in poll_dispatch (base=0x659370, tv=0x0) at ../../../../../../../openmpi-4.0.3/opal/mca/event/libevent2022/libevent/poll.c:165
#2 0x00002aaaab08ec80 in opal_libevent2022_event_base_loop (base=0x659370, flags=1) at ../../../../../../../openmpi-4.0.3/opal/mca/event/libevent2022/libevent/event.c:1630
#3 0x0000000000401438 in orterun (argc=5, argv=0x7fffffffaae8) at ../../../../../openmpi-4.0.3/orte/tools/orterun/orterun.c:178
#4 0x0000000000400f6d in main (argc=5, argv=0x7fffffffaae8) at ../../../../../openmpi-4.0.3/orte/tools/orterun/main.c:13
(gdb) n
Single stepping until exit from function poll,
which has no line number information.
< ... freeze ... >

Rank 0 Process
==========

|node040 B-3-22471|$ gdb -p 54152

GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-120.el7
Attaching to process 54152
Reading symbols from /mmfs1/tools/openfoam-8/debug/OpenFOAM-8/platforms/linux64GccDPInt32Debug/bin/MPPICFoam...done.
Reading symbols from /mmfs1/tools/openfoam-8/debug/OpenFOAM-8/platforms/linux64GccDPInt32Debug/lib/liblagrangian.so...done.
Loaded symbols for /mmfs1/tools/openfoam-8/debug/OpenFOAM-8/platforms/linux64GccDPInt32Debug/lib/liblagrangian.so
Reading symbols from /mmfs1/tools/openfoam-8/debug/OpenFOAM-8/platforms/linux64GccDPInt32Debug/lib/liblagrangianIntermediate.so...done.
Loaded symbols for /mmfs1/tools/openfoam-8/debug/OpenFOAM-8/platforms/linux64GccDPInt32Debug/lib/liblagrangianIntermediate.so
Reading symbols from /mmfs1/tools/openfoam-8/debug/OpenFOAM-8/platforms/linux64GccDPInt32Debug/lib/liblagrangianTurbulence.so...done.
Loaded symbols for /mmfs1/tools/openfoam-8/debug/OpenFOAM-8/platforms/linux64GccDPInt32Debug/lib/liblagrangianTurbulence.so
Reading symbols from /mmfs1/tools/openfoam-8/debug/OpenFOAM-8/platforms/linux64GccDPInt32Debug/lib/libincompressibleTransportModels.so...done.
Loaded symbols for /mmfs1/tools/openfoam-8/debug/OpenFOAM-8/platforms/linux64GccDPInt32Debug/lib/libincompressibleTransportModels.so
...

(gdb) frame
#0 0x00002aaaaacfd654 in Foam::BarycentricTensor<double>::d (this=0x7fffffff4620) at /mmfs1/tools/openfoam-8/debug/OpenFOAM-8/src/OpenFOAM/lnInclude/BarycentricTensorI.H:159
159 return Vector<Cmpt>(this->v_|XD|, this->v_|YD|, this->v_|ZD|);

(gdb) where
#0 0x00000000004ceebd in Foam::Barycentric<double>::Barycentric (this=0x7fffffff4be0, va=@0x7fffffff4cc0: -0.13335, vb=@0x7fffffff4cc8: -0.13716, vc=@0x7fffffff4cd0: -0.13716,
vd=@0x7fffffff4cd8: -0.12953999999999999) at /mmfs1/tools/openfoam-8/debug/OpenFOAM-8/src/OpenFOAM/lnInclude/BarycentricI.H:50
#1 0x00000000004b97f5 in Foam::BarycentricTensor<double>::z (this=0x7fffffff4c80) at /mmfs1/tools/openfoam-8/debug/OpenFOAM-8/src/OpenFOAM/lnInclude/BarycentricTensorI.H:131
#2 0x00000000004aae13 in Foam::operator&<double> (T=..., b=...) at /mmfs1/tools/openfoam-8/debug/OpenFOAM-8/src/OpenFOAM/lnInclude/BarycentricTensorI.H:177
#3 0x00000000004a6a1e in Foam::particle::position (this=0x2240a40) at /mmfs1/tools/openfoam-8/debug/OpenFOAM-8/src/lagrangian/basic/lnInclude/particleI.H:280
#4 0x00002aaaaacf837d in Foam::particle::deviationFromMeshCentre (this=0x2240a40) at particle/particle.C:1036
#5 0x000000000051ac3a in Foam::KinematicParcel<Foam::particle>::move<Foam:: MPPICCloud<Foam::KinematicCloud<Foam::Cloud<Foam:: MPPICParcel<Foam::KinematicParcel<Foam::particle> (
this=0x2240a40, cloud=..., td=..., trackTime=1) at /mmfs1/tools/openfoam-8/debug/OpenFOAM-8/src/lagrangian/intermediate/lnInclude/KinematicParcel.C:309
#6 0x000000000050acc2 in Foam::MPPICParcel<Foam::KinematicParcel<Foam::part icle> >::move<Foam::MPPICCloud<Foam::KinematicCloud<Foam ::Cloud<Foam::MPPICParcel<Foam::KinematicParcel<Fo am::particle> (this=0x2240a40, cloud=..., td=..., trackTime=1) at /mmfs1/tools/openfoam-8/debug/OpenFOAM-8/src/lagrangian/intermediate/lnInclude/MPPICParcel.C:102
#7 0x00000000004f22f3 in Foam::Cloud<Foam::MPPICParcel<Foam::KinematicParce l<Foam::particle> > >::move<Foam::MPPICCloud<Foam::KinematicCloud<Foam ::Cloud<Foam::MPPICParcel<Foam::KinematicParcel<Fo am::particle> > > > > > (this=0x7fffffff7220, cloud=..., td=..., trackTime=1) at /mmfs1/tools/openfoam-8/debug/OpenFOAM-8/src/lagrangian/basic/lnInclude/Cloud.C:205
#8 0x00000000004f1e18 in Foam::MPPICCloud<Foam::KinematicCloud<Foam::Cloud< Foam::MPPICParcel<Foam::KinematicParcel<Foam::part icle> > > > >::motion<Foam::MPPICCloud<Foam::KinematicCloud<Fo am::Cloud<Foam::MPPICParcel<Foam::KinematicParcel< Foam::particle> > > > > > (this=0x7fffffff7220, cloud=..., td=...)
at /mmfs1/tools/openfoam-8/debug/OpenFOAM-8/src/lagrangian/intermediate/lnInclude/MPPICCloud.C:247
#9 0x00000000004da066 in Foam::KinematicCloud<Foam::Cloud<Foam::MPPICParcel <Foam::KinematicParcel<Foam::particle> > > >::evolveCloud<Foam::MPPICCloud<Foam::KinematicClo ud<Foam::Cloud<Foam::MPPICParcel<Foam::KinematicPa rcel<Foam::particle> > > > > > (this=0x7fffffff7220, cloud=..., td=...)
at /mmfs1/tools/openfoam-8/debug/OpenFOAM-8/src/lagrangian/intermediate/lnInclude/KinematicCloud.C:210
#10 0x00000000004c3497 in Foam::KinematicCloud<Foam::Cloud<Foam::MPPICParcel <Foam::KinematicParcel<Foam::particle> > > >::solve<Foam::MPPICCloud<Foam::KinematicCloud<Foa m::Cloud<Foam::MPPICParcel<Foam::KinematicParcel<F oam::particle> > > > > > (this=0x7fffffff7220, cloud=..., td=...) at /mmfs1/tools/openfoam-8/debug/OpenFOAM-8/src/lagrangian/intermediate/lnInclude/KinematicCloud.C:114
#11 0x00000000004afc73 in Foam::MPPICCloud<Foam::KinematicCloud<Foam::Cloud< Foam::MPPICParcel<Foam::KinematicParcel<Foam::part icle> > > > >::evolve (this=0x7fffffff7220)
at /mmfs1/tools/openfoam-8/debug/OpenFOAM-8/src/lagrangian/intermediate/lnInclude/MPPICCloud.C:169
#12 0x000000000049e61e in main (argc=2, argv=0x7fffffffa258) at ../DPMFoam.C:109


Rank 1 Process
=========

|node040 B-3-22471|$ gdb -p 54153

(gdb) frame
#0 0x00002aaac936812d in uct_mm_iface_progress (tl_iface=<optimized out>) at ../../../src/uct/sm/mm/base/mm_iface.c:365

(gdb) frame
#0 0x00002aaaca08848a in uct_rc_mlx5_iface_progress_cyclic (arg=<optimized out>) at ../../../../src/uct/ib/rc/accel/rc_mlx5_iface.c:183
183 }

(gdb) where
#0 0x00002aaaca088484 in uct_rc_mlx5_iface_progress_cyclic (arg=<optimized out>) at ../../../../src/uct/ib/rc/accel/rc_mlx5_iface.c:183
#1 0x00002aaac90b608a in ucs_callbackq_dispatch (cbq=<optimized out>) at /home/hpcuser/build/ucx/build/../src/ucs/datastruct/callbackq.h:211
#2 uct_worker_progress (worker=<optimized out>) at /home/hpcuser/build/ucx/build/../src/uct/api/uct.h:2592
#3 ucp_worker_progress (worker=0xb9f390) at ../../../src/ucp/core/ucp_worker.c:2530
#4 0x00002aaac8c6c6d7 in mca_pml_ucx_progress () from /tools/openmpi-4.0.3/gcc/4.8.5/ucx/lib/openmpi/mca_pml_ucx.so
#5 0x00002aaab91c780c in opal_progress () from /tools/openmpi-4.0.3/gcc/4.8.5/ucx/lib/libopen-pal.so.40
#6 0x00002aaab85111bd in ompi_request_default_wait_all () from /tools/openmpi-4.0.3/gcc/4.8.5/ucx/lib/libmpi.so.40
#7 0x00002aaab8565398 in ompi_coll_base_alltoall_intra_basic_linear () from /tools/openmpi-4.0.3/gcc/4.8.5/ucx/lib/libmpi.so.40
#8 0x00002aaab85240d7 in PMPI_Alltoall () from /tools/openmpi-4.0.3/gcc/4.8.5/ucx/lib/libmpi.so.40
#9 0x00002aaab2569953 in Foam::UPstream::allToAll (sendData=..., recvData=..., communicator=0) at UPstream.C:367
#10 0x00002aaab0a162c1 in Foam::Pstream::exchangeSizes<Foam::List<Foam::Dyna micList<char, 0u, 2u, 1u> > > (sendBufs=..., recvSizes=..., comm=0) at db/IOstreams/Pstreams/exchange.C:158
#11 0x00002aaab0a15d0d in Foam::PstreamBuffers::finishedSends (this=0x7fffffff4fe0, recvSizes=..., block=true) at db/IOstreams/Pstreams/PstreamBuffers.C:106
#12 0x00000000004f2670 in Foam::Cloud<Foam::MPPICParcel<Foam::KinematicParce l<Foam::particle> > >::move<Foam::MPPICCloud<Foam::KinematicCloud<Foam ::Cloud<Foam::MPPICParcel<Foam::KinematicParcel<Fo am::particle> > > > > > (this=0x7fffffff7220, cloud=..., td=..., trackTime=1) at /mmfs1/tools/openfoam-8/debug/OpenFOAM-8/src/lagrangian/basic/lnInclude/Cloud.C:283
#13 0x00000000004f1e18 in Foam::MPPICCloud<Foam::KinematicCloud<Foam::Cloud< Foam::MPPICParcel<Foam::KinematicParcel<Foam::part icle> > > > >::motion<Foam::MPPICCloud<Foam::KinematicCloud<Fo am::Cloud<Foam::MPPICParcel<Foam::KinematicParcel< Foam::particle> > > > > > (this=0x7fffffff7220, cloud=..., td=...)
at /mmfs1/tools/openfoam-8/debug/OpenFOAM-8/src/lagrangian/intermediate/lnInclude/MPPICCloud.C:247
#14 0x00000000004da066 in Foam::KinematicCloud<Foam::Cloud<Foam::MPPICParcel <Foam::KinematicParcel<Foam::particle> > > >::evolveCloud<Foam::MPPICCloud<Foam::KinematicClo ud<Foam::Cloud<Foam::MPPICParcel<Foam::KinematicPa rcel<Foam::particle> > > > > > (this=0x7fffffff7220, cloud=..., td=...)
at /mmfs1/tools/openfoam-8/debug/OpenFOAM-8/src/lagrangian/intermediate/lnInclude/KinematicCloud.C:210
#15 0x00000000004c3497 in Foam::KinematicCloud<Foam::Cloud<Foam::MPPICParcel <Foam::KinematicParcel<Foam::particle> > > >::solve<Foam::MPPICCloud<Foam::KinematicCloud<Foa m::Cloud<Foam::MPPICParcel<Foam::KinematicParcel<F oam::particle> > > > > > (this=0x7fffffff7220, cloud=..., td=...) at /mmfs1/tools/openfoam-8/debug/OpenFOAM-8/src/lagrangian/intermediate/lnInclude/KinematicCloud.C:114
#16 0x00000000004afc73 in Foam::MPPICCloud<Foam::KinematicCloud<Foam::Cloud< Foam::MPPICParcel<Foam::KinematicParcel<Foam::part icle> > > > >::evolve (this=0x7fffffff7220)
at /mmfs1/tools/openfoam-8/debug/OpenFOAM-8/src/lagrangian/intermediate/lnInclude/MPPICCloud.C:169
#17 0x000000000049e61e in main (argc=2, argv=0x7fffffffa258) at ../DPMFoam.C:109


Rank 2 Process
=========

<see GitHub issue #14>

Last edited by auburnhpc; May 29, 2021 at 12:19. Reason: formatting
auburnhpc is offline   Reply With Quote

Old   May 29, 2021, 17:55
Default
  #2
HPE
Senior Member
 
HPE's Avatar
 
Herpes Free Engineer
Join Date: Sep 2019
Location: The Home Under The Ground with the Lost Boys
Posts: 931
Rep Power: 13
HPE is on a distinguished road
openfoam.org uses the https://bugs.openfoam.org/rules.php instead of GitHub for the bugs. just fyi.
auburnhpc likes this.
HPE is offline   Reply With Quote

Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Divergence detected in AMG solver. VOF. Mr.Mister Fluent Multiphase 5 November 22, 2024 07:32
Fail to converge when solving with a fabricated solution zizhou FLUENT 0 March 22, 2021 07:33
[OpenFOAM.org] Custom solver not running in parallel syavash OpenFOAM Installation 12 June 6, 2019 16:02
Solver seems to diverge after re-run the simulation (parallel) cryabroad OpenFOAM Running, Solving & CFD 2 October 31, 2018 00:55
rhoCentralFoam solver with Slip BCs fails in Parallel Only JLight OpenFOAM Running, Solving & CFD 2 October 11, 2012 22:08


All times are GMT -4. The time now is 00:23.