|
[Sponsors] |
foam-extend-3.2 Pstream: "MPI_ABORT was invoked" |
|
LinkBack | Thread Tools | Search this Thread | Display Modes |
November 11, 2015, 17:33 |
foam-extend-3.2 Pstream: "MPI_ABORT was invoked"
|
#1 |
New Member
Brent Craven
Join Date: Oct 2015
Posts: 7
Rep Power: 11 |
I am having a similar related issue with foam-extend 3.2. It installed with no problems and runs in parallel using system Open MPI on a single node (up to 12 cores). But, when I try using more than 1 node I get the following MPI_ABORT:
Code:
-------------------------------------------------------------------------- MPI_ABORT was invoked on rank 5 in communicator MPI_COMM_WORLD with errorcode 1. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. -------------------------------------------------------------------------- -------------------------------------------------------------------------- Code:
OptimisationSwitches { commsType nonBlocking; } [ Moderator note: moved from http://www.cfd-online.com/Forums/ope...end-3-2-a.html ] Last edited by wyldckat; November 16, 2015 at 13:04. Reason: see "Moderator note:" |
|
November 14, 2015, 23:30 |
foam-extend-3.2 Pstream: "MPI_ABORT was invoked"
|
#2 |
New Member
Brent Craven
Join Date: Oct 2015
Posts: 7
Rep Power: 11 |
Hi All,
I am having major problems getting foam-extend-3.2 running across multiple nodes on a cluster (actually, I have tried two different clusters with the same result). The code installed just fine and runs in serial and in parallel on a single node with descent scaling (so, MPI seems to be running on a single node just fine). However, as soon as I try to bridge multiple nodes, I get the following MPI_ABORT error as soon as simpleFoam (or other solvers that I have tested) enters the time loop: Code:
Starting time loop -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 21 in communicator MPI_COMM_WORLD with errorcode 1. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. -------------------------------------------------------------------------- I noticed that the Pstream library changed locations from foam-extend-3.1 to foam-extend-3.2 and seems to have changed quite a bit. I wonder if that is part of the issue? |
|
November 16, 2015, 13:28 |
|
#3 |
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128 |
Quick answer: Please try using the parallel testing utility that exists for OpenFOAM and foam-extend. Instructions for foam-extend are provided here: http://www.cfd-online.com/Forums/ope...tml#post560394 - post #12
The other possibility that comes to mind is that perhaps there is a shell environment flag that is automatically loading parts of shell environment variables for foam-extend only on the nodes, resulting in incompatible versions of simpleFoam being loaded. One test I usually do for this is to launch mpirun with a shell script that simply outputs the current shell environment into a log file, so that I can examine what the shell environment looks like on each launched process. For example, a script containing this: Code:
#!/bin/sh export > log_env.$$ |
|
November 16, 2015, 17:51 |
|
#4 |
New Member
Brent Craven
Join Date: Oct 2015
Posts: 7
Rep Power: 11 |
Hi Bruno,
Thanks for your recommendation. I used your script and looked at my shell environment, which looks fine to me (is showing the correct $PATH that includes foam-extend-3.2 for all processes). So, I don't think that's the problem. I had to slightly modify the parallelTest utility to get it compiled in foam-extend-3.2 since it appears as though Time.H no longer exists and I was getting "Time.H: No such file or directory." The source code is attached below. I ran parallelTest using MPI across multiple nodes (2 nodes with 12 cores each) and here is the resultant stderr: Code:
[13] slave sending to master 0 [13] slave receiving from master 0 [15] slave sending to master 0 [15] slave receiving from master 0 [23] slave sending to master 0 [23] slave receiving from master 0 [14] slave sending to master 0 [14] slave receiving from master 0 [22] slave sending to master 0 [22] slave receiving from master 0 [12] slave sending to master 0 [12] slave receiving from master 0 [20] slave sending to master [8] slave sending to master 0 [8] slave receiving from master 0 [9] 0 [20] slave receiving from master 0slave sending to master 0 [9] slave receiving from master 0 [0] master receiving from slave 1 [16] [1] slave sending to master 0 [1] slave receiving from master 0 [0] (0 1 2) [0] master receiving from slave 2 slave sending to master 0[19] slave sending to master 0 [19] slave receiving from master 0 [21] slave sending to master 0 [21] slave receiving from master 0 [16] [11] slave sending to master 0 [11] slave receiving from master 0 slave receiving from master [3] slave sending to master 0 [3] slave receiving from master 0 [6] slave sending to master 0 [6] slave receiving from master 0[7] slave sending to master 0 [7] slave receiving from master 0 0 [2] slave sending to master 0 [0] (0 1 2)[2] slave receiving from master 0 [0] master receiving from slave 3 [0] (0 1 2) [0] master receiving from slave 4 [10] slave sending to master 0 [10] slave receiving from master 0 [18] [5] slave sending to master 0 [18] slave receiving from master 0slave sending to master 0 [5] slave receiving from master 0 [4] slave sending to master 0 [0] (0 1 2) [0] master receiving from slave 5[4] [0] (0 1 2) [0] master receiving from slave 6 [0] (0 1 2) [0] master receiving from slave 7 [0] (0 1 2) [0] master receiving from slave 8 [0] (0 1 2) [0] master receiving from slave 9 [0] (0 1 2) [0] master receiving from slave 10 [0] (0 1 2) [0] master receiving from slave 11 [0] (0 1 2) [0] master receiving from slave 12 [0] (0 1 2) [0] master receiving from slave 13 [0] (0 1 2) [0] master receiving from slave 14 [0] (0 1 2) [0] master receiving from slave 15 [0] (0 1 2) [0] master receiving from slave 16 [0] (0 1 2) [0] master receiving from slave 17 slave receiving from master 0 [0] [17] slave sending to master 0 [17] slave receiving from master 0 (0 1 2) [0] master receiving from slave 18 [0] (0 1 2) [0] master receiving from slave 19 [0] (0 1 2) [0] master receiving from slave 20 [0] (0 1 2) [0] master receiving from slave 21 [0] (0 1 2) [0] master receiving from slave 22 [0] (0 1 2) [0] master receiving from slave 23 [0] (0 1 2) [0] master sending to slave 1 [0] [1] (0 1 2) master sending to slave 2 [0] [2] (0 1 2) master sending to slave 3 [0] master sending to slave 4 [0] master sending to slave 5 [0] master sending to slave 6 [0] master sending to slave 7 [0] master sending to slave 8 [0] master sending to slave 9 [0] master sending to slave 10 [0] master sending to slave 11 [0] master sending to slave 12 [5] (0 1 2) [3] (0 1 2) [4] (0 1 2) [8] (0 1 2) [6] (0 1 2) [11] (0 1 2) [0] master sending to slave 13 [0] master sending to slave 14 [0] master sending to slave 15 [0] master sending to slave 16 [0] master sending to slave 17 [0] master sending to slave 18 [0] master sending to slave 19 [0] master sending to slave 20 [0] master sending to slave 21 [0] master sending to slave 22 [0] master sending to slave 23 [10] (0 1 2) [7] (0 1 2) [9] (0 1 2) [13] (0 1 2) [12] [14] (0 1 2) [17] (0 1 2) [20] (0 1 2) [18] (0 1 2) [16] (0 1 2) [21] (0 1 2) [15] (0 1 2) [19] (0 1 2) [22] (0 1 2) [23] (0 1 2) (0 1 2) There are no error messages. But, since the output is not synchronized, it's difficult to tell whether there is a problem or not. Does anything pop out at you? Thanks for your help. I really appreciate it. |
|
November 17, 2015, 18:17 |
|
#5 |
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128 |
Hi Brent,
The output from parallelTest seems OK. Since it didn't crash, this means that at least the basic communication is working as intended with foam-extend's own Pstream mechanism. I went back to see how you had tried to define the optimization flag and I then remembered that foam-extend does things a bit differently from OpenFOAM. Please check this post: http://www.cfd-online.com/Forums/ope...tml#post491522 - post #7 Oh, this is interesting... check this commit message as well: http://sourceforge.net/p/foam-extend...a0ca1f8ec3230/ If I understood it correctly, you can do the following: Code:
mpirun ... simpleFoam -parallel -OptimisationSwitches commsType=nonBlocking Bruno
__________________
|
|
November 18, 2015, 08:55 |
|
#6 |
New Member
Brent Craven
Join Date: Oct 2015
Posts: 7
Rep Power: 11 |
Hi Bruno,
Making sure commsType was set to 'nonBlocking' in this way seems to have solved my issue. Unfortunately, I wiped my previous test case where I was trying to set it in the case controlDict to see why that didn't work. But, regardless it is now working and I am happy! Thanks for your help with this! I really appreciate it. Thanks, Brent |
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
[Other] mesh airfoil NACA0012 | anand_30 | OpenFOAM Meshing & Mesh Conversion | 13 | March 7, 2022 18:22 |
[blockMesh] error message with modeling a cube with a hold at the center | hsingtzu | OpenFOAM Meshing & Mesh Conversion | 2 | March 14, 2012 10:56 |
[blockMesh] BlockMesh FOAM warning | gaottino | OpenFOAM Meshing & Mesh Conversion | 7 | July 19, 2010 15:11 |
[blockMesh] Axisymmetrical mesh | Rasmus Gjesing (Gjesing) | OpenFOAM Meshing & Mesh Conversion | 10 | April 2, 2007 15:00 |
[Gmsh] Import gmsh msh to Foam | adorean | OpenFOAM Meshing & Mesh Conversion | 24 | April 27, 2005 09:19 |