CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Software User Forums > SU2

MPI error when trying to run su2 in parallel

Register Blogs Community New Posts Updated Threads Search

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   April 24, 2019, 12:06
Unhappy MPI error when trying to run su2 in parallel
  #1
New Member
 
Guilherme Pimentel
Join Date: Mar 2019
Posts: 9
Rep Power: 7
Gui_AP is on a distinguished road
Hello guys!



I need some help... I'm trying to run a case in parallel with the script bellow:


mpirun -n 3 SU2_CFD AhmedBody.cfg


But every time I do it, I got this error:

--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node aerofleet-System-Product-Name exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

I don't know why this errors occurs... I have a 16Gb of RAM and 50 Gb of swap partition and I'm using Open MPI.. I can run this case in serial, but the calculation takes too long..



Can someone help me with this one? I'll appreciate a lot..


Best wishes, thank you.
Gui_AP is offline   Reply With Quote

Old   May 4, 2019, 16:39
Default
  #2
pcg
Senior Member
 
Pedro Gomes
Join Date: Dec 2017
Posts: 466
Rep Power: 13
pcg is on a distinguished road
Hi Guilherme,
Do you get any other output before that?
Assuming that this is the case in Testcases I can run it fine.
Cheers,
Pedro
pcg is offline   Reply With Quote

Old   May 9, 2019, 11:44
Default
  #3
New Member
 
Guilherme Pimentel
Join Date: Mar 2019
Posts: 9
Rep Power: 7
Gui_AP is on a distinguished road
Hello Pedro!

No, I didn't get any output before..
Gui_AP is offline   Reply With Quote

Old   May 10, 2019, 06:22
Default
  #4
pcg
Senior Member
 
Pedro Gomes
Join Date: Dec 2017
Posts: 466
Rep Power: 13
pcg is on a distinguished road
If the code does not start at all, I suspect a compilation issue, the typical one is using different MPI versions to compile and run the code.
pcg is offline   Reply With Quote

Old   May 15, 2019, 10:13
Default
  #5
New Member
 
Guilherme Pimentel
Join Date: Mar 2019
Posts: 9
Rep Power: 7
Gui_AP is on a distinguished road
Hello Pedro!
I uninstalled and installed all again and it seems working now. But now I'm facing another problem:
The SU2 is running slowly in parallel than in serial, I mean.. the more cores I put in the command line (mpirun -n "x") the slower it gets... Do you have an idea about what it may be? Thank you ^^
Gui_AP is offline   Reply With Quote

Old   May 16, 2019, 06:28
Default
  #6
pcg
Senior Member
 
Pedro Gomes
Join Date: Dec 2017
Posts: 466
Rep Power: 13
pcg is on a distinguished road
Hi Guilherme,

If the output is also scrambled, like iteration X being printed multiple times that means you are launching multiple serial instances instead of a parallel run, which again happens when the mpi version used to run the code is not the same used to compile it.

If that is not the case please describe in detail the steps you are following to compile and run the code.
pcg is offline   Reply With Quote

Old   May 16, 2019, 09:29
Default
  #7
New Member
 
Guilherme Pimentel
Join Date: Mar 2019
Posts: 9
Rep Power: 7
Gui_AP is on a distinguished road
Hello Pedro, thank you for helping me.


I faced that problem before (iterations being printed multiple times), but it's not happening anymore.



I first installed mpi4py and then I installed Open MPI.



Then I compiled the SU2 with this following command:



"./configure --prefix=/$HOME/SU2 CXXFLAGS="-O3" --enable-mpi --with-cc=/$HOME/OpenMpi/bin/mpicc --with-cxx=/$HOME/OpenMpi/bin/mpicxx"


Then I did: "sudo make -j 8 install"


I didn't get any error output and the installation seems to have been succesfull.


So I did run the quick start case with " mpirun -n 4 SU2_CFD inv_NACA0012.cfg".


However, running in parallel seems slowly than running in serial and a curious fact that I had noticed is that, when the computation ends, it don't' give me that output "the calculation finished in "n" cores!" or something like that.


If it's needed I can paste the outputs here. Thank you again! ^^
Gui_AP is offline   Reply With Quote

Old   May 16, 2019, 10:02
Default
  #8
New Member
 
Bhupinder Singh Sanghera
Join Date: Jun 2018
Posts: 5
Rep Power: 8
Sanghera is on a distinguished road
Hi Pedro,

I am also getting an error message when I try to run in parallel saying:

'mpirun noticed that process rank 2 with PID 0 on node node-xxx exited on signal 11 (Segmentation fault).'

When I run in serial I get the error:

'SU2_CFD:xxxxxx terminated with signal 11 at PC=xxxxxx SP=xxxxxxx'

I get both of these errors only for a particular case. When I try to run the Quickstart simulation, it works both in serial and parallel. This seems quite weird to me. I read on another thread that it maybe has to do with the RAM/swap memory?
Sanghera is offline   Reply With Quote

Old   May 20, 2019, 07:17
Default
  #9
pcg
Senior Member
 
Pedro Gomes
Join Date: Dec 2017
Posts: 466
Rep Power: 13
pcg is on a distinguished road
Hi Guilherme,

Type "which mpirun" on a terminal and see if the executable that the OS finds is inside $HOME/OpenMpi/bin.
Alternatively try running $HOME/OpenMpi/bin/mpirun -n 4 SU2_CFD inv_NACA0012.cfg

Hi Bhupinder,

Signal 11 is a segmentation fault, it may be a problem of the mesh, the combination of settings you are trying to use not being valid, or the code. Try starting from something you know it works (like the quickstart) and go from there.
pcg is offline   Reply With Quote

Old   May 31, 2019, 08:46
Default
  #10
New Member
 
Guilherme Pimentel
Join Date: Mar 2019
Posts: 9
Rep Power: 7
Gui_AP is on a distinguished road
Hello Pedro, sorry for my very late response... I was focused in another project...


I typed "which mpirun" and I got "/usr/bin/mpirun"


When I tried " $HOME/OpenMpi/bin/mpirun -n 4 SU2_CFD inv_NACA0012.cfg" I got many errors:


It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

ompi_mpi_init: ompi_rte_init failed
--> Returned "(null)" (-43) instead of "Success" (0)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

ompi_mpi_init: ompi_rte_init failed
--> Returned "(null)" (-43) instead of "Success" (0)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

ompi_mpi_init: ompi_rte_init failed
--> Returned "(null)" (-43) instead of "Success" (0)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

ompi_mpi_init: ompi_rte_init failed
--> Returned "(null)" (-43) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[aerofleet-System-Product-Name:30145] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[aerofleet-System-Product-Name:30146] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[aerofleet-System-Product-Name:30147] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[aerofleet-System-Product-Name:30144] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[40496,1],3]
Exit code: 1
--------------------------------------------------------------------------

What does it means?


Thank you again...
Gui_AP is offline   Reply With Quote

Old   May 31, 2019, 20:36
Default
  #11
pcg
Senior Member
 
Pedro Gomes
Join Date: Dec 2017
Posts: 466
Rep Power: 13
pcg is on a distinguished road
You seem to have 2 mpi versions installed... One in a system location (/usr) the other in your home folder.
Try compiling with:
"./configure --prefix=/$HOME/SU2 CXXFLAGS="-O3" --enable-mpi --with-cc=/usr/bin/mpicc --with-cxx=/usr/bin/mpicxx"
And running with:
"mpirun -n ..."
If that does not work I have no idea.
pcg is offline   Reply With Quote

Old   June 3, 2019, 04:38
Default
  #12
New Member
 
Nitish Anand
Join Date: Sep 2016
Location: Netherlands
Posts: 12
Rep Power: 10
nitish_anand is on a distinguished road
Quote:
Originally Posted by Sanghera View Post
Hi Pedro,

I am also getting an error message when I try to run in parallel saying:

'mpirun noticed that process rank 2 with PID 0 on node node-xxx exited on signal 11 (Segmentation fault).'

When I run in serial I get the error:

'SU2_CFD:xxxxxx terminated with signal 11 at PC=xxxxxx SP=xxxxxxx'

I get both of these errors only for a particular case. When I try to run the Quickstart simulation, it works both in serial and parallel. This seems quite weird to me. I read on another thread that it maybe has to do with the RAM/swap memory?
Hey Bhupinder,

Have you made changes to the code? If you try running with gdb you should get exact function where the issue is.
nitish_anand is offline   Reply With Quote

Reply

Tags
mpi, su2, su2 error


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
SU2 parallel run results visualization ddvdc SU2 1 February 12, 2019 17:14
Problem with foam-extend 4.0 ggi parallel run Metikurke OpenFOAM Running, Solving & CFD 1 December 6, 2018 16:51
Problem in foam-extend 4.0 ggi parallel run Metikurke OpenFOAM Running, Solving & CFD 0 February 20, 2018 07:34
Cannot run as parallel. please help. TommiPLaiho OpenFOAM Running, Solving & CFD 3 October 21, 2013 09:07
Error using LaunderGibsonRSTM on SGI ALTIX 4700 jaswi OpenFOAM 2 April 29, 2008 11:54


All times are GMT -4. The time now is 04:30.