|
[Sponsors] |
April 24, 2019, 12:06 |
MPI error when trying to run su2 in parallel
|
#1 |
New Member
Guilherme Pimentel
Join Date: Mar 2019
Posts: 9
Rep Power: 7 |
Hello guys!
I need some help... I'm trying to run a case in parallel with the script bellow: mpirun -n 3 SU2_CFD AhmedBody.cfg But every time I do it, I got this error: -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun noticed that process rank 1 with PID 0 on node aerofleet-System-Product-Name exited on signal 11 (Segmentation fault). -------------------------------------------------------------------------- I don't know why this errors occurs... I have a 16Gb of RAM and 50 Gb of swap partition and I'm using Open MPI.. I can run this case in serial, but the calculation takes too long.. Can someone help me with this one? I'll appreciate a lot.. Best wishes, thank you. |
|
May 4, 2019, 16:39 |
|
#2 |
Senior Member
Pedro Gomes
Join Date: Dec 2017
Posts: 466
Rep Power: 13 |
Hi Guilherme,
Do you get any other output before that? Assuming that this is the case in Testcases I can run it fine. Cheers, Pedro |
|
May 9, 2019, 11:44 |
|
#3 |
New Member
Guilherme Pimentel
Join Date: Mar 2019
Posts: 9
Rep Power: 7 |
Hello Pedro!
No, I didn't get any output before.. |
|
May 10, 2019, 06:22 |
|
#4 |
Senior Member
Pedro Gomes
Join Date: Dec 2017
Posts: 466
Rep Power: 13 |
If the code does not start at all, I suspect a compilation issue, the typical one is using different MPI versions to compile and run the code.
|
|
May 15, 2019, 10:13 |
|
#5 |
New Member
Guilherme Pimentel
Join Date: Mar 2019
Posts: 9
Rep Power: 7 |
Hello Pedro!
I uninstalled and installed all again and it seems working now. But now I'm facing another problem: The SU2 is running slowly in parallel than in serial, I mean.. the more cores I put in the command line (mpirun -n "x") the slower it gets... Do you have an idea about what it may be? Thank you ^^ |
|
May 16, 2019, 06:28 |
|
#6 |
Senior Member
Pedro Gomes
Join Date: Dec 2017
Posts: 466
Rep Power: 13 |
Hi Guilherme,
If the output is also scrambled, like iteration X being printed multiple times that means you are launching multiple serial instances instead of a parallel run, which again happens when the mpi version used to run the code is not the same used to compile it. If that is not the case please describe in detail the steps you are following to compile and run the code. |
|
May 16, 2019, 09:29 |
|
#7 |
New Member
Guilherme Pimentel
Join Date: Mar 2019
Posts: 9
Rep Power: 7 |
Hello Pedro, thank you for helping me.
I faced that problem before (iterations being printed multiple times), but it's not happening anymore. I first installed mpi4py and then I installed Open MPI. Then I compiled the SU2 with this following command: "./configure --prefix=/$HOME/SU2 CXXFLAGS="-O3" --enable-mpi --with-cc=/$HOME/OpenMpi/bin/mpicc --with-cxx=/$HOME/OpenMpi/bin/mpicxx" Then I did: "sudo make -j 8 install" I didn't get any error output and the installation seems to have been succesfull. So I did run the quick start case with " mpirun -n 4 SU2_CFD inv_NACA0012.cfg". However, running in parallel seems slowly than running in serial and a curious fact that I had noticed is that, when the computation ends, it don't' give me that output "the calculation finished in "n" cores!" or something like that. If it's needed I can paste the outputs here. Thank you again! ^^ |
|
May 16, 2019, 10:02 |
|
#8 |
New Member
Bhupinder Singh Sanghera
Join Date: Jun 2018
Posts: 5
Rep Power: 8 |
Hi Pedro,
I am also getting an error message when I try to run in parallel saying: 'mpirun noticed that process rank 2 with PID 0 on node node-xxx exited on signal 11 (Segmentation fault).' When I run in serial I get the error: 'SU2_CFD:xxxxxx terminated with signal 11 at PC=xxxxxx SP=xxxxxxx' I get both of these errors only for a particular case. When I try to run the Quickstart simulation, it works both in serial and parallel. This seems quite weird to me. I read on another thread that it maybe has to do with the RAM/swap memory? |
|
May 20, 2019, 07:17 |
|
#9 |
Senior Member
Pedro Gomes
Join Date: Dec 2017
Posts: 466
Rep Power: 13 |
Hi Guilherme,
Type "which mpirun" on a terminal and see if the executable that the OS finds is inside $HOME/OpenMpi/bin. Alternatively try running $HOME/OpenMpi/bin/mpirun -n 4 SU2_CFD inv_NACA0012.cfg Hi Bhupinder, Signal 11 is a segmentation fault, it may be a problem of the mesh, the combination of settings you are trying to use not being valid, or the code. Try starting from something you know it works (like the quickstart) and go from there. |
|
May 31, 2019, 08:46 |
|
#10 |
New Member
Guilherme Pimentel
Join Date: Mar 2019
Posts: 9
Rep Power: 7 |
Hello Pedro, sorry for my very late response... I was focused in another project...
I typed "which mpirun" and I got "/usr/bin/mpirun" When I tried " $HOME/OpenMpi/bin/mpirun -n 4 SU2_CFD inv_NACA0012.cfg" I got many errors: It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): ompi_mpi_init: ompi_rte_init failed --> Returned "(null)" (-43) instead of "Success" (0) -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): ompi_mpi_init: ompi_rte_init failed --> Returned "(null)" (-43) instead of "Success" (0) -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): ompi_mpi_init: ompi_rte_init failed --> Returned "(null)" (-43) instead of "Success" (0) -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): ompi_mpi_init: ompi_rte_init failed --> Returned "(null)" (-43) instead of "Success" (0) -------------------------------------------------------------------------- *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [aerofleet-System-Product-Name:30145] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [aerofleet-System-Product-Name:30146] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [aerofleet-System-Product-Name:30147] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [aerofleet-System-Product-Name:30144] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed! -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[40496,1],3] Exit code: 1 -------------------------------------------------------------------------- What does it means? Thank you again... |
|
May 31, 2019, 20:36 |
|
#11 |
Senior Member
Pedro Gomes
Join Date: Dec 2017
Posts: 466
Rep Power: 13 |
You seem to have 2 mpi versions installed... One in a system location (/usr) the other in your home folder.
Try compiling with: "./configure --prefix=/$HOME/SU2 CXXFLAGS="-O3" --enable-mpi --with-cc=/usr/bin/mpicc --with-cxx=/usr/bin/mpicxx" And running with: "mpirun -n ..." If that does not work I have no idea. |
|
June 3, 2019, 04:38 |
|
#12 | |
New Member
Nitish Anand
Join Date: Sep 2016
Location: Netherlands
Posts: 12
Rep Power: 10 |
Quote:
Have you made changes to the code? If you try running with gdb you should get exact function where the issue is. |
||
Tags |
mpi, su2, su2 error |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
SU2 parallel run results visualization | ddvdc | SU2 | 1 | February 12, 2019 17:14 |
Problem with foam-extend 4.0 ggi parallel run | Metikurke | OpenFOAM Running, Solving & CFD | 1 | December 6, 2018 16:51 |
Problem in foam-extend 4.0 ggi parallel run | Metikurke | OpenFOAM Running, Solving & CFD | 0 | February 20, 2018 07:34 |
Cannot run as parallel. please help. | TommiPLaiho | OpenFOAM Running, Solving & CFD | 3 | October 21, 2013 09:07 |
Error using LaunderGibsonRSTM on SGI ALTIX 4700 | jaswi | OpenFOAM | 2 | April 29, 2008 11:54 |