|
[Sponsors] |
July 12, 2019, 06:23 |
OpenFoam v1812 over Infiniband
|
#1 |
New Member
Augustin
Join Date: Jan 2019
Posts: 6
Rep Power: 7 |
Hi,
I have a problem launching openFoam with mpirun --hostfile I have two servers on Ubuntu 18.04 with 32 cores each and OpenFoam 1812 I 've linked two of my servers with an infiniband connexion, and I would like to launch a calculation on the two machines to have 64 cores. The link is working I tried with a simple "Hello word" script, I can print 64 times Hello Word, it is cool but not very usefull. I use the command: /usr/local/lib/openMPI-4/bin/mpirun -np 64 --hostfile hostfile --mca btl_openib_allow_ib true snappyHexMesh I get this error Code:
-------------------------------------------------------------------------- mpirun was unable to find the specified executable file, and therefore did not launch the job. This error was first reported for process rank 16; it may have occurred for other processes as well. NOTE: A common cause for this error is misspelling a mpirun command line parameter option (remember that mpirun interprets the first unrecognized command line token as the executable). Node: oahu Executable: /opt/openfoam1812/OpenFOAM-v1812/platforms/linux64GccDPInt32Opt/bin/snappyHexMesh -------------------------------------------------------------------------- I also tried with the full path of snappyHexMesh But the server slave is throwing: Code:
/opt/openfoam1812/OpenFOAM-v1812/platforms/linux64GccDPInt32Opt/bin/snappyHexMesh: error while loading shared libraries: libfiniteVolume.so: cannot open shared object file: No such file or directory Cheers Augustin |
|
July 14, 2019, 08:06 |
|
#2 |
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128 |
Quick answer: OpenFOAM's script foamJob can do the necessary tuning for you, so that you don't need to worry about writing all of the lengthy commands.
If you simply run: Code:
foamJob -p -s snappyHexMesh As for selecting Infiniband by default, I believe that Open-MPI will try to use all available network interfaces and choose the best performing one.
__________________
|
|
July 15, 2019, 06:00 |
|
#3 |
New Member
Augustin
Join Date: Jan 2019
Posts: 6
Rep Power: 7 |
Hi thanks for the answer, foamExec was not present in the v1812 version, but I added the executable from v1806 version, but I got the following error:
Code:
cws@maui:~/Molokai/bench/run_32$ foamJob -p -s snappyHexMesh Parallel processing using SYSTEMOPENMPI with 32 processors Executing: /usr/local/lib/openMPI-4/bin/mpirun -np 32 -hostfile hostfile -x FOAM_SETTINGS /opt/openfoam1812/OpenFOAM-v1812/bin/foamExec snappyHexMesh -parallel | tee log [maui:03969] Warning: could not find environment variable "FOAM_SETTINGS" -------------------------------------------------------------------------- By default, for Open MPI 4.0 and later, infiniband ports on a device are not used by default. The intent is to use UCX for these devices. You can override this policy by setting the btl_openib_allow_ib MCA parameter to true. Local host: oahu Local adapter: mlx5_0 Local port: 1 -------------------------------------------------------------------------- -------------------------------------------------------------------------- WARNING: There was an error initializing an OpenFabrics device. Local host: oahu Local device: mlx5_0 -------------------------------------------------------------------------- /*---------------------------------------------------------------------------*\ | ========= | | | \\ / F ield | OpenFOAM: The Open Source CFD Toolbox | | \\ / O peration | Version: v1812 | | \\ / A nd | Web: www.OpenFOAM.com | | \\/ M anipulation | | \*---------------------------------------------------------------------------*/ Build : v1812 OPENFOAM=1812 Arch : "LSB;label=32;scalar=64" Exec : snappyHexMesh -parallel Date : Jul 15 2019 Time : 10:57:47 Host : maui PID : 3978 I/O : uncollated [maui:3978 :0:3978] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil)) ==== backtrace ==== 0 /usr/lib/libucs.so.0(+0x1ec4c) [0x7fae62279c4c] 1 /usr/lib/libucs.so.0(+0x1eec4) [0x7fae62279ec4] =================== -------------------------------------------------------------------------- An MPI communication peer process has unexpectedly disconnected. This usually indicates a failure in the peer process (e.g., a crash or otherwise exiting without calling MPI_FINALIZE first). Although this local MPI process will likely now behave unpredictably (it may even hang or crash), the root cause of this problem is the failure of the peer -- that is what you need to investigate. For example, there may be a core file that you can examine. More generally: such peer hangups are frequently caused by application bugs or other external events. Local host: oahu Local PID: 26340 Peer host: maui -------------------------------------------------------------------------- -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun noticed that process rank 0 with PID 0 on node maui exited on signal 11 (Segmentation fault). -------------------------------------------------------------------------- [maui:03969] 31 more processes have sent help message help-mpi-btl-openib.txt / ib port not selected [maui:03969] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [maui:03969] 31 more processes have sent help message help-mpi-btl-openib.txt / error in device init [maui:03969] 14 more processes have sent help message help-mpi-btl-tcp.txt / peer hung up |
|
July 15, 2019, 18:12 |
|
#4 |
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128 |
Quick answer: Looks like Open-MPI 4 has gotten a lot pickier with how it works... A bit of online searching for "btl_openib_allow_ib" and I got this thread and respective solution: https://github.com/open-mpi/ompi/issues/6300
Try running: Code:
export OMPI_MCA_btl_openib_allow_ib=1 export OMPI_MCA_btl_openib_if_include="mlx5_0:1"
__________________
|
|
July 16, 2019, 13:31 |
|
#5 |
New Member
Augustin
Join Date: Jan 2019
Posts: 6
Rep Power: 7 |
Hi Bruno,
I add the following code to both ~/.bashrc (with mlx5_0 for the first server and mlx5_1 for the second one as the IB cable is plugged on different plug (I can see it with the command ibstat) Code:
export OMPI_MCA_btl_openib_allow_ib=1 export OMPI_MCA_btl_openib_if_include="mlx5_0:1" I also add a link between my OpenMPI 4 and the OpenFoam bin so mpirun is now taken in the following repertory: Code:
/opt/openfoam1812/OpenFOAM-v1812/bin/mpirun I still have an error, but I am not sure about what is causing it: I wonder if it is the first warning complaining about FOAM_SETTINGS or the OpenFabrics device found but with no active port which is weird because ibstat gives: Code:
CA 'mlx5_1' CA type: MT4119 Number of ports: 1 Firmware version: 16.25.1020 Hardware version: 0 Node GUID: 0x98039b03000345d1 System image GUID: 0x98039b03000345d0 Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 3 LMC: 0 SM lid: 3 Capability mask: 0x2651e84a Port GUID: 0x98039b03000345d1 Link layer: InfiniBand Code:
Parallel processing using SYSTEMOPENMPI with 32 processors Executing: /opt/openfoam1812/OpenFOAM-v1812/bin/mpirun -np 32 -hostfile hostfile -x FOAM_SETTINGS /opt/openfoam1812/OpenFOAM-v1812/bin/foamExec snappyHexMesh -parallel | tee log [maui:16920] Warning: could not find environment variable "FOAM_SETTINGS" -------------------------------------------------------------------------- WARNING: There is at least non-excluded one OpenFabrics device found, but there are no active ports detected (or Open MPI was unable to use them). This is most certainly not what you wanted. Check your cables, subnet manager configuration, etc. The openib BTL will be ignored for this job. Local host: oahu -------------------------------------------------------------------------- /*---------------------------------------------------------------------------*\ | ========= | | | \\ / F ield | OpenFOAM: The Open Source CFD Toolbox | | \\ / O peration | Version: v1812 | | \\ / A nd | Web: www.OpenFOAM.com | | \\/ M anipulation | | \*---------------------------------------------------------------------------*/ Build : v1812 OPENFOAM=1812 Arch : "LSB;label=32;scalar=64" Exec : snappyHexMesh -parallel Date : Jul 16 2019 Time : 18:22:11 Host : maui PID : 16928 I/O : uncollated [maui:16928:0:16928] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil)) ==== backtrace ==== 0 /usr/lib/libucs.so.0(+0x1ec4c) [0x7f3c62cfec4c] 1 /usr/lib/libucs.so.0(+0x1eec4) [0x7f3c62cfeec4] =================== -------------------------------------------------------------------------- An MPI communication peer process has unexpectedly disconnected. This usually indicates a failure in the peer process (e.g., a crash or otherwise exiting without calling MPI_FINALIZE first). Although this local MPI process will likely now behave unpredictably (it may even hang or crash), the root cause of this problem is the failure of the peer -- that is what you need to investigate. For example, there may be a core file that you can examine. More generally: such peer hangups are frequently caused by application bugs or other external events. Local host: oahu Local PID: 2213 Peer host: maui -------------------------------------------------------------------------- -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun noticed that process rank 0 with PID 0 on node maui exited on signal 11 (Segmentation fault). -------------------------------------------------------------------------- [maui:16920] 15 more processes have sent help message help-mpi-btl-openib.txt / no active ports found [maui:16920] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [maui:16920] 14 more processes have sent help message help-mpi-btl-tcp.txt / peer hung up |
|
July 16, 2019, 17:53 |
|
#6 |
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128 |
Quick answer: I have a few suggestions to try and guide you in the right direction, since I will not be able to test this myself in the next months (Infiniband+Open-MPI 4 is hard to come by). So, the suggestions:
|
|
July 17, 2019, 11:38 |
|
#7 |
New Member
Augustin
Join Date: Jan 2019
Posts: 6
Rep Power: 7 |
Hi Bruno,
I couldn't get your application compiled with wmake, but I compiled it directly with mpicc. In fact I already tried a code like that and it worked, but I still get the OpenFabrics warning: Code:
cws@maui:~/Molokai/test_CFDonline/parallelMin$ mpirun -np 4 -hostfile hostfile Test_parallelMin -------------------------------------------------------------------------- WARNING: There is at least non-excluded one OpenFabrics device found, but there are no active ports detected (or Open MPI was unable to use them). This is most certainly not what you wanted. Check your cables, subnet manager configuration, etc. The openib BTL will be ignored for this job. Local host: oahu -------------------------------------------------------------------------- Process 1 on maui out of 4 Process 0 on maui out of 4 Process 2 on oahu out of 4 Process 3 on oahu out of 4 [maui:22732] 1 more process has sent help message help-mpi-btl-openib.txt / no active ports found [maui:22732] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages That's why I don't understand why it is not working, it works with a simple C code. I also unplugged all Ethernet cable plugged in the "slave" server to make sure that it was going through the Infiniband, the result is the same, it works with the simple C code and not openfoam. Augustin |
|
July 17, 2019, 20:03 |
|
#8 |
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128 |
Quick answer: Why didn't I think of this before... What I mean is that you should report this to the issue tracker at OpenFOAM.com, since it's their version: https://develop.openfoam.com/Develop...M-plus/issues/
They will certainly be interested in this issue, specially since it's possibly a compatibility issue with Open-MPI 4 and newer. I only connected the dots right now due to this error line you gave in a previous comment: Code:
Caught signal 11 (Segmentation fault: address not mapped to object at address (nil)) Because these simple C examples mostly only rely on the shell environment variables and don't transfer any data. |
|
July 18, 2019, 07:12 |
|
#9 |
New Member
Augustin
Join Date: Jan 2019
Posts: 6
Rep Power: 7 |
It looks like there is an OpenMPI problem or something doing with the infiniband. I used the following code which is exchanging a variable between two procs:
https://github.com/wesleykendall/mpi...de/ping_pong.c and I get Code:
cws@maui:~/Molokai/sendAndReceive$ mpirun -np 2 --hostfile host ping_pong -------------------------------------------------------------------------- WARNING: There is at least non-excluded one OpenFabrics device found, but there are no active ports detected (or Open MPI was unable to use them). This is most certainly not what you wanted. Check your cables, subnet manager configuration, etc. The openib BTL will be ignored for this job. Local host: oahu -------------------------------------------------------------------------- [maui:04742] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:121 Error: Failed to receive UCX worker address: Not found (-13) [maui:04742] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:389 Error: Failed to resolve UCX endpoint for rank 1 [maui:04742] *** An error occurred in MPI_Send [maui:04742] *** reported by process [4035313665,0] [maui:04742] *** on communicator MPI_COMM_WORLD [maui:04742] *** MPI_ERR_OTHER: known error not in list [maui:04742] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [maui:04742] *** and potentially your MPI job) I posted on the forum, I hope they will find something. Cheers Augustin ---- For future reference: https://develop.openfoam.com/Develop...us/issues/1379 Last edited by wyldckat; July 23, 2019 at 19:23. Reason: added "For future reference" |
|
August 9, 2019, 04:16 |
|
#10 |
New Member
Augustin
Join Date: Jan 2019
Posts: 6
Rep Power: 7 |
Hi,
I managed to install infiniband on two new server with the default openMPI (2.1.1) from apt-get. It still doesn't work for my other two server with openMPI 4. It looks like that the problem is the openMPI version. Augustin |
|
Tags |
hostfile mpirun |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
how to confirm that I have already use infiniband in OpenFOAM? | Detian Liu | OpenFOAM | 4 | February 19, 2022 04:20 |
Map of the OpenFOAM Forum - Understanding where to post your questions! | wyldckat | OpenFOAM | 10 | September 2, 2021 06:29 |
UNIGE February 13th-17th - 2107. OpenFOAM advaced training days | joegi.geo | OpenFOAM Announcements from Other Sources | 0 | October 1, 2016 20:20 |
OpenFOAM Training Jan-Apr 2017, Virtual, London, Houston, Berlin | cfd.direct | OpenFOAM Announcements from Other Sources | 0 | September 21, 2016 12:50 |
OpenFOAM and infiniband | mrangitschdowcom | OpenFOAM Installation | 5 | October 30, 2008 08:47 |