CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Software User Forums > OpenFOAM > OpenFOAM Running, Solving & CFD

OpenFoam v1812 over Infiniband

Register Blogs Community New Posts Updated Threads Search

Like Tree1Likes
  • 1 Post By wyldckat

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   July 12, 2019, 06:23
Default OpenFoam v1812 over Infiniband
  #1
New Member
 
Augustin
Join Date: Jan 2019
Posts: 6
Rep Power: 7
Augustin.h is on a distinguished road
Hi,

I have a problem launching openFoam with mpirun --hostfile

I have two servers on Ubuntu 18.04 with 32 cores each and OpenFoam 1812

I 've linked two of my servers with an infiniband connexion, and I would like to launch a calculation on the two machines to have 64 cores. The link is working I tried with a simple "Hello word" script, I can print 64 times Hello Word, it is cool but not very usefull.

I use the command:
/usr/local/lib/openMPI-4/bin/mpirun -np 64 --hostfile hostfile --mca btl_openib_allow_ib true snappyHexMesh

I get this error
Code:
--------------------------------------------------------------------------
mpirun was unable to find the specified executable file, and therefore
did not launch the job.  This error was first reported for process
rank 16; it may have occurred for other processes as well.

NOTE: A common cause for this error is misspelling a mpirun command
      line parameter option (remember that mpirun interprets the first
      unrecognized command line token as the executable).

Node:       oahu
Executable: /opt/openfoam1812/OpenFOAM-v1812/platforms/linux64GccDPInt32Opt/bin/snappyHexMesh
--------------------------------------------------------------------------
The path to the executable is right and is the same for both server.

I also tried with the full path of snappyHexMesh

But the server slave is throwing:
Code:
/opt/openfoam1812/OpenFOAM-v1812/platforms/linux64GccDPInt32Opt/bin/snappyHexMesh: error while loading shared libraries: libfiniteVolume.so: cannot open shared object file: No such file or directory
Does any one have an idea ?

Cheers

Augustin
Augustin.h is offline   Reply With Quote

Old   July 14, 2019, 08:06
Default
  #2
Retired Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128
wyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to all
Quick answer: OpenFOAM's script foamJob can do the necessary tuning for you, so that you don't need to worry about writing all of the lengthy commands.

If you simply run:
Code:
foamJob -p -s snappyHexMesh
it will do everything else for you. It will even show at the start of the execution, the command that was used to run mpirun.

As for selecting Infiniband by default, I believe that Open-MPI will try to use all available network interfaces and choose the best performing one.
__________________
wyldckat is offline   Reply With Quote

Old   July 15, 2019, 06:00
Default
  #3
New Member
 
Augustin
Join Date: Jan 2019
Posts: 6
Rep Power: 7
Augustin.h is on a distinguished road
Hi thanks for the answer, foamExec was not present in the v1812 version, but I added the executable from v1806 version, but I got the following error:

Code:
cws@maui:~/Molokai/bench/run_32$ foamJob -p -s snappyHexMesh
Parallel processing using SYSTEMOPENMPI with 32 processors
Executing: /usr/local/lib/openMPI-4/bin/mpirun -np 32 -hostfile hostfile -x FOAM_SETTINGS /opt/openfoam1812/OpenFOAM-v1812/bin/foamExec snappyHexMesh -parallel | tee  log
[maui:03969] Warning: could not find environment variable "FOAM_SETTINGS"
--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default.  The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.

  Local host:              oahu
  Local adapter:           mlx5_0
  Local port:              1

--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   oahu
  Local device: mlx5_0
--------------------------------------------------------------------------
/*---------------------------------------------------------------------------*\
| =========                 |                                                 |
| \\      /  F ield         | OpenFOAM: The Open Source CFD Toolbox           |
|  \\    /   O peration     | Version:  v1812                                 |
|   \\  /    A nd           | Web:      www.OpenFOAM.com                      |
|    \\/     M anipulation  |                                                 |
\*---------------------------------------------------------------------------*/
Build  : v1812 OPENFOAM=1812
Arch   : "LSB;label=32;scalar=64"
Exec   : snappyHexMesh -parallel
Date   : Jul 15 2019
Time   : 10:57:47
Host   : maui
PID    : 3978
I/O    : uncollated
[maui:3978 :0:3978] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace ====
    0  /usr/lib/libucs.so.0(+0x1ec4c) [0x7fae62279c4c]
    1  /usr/lib/libucs.so.0(+0x1eec4) [0x7fae62279ec4]
===================
--------------------------------------------------------------------------
An MPI communication peer process has unexpectedly disconnected.  This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).

Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate.  For
example, there may be a core file that you can examine.  More
generally: such peer hangups are frequently caused by application bugs
or other external events.

  Local host: oahu
  Local PID:  26340
  Peer host:  maui
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node maui exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
[maui:03969] 31 more processes have sent help message help-mpi-btl-openib.txt / ib port not selected
[maui:03969] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[maui:03969] 31 more processes have sent help message help-mpi-btl-openib.txt / error in device init
[maui:03969] 14 more processes have sent help message help-mpi-btl-tcp.txt / peer hung up
Augustin.h is offline   Reply With Quote

Old   July 15, 2019, 18:12
Default
  #4
Retired Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128
wyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to all
Quick answer: Looks like Open-MPI 4 has gotten a lot pickier with how it works... A bit of online searching for "btl_openib_allow_ib" and I got this thread and respective solution: https://github.com/open-mpi/ompi/issues/6300

Try running:
Code:
export OMPI_MCA_btl_openib_allow_ib=1
export OMPI_MCA_btl_openib_if_include="mlx5_0:1"
before running foamJob.
__________________
wyldckat is offline   Reply With Quote

Old   July 16, 2019, 13:31
Default
  #5
New Member
 
Augustin
Join Date: Jan 2019
Posts: 6
Rep Power: 7
Augustin.h is on a distinguished road
Hi Bruno,

I add the following code to both ~/.bashrc (with mlx5_0 for the first server and mlx5_1 for the second one as the IB cable is plugged on different plug (I can see it with the command ibstat)

Code:
export OMPI_MCA_btl_openib_allow_ib=1
export OMPI_MCA_btl_openib_if_include="mlx5_0:1"

I also add a link between my OpenMPI 4 and the OpenFoam bin so mpirun is now taken in the following repertory:
Code:
/opt/openfoam1812/OpenFOAM-v1812/bin/mpirun
because it was complaining about not finding mpicc mpirun orterun, etc ...

I still have an error, but I am not sure about what is causing it:

I wonder if it is the first warning complaining about FOAM_SETTINGS or the OpenFabrics device found but with no active port which is weird because ibstat gives:


Code:
CA 'mlx5_1'
        CA type: MT4119
        Number of ports: 1
        Firmware version: 16.25.1020
        Hardware version: 0
        Node GUID: 0x98039b03000345d1
        System image GUID: 0x98039b03000345d0
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 100
                Base lid: 3
                LMC: 0
                SM lid: 3
                Capability mask: 0x2651e84a
                Port GUID: 0x98039b03000345d1
                Link layer: InfiniBand
I also add my log, I will keep looking for that OpenFabrics error.


Code:
Parallel processing using SYSTEMOPENMPI with 32 processors
Executing: /opt/openfoam1812/OpenFOAM-v1812/bin/mpirun -np 32 -hostfile hostfile -x FOAM_SETTINGS /opt/openfoam1812/OpenFOAM-v1812/bin/foamExec snappyHexMesh -parallel | tee  log
[maui:16920] Warning: could not find environment variable "FOAM_SETTINGS"
--------------------------------------------------------------------------
WARNING: There is at least non-excluded one OpenFabrics device found,
but there are no active ports detected (or Open MPI was unable to use
them).  This is most certainly not what you wanted.  Check your
cables, subnet manager configuration, etc.  The openib BTL will be
ignored for this job.

  Local host: oahu
--------------------------------------------------------------------------
/*---------------------------------------------------------------------------*\
| =========                 |                                                 |
| \\      /  F ield         | OpenFOAM: The Open Source CFD Toolbox           |
|  \\    /   O peration     | Version:  v1812                                 |
|   \\  /    A nd           | Web:      www.OpenFOAM.com                      |
|    \\/     M anipulation  |                                                 |
\*---------------------------------------------------------------------------*/
Build  : v1812 OPENFOAM=1812
Arch   : "LSB;label=32;scalar=64"
Exec   : snappyHexMesh -parallel
Date   : Jul 16 2019
Time   : 18:22:11
Host   : maui
PID    : 16928
I/O    : uncollated
[maui:16928:0:16928] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace ====
    0  /usr/lib/libucs.so.0(+0x1ec4c) [0x7f3c62cfec4c]
    1  /usr/lib/libucs.so.0(+0x1eec4) [0x7f3c62cfeec4]
===================
--------------------------------------------------------------------------
An MPI communication peer process has unexpectedly disconnected.  This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).

Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate.  For
example, there may be a core file that you can examine.  More
generally: such peer hangups are frequently caused by application bugs
or other external events.

  Local host: oahu
  Local PID:  2213
  Peer host:  maui
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node maui exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
[maui:16920] 15 more processes have sent help message help-mpi-btl-openib.txt / no active ports found
[maui:16920] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[maui:16920] 14 more processes have sent help message help-mpi-btl-tcp.txt / peer hung up
Augustin.h is offline   Reply With Quote

Old   July 16, 2019, 17:53
Default
  #6
Retired Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128
wyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to all
Quick answer: I have a few suggestions to try and guide you in the right direction, since I will not be able to test this myself in the next months (Infiniband+Open-MPI 4 is hard to come by). So, the suggestions:
  1. On the blueCFD-Core project that I manage and work on, I have a test application there named "parallelMin", available here: https://github.com/blueCFD/OpenFOAM-...st/parallelMin
    • Download the files and folder structure for that folder. Then build it with the conventional OpenFOAM command:
      Code:
      wmake
    • The application is extremely bare-bones and does not link to OpenFOAM.
    • You can simply run it with:
      Code:
      mpirun -np 32 -hostfile hostfile parallelMin
    • It should give you text output on the MPI rank, processor name and number of processors on this job.
    • This will allow you to more easily isolate and conquer the specific MPI settings that you need.
  2. The other suggestion is that if you are unable to get Open-MPI to work with the test application above, then ask about this at the Open-MPI issue tracker, which I guess is this one: https://github.com/open-mpi/ompi/issues - or try their mailing list... or check their FAQ, which does have at least a few entries on this topic: https://www.open-mpi.org/faq/?catego...abrics#run-ucx
  3. Any chance you can go back to an older Open-MPI version, or is version 4 the only one you can use?
LiedtkeJ likes this.
wyldckat is offline   Reply With Quote

Old   July 17, 2019, 11:38
Default
  #7
New Member
 
Augustin
Join Date: Jan 2019
Posts: 6
Rep Power: 7
Augustin.h is on a distinguished road
Hi Bruno,

I couldn't get your application compiled with wmake, but I compiled it directly with mpicc. In fact I already tried a code like that and it worked, but I still get the OpenFabrics warning:

Code:
cws@maui:~/Molokai/test_CFDonline/parallelMin$ mpirun -np 4 -hostfile hostfile Test_parallelMin
--------------------------------------------------------------------------
WARNING: There is at least non-excluded one OpenFabrics device found,
but there are no active ports detected (or Open MPI was unable to use
them).  This is most certainly not what you wanted.  Check your
cables, subnet manager configuration, etc.  The openib BTL will be
ignored for this job.

  Local host: oahu
--------------------------------------------------------------------------
Process 1 on maui out of 4
Process 0 on maui out of 4
Process 2 on oahu out of 4
Process 3 on oahu out of 4
[maui:22732] 1 more process has sent help message help-mpi-btl-openib.txt / no active ports found
[maui:22732] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
(I ran only with 4 cpu for lisibility reason, but I get the same output with more line with 32 and 64 cores)

That's why I don't understand why it is not working, it works with a simple C code. I also unplugged all Ethernet cable plugged in the "slave" server to make sure that it was going through the Infiniband, the result is the same, it works with the simple C code and not openfoam.

Augustin
Augustin.h is offline   Reply With Quote

Old   July 17, 2019, 20:03
Default
  #8
Retired Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128
wyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to all
Quick answer: Why didn't I think of this before... What I mean is that you should report this to the issue tracker at OpenFOAM.com, since it's their version: https://develop.openfoam.com/Develop...M-plus/issues/

They will certainly be interested in this issue, specially since it's possibly a compatibility issue with Open-MPI 4 and newer.

I only connected the dots right now due to this error line you gave in a previous comment:
Code:
Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
And have you tested with any other test code with MPI that does in fact transfer data via MPI?
Because these simple C examples mostly only rely on the shell environment variables and don't transfer any data.
wyldckat is offline   Reply With Quote

Old   July 18, 2019, 07:12
Default
  #9
New Member
 
Augustin
Join Date: Jan 2019
Posts: 6
Rep Power: 7
Augustin.h is on a distinguished road
It looks like there is an OpenMPI problem or something doing with the infiniband. I used the following code which is exchanging a variable between two procs:

https://github.com/wesleykendall/mpi...de/ping_pong.c

and I get

Code:
cws@maui:~/Molokai/sendAndReceive$ mpirun -np 2 --hostfile host ping_pong
--------------------------------------------------------------------------
WARNING: There is at least non-excluded one OpenFabrics device found,
but there are no active ports detected (or Open MPI was unable to use
them).  This is most certainly not what you wanted.  Check your
cables, subnet manager configuration, etc.  The openib BTL will be
ignored for this job.

  Local host: oahu
--------------------------------------------------------------------------
[maui:04742] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:121 Error: Failed to receive UCX worker address: Not found (-13)
[maui:04742] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:389 Error: Failed to resolve UCX endpoint for rank 1
[maui:04742] *** An error occurred in MPI_Send
[maui:04742] *** reported by process [4035313665,0]
[maui:04742] *** on communicator MPI_COMM_WORLD
[maui:04742] *** MPI_ERR_OTHER: known error not in list
[maui:04742] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[maui:04742] ***    and potentially your MPI job)
The small C script that you sent me is still working.

I posted on the forum, I hope they will find something.

Cheers

Augustin

----
For future reference: https://develop.openfoam.com/Develop...us/issues/1379

Last edited by wyldckat; July 23, 2019 at 19:23. Reason: added "For future reference"
Augustin.h is offline   Reply With Quote

Old   August 9, 2019, 04:16
Default
  #10
New Member
 
Augustin
Join Date: Jan 2019
Posts: 6
Rep Power: 7
Augustin.h is on a distinguished road
Hi,

I managed to install infiniband on two new server with the default openMPI (2.1.1) from apt-get.
It still doesn't work for my other two server with openMPI 4. It looks like that the problem is the openMPI version.

Augustin
Augustin.h is offline   Reply With Quote

Reply

Tags
hostfile mpirun


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
how to confirm that I have already use infiniband in OpenFOAM? Detian Liu OpenFOAM 4 February 19, 2022 04:20
Map of the OpenFOAM Forum - Understanding where to post your questions! wyldckat OpenFOAM 10 September 2, 2021 06:29
UNIGE February 13th-17th - 2107. OpenFOAM advaced training days joegi.geo OpenFOAM Announcements from Other Sources 0 October 1, 2016 20:20
OpenFOAM Training Jan-Apr 2017, Virtual, London, Houston, Berlin cfd.direct OpenFOAM Announcements from Other Sources 0 September 21, 2016 12:50
OpenFOAM and infiniband mrangitschdowcom OpenFOAM Installation 5 October 30, 2008 08:47


All times are GMT -4. The time now is 00:55.