CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Software User Forums > OpenFOAM > OpenFOAM Community Contributions

[cfMesh] Using cfMesh on HPC in parallel (with MPI) for large meshes - MPI_Bsend error

Register Blogs Community New Posts Updated Threads Search

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   December 8, 2024, 06:18
Post Using cfMesh on HPC in parallel (with MPI) for large meshes - MPI_Bsend error
  #1
New Member
 
Join Date: Feb 2023
Posts: 4
Rep Power: 3
standingVortex is on a distinguished road
Hello foamers,

I have been using cfMesh successfully for several years now (both on local machine and HPC environments), where I was creating medium size meshes without any problem whatsoever. This time however, I would like to create a large mesh with approx. 200M cells, which I would like to create on HPC environments - by using resources available across multiple nodes.

As far as I understand, cfMesh by default uses all available CPU resources on a single node through Shared Memory Parallelization (SMP) - using the OpenMP - but I believe that is no longer sufficient as the meshing procedure in my case is quite slow using that approach. Therefore I would like to use the MPI parallelization using let's say 5 nodes, where each node has 128 cores.

To do so, I did the following:
1) prepared the case as usual - FMS file preparation and specification of corresponding meshDict settings.
2) I ran the
Code:
preparePar
by specifying
Code:
numberOfSubdomains 5;
in decomposeParDict, as I would like to parallelize across 5 MPI tasks.
3) I prepared the cartesianMesh_SLURM.sh script for the SLURM job scheduler.

cartesianMesh_SLURM.sh:

Code:
#!/bin/bash

#SBATCH --nodes=5                # Total number of nodes requested
#SBATCH --ntasks-per-node=1      # 1 MPI task per node

#SBATCH --ntasks=5

#SBATCH --cpus-per-task=128  # Use all these cpus for intra-node parallelization handled by cfMesh by default

#SBATCH --time=72:00:00
#SBATCH --mem=300000
#SBATCH --exclusive
#SBATCH --contiguous

#SBATCH --error=cartesianMesh.err
#SBATCH --output=cartesianMesh.out

# Load appropriate modules and make OpenFOAM available
module load OpenFOAM/v2206
source $FOAM_BASH

solver=cartesianMesh

# Run cfMesh with hybrid MPI + OpenMP parallelization
mpirun -np 5 cartesianMesh -parallel >> log.cartesianMesh
The cartesianMesh starts running, but it crashes. Hereby the content of the corresponding log files.

log.cartesianMesh:
Code:
/*---------------------------------------------------------------------------*\
| =========                 |                                                 |
| \\      /  F ield         | OpenFOAM: The Open Source CFD Toolbox           |
|  \\    /   O peration     | Version:  2206                                  |
|   \\  /    A nd           | Website:  www.openfoam.com                      |
|    \\/     M anipulation  |                                                 |
\*---------------------------------------------------------------------------*/
Build  : _76d719d1e6-20220624 OPENFOAM=2206 version=v2206
Arch   : "LSB;label=32;scalar=64"
Exec   : cartesianMesh -parallel
Date   : Dec 07 2024
Time   : 19:28:30
Host   : tcn941.local
PID    : 3685925
I/O    : uncollated
Case   : <path_to_rootFolder>
nProcs : 5
Hosts  :
(
    (tcn941.local 1)
    (tcn942.local 1)
    (tcn943.local 1)
    (tcn944.local 1)
    (tcn945.local 1)
)
Pstream initialized with:
    floatTransfer      : 0
    nProcsSimpleSum    : 0
    commsType          : nonBlocking
    polling iterations : 0
trapFpe: Floating point exception trapping enabled (FOAM_SIGFPE).
fileModificationChecking : Monitoring run-time modified files using timeStampMaster (fileModificationSkew 5, maxFileModificationPolls 20)
allowSystemOperations : Allowing user-supplied system call operations

// * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * //
Create time

Setting root cube size and refinement parameters
Root box (-46192.4 -42195.6 -49707.5) (56207.6 60204.4 52692.5)
Requested cell size corresponds to octree level 10
Refining boundary
Refining boundary boxes to the given size
Number of leaves per processor 1
Distributing leaves to processors
Finished distributing leaves to processors
Distributing load between processors
Finished distributing load between processors
Distributing load between processors
Finished distributing load between processors
Distributing load between processors
Finished distributing load between processors
Distributing load between processors
Finished distributing load between processors
Distributing load between processors
Finished distributing load between processors
Distributing load between processors
Finished distributing load between processors
Distributing load between processors
Finished distributing load between processors
Distributing load between processors
Finished distributing load between processors
Distributing load between processors
Finished distributing load between processors
Distributing load between processors
Finished distributing load between processors
Distributing load between processors
Finished distributing load between processors
Distributing load between processors
cartesianMesh.err:

Code:
[tcn942.local.snellius.surf.nl:3684931] pml_ucx.c:738  Error: bsend: failed to allocate buffer
[tcn942.local.snellius.surf.nl:3684931] pml_ucx.c:882  Error: ucx send failed: No pending message
[tcn942:3684931] *** An error occurred in MPI_Bsend
[tcn942:3684931] *** reported by process [3133276161,1]
[tcn942:3684931] *** on communicator MPI_COMM_WORLD
[tcn942:3684931] *** MPI_ERR_OTHER: known error not in list
[tcn942:3684931] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[tcn942:3684931] ***    and potentially your MPI job)
I couldn't understand where this error was coming from, thus after a bit of searching online I came across two different threads where it was suggested to specify OMP_NUM_THREADS and MPI_BUFFER_SIZE.

Therefore I also tried with adding the following piece of code into the SLURM script before executing cartesianMesh, but it was without any success as well.

Code:
# Set OpenMP environment variable
# => number of threads OpenMP will use for shared-memory parallelism on intra-node level
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK,$SLURM_CPUS_PER_TASK # 128,128

# Set MPI buffer size
export MPI_BUFFER_SIZE=200000000
However, specifying MPI_BUFFER_SIZE and/or OMP_NUM_THREADS did not fix the issue, as I still cannot run the cartesianMesh in parallel using MPI.

I tried searching for a way of solving these MPI_Bsend/UCX issues, but unfortunately I didn't manage to understand what to do.

Does anyone have any idea, or suggestion to point out what am I actually doing wrong and how could one make using this amazing meshing tool on clusters with cfMesh using MPI and CPU resources distributed across multiple nodes for large meshes?

I would appreciate a lot any insights.
Thank you so much!
standingVortex is offline   Reply With Quote

Reply

Tags
cartesianmesh, cfmesh, hpc cluster, mpi, mpi errors


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
long error when using make-install SU2_AD. tomp1993 SU2 Installation 3 March 17, 2018 07:25
[swak4Foam] GroovyBC the dynamic cousin of funkySetFields that lives on the suburb of the mesh gschaider OpenFOAM Community Contributions 300 October 29, 2014 19:00
Compile problem ivanyao OpenFOAM Running, Solving & CFD 1 October 12, 2012 10:31
CGNS lib and Fortran compiler manaliac Main CFD Forum 2 November 29, 2010 07:25
How to get the max value of the whole field waynezw0618 OpenFOAM Running, Solving & CFD 4 June 17, 2008 06:07


All times are GMT -4. The time now is 09:48.