[cfMesh] Using cfMesh on HPC in parallel (with MPI) for large meshes - MPI_Bsend error

standingVortex · December 8, 2024, 06:18

Hello foamers,

I have been using cfMesh successfully for several years now (both on local machine and HPC environments), where I was creating medium size meshes without any problem whatsoever. This time however, I would like to create a large mesh with approx. 200M cells, which I would like to create on HPC environments - by using resources available across multiple nodes.

As far as I understand, cfMesh by default uses all available CPU resources on a single node through Shared Memory Parallelization (SMP) - using the OpenMP - but I believe that is no longer sufficient as the meshing procedure in my case is quite slow using that approach. Therefore I would like to use the MPI parallelization using let's say 5 nodes, where each node has 128 cores.

To do so, I did the following:
1) prepared the case as usual - FMS file preparation and specification of corresponding meshDict settings.
2) I ran the

Code:

preparePar

by specifying

Code:

numberOfSubdomains 5;

in decomposeParDict, as I would like to parallelize across 5 MPI tasks.
3) I prepared the cartesianMesh_SLURM.sh script for the SLURM job scheduler.

cartesianMesh_SLURM.sh:

Code:

#!/bin/bash

#SBATCH --nodes=5                # Total number of nodes requested
#SBATCH --ntasks-per-node=1      # 1 MPI task per node

#SBATCH --ntasks=5

#SBATCH --cpus-per-task=128  # Use all these cpus for intra-node parallelization handled by cfMesh by default

#SBATCH --time=72:00:00
#SBATCH --mem=300000
#SBATCH --exclusive
#SBATCH --contiguous

#SBATCH --error=cartesianMesh.err
#SBATCH --output=cartesianMesh.out

# Load appropriate modules and make OpenFOAM available
module load OpenFOAM/v2206
source $FOAM_BASH

solver=cartesianMesh

# Run cfMesh with hybrid MPI + OpenMP parallelization
mpirun -np 5 cartesianMesh -parallel >> log.cartesianMesh

The cartesianMesh starts running, but it crashes. Hereby the content of the corresponding log files.

log.cartesianMesh:

Code:

/*---------------------------------------------------------------------------*\
| =========                 |                                                 |
| \\      /  F ield         | OpenFOAM: The Open Source CFD Toolbox           |
|  \\    /   O peration     | Version:  2206                                  |
|   \\  /    A nd           | Website:  www.openfoam.com                      |
|    \\/     M anipulation  |                                                 |
\*---------------------------------------------------------------------------*/
Build  : _76d719d1e6-20220624 OPENFOAM=2206 version=v2206
Arch   : "LSB;label=32;scalar=64"
Exec   : cartesianMesh -parallel
Date   : Dec 07 2024
Time   : 19:28:30
Host   : tcn941.local
PID    : 3685925
I/O    : uncollated
Case   : <path_to_rootFolder>
nProcs : 5
Hosts  :
(
    (tcn941.local 1)
    (tcn942.local 1)
    (tcn943.local 1)
    (tcn944.local 1)
    (tcn945.local 1)
)
Pstream initialized with:
    floatTransfer      : 0
    nProcsSimpleSum    : 0
    commsType          : nonBlocking
    polling iterations : 0
trapFpe: Floating point exception trapping enabled (FOAM_SIGFPE).
fileModificationChecking : Monitoring run-time modified files using timeStampMaster (fileModificationSkew 5, maxFileModificationPolls 20)
allowSystemOperations : Allowing user-supplied system call operations

// * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * //
Create time

Setting root cube size and refinement parameters
Root box (-46192.4 -42195.6 -49707.5) (56207.6 60204.4 52692.5)
Requested cell size corresponds to octree level 10
Refining boundary
Refining boundary boxes to the given size
Number of leaves per processor 1
Distributing leaves to processors
Finished distributing leaves to processors
Distributing load between processors
Finished distributing load between processors
Distributing load between processors
Finished distributing load between processors
Distributing load between processors
Finished distributing load between processors
Distributing load between processors
Finished distributing load between processors
Distributing load between processors
Finished distributing load between processors
Distributing load between processors
Finished distributing load between processors
Distributing load between processors
Finished distributing load between processors
Distributing load between processors
Finished distributing load between processors
Distributing load between processors
Finished distributing load between processors
Distributing load between processors
Finished distributing load between processors
Distributing load between processors
Finished distributing load between processors
Distributing load between processors

cartesianMesh.err:

Code:

[tcn942.local.snellius.surf.nl:3684931] pml_ucx.c:738  Error: bsend: failed to allocate buffer
[tcn942.local.snellius.surf.nl:3684931] pml_ucx.c:882  Error: ucx send failed: No pending message
[tcn942:3684931] *** An error occurred in MPI_Bsend
[tcn942:3684931] *** reported by process [3133276161,1]
[tcn942:3684931] *** on communicator MPI_COMM_WORLD
[tcn942:3684931] *** MPI_ERR_OTHER: known error not in list
[tcn942:3684931] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[tcn942:3684931] ***    and potentially your MPI job)

I couldn't understand where this error was coming from, thus after a bit of searching online I came across two different threads where it was suggested to specify OMP_NUM_THREADS and MPI_BUFFER_SIZE.

Therefore I also tried with adding the following piece of code into the SLURM script before executing cartesianMesh, but it was without any success as well.

Code:

# Set OpenMP environment variable
# => number of threads OpenMP will use for shared-memory parallelism on intra-node level
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK,$SLURM_CPUS_PER_TASK # 128,128

# Set MPI buffer size
export MPI_BUFFER_SIZE=200000000

However, specifying MPI_BUFFER_SIZE and/or OMP_NUM_THREADS did not fix the issue, as I still cannot run the cartesianMesh in parallel using MPI.

I tried searching for a way of solving these MPI_Bsend/UCX issues, but unfortunately I didn't manage to understand what to do.

Does anyone have any idea, or suggestion to point out what am I actually doing wrong and how could one make using this amazing meshing tool on clusters with cfMesh using MPI and CPU resources distributed across multiple nodes for large meshes?

I would appreciate a lot any insights.
Thank you so much!

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
long error when using make-install SU2_AD.	tomp1993	SU2 Installation	3	March 17, 2018 07:25
[swak4Foam] GroovyBC the dynamic cousin of funkySetFields that lives on the suburb of the mesh	gschaider	OpenFOAM Community Contributions	300	October 29, 2014 19:00
Compile problem	ivanyao	OpenFOAM Running, Solving & CFD	1	October 12, 2012 10:31
CGNS lib and Fortran compiler	manaliac	Main CFD Forum	2	November 29, 2010 07:25
How to get the max value of the whole field	waynezw0618	OpenFOAM Running, Solving & CFD	4	June 17, 2008 06:07