|
[Sponsors] |
MPI_Send: MPI_ERR_COUNT: invalid count argument |
|
LinkBack | Thread Tools | Search this Thread | Display Modes |
December 24, 2023, 03:36 |
MPI_Send: MPI_ERR_COUNT: invalid count argument
|
#1 |
New Member
Ilya
Join Date: Jan 2021
Posts: 10
Rep Power: 5 |
Dear Colleagues,
I have encountered a problem with MPI_Send while running a very big case on the cluster: [lxbk1208:00000] *** An error occurred in MPI_Send [lxbk1208:00000] *** reported by process [3960864768,1024] [lxbk1208:00000] *** on communicator MPI_COMM_WORLD [lxbk1208:00000] *** MPI_ERR_COUNT: invalid count argument [lxbk1208:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [lxbk1208:00000] ****** and MPI will try to terminate your MPI job as well) (Full log is in the attachment) I'm using chtMultiRegionSimpleFoam to solve a heat transfer problem for a multilayer PCB with vias in great detail. The solver goes through a few regions with no problem, but when it proceeds to a very big region with 141 211 296 cells (I'm using 2048 processors, so it's 68 950 cells per processor - should be fine), it crashes with the error above. Decomposition method is hierarchical: decomposeParDict: numberOfSubdomains 2048; method hierarchical; coeffs { n (32 64 1); } The cluster I'm using is called Virgo. It uses Slurm for tasks scheduling. More informations is available at https://hpc.gsi.de/virgo/ I submit the task with the following command: sbatch --ntasks=2048 --mem-per-cpu=4G --hint=multithread --partition=main --mincpus=32 slurmScripts/chtMultiRegionSimpleFoam.sh & chtMultiRegionSimpleFoam.sh: srun chtMultiRegionSimpleFoam -parallel OpenFOAM is compiled with WM_LABEL_SIZE=64, WM_MPLIB=SYSTEMOPENMPI, and WM_ARCH_OPTION=64. OpenFOAM version is ESI OpenFOAM v2306. Our support team assumes that this error appears because the solver is calling MPI_Send with a negative count argument. The count argument is a signed int of 32 bit size, so it is likely overflowing in my case. For solving this problem, I tried to change MPI optimization parameters in controlDict, but didn't achieve any success: 1. Set pbufs.tuning to 1 to activate new NBX algorythm. 2. I varied nbx.min parameter between 1 and 100. 3. Tried setting nbx.tuning to 0 and 1. 4. Setting maxCommsSize to 2147483647, which is 2^31 - 1 5. Tried to find if similar problem was mentioned already on the forum. What could be the cause and how can this error be fixed? Thank you for help. Best regards, Ilya --- CBM Department GSI Helmholtzzentrum für Schwerionenforschung GmbH Planckstr. 1, 64291 Darmstadt, Germany Last edited by wht; January 3, 2024 at 08:22. |
|
January 2, 2024, 09:06 |
|
#2 |
New Member
Ilya
Join Date: Jan 2021
Posts: 10
Rep Power: 5 |
Dear Colleagues,
I'm still working on the solution to this problem, but with no success. So far I have tried a few more things: 1. I've built openmpi-4.1.2 from the ThirdParty-v2306 with -m64 flag and run the solver with it on the cluster. etc/bashrc from OpenFOAM folder: <...> export WM_MPLIB=OPENMPI <...> All the compilation flags for OpenMPI (ompi_info): <...> configure command line: 'CFLAGS=-m64' 'FFLAGS=-m64' 'FCFLAGS=-m64' 'CXXFLAGS=-m64' '--prefix=/linux64Gcc/openmpi-4.1.2' '--with-max-info-key=255' '--with-max-info-val=512' '--with-max-object-name=128' '--with-max-datarep-string=256' '--with-wrapper-cflags=-m64' '--with-wrapper-cxxflags=-m64' '--with-wrapper-fcflags=-m64' '--disable-orterun-prefix-by-default' '--with-pmix' '--with-libevent' '--with-ompi-pmix-rte' '--with-orte=no' '--disable-oshmem' '--enable-shared' '--without-verbs' '--with-hwloc' '--with-ucx=/lustre/cbm/users/elizarov' '--with-slurm' '--enable-mca-no-build=btl-uct' '--enable-shared' '--disable-static' '--enable-mpi-fortran=none' '--with-sge' <...> My new wrapper script is: #!/bin/bash #SBATCH --job-name=solver #SBATCH --time 8:00:00 #SBATCH --output Slurm-solver.out orterun chtMultiRegionSimpleFoam -parallel 2. Tried to switch off multithreading: sbatch --ntasks=2048 --mem-per-cpu=4G --hint=nomultithread --partition=main --mincpus=32 slurmScripts/chtMultiRegionSimpleFoam.sh & 3. Changed solving algorithm from GAMG to PCG 4. Tried renumbering a problematic region (PCB_Copper) with renumberMesh Best regards, Ilya --- CBM Department GSI Helmholtzzentrum für Schwerionenforschung GmbH Planckstr. 1, 64291 Darmstadt, Germany Last edited by wht; January 3, 2024 at 16:17. |
|
January 2, 2024, 12:21 |
|
#3 |
Senior Member
Mark Olesen
Join Date: Mar 2009
Location: https://olesenm.github.io/
Posts: 1,715
Rep Power: 40 |
Don't start fiddle with the nbx tuning factors - they will only really help for large problems with AMI and with distributed mapping etc. Not likely your case here.
The error appears to arise immediately after trying to solve PCB_BasePlate or during it? (the initial residual of zero could be suspicious). With MPI errors, it is not always clear when/where they arise. They can also be a result of something else - for example, a zero-size check is triggered on one rank, but inconsistently on other and when the MPI exchange occurs the send/recv are completely mismatched. After checking your case (possibly with different decompositions), the first thing to try is setting FOAM_ABORT=true which will at least give you a stacktrace, which might help identify how things got to the failure point. |
|
January 3, 2024, 07:35 |
|
#4 | |
New Member
Ilya
Join Date: Jan 2021
Posts: 10
Rep Power: 5 |
Thanks for the reply Mark!
I’ve tried to get more information by setting FOAM_ABORT=true, so I added a respective command to my routine: <…> export FOAM_ABORT=true chtMultiRegionSimpleFoam -parallel >> log.chtMultiRegionSimpleFoam 2>&1 However, I don’t see any additional output in the log (below): [lxbk1159:00000] *** An error occurred in MPI_Send [lxbk1159:00000] *** reported by process [4110286848,512] [lxbk1159:00000] *** on communicator MPI_COMM_WORLD [lxbk1159:00000] *** MPI_ERR_COUNT: invalid count argument [lxbk1159:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [lxbk1159:00000] *** and MPI will try to terminate your MPI job as well) In my first message, I forgot to add output from the task scheduler. Here it is below. Maybe it could be helpful: slurmstepd: error: *** STEP 17666910.0 ON lxbk0997 CANCELLED AT 2024-01-03T12:23:58 *** srun: Job step aborted: Waiting up to 32 seconds for job step to finish. srun: error: lxbk1003: tasks 266-297: Killed srun: error: lxbk0999: tasks 92-137: Killed srun: error: lxbk1001: tasks 184-229: Killed srun: error: lxbk1160: tasks 434-475: Killed srun: error: lxbk0998: tasks 46-91: Killed srun: error: lxbk1000: tasks 138-183: Killed srun: error: lxbk1188: tasks 724-777: Killed srun: error: lxbk1170: tasks 536-591: Killed srun: error: lxbk1187: tasks 668-723: Killed srun: error: lxbk1159: tasks 374-433: Killed srun: error: lxbk1171: tasks 592-667: Killed srun: error: lxbk1155: tasks 298-373: Killed slurmstepd: error: mpi/pmix_v2: _errhandler: lxbk1002 [5]: pmixp_client_v2.c:211: Error handler invoked: status = -25, source = [slurm.pmix.17666910.0:241] srun: error: lxbk1002: tasks 230-265: Killed srun: error: lxbk1233: tasks 778-861: Killed srun: error: lxbk1235: tasks 986-1023: Killed srun: error: lxbk0997: tasks 0-45: Killed slurmstepd: error: mpi/pmix_v2: _errhandler: lxbk1161 [10]: pmixp_client_v2.c:211: Error handler invoked: status = -25, source = [slurm.pmix.17666910.0:486] srun: error: lxbk1161: tasks 476-535: Killed slurmstepd: error: mpi/pmix_v2: _errhandler: lxbk1234 [16]: pmixp_client_v2.c:211: Error handler invoked: status = -25, source = [slurm.pmix.17666910.0:945] srun: error: lxbk1234: tasks 862-985: Killed I also tried to set Pstream debug flag in controlDict: DebugSwitches {Pstream 1;} OpenFOAM acknowledges the flag: <…> Overriding DebugSwitches according to controlDict Pstream 1; <…> But I don’t see any additional output either. About your question: Quote:
Initial zero residuals in PCB_BasePlate are there because I applied fixedTemperatureConstraint for this region to imitate isothermal cooling. Best regards, Ilya --- CBM Department GSI Helmholtzzentrum für Schwerionenforschung GmbH Planckstr. 1, 64291 Darmstadt, Germany Last edited by wht; January 23, 2024 at 16:56. |
||
January 4, 2024, 04:50 |
|
#5 |
Senior Member
Mark Olesen
Join Date: Mar 2009
Location: https://olesenm.github.io/
Posts: 1,715
Rep Power: 40 |
Since you are in Darmstadt, could see you if can harness some resources from https://www.mma.tu-darmstadt.de/mma_institute/mma_team/ to help you out (formally or informally).
|
|
January 8, 2024, 08:54 |
|
#6 |
New Member
Ilya
Join Date: Jan 2021
Posts: 10
Rep Power: 5 |
Dear Colleagues,
I have tried a few more thing to solve this problem, but didn't succeed, unfortunately. I run out of ideas, except for the one that Mark has suggested (many thanks!). 1. Increased the number of subdomains from 2048 to 4096: numberOfSubdomains 4096; method simple; coeffs { n (64 32 2); } For PCB_Copper region, it's approximately 35000 cells per processor now. This makes me think that the problem comes not from a relative number of cells per processor, but, rather, from absolute number of cells in a region. 2. Tried OpenFOAM 10 from Foundation instead of ESI and got the same error: Solving for solid region PCB_Copper [lxbk0957:1650843] *** An error occurred in MPI_Send [lxbk0957:1650843] *** reported by process [1978794742,512] [lxbk0957:1650843] *** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0 [lxbk0957:1650843] *** MPI_ERR_COUNT: invalid count argument [lxbk0957:1650843] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [lxbk0957:1650843] *** and potentially your MPI job) slurmstepd: error: *** STEP 18002547.0 ON lxbk0824 CANCELLED AT 2024-01-08T02:21:37 *** 3. I used hierarchical decomposition for 1024, 2048, and 4096 subdomains, but also tries ptscotch for 1024 subdomains. 4. Tried using IOranks with -fileHandler hostCollated on the same case, but with 1024 subdomains: <...> processors1024_0-127 processors1024_128-255 processors1024_256-383 processors1024_384-511 processors1024_512-639 processors1024_640-767 processors1024_768-895 processors1024_896-1023 <...> export FOAM_IORANKS='(0 128 256 384 512 640 768 896)' chtMultiRegionSimpleFoam -parallel -fileHandler hostCollated >> log.chtMultiRegionSimpleFoam 2>&1 I have attached a log for this try. 5. Tried to play around with MPI_BUFFER_SIZE variable, the same entry in etc/controlDict, set it to 400 000 000 with no success. Default value is 20 000 000 One more thing: Maybe this will help. I've noticed this message while compiling OpenFOAM-10 with WM_LABEL_SIZE=64: specified bound between 9223372036854775808 and 18446744073709551615 exceeds maximum object size 9223372036854775807 <...> In file included from /lustre/cbm/users/elizarov/temp/OpenFOAM-10-compilation/OpenFOAM-10/src/OpenFOAM/lnInclude/List.H:316 , from /lustre/cbm/users/temp/OpenFOAM-10-compilation/OpenFOAM-10/src/OpenFOAM/lnInclude/HashTable.C:30, from /lustre/cbm/users/temp/OpenFOAM-10-compilation/OpenFOAM-10/src/OpenFOAM/lnInclude/Istream.H:187, from /lustre/cbm/users/temp/OpenFOAM-10-compilation/OpenFOAM-10/src/OpenFOAM/lnInclude/ISstream.H:39, from /lustre/cbm/users/temp/OpenFOAM-10-compilation/OpenFOAM-10/src/OpenFOAM/lnInclude/IOstreams.H:38, from /lustre/cbm/users/temp/OpenFOAM-10-compilation/OpenFOAM-10/src/OpenFOAM/lnInclude/VectorSpace.C:27, from /lustre/cbm/users/temp/OpenFOAM-10-compilation/OpenFOAM-10/src/OpenFOAM/lnInclude/VectorSpace.H:232, from /lustre/cbm/users/temp/OpenFOAM-10-compilation/OpenFOAM-10/src/OpenFOAM/lnInclude/Vector.H:44, from /lustre/cbm/users/temp/OpenFOAM-10-compilation/OpenFOAM-10/src/OpenFOAM/lnInclude/vector.H:39, from /lustre/cbm/users/temp/OpenFOAM-10-compilation/OpenFOAM-10/src/OpenFOAM/lnInclude/point.H:35, from /lustre/cbm/users/temp/OpenFOAM-10-compilation/OpenFOAM-10/src/OpenFOAM/lnInclude/pointField.H:35, from /lustre/cbm/users/temp/OpenFOAM-10-compilation/OpenFOAM-10/src/OpenFOAM/lnInclude/face.H:46, from /lustre/cbm/users/temp/OpenFOAM-10-compilation/OpenFOAM-10/src/OpenFOAM/lnInclude/faceList.H:34, from lnInclude/interpolation.H:35, from lnInclude/interpolationCellPoint.H:36, from interpolation/interpolation/interpolationCellPointWallModified/interpolationCellPointWallModified.H:44, from interpolation/interpolation/interpolationCellPointWallModified/makeInterpolationCellPointWallModified.C:26: In constructor 'Foam::List<T>::List(Foam::label, const T&) [with T = bool]', inlined from 'void Foam::volPointInterpolation::interpolateUnconstrai ned(const Foam::GeometricField<Type, Foam::fvPatchField, Foam::volMesh>&, Foam::GeometricField<Type, Foam:ointPatchField, Foam:ointMesh>&) const [with Type = Foam::Vector<double>]' at lnInclude/volPointInterpolationTemplates.C:62:14: /lustre/cbm/users/temp/OpenFOAM-10-compilation/OpenFOAM-10/src/OpenFOAM/lnInclude/List.C:72:39: warning: 'void* __builtin_memset(void*, int, long unsigned int)' specified bound between 9223372036854775808 and 18446744073709551615 exceeds maximum object size 9223372036854775807 [-Wstringop-overflow=] 72 | List_ELEM((*this), vp, i) = a; <...> The version of OpenFOAM 10 is https://github.com/OpenFOAM/OpenFOAM...s/tag/20230119 P.S. I also attached output of ompi_info command run on the cluster. Best regards, Ilya --- CBM Department GSI Helmholtzzentrum für Schwerionenforschung GmbH Planckstr. 1, 64291 Darmstadt, Germany Last edited by wht; January 29, 2024 at 07:03. |
|
January 23, 2024, 16:53 |
|
#7 |
New Member
Ilya
Join Date: Jan 2021
Posts: 10
Rep Power: 5 |
Dear Colleagues,
if you have an opportunity to test my case on your system, it would be a great help. Meanwhile, I have filed a bug report https://develop.openfoam.com/Develop.../-/issues/3092; however, it is hard to reproduce the error because of the obvious reasons. My case can be found at https://sf.gsi.de/f/4db522c9b39b4125855f/?dl=1 (24,2 Mb) Requirements: 1024 CPUs (multithreading can be used), 4 Gb RAM per processor, Slurm workload manager, OpenFOAM installed with WM_LABEL_SIZE=64 Simply run ./Allrun script The case uses collated file format and OpenFOAM v2306. Best regards, Ilya --- CBM Department GSI Helmholtzzentrum für Schwerionenforschung GmbH Planckstr. 1, 64291 Darmstadt, Germany |
|
Tags |
big model, cluster computing, mpi error |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
[General] Extracting ParaView Data into Python Arrays | Jeffzda | ParaView | 30 | November 6, 2023 22:00 |
[OpenFOAM] ParaView command in Foam-extend-4.1 | mitu_94 | ParaView | 0 | March 4, 2021 14:46 |
Pressure outlet boundary condition | rolando | OpenFOAM Running, Solving & CFD | 62 | September 18, 2017 07:45 |
parallel simulations - error message: "OPT_ITERATIONS: invalid option name" | v8areu | SU2 | 5 | July 23, 2015 03:57 |
Phase locked average in run time | panara | OpenFOAM | 2 | February 20, 2008 15:37 |