|
[Sponsors] |
June 14, 2012, 20:29 |
SnappyHexmesh crashes with many processes
|
#1 |
New Member
Jos Ewert
Join Date: Jun 2012
Posts: 5
Rep Power: 14 |
Hi,
I am currently trying the tutorials on our cluster, but on the motorbike example in incompressible/pisoFoam/les/ generating the mesh seems to fail with 208 threads. I increased the blockmesh in blockmeshdict by doubling the 3 values. If run snappyhexmesh with 208 threads and I will get a large stacktrace after iteration 5, probably right before going into iteration 6. Strangely enough it runs perfectly fine singlethreaded and with only 104 processes. Snappyhexmesh does not seem to run with less processes then subdomains, unless it is only 1 process. Does anyone have an idea what could go wrong? heres the stacktrace: Code:
6 18370624 [137] #0 Foam::error::printStack(Foam::Ostream&) in "/home/kluster/openfoam/IntelMPI-gcc46//OpenFOAM-2.1.0/platforms/linux64Gcc46DPOpt/lib/lib OpenFOAM.so" [137] #1 Foam::sigSegv::sigHandler(int) in "/home/kluster/openfoam/IntelMPI-gcc46//OpenFOAM-2.1.0/platforms/linux64Gcc46DPOpt/lib/libOpenFOAM. so" [137] #2 [135] #0 Foam::error::printStack(Foam::Ostream&) in "/lib64/libc.so.6" [137] #3 _SCOTCHdgraphMatchSyncColl in "/home/kluster/openfoam/IntelMPI-gcc46//OpenFOAM-2.1.0/platforms/linux64Gcc46DPOpt/lib/libOpenFOAM.so" [135] #1 Foam::sigSegv::sigHandler(int) in "/home/kluster/openfoam/IntelMPI-gcc46//ThirdParty-2.1.0/platforms/linux64Gcc46DPOpt/lib/4.0.3/libp tscotch.so" [137] #4 _SCOTCHdgraphCoarsen in "/home/kluster/openfoam/IntelMPI-gcc46//ThirdParty-2.1.0/platforms/linux64Gcc46DPOpt/lib/4.0.3/libptscotch.so " [137] #5 in "/home/kluster/openfoam/IntelMPI-gcc46//OpenFOAM-2.1.0/platforms/linux64Gcc46DPOpt/lib/libOpenFOAM.so" [135] #2 at bdgraph_bipart_ml.c:0 [137] #6 in "/lib64/libc.so.6" [135] #3 _SCOTCHdgraphMatchSyncColl at bdgraph_bipart_ml.c:0 [137] #7 _SCOTCHbdgraphBipartMl in "/home/kluster/openfoam/IntelMPI-gcc46//ThirdParty-2.1.0/platforms/linux64Gcc46DPOpt/lib/4.0.3/libptscotch. so" [135] #4 _SCOTCHdgraphCoarsen in "/home/kluster/openfoam/IntelMPI-gcc46//ThirdParty-2.1.0/platforms/linux64Gcc46DPOpt/lib/4.0.3/libptscotch.so " [137] #8 _SCOTCHbdgraphBipartSt in "/home/kluster/openfoam/IntelMPI-gcc46//ThirdParty-2.1.0/platforms/linux64Gcc46DPOpt/lib/4.0.3/libptscotch. so" [135] #5 in "/home/kluster/openfoam/IntelMPI-gcc46//ThirdParty-2.1.0/platforms/linux64Gcc46DPOpt/lib/4.0.3/libptscotch.so" [137] #9 at bdgraph_bipart_ml.c:0 [135] #6 at kdgraph_map_rb_part.c:0 [137] #10 at bdgraph_bipart_ml.c:0 [135] #7 _SCOTCHbdgraphBipartMl at kdgraph_map_rb_part.c:0 [137] #11 _SCOTCHkdgraphMapRbPart in "/home/kluster/openfoam/IntelMPI-gcc46//ThirdParty-2.1.0/platforms/linux64Gcc46DPOpt/lib/4.0.3/libptscotc h.so" [135] #8 _SCOTCHbdgraphBipartSt in "/home/kluster/openfoam/IntelMPI-gcc46//ThirdParty-2.1.0/platforms/linux64Gcc46DPOpt/lib/4.0.3/libptscotch. so" [137] #12 _SCOTCHkdgraphMapSt in "/home/kluster/openfoam/IntelMPI-gcc46//ThirdParty-2.1.0/platforms/linux64Gcc46DPOpt/lib/4.0.3/libptscotch.so " [135] #9 in "/home/kluster/openfoam/IntelMPI-gcc46//ThirdParty-2.1.0/platforms/linux64Gcc46DPOpt/lib/4.0.3/libptscotch.so" [137] #13 SCOTCH_dgraphMapCompute at kdgraph_map_rb_part.c:0 [135] #10 in "/home/kluster/openfoam/IntelMPI-gcc46//ThirdParty-2.1.0/platforms/linux64Gcc46DPOpt/lib/4.0.3/libptscotch.so" [137] #14 SCOTCH_dgraphMap at kdgraph_map_rb_part.c:0 [135] #11 _SCOTCHkdgraphMapRbPart in "/home/kluster/openfoam/IntelMPI-gcc46//ThirdParty-2.1.0/platforms/linux64Gcc46DPOpt/lib/4.0.3/libptscotc h.so" [137] #15 Foam::ptscotchDecomp::decompose(Foam::fileName const&, Foam::List<int> const&, Foam::List<int> const&, Foam::Field<double> const&, F oam::List<int>&) const in "/home/kluster/openfoam/IntelMPI-gcc46//OpenFOAM-2.1.0/platforms/linux64Gcc46DPOpt/lib/4.0.3/libptscotchDecomp.so" [137] #16 Foam::ptscotchDecomp::decomposeZeroDomains(Foam::fileName const&, Foam::List<int> const&, Foam::List<int> const&, Foam::Field<double > const&, Foam::List<int>&) const in "/home/kluster/openfoam/IntelMPI-gcc46//ThirdParty-2.1.0/platforms/linux64Gcc46DPOpt/lib/4.0.3/libptscotch .so" [135] #12 _SCOTCHkdgraphMapSt in "/home/kluster/openfoam/IntelMPI-gcc46//OpenFOAM-2.1.0/platforms/linux64Gcc46DPOpt/lib/4.0.3/libptscotchDecom p.so" [137] #17 Foam::ptscotchDecomp::decompose(Foam::polyMesh const&, Foam::Field<Foam::Vector<double> > const&, Foam::Field<double> const&) in "/h ome/kluster/openfoam/IntelMPI-gcc46//ThirdParty-2.1.0/platforms/linux64Gcc46DPOpt/lib/4.0.3/libptscotch.so" [135] #13 SCOTCH_dgraphMapCompute in "/home/kluster/openfoam/IntelMPI-gcc46//OpenFOAM-2.1.0/platforms/linux64Gcc46DPOpt/lib/4.0.3/libptscotchD ecomp.so" [137] #18 Foam::meshRefinement::balance(bool, bool, Foam::Field<double> const&, Foam::decompositionMethod&, Foam::fvMeshDistribute&) in "/home /kluster/openfoam/IntelMPI-gcc46//ThirdParty-2.1.0/platforms/linux64Gcc46DPOpt/lib/4.0.3/libptscotch.so" [135] #14 SCOTCH_dgraphMap in "/home/kluster/openfoam/IntelMPI-gcc46//ThirdParty-2.1.0/platforms/linux64Gcc46DPOpt/lib/4.0.3/libptscotch.so" in "/home/kluster/openfoam/Inte[135] #15 lMPI-gcc46//OpenFOAM-2.1.0/platforms/linux64Gcc46DPOpt/lib/libautoMesh.so"Foam::ptscotchDecomp::deco mpose(Foam::fileName const&, Foam::List<int> const&, Foam::List<int> const&, Foam::Field<double> const&, Foam::List<int>&) const [137] #19 Foam::meshRefinement::refineAndBalance(Foam::string const&, Foam::decompositionMethod&, Foam::fvMeshDistribute&, Foam::List<int> con st&, double) in "/home/kluster/openfoam/IntelMPI-gcc46//OpenFOAM-2.1.0/platforms/linux64Gcc46DPOpt/lib/4.0.3/libptscotchDecomp.so" [135] #16 Foam::ptscotchDecomp::decomposeZeroDomains(Foam::fileName const&, Foam::List<int> const&, Foam::List<int> const&, Foam::Field<double > const&, Foam::List<int>&) const in "/home/kluster/openfoam/IntelMPI-gcc46//OpenFOAM-2.1.0/platforms/linux64Gcc46DPOpt/lib/libautoMesh.so" [137] #20 Foam::autoRefineDriver::surfaceOnlyRefine(Foam::refinementParameters const&, int) in "/home/kluster/openfoam/IntelMPI-gcc46//OpenFOA M-2.1.0/platforms/linux64Gcc46DPOpt/lib/4.0.3/libptscotchDecomp.so" [135] #17 Foam::ptscotchDecomp::decompose(Foam::polyMesh const&, Foam::Field<Foam::Vector<double> > const&, Foam::Field<double> const&) in "/h ome/kluster/openfoam/IntelMPI-gcc46//OpenFOAM-2.1.0/platforms/linux64Gcc46DPOpt/lib/libautoMesh.so" [137] #21 Foam::autoRefineDriver::doRefine(Foam::dictionary const&, Foam::refinementParameters const&, bool, Foam::dictionary const&) in "/hom e/kluster/openfoam/IntelMPI-gcc46//OpenFOAM-2.1.0/platforms/linux64Gcc46DPOpt/lib/4.0.3/libptscotchDecomp.so" [135] #18 Foam::meshRefinement::balance(bool, bool, Foam::Field<double> const&, Foam::decompositionMethod&, Foam::fvMeshDistribute&) in "/home /kluster/openfoam/IntelMPI-gcc46//OpenFOAM-2.1.0/platforms/linux64Gcc46DPOpt/lib/libautoMesh.so" [137] #22 in "/home/kluster/openfoam/IntelMPI-gcc46//OpenFOAM-2.1.0/platforms/linux64Gcc46DPOpt/lib/libautoMesh.so" [135] #19 Foam::meshRefinement::refineAndBalance(Foam::string const&, Foam::decompositionMethod&, Foam::fvMeshDistribute&, Foam::List<int> con st&, double) [137] in "/home/kluster/openfoam/IntelMPI-gcc46/OpenFOAM-2.1.0/platforms/linux64Gcc46DPOpt/bin/snappyHexMesh" [137] #23 __libc_start_main in "/home/kluster/openfoam/IntelMPI-gcc46//OpenFOAM-2.1.0/platforms/linux64Gcc46DPOpt/lib/libautoMesh.so" [135] #20 Foam::autoRefineDriver::surfaceOnlyRefine(Foam::refinementParameters const&, int) in "/lib64/libc.so.6" [137] #24 in "/home/kluster/openfoam/IntelMPI-gcc46//OpenFOAM-2.1.0/platforms/linux64Gcc46DPOpt/lib/libautoMesh.so" [135] #21 Foam::autoRefineDriver::doRefine(Foam::dictionary const&, Foam::refinementParameters const&, bool, Foam::dictionary const&) in "/home/kluster/openfoam/IntelMPI-gcc46//OpenFOAM-2.1.0/platforms/linux64Gcc46DPOpt/lib/libautoMesh.so" [135] #22 [137] at /usr/src/packages/BUILD/glibc-2.11.3/csu/../sysdeps/x86_64/elf/start.S:116 |
|
June 14, 2012, 20:51 |
|
#2 | ||
Assistant Moderator
Bernhard Gschaider
Join Date: Mar 2009
Posts: 4,225
Rep Power: 51 |
Quote:
Quote:
|
|||
June 16, 2012, 06:47 |
|
#3 | ||
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128 |
Greetings to all!
@flami - Here's what I know: Quote:
Quote:
Bruno
__________________
|
|||
June 16, 2012, 09:13 |
|
#4 |
New Member
Jos Ewert
Join Date: Jun 2012
Posts: 5
Rep Power: 14 |
Sorry I made a typo , instead of threads I mean processes. I dont use any kind of openmp or other thread parallelisation. Its all MPI. I use IntelMPI.
I run this from the allrun script: runParallel snappyHexMesh 104 -overwrite -parallel which then becomes: mpirun -np 104 -ppn 8 -binding "map=scatter" snappyHexMesh 104 -overwrite -parallel anything above 104 processes so 130, 156 , 182, 208 doesn't seem to crash anymore but get stuck in an iteration ( it seems to be the same for each, so e.g. 208 always gets stuck in iteration 4 of the first step [sorry forgot what it was called ]). They get stuck at 100% core usage , and don't progress. The first time e.g. 208 processes were stuck for 4 hours until I canceled the job. That stacktrace suddenly appeared after I changed the mesh in blockmeshdict. changing it again made snappyhexmesh get stuck again. It doesn't matter if I use gcc or the intel compilers. It always is the same error. I might try openmpi too if I get the time. maybe I get lucky. |
|
June 16, 2012, 09:27 |
|
#5 |
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128 |
Ah, OK, now we're getting closer to the problem
Do the CPUs you're using have Hyper-Threading? If so, does this mean that anything above 104 processes will strictly require using HT? And... does your mpirun command use the "machines" or "hosts" file? Or do you have this globally configured? Or does your machine indeed have 104 or 208 cores?
__________________
|
|
June 16, 2012, 15:05 |
|
#6 |
New Member
Jos Ewert
Join Date: Jun 2012
Posts: 5
Rep Power: 14 |
Yes we have 208 physical cores. we use slurm to distribute the mpi processes ( it sets some environment variabels that intelMPI reads ) .
With 104 processes I explicitly run 8 processes per node that get pinned to 4 processes per CPU . So I only use 1/2 of the cores per CPU. |
|
June 16, 2012, 18:26 |
|
#7 |
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128 |
Mmm... OK, then there are a few possibilities left:
__________________
|
|
June 23, 2012, 15:52 |
|
#8 | |
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128 |
Greetings to all!
Yesterday I was writing some stuff about "decomposeParDict" and went to look at the main one present in the folder "OpenFOAM-2.1.x/applications/utilities/parallelProcessing/decomposePar", which I hadn't looked at it for several months now... and I saw this: Quote:
@flami: I haven't tested this yet, but I suggest that you give this method a try, because this might very well be a innocent way of showing people that when scotch crashes due to a partition limit, we have to resort to the multiLevel method! Which apparently provides the ability to use Scotch in a multi level partition graph, instead of a single level partition graph! Oh, since you were using it on snappyHexMesh, instead of "scotch" will probably have to be "ptscotch" The other possibility is to use "ptscotch" in one level and "simple" or "hierarchical" in the other. Best regards, Bruno
__________________
|
||
July 2, 2012, 21:30 |
|
#9 |
New Member
Jos Ewert
Join Date: Jun 2012
Posts: 5
Rep Power: 14 |
Sadly I do not have the hardware anymore ( it was torn apart and send to whoknowswhere ) .
I might be able to test it on some other hardware, but sadly I cannot make any promises . Anyway I found out that at 128 with ptscotch, you reached the limit of when the program will run. Anything above will make it crash ( e.g. 130) , no matter the problem size. I had the issue with the defaults being too small for 128, that is why I doubled every side . but yes multiLevelCoeffs might help. For lazy people that don't understand how the problem is actually structured (i.e. me ) something like this: Code:
level0 { [...] method ptscotch; } level1 { [...] method ptscotch; } thanks for the help. |
|
July 18, 2012, 15:26 |
|
#10 |
New Member
Jos Ewert
Join Date: Jun 2012
Posts: 5
Rep Power: 14 |
Hi, I have access to a larger machine again and tested the multilevel suggestion.
Sadly it still crashes , but it seems at a different place: Code:
Surface refinement iteration 0 ------------------------------ Marked for refinement due to surface intersection : 630 cells. Marked for refinement due to curvature/regions : 0 cells. Determined cells to refine in = 0.18 s Selected for refinement : 630 cells (out of 655360) Edge intersection testing: Number of edges : 2005059 Number of edges to retest : 17979 Number of intersected edges : 3239 Refined mesh in = 0.56 s After refinement surface refinement iteration 0 : cells:659770 faces:2005059 points:685831 Cells per refinement level: 0 654730 1 5040 [0] Decomposition at level 0 : [0] [0] [0] --> FOAM FATAL ERROR: [0] bad set size -4 [0] [0] From function List<T>::setSize(const label) [0] in file /home/ws/nm46/openfoam/SystemOMPI-gcc46/OpenFOAM-2.1.0/src/OpenFOAM/lnInclude/List.C at line [0] Domain 0 [0] Number of cells = 40976 [0] Number of inter-domain patches = 0 [0] Number of inter-domain faces = 0 [0] 322. [0] FOAM parallel run aborting [0] [0] #0 Foam::error::printStack(Foam::Ostream&)-------------------------------------------------------------------------- An MPI process has executed an operation involving a call to the "fork()" system call to create a child process. Open MPI is currently operating in a condition that could result in memory corruption or other system errors; your MPI job may hang, crash, or produce silent data corruption. The use of fork() (or system() or other calls that create child processes) is strongly discouraged. The process that invoked fork was: Local host: ic1n045 (PID 3374) MPI_COMM_WORLD rank: 0 If you are *absolutely sure* that your application will successfully and correctly survive a call to fork(), you may disable this warning by setting the mpi_warn_on_fork MCA parameter to 0. -------------------------------------------------------------------------- in "/home/ws/nm46/openfoam/SystemOMPI-gcc46/OpenFOAM-2.1.0/platforms/linux64GccDPOpt/lib/libOpenFOAM.so" [0] #1 Foam::error::abort() in "/home/ws/nm46/openfoam/SystemOMPI-gcc46/OpenFOAM-2.1.0/platforms/linux64GccDPOpt/lib/libOpenFOAM.so" [0] #2 Foam::List<int>::setSize(int) in "/home/ws/nm46/openfoam/SystemOMPI-gcc46/OpenFOAM-2.1.0/platforms/linux64GccDPOpt/bin/snappyHexMesh" [0] #3 Foam::ptscotchDecomp::decomposeZeroDomains(Foam::fileName const&, Foam::List<int> const&, Foam::List<int> const&, Foam::Field<double> const&, Foam::List<int>&) const in "/home/ws/nm46/openfoam/SystemOMPI-gcc46/OpenFOAM-2.1.0/platforms/linux64GccDPOpt/lib/openmpi-system/libpts cotchDecomp.so" [0] #4 Foam::ptscotchDecomp::decompose(Foam::List<Foam::List<int> > const&, Foam::Field<Foam::Vector<double> > const&, Foam::Field<double> co nst&) in "/home/ws/nm46/openfoam/SystemOMPI-gcc46/OpenFOAM-2.1.0/platforms/linux64GccDPOpt/lib/openmpi-system/libptscotchDecomp.so" [0] #5 Foam::multiLevelDecomp::decompose(Foam::List<Foam::List<int> > const&, Foam::Field<Foam::Vector<double> > const&, Foam::Field<double> const&, Foam::List<int> const&, int, Foam::Field<int>&) in "/home/ws/nm46/openfoam/SystemOMPI-gcc46/OpenFOAM-2.1.0/platforms/linux64GccDPOpt/l ib/libdecompositionMethods.so" [0] #6 Foam::multiLevelDecomp::decompose(Foam::List<Foam::List<int> > const&, Foam::Field<Foam::Vector<double> > const&, Foam::Field<double> const&, Foam::List<int> const&, int, Foam::Field<int>&) in "/home/ws/nm46/openfoam/SystemOMPI-gcc46/OpenFOAM-2.1.0/platforms/linux64GccDPOpt/l ib/libdecompositionMethods.so" [0] #7 Foam::multiLevelDecomp::decompose(Foam::polyMesh const&, Foam::Field<Foam::Vector<double> > const&, Foam::Field<double> const&) in "/h ome/ws/nm46/openfoam/SystemOMPI-gcc46/OpenFOAM-2.1.0/platforms/linux64GccDPOpt/lib/libdecompositionMethods.so" [0] #8 Foam::meshRefinement::balance(bool, bool, Foam::Field<double> const&, Foam::decompositionMethod&, Foam::fvMeshDistribute&) in "/home/w s/nm46/openfoam/SystemOMPI-gcc46/OpenFOAM-2.1.0/platforms/linux64GccDPOpt/lib/libautoMesh.so" [0] #9 Foam::meshRefinement::refineAndBalance(Foam::string const&, Foam::decompositionMethod&, Foam::fvMeshDistribute&, Foam::List<int> const &, double) in "/home/ws/nm46/openfoam/SystemOMPI-gcc46/OpenFOAM-2.1.0/platforms/linux64GccDPOpt/lib/libautoMesh.so" [0] #10 Foam::autoRefineDriver::surfaceOnlyRefine(Foam::refinementParameters const&, int) in "/home/ws/nm46/openfoam/SystemOMPI-gcc46/OpenFOA M-2.1.0/platforms/linux64GccDPOpt/lib/libautoMesh.so" [0] #11 Foam::autoRefineDriver::doRefine(Foam::dictionary const&, Foam::refinementParameters const&, bool, Foam::dictionary const&) in "/home /ws/nm46/openfoam/SystemOMPI-gcc46/OpenFOAM-2.1.0/platforms/linux64GccDPOpt/lib/libautoMesh.so" [0] #12 [0] in "/home/ws/nm46/openfoam/SystemOMPI-gcc46/OpenFOAM-2.1.0/platforms/linux64GccDPOpt/bin/snappyHexMesh" [0] #13 __libc_start_main in "/lib64/libc.so.6" [0] #14 both crash at the same place, except that at 256 the error says that its "bad set size -41" This is is the decomposepardict for 208: ( its a bit wrong as I still had the amounts of cores on the old system, I guess 26*8 would have been better ) Code:
numberOfSubdomains 208; method multiLevel; multiLevelCoeffs { level0 { numberOfSubdomains 16; method ptscotch; } level1 { numberOfSubdomains 13; method ptscotch; } } Code:
numberOfSubdomains 256; method multiLevel; multiLevelCoeffs { level0 { numberOfSubdomains 64; method ptscotch; } level1 { numberOfSubdomains 4; method ptscotch; } } (160 64 64) which leaves about 2550 blocks on each process for 256 processes. I do not really know what to do about that set size error, as it happens long before the maximum of cells for the hexmesh are reached (iirc its 7 million ), so I guess it is not related to that being too small. maybe it is related to "maxLocalCells 100000;" as this now isn't reached that easily anymore (or at all I'd guess ) |
|
July 18, 2012, 17:29 |
|
#11 |
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128 |
Hi flami,
Mmm... too bad . Here I was thinking that the multilevel thinga-ma-bob was the life saver... What about the "maxGlobalCells"? Do you have it set to a very high value? My guess is that the best next step would be to file a bug report with this information: http://www.openfoam.org/mantisbt/ I also did a quick search in Scotch's code for any hard coded values of 64, 128 or 256, but didn't find any suspicious looking ones Trying to upgrade to a more recent Scotch library would also be a possibility, but it might be a serious pain in the neck to do, if the library interfaces changed too much with the upgrade. Best regards, Bruno
__________________
|
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
SnappyHexMesh OF-1.6-ext crashes on a parallel run | norman1981 | OpenFOAM Bugs | 5 | December 7, 2011 13:48 |
Strange Results With snappyHexMesh | calebamiles | OpenFOAM Running, Solving & CFD | 0 | August 14, 2011 17:02 |
[snappyHexMesh] snappyHexMesh in parallel with cyclics | tonyuprm | OpenFOAM Meshing & Mesh Conversion | 1 | June 29, 2011 11:43 |
[snappyHexMesh] stitchMesh and snappyHexMesh | gdbaldw | OpenFOAM Meshing & Mesh Conversion | 0 | December 23, 2009 03:09 |
[snappyHexMesh] SnappyHexMesh not generate mesh first time | mavimo | OpenFOAM Meshing & Mesh Conversion | 4 | August 26, 2008 08:08 |