|
[Sponsors] |
October 24, 2018, 12:02 |
"Failed Starting Thread 0"
|
#1 |
New Member
Eric Bringley
Join Date: Nov 2016
Posts: 14
Rep Power: 10 |
Dear all,
I've run into a problem. I'm running into my job failing during File IO with this message: Log: Code:
PIMPLE: not converged within 3 iterations [15] [15] [15] --> FOAM FATAL ERROR: [15] Failed starting thread 0 [15] [15] From function void Foam::createThread(int, void *(*)(void *), void *) [15] in file POSIX.C at line 1422. [15] FOAM parallel run exiting [15] Code:
[processors]$ ls -1 0.0074 | wc -l 224 [processors]$ ls -1 0.007425/ | wc -l 110 Relevant details:
Does anyone have any ideas about why OpenFOAM is failing when it is writing to file? |
|
April 25, 2019, 15:52 |
FOAM FATAL ERROR: Failed starting thread 0
|
#2 |
Member
Andrew O. Winter
Join Date: Aug 2015
Location: Seattle, WA, USA
Posts: 78
Rep Power: 11 |
Hi Eric,
Were you ever able to discern what was causing this issue? I've just run into the same error output while trying out OpenFOAM-5.x (compiled last week on Friday) using the collated fileHandler instead of the default uncollated. So far out of the 3 cases I've run using the uncollated format I've not had any errors (2 are complete whereas 1 is just past 50% completion), but with the 1 case I've tried using the collated format it failed at about 5.4 seconds of simulated time, which is roughly 1/4 of the total time. To provide some details of the case, I'm modeling a piston-type wave maker using the olaDyMFlow solver from Pablo Higuera's OlaFlow solver + BC package, which is discussed on the forums in this thread. My model is similar to the wavemakerFlume tutorial with modified flume geometry and 1 to 3 rectangular structures are added using snappyHexMesh. Also, the hardware I'm running this on is a pair of Skylake nodes (Intel Xeon Platinum 8160), which each have 2 sockets, 24/cores per socket, and 2 threads per core to give 48 cores or 96 threads per node. Using uname -r, the operating system is shown as... Code:
Operating System: CentOS Linux 7 (Core) CPE OS Name: cpe:/o:centos:centos:7 Kernel: Linux 3.10.0-957.5.1.el7.x86_64 Architecture: x86-64 Thanks in advance! Slurm batch script: Code:
#!/bin/bash #SBATCH --job-name=case065 # job name #SBATCH --account=DesignSafe-Motley-UW # project allocation name (required if you have >1) #SBATCH --partition=skx-normal # queue name #SBATCH --time=48:00:00 # run time (D-HH:MM:SS) #SBATCH --nodes=2 # total number of nodes #SBATCH --ntasks=96 # total number of MPI tasks module load intel/18.0.2 module load impi/18.0.2 export MPI_ROOT=$I_MPI_ROOT source $WORK/OpenFOAM-5.x/etc/bashrc cd $SCRATCH/Apr22/case065_W12ft_xR016in_yR-040in_xL016in_yL_056in_Broken_kOmegaSST_Euler_MeshV2_0_1 echo Preparing 0 folder... if [ -d 0 ]; then rm -r 0 fi cp -r 0.org 0 echo blockMesh meshing... blockMesh > log.blockMesh echo surfaceFeatureExtract extracting... surfaceFeatureExtract > log.surfFeatExt echo decomposePar setting up parallel case... cp ./system/decompParDict_sHM ./system/decomposeParDict decomposePar -copyZero > log.decomp_sHM echo snappyHex meshing testStruct... cp ./system/snappyHexMeshDict_testStruct ./system/snappyHexMeshDict ibrun -np 96 snappyHexMesh -parallel -overwrite > log.sHM_testStruct echo snappyHex meshing concBlocks... cp ./system/snappyHexMeshDict_concBlocks ./system/snappyHexMeshDict ibrun -np 96 snappyHexMesh -parallel -overwrite > log.sHM_concBlocks echo reconstructParMesh rebuilding mesh... reconstructParMesh -constant -mergeTol 1e-6 > log.reconMesh_sHM echo reconstructPar rebuilding fields... reconstructPar > log.reconFields_sHM rm -r processor* echo checking mesh quality... checkMesh > log.checkMesh echo Setting the fields... setFields > log.setFields echo decomposePar setting up parallel case... cp ./system/decompParDict_runCase ./system/decomposeParDict decomposePar > log.decomp_runCase echo Mesh built, ICs set, and parallel decomposition complete echo Begin running olaDyMFlow... ibrun -np 96 olaDyMFlow -parallel > log.olaDyMFlow echo Completed running olaDyMFlow Code:
Preparing 0 folder... blockMesh meshing... surfaceFeatureExtract extracting... decomposePar setting up parallel case... snappyHex meshing testStruct... snappyHex meshing concBlocks... reconstructParMesh rebuilding mesh... reconstructPar rebuilding fields... checking mesh quality... Setting the fields... decomposePar setting up parallel case... Mesh built, ICs set, and parallel decomposition complete Begin running olaDyMFlow... [52] [71] [74] [79] [83] [87] [94] [52] [52] --> FOAM FATAL ERROR: [52] Failed starting thread 0 [52] [52] From function void Foam::createThread(int, void *(*)(void *), void *) [52] in file POSIX.C at line [71] [71] --> FOAM FATAL ERROR: [71] Failed starting thread 0 [71] [71] From function void Foam::createThread(int, void *(*)(void *), void *) [71] in file POSIX.C at line [74] [74] --> FOAM FATAL ERROR: [74] Failed starting thread 0 [74] [74] From function void Foam::createThread(int, void *(*)(void *), void *) [74] in file POSIX.C at line [79] [79] --> FOAM FATAL ERROR: [79] Failed starting thread 0 [79] [79] From function void Foam::createThread(int, void *(*)(void *), void *) [79] in file POSIX.C at line [83] [83] --> FOAM FATAL ERROR: [83] Failed starting thread 0 [83] [83] From function void Foam::createThread(int, void *(*)(void *), void *) [83] in file POSIX.C at line [87] [87] --> FOAM FATAL ERROR: [87] Failed starting thread 0 [87] [87] From function void Foam::createThread(int, void *(*)(void *), void *) [87] in file POSIX.C at line [94] [94] --> FOAM FATAL ERROR: [94] Failed starting thread 0 [94] [94] From function void Foam::createThread(int, void *(*)(void *), void *) [94] in file POSIX.C at line 1422. [52] FOAM parallel run exiting [52] 1422. [71] FOAM parallel run exiting [71] 1422. [74] FOAM parallel run exiting [74] 1422. [83] FOAM parallel run exiting [83] 1422. [94] FOAM parallel run exiting [94] 1422. [79] FOAM parallel run exiting [79] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 79 1422. [87] FOAM parallel run exiting [87] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 87 application called MPI_Abort(MPI_COMM_WORLD, 1) - process 52 application called MPI_Abort(MPI_COMM_WORLD, 1) - process 71 application called MPI_Abort(MPI_COMM_WORLD, 1) - process 74 application called MPI_Abort(MPI_COMM_WORLD, 1) - process 83 application called MPI_Abort(MPI_COMM_WORLD, 1) - process 94 Completed running olaDyMFlow Code:
TACC: Starting up job 3370514 TACC: Starting parallel tasks... /*---------------------------------------------------------------------------*\ | ========= | | | \\ / F ield | OpenFOAM: The Open Source CFD Toolbox | | \\ / O peration | Version: 5.x | | \\ / A nd | Web: www.OpenFOAM.org | | \\/ M anipulation | | \*---------------------------------------------------------------------------*/ Build : 5.x-7f7d351b741b Exec : /work/04697/winter89/stampede2/OpenFOAM-5.x/platforms/linux64IccDPInt32Opt/bin/olaDyMFlow -parallel Date : Apr 25 2019 Time : 07:11:35 Host : "c499-092.stampede2.tacc.utexas.edu" PID : 103053 I/O : uncollated Case : /scratch/04697/winter89/Apr22/case065_W12ft_xR016in_yR-040in_xL016in_yL_056in_Broken_kOmegaSST_Euler_MeshV2_0_1 nProcs : 96 Slaves : 95 ( "c499-092.stampede2.tacc.utexas.edu.103054" "c499-092.stampede2.tacc.utexas.edu.103055" "c499-092.stampede2.tacc.utexas.edu.103056" "c499-092.stampede2.tacc.utexas.edu.103057" "c499-092.stampede2.tacc.utexas.edu.103058" "c499-092.stampede2.tacc.utexas.edu.103059" "c499-092.stampede2.tacc.utexas.edu.103060" "c499-092.stampede2.tacc.utexas.edu.103061" "c499-092.stampede2.tacc.utexas.edu.103062" "c499-092.stampede2.tacc.utexas.edu.103063" "c499-092.stampede2.tacc.utexas.edu.103064" "c499-092.stampede2.tacc.utexas.edu.103065" "c499-092.stampede2.tacc.utexas.edu.103066" "c499-092.stampede2.tacc.utexas.edu.103067" "c499-092.stampede2.tacc.utexas.edu.103068" "c499-092.stampede2.tacc.utexas.edu.103069" "c499-092.stampede2.tacc.utexas.edu.103070" "c499-092.stampede2.tacc.utexas.edu.103071" "c499-092.stampede2.tacc.utexas.edu.103072" "c499-092.stampede2.tacc.utexas.edu.103073" "c499-092.stampede2.tacc.utexas.edu.103074" "c499-092.stampede2.tacc.utexas.edu.103075" "c499-092.stampede2.tacc.utexas.edu.103076" "c499-092.stampede2.tacc.utexas.edu.103077" "c499-092.stampede2.tacc.utexas.edu.103078" "c499-092.stampede2.tacc.utexas.edu.103079" "c499-092.stampede2.tacc.utexas.edu.103080" "c499-092.stampede2.tacc.utexas.edu.103081" "c499-092.stampede2.tacc.utexas.edu.103082" "c499-092.stampede2.tacc.utexas.edu.103083" "c499-092.stampede2.tacc.utexas.edu.103084" "c499-092.stampede2.tacc.utexas.edu.103085" "c499-092.stampede2.tacc.utexas.edu.103086" "c499-092.stampede2.tacc.utexas.edu.103087" "c499-092.stampede2.tacc.utexas.edu.103088" "c499-092.stampede2.tacc.utexas.edu.103089" "c499-092.stampede2.tacc.utexas.edu.103090" "c499-092.stampede2.tacc.utexas.edu.103091" "c499-092.stampede2.tacc.utexas.edu.103092" "c499-092.stampede2.tacc.utexas.edu.103093" "c499-092.stampede2.tacc.utexas.edu.103094" "c499-092.stampede2.tacc.utexas.edu.103095" "c499-092.stampede2.tacc.utexas.edu.103096" "c499-092.stampede2.tacc.utexas.edu.103097" "c499-092.stampede2.tacc.utexas.edu.103098" "c499-092.stampede2.tacc.utexas.edu.103099" "c499-092.stampede2.tacc.utexas.edu.103100" "c500-054.stampede2.tacc.utexas.edu.377907" "c500-054.stampede2.tacc.utexas.edu.377908" "c500-054.stampede2.tacc.utexas.edu.377909" "c500-054.stampede2.tacc.utexas.edu.377910" "c500-054.stampede2.tacc.utexas.edu.377911" "c500-054.stampede2.tacc.utexas.edu.377912" "c500-054.stampede2.tacc.utexas.edu.377913" "c500-054.stampede2.tacc.utexas.edu.377914" "c500-054.stampede2.tacc.utexas.edu.377915" "c500-054.stampede2.tacc.utexas.edu.377916" "c500-054.stampede2.tacc.utexas.edu.377917" "c500-054.stampede2.tacc.utexas.edu.377918" "c500-054.stampede2.tacc.utexas.edu.377919" "c500-054.stampede2.tacc.utexas.edu.377920" "c500-054.stampede2.tacc.utexas.edu.377921" "c500-054.stampede2.tacc.utexas.edu.377922" "c500-054.stampede2.tacc.utexas.edu.377923" "c500-054.stampede2.tacc.utexas.edu.377924" "c500-054.stampede2.tacc.utexas.edu.377925" "c500-054.stampede2.tacc.utexas.edu.377926" "c500-054.stampede2.tacc.utexas.edu.377927" "c500-054.stampede2.tacc.utexas.edu.377928" "c500-054.stampede2.tacc.utexas.edu.377929" "c500-054.stampede2.tacc.utexas.edu.377930" "c500-054.stampede2.tacc.utexas.edu.377931" "c500-054.stampede2.tacc.utexas.edu.377932" "c500-054.stampede2.tacc.utexas.edu.377933" "c500-054.stampede2.tacc.utexas.edu.377934" "c500-054.stampede2.tacc.utexas.edu.377935" "c500-054.stampede2.tacc.utexas.edu.377936" "c500-054.stampede2.tacc.utexas.edu.377937" "c500-054.stampede2.tacc.utexas.edu.377938" "c500-054.stampede2.tacc.utexas.edu.377939" "c500-054.stampede2.tacc.utexas.edu.377940" "c500-054.stampede2.tacc.utexas.edu.377941" "c500-054.stampede2.tacc.utexas.edu.377942" "c500-054.stampede2.tacc.utexas.edu.377943" "c500-054.stampede2.tacc.utexas.edu.377944" "c500-054.stampede2.tacc.utexas.edu.377945" "c500-054.stampede2.tacc.utexas.edu.377946" "c500-054.stampede2.tacc.utexas.edu.377947" "c500-054.stampede2.tacc.utexas.edu.377948" "c500-054.stampede2.tacc.utexas.edu.377949" "c500-054.stampede2.tacc.utexas.edu.377950" "c500-054.stampede2.tacc.utexas.edu.377951" "c500-054.stampede2.tacc.utexas.edu.377952" "c500-054.stampede2.tacc.utexas.edu.377953" "c500-054.stampede2.tacc.utexas.edu.377954" ) Pstream initialized with: floatTransfer : 0 nProcsSimpleSum : 0 commsType : nonBlocking polling iterations : 0 sigFpe : Enabling floating point exception trapping (FOAM_SIGFPE). fileModificationChecking : Monitoring run-time modified files using timeStampMaster (fileModificationSkew 10) allowSystemOperations : Allowing user-supplied system call operations // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * // Create time Overriding OptimisationSwitches according to controlDict maxThreadFileBufferSize 2e+09; maxMasterFileBufferSize 2e+09; Overriding fileHandler to collated I/O : collated (maxThreadFileBufferSize 2e+09) Threading activated since maxThreadFileBufferSize > 0. Requires thread support enabled in MPI, otherwise the simulation may "hang". If thread support cannot be enabled, deactivate threading by setting maxThreadFileBufferSize to 0 in $FOAM_ETC/controlDict Create mesh for time = 0 Selecting dynamicFvMesh dynamicMotionSolverFvMesh Selecting motion solver: displacementLaplacian Selecting motion diffusion: uniform PIMPLE: Operating solver in PISO mode Reading field porosityIndex Porosity NOT activated Reading field p_rgh Reading field U Reading/calculating face flux field phi Reading transportProperties Selecting incompressible transport model Newtonian Selecting incompressible transport model Newtonian Selecting turbulence model type RAS Selecting RAS turbulence model kOmegaSST Selecting patchDistMethod meshWave RAS { RASModel kOmegaSST; turbulence on; printCoeffs on; alphaK1 0.85; alphaK2 1; alphaOmega1 0.5; alphaOmega2 0.856; gamma1 0.555556; gamma2 0.44; beta1 0.075; beta2 0.0828; betaStar 0.09; a1 0.31; b1 1; c1 10; F3 false; } Reading g Reading hRef Calculating field g.h No MRF models present No finite volume options present GAMGPCG: Solving for pcorr, Initial residual = 0, Final residual = 0, No Iterations 0 time step continuity errors : sum local = 0, global = 0, cumulative = 0 Reading/calculating face velocity Uf Courant Number mean: 0 max: 0 Starting time loop forces frontFaceForce: Not including porosity effects forces backFaceForce: Not including porosity effects forces leftFaceForce: Not including porosity effects forces rightFaceForce: Not including porosity effects forces bottomFaceForce: Not including porosity effects forces topFaceForce: Not including porosity effects Reading surface description: frontBox Courant Number mean: 0 max: 0 Interface Courant Number mean: 0 max: 0 deltaT = 0.00119048 Time = 0.00119048 . . . . . Courant Number mean: 0.0209079 max: 0.570672 Interface Courant Number mean: 0.00106573 max: 0.418045 deltaT = 0.00444444 Time = 5.4 PIMPLE: iteration 1 Point displacement BC on patch paddle Displacement Paddles_paddle => 1(3.62) GAMG: Solving for cellDisplacementx, Initial residual = 3.18878e-06, Final residual = 3.18878e-06, No Iterations 0 GAMG: Solving for cellDisplacementy, Initial residual = 0, Final residual = 0, No Iterations 0 GAMG: Solving for cellDisplacementz, Initial residual = 0, Final residual = 0, No Iterations 0 Execution time for mesh.update() = 0.72 s GAMGPCG: Solving for pcorr, Initial residual = 1, Final residual = 5.77301e-06, No Iterations 7 time step continuity errors : sum local = 7.62535e-13, global = -1.05727e-13, cumulative = 1.68824e-06 smoothSolver: Solving for alpha.water, Initial residual = 0.000177732, Final residual = 5.84167e-09, No Iterations 2 Phase-1 volume fraction = 0.221957 Min(alpha.water) = -2.25391e-35 Max(alpha.water) = 1.00013 MULES: Correcting alpha.water MULES: Correcting alpha.water Phase-1 volume fraction = 0.221957 Min(alpha.water) = -1.10925e-22 Max(alpha.water) = 1.00013 smoothSolver: Solving for alpha.water, Initial residual = 0.00017774, Final residual = 5.9304e-09, No Iterations 2 Phase-1 volume fraction = 0.221957 Min(alpha.water) = -2.24798e-35 Max(alpha.water) = 1.00013 MULES: Correcting alpha.water MULES: Correcting alpha.water Phase-1 volume fraction = 0.221957 Min(alpha.water) = -9.97751e-23 Max(alpha.water) = 1.00013 GAMG: Solving for p_rgh, Initial residual = 0.00176017, Final residual = 9.41743e-06, No Iterations 2 time step continuity errors : sum local = 1.14858e-05, global = 1.35418e-07, cumulative = 1.82366e-06 GAMG: Solving for p_rgh, Initial residual = 1.43755e-05, Final residual = 1.02771e-07, No Iterations 4 time step continuity errors : sum local = 1.25308e-07, global = -1.47874e-08, cumulative = 1.80887e-06 GAMG: Solving for p_rgh, Initial residual = 1.71185e-06, Final residual = 4.65467e-09, No Iterations 5 time step continuity errors : sum local = 5.67765e-09, global = -2.01188e-09, cumulative = 1.80686e-06 smoothSolver: Solving for omega, Initial residual = 0.00147789, Final residual = 2.02482e-05, No Iterations 1 smoothSolver: Solving for k, Initial residual = 0.00709563, Final residual = 0.00018944, No Iterations 1 TACC: MPI job exited with code: 1 TACC: Shutdown complete. Exiting. |
|
April 26, 2019, 06:45 |
Unsolved, but problem is intermittent...
|
#3 |
New Member
Eric Bringley
Join Date: Nov 2016
Posts: 14
Rep Power: 10 |
Hi Andrew,
I cannot say I have solved this problem, as variations of it still haunt me. I think the vast majority of it is problems with the cluster I'm using, which are largely outside my control. This probably stems from file system/storage access being intermittent. Most of what I find online about problems like this (searching from an MPI standpoint) says to run with a debugger, which is impossible when the problem is intermittent and reproducibly inconsistent and seemingly random. Your log file says it failed at t=5.4 and, being a clean, round number, I'm guessing is a write-out timestep. You could be experiencing the same suspected filesystem problems I was. I'd suggest a help ticket with TACC. I hope you have more success than I did in solving this problem or it just disappears and you can continue your work uninterrupted. Sorry I cannot be of any more help. Best, Eric |
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
ANSYS Workbench on "Uncertified" Linux Distros | hsr | CFX | 289 | April 20, 2023 10:23 |
udf problem | jane | Fluent UDF and Scheme Programming | 37 | February 20, 2018 05:17 |
UDF didn't work | mtfl | Fluent Multiphase | 0 | January 5, 2016 03:33 |
[Virtualization] OpenFOAM oriented tutorial on using VMware Player - support thread | wyldckat | OpenFOAM Installation | 2 | July 11, 2012 17:01 |
Possible to loop a face thread inside a cell thread loop? | MarcusW | FLUENT | 3 | March 7, 2012 07:32 |