CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Software User Forums > OpenFOAM > OpenFOAM Running, Solving & CFD

"Failed Starting Thread 0"

Register Blogs Community New Posts Updated Threads Search

Like Tree1Likes
  • 1 Post By ebringley

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   October 24, 2018, 12:02
Default "Failed Starting Thread 0"
  #1
New Member
 
Eric Bringley
Join Date: Nov 2016
Posts: 14
Rep Power: 10
ebringley is on a distinguished road
Dear all,

I've run into a problem. I'm running into my job failing during File IO with this message:

Log:
Code:
PIMPLE: not converged within 3 iterations
[15] 
[15] 
[15] --> FOAM FATAL ERROR: 
[15] Failed starting thread 0
[15] 
[15]     From function void Foam::createThread(int, void *(*)(void *), void *)
[15]     in file POSIX.C at line 1422.
[15] 
FOAM parallel run exiting
[15]
Code:
[processors]$ ls -1 0.0074 | wc -l
224
[processors]$ ls -1 0.007425/ | wc -l
110
I believe the above log and file count should prove that it happens during a runtime.write().


Relevant details:

  • I am running OpenFOAM 5.x in parallel on a HPC cluster, compiled with Intel Compilers and Intel MPI library.
  • This is a repeatable problem.
  • This has happened anywhere between 5 hours and 38 hours of wall run time, depending on the case, but in this latest case, it seems to be after the same amount of simulation time (e.g. Start at time = 0, falied at time X, restarted from X-writeTimeStep, failed at 2X-writeTimeStep)
  • Executable is a modified reactingFoam, coupled to a user-written Fortran Library (not being called during file IO).
  • I am using the collated file IO to reduce number of files output.
  • The latest time step can be deleted and restarted from that point and it will pass the time step where it failed to write out.


Does anyone have any ideas about why OpenFOAM is failing when it is writing to file?
ebringley is offline   Reply With Quote

Old   April 25, 2019, 15:52
Default FOAM FATAL ERROR: Failed starting thread 0
  #2
aow
Member
 
Andrew O. Winter
Join Date: Aug 2015
Location: Seattle, WA, USA
Posts: 78
Rep Power: 11
aow is on a distinguished road
Hi Eric,

Were you ever able to discern what was causing this issue?

I've just run into the same error output while trying out OpenFOAM-5.x (compiled last week on Friday) using the collated fileHandler instead of the default uncollated. So far out of the 3 cases I've run using the uncollated format I've not had any errors (2 are complete whereas 1 is just past 50% completion), but with the 1 case I've tried using the collated format it failed at about 5.4 seconds of simulated time, which is roughly 1/4 of the total time.

To provide some details of the case, I'm modeling a piston-type wave maker using the olaDyMFlow solver from Pablo Higuera's OlaFlow solver + BC package, which is discussed on the forums in this thread. My model is similar to the wavemakerFlume tutorial with modified flume geometry and 1 to 3 rectangular structures are added using snappyHexMesh.

Also, the hardware I'm running this on is a pair of Skylake nodes (Intel Xeon Platinum 8160), which each have 2 sockets, 24/cores per socket, and 2 threads per core to give 48 cores or 96 threads per node. Using uname -r, the operating system is shown as...
Code:
Operating System: CentOS Linux 7 (Core)
CPE OS Name: cpe:/o:centos:centos:7
Kernel: Linux 3.10.0-957.5.1.el7.x86_64
Architecture: x86-64
In case you or anyone else has any tips or clues to offer, I've posted my Slurm batch script and Slurm output file for building and running the case. The olaDyMFlow log file is really long so I omitted the middle portion where things were running smoothly.

Thanks in advance!

Slurm batch script:
Code:
#!/bin/bash
#SBATCH --job-name=case065              # job name
#SBATCH --account=DesignSafe-Motley-UW  # project allocation name (required if you have >1)
#SBATCH --partition=skx-normal          # queue name
#SBATCH --time=48:00:00                 # run time (D-HH:MM:SS)
#SBATCH --nodes=2                       # total number of nodes
#SBATCH --ntasks=96                     # total number of MPI tasks
module load intel/18.0.2
module load impi/18.0.2
export MPI_ROOT=$I_MPI_ROOT
source $WORK/OpenFOAM-5.x/etc/bashrc
cd $SCRATCH/Apr22/case065_W12ft_xR016in_yR-040in_xL016in_yL_056in_Broken_kOmegaSST_Euler_MeshV2_0_1
echo Preparing 0 folder...
if [ -d 0 ]; then
	rm -r 0
fi
cp -r 0.org 0
echo blockMesh meshing...
blockMesh > log.blockMesh
echo surfaceFeatureExtract extracting...
surfaceFeatureExtract > log.surfFeatExt
echo decomposePar setting up parallel case...
cp ./system/decompParDict_sHM ./system/decomposeParDict
decomposePar -copyZero > log.decomp_sHM
echo snappyHex meshing testStruct...
cp ./system/snappyHexMeshDict_testStruct ./system/snappyHexMeshDict
ibrun -np 96 snappyHexMesh -parallel -overwrite > log.sHM_testStruct
echo snappyHex meshing concBlocks...
cp ./system/snappyHexMeshDict_concBlocks ./system/snappyHexMeshDict
ibrun -np 96 snappyHexMesh -parallel -overwrite > log.sHM_concBlocks
echo reconstructParMesh rebuilding mesh...
reconstructParMesh -constant -mergeTol 1e-6 > log.reconMesh_sHM
echo reconstructPar rebuilding fields...
reconstructPar > log.reconFields_sHM
rm -r processor*
echo checking mesh quality...
checkMesh > log.checkMesh
echo Setting the fields...
setFields > log.setFields
echo decomposePar setting up parallel case...
cp ./system/decompParDict_runCase ./system/decomposeParDict
decomposePar > log.decomp_runCase
echo Mesh built, ICs set, and parallel decomposition complete
echo Begin running olaDyMFlow...
ibrun -np 96 olaDyMFlow -parallel > log.olaDyMFlow
echo Completed running olaDyMFlow
Slurm output file:
Code:
Preparing 0 folder...
blockMesh meshing...
surfaceFeatureExtract extracting...
decomposePar setting up parallel case...
snappyHex meshing testStruct...
snappyHex meshing concBlocks...
reconstructParMesh rebuilding mesh...
reconstructPar rebuilding fields...
checking mesh quality...
Setting the fields...
decomposePar setting up parallel case...
Mesh built, ICs set, and parallel decomposition complete
Begin running olaDyMFlow...
[52] 
[71] 
[74] 
[79] 
[83] 
[87] 
[94] 
[52] 
[52] --> FOAM FATAL ERROR: 
[52] Failed starting thread 0
[52] 
[52]     From function void Foam::createThread(int, void *(*)(void *), void *)
[52]     in file POSIX.C at line [71] 
[71] --> FOAM FATAL ERROR: 
[71] Failed starting thread 0
[71] 
[71]     From function void Foam::createThread(int, void *(*)(void *), void *)
[71]     in file POSIX.C at line [74] 
[74] --> FOAM FATAL ERROR: 
[74] Failed starting thread 0
[74] 
[74]     From function void Foam::createThread(int, void *(*)(void *), void *)
[74]     in file POSIX.C at line [79] 
[79] --> FOAM FATAL ERROR: 
[79] Failed starting thread 0
[79] 
[79]     From function void Foam::createThread(int, void *(*)(void *), void *)
[79]     in file POSIX.C at line [83] 
[83] --> FOAM FATAL ERROR: 
[83] Failed starting thread 0
[83] 
[83]     From function void Foam::createThread(int, void *(*)(void *), void *)
[83]     in file POSIX.C at line [87] 
[87] --> FOAM FATAL ERROR: 
[87] Failed starting thread 0
[87] 
[87]     From function void Foam::createThread(int, void *(*)(void *), void *)
[87]     in file POSIX.C at line [94] 
[94] --> FOAM FATAL ERROR: 
[94] Failed starting thread 0
[94] 
[94]     From function void Foam::createThread(int, void *(*)(void *), void *)
[94]     in file POSIX.C at line 1422.
[52] 
FOAM parallel run exiting
[52] 
1422.
[71] 
FOAM parallel run exiting
[71] 
1422.
[74] 
FOAM parallel run exiting
[74] 
1422.
[83] 
FOAM parallel run exiting
[83] 
1422.
[94] 
FOAM parallel run exiting
[94] 
1422.
[79] 
FOAM parallel run exiting
[79] 
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 79
1422.
[87] 
FOAM parallel run exiting
[87] 
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 87
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 52
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 71
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 74
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 83
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 94
Completed running olaDyMFlow
olaDyMFlow (truncated) log file:
Code:
TACC:  Starting up job 3370514 
TACC:  Starting parallel tasks... 
/*---------------------------------------------------------------------------*\
| =========                 |                                                 |
| \\      /  F ield         | OpenFOAM: The Open Source CFD Toolbox           |
|  \\    /   O peration     | Version:  5.x                                   |
|   \\  /    A nd           | Web:      www.OpenFOAM.org                      |
|    \\/     M anipulation  |                                                 |
\*---------------------------------------------------------------------------*/
Build  : 5.x-7f7d351b741b
Exec   : /work/04697/winter89/stampede2/OpenFOAM-5.x/platforms/linux64IccDPInt32Opt/bin/olaDyMFlow -parallel
Date   : Apr 25 2019
Time   : 07:11:35
Host   : "c499-092.stampede2.tacc.utexas.edu"
PID    : 103053
I/O    : uncollated
Case   : /scratch/04697/winter89/Apr22/case065_W12ft_xR016in_yR-040in_xL016in_yL_056in_Broken_kOmegaSST_Euler_MeshV2_0_1
nProcs : 96
Slaves : 
95
(
"c499-092.stampede2.tacc.utexas.edu.103054"
"c499-092.stampede2.tacc.utexas.edu.103055"
"c499-092.stampede2.tacc.utexas.edu.103056"
"c499-092.stampede2.tacc.utexas.edu.103057"
"c499-092.stampede2.tacc.utexas.edu.103058"
"c499-092.stampede2.tacc.utexas.edu.103059"
"c499-092.stampede2.tacc.utexas.edu.103060"
"c499-092.stampede2.tacc.utexas.edu.103061"
"c499-092.stampede2.tacc.utexas.edu.103062"
"c499-092.stampede2.tacc.utexas.edu.103063"
"c499-092.stampede2.tacc.utexas.edu.103064"
"c499-092.stampede2.tacc.utexas.edu.103065"
"c499-092.stampede2.tacc.utexas.edu.103066"
"c499-092.stampede2.tacc.utexas.edu.103067"
"c499-092.stampede2.tacc.utexas.edu.103068"
"c499-092.stampede2.tacc.utexas.edu.103069"
"c499-092.stampede2.tacc.utexas.edu.103070"
"c499-092.stampede2.tacc.utexas.edu.103071"
"c499-092.stampede2.tacc.utexas.edu.103072"
"c499-092.stampede2.tacc.utexas.edu.103073"
"c499-092.stampede2.tacc.utexas.edu.103074"
"c499-092.stampede2.tacc.utexas.edu.103075"
"c499-092.stampede2.tacc.utexas.edu.103076"
"c499-092.stampede2.tacc.utexas.edu.103077"
"c499-092.stampede2.tacc.utexas.edu.103078"
"c499-092.stampede2.tacc.utexas.edu.103079"
"c499-092.stampede2.tacc.utexas.edu.103080"
"c499-092.stampede2.tacc.utexas.edu.103081"
"c499-092.stampede2.tacc.utexas.edu.103082"
"c499-092.stampede2.tacc.utexas.edu.103083"
"c499-092.stampede2.tacc.utexas.edu.103084"
"c499-092.stampede2.tacc.utexas.edu.103085"
"c499-092.stampede2.tacc.utexas.edu.103086"
"c499-092.stampede2.tacc.utexas.edu.103087"
"c499-092.stampede2.tacc.utexas.edu.103088"
"c499-092.stampede2.tacc.utexas.edu.103089"
"c499-092.stampede2.tacc.utexas.edu.103090"
"c499-092.stampede2.tacc.utexas.edu.103091"
"c499-092.stampede2.tacc.utexas.edu.103092"
"c499-092.stampede2.tacc.utexas.edu.103093"
"c499-092.stampede2.tacc.utexas.edu.103094"
"c499-092.stampede2.tacc.utexas.edu.103095"
"c499-092.stampede2.tacc.utexas.edu.103096"
"c499-092.stampede2.tacc.utexas.edu.103097"
"c499-092.stampede2.tacc.utexas.edu.103098"
"c499-092.stampede2.tacc.utexas.edu.103099"
"c499-092.stampede2.tacc.utexas.edu.103100"
"c500-054.stampede2.tacc.utexas.edu.377907"
"c500-054.stampede2.tacc.utexas.edu.377908"
"c500-054.stampede2.tacc.utexas.edu.377909"
"c500-054.stampede2.tacc.utexas.edu.377910"
"c500-054.stampede2.tacc.utexas.edu.377911"
"c500-054.stampede2.tacc.utexas.edu.377912"
"c500-054.stampede2.tacc.utexas.edu.377913"
"c500-054.stampede2.tacc.utexas.edu.377914"
"c500-054.stampede2.tacc.utexas.edu.377915"
"c500-054.stampede2.tacc.utexas.edu.377916"
"c500-054.stampede2.tacc.utexas.edu.377917"
"c500-054.stampede2.tacc.utexas.edu.377918"
"c500-054.stampede2.tacc.utexas.edu.377919"
"c500-054.stampede2.tacc.utexas.edu.377920"
"c500-054.stampede2.tacc.utexas.edu.377921"
"c500-054.stampede2.tacc.utexas.edu.377922"
"c500-054.stampede2.tacc.utexas.edu.377923"
"c500-054.stampede2.tacc.utexas.edu.377924"
"c500-054.stampede2.tacc.utexas.edu.377925"
"c500-054.stampede2.tacc.utexas.edu.377926"
"c500-054.stampede2.tacc.utexas.edu.377927"
"c500-054.stampede2.tacc.utexas.edu.377928"
"c500-054.stampede2.tacc.utexas.edu.377929"
"c500-054.stampede2.tacc.utexas.edu.377930"
"c500-054.stampede2.tacc.utexas.edu.377931"
"c500-054.stampede2.tacc.utexas.edu.377932"
"c500-054.stampede2.tacc.utexas.edu.377933"
"c500-054.stampede2.tacc.utexas.edu.377934"
"c500-054.stampede2.tacc.utexas.edu.377935"
"c500-054.stampede2.tacc.utexas.edu.377936"
"c500-054.stampede2.tacc.utexas.edu.377937"
"c500-054.stampede2.tacc.utexas.edu.377938"
"c500-054.stampede2.tacc.utexas.edu.377939"
"c500-054.stampede2.tacc.utexas.edu.377940"
"c500-054.stampede2.tacc.utexas.edu.377941"
"c500-054.stampede2.tacc.utexas.edu.377942"
"c500-054.stampede2.tacc.utexas.edu.377943"
"c500-054.stampede2.tacc.utexas.edu.377944"
"c500-054.stampede2.tacc.utexas.edu.377945"
"c500-054.stampede2.tacc.utexas.edu.377946"
"c500-054.stampede2.tacc.utexas.edu.377947"
"c500-054.stampede2.tacc.utexas.edu.377948"
"c500-054.stampede2.tacc.utexas.edu.377949"
"c500-054.stampede2.tacc.utexas.edu.377950"
"c500-054.stampede2.tacc.utexas.edu.377951"
"c500-054.stampede2.tacc.utexas.edu.377952"
"c500-054.stampede2.tacc.utexas.edu.377953"
"c500-054.stampede2.tacc.utexas.edu.377954"
)

Pstream initialized with:
    floatTransfer      : 0
    nProcsSimpleSum    : 0
    commsType          : nonBlocking
    polling iterations : 0
sigFpe : Enabling floating point exception trapping (FOAM_SIGFPE).
fileModificationChecking : Monitoring run-time modified files using timeStampMaster (fileModificationSkew 10)
allowSystemOperations : Allowing user-supplied system call operations

// * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * //
Create time

Overriding OptimisationSwitches according to controlDict
    maxThreadFileBufferSize 2e+09;

    maxMasterFileBufferSize 2e+09;

Overriding fileHandler to collated
I/O    : collated (maxThreadFileBufferSize 2e+09)
         Threading activated since maxThreadFileBufferSize > 0.
         Requires thread support enabled in MPI, otherwise the simulation
         may "hang".  If thread support cannot be enabled, deactivate threading
         by setting maxThreadFileBufferSize to 0 in $FOAM_ETC/controlDict
Create mesh for time = 0

Selecting dynamicFvMesh dynamicMotionSolverFvMesh
Selecting motion solver: displacementLaplacian
Selecting motion diffusion: uniform

PIMPLE: Operating solver in PISO mode

Reading field porosityIndex

Porosity NOT activated

Reading field p_rgh

Reading field U

Reading/calculating face flux field phi

Reading transportProperties

Selecting incompressible transport model Newtonian
Selecting incompressible transport model Newtonian
Selecting turbulence model type RAS
Selecting RAS turbulence model kOmegaSST
Selecting patchDistMethod meshWave
RAS
{
    RASModel        kOmegaSST;
    turbulence      on;
    printCoeffs     on;
    alphaK1         0.85;
    alphaK2         1;
    alphaOmega1     0.5;
    alphaOmega2     0.856;
    gamma1          0.555556;
    gamma2          0.44;
    beta1           0.075;
    beta2           0.0828;
    betaStar        0.09;
    a1              0.31;
    b1              1;
    c1              10;
    F3              false;
}


Reading g

Reading hRef
Calculating field g.h

No MRF models present

No finite volume options present

GAMGPCG:  Solving for pcorr, Initial residual = 0, Final residual = 0, No Iterations 0
time step continuity errors : sum local = 0, global = 0, cumulative = 0
Reading/calculating face velocity Uf

Courant Number mean: 0 max: 0

Starting time loop

forces frontFaceForce:
    Not including porosity effects
forces backFaceForce:
    Not including porosity effects
forces leftFaceForce:
    Not including porosity effects
forces rightFaceForce:
    Not including porosity effects
forces bottomFaceForce:
    Not including porosity effects
forces topFaceForce:
    Not including porosity effects
Reading surface description:
    frontBox

Courant Number mean: 0 max: 0
Interface Courant Number mean: 0 max: 0
deltaT = 0.00119048
Time = 0.00119048
.
.
.
.
.
Courant Number mean: 0.0209079 max: 0.570672
Interface Courant Number mean: 0.00106573 max: 0.418045
deltaT = 0.00444444
Time = 5.4

PIMPLE: iteration 1
Point displacement BC on patch paddle
Displacement Paddles_paddle => 1(3.62)
GAMG:  Solving for cellDisplacementx, Initial residual = 3.18878e-06, Final residual = 3.18878e-06, No Iterations 0
GAMG:  Solving for cellDisplacementy, Initial residual = 0, Final residual = 0, No Iterations 0
GAMG:  Solving for cellDisplacementz, Initial residual = 0, Final residual = 0, No Iterations 0
Execution time for mesh.update() = 0.72 s
GAMGPCG:  Solving for pcorr, Initial residual = 1, Final residual = 5.77301e-06, No Iterations 7
time step continuity errors : sum local = 7.62535e-13, global = -1.05727e-13, cumulative = 1.68824e-06
smoothSolver:  Solving for alpha.water, Initial residual = 0.000177732, Final residual = 5.84167e-09, No Iterations 2
Phase-1 volume fraction = 0.221957  Min(alpha.water) = -2.25391e-35  Max(alpha.water) = 1.00013
MULES: Correcting alpha.water
MULES: Correcting alpha.water
Phase-1 volume fraction = 0.221957  Min(alpha.water) = -1.10925e-22  Max(alpha.water) = 1.00013
smoothSolver:  Solving for alpha.water, Initial residual = 0.00017774, Final residual = 5.9304e-09, No Iterations 2
Phase-1 volume fraction = 0.221957  Min(alpha.water) = -2.24798e-35  Max(alpha.water) = 1.00013
MULES: Correcting alpha.water
MULES: Correcting alpha.water
Phase-1 volume fraction = 0.221957  Min(alpha.water) = -9.97751e-23  Max(alpha.water) = 1.00013
GAMG:  Solving for p_rgh, Initial residual = 0.00176017, Final residual = 9.41743e-06, No Iterations 2
time step continuity errors : sum local = 1.14858e-05, global = 1.35418e-07, cumulative = 1.82366e-06
GAMG:  Solving for p_rgh, Initial residual = 1.43755e-05, Final residual = 1.02771e-07, No Iterations 4
time step continuity errors : sum local = 1.25308e-07, global = -1.47874e-08, cumulative = 1.80887e-06
GAMG:  Solving for p_rgh, Initial residual = 1.71185e-06, Final residual = 4.65467e-09, No Iterations 5
time step continuity errors : sum local = 5.67765e-09, global = -2.01188e-09, cumulative = 1.80686e-06
smoothSolver:  Solving for omega, Initial residual = 0.00147789, Final residual = 2.02482e-05, No Iterations 1
smoothSolver:  Solving for k, Initial residual = 0.00709563, Final residual = 0.00018944, No Iterations 1
TACC:  MPI job exited with code: 1 
TACC:  Shutdown complete. Exiting.
aow is offline   Reply With Quote

Old   April 26, 2019, 06:45
Default Unsolved, but problem is intermittent...
  #3
New Member
 
Eric Bringley
Join Date: Nov 2016
Posts: 14
Rep Power: 10
ebringley is on a distinguished road
Hi Andrew,


I cannot say I have solved this problem, as variations of it still haunt me. I think the vast majority of it is problems with the cluster I'm using, which are largely outside my control. This probably stems from file system/storage access being intermittent. Most of what I find online about problems like this (searching from an MPI standpoint) says to run with a debugger, which is impossible when the problem is intermittent and reproducibly inconsistent and seemingly random.



Your log file says it failed at t=5.4 and, being a clean, round number, I'm guessing is a write-out timestep. You could be experiencing the same suspected filesystem problems I was.



I'd suggest a help ticket with TACC. I hope you have more success than I did in solving this problem or it just disappears and you can continue your work uninterrupted. Sorry I cannot be of any more help.



Best,
Eric
aow likes this.
ebringley is offline   Reply With Quote

Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
ANSYS Workbench on "Uncertified" Linux Distros hsr CFX 289 April 20, 2023 10:23
udf problem jane Fluent UDF and Scheme Programming 37 February 20, 2018 05:17
UDF didn't work mtfl Fluent Multiphase 0 January 5, 2016 03:33
[Virtualization] OpenFOAM oriented tutorial on using VMware Player - support thread wyldckat OpenFOAM Installation 2 July 11, 2012 17:01
Possible to loop a face thread inside a cell thread loop? MarcusW FLUENT 3 March 7, 2012 07:32


All times are GMT -4. The time now is 08:20.