reactingParcelFoam 2D crash in parallel, works fine in serial

FerdiFuchs · May 28, 2015, 13:15

Hi everyone,

im solving a "simple" 2d channelflow with air and a spray with water in 2d (similar to $FOAM_TUT/lagrangian/reactingParcelFoam/verticalChannel), just in 2d.

When i try to run this case in parallel, the solver crashes at the first injection timestep with the following errormessage:

Code:

Solving 2-D cloud reactingCloud1

--> Cloud: reactingCloud1 injector: model1
Added 91 new parcels

[$HOSTNAME:31049] *** An error occurred in MPI_Recv
[$HOSTNAME:31049] *** reported by process [139954540642305,1]
[$HOSTNAME:31049] *** on communicator MPI_COMM_WORLD
[$HOSTNAME:31049] *** MPI_ERR_TRUNCATE: message truncated
[$HOSTNAME:31049] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[$HOSTNAME:31049] ***    and potentially your MPI job)
[$HOSTNAME:31035] 2 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[$HOSTNAME:31035] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

In serial, it runs fine without any error. If i change number of $WM_NCOMPPROCS, sometimes the solver stucks instead of crashing. htop shows then a lot of red cpu usage (kernel threads).

i found something in the net; someone had the same error here, solved it by disable functionObjects and cloudFunctions. Not in my case...
method for decomposing is also irrelevant, i checked simple and scotch.

Maybe this thread is also better placed in OpenFOAM bugs? if someone could confirm this, i will also open an issue in OF-2.3.x-bugtracking.
Tomorrow ill check it in OF-2.4.x and in FE-3.1.

If somebody knows what to do, every help is appreciated. This case is some kind of urgent for me.

Thank you very much!

oswald · June 11, 2015, 07:31

I'm having a similar problem with a lagrangian tracking solver in parallel, based on icoUncoupledKinematicParcelFoam. It works at first, but after some time it crashes with the same error message as in your case.

Code:

[ran:7367] *** An error occurred in MPI_Waitall
[ran:7367] *** on communicator MPI_COMM_WORLD
[ran:7367] *** MPI_ERR_TRUNCATE: message truncated
[ran:7367] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 7367 on
node ran exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

And, also as in your case, sometimes it just gets stuck somewhere without crashing. I tried to narrow down the step where it gets stuck and it seems to be that in kinematicCloud.evolve() in kinematicCloud.C it is stuck at getting the trackingData

Code:

template<class CloudType>
void Foam::KinematicCloud<CloudType>::evolve()
{
    Info << "start kinematicCloud.evolve" << endl;
    if (solution_.canEvolve())
    {
        Info << "solution can evolve, getting track data" << endl;
        typename parcelType::template
            TrackingData<KinematicCloud<CloudType> > td(*this);

        Info << "start solving" << endl;
        solve(td);
    }
}

When the solver is stuck, my program's last output is "solution can evolve, getting track data". So it seems to be some error there.

When changing the commsType from nonBlocking to blocking in $WM_PROJECT_DIR/etc/controlDict, the error is:

Code:

[0]
[0]
[0] --> FOAM FATAL IO ERROR:
[0] error in IOstream "IOstream" for operation operator>>(Istream&, List<T>&) : reading first token
[0]
[0] file: IOstream at line 0.
[0]
[0]     From function IOstream::fatalCheck(const char*) const
[0]     in file db/IOstreams/IOstreams/IOstream.C at line 114.
[0]
FOAM parallel run exiting
[0]

clockworker · August 4, 2015, 11:32

Hi there!

I ran into the same error message in a case similar to
$FOAM_TUT/lagrangian/reactingParcelFoam/verticalChannel/

On the tutorial case i was able to reproduce the described behavior
with the following commands:

Code:

#!/bin/sh
cd ${0%/*} || exit 1    # run from this directory

# Source tutorial run functions
. $WM_PROJECT_DIR/bin/tools/RunFunctions

# create mesh
runApplication blockMesh

cp -r 0.org 0

# initialise with potentialFoam solution
runApplication potentialFoam

rm -f 0/phi

# run the solver
runApplication pyFoamDecompose.py . 4
runApplication pyFoamPlotRunner.py mpirun -np 4 reactingParcelFoam -parallel

# ----------------------------------------------------------------- end-of-file

The calculation hangs at

Code:

...
Courant Number mean: 1.705107874 max: 4.895575368
deltaT = 0.0004761904762
Time = 0.0109524

Solving 3-D cloud reactingCloud1

with htop showing CPU usage of ~ 100 % on all cores.

If i deactivate

Code:

dispersionModel none;//stochasticDispersionRAS;

I can reproduce the error message in the OP:

Code:

--> Cloud: reactingCloud1 injector: model1
[$Hostname:15844] *** An error occurred in MPI_Recv
[$Hostname:15844] *** on communicator MPI_COMM_WORLD
[$Hostname:15844] *** MPI_ERR_TRUNCATE: message truncated
[$Hostname:15844] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort

I use Ubuntu 14.04.3 LTS with openfoam240. Can anyone else confirm this behaviour or even provide a solution?
Thank you very much for your time.

clockworker · August 7, 2015, 04:03

Hi there,
I think I stumbled upon a solution
I changed the reactingCloud1Properties from

Code:

massTotal       8;
duration        10000;

to

Code:

massTotal       0.0008;
duration        1;

and the calculation continued without the error messages
Hope this helps someone.

FerdiFuchs · August 10, 2015, 10:18

mh this does not really help.

what you changed is the timeframe of the injection and the mass which is injected in this time.
The Injection starts at SOI for the defined timeframe.

If you change these values, you will definetly get results you dont want to have.

Greets,
Ferdi

clockworker · August 10, 2015, 19:19

Hi Ferdi,

I was under the impression that you can maintain a constant mass flow rate if you change massTotal proportional to the duration according to this

HTML Code:

http://www.dhcae-tools.com/images/dhcaeLTSThermoParcelSolver.pdf

as long as duration is longer as endTime. I stand corrected if this is not the case.
Nonetheless, I was not able to reproduce the described behavior at home on 2 cores anymore. Meaning the error messages appear no matter what I do with massTotal or duration.

What I tried now was changing the injectionModel from patchInjection to coneNozzleInjection like this:

Code:

injectionModels
    {
            model1 
    { 
        type            coneNozzleInjection; 
        SOI             0.01; 
        massTotal       8; 
        parcelBasisType mass; 
        injectionMethod disc; 
      	flowType 	constantVelocity;
      	UMag 		40; 
        outerDiameter   6.5e-3; 
        innerDiameter   0; 
        duration        10000;
        position        ( 12.5e-3 -230e-3 0 ); 
        direction       ( 1 0 0 ); 
        parcelsPerSecond 1e5; 
        flowRateProfile constant 1; 
        Cd              constant 0.9; 
        thetaInner      constant 0.0; 
        thetaOuter      constant 1.0; 

        sizeDistribution 
	{
                type        general;
                generalDistribution
                {
                    distribution
                    (
                        (10e-06      0.0025)
                        (15e-06      0.0528)
                        (20e-06      0.2795)
                        (25e-06      1.0918)
                        (30e-06      2.3988)
                        (35e-06      4.4227)
                        (40e-06      6.3888)
                        (45e-06      8.6721)
                        (50e-06      10.3153)
                        (55e-06      11.6259)
                        (60e-06      12.0030)
                        (65e-06      10.4175)
                        (70e-06      10.8427)
                        (75e-06      8.0016)
                        (80e-06      6.1333)
                        (85e-06      3.8827)
                        (90e-06      3.4688)
                    );
                }
            }

And now the error messages disappear

I don't know if coneNozzleInjection is applicable in 2D but perhaps that does provide a workaround. Or perhaps ManualInjection can be an alternative. I have to try this at my work case. It is also 2D.
Thanks Ferdi for taking the time.
Greetings
clockworker

May 28, 2015, 13:15	reactingParcelFoam 2D crash in parallel, works fine in serial	#1
FerdiFuchs Member Ferdinand Pfender Join Date: May 2013 Location: Berlin, Germany Posts: 40 Rep Power: 13	Hi everyone, im solving a "simple" 2d channelflow with air and a spray with water in 2d (similar to $FOAM_TUT/lagrangian/reactingParcelFoam/verticalChannel), just in 2d. When i try to run this case in parallel, the solver crashes at the first injection timestep with the following errormessage: Code: Solving 2-D cloud reactingCloud1 --> Cloud: reactingCloud1 injector: model1 Added 91 new parcels [$HOSTNAME:31049] * An error occurred in MPI_Recv [$HOSTNAME:31049] * reported by process [139954540642305,1] [$HOSTNAME:31049] * on communicator MPI_COMM_WORLD [$HOSTNAME:31049] * MPI_ERR_TRUNCATE: message truncated [$HOSTNAME:31049] * MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [$HOSTNAME:31049] * and potentially your MPI job) [$HOSTNAME:31035] 2 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal [$HOSTNAME:31035] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages In serial, it runs fine without any error. If i change number of $WM_NCOMPPROCS, sometimes the solver stucks instead of crashing. htop shows then a lot of red cpu usage (kernel threads). i found something in the net; someone had the same error here, solved it by disable functionObjects and cloudFunctions. Not in my case... method for decomposing is also irrelevant, i checked simple and scotch. Maybe this thread is also better placed in OpenFOAM bugs? if someone could confirm this, i will also open an issue in OF-2.3.x-bugtracking. Tomorrow ill check it in OF-2.4.x and in FE-3.1. If somebody knows what to do, every help is appreciated. This case is some kind of urgent for me. Thank you very much!

August 7, 2015, 04:03		#4
clockworker New Member Join Date: Dec 2013 Posts: 4 Rep Power: 0	Hi there, I think I stumbled upon a solution I changed the reactingCloud1Properties from Code: massTotal 8; duration 10000; to Code: massTotal 0.0008; duration 1; and the calculation continued without the error messages Hope this helps someone.

August 10, 2015, 19:19	3rd try	#6
clockworker New Member Join Date: Dec 2013 Posts: 4 Rep Power: 0	Hi Ferdi, I was under the impression that you can maintain a constant mass flow rate if you change massTotal proportional to the duration according to this HTML Code: http://www.dhcae-tools.com/images/dhcaeLTSThermoParcelSolver.pdf as long as duration is longer as endTime. I stand corrected if this is not the case. Nonetheless, I was not able to reproduce the described behavior at home on 2 cores anymore. Meaning the error messages appear no matter what I do with massTotal or duration. What I tried now was changing the injectionModel from patchInjection to coneNozzleInjection like this: Code: injectionModels { model1 { type coneNozzleInjection; SOI 0.01; massTotal 8; parcelBasisType mass; injectionMethod disc; flowType constantVelocity; UMag 40; outerDiameter 6.5e-3; innerDiameter 0; duration 10000; position ( 12.5e-3 -230e-3 0 ); direction ( 1 0 0 ); parcelsPerSecond 1e5; flowRateProfile constant 1; Cd constant 0.9; thetaInner constant 0.0; thetaOuter constant 1.0; sizeDistribution { type general; generalDistribution { distribution ( (10e-06 0.0025) (15e-06 0.0528) (20e-06 0.2795) (25e-06 1.0918) (30e-06 2.3988) (35e-06 4.4227) (40e-06 6.3888) (45e-06 8.6721) (50e-06 10.3153) (55e-06 11.6259) (60e-06 12.0030) (65e-06 10.4175) (70e-06 10.8427) (75e-06 8.0016) (80e-06 6.1333) (85e-06 3.8827) (90e-06 3.4688) ); } } And now the error messages disappear I don't know if coneNozzleInjection is applicable in 2D but perhaps that does provide a workaround. Or perhaps ManualInjection can be an alternative. I have to try this at my work case. It is also 2D. Thanks Ferdi for taking the time. Greetings clockworker

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Poisson eq w setReference works serial diverges in parallel	tehache	OpenFOAM Running, Solving & CFD	5	August 29, 2012 10:41
serial run fine, but parallel run diverged	phsieh2005	OpenFOAM Running, Solving & CFD	2	October 6, 2009 09:33
Parallel run diverges, serial does not	SammyB	OpenFOAM Running, Solving & CFD	1	May 10, 2009 04:28
interpret works fine but compile doesn't	Jan Balemans	FLUENT	0	March 14, 2008 10:41
Serial run OK parallel one fails	r2d2	OpenFOAM Running, Solving & CFD	2	November 16, 2005 13:44

August 10, 2015, 10:18		#5
FerdiFuchs Member Ferdinand Pfender Join Date: May 2013 Location: Berlin, Germany Posts: 40 Rep Power: 13	mh this does not really help. what you changed is the timeframe of the injection and the mass which is injected in this time. The Injection starts at SOI for the defined timeframe. If you change these values, you will definetly get results you dont want to have. Greets, Ferdi