poor performance at massive parallel run using SGI cluster

matthias · October 19, 2011, 11:28

Hi all,

I would like to address this issue to the forum since I have no more idea where I can optimize my setup.

Some times ago I run a case with 190Mio cells on a SGI ICE cluster using 2048 cores. For parallelization I used MPI and MPT, which showed similar performance. An iteration step took approximately 10s (pisoFoam+advection-diffusion equation).
So far so good!

Now I would like to run a 350Mio cell case on the same SGI ICE cluster using 2720 cores. The mesh is decomposed by scotch, each processor has approximately 125000 cells. The setup is equal to the previous case. But now an iteration step takes 120s.

I already used different MPT optimization flags without success.

Most time is spent for pressure correction (GAMG) although only few iterations (max 7) are needed. I used 3 nCorrector steps and 2 nonOrthogonalCorrector step (as before). The new mesh was created using refineMesh and it shows no errors using checkMesh.

If I limit the max iterations to two in the GAMG settings an iteration step takes approx. 30s but there must be a better solution.

Code:

Time = 0.0009171

Courant Number mean: 0.0001917142724 max: 0.287876642
DILUPBiCG:  Solving for Ux, Initial residual = 3.052289654e-06, Final residual = 2.738676051e-10, No Iterations 1
DILUPBiCG:  Solving for Uy, Initial residual = 5.981403037e-05, Final residual = 6.161745588e-09, No Iterations 1
DILUPBiCG:  Solving for Uz, Initial residual = 6.441312049e-05, Final residual = 7.118668359e-09, No Iterations 1
GAMG:  Solving for p, Initial residual = 0.001449305359, Final residual = 9.574983996e-05, No Iterations 2
GAMG:  Solving for p, Initial residual = 0.0001347876892, Final residual = 2.672140607e-05, No Iterations 2
GAMG:  Solving for p, Initial residual = 3.610848126e-05, Final residual = 7.369521828e-06, No Iterations 2
time step continuity errors : sum local = 5.649504504e-14, global = -4.718524356e-16, cumulative = -5.771567132e-15
GAMG:  Solving for p, Initial residual = 0.0005869720633, Final residual = 4.412817839e-05, No Iterations 2
GAMG:  Solving for p, Initial residual = 6.206812369e-05, Final residual = 1.367993617e-05, No Iterations 2
GAMG:  Solving for p, Initial residual = 1.819577585e-05, Final residual = 5.044909251e-06, No Iterations 2
time step continuity errors : sum local = 3.866414897e-14, global = -5.610493932e-18, cumulative = -5.777177626e-15
GAMG:  Solving for p, Initial residual = 4.005752672e-05, Final residual = 7.733093149e-06, No Iterations 2
GAMG:  Solving for p, Initial residual = 1.005946488e-05, Final residual = 4.57022811e-06, No Iterations 2
GAMG:  Solving for p, Initial residual = 5.315044337e-06, Final residual = 3.059376256e-06, No Iterations 2
time step continuity errors : sum local = 2.344610732e-14, global = -8.601005524e-18, cumulative = -5.785778632e-15
DILUPBiCG:  Solving for F, Initial residual = 1.423119179e-06, Final residual = 1.152704014e-10, No Iterations 1
DILUPBiCG:  Solving for LLMM, Initial residual = 0.0001600554571, Final residual = 1.266482924e-08, No Iterations 1
DILUPBiCG:  Solving for MMMM, Initial residual = 6.3441275e-05, Final residual = 2.470064085e-09, No Iterations 1
DILUPBiCG:  Solving for NNMM, Initial residual = 0.0001706087254, Final residual = 1.25219109e-08, No Iterations 1
DILUPBiCG:  Solving for LFMFF_LDMMS, Initial residual = 5.309817933e-05, Final residual = 1.139502859e-08, No Iterations 1
DILUPBiCG:  Solving for MFMFF_LDMMS, Initial residual = 5.665481015e-05, Final residual = 3.567372018e-08, No Iterations 1
DILUPBiCG:  Solving for NFMFF_LDMMS, Initial residual = 5.804721854e-05, Final residual = 9.66232654e-09, No Iterations 1
ExecutionTime = 765.53 s  ClockTime = 814 s

Time = 0.000918

Courant Number mean: 0.0001917165609 max: 0.2878756917
DILUPBiCG:  Solving for Ux, Initial residual = 3.052300658e-06, Final residual = 2.738577117e-10, No Iterations 1
DILUPBiCG:  Solving for Uy, Initial residual = 5.97459875e-05, Final residual = 6.158196307e-09, No Iterations 1
DILUPBiCG:  Solving for Uz, Initial residual = 6.448140805e-05, Final residual = 7.136262812e-09, No Iterations 1
GAMG:  Solving for p, Initial residual = 0.001456275326, Final residual = 9.601357664e-05, No Iterations 2
GAMG:  Solving for p, Initial residual = 0.0001362132073, Final residual = 2.654885927e-05, No Iterations 2
GAMG:  Solving for p, Initial residual = 3.614924123e-05, Final residual = 7.118986273e-06, No Iterations 2
time step continuity errors : sum local = 5.455807911e-14, global = 2.641470572e-16, cumulative = -5.521631575e-15
GAMG:  Solving for p, Initial residual = 0.0005872826127, Final residual = 4.418221094e-05, No Iterations 2
GAMG:  Solving for p, Initial residual = 6.214854047e-05, Final residual = 1.378557672e-05, No Iterations 2
GAMG:  Solving for p, Initial residual = 1.834875281e-05, Final residual = 5.077590028e-06, No Iterations 2
time step continuity errors : sum local = 3.890282517e-14, global = 1.330261159e-16, cumulative = -5.388605459e-15
GAMG:  Solving for p, Initial residual = 4.023426514e-05, Final residual = 7.565715719e-06, No Iterations 2
GAMG:  Solving for p, Initial residual = 9.852083357e-06, Final residual = 4.413431217e-06, No Iterations 2
GAMG:  Solving for p, Initial residual = 5.164066985e-06, Final residual = 3.023006277e-06, No Iterations 2
time step continuity errors : sum local = 2.316040413e-14, global = 6.903163248e-17, cumulative = -5.319573826e-15
DILUPBiCG:  Solving for F, Initial residual = 1.423187154e-06, Final residual = 1.152845032e-10, No Iterations 1
DILUPBiCG:  Solving for LLMM, Initial residual = 0.0001611766838, Final residual = 1.291849175e-08, No Iterations 1
DILUPBiCG:  Solving for MMMM, Initial residual = 6.407629502e-05, Final residual = 2.486206485e-09, No Iterations 1
DILUPBiCG:  Solving for NNMM, Initial residual = 0.0001709372216, Final residual = 1.2537942e-08, No Iterations 1
DILUPBiCG:  Solving for LFMFF_LDMMS, Initial residual = 5.310868952e-05, Final residual = 1.139492543e-08, No Iterations 1
DILUPBiCG:  Solving for MFMFF_LDMMS, Initial residual = 5.666380492e-05, Final residual = 3.561810739e-08, No Iterations 1
DILUPBiCG:  Solving for NFMFF_LDMMS, Initial residual = 5.80577229e-05, Final residual = 9.6605096e-09, No Iterations 1
ExecutionTime = 801.32 s  ClockTime = 850 s

The boundary conditions are equal to the previous case as well as the initial values.
In my opinion 30000 cells more per proc cannot increase the iteration time from ~10s to ~120s.
Are 125000 cells to much for one core? BTW, the CPU is a Nehalem EP, X5570 running at 2.93 GHz.

Best regards

Matthias

Canesin · October 19, 2011, 22:58

Matthias, try using PCG for the pressure, or increasing the number of cells in the coarse level of GAMG.. GAMG need to much communication .. that's what is increasing the cost of simulation..

p
{
solver PCG;
preconditioner DIC;
tolerance 1e-07;
relTol 1e-02;
}

pFinal
{
solver PCG;
preconditioner DIC;
tolerance 1e-08;
relTol 0;
}

kmooney · October 20, 2011, 11:17

I usually aim for 50k cells per processor but I haven't done too much quantification of the speeds.

I tweaked the GAMG settings for a parallel run and had pretty good speeds with this configuration (although I can't remember my original settings

):

p GAMG
{
agglomerator faceAreaPair;
nCellsInCoarsestLevel 30;
cacheAgglomeration true;
directSolveCoarsest false;
nPreSweeps 1;
nPostSweeps 2;
nFinestSweeps 2;
tolerance 1e-07;
relTol 0.0;
smoother GaussSeidel;
mergeLevels 2;
minIter 0;
maxIter 10;
};

Here is another thread (although a little old at this point) where users discussed speedup with OF's multi grid solvers. They link to some interesting presentations as well.

http://www.cfd-online.com/Forums/ope...ward-step.html

Fransje · October 20, 2011, 14:31

I don't know if the PCG solver with DIC preconditioner will be faster, although I don't have experience with domains this large..

What you could also try is using the GAMG solver as preconditioner. Something like:

Code:

     p
    {
        solver          PCG;
        preconditioner
        {
            preconditioner     GAMG;
            tolerance            1e-10;
            relTol                  0;
            smoother            DICGaussSeidel;
            nPreSweeps       0;
            nPostSweeps     2;
            nFinestSweeps   2;
            cacheAgglomeration false;
            nCellsInCoarsestLevel 10;
            agglomerator       faceAreaPair;
            mergeLevels        2;
        }
        tolerance       1e-10;
        relTol          0;;
    }

Let us know if one of those settings helps!

Kind regards,

Francois.

alberto · October 20, 2011, 14:52

PCG methods scale better, even though they require more iterations, so the suggestion is correct.

The number of cells you are using (~120k/processor) should be fine too.

matthias · October 21, 2011, 07:35

I tested all settings posted here but no one seems to be the holy grail.

@Canesin, Alberto: using only PCG with DIC seems to work best. An iteration takes now ~ 60s.

@kmooney: an iteration needs about 70s.

@Fransje: using this setup an iteration takes about 200s, much more than using only pcg or gamg

Next, I will test what happens if I use 3600 Cores (~98K cells per proc). Maybe I will get an improvement in the time for one iteration or the communication eats up all time savings.

Canesin · October 21, 2011, 08:00

Add in the controDict the option:

commsType nomBlocking

Try modifying the PCG relative tolerance to 0.1 instead of 1e-02.

But the most important thing is to look at the cluster infraestruture, how many cores is in one hack ? How is the rack to rack communication ?? Because maybe you need to create a local copy of the case in each node... Imagine you are sharing an folder in the head node of one rack and them an other rack three levels above in the hierarchic needs to download the mesh, it will need -> time to transfer the fiels and mesh + 3 routing times.

There is an option in decomposeDict that makes possible saving the data locally.

matthias · October 21, 2011, 08:42

I will try using rel tolerance 0.1 but increasing the rel. tolerance impairs also the numerical solution?

The cluster has a parallel storage system connected by IB with two rails. So the data transport should be no problem.

BTW, hopping from IB switch to IB switch takes only some nano seconds.

akidess · October 21, 2011, 09:24

I think you're wasting a lot of time by using nCellsInCoarsestLevel 10 or 30. At this point you'll be using more time on interpolation and restriction than gaining thanks to the coarser mesh. Increasing nCellsInCoarsestLevel to 1000 should improve the GAMG performance. Also, the more levels you have, the more you will have to communicate.

As Alberto already states though PCG requires less communication, so you might never be able to achieve better performance with GAMG with lots of processor boundaries.

October 20, 2011, 11:17		#3
kmooney Senior Member Kyle Mooney Join Date: Jul 2009 Location: San Francisco, CA USA Posts: 323 Rep Power: 18	I usually aim for 50k cells per processor but I haven't done too much quantification of the speeds. I tweaked the GAMG settings for a parallel run and had pretty good speeds with this configuration (although I can't remember my original settings ): p GAMG { agglomerator faceAreaPair; nCellsInCoarsestLevel 30; cacheAgglomeration true; directSolveCoarsest false; nPreSweeps 1; nPostSweeps 2; nFinestSweeps 2; tolerance 1e-07; relTol 0.0; smoother GaussSeidel; mergeLevels 2; minIter 0; maxIter 10; }; Here is another thread (although a little old at this point) where users discussed speedup with OF's multi grid solvers. They link to some interesting presentations as well. http://www.cfd-online.com/Forums/ope...ward-step.html mgg likes this.

October 20, 2011, 14:31		#4
Fransje Senior Member Francois Join Date: Jun 2010 Posts: 107 Rep Power: 21	I don't know if the PCG solver with DIC preconditioner will be faster, although I don't have experience with domains this large.. What you could also try is using the GAMG solver as preconditioner. Something like: Code: p { solver PCG; preconditioner { preconditioner GAMG; tolerance 1e-10; relTol 0; smoother DICGaussSeidel; nPreSweeps 0; nPostSweeps 2; nFinestSweeps 2; cacheAgglomeration false; nCellsInCoarsestLevel 10; agglomerator faceAreaPair; mergeLevels 2; } tolerance 1e-10; relTol 0;; } Let us know if one of those settings helps! Kind regards, Francois.

October 20, 2011, 14:52		#5
alberto Senior Member Alberto Passalacqua Join Date: Mar 2009 Location: Ames, Iowa, United States Posts: 1,912 Rep Power: 36	PCG methods scale better, even though they require more iterations, so the suggestion is correct. The number of cells you are using (~120k/processor) should be fine too. __________________ Alberto Passalacqua GeekoCFD - A free distribution based on openSUSE 64 bit with CFD tools, including OpenFOAM. Available as in both physical and virtual formats (current status: http://albertopassalacqua.com/?p=1541) OpenQBMM - An open-source implementation of quadrature-based moment methods. To obtain more accurate answers, please specify the version of OpenFOAM you are using.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
new Solver won't run parallel	Chris Lucas	OpenFOAM	4	January 10, 2012 11:30
Fluent can't run in parallel when hyper threading turning on.	field	FLUENT	0	May 5, 2011 08:41
batch mode - parallel run	turbotel	CFX	2	March 29, 2011 17:53
Minimum number of nodes to run CFX in parallel	Rui	CFX	3	April 11, 2005 21:46
How to run parallel in ICEM_CFD?	Kiddo	Main CFD Forum	2	January 24, 2005 09:53

October 19, 2011, 22:58		#2
Canesin Member Fábio César Canesin Join Date: Mar 2010 Location: Florianópolis Posts: 67 Rep Power: 16	Matthias, try using PCG for the pressure, or increasing the number of cells in the coarse level of GAMG.. GAMG need to much communication .. that's what is increasing the cost of simulation.. p { solver PCG; preconditioner DIC; tolerance 1e-07; relTol 1e-02; } pFinal { solver PCG; preconditioner DIC; tolerance 1e-08; relTol 0; }

October 21, 2011, 07:35		#6
matthias Member Matthias Walter Join Date: Mar 2009 Location: Rostock, Germany Posts: 63 Rep Power: 17	I tested all settings posted here but no one seems to be the holy grail. @Canesin, Alberto: using only PCG with DIC seems to work best. An iteration takes now ~ 60s. @kmooney: an iteration needs about 70s. @Fransje: using this setup an iteration takes about 200s, much more than using only pcg or gamg Next, I will test what happens if I use 3600 Cores (~98K cells per proc). Maybe I will get an improvement in the time for one iteration or the communication eats up all time savings.

October 21, 2011, 08:00		#7
Canesin Member Fábio César Canesin Join Date: Mar 2010 Location: Florianópolis Posts: 67 Rep Power: 16	Add in the controDict the option: commsType nomBlocking Try modifying the PCG relative tolerance to 0.1 instead of 1e-02. But the most important thing is to look at the cluster infraestruture, how many cores is in one hack ? How is the rack to rack communication ?? Because maybe you need to create a local copy of the case in each node... Imagine you are sharing an folder in the head node of one rack and them an other rack three levels above in the hierarchic needs to download the mesh, it will need -> time to transfer the fiels and mesh + 3 routing times. There is an option in decomposeDict that makes possible saving the data locally.

October 21, 2011, 08:42		#8
matthias Member Matthias Walter Join Date: Mar 2009 Location: Rostock, Germany Posts: 63 Rep Power: 17	I will try using rel tolerance 0.1 but increasing the rel. tolerance impairs also the numerical solution? The cluster has a parallel storage system connected by IB with two rails. So the data transport should be no problem. BTW, hopping from IB switch to IB switch takes only some nano seconds.

October 21, 2011, 09:24		#9
akidess Senior Member Anton Kidess Join Date: May 2009 Location: Germany Posts: 1,377 Rep Power: 30	I think you're wasting a lot of time by using nCellsInCoarsestLevel 10 or 30. At this point you'll be using more time on interpolation and restriction than gaining thanks to the coarser mesh. Increasing nCellsInCoarsestLevel to 1000 should improve the GAMG performance. Also, the more levels you have, the more you will have to communicate. As Alberto already states though PCG requires less communication, so you might never be able to achieve better performance with GAMG with lots of processor boundaries.