Parallel Performance 2.1.1/AMI vs. 1.6-ext/GGI

timo_IHS · March 13, 2013, 11:34

Hello everybody,

I have find out, that the parallel performance of AMI is not good or even bad for simulations above ~100 cores.

Description of my (test) case:
-40M elements
-4 different meshes coupled by GGI/AMI
-one of them is rotating (turbine)
-transientSimpleDyMFoam
-partitions: 128, 256, 512, 1024
-versions: 1.6-ext and 2.1.1

Has anybody similar results or suggestions on improving something?

Best regards,
Timo

Fuchs · March 13, 2013, 13:16

Hello everyone,

and thanks to Timo for highlighting this important topic and sharing your experience with the community. Could you be so kind and add the speed-up plot which compares both methods for your test case to your post? The principal outcome of the study was that the AMI performance in OpenFOAM-2.1.1 stagnates for parallel computations above approx. 100 cores (speed-up is unity, where the speed-up was measured relative to the performance of 128 cores). In contrast, the GGI in OpenFOAM-1.6-ext seems to perform fairly good ("globalFaceZones" was used during decomposition), where the speed-up between 128 and 1024 cores is approx. 3.9.

best regards,
Marian

timo_IHS · March 13, 2013, 13:38

Now it should be visible...

gschaider · March 14, 2013, 15:33

Quote:

Originally Posted by Fuchs

Hello everyone,

and thanks to Timo for highlighting this important topic and sharing your experience with the community. Could you be so kind and add the speed-up plot which compares both methods for your test case to your post? The principal outcome of the study was that the AMI performance in OpenFOAM-2.1.1 stagnates for parallel computations above approx. 100 cores (speed-up is unity, where the speed-up was measured relative to the performance of 128 cores). In contrast, the GGI in OpenFOAM-1.6-ext seems to perform fairly good ("globalFaceZones" was used during decomposition), where the speed-up between 128 and 1024 cores is approx. 3.9.

best regards,
Marian

That's interesting (but I haven't really much experience with either GGI or AMI). Anyway: have you done a similar comparison with a non-GGI/AMI-case to make sure that this is only a problem of the method and not of the communication layer? (like for instance the commsType-switch etc). And what are the calculation times you're speeding up from (are they comparable)?

linnemann · March 15, 2013, 08:19

Hi

I have a completely different result.

Its based on a real pump geometry with 7 interfaces.

timo_IHS · April 2, 2013, 07:20

Thanks for the suggestions.

For cases without interface there is no problem in performance.
For 2.1.1 I get a segmentation fault with commsType blocking.
The computational time for 128 cores is (almost) comparable.

@linnemann: you did the speed up only up to 32 cores!
BTW: how many elements do you have in total?

Best regards,
Timo

linnemann · April 2, 2013, 07:30

Yes I only did it with 32 cores but our cases are normally handled with 12-24 so no need to go above that. And the cell count is roughly around 750k all hex.

hjasak · June 7, 2013, 09:16

Please put all GGIs into a single patch (pair) and you will get massively better scaling.

Hrv

timo_IHS · June 7, 2013, 12:00

Hello Prof. Jasak,

do I understand you correctly, that you recommend to put a ggi pair respectively the adjacent cells on a single processor to get better performance? Henry told me this already but I haven't tried it because of following reasons:
The ggi-patch has cylindrical shape, which leads to very bad distribution of the "ggi"-processors elements and the ggi-patches have between 70k and 100k faces. With this method I would have to keep ~170k elements on one processor. This leads to a large imbalance regarding the aim to use ~40k elements per processor.

Best regards,
Timo

hjasak · June 7, 2013, 12:18

No, what I said is that in a multi-stage machine you can take all rotating sides and put them into one ggi patch and all stationary sides and put them into another ggi patch.

The pair of patches then makes a single ggi interface and this will make it run much faster: each ggi pair causes one additional parallel comms per iteration.

I don't care about the ggi distribution on various processors or imbalance in ggi work. What matters is the balance of CELLS per processor and this is easy to achieve.

What we saw from the previous picture is that having 7 GGI pairs ruins the performance because thye speak 7 (additional) times instead of once.

Hope this helps,

Hrv

timo_IHS · July 22, 2013, 09:29

Hello Prof. Jasak,

my file system was crashed...
So I tried to merge all rotating patches to one patch.
The problem is: I have one interface that couples stationary-stationary and this does not work with mixerGgiFvMesh.
So I hacked the code and I replaced in mixerGgiFvMesh/mixerGgiFvMesh.C

Code:

    // Grab the ggi patches on the moving side
    wordList movingPatches(dict_.subDict("slider").lookup("moving"));

    forAll (movingPatches, patchI)
    {
        const label movingSliderID =
            boundaryMesh().findPatchID(movingPatches[patchI]);

        if (movingSliderID < 0)
        {
            FatalErrorIn("void mixerGgiFvMeshTK::calcMovingMasks() const")
                << "Moving slider named " << movingPatches[patchI]
                << " not found.  Valid patch names: "
                << boundaryMesh().names() << abort(FatalError);
        }

        const ggiPolyPatch& movingGgiPatch =
            refCast<const ggiPolyPatch>(boundaryMesh()[movingSliderID]);

        const labelList& movingSliderAddr = movingGgiPatch.zone();

        forAll (movingSliderAddr, faceI)
        {
            const face& curFace = f[movingSliderAddr[faceI]];

            forAll (curFace, pointI)
            {
                movingPointsMask[curFace[pointI]] = 1;
            }
        }
    }

with

Code:

    wordList movingFaceZones(dict_.subDict("slider").lookup("movingFaceZones"));

    forAll (movingFaceZones, faceZoneI)
    {

        Info<< "movingFaceZones Name: " << movingFaceZones[faceZoneI]
            << endl;

        faceZoneID zoneID(movingFaceZones[faceZoneI], faceZones());

        const labelList& movingSliderAddr = faceZones()[zoneID.index()];

        forAll (movingSliderAddr, faceI)
        {
            const face& curFace = f[movingSliderAddr[faceI]];

            forAll (curFace, pointI)
            {
                movingPointsMask[curFace[pointI]] = 1;
            }
        }

    }

For a simple pipe test case this works, but for a real case it does not. My movingCellZone is not rotating anymore expect of the interface-patch.

So, am I allowed to do it with the faceZone as written above?

timo_IHS · July 22, 2013, 09:36

Another thing is the number of faces of the ggi.
With the 7 stage turbine: how many faces are on the ggi-patches?

I have ~200000 faces on the ggi. So I run in a n^2 problem:

In OpenFOAM/interpolations/GGIInterpolation/GGIInterpolationWeights.C ~107
There is written

Code:

    // First, find a rough estimate of each slave and master facet
    // neighborhood by filtering out all the faces located outside of
    // an Axis-Aligned Bounding Box (AABB).  Warning: This algorithm
    // is based on the evaluation of AABB boxes, which is pretty fast;
    // but still the complexity of the algorithm is n^2, wich is
    // pretty bad for GGI patches composed of 100,000 of facets...  So
    // here is the place where we could certainly gain major speedup
    // for larger meshes.

My question: how could I/we gain speedup for larger meshes?

mbeaudoin · July 22, 2013, 17:40

Hello Timo,

> I have ~200000 faces on the ggi. So I run in a n^2 problem:

Well, that would be true if you are still using the AABB search algorithm for finding the GGI facets neighbours, or an old version of 1.6-ext.

Almost 2 years ago, I have introduced an octree-based search algorithm for speeding things up quite a bit when searching for GGI facets neighbours.

This is now the default search algorithm for the GGI (take a look at the constructors for Foam::ggiPolyPatch), so you should no longer run into the n^2 problem you are mentioning.

Best,

Martin

Quote:

Originally Posted by timo_IHS

Another thing is the number of faces of the ggi.
With the 7 stage turbine: how many faces are on the ggi-patches?

I have ~200000 faces on the ggi. So I run in a n^2 problem:

In OpenFOAM/interpolations/GGIInterpolation/GGIInterpolationWeights.C ~107
There is written

Code:

    // First, find a rough estimate of each slave and master facet
    // neighborhood by filtering out all the faces located outside of
    // an Axis-Aligned Bounding Box (AABB).  Warning: This algorithm
    // is based on the evaluation of AABB boxes, which is pretty fast;
    // but still the complexity of the algorithm is n^2, wich is
    // pretty bad for GGI patches composed of 100,000 of facets...  So
    // here is the place where we could certainly gain major speedup
    // for larger meshes.

My question: how could I/we gain speedup for larger meshes?

timo_IHS · July 23, 2013, 03:56

Hello Martin,

okay, I understand.

Do you think it might be worth to test these parameters? And if so, do you have a rule of thumb for large facets?

Code:

       //  For GGI patches larger than ~100K facets, your mileage may vary.
       //  So these 3 control parameters are adjustable using the following
       //  global optimization switches:
       //
       //     GGIOctreeSearchMinNLevel
       //     GGIOctreeSearchMaxLeafRatio
       //     GGIOctreeSearchMaxShapeRatio

Best regards,
Timo

mbeaudoin · July 23, 2013, 10:41

Hello Timo,

> Do you think it might be worth to test these parameters?
Yup, for large amount of GGI facets, definitely.

You can use the OptimisationSwitches section of your global controlDict file to play with these.

Here are the default values, taken from GGIInterpolationQuickRejectTests.C:

Code:

    debug::optimisationSwitch("GGIOctreeSearchMinNLevel", 3)
    debug::optimisationSwitch("GGIOctreeSearchMaxLeafRatio", 3)
    debug::optimisationSwitch("GGIOctreeSearchMaxShapeRatio", 1)

> And if so, do you have a rule of thumb for large facets?
Not really. The default values I came up with are based on my own tests, using smaller meshes than yours.

You can have a look at the header from octree.H for some comments on the values for those three parameters.

Code:

    The construction on the depth of the tree is:
      - one can specify a minimum depth
        (though the tree will never be refined if all leaves contain <=1
         shapes)
      - after the minimum depth two statistics are used to decide further
        refinement:
        - average number of entries per leaf (leafRatio). Since inside a
          leaf most algorithms are n or n^2 this value has to be small.
        - average number of leaves a shape is in. Because of bounding boxes,
          a single shape can be in multiple leaves. If the bbs are large
          compared to the leaf size this multiplicity can become extremely
          large and will become larger with more levels.

Don't hesitate to report any nice findings based on your large meshes.

Best,

Martin

Quote:

Originally Posted by timo_IHS

Hello Martin,

okay, I understand.

Do you think it might be worth to test these parameters? And if so, do you have a rule of thumb for large facets?

Code:

       //  For GGI patches larger than ~100K facets, your mileage may vary.
       //  So these 3 control parameters are adjustable using the following
       //  global optimization switches:
       //
       //     GGIOctreeSearchMinNLevel
       //     GGIOctreeSearchMaxLeafRatio
       //     GGIOctreeSearchMaxShapeRatio

Best regards,
Timo

timo_IHS · April 25, 2014, 08:33

Hi Martin,

I think it is quite long ago, but I found my results again.
I played (not really scientifically) a bit around with the parameters for a 14M element case and got 5% of performance improvement for the configuration 5-2-1

Best,
Timo

March 13, 2013, 11:34	Parallel Performance 2.1.1/AMI vs. 1.6-ext/GGI	#1
timo_IHS Member Timo K. Join Date: Feb 2010 Location: University of Stuttgart Posts: 66 Rep Power: 16	Hello everybody, I have find out, that the parallel performance of AMI is not good or even bad for simulations above ~100 cores. Description of my (test) case: -40M elements -4 different meshes coupled by GGI/AMI -one of them is rotating (turbine) -transientSimpleDyMFoam -partitions: 128, 256, 512, 1024 -versions: 1.6-ext and 2.1.1 Has anybody similar results or suggestions on improving something? Best regards, Timo

April 2, 2013, 07:30		#7
linnemann Senior Member Niels Nielsen Join Date: Mar 2009 Location: NJ - Denmark Posts: 556 Rep Power: 27	Yes I only did it with 32 cores but our cases are normally handled with 12-24 so no need to go above that. And the cell count is roughly around 750k all hex. __________________ Linnemann PS. I do not do personal support, so please post in the forums.

June 7, 2013, 09:16		#8
hjasak Senior Member Hrvoje Jasak Join Date: Mar 2009 Location: London, England Posts: 1,907 Rep Power: 33	Please put all GGIs into a single patch (pair) and you will get massively better scaling. Hrv __________________ Hrvoje Jasak Providing commercial FOAM/OpenFOAM and CFD Consulting: http://wikki.co.uk

June 7, 2013, 12:18		#10
hjasak Senior Member Hrvoje Jasak Join Date: Mar 2009 Location: London, England Posts: 1,907 Rep Power: 33	No, what I said is that in a multi-stage machine you can take all rotating sides and put them into one ggi patch and all stationary sides and put them into another ggi patch. The pair of patches then makes a single ggi interface and this will make it run much faster: each ggi pair causes one additional parallel comms per iteration. I don't care about the ggi distribution on various processors or imbalance in ggi work. What matters is the balance of CELLS per processor and this is easy to achieve. What we saw from the previous picture is that having 7 GGI pairs ruins the performance because thye speak 7 (additional) times instead of once. Hope this helps, Hrv __________________ Hrvoje Jasak Providing commercial FOAM/OpenFOAM and CFD Consulting: http://wikki.co.uk

July 23, 2013, 03:56		#14
timo_IHS Member Timo K. Join Date: Feb 2010 Location: University of Stuttgart Posts: 66 Rep Power: 16	Hello Martin, okay, I understand. Do you think it might be worth to test these parameters? And if so, do you have a rule of thumb for large facets? Code: // For GGI patches larger than ~100K facets, your mileage may vary. // So these 3 control parameters are adjustable using the following // global optimization switches: // // GGIOctreeSearchMinNLevel // GGIOctreeSearchMaxLeafRatio // GGIOctreeSearchMaxShapeRatio Best regards, Timo

March 13, 2013, 13:16		#2
Fuchs New Member Marian Fuchs Join Date: Dec 2010 Location: Berlin, Germany Posts: 9 Rep Power: 16	Hello everyone, and thanks to Timo for highlighting this important topic and sharing your experience with the community. Could you be so kind and add the speed-up plot which compares both methods for your test case to your post? The principal outcome of the study was that the AMI performance in OpenFOAM-2.1.1 stagnates for parallel computations above approx. 100 cores (speed-up is unity, where the speed-up was measured relative to the performance of 128 cores). In contrast, the GGI in OpenFOAM-1.6-ext seems to perform fairly good ("globalFaceZones" was used during decomposition), where the speed-up between 128 and 1024 cores is approx. 3.9. best regards, Marian

April 2, 2013, 07:20		#6
timo_IHS Member Timo K. Join Date: Feb 2010 Location: University of Stuttgart Posts: 66 Rep Power: 16	Thanks for the suggestions. For cases without interface there is no problem in performance. For 2.1.1 I get a segmentation fault with commsType blocking. The computational time for 128 cores is (almost) comparable. @linnemann: you did the speed up only up to 32 cores! BTW: how many elements do you have in total? Best regards, Timo

June 7, 2013, 12:00		#9
timo_IHS Member Timo K. Join Date: Feb 2010 Location: University of Stuttgart Posts: 66 Rep Power: 16	Hello Prof. Jasak, do I understand you correctly, that you recommend to put a ggi pair respectively the adjacent cells on a single processor to get better performance? Henry told me this already but I haven't tried it because of following reasons: The ggi-patch has cylindrical shape, which leads to very bad distribution of the "ggi"-processors elements and the ggi-patches have between 70k and 100k faces. With this method I would have to keep ~170k elements on one processor. This leads to a large imbalance regarding the aim to use ~40k elements per processor. Best regards, Timo

April 25, 2014, 08:33		#16
timo_IHS Member Timo K. Join Date: Feb 2010 Location: University of Stuttgart Posts: 66 Rep Power: 16	Hi Martin, I think it is quite long ago, but I found my results again. I played (not really scientifically) a bit around with the parameters for a 14M element case and got 5% of performance improvement for the configuration 5-2-1 Best, Timo elvis and SailorLiu like this.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
parallel computing with GGI (OF 1.6 extend)	A.Wendy	OpenFOAM Running, Solving & CFD	1	November 18, 2012 18:27
parallel performance on BX900	uzawa	OpenFOAM Installation	3	September 5, 2011 16:52
Performance of GGI case in parallel	hannes	OpenFOAM Running, Solving & CFD	26	August 3, 2011 04:07
Parallel performance OpenFoam Vs Fluent	prapanj	Main CFD Forum	0	March 26, 2009 06:43
ANSYS CFX 10.0 Parallel Performance for Windows XP	Saturn	CFX	4	August 13, 2006 13:27