|
[Sponsors] |
Parallel Performance 2.1.1/AMI vs. 1.6-ext/GGI |
|
LinkBack | Thread Tools | Search this Thread | Display Modes |
March 13, 2013, 11:34 |
Parallel Performance 2.1.1/AMI vs. 1.6-ext/GGI
|
#1 |
Member
Timo K.
Join Date: Feb 2010
Location: University of Stuttgart
Posts: 66
Rep Power: 16 |
Hello everybody,
I have find out, that the parallel performance of AMI is not good or even bad for simulations above ~100 cores. Description of my (test) case: -40M elements -4 different meshes coupled by GGI/AMI -one of them is rotating (turbine) -transientSimpleDyMFoam -partitions: 128, 256, 512, 1024 -versions: 1.6-ext and 2.1.1 Has anybody similar results or suggestions on improving something? Best regards, Timo |
|
March 13, 2013, 13:16 |
|
#2 |
New Member
Marian Fuchs
Join Date: Dec 2010
Location: Berlin, Germany
Posts: 9
Rep Power: 16 |
Hello everyone,
and thanks to Timo for highlighting this important topic and sharing your experience with the community. Could you be so kind and add the speed-up plot which compares both methods for your test case to your post? The principal outcome of the study was that the AMI performance in OpenFOAM-2.1.1 stagnates for parallel computations above approx. 100 cores (speed-up is unity, where the speed-up was measured relative to the performance of 128 cores). In contrast, the GGI in OpenFOAM-1.6-ext seems to perform fairly good ("globalFaceZones" was used during decomposition), where the speed-up between 128 and 1024 cores is approx. 3.9. best regards, Marian |
|
March 13, 2013, 13:38 |
|
#3 |
Member
Timo K.
Join Date: Feb 2010
Location: University of Stuttgart
Posts: 66
Rep Power: 16 |
Now it should be visible...
|
|
March 14, 2013, 15:33 |
|
#4 | |
Assistant Moderator
Bernhard Gschaider
Join Date: Mar 2009
Posts: 4,225
Rep Power: 51 |
Quote:
__________________
Note: I don't use "Friend"-feature on this forum out of principle. Ah. And by the way: I'm not on Facebook either. So don't be offended if I don't accept your invitation/friend request |
||
March 15, 2013, 08:19 |
|
#5 |
Senior Member
Niels Nielsen
Join Date: Mar 2009
Location: NJ - Denmark
Posts: 556
Rep Power: 27 |
Hi
I have a completely different result. Its based on a real pump geometry with 7 interfaces.
__________________
Linnemann PS. I do not do personal support, so please post in the forums. |
|
April 2, 2013, 07:20 |
|
#6 |
Member
Timo K.
Join Date: Feb 2010
Location: University of Stuttgart
Posts: 66
Rep Power: 16 |
Thanks for the suggestions.
For cases without interface there is no problem in performance. For 2.1.1 I get a segmentation fault with commsType blocking. The computational time for 128 cores is (almost) comparable. @linnemann: you did the speed up only up to 32 cores! BTW: how many elements do you have in total? Best regards, Timo |
|
April 2, 2013, 07:30 |
|
#7 |
Senior Member
Niels Nielsen
Join Date: Mar 2009
Location: NJ - Denmark
Posts: 556
Rep Power: 27 |
Yes I only did it with 32 cores but our cases are normally handled with 12-24 so no need to go above that. And the cell count is roughly around 750k all hex.
__________________
Linnemann PS. I do not do personal support, so please post in the forums. |
|
June 7, 2013, 09:16 |
|
#8 |
Senior Member
Hrvoje Jasak
Join Date: Mar 2009
Location: London, England
Posts: 1,907
Rep Power: 33 |
Please put all GGIs into a single patch (pair) and you will get massively better scaling.
Hrv
__________________
Hrvoje Jasak Providing commercial FOAM/OpenFOAM and CFD Consulting: http://wikki.co.uk |
|
June 7, 2013, 12:00 |
|
#9 |
Member
Timo K.
Join Date: Feb 2010
Location: University of Stuttgart
Posts: 66
Rep Power: 16 |
Hello Prof. Jasak,
do I understand you correctly, that you recommend to put a ggi pair respectively the adjacent cells on a single processor to get better performance? Henry told me this already but I haven't tried it because of following reasons: The ggi-patch has cylindrical shape, which leads to very bad distribution of the "ggi"-processors elements and the ggi-patches have between 70k and 100k faces. With this method I would have to keep ~170k elements on one processor. This leads to a large imbalance regarding the aim to use ~40k elements per processor. Best regards, Timo |
|
June 7, 2013, 12:18 |
|
#10 |
Senior Member
Hrvoje Jasak
Join Date: Mar 2009
Location: London, England
Posts: 1,907
Rep Power: 33 |
No, what I said is that in a multi-stage machine you can take all rotating sides and put them into one ggi patch and all stationary sides and put them into another ggi patch.
The pair of patches then makes a single ggi interface and this will make it run much faster: each ggi pair causes one additional parallel comms per iteration. I don't care about the ggi distribution on various processors or imbalance in ggi work. What matters is the balance of CELLS per processor and this is easy to achieve. What we saw from the previous picture is that having 7 GGI pairs ruins the performance because thye speak 7 (additional) times instead of once. Hope this helps, Hrv
__________________
Hrvoje Jasak Providing commercial FOAM/OpenFOAM and CFD Consulting: http://wikki.co.uk |
|
July 22, 2013, 09:29 |
|
#11 |
Member
Timo K.
Join Date: Feb 2010
Location: University of Stuttgart
Posts: 66
Rep Power: 16 |
Hello Prof. Jasak,
my file system was crashed... So I tried to merge all rotating patches to one patch. The problem is: I have one interface that couples stationary-stationary and this does not work with mixerGgiFvMesh. So I hacked the code and I replaced in mixerGgiFvMesh/mixerGgiFvMesh.C Code:
// Grab the ggi patches on the moving side wordList movingPatches(dict_.subDict("slider").lookup("moving")); forAll (movingPatches, patchI) { const label movingSliderID = boundaryMesh().findPatchID(movingPatches[patchI]); if (movingSliderID < 0) { FatalErrorIn("void mixerGgiFvMeshTK::calcMovingMasks() const") << "Moving slider named " << movingPatches[patchI] << " not found. Valid patch names: " << boundaryMesh().names() << abort(FatalError); } const ggiPolyPatch& movingGgiPatch = refCast<const ggiPolyPatch>(boundaryMesh()[movingSliderID]); const labelList& movingSliderAddr = movingGgiPatch.zone(); forAll (movingSliderAddr, faceI) { const face& curFace = f[movingSliderAddr[faceI]]; forAll (curFace, pointI) { movingPointsMask[curFace[pointI]] = 1; } } } Code:
wordList movingFaceZones(dict_.subDict("slider").lookup("movingFaceZones")); forAll (movingFaceZones, faceZoneI) { Info<< "movingFaceZones Name: " << movingFaceZones[faceZoneI] << endl; faceZoneID zoneID(movingFaceZones[faceZoneI], faceZones()); const labelList& movingSliderAddr = faceZones()[zoneID.index()]; forAll (movingSliderAddr, faceI) { const face& curFace = f[movingSliderAddr[faceI]]; forAll (curFace, pointI) { movingPointsMask[curFace[pointI]] = 1; } } } So, am I allowed to do it with the faceZone as written above? |
|
July 22, 2013, 09:36 |
|
#12 |
Member
Timo K.
Join Date: Feb 2010
Location: University of Stuttgart
Posts: 66
Rep Power: 16 |
Another thing is the number of faces of the ggi.
With the 7 stage turbine: how many faces are on the ggi-patches? I have ~200000 faces on the ggi. So I run in a n^2 problem: In OpenFOAM/interpolations/GGIInterpolation/GGIInterpolationWeights.C ~107 There is written Code:
// First, find a rough estimate of each slave and master facet // neighborhood by filtering out all the faces located outside of // an Axis-Aligned Bounding Box (AABB). Warning: This algorithm // is based on the evaluation of AABB boxes, which is pretty fast; // but still the complexity of the algorithm is n^2, wich is // pretty bad for GGI patches composed of 100,000 of facets... So // here is the place where we could certainly gain major speedup // for larger meshes. My question: how could I/we gain speedup for larger meshes? |
|
July 22, 2013, 17:40 |
|
#13 | |
Senior Member
Martin Beaudoin
Join Date: Mar 2009
Posts: 332
Rep Power: 22 |
Hello Timo,
> I have ~200000 faces on the ggi. So I run in a n^2 problem: Well, that would be true if you are still using the AABB search algorithm for finding the GGI facets neighbours, or an old version of 1.6-ext. Almost 2 years ago, I have introduced an octree-based search algorithm for speeding things up quite a bit when searching for GGI facets neighbours. This is now the default search algorithm for the GGI (take a look at the constructors for Foam::ggiPolyPatch), so you should no longer run into the n^2 problem you are mentioning. Best, Martin Quote:
|
||
July 23, 2013, 03:56 |
|
#14 |
Member
Timo K.
Join Date: Feb 2010
Location: University of Stuttgart
Posts: 66
Rep Power: 16 |
Hello Martin,
okay, I understand. Do you think it might be worth to test these parameters? And if so, do you have a rule of thumb for large facets? Code:
// For GGI patches larger than ~100K facets, your mileage may vary. // So these 3 control parameters are adjustable using the following // global optimization switches: // // GGIOctreeSearchMinNLevel // GGIOctreeSearchMaxLeafRatio // GGIOctreeSearchMaxShapeRatio Timo |
|
July 23, 2013, 10:41 |
|
#15 | |
Senior Member
Martin Beaudoin
Join Date: Mar 2009
Posts: 332
Rep Power: 22 |
Hello Timo,
> Do you think it might be worth to test these parameters? Yup, for large amount of GGI facets, definitely. You can use the OptimisationSwitches section of your global controlDict file to play with these. Here are the default values, taken from GGIInterpolationQuickRejectTests.C: Code:
debug::optimisationSwitch("GGIOctreeSearchMinNLevel", 3) debug::optimisationSwitch("GGIOctreeSearchMaxLeafRatio", 3) debug::optimisationSwitch("GGIOctreeSearchMaxShapeRatio", 1) Not really. The default values I came up with are based on my own tests, using smaller meshes than yours. You can have a look at the header from octree.H for some comments on the values for those three parameters. Code:
The construction on the depth of the tree is: - one can specify a minimum depth (though the tree will never be refined if all leaves contain <=1 shapes) - after the minimum depth two statistics are used to decide further refinement: - average number of entries per leaf (leafRatio). Since inside a leaf most algorithms are n or n^2 this value has to be small. - average number of leaves a shape is in. Because of bounding boxes, a single shape can be in multiple leaves. If the bbs are large compared to the leaf size this multiplicity can become extremely large and will become larger with more levels. Best, Martin Quote:
|
||
April 25, 2014, 08:33 |
|
#16 |
Member
Timo K.
Join Date: Feb 2010
Location: University of Stuttgart
Posts: 66
Rep Power: 16 |
Hi Martin,
I think it is quite long ago, but I found my results again. I played (not really scientifically) a bit around with the parameters for a 14M element case and got 5% of performance improvement for the configuration 5-2-1 Best, Timo |
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
parallel computing with GGI (OF 1.6 extend) | A.Wendy | OpenFOAM Running, Solving & CFD | 1 | November 18, 2012 18:27 |
parallel performance on BX900 | uzawa | OpenFOAM Installation | 3 | September 5, 2011 16:52 |
Performance of GGI case in parallel | hannes | OpenFOAM Running, Solving & CFD | 26 | August 3, 2011 04:07 |
Parallel performance OpenFoam Vs Fluent | prapanj | Main CFD Forum | 0 | March 26, 2009 06:43 |
ANSYS CFX 10.0 Parallel Performance for Windows XP | Saturn | CFX | 4 | August 13, 2006 13:27 |