CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > General Forums > Main CFD Forum

How to group cells to distribute onto parallel processors/threads?

Register Blogs Community New Posts Updated Threads Search

Like Tree4Likes
  • 1 Post By aerosayan
  • 1 Post By flotus1
  • 1 Post By sbaffini
  • 1 Post By zx-81

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   October 31, 2020, 09:59
Default How to group cells to distribute onto parallel processors/threads?
  #1
Senior Member
 
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8
aerosayan is on a distinguished road
I'm trying to parallelize my explicit compressible Euler FVM solver using OpenMP.


I want to distribute the cells onto 2 threads. But before distributing the cells onto the
2 threads, I want to group them together into 2 distinct sets.


This is my current result ; https://imgur.com/a/GrU9jvy


The cells of the same color are distributed onto the same thread.



As it can be seen, my current grouping algorithm needs to be improved. Currently I'm just simply dividing the cell list into 2 parts and distributing them to 2 threads.



However, this produces a scattered distribution as can be seen in the image. This is problematic for me.


I would like to minimize the intersection boundary between the cells of these 2 threads. I would like to make 2 groups of cells that are closely packed together, and not scattered as shown in the image.



Is there any algorithm that you can recommend to do this?
aerosayan is offline   Reply With Quote

Old   October 31, 2020, 11:02
Default
  #2
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
Domain decomposition is not the most straightforward load distribution mechanism to go along with OpenMP. That's usually coupled with MPI parallelization. Not saying that it can't be done, or doesn't exist in the wild. Just that it not the most common combination.

What you are searching for falls under the term graph partitioning. There are quite a few implementations available. Scotch and Metis are two of the more commonly used ones.

BTW: if your colored graph reflects the numbering of your grid, there is a lot of performance up for grabs by renumbering the elements. The optimization goal being increased locality. I.e. elements that are direct neighbors spatially should also be as close as possible in their memory index. That increases cache reuse. Doing that first would also result in a much better domain decomposition when simply dividing the mesh along the memory index.
flotus1 is offline   Reply With Quote

Old   October 31, 2020, 11:38
Default
  #3
Senior Member
 
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8
aerosayan is on a distinguished road
Quote:
Originally Posted by flotus1 View Post
Domain decomposition is not the most straightforward load distribution mechanism to go along with OpenMP. That's usually coupled with MPI parallelization. Not saying that it can't be done, or doesn't exist in the wild. Just that it not the most common combination.

What you are searching for falls under the term graph partitioning. There are quite a few implementations available. Scotch and Metis are two of the more commonly used ones.

BTW: if your colored graph reflects the numbering of your grid, there is a lot of performance up for grabs by renumbering the elements. The optimization goal being increased locality. I.e. elements that are direct neighbors spatially should also be as close as possible in their memory index. That increases cache reuse. Doing that first would also result in a much better domain decomposition when simply dividing the mesh along the memory index.

Thanks for the reply. I'm new to solver dev, so I don't particularly know all of the terms very well. I'm trying to group the cells to improve performance by decreasing the cache misses. As you proposed, I will be renumbering the cells within each group to decrease cache misses.


And I did want to decompose the domain into multiple groups because I do have plans for implementing MPI support in future.


I was able to come up with a basic decomposition method. I simply sort the nodes based on their x-coordinate and then use a node_to_cell connectivity map to find the cells which are connected to each of the sorted nodes.


I was able to create 4 and 8 groups of cells using this simple method.


This works pretty well for me


RESULTS : https://imgur.com/a/EBa7QBg
aerosayan is offline   Reply With Quote

Old   October 31, 2020, 11:46
Default
  #4
Senior Member
 
sbaffini's Avatar
 
Paolo Lampitella
Join Date: Mar 2009
Location: Italy
Posts: 2,192
Blog Entries: 29
Rep Power: 39
sbaffini will become famous soon enoughsbaffini will become famous soon enough
Send a message via Skype™ to sbaffini
Space filling curves will work great for both your tasks, the current (shared memory parallelization) and the future one (distributed parallelization).

Still, you have to deeply understand how cache access works in shared memory in order to properly arrange the shared memory loops (which is where shared memory parallelization actually happens)
sbaffini is offline   Reply With Quote

Old   October 31, 2020, 16:44
Default
  #5
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
Quote:
Originally Posted by aerosayan View Post
I was able to come up with a basic decomposition method. I simply sort the nodes based on their x-coordinate and then use a node_to_cell connectivity map to find the cells which are connected to each of the sorted nodes.

This works pretty well for me
Approaches like that work great...up to a certain point. Once your geometry gets more complicated, or you need more than a handful of domains, load balancing becomes almost impossible, and communication volume goes through the roof.
Space-filling curves are another interesting approach which works to some extent. But they do a poor job at minimizing, let alone equalizing communication volumes. And they have other limitations, e.g. once you have more than one load-balancing constraint, apart from element count. They are great for ordering elements in a cache-friendly way, but not necessarily the best choice for domain decomposition.
It all depends on where you want to go with this. If the examples you have shown here are as complex as it will get, then domain decomposition based on a spatial coordinate is fine. If you want to go further, you will have to look for alternatives.
flotus1 is offline   Reply With Quote

Old   November 1, 2020, 07:25
Default
  #6
Senior Member
 
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8
aerosayan is on a distinguished road
Quote:
Originally Posted by sbaffini View Post
Space filling curves will work great for both your tasks, the current (shared memory parallelization) and the future one (distributed parallelization).

Still, you have to deeply understand how cache access works in shared memory in order to properly arrange the shared memory loops (which is where shared memory parallelization actually happens)

I took your and flotus1's advice, and used z-order curve based mesh cell renumbering.



RESULTS : https://imgur.com/a/gBQg20X


In case someone else wants to use the same approach, I'm sharing the source...



CODE : https://pastebin.com/AvzmJNJX

EXPLANATION : https://archive.is/Th2xk
ssh123 likes this.
aerosayan is offline   Reply With Quote

Old   November 2, 2020, 09:03
Default
  #7
Senior Member
 
sbaffini's Avatar
 
Paolo Lampitella
Join Date: Mar 2009
Location: Italy
Posts: 2,192
Blog Entries: 29
Rep Power: 39
sbaffini will become famous soon enoughsbaffini will become famous soon enough
Send a message via Skype™ to sbaffini
Quote:
Originally Posted by flotus1 View Post
Space-filling curves are another interesting approach which works to some extent. But they do a poor job at minimizing, let alone equalizing communication volumes. And they have other limitations, e.g. once you have more than one load-balancing constraint, apart from element count. They are great for ordering elements in a cache-friendly way, but not necessarily the best choice for domain decomposition.
Of course I agree, the SFC suggestion was along the lines of a DIY solution with a great results/effort ratio, especially considering the shared/distributed requirement. However, with a little bit more of coding you can add weights to the cells and take them into account together with the SFC ordering, which is slightly more flexible than simple SFC ordering.
sbaffini is offline   Reply With Quote

Old   November 5, 2020, 07:59
Default
  #8
Senior Member
 
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8
aerosayan is on a distinguished road
Quote:
Originally Posted by flotus1 View Post
BTW: if your colored graph reflects the numbering of your grid, there is a lot of performance up for grabs by renumbering the elements. The optimization goal being increased locality. I.e. elements that are direct neighbors spatially should also be as close as possible in their memory index. That increases cache reuse. Doing that first would also result in a much better domain decomposition when simply dividing the mesh along the memory index.

https://imgur.com/KuEVYlA


Organized the cells using a space filling curve.


Organized the boundary cells at the front of the cell list. The boundary cells still follow the order they were prescribed by the space filling curve. So, they will be accessed in the same efficient order, thus improving cache performance.


Is their anything else that can be done to improve performance?
aerosayan is offline   Reply With Quote

Old   November 5, 2020, 09:22
Default
  #9
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
Well...you can make a whole career out of optimizing things further

If we limit ourselves to grid renumbering:
You probably already do this, but just to make sure I didn't misunderstand your last post: each domain should have its share of boundary cells. If they all end up in the first domain, that would be a serious load balancing problem, and introduce a huge communication overhead.
For the order of cells, z-order is a great starting point, mostly because it's easy to understand and implement. But from my own testing, different space-filling curves can give slightly better performance overall. I am currently using a Hilbert curve for our LBM solvers. Which one works best definitely depends on the exact algorithm you are using, and most likely which hardware is used to run the code.
And for general purpose domain decomposition, tools like metis do a better job at minimizing the surfaces between different domains. I.e. minimizing communication between cores. For very simple geometries, you can come up with a hand-crafted domain decomposition scheme that might work slightly better. But you will quickly run out of feasible ideas once the geometry gets more complex.
aerosayan likes this.
flotus1 is offline   Reply With Quote

Old   November 26, 2020, 18:00
Default
  #10
Senior Member
 
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8
aerosayan is on a distinguished road
I made some upgrades : Moved the boundary cells to the end of the list, and based on the morton ordering, renumbered the edges. This allowed me to find out the total number of distinct faces/edges, setup the array to store the flux value at these faces/edges, and figure out the edges which make any triangular cell.


Using the octave code shown below, I loaded the 3 integer edge ids which make up each cell and analyzed for each cell, how much distance apart the 3 flux values will be in the flux array.



Code:
clc;
clear all;
y = load("optimized_edges.grid");

e1 = y(:,1);
e2 = y(:,2);
e3 = y(:,3);

emax = max(max(y))

plot(abs(e1-e2-e3)/emax)

grid on;
title('Edge Distribution');
xlabel('Cell Index');
ylabel('abs(e1-e2-e3)/emax');
The 5 images attached show the results.


The linear upward progression of the line is expected, however I'm interested in the spikes shown in the graph. These spikes show in order to access all 3 flux values for each cell, how much distance apart in memory, the code has to jump.


There are 234 boundary cells at the end of list, so the spikes there are expected to be huge at the end of the list. We can also see that due to the initial morton ordering of the cells, the spike heights have a periodic rise and fall. We can also see that due to the initial morton ordering of the cells, there are some regions where the curve is quite linear.


If possible, I would like to remove the spikes by another edge renumbering optimization.

Is there anything that can be done?

Thanks
- syn
Attached Images
File Type: png edge-distribution-1.png (24.7 KB, 8 views)
File Type: png edge-distribution-2.png (26.2 KB, 7 views)
File Type: png edge-distribution-3.png (26.8 KB, 7 views)
File Type: png edge-distribution-4.png (30.4 KB, 6 views)
File Type: png edge-distribution-5.png (30.1 KB, 4 views)
aerosayan is offline   Reply With Quote

Old   November 26, 2020, 18:11
Exclamation
  #11
Senior Member
 
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8
aerosayan is on a distinguished road
Mistakes were made.
I added the wrong figures in the previous post.
These are the correct images:
Attached Images
File Type: png edge-distribution-1.png (22.9 KB, 6 views)
File Type: png edge-distribution-2.png (26.2 KB, 5 views)
File Type: png edge-distribution-3.png (27.4 KB, 4 views)
File Type: png edge-distribution-4.png (28.0 KB, 3 views)
File Type: png edge-distribution-5.png (26.3 KB, 3 views)
aerosayan is offline   Reply With Quote

Old   November 26, 2020, 18:23
Default
  #12
Senior Member
 
sbaffini's Avatar
 
Paolo Lampitella
Join Date: Mar 2009
Location: Italy
Posts: 2,192
Blog Entries: 29
Rep Power: 39
sbaffini will become famous soon enoughsbaffini will become famous soon enough
Send a message via Skype™ to sbaffini
Besides the fact that I don't see why what you plot should have any particular meaning (instead of, say, the max between any of e1-e2, e1-e3, e2-e3), let me clarify that the way you do the fluxes in FV is actually the other way around.

That is, you loop on edges/faces, grab variables from neighbor cells, compute fluxes (and store them if you like), distribute fluxes baxk to the neighbor cells.

So, if your cells are already ordered, the only problem you might have is that you access faces randomly. A first attempt then could be, instead of renumbering them independently, to actually do it by first looping on the ordered cells and then on their faces.

Again, a profiler can tell you where cache misses are
aerosayan likes this.
sbaffini is offline   Reply With Quote

Old   November 26, 2020, 18:34
Smile
  #13
Senior Member
 
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8
aerosayan is on a distinguished road
Thanks. I will try that.
aerosayan is offline   Reply With Quote

Old   November 27, 2020, 14:59
Default
  #14
Member
 
Bernd
Join Date: Jul 2012
Posts: 42
Rep Power: 14
zx-81 is on a distinguished road
Sort your cells according to their morton encoded cell center coordinates (efficiently computable space filling code).



then simply partition the sorted array.

inline uint32_t morton32_split3_( uint32_t x )
{
x &= 1023;
x = (x | (x << 16)) & 0x030000FF;
x = (x | (x << 8)) & 0x0300F00F;
x = (x | (x << 4)) & 0x030C30C3;
x = (x | (x << 2)) & 0x09249249;
return x;
}

inline uint32_t morton32_encode_( uint32_t x, uint32_t y, uint32_t z )
{
return morton32_split3_(x) | (morton32_split3_(y)<<1) | (morton32_split3_(z)<<2);
}




Quote:
Originally Posted by aerosayan View Post
I'm trying to parallelize my explicit compressible Euler FVM solver using OpenMP.


I want to distribute the cells onto 2 threads. But before distributing the cells onto the
2 threads, I want to group them together into 2 distinct sets.


This is my current result ; https://imgur.com/a/GrU9jvy


The cells of the same color are distributed onto the same thread.



As it can be seen, my current grouping algorithm needs to be improved. Currently I'm just simply dividing the cell list into 2 parts and distributing them to 2 threads.



However, this produces a scattered distribution as can be seen in the image. This is problematic for me.


I would like to minimize the intersection boundary between the cells of these 2 threads. I would like to make 2 groups of cells that are closely packed together, and not scattered as shown in the image.



Is there any algorithm that you can recommend to do this?
aerosayan likes this.
zx-81 is offline   Reply With Quote

Reply

Tags
openmp, parallel code, solver


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
[blockMesh] Create internal faces as patch in blockMesh m.delta68 OpenFOAM Meshing & Mesh Conversion 14 July 12, 2018 15:43
[snappyHexMesh] SnappyHexMesh for internal Flow vishwa OpenFOAM Meshing & Mesh Conversion 24 June 27, 2016 09:54
[Netgen] Import netgen mesh to OpenFOAM hsieh OpenFOAM Meshing & Mesh Conversion 32 September 13, 2011 06:50
Two-Phase Buoyant Flow Issue Miguel Baritto CFX 4 August 31, 2006 13:02
BFC for Dam break problem Mehdi BEN ELHADJ Phoenics 0 January 18, 2001 16:22


All times are GMT -4. The time now is 13:22.