|
[Sponsors] |
How to group cells to distribute onto parallel processors/threads? |
|
LinkBack | Thread Tools | Search this Thread | Display Modes |
October 31, 2020, 09:59 |
How to group cells to distribute onto parallel processors/threads?
|
#1 |
Senior Member
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8 |
I'm trying to parallelize my explicit compressible Euler FVM solver using OpenMP.
I want to distribute the cells onto 2 threads. But before distributing the cells onto the 2 threads, I want to group them together into 2 distinct sets. This is my current result ; https://imgur.com/a/GrU9jvy The cells of the same color are distributed onto the same thread. As it can be seen, my current grouping algorithm needs to be improved. Currently I'm just simply dividing the cell list into 2 parts and distributing them to 2 threads. However, this produces a scattered distribution as can be seen in the image. This is problematic for me. I would like to minimize the intersection boundary between the cells of these 2 threads. I would like to make 2 groups of cells that are closely packed together, and not scattered as shown in the image. Is there any algorithm that you can recommend to do this? |
|
October 31, 2020, 11:02 |
|
#2 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
Domain decomposition is not the most straightforward load distribution mechanism to go along with OpenMP. That's usually coupled with MPI parallelization. Not saying that it can't be done, or doesn't exist in the wild. Just that it not the most common combination.
What you are searching for falls under the term graph partitioning. There are quite a few implementations available. Scotch and Metis are two of the more commonly used ones. BTW: if your colored graph reflects the numbering of your grid, there is a lot of performance up for grabs by renumbering the elements. The optimization goal being increased locality. I.e. elements that are direct neighbors spatially should also be as close as possible in their memory index. That increases cache reuse. Doing that first would also result in a much better domain decomposition when simply dividing the mesh along the memory index. |
|
October 31, 2020, 11:38 |
|
#3 | |
Senior Member
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8 |
Quote:
Thanks for the reply. I'm new to solver dev, so I don't particularly know all of the terms very well. I'm trying to group the cells to improve performance by decreasing the cache misses. As you proposed, I will be renumbering the cells within each group to decrease cache misses. And I did want to decompose the domain into multiple groups because I do have plans for implementing MPI support in future. I was able to come up with a basic decomposition method. I simply sort the nodes based on their x-coordinate and then use a node_to_cell connectivity map to find the cells which are connected to each of the sorted nodes. I was able to create 4 and 8 groups of cells using this simple method. This works pretty well for me RESULTS : https://imgur.com/a/EBa7QBg |
||
October 31, 2020, 11:46 |
|
#4 |
Senior Member
|
Space filling curves will work great for both your tasks, the current (shared memory parallelization) and the future one (distributed parallelization).
Still, you have to deeply understand how cache access works in shared memory in order to properly arrange the shared memory loops (which is where shared memory parallelization actually happens) |
|
October 31, 2020, 16:44 |
|
#5 | |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
Quote:
Space-filling curves are another interesting approach which works to some extent. But they do a poor job at minimizing, let alone equalizing communication volumes. And they have other limitations, e.g. once you have more than one load-balancing constraint, apart from element count. They are great for ordering elements in a cache-friendly way, but not necessarily the best choice for domain decomposition. It all depends on where you want to go with this. If the examples you have shown here are as complex as it will get, then domain decomposition based on a spatial coordinate is fine. If you want to go further, you will have to look for alternatives. |
||
November 1, 2020, 07:25 |
|
#6 | |
Senior Member
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8 |
Quote:
I took your and flotus1's advice, and used z-order curve based mesh cell renumbering. RESULTS : https://imgur.com/a/gBQg20X In case someone else wants to use the same approach, I'm sharing the source... CODE : https://pastebin.com/AvzmJNJX EXPLANATION : https://archive.is/Th2xk |
||
November 2, 2020, 09:03 |
|
#7 | |
Senior Member
|
Quote:
|
||
November 5, 2020, 07:59 |
|
#8 | |
Senior Member
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8 |
Quote:
https://imgur.com/KuEVYlA Organized the cells using a space filling curve. Organized the boundary cells at the front of the cell list. The boundary cells still follow the order they were prescribed by the space filling curve. So, they will be accessed in the same efficient order, thus improving cache performance. Is their anything else that can be done to improve performance? |
||
November 5, 2020, 09:22 |
|
#9 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
Well...you can make a whole career out of optimizing things further
If we limit ourselves to grid renumbering: You probably already do this, but just to make sure I didn't misunderstand your last post: each domain should have its share of boundary cells. If they all end up in the first domain, that would be a serious load balancing problem, and introduce a huge communication overhead. For the order of cells, z-order is a great starting point, mostly because it's easy to understand and implement. But from my own testing, different space-filling curves can give slightly better performance overall. I am currently using a Hilbert curve for our LBM solvers. Which one works best definitely depends on the exact algorithm you are using, and most likely which hardware is used to run the code. And for general purpose domain decomposition, tools like metis do a better job at minimizing the surfaces between different domains. I.e. minimizing communication between cores. For very simple geometries, you can come up with a hand-crafted domain decomposition scheme that might work slightly better. But you will quickly run out of feasible ideas once the geometry gets more complex. |
|
November 26, 2020, 18:00 |
|
#10 |
Senior Member
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8 |
I made some upgrades : Moved the boundary cells to the end of the list, and based on the morton ordering, renumbered the edges. This allowed me to find out the total number of distinct faces/edges, setup the array to store the flux value at these faces/edges, and figure out the edges which make any triangular cell.
Using the octave code shown below, I loaded the 3 integer edge ids which make up each cell and analyzed for each cell, how much distance apart the 3 flux values will be in the flux array. Code:
clc; clear all; y = load("optimized_edges.grid"); e1 = y(:,1); e2 = y(:,2); e3 = y(:,3); emax = max(max(y)) plot(abs(e1-e2-e3)/emax) grid on; title('Edge Distribution'); xlabel('Cell Index'); ylabel('abs(e1-e2-e3)/emax'); The linear upward progression of the line is expected, however I'm interested in the spikes shown in the graph. These spikes show in order to access all 3 flux values for each cell, how much distance apart in memory, the code has to jump. There are 234 boundary cells at the end of list, so the spikes there are expected to be huge at the end of the list. We can also see that due to the initial morton ordering of the cells, the spike heights have a periodic rise and fall. We can also see that due to the initial morton ordering of the cells, there are some regions where the curve is quite linear. If possible, I would like to remove the spikes by another edge renumbering optimization. Is there anything that can be done? Thanks - syn |
|
November 26, 2020, 18:11 |
|
#11 |
Senior Member
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8 |
Mistakes were made.
I added the wrong figures in the previous post. These are the correct images: |
|
November 26, 2020, 18:23 |
|
#12 |
Senior Member
|
Besides the fact that I don't see why what you plot should have any particular meaning (instead of, say, the max between any of e1-e2, e1-e3, e2-e3), let me clarify that the way you do the fluxes in FV is actually the other way around.
That is, you loop on edges/faces, grab variables from neighbor cells, compute fluxes (and store them if you like), distribute fluxes baxk to the neighbor cells. So, if your cells are already ordered, the only problem you might have is that you access faces randomly. A first attempt then could be, instead of renumbering them independently, to actually do it by first looping on the ordered cells and then on their faces. Again, a profiler can tell you where cache misses are |
|
November 26, 2020, 18:34 |
|
#13 |
Senior Member
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8 |
Thanks. I will try that.
|
|
November 27, 2020, 14:59 |
|
#14 | |
Member
Bernd
Join Date: Jul 2012
Posts: 42
Rep Power: 14 |
Sort your cells according to their morton encoded cell center coordinates (efficiently computable space filling code).
then simply partition the sorted array. inline uint32_t morton32_split3_( uint32_t x ) { x &= 1023; x = (x | (x << 16)) & 0x030000FF; x = (x | (x << 8)) & 0x0300F00F; x = (x | (x << 4)) & 0x030C30C3; x = (x | (x << 2)) & 0x09249249; return x; } inline uint32_t morton32_encode_( uint32_t x, uint32_t y, uint32_t z ) { return morton32_split3_(x) | (morton32_split3_(y)<<1) | (morton32_split3_(z)<<2); } Quote:
|
||
Tags |
openmp, parallel code, solver |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
[blockMesh] Create internal faces as patch in blockMesh | m.delta68 | OpenFOAM Meshing & Mesh Conversion | 14 | July 12, 2018 15:43 |
[snappyHexMesh] SnappyHexMesh for internal Flow | vishwa | OpenFOAM Meshing & Mesh Conversion | 24 | June 27, 2016 09:54 |
[Netgen] Import netgen mesh to OpenFOAM | hsieh | OpenFOAM Meshing & Mesh Conversion | 32 | September 13, 2011 06:50 |
Two-Phase Buoyant Flow Issue | Miguel Baritto | CFX | 4 | August 31, 2006 13:02 |
BFC for Dam break problem | Mehdi BEN ELHADJ | Phoenics | 0 | January 18, 2001 16:22 |