How to group cells to distribute onto parallel processors/threads?

aerosayan · October 31, 2020, 09:59

I'm trying to parallelize my explicit compressible Euler FVM solver using OpenMP.

I want to distribute the cells onto 2 threads. But before distributing the cells onto the
2 threads, I want to group them together into 2 distinct sets.

This is my current result ; https://imgur.com/a/GrU9jvy

The cells of the same color are distributed onto the same thread.

As it can be seen, my current grouping algorithm needs to be improved. Currently I'm just simply dividing the cell list into 2 parts and distributing them to 2 threads.

However, this produces a scattered distribution as can be seen in the image. This is problematic for me.

I would like to minimize the intersection boundary between the cells of these 2 threads. I would like to make 2 groups of cells that are closely packed together, and not scattered as shown in the image.

Is there any algorithm that you can recommend to do this?

flotus1 · October 31, 2020, 11:02

Domain decomposition is not the most straightforward load distribution mechanism to go along with OpenMP. That's usually coupled with MPI parallelization. Not saying that it can't be done, or doesn't exist in the wild. Just that it not the most common combination.

What you are searching for falls under the term graph partitioning. There are quite a few implementations available. Scotch and Metis are two of the more commonly used ones.

BTW: if your colored graph reflects the numbering of your grid, there is a lot of performance up for grabs by renumbering the elements. The optimization goal being increased locality. I.e. elements that are direct neighbors spatially should also be as close as possible in their memory index. That increases cache reuse. Doing that first would also result in a much better domain decomposition when simply dividing the mesh along the memory index.

aerosayan · October 31, 2020, 11:38

Quote:

Originally Posted by flotus1

Domain decomposition is not the most straightforward load distribution mechanism to go along with OpenMP. That's usually coupled with MPI parallelization. Not saying that it can't be done, or doesn't exist in the wild. Just that it not the most common combination.

What you are searching for falls under the term graph partitioning. There are quite a few implementations available. Scotch and Metis are two of the more commonly used ones.

BTW: if your colored graph reflects the numbering of your grid, there is a lot of performance up for grabs by renumbering the elements. The optimization goal being increased locality. I.e. elements that are direct neighbors spatially should also be as close as possible in their memory index. That increases cache reuse. Doing that first would also result in a much better domain decomposition when simply dividing the mesh along the memory index.

Thanks for the reply. I'm new to solver dev, so I don't particularly know all of the terms very well. I'm trying to group the cells to improve performance by decreasing the cache misses. As you proposed, I will be renumbering the cells within each group to decrease cache misses.

And I did want to decompose the domain into multiple groups because I do have plans for implementing MPI support in future.

I was able to come up with a basic decomposition method. I simply sort the nodes based on their x-coordinate and then use a node_to_cell connectivity map to find the cells which are connected to each of the sorted nodes.

I was able to create 4 and 8 groups of cells using this simple method.

This works pretty well for me

RESULTS : https://imgur.com/a/EBa7QBg

sbaffini · October 31, 2020, 11:46

Space filling curves will work great for both your tasks, the current (shared memory parallelization) and the future one (distributed parallelization).

Still, you have to deeply understand how cache access works in shared memory in order to properly arrange the shared memory loops (which is where shared memory parallelization actually happens)

flotus1 · October 31, 2020, 16:44

Quote:

Originally Posted by aerosayan

I was able to come up with a basic decomposition method. I simply sort the nodes based on their x-coordinate and then use a node_to_cell connectivity map to find the cells which are connected to each of the sorted nodes.

This works pretty well for me

Approaches like that work great...up to a certain point. Once your geometry gets more complicated, or you need more than a handful of domains, load balancing becomes almost impossible, and communication volume goes through the roof.
Space-filling curves are another interesting approach which works to some extent. But they do a poor job at minimizing, let alone equalizing communication volumes. And they have other limitations, e.g. once you have more than one load-balancing constraint, apart from element count. They are great for ordering elements in a cache-friendly way, but not necessarily the best choice for domain decomposition.
It all depends on where you want to go with this. If the examples you have shown here are as complex as it will get, then domain decomposition based on a spatial coordinate is fine. If you want to go further, you will have to look for alternatives.

aerosayan · November 1, 2020, 07:25

Quote:

Originally Posted by sbaffini

Space filling curves will work great for both your tasks, the current (shared memory parallelization) and the future one (distributed parallelization).

Still, you have to deeply understand how cache access works in shared memory in order to properly arrange the shared memory loops (which is where shared memory parallelization actually happens)

I took your and flotus1's advice, and used z-order curve based mesh cell renumbering.

RESULTS : https://imgur.com/a/gBQg20X

In case someone else wants to use the same approach, I'm sharing the source...

CODE : https://pastebin.com/AvzmJNJX

EXPLANATION : https://archive.is/Th2xk

sbaffini · November 2, 2020, 09:03

Quote:

Originally Posted by flotus1

Space-filling curves are another interesting approach which works to some extent. But they do a poor job at minimizing, let alone equalizing communication volumes. And they have other limitations, e.g. once you have more than one load-balancing constraint, apart from element count. They are great for ordering elements in a cache-friendly way, but not necessarily the best choice for domain decomposition.

Of course I agree, the SFC suggestion was along the lines of a DIY solution with a great results/effort ratio, especially considering the shared/distributed requirement. However, with a little bit more of coding you can add weights to the cells and take them into account together with the SFC ordering, which is slightly more flexible than simple SFC ordering.

aerosayan · November 5, 2020, 07:59

Quote:

Originally Posted by flotus1

BTW: if your colored graph reflects the numbering of your grid, there is a lot of performance up for grabs by renumbering the elements. The optimization goal being increased locality. I.e. elements that are direct neighbors spatially should also be as close as possible in their memory index. That increases cache reuse. Doing that first would also result in a much better domain decomposition when simply dividing the mesh along the memory index.

https://imgur.com/KuEVYlA

Organized the cells using a space filling curve.

Organized the boundary cells at the front of the cell list. The boundary cells still follow the order they were prescribed by the space filling curve. So, they will be accessed in the same efficient order, thus improving cache performance.

Is their anything else that can be done to improve performance?

flotus1 · November 5, 2020, 09:22

Well...you can make a whole career out of optimizing things further

If we limit ourselves to grid renumbering:
You probably already do this, but just to make sure I didn't misunderstand your last post: each domain should have its share of boundary cells. If they all end up in the first domain, that would be a serious load balancing problem, and introduce a huge communication overhead.
For the order of cells, z-order is a great starting point, mostly because it's easy to understand and implement. But from my own testing, different space-filling curves can give slightly better performance overall. I am currently using a Hilbert curve for our LBM solvers. Which one works best definitely depends on the exact algorithm you are using, and most likely which hardware is used to run the code.
And for general purpose domain decomposition, tools like metis do a better job at minimizing the surfaces between different domains. I.e. minimizing communication between cores. For very simple geometries, you can come up with a hand-crafted domain decomposition scheme that might work slightly better. But you will quickly run out of feasible ideas once the geometry gets more complex.

aerosayan · November 26, 2020, 18:00

I made some upgrades : Moved the boundary cells to the end of the list, and based on the morton ordering, renumbered the edges. This allowed me to find out the total number of distinct faces/edges, setup the array to store the flux value at these faces/edges, and figure out the edges which make any triangular cell.

Using the octave code shown below, I loaded the 3 integer edge ids which make up each cell and analyzed for each cell, how much distance apart the 3 flux values will be in the flux array.

Code:

clc;
clear all;
y = load("optimized_edges.grid");

e1 = y(:,1);
e2 = y(:,2);
e3 = y(:,3);

emax = max(max(y))

plot(abs(e1-e2-e3)/emax)

grid on;
title('Edge Distribution');
xlabel('Cell Index');
ylabel('abs(e1-e2-e3)/emax');

The 5 images attached show the results.

The linear upward progression of the line is expected, however I'm interested in the spikes shown in the graph. These spikes show in order to access all 3 flux values for each cell, how much distance apart in memory, the code has to jump.

There are 234 boundary cells at the end of list, so the spikes there are expected to be huge at the end of the list. We can also see that due to the initial morton ordering of the cells, the spike heights have a periodic rise and fall. We can also see that due to the initial morton ordering of the cells, there are some regions where the curve is quite linear.

If possible, I would like to remove the spikes by another edge renumbering optimization.

Is there anything that can be done?

Thanks
- syn

aerosayan · November 26, 2020, 18:11

Mistakes were made.
I added the wrong figures in the previous post.
These are the correct images:

sbaffini · November 26, 2020, 18:23

Besides the fact that I don't see why what you plot should have any particular meaning (instead of, say, the max between any of e1-e2, e1-e3, e2-e3), let me clarify that the way you do the fluxes in FV is actually the other way around.

That is, you loop on edges/faces, grab variables from neighbor cells, compute fluxes (and store them if you like), distribute fluxes baxk to the neighbor cells.

So, if your cells are already ordered, the only problem you might have is that you access faces randomly. A first attempt then could be, instead of renumbering them independently, to actually do it by first looping on the ordered cells and then on their faces.

Again, a profiler can tell you where cache misses are

aerosayan · November 26, 2020, 18:34

Thanks. I will try that.

zx-81 · November 27, 2020, 14:59

Sort your cells according to their morton encoded cell center coordinates (efficiently computable space filling code).

then simply partition the sorted array.

inline uint32_t morton32_split3_( uint32_t x )
{
x &= 1023;
x = (x | (x << 16)) & 0x030000FF;
x = (x | (x << 8)) & 0x0300F00F;
x = (x | (x << 4)) & 0x030C30C3;
x = (x | (x << 2)) & 0x09249249;
return x;
}

inline uint32_t morton32_encode_( uint32_t x, uint32_t y, uint32_t z )
{
return morton32_split3_(x) | (morton32_split3_(y)<<1) | (morton32_split3_(z)<<2);
}

Quote:

Originally Posted by aerosayan

I'm trying to parallelize my explicit compressible Euler FVM solver using OpenMP.

I want to distribute the cells onto 2 threads. But before distributing the cells onto the
2 threads, I want to group them together into 2 distinct sets.

This is my current result ; https://imgur.com/a/GrU9jvy

The cells of the same color are distributed onto the same thread.

As it can be seen, my current grouping algorithm needs to be improved. Currently I'm just simply dividing the cell list into 2 parts and distributing them to 2 threads.

However, this produces a scattered distribution as can be seen in the image. This is problematic for me.

I would like to minimize the intersection boundary between the cells of these 2 threads. I would like to make 2 groups of cells that are closely packed together, and not scattered as shown in the image.

Is there any algorithm that you can recommend to do this?

October 31, 2020, 09:59	How to group cells to distribute onto parallel processors/threads?	#1
aerosayan Senior Member Sayan Bhattacharjee Join Date: Mar 2020 Posts: 495 Rep Power: 8	I'm trying to parallelize my explicit compressible Euler FVM solver using OpenMP. I want to distribute the cells onto 2 threads. But before distributing the cells onto the 2 threads, I want to group them together into 2 distinct sets. This is my current result ; https://imgur.com/a/GrU9jvy The cells of the same color are distributed onto the same thread. As it can be seen, my current grouping algorithm needs to be improved. Currently I'm just simply dividing the cell list into 2 parts and distributing them to 2 threads. However, this produces a scattered distribution as can be seen in the image. This is problematic for me. I would like to minimize the intersection boundary between the cells of these 2 threads. I would like to make 2 groups of cells that are closely packed together, and not scattered as shown in the image. Is there any algorithm that you can recommend to do this?

November 5, 2020, 09:22		#9
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,427 Rep Power: 49	Well...you can make a whole career out of optimizing things further If we limit ourselves to grid renumbering: You probably already do this, but just to make sure I didn't misunderstand your last post: each domain should have its share of boundary cells. If they all end up in the first domain, that would be a serious load balancing problem, and introduce a huge communication overhead. For the order of cells, z-order is a great starting point, mostly because it's easy to understand and implement. But from my own testing, different space-filling curves can give slightly better performance overall. I am currently using a Hilbert curve for our LBM solvers. Which one works best definitely depends on the exact algorithm you are using, and most likely which hardware is used to run the code. And for general purpose domain decomposition, tools like metis do a better job at minimizing the surfaces between different domains. I.e. minimizing communication between cores. For very simple geometries, you can come up with a hand-crafted domain decomposition scheme that might work slightly better. But you will quickly run out of feasible ideas once the geometry gets more complex. aerosayan likes this.

November 26, 2020, 18:23		#12
sbaffini Senior Member Paolo Lampitella Join Date: Mar 2009 Location: Italy Posts: 2,192 Blog Entries: 29 Rep Power: 39	Besides the fact that I don't see why what you plot should have any particular meaning (instead of, say, the max between any of e1-e2, e1-e3, e2-e3), let me clarify that the way you do the fluxes in FV is actually the other way around. That is, you loop on edges/faces, grab variables from neighbor cells, compute fluxes (and store them if you like), distribute fluxes baxk to the neighbor cells. So, if your cells are already ordered, the only problem you might have is that you access faces randomly. A first attempt then could be, instead of renumbering them independently, to actually do it by first looping on the ordered cells and then on their faces. Again, a profiler can tell you where cache misses are aerosayan likes this.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
[blockMesh] Create internal faces as patch in blockMesh	m.delta68	OpenFOAM Meshing & Mesh Conversion	14	July 12, 2018 15:43
[snappyHexMesh] SnappyHexMesh for internal Flow	vishwa	OpenFOAM Meshing & Mesh Conversion	24	June 27, 2016 09:54
[Netgen] Import netgen mesh to OpenFOAM	hsieh	OpenFOAM Meshing & Mesh Conversion	32	September 13, 2011 06:50
Two-Phase Buoyant Flow Issue	Miguel Baritto	CFX	4	August 31, 2006 13:02
BFC for Dam break problem	Mehdi BEN ELHADJ	Phoenics	0	January 18, 2001 16:22

October 31, 2020, 11:02		#2
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,427 Rep Power: 49	Domain decomposition is not the most straightforward load distribution mechanism to go along with OpenMP. That's usually coupled with MPI parallelization. Not saying that it can't be done, or doesn't exist in the wild. Just that it not the most common combination. What you are searching for falls under the term graph partitioning. There are quite a few implementations available. Scotch and Metis are two of the more commonly used ones. BTW: if your colored graph reflects the numbering of your grid, there is a lot of performance up for grabs by renumbering the elements. The optimization goal being increased locality. I.e. elements that are direct neighbors spatially should also be as close as possible in their memory index. That increases cache reuse. Doing that first would also result in a much better domain decomposition when simply dividing the mesh along the memory index.

October 31, 2020, 11:46		#4
sbaffini Senior Member Paolo Lampitella Join Date: Mar 2009 Location: Italy Posts: 2,192 Blog Entries: 29 Rep Power: 39	Space filling curves will work great for both your tasks, the current (shared memory parallelization) and the future one (distributed parallelization). Still, you have to deeply understand how cache access works in shared memory in order to properly arrange the shared memory loops (which is where shared memory parallelization actually happens)

November 26, 2020, 18:34		#13
aerosayan Senior Member Sayan Bhattacharjee Join Date: Mar 2020 Posts: 495 Rep Power: 8	Thanks. I will try that.