How does parallelisation works ?

Martin007 · July 16, 2020, 05:42

Hi everyone,

I am currently running parallelized calculation of a bi-periodic channel flow. The channel is divided into 4 parts which are associated with a processor.
For instance, if we have 4 processors distributed in as follow:
______
| 1 | 3 |
--------- -> flow direction
| 2 | 4|
______

(___ represents the walls of the channel, | and --- represent the frontier of the domain attributed to each processor)

I would have like to know how does and when information is transfered between processors. How can processors 3 and 4 work in parallel if they do not have the flow characteristics resulting from processors 1 and 2's calculation ? The same question is valid from the frontier between processor 1 and processor 2. There should be a continuous interaction between all processors but I don't understand how it works. Can someone explain it to me ?

Thank you very much,
Martin

gnwt4a · July 16, 2020, 08:28

without knowing the numerical method one cannot say. moreover, r u talking about 4 cores on the same chip or four separate multicore chips.

in genera,l whatever the method, if the flow is incompressible there must be a global exchange of information once per time step.

--

LuckyTran · July 16, 2020, 09:06

The |, and --- are called inter-processor boundary faces and they are tagged as such in the decomposed mesh. Eventually what you need in FVM is the face values (more correctly the face fluxes) on these shared inter-processor boundary faces. The approach for determining the face fluxes is defined via your gradient interpolation scheme which generally requires cell values on either side of the inter-processor faces.

In modern parallelized codes (pretty much anything that runs on an mpi), the values from from cells at neighboring processors are streamed to one-another. That is, the left side of | sends the cell values to the right side and vice-versa. It's straightforward to stream cell values of the adjacent cells (i.e. 1 layer deep). Not trivial is how to stream cell values multiple layers deep and that's why your discretization schemes at inter-processor boundaries are usually limited in many commercial codes.

sbaffini · July 19, 2020, 07:00

The idea is that each processor owns not only its own cells as depicted by you, but also some others from the neighboring processors. These can be one or multiple layers, with a tradeoff on memory consumption and amount of data exchanged typically being on using just a single layer.

This sounds more difficult than it is, you actually just have a larger than expected mesh on each processor and keep track of the part that is effectively owned by the processor and which one actually belongs to the neighbor ones.

At the start of each iteration, in the exact same way as you would need to initialize your variables on a single grid, you perform the parallel exchanges between neighbor processors. Once that is done, you can practically treat the computation on each processor as if it was serial.

There are just a couple of caveats:

1) if your algorithm needs cell gradients, it is typically better to exchange them as well, instead of computing them with the exchanged values. So after you compute them, you exchange them as well. If you need iterations to compute them (as required by some gradient computation method), you exchange them after each iteration.

2) If you need to solve a linear system, say, because you are using an implicit method, the parallelization is needed there as well, but you need to see it as part of the linear system solver. In practice, if you use, say, SOR, the idea is that you exchange the variables solved in the linear system (as opposed to the general variables used in the code) after each linear iteration. So you effectively work Jacobi like between processors and SOR like on each processor. That's typically a good compromise. Not an expert here, but Krylov methods should then just need some global reduction to work on top of such a SOR like preconditioner.

July 16, 2020, 05:42	How does parallelisation works ?	#1
Martin007 New Member Martin Join Date: Nov 2016 Posts: 6 Rep Power: 10	Hi everyone, I am currently running parallelized calculation of a bi-periodic channel flow. The channel is divided into 4 parts which are associated with a processor. For instance, if we have 4 processors distributed in as follow: ______ \| 1 \| 3 \| --------- -> flow direction \| 2 \| 4\| ______ (___ represents the walls of the channel, \| and --- represent the frontier of the domain attributed to each processor) I would have like to know how does and when information is transfered between processors. How can processors 3 and 4 work in parallel if they do not have the flow characteristics resulting from processors 1 and 2's calculation ? The same question is valid from the frontier between processor 1 and processor 2. There should be a continuous interaction between all processors but I don't understand how it works. Can someone explain it to me ? Thank you very much, Martin

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Homogeneous reaction works fine alone but not with diffusion and/or convection	Anderson2019	Main CFD Forum	1	October 5, 2019 10:12
Viscosity UDF works when interpreted, Doesn't when compiled?	bloodflow	Fluent UDF and Scheme Programming	4	April 11, 2019 10:06
My UDF works well with Fluent 16 but not with Fluent 19	Ahmed A. Serageldin	Fluent UDF and Scheme Programming	3	October 19, 2018 12:38
Why renumbering works for LduMatrix?	chengdi	OpenFOAM	4	July 31, 2017 19:54
Parallel runs with sonicDyMFoam crashes (works fine with sonicFoam)	jnilsson	OpenFOAM Running, Solving & CFD	0	March 9, 2012 07:45

July 16, 2020, 08:28		#2
gnwt4a Member EM Join Date: Sep 2019 Posts: 59 Rep Power: 7	without knowing the numerical method one cannot say. moreover, r u talking about 4 cores on the same chip or four separate multicore chips. in genera,l whatever the method, if the flow is incompressible there must be a global exchange of information once per time step. --

July 16, 2020, 09:06		#3
LuckyTran Senior Member Lucky Join Date: Apr 2011 Location: Orlando, FL USA Posts: 5,762 Rep Power: 66	The \|, and --- are called inter-processor boundary faces and they are tagged as such in the decomposed mesh. Eventually what you need in FVM is the face values (more correctly the face fluxes) on these shared inter-processor boundary faces. The approach for determining the face fluxes is defined via your gradient interpolation scheme which generally requires cell values on either side of the inter-processor faces. In modern parallelized codes (pretty much anything that runs on an mpi), the values from from cells at neighboring processors are streamed to one-another. That is, the left side of \| sends the cell values to the right side and vice-versa. It's straightforward to stream cell values of the adjacent cells (i.e. 1 layer deep). Not trivial is how to stream cell values multiple layers deep and that's why your discretization schemes at inter-processor boundaries are usually limited in many commercial codes.

July 19, 2020, 07:00		#4
sbaffini Senior Member Paolo Lampitella Join Date: Mar 2009 Location: Italy Posts: 2,195 Blog Entries: 29 Rep Power: 39	The idea is that each processor owns not only its own cells as depicted by you, but also some others from the neighboring processors. These can be one or multiple layers, with a tradeoff on memory consumption and amount of data exchanged typically being on using just a single layer. This sounds more difficult than it is, you actually just have a larger than expected mesh on each processor and keep track of the part that is effectively owned by the processor and which one actually belongs to the neighbor ones. At the start of each iteration, in the exact same way as you would need to initialize your variables on a single grid, you perform the parallel exchanges between neighbor processors. Once that is done, you can practically treat the computation on each processor as if it was serial. There are just a couple of caveats: 1) if your algorithm needs cell gradients, it is typically better to exchange them as well, instead of computing them with the exchanged values. So after you compute them, you exchange them as well. If you need iterations to compute them (as required by some gradient computation method), you exchange them after each iteration. 2) If you need to solve a linear system, say, because you are using an implicit method, the parallelization is needed there as well, but you need to see it as part of the linear system solver. In practice, if you use, say, SOR, the idea is that you exchange the variables solved in the linear system (as opposed to the general variables used in the code) after each linear iteration. So you effectively work Jacobi like between processors and SOR like on each processor. That's typically a good compromise. Not an expert here, but Krylov methods should then just need some global reduction to work on top of such a SOR like preconditioner.