Parallel processing of OpenFOAM cases on multicore processor???

smraniaki · November 20, 2013, 21:06

I'm not sure why you are having longer computation time but I have a guess:
Your longer processing time could be due to unoptimized selection of number of decomposition domain. when you decomposed the domain the communication between each threads during parallel computation also take process which correspondingly demands more time. In your case I believe if you decompose you domain into 5 or 3 instead of 4, you should be facing different processing time as the communication between threads might decrease or increase. it is not always efficient to decompose the domain into several parts.

Alish1984 · May 30, 2015, 04:15

Quote:

Originally Posted by eddi0907

Hi Bruno,

The Processors are XEON W5580 (4 cores) or X5680 (6 cores) with 3.2 respectively 3.33 GHz without overclocking.
The Memory is DDR3-1333.
I used normal Ethernet 1GbpS.

The modelsize was 1 Million cells.

Running on 2 cores the Speedup is 2 as well.
Using 4 cores on the same CPU the speedup is only ~2.6 but Using 4 cores on 2 CPU's the speedup will be nearly 4!
It seems to be the same when looking at the unofficial benchmarks (http://code.google.com/p/bluecfd-sin...SE_12.1_x86_64)

So on a cluster where you use machines with more than one CPU you need to do corbinding and define which task on which cpu-core with an additional rankfile in openmpi.

Example 2 Dual CPU machines (no matter if 4 or 6 cores):

mpirun -np 8 -hostfile ./hostfile.txt -rankfile ./rankfile.txt icoFoam -parallel

hostfile:

host_1
host_2

rankfile:

rank 0 =host_1 slot=0:0
rank 1 =host_1 slot=0:1
rank 2 =host_1 slot=1:0
rank 3 =host_1 slot=1:1
rank 4 =host_2 slot=0:0
rank 5 =host_2 slot=0:1
rank 6 =host_2 slot=1:0
rank 7 =host_2 slot=1:1

Now the job runs on 8 cores distributed on cores 0 and 1 on any of the 4 CPU's with a speedup of more than 7.

Perhaps the newest generation of CPU's show a faster CPU-RAM communication and one can use 3 cores per CPU.

Kind Regards.

Edmund

Dear Edmund and Bruno,

It seems that Open MPI rank file can not detect multi threads, I mean when u have cores with HT enabled, in a rankfile u can only include physical processors. Is there any solution?

Regards,
Ali

wyldckat · May 30, 2015, 09:20

Quote:

Originally Posted by Alish1984

It seems that Open MPI rank file can not detect multi threads, I mean when u have cores with HT enabled, in a rankfile u can only include physical processors. Is there any solution?

Quick answer: You will not see a substantial performance increase when using HyperThreading with OpenFOAM. It's best that you only use the physical cores.

Beyond that, a very quick search lead me to this answer: http://stackoverflow.com/a/11761943

Alish1984 · May 31, 2015, 08:35

Quote:

Originally Posted by wyldckat

Quick answer: You will not see a substantial performance increase when using HyperThreading with OpenFOAM. It's best that you only use the physical cores.

Beyond that, a very quick search lead me to this answer: http://stackoverflow.com/a/11761943

Dear Bruno,

Tnx for quick response. It was helpful.
I know that the maximum speedup would be 10-30% in some cases, when some processors become idle e.g. in combustion probs. I refer u to this paper, "An Empirical Study of Hyper-Threading in High Performance Computing Clusters".

Ok lets forget the HT for the moment. I have another question, is there any report of OpenFOAM scalability above 32 processors like this "https://www.hpc.ntnu.no/display/hpc/...mance+on+Vilje" but without infiniband communication? I mean with Ethernet communication among nodes?

The question may seem weird but let me describe it more, I'm not a pro in computer science so excuse me for probable mistakes. We have 3 Supermicro servers, each has 2 Intel Xeon E5-2690 (2*10 cores). I connected them via ethernet with Cat6 cables and a high speed switch.
The problem is that I cant reproduce the result of "https://www.hpc.ntnu.no/display/hpc/...mance+on+Vilje" in 1M cells cavity case using 32 processors.
The solution in 1 node is scalable, however increasing the nodes to 2 and 3 (40 and 60 processors respectively) there is no substantial speedup.

When I change the problem to the combustion case (PDE+ODE solutions) an interesting behavior is seen. The scalability of ODE solution part is linear. But the PDEs solution time is still the same like cavity case.

So it comes to my mind that maybe this is the prblem of communication among nodes. Since ODE solution part doesn't need any synchronization while the PDEs do.

The conclusion: since the only major difference btw me and the cluster in "https://www.hpc.ntnu.no/display/hpc/...mance+on+Vilje" is the type of communication (ethernet VS infiniband) it seems that this is the source of lack of scalability under the same conditions.

Is it true? Is there any report of significant speedup by using ethernet communication among nodes in clusters?

Regards,

Ali

wyldckat · May 31, 2015, 18:58

Hi Ali,

Quote:

Originally Posted by Alish1984

I have another question, is there any report of OpenFOAM scalability above 32 processors like this "https://www.hpc.ntnu.no/display/hpc/...mance+on+Vilje" but without infiniband communication? I mean with Ethernet communication among nodes?

I know there are more examples on the Hardware forum, but I can't find them right now. The one I found after a quick search is in the attached image on this post:
http://www.cfd-online.com/Forums/har...tml#post518234 - post #8
Your cluster already falls within the details given in the image, namely that 1Gbps connection is not enough to support so many processors.

Best regards,
Bruno

KateEisenhower · October 29, 2015, 07:31

Quote:

Originally Posted by wyldckat

I would suggest splitting the case in 2,3,4,5,6 and 12 sub-domains, to try and isolate if it's a CPU cache problem. I've had a situation where a 6 core CPU was faster with 16 sub-domains than 6 sub-domains

Hi Bruno,

would you mind to explain this part of your quote in more detail? How can you tell then if it's a CPU cache problem? What should be saved in the cache? I can't imagine even the L3 cache is big enough to hold the whole mesh.

Do you know of some tutorial or description of how to use the hierarcial decomposition method? I searched the user guide and the forum but didn't get a clue.

Best regards,

Kate

wyldckat · October 31, 2015, 09:44

Hi Kate,

Quote:

Originally Posted by KateEisenhower

would you mind to explain this part of your quote in more detail? How can you tell then if it's a CPU cache problem? What should be saved in the cache? I can't imagine even the L3 cache is big enough to hold the whole mesh.

The logic in my thought process is that when we have over-scheduling going on, it can eventually end up in a situation of "least effort" as a result of the bottlenecking effect, namely where:

all processes are either accessing neighbouring memory regions that are common to various processes;
or all processes only need to access a particular region in the memory for each process, that is needed for communicating between various processes.

For example, if 4 processes are dealing with a corner in their decompositions that are common to all sub-domains, this would mean that this memory region would be used as data source for each process to communicate with 2 or more processes at the same time.

Quote:

Originally Posted by KateEisenhower

Do you know of some tutorial or description of how to use the hierarcial decomposition method? I searched the user guide and the forum but didn't get a clue.

Fortunately I believe/hope you've already found some more details about this: http://www.cfd-online.com/Forums/ope...mulations.html

Best regards,
Bruno

KateEisenhower · November 2, 2015, 05:28

Hi Bruno,

I understand your thought process. But what does this mean for a real simulation. The problem is you can't actually see what is slowing down your parallel simulation, can you?
My current way of procedure on a 2 socket machine with each having 6 cores and 3 memory channels is the following:

1) Run case in serial to have a reference
2) Run 2 threads on different sockets core-binded
3) Run 4 threads, 2 on every socket, core-binded
4) The same with 6, 8, 10 and 12 threads

I run these test cases for 10 iterations each (is that enough), see which one finishes the fastet and go with this configuration for this case. Is there any other method?

Regarding the hierarcial decomposition method. Not really. I don't understand what it is supposed to do. A quick example:

Code:

28  hierarchicalCoeffs 
29  { 
30      n               ( 3 1 2 ); 
31      delta           0.001; 
32      order           xyz; 
33  }

would look as follows:

Code:

----------------------
I      I      I      I
----------------------
I      I      I      I
----------------------

Î: z-direction ->: x-direction

How does the order of splitting effect the outcome?

Best regards,

Kate

wyldckat · November 2, 2015, 18:34

Hi Kate,

Quote:

Originally Posted by KateEisenhower

The problem is you can't actually see what is slowing down your parallel simulation, can you?

I know that there are MPI profiling tools that can try to give you this kind of information, but I've never used them myself.

Quote:

Originally Posted by KateEisenhower

My current way of procedure on a 2 socket machine with each having 6 cores and 3 memory channels is the following:

1) Run case in serial to have a reference
2) Run 2 threads on different sockets core-binded
3) Run 4 threads, 2 on every socket, core-binded
4) The same with 6, 8, 10 and 12 threads

For a particular type of test cases, this is usually the way to do this. Your mileage can vary depending on the type of simulation (e.g. simpleFoam or reactingFoam), mesh configuration, and on the matrix solver settings defined in "fvSolution".

Quote:

Originally Posted by KateEisenhower

I run these test cases for 10 iterations each (is that enough), see which one finishes the fastet and go with this configuration for this case. Is there any other method?

The number of subdomains and the way the subdomains were divided, can affect the necessary number of iterations for the simulation to converge. This to say that 10 iterations might not be enough to give you a good enough comparison. For example, comparing 10 vs 11 vs 12 seconds isn't as good as comparing 101 vs 113 vs 118 seconds.

Keep in mind that OpenFOAM technically uses boundary conditions of type "processor" for communicating the data between subdomains. And since small changes in a boundary condition can affect the solution, this means that more or less iterations might be needed to reach convergence. Keep in mind that this can either be iterations at the level of the matrix solvers (e.g. GAMG) or at the level of the outer iterations of the application solver (e.g. simpleFoam).

Quote:

Originally Posted by KateEisenhower

Regarding the hierarcial decomposition method. Not really. I don't understand what it is supposed to do. A quick example:

Code:

28  hierarchicalCoeffs 
29  { 
30      n               ( 3 1 2 ); 
31      delta           0.001; 
32      order           xyz; 
33  }

would look as follows:

Code:

----------------------
I      I      I      I
----------------------
I      I      I      I
----------------------

Î: z-direction ->: x-direction

How does the order of splitting effect the outcome?

The standard objective is simple enough: keep the number of faces shared between subdomains down to the smallest possible number. Because the fewer the shared faces, the less time is spent communicating between processes.

To a lesser extend, the other objective is to have the simulation be solved in the most efficient way possible, simultaneously if possible. This can be tested by modifying the "incompressible/icoFoam/cavity" tutorial case to be 3D and then test the various orders of decomposition. In theory, if we can have all of the domains work though the equation matrices in the same exact order in parallel, this should be the most optimum way to process the data.
From your ASCII drawing, the efficient way would be to have all 6 processes work from the left to the right, then one line down and left to the right again, within their own subdomains, so that they are working side-by-side on solving the same parts of the matrices, at least for each pair of processes.

I'm oversimplifying this, but this should become more apparent when testing with a 3D cavity case with a uniform mesh and uniform mesh distribution between processes.

Translating this to a real simulation isn't as straight forward, but it can at least help you reduce the number of tests you need to do when looking for the best decomposition.
But for more complex meshes, the usual decomposition to go with is Scotch or Metis, since they use graph theory (I can't remember the exact terminology) for trying to minimize the number of faces needed for communicating between subdomains.

Best regards,
Bruno

ht2017 · October 8, 2017, 01:08

hi, everyone.
I am running parallel in OpenFoam. When I comment "reconstructPar - latestTime", it appears the errors.

the first: there are the coordinates of the face in the Polymesh have "word" in the number.
the second: in the file P appear the symbol as "^,$,&" in the number in here.

I hope everyone helps me.

thanh.jpg

smraniaki · October 8, 2017, 14:18

Quote:

Originally Posted by ht2017

hi, everyone.
I am running parallel in OpenFoam. When I comment "reconstructPar - latestTime", it appears the errors.

the first: there are the coordinates of the face in the Polymesh have "word" in the number.
the second: in the file P appear the symbol as "^,$,&" in the number in here.

I hope everyone helps me.

Attachment 58859

what solver did you use? It appears to me that your mesh has reformed, in this case you need to reconstract the mesh first, then reconstruct the fields.

OpenFoamlove · November 1, 2017, 10:25

Quote:

Originally Posted by eddi0907

Hi Bruno,

The Processors are XEON W5580 (4 cores) or X5680 (6 cores) with 3.2 respectively 3.33 GHz without overclocking.
The Memory is DDR3-1333.
I used normal Ethernet 1GbpS.

The modelsize was 1 Million cells.

Running on 2 cores the Speedup is 2 as well.
Using 4 cores on the same CPU the speedup is only ~2.6 but Using 4 cores on 2 CPU's the speedup will be nearly 4!
It seems to be the same when looking at the unofficial benchmarks (http://code.google.com/p/bluecfd-sin...SE_12.1_x86_64)

So on a cluster where you use machines with more than one CPU you need to do corbinding and define which task on which cpu-core with an additional rankfile in openmpi.

Example 2 Dual CPU machines (no matter if 4 or 6 cores):

mpirun -np 8 -hostfile ./hostfile.txt -rankfile ./rankfile.txt icoFoam -parallel

hostfile:

host_1
host_2

rankfile:

rank 0 =host_1 slot=0:0
rank 1 =host_1 slot=0:1
rank 2 =host_1 slot=1:0
rank 3 =host_1 slot=1:1
rank 4 =host_2 slot=0:0
rank 5 =host_2 slot=0:1
rank 6 =host_2 slot=1:0
rank 7 =host_2 slot=1:1

Now the job runs on 8 cores distributed on cores 0 and 1 on any of the 4 CPU's with a speedup of more than 7.

Perhaps the newest generation of CPU's show a faster CPU-RAM communication and one can use 3 cores per CPU.

Kind Regards.

Edmund

Hi Edmund I tried to do parallel calculation in two network pc by simulation does not run further it is stock as below please help me to find my failure

[15:18][tec0683@rue-l020:/disk1/krishna/EinfacheRohre/bendtubeparalle/bendingtube]$ mpirun -np 8 -hostfile machines simpleFoam -parallel
/*---------------------------------------------------------------------------*\
| ========= | |
| \\ / F ield | OpenFOAM: The Open Source CFD Toolbox |
| \\ / O peration | Version: 2.1.1 |
| \\ / A nd | Web: www.OpenFOAM.org |
| \\/ M anipulation | |
\*---------------------------------------------------------------------------*/
Build : 2.1.1-221db2718bbb
Exec : simpleFoam -parallel
Date : Nov 01 2017
Time : 15:18:49
Host : "linxuman"
PID : 13714

with regards Anna

November 2, 2015, 05:28		#28
KateEisenhower Senior Member Join Date: Mar 2015 Posts: 250 Rep Power: 12	Hi Bruno, I understand your thought process. But what does this mean for a real simulation. The problem is you can't actually see what is slowing down your parallel simulation, can you? My current way of procedure on a 2 socket machine with each having 6 cores and 3 memory channels is the following: 1) Run case in serial to have a reference 2) Run 2 threads on different sockets core-binded 3) Run 4 threads, 2 on every socket, core-binded 4) The same with 6, 8, 10 and 12 threads I run these test cases for 10 iterations each (is that enough), see which one finishes the fastet and go with this configuration for this case. Is there any other method? Regarding the hierarcial decomposition method. Not really. I don't understand what it is supposed to do. A quick example: Code: 28 hierarchicalCoeffs 29 { 30 n ( 3 1 2 ); 31 delta 0.001; 32 order xyz; 33 } would look as follows: Code: ---------------------- I I I I ---------------------- I I I I ---------------------- Î: z-direction ->: x-direction How does the order of splitting effect the outcome? Best regards, Kate Last edited by KateEisenhower; November 2, 2015 at 05:40. Reason: clarification

October 8, 2017, 01:08	Can you help me. the errors appear when I run parallel. the comment "reconstructPar"	#30
ht2017 Member ESI Join Date: Sep 2017 Posts: 49 Rep Power: 9	hi, everyone. I am running parallel in OpenFoam. When I comment "reconstructPar - latestTime", it appears the errors. the first: there are the coordinates of the face in the Polymesh have "word" in the number. the second: in the file P appear the symbol as "^,$,&" in the number in here. I hope everyone helps me. thanh.jpg

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
HP MPI warning...Distributed parallel processing	Peter	CFX	10	May 14, 2011 07:17
FSI and parallel processing	Jorn	CFX	5	June 8, 2007 16:53
Paradox in parallel processing	Vagelis	FLUENT	0	October 26, 2005 06:36
About parallel processing in Linux	tuks	CFX	10	August 8, 2005 09:22
Parallel processing	L.S. Frinch	FLUENT	1	August 21, 2001 14:00

November 20, 2013, 21:06		#21
smraniaki New Member Join Date: Dec 2012 Posts: 19 Rep Power: 13	I'm not sure why you are having longer computation time but I have a guess: Your longer processing time could be due to unoptimized selection of number of decomposition domain. when you decomposed the domain the communication between each threads during parallel computation also take process which correspondingly demands more time. In your case I believe if you decompose you domain into 5 or 3 instead of 4, you should be facing different processing time as the communication between threads might decrease or increase. it is not always efficient to decompose the domain into several parts.