Parallel processing of OpenFOAM cases on multicore processor???

g.akbari · February 20, 2010, 04:41

Dear experts,
I'm going to run a huge OpenFOAM case in a quad-core processor. I don't know if it is possible in OpenFOAM to apply parallel processing for shared memory computers. As I know OpenFOAM uses MPI for parallel processing and MPI is desirable for distributed memory computers. By the way, I have not the possibility of using clusters.
Is there any way for implementation of parallel processing in my shared memory computer and using all cores of cpu in calculations?

Best regards.

wyldckat · February 20, 2010, 07:58

Greetings g.akbari,

With OpenFOAM's ThirdParty package, comes OpenMPI. And by what I know, OpenMPI can automatically decide what communication protocol to use when comunicating between processes, whether they are in the same computer or in different computers.
Nonetheless, you can search in the OpenMPI manual how to specifically define what protocol to use, including "shared memory".
Then you can edit the script $WM_PROJECT_DIR/etc/foamJob and tweak the running options for OpenMPI!

Sadly I don't have much experience with OpenMPI, so I only know it's possible.

Best regards,
Bruno Santos

xisto · February 20, 2010, 08:44

You just need to use the decomposeParDict to make the partitions.

The user manual have a good tutorial in page 63.

I run all my cases in my mini cluster with two quad core xeon processors and 8g ram.

Good luck

CX

g.akbari · February 20, 2010, 09:47

Thanks alot. I performed the damBreak tutorial and it worked fine. Now, my problem is that I gain no speed-up and the serial execution time is shorter than parallel execution time. What 's the reason for this behavior?

xisto · February 20, 2010, 10:05

I don't have a answer for that question.

The only thing I can say is that you will certanly attained a faster convergence with the parallel execution.

Do you try to run the dam break without the mpi?

CX

g.akbari · February 20, 2010, 10:32

Yes, I executed damBreak two times
using MPI and by setting the number of processors as 4, calculations take 122 sec,
without MPI and in serial mode, it takes 112 sec.
May be I shoud use more finer mesh to obtain more better speed-up.

g.akbari · February 20, 2010, 10:43

After using a finer mesh, the speed-up becomes larger than unity. Thanks very much about your comments wyldckat and xisto.
Sincerely

dancfd · November 18, 2010, 18:52

Hello all,

I have a single dual-core processor, and I am attempting to determine if it is possible to use parallel processing with multiple cores vice multiple processors. I ran wingMotion2D_pimpleDyMFoam three times, with 1, 2 and 3 "processors" identified each time in the /system/decomposeParDict. The results were predictable, in that the single "processor" run required 1.5x the time required by the dual "processor" run. I then tried 3 processors just to see what would happen, and found that this required 99.9% of the execution time of the dual-core run, but 150% of the clock time of the dual-core run.

I did not know what to expect, running a 3-processor parallel simulation on a dual-core single-processor system, but this anomaly was definitely not expected. Can anyone tell me why the clock and execution times would be so different, but only when #processors in decomposeParDict > #cores in the computer? Sure, it was a silly test, but now that I have strange results it does make me wonder.

Thanks,
Dan

akidess · November 19, 2010, 05:48

My guess: You're still doing the same amount of computations, so the CPU time is similar, but you're wasting lot's of time on communication and process-switching so the wall clock time is larger.

Prash · May 27, 2011, 20:57

Hey Guys ,

I am facing similar problems here, my parallel run with 6 processors is taking a way lot more time than the serial one. Any one has a clue that what might be happening ??

wyldckat · May 28, 2011, 20:18

Greetings Prashant,

From the top of my head, here are a few pointers:

The case is too small for running in parallel. A rule of thumb is around 50kcells/core minimum, but it will also depend on the combinations of: solver, matrix solver and preconditioner.
The case is too big and chaotic. Try running renumberMesh in parallel mode first, so the decomposed mesh is sorted out. I.e., run something like this:
Code:
```
foamJob -s -p renumberMesh
```
Or something like this:
Code:
```
mpirun -np 6 renumberMesh -parallel
```
The case is still too big, even when the mesh is renumbered. By this I mean that the real processor is taking too long to fetch from very different sections of memory, which leads to a seriously non-optimized memory access scheme. In other words: the 6 cores are mostly fetching directly from RAM, instead of taking advantage of the on dye cache system (you know, L1, L2 and L3 cache).
How to fix this? I don't know yet All I know is that it's at least 10x speed difference of cache vs direct RAM access.
I would suggest splitting the case in 2,3,4,5,6 and 12 sub-domains, to try and isolate if it's a CPU cache problem. I've had a situation where a 6 core CPU was faster with 16 sub-domains than 6 sub-domains
The decomposition method was not properly chosen/configured. Try the other decomposition methods and/or learn more about each one. If your geometry isn't too complex, then metis/scotch won't help.
Reproduce a benchmarked case, even if not official. For example: Report from thread http://www.cfd-online.com/Forums/ope...v-cluster.html - this can help you figure out if it's a solver related problem, or configuration problem or something that you overlooked.

Try disabling connection options in mpirun. For example:

Quote:

Originally Posted by pkr

When using MPI_reduce, the OpenMPI was trying to establish TCP through a different interface. The problem is solved if the following command is used:
mpirun --mca btl_tcp_if_exclude lo,virbr0 -hostfile machines -np 2 /home/rphull/OpenFOAM/OpenFOAM-1.6/bin/foamExec interFoam -parallel

The above command will restrict MPI to use certain networks (lo, vibro in this case).

Check out the mental notes I've got on my blog: Notes about running OpenFOAM in parallel - they are just notes for when I get some free time to write at openfoamwiki.net about this.

Best regards,
Bruno

ali jafari · December 15, 2012, 14:10

hi

I decided to buy a multi core (core=6;thread=12 intel) computer.

my question : does openfoam use to threads of CPU for processing ?

wyldckat · December 15, 2012, 15:41

Greetings Ali Jafari,

I know this has been discussed here on the forum, but I'm not in the mood to go searching

A summary is as follows:

HyperThreading (HT) was designed with user usability in mind. Users want a responsive system, even when certain applications are using a lot of processing power.
This means that HT only provides many capabilities that the CPU has, except for some of the more powerful features, such as the FPU - Floating-point Unit. Which means that each core is somewhat split into 2 HT parts, sharing a single FPU.
OpenFOAM (and pretty much any other CFD application) will need almost absolute access to as many FPUs there are on the machine as possible.
Using HT will only lead to having 2 threads trying to shove numbers into a single FPU at nearly the same time. Which actually isn't completely bad, since each thread will prepare the data for the FPU to handle right after the previous thread, which basically only leads to an improvement of about... I don't know for sure, but maybe somewhere between 1 to 10 %, depending on several details. Such an example can be seen on this simple (and unofficial) benchmark case: http://code.google.com/p/bluecfd-sin...SE_12.1_x86_64 - the 8 core column was actually 8 threads of an "i7 950 CPU, with 4 cores and with Hyper-Threading (HT) turned on"
Another bottleneck is the RAM, cache and memory controller, which can lead to a simple issue: having 6 threads or 12 threads accessing the RAM at nearly the same time, can lead to ... well, barely any improvement.
Which leads to the usual final conclusion: turn off HT in the BIOS/UEFI. Perhaps even overclock the CPU as well, which will then give you actual performance increase, at the cost of additional power consumption and increased heat production by the CPU... although if done incorrectly, will lead to a substantial stability decrease and reduce the life of the processor and/or motherboard.

For more about HT: http://en.wikipedia.org/wiki/Hyper-threading

Best regards,
Bruno

ali jafari · December 16, 2012, 02:37

Dear wyldckat

your explain was very useful .

Thank you very much !!!

eddi0907 · March 4, 2013, 10:19

Dear all,

it is a little bit late but I want to share my findings on parallel runs in OpenFOAM:

I used the lid driven cavity for benchmarking.

I found out that for this case the RAM-CPU communication was the bottleneck.

The best scaling I found is using only 2 cores per CPU (8 with 4 core each and no HT) with corebinding especially if you have Dualcore-machines.

Doing so the case scaled almost linear up to 16 cores, and was faster than using all available 32 cores.

Kind Regards.

Edmund

wyldckat · March 4, 2013, 18:16

Hi Edmund,

Can you share some more information about the characteristics of the machines you've used? Such as:

What processor models and speed?
- With or without overclock?
What RAM types and speeds?
Was it over an normal Ethernet connection? 1 Gbps?

Best regards,
Bruno

eddi0907 · March 5, 2013, 03:07

Hi Bruno,

The Processors are XEON W5580 (4 cores) or X5680 (6 cores) with 3.2 respectively 3.33 GHz without overclocking.
The Memory is DDR3-1333.
I used normal Ethernet 1GbpS.

The modelsize was 1 Million cells.

Running on 2 cores the Speedup is 2 as well.
Using 4 cores on the same CPU the speedup is only ~2.6 but Using 4 cores on 2 CPU's the speedup will be nearly 4!
It seems to be the same when looking at the unofficial benchmarks (http://code.google.com/p/bluecfd-sin...SE_12.1_x86_64)

So on a cluster where you use machines with more than one CPU you need to do corbinding and define which task on which cpu-core with an additional rankfile in openmpi.

Example 2 Dual CPU machines (no matter if 4 or 6 cores):

mpirun -np 8 -hostfile ./hostfile.txt -rankfile ./rankfile.txt icoFoam -parallel

hostfile:

host_1
host_2

rankfile:

rank 0 =host_1 slot=0:0
rank 1 =host_1 slot=0:1
rank 2 =host_1 slot=1:0
rank 3 =host_1 slot=1:1
rank 4 =host_2 slot=0:0
rank 5 =host_2 slot=0:1
rank 6 =host_2 slot=1:0
rank 7 =host_2 slot=1:1

Now the job runs on 8 cores distributed on cores 0 and 1 on any of the 4 CPU's with a speedup of more than 7.

Perhaps the newest generation of CPU's show a faster CPU-RAM communication and one can use 3 cores per CPU.

Kind Regards.

Edmund

wyldckat · March 5, 2013, 17:30

Hi Edmund,

Many thanks for sharing the information!

But I'm still wondering if there isn't some specific detail we're missing. I did some searching and:

The W5580 has got 3 memory channels: http://ark.intel.com/products/37113/...-GTs-Intel-QPI
The X5680 also has got 3 memory channels: http://ark.intel.com/products/47916/...-GTs-Intel-QPI
The latest Intel CPUs already have 4 memory channels.
According to the forum member scipy, he keeps telling the story about the impact that the number of available memory channels (and respective memory modules) can have on performance. Here's an example: http://www.cfd-online.com/Forums/har...tml#post405352 post #2.

So my question is: do you know how many memory channels your machines are using? Or in other words, are all of the memory slots filled up with evenly sized RAM modules?
Because from correlating all of this information, my guess is that your machines only have 2 RAM modules assigned per socket... 4 modules in total per machine.

Either that or 1 million cells is not enough for a full test!

Best regards,
Bruno

eddi0907 · March 6, 2013, 04:05

Hi Bruno,

the slots are all filled with equal sized DIMM's.

What do you mean with "full test"?
Up to 20 cores 1 Million cells will be at least 50kcells/core.

I don't want to tell stories.

Attached you can find an overview of the timings I found and the test case I used in zip-format.

Could you please crosscheck the speedup from 1 to 2 and 4 Cores to see if only my hardware behaves ugly?

Kind Regards.

Edmund

wyldckat · March 6, 2013, 06:32

Hi Edmund,

Thanks for sharing. I'll give it a try when I get an opening on our clusters.

In the meantime, check the following:

Notes on scalability: Large test case for running OpenFoam in parallel
And a whole list of links to threads on this subject: Notes about running OpenFOAM in parallel
This one also comes to mind: How to run concurrent MPI jobs within a node or set of nodes - post #9
Check the threads at http://www.cfd-online.com/Forums/hardware/ for more ideas as well.

Best regards,
Bruno

February 20, 2010, 04:41	Parallel processing of OpenFOAM cases on multicore processor???	#1
g.akbari New Member Ghasem Akbari Join Date: Nov 2009 Posts: 7 Rep Power: 17	Dear experts, I'm going to run a huge OpenFOAM case in a quad-core processor. I don't know if it is possible in OpenFOAM to apply parallel processing for shared memory computers. As I know OpenFOAM uses MPI for parallel processing and MPI is desirable for distributed memory computers. By the way, I have not the possibility of using clusters. Is there any way for implementation of parallel processing in my shared memory computer and using all cores of cpu in calculations? Best regards.

February 20, 2010, 08:44		#3
xisto Member Carlos Xisto Join Date: Nov 2009 Location: Covilhã, Portugal Posts: 53 Rep Power: 17	You just need to use the decomposeParDict to make the partitions. The user manual have a good tutorial in page 63. I run all my cases in my mini cluster with two quad core xeon processors and 8g ram. Good luck CX BSengupta likes this.

November 18, 2010, 18:52	Multi-Core Processors: Execution Time vs Clock Time	#8
dancfd Senior Member Daniel Join Date: Jul 2009 Location: Montreal, Canada Posts: 156 Rep Power: 17	Hello all, I have a single dual-core processor, and I am attempting to determine if it is possible to use parallel processing with multiple cores vice multiple processors. I ran wingMotion2D_pimpleDyMFoam three times, with 1, 2 and 3 "processors" identified each time in the /system/decomposeParDict. The results were predictable, in that the single "processor" run required 1.5x the time required by the dual "processor" run. I then tried 3 processors just to see what would happen, and found that this required 99.9% of the execution time of the dual-core run, but 150% of the clock time of the dual-core run. I did not know what to expect, running a 3-processor parallel simulation on a dual-core single-processor system, but this anomaly was definitely not expected. Can anyone tell me why the clock and execution times would be so different, but only when #processors in decomposeParDict > #cores in the computer? Sure, it was a silly test, but now that I have strange results it does make me wonder. Thanks, Dan

December 15, 2012, 15:41		#13
wyldckat Retired Super Moderator Bruno Santos Join Date: Mar 2009 Location: Lisbon, Portugal Posts: 10,981 Blog Entries: 45 Rep Power: 128	Greetings Ali Jafari, I know this has been discussed here on the forum, but I'm not in the mood to go searching A summary is as follows: HyperThreading (HT) was designed with user usability in mind. Users want a responsive system, even when certain applications are using a lot of processing power. This means that HT only provides many capabilities that the CPU has, except for some of the more powerful features, such as the FPU - Floating-point Unit. Which means that each core is somewhat split into 2 HT parts, sharing a single FPU. OpenFOAM (and pretty much any other CFD application) will need almost absolute access to as many FPUs there are on the machine as possible. Using HT will only lead to having 2 threads trying to shove numbers into a single FPU at nearly the same time. Which actually isn't completely bad, since each thread will prepare the data for the FPU to handle right after the previous thread, which basically only leads to an improvement of about... I don't know for sure, but maybe somewhere between 1 to 10 %, depending on several details. Such an example can be seen on this simple (and unofficial) benchmark case: http://code.google.com/p/bluecfd-sin...SE_12.1_x86_64 - the 8 core column was actually 8 threads of an "i7 950 CPU, with 4 cores and with Hyper-Threading (HT) turned on" Another bottleneck is the RAM, cache and memory controller, which can lead to a simple issue: having 6 threads or 12 threads accessing the RAM at nearly the same time, can lead to ... well, barely any improvement. Which leads to the usual final conclusion: turn off HT in the BIOS/UEFI. Perhaps even overclock the CPU as well, which will then give you actual performance increase, at the cost of additional power consumption and increased heat production by the CPU... although if done incorrectly, will lead to a substantial stability decrease and reduce the life of the processor and/or motherboard. For more about HT: http://en.wikipedia.org/wiki/Hyper-threading Best regards, Bruno wagnergaluppo and sharonyue like this. __________________ OpenFOAM: FAQ \| Getting started Forum: How to get help, to post code/output and forum guide Read this before sending me PM Last edited by wyldckat; December 16, 2012 at 06:21. Reason: FPU isn't "Floating Processing Unit"... it's Floating-point Unit

December 16, 2012, 02:37		#14
ali jafari Member ali jafari Join Date: Sep 2012 Posts: 50 Rep Power: 14	Dear wyldckat your explain was very useful . Thank you very much !!! Last edited by ali jafari; December 16, 2012 at 03:28.

February 20, 2010, 07:58		#2
wyldckat Retired Super Moderator Bruno Santos Join Date: Mar 2009 Location: Lisbon, Portugal Posts: 10,981 Blog Entries: 45 Rep Power: 128	Greetings g.akbari, With OpenFOAM's ThirdParty package, comes OpenMPI. And by what I know, OpenMPI can automatically decide what communication protocol to use when comunicating between processes, whether they are in the same computer or in different computers. Nonetheless, you can search in the OpenMPI manual how to specifically define what protocol to use, including "shared memory". Then you can edit the script $WM_PROJECT_DIR/etc/foamJob and tweak the running options for OpenMPI! Sadly I don't have much experience with OpenMPI, so I only know it's possible. Best regards, Bruno Santos scleakey likes this.

February 20, 2010, 09:47		#4
g.akbari New Member Ghasem Akbari Join Date: Nov 2009 Posts: 7 Rep Power: 17	Thanks alot. I performed the damBreak tutorial and it worked fine. Now, my problem is that I gain no speed-up and the serial execution time is shorter than parallel execution time. What 's the reason for this behavior?

February 20, 2010, 10:05		#5
xisto Member Carlos Xisto Join Date: Nov 2009 Location: Covilhã, Portugal Posts: 53 Rep Power: 17	I don't have a answer for that question. The only thing I can say is that you will certanly attained a faster convergence with the parallel execution. Do you try to run the dam break without the mpi? CX

February 20, 2010, 10:32		#6
g.akbari New Member Ghasem Akbari Join Date: Nov 2009 Posts: 7 Rep Power: 17	Yes, I executed damBreak two times using MPI and by setting the number of processors as 4, calculations take 122 sec, without MPI and in serial mode, it takes 112 sec. May be I shoud use more finer mesh to obtain more better speed-up.

February 20, 2010, 10:43		#7
g.akbari New Member Ghasem Akbari Join Date: Nov 2009 Posts: 7 Rep Power: 17	After using a finer mesh, the speed-up becomes larger than unity. Thanks very much about your comments wyldckat and xisto. Sincerely

November 19, 2010, 05:48		#9
akidess Senior Member Anton Kidess Join Date: May 2009 Location: Germany Posts: 1,377 Rep Power: 30	My guess: You're still doing the same amount of computations, so the CPU time is similar, but you're wasting lot's of time on communication and process-switching so the wall clock time is larger.

May 27, 2011, 20:57		#10
Prash New Member Prashant Gupta Join Date: Mar 2011 Location: Edinburgh Posts: 29 Rep Power: 15	Hey Guys , I am facing similar problems here, my parallel run with 6 processors is taking a way lot more time than the serial one. Any one has a clue that what might be happening ??

December 15, 2012, 14:10		#12
ali jafari Member ali jafari Join Date: Sep 2012 Posts: 50 Rep Power: 14	hi I decided to buy a multi core (core=6;thread=12 intel) computer. my question : does openfoam use to threads of CPU for processing ?

March 4, 2013, 10:19		#15
eddi0907 New Member Join Date: Jan 2013 Posts: 15 Rep Power: 13	Dear all, it is a little bit late but I want to share my findings on parallel runs in OpenFOAM: I used the lid driven cavity for benchmarking. I found out that for this case the RAM-CPU communication was the bottleneck. The best scaling I found is using only 2 cores per CPU (8 with 4 core each and no HT) with corebinding especially if you have Dualcore-machines. Doing so the case scaled almost linear up to 16 cores, and was faster than using all available 32 cores. Kind Regards. Edmund

March 4, 2013, 18:16		#16
wyldckat Retired Super Moderator Bruno Santos Join Date: Mar 2009 Location: Lisbon, Portugal Posts: 10,981 Blog Entries: 45 Rep Power: 128	Hi Edmund, Can you share some more information about the characteristics of the machines you've used? Such as: What processor models and speed? With or without overclock? What RAM types and speeds? Was it over an normal Ethernet connection? 1 Gbps? Best regards, Bruno __________________ OpenFOAM: FAQ \| Getting started Forum: How to get help, to post code/output and forum guide Read this before sending me PM

March 5, 2013, 03:07		#17
eddi0907 New Member Join Date: Jan 2013 Posts: 15 Rep Power: 13	Hi Bruno, The Processors are XEON W5580 (4 cores) or X5680 (6 cores) with 3.2 respectively 3.33 GHz without overclocking. The Memory is DDR3-1333. I used normal Ethernet 1GbpS. The modelsize was 1 Million cells. Running on 2 cores the Speedup is 2 as well. Using 4 cores on the same CPU the speedup is only ~2.6 but Using 4 cores on 2 CPU's the speedup will be nearly 4! It seems to be the same when looking at the unofficial benchmarks (http://code.google.com/p/bluecfd-sin...SE_12.1_x86_64) So on a cluster where you use machines with more than one CPU you need to do corbinding and define which task on which cpu-core with an additional rankfile in openmpi. Example 2 Dual CPU machines (no matter if 4 or 6 cores): mpirun -np 8 -hostfile ./hostfile.txt -rankfile ./rankfile.txt icoFoam -parallel hostfile: host_1 host_2 rankfile: rank 0 =host_1 slot=0:0 rank 1 =host_1 slot=0:1 rank 2 =host_1 slot=1:0 rank 3 =host_1 slot=1:1 rank 4 =host_2 slot=0:0 rank 5 =host_2 slot=0:1 rank 6 =host_2 slot=1:0 rank 7 =host_2 slot=1:1 Now the job runs on 8 cores distributed on cores 0 and 1 on any of the 4 CPU's with a speedup of more than 7. Perhaps the newest generation of CPU's show a faster CPU-RAM communication and one can use 3 cores per CPU. Kind Regards. Edmund louisgag, wyldckat, Alish1984 and 2 others like this. Last edited by eddi0907; March 5, 2013 at 04:03.

March 5, 2013, 17:30		#18
wyldckat Retired Super Moderator Bruno Santos Join Date: Mar 2009 Location: Lisbon, Portugal Posts: 10,981 Blog Entries: 45 Rep Power: 128	Hi Edmund, Many thanks for sharing the information! But I'm still wondering if there isn't some specific detail we're missing. I did some searching and: The W5580 has got 3 memory channels: http://ark.intel.com/products/37113/...-GTs-Intel-QPI The X5680 also has got 3 memory channels: http://ark.intel.com/products/47916/...-GTs-Intel-QPI The latest Intel CPUs already have 4 memory channels. According to the forum member scipy, he keeps telling the story about the impact that the number of available memory channels (and respective memory modules) can have on performance. Here's an example: http://www.cfd-online.com/Forums/har...tml#post405352 post #2. So my question is: do you know how many memory channels your machines are using? Or in other words, are all of the memory slots filled up with evenly sized RAM modules? Because from correlating all of this information, my guess is that your machines only have 2 RAM modules assigned per socket... 4 modules in total per machine. Either that or 1 million cells is not enough for a full test! Best regards, Bruno __________________ OpenFOAM: FAQ \| Getting started Forum: How to get help, to post code/output and forum guide Read this before sending me PM

March 6, 2013, 06:32		#20
wyldckat Retired Super Moderator Bruno Santos Join Date: Mar 2009 Location: Lisbon, Portugal Posts: 10,981 Blog Entries: 45 Rep Power: 128	Hi Edmund, Thanks for sharing. I'll give it a try when I get an opening on our clusters. In the meantime, check the following: Notes on scalability: Large test case for running OpenFoam in parallel And a whole list of links to threads on this subject: Notes about running OpenFOAM in parallel This one also comes to mind: How to run concurrent MPI jobs within a node or set of nodes - post #9 Check the threads at http://www.cfd-online.com/Forums/hardware/ for more ideas as well. Best regards, Bruno __________________ OpenFOAM: FAQ \| Getting started Forum: How to get help, to post code/output and forum guide Read this before sending me PM

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
HP MPI warning...Distributed parallel processing	Peter	CFX	10	May 14, 2011 07:17
FSI and parallel processing	Jorn	CFX	5	June 8, 2007 16:53
Paradox in parallel processing	Vagelis	FLUENT	0	October 26, 2005 06:36
About parallel processing in Linux	tuks	CFX	10	August 8, 2005 09:22
Parallel processing	L.S. Frinch	FLUENT	1	August 21, 2001 14:00