|
[Sponsors] |
Parallel processing of OpenFOAM cases on multicore processor??? |
|
LinkBack | Thread Tools | Search this Thread | Display Modes |
February 20, 2010, 04:41 |
Parallel processing of OpenFOAM cases on multicore processor???
|
#1 |
New Member
Ghasem Akbari
Join Date: Nov 2009
Posts: 7
Rep Power: 17 |
Dear experts,
I'm going to run a huge OpenFOAM case in a quad-core processor. I don't know if it is possible in OpenFOAM to apply parallel processing for shared memory computers. As I know OpenFOAM uses MPI for parallel processing and MPI is desirable for distributed memory computers. By the way, I have not the possibility of using clusters. Is there any way for implementation of parallel processing in my shared memory computer and using all cores of cpu in calculations? Best regards. |
|
February 20, 2010, 07:58 |
|
#2 |
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128 |
Greetings g.akbari,
With OpenFOAM's ThirdParty package, comes OpenMPI. And by what I know, OpenMPI can automatically decide what communication protocol to use when comunicating between processes, whether they are in the same computer or in different computers. Nonetheless, you can search in the OpenMPI manual how to specifically define what protocol to use, including "shared memory". Then you can edit the script $WM_PROJECT_DIR/etc/foamJob and tweak the running options for OpenMPI! Sadly I don't have much experience with OpenMPI, so I only know it's possible. Best regards, Bruno Santos |
|
February 20, 2010, 08:44 |
|
#3 |
Member
|
You just need to use the decomposeParDict to make the partitions.
The user manual have a good tutorial in page 63. I run all my cases in my mini cluster with two quad core xeon processors and 8g ram. Good luck CX |
|
February 20, 2010, 09:47 |
|
#4 |
New Member
Ghasem Akbari
Join Date: Nov 2009
Posts: 7
Rep Power: 17 |
Thanks alot. I performed the damBreak tutorial and it worked fine. Now, my problem is that I gain no speed-up and the serial execution time is shorter than parallel execution time. What 's the reason for this behavior?
|
|
February 20, 2010, 10:05 |
|
#5 |
Member
|
I don't have a answer for that question.
The only thing I can say is that you will certanly attained a faster convergence with the parallel execution. Do you try to run the dam break without the mpi? CX |
|
February 20, 2010, 10:32 |
|
#6 |
New Member
Ghasem Akbari
Join Date: Nov 2009
Posts: 7
Rep Power: 17 |
Yes, I executed damBreak two times
using MPI and by setting the number of processors as 4, calculations take 122 sec, without MPI and in serial mode, it takes 112 sec. May be I shoud use more finer mesh to obtain more better speed-up. |
|
February 20, 2010, 10:43 |
|
#7 |
New Member
Ghasem Akbari
Join Date: Nov 2009
Posts: 7
Rep Power: 17 |
After using a finer mesh, the speed-up becomes larger than unity. Thanks very much about your comments wyldckat and xisto.
Sincerely |
|
November 18, 2010, 18:52 |
Multi-Core Processors: Execution Time vs Clock Time
|
#8 |
Senior Member
Daniel
Join Date: Jul 2009
Location: Montreal, Canada
Posts: 156
Rep Power: 17 |
Hello all,
I have a single dual-core processor, and I am attempting to determine if it is possible to use parallel processing with multiple cores vice multiple processors. I ran wingMotion2D_pimpleDyMFoam three times, with 1, 2 and 3 "processors" identified each time in the /system/decomposeParDict. The results were predictable, in that the single "processor" run required 1.5x the time required by the dual "processor" run. I then tried 3 processors just to see what would happen, and found that this required 99.9% of the execution time of the dual-core run, but 150% of the clock time of the dual-core run. I did not know what to expect, running a 3-processor parallel simulation on a dual-core single-processor system, but this anomaly was definitely not expected. Can anyone tell me why the clock and execution times would be so different, but only when #processors in decomposeParDict > #cores in the computer? Sure, it was a silly test, but now that I have strange results it does make me wonder. Thanks, Dan |
|
November 19, 2010, 05:48 |
|
#9 |
Senior Member
Anton Kidess
Join Date: May 2009
Location: Germany
Posts: 1,377
Rep Power: 30 |
My guess: You're still doing the same amount of computations, so the CPU time is similar, but you're wasting lot's of time on communication and process-switching so the wall clock time is larger.
|
|
May 27, 2011, 20:57 |
|
#10 |
New Member
Prashant Gupta
Join Date: Mar 2011
Location: Edinburgh
Posts: 29
Rep Power: 15 |
Hey Guys ,
I am facing similar problems here, my parallel run with 6 processors is taking a way lot more time than the serial one. Any one has a clue that what might be happening ?? |
|
May 28, 2011, 20:18 |
|
#11 | |
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128 |
Greetings Prashant,
From the top of my head, here are a few pointers:
Bruno
__________________
|
||
December 15, 2012, 14:10 |
|
#12 |
Member
ali jafari
Join Date: Sep 2012
Posts: 50
Rep Power: 14 |
hi
I decided to buy a multi core (core=6;thread=12 intel) computer. my question : does openfoam use to threads of CPU for processing ? |
|
December 15, 2012, 15:41 |
|
#13 |
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128 |
Greetings Ali Jafari,
I know this has been discussed here on the forum, but I'm not in the mood to go searching A summary is as follows:
Best regards, Bruno
__________________
Last edited by wyldckat; December 16, 2012 at 06:21. Reason: FPU isn't "Floating Processing Unit"... it's Floating-point Unit |
|
March 4, 2013, 10:19 |
|
#15 |
New Member
Join Date: Jan 2013
Posts: 15
Rep Power: 13 |
Dear all,
it is a little bit late but I want to share my findings on parallel runs in OpenFOAM: I used the lid driven cavity for benchmarking. I found out that for this case the RAM-CPU communication was the bottleneck. The best scaling I found is using only 2 cores per CPU (8 with 4 core each and no HT) with corebinding especially if you have Dualcore-machines. Doing so the case scaled almost linear up to 16 cores, and was faster than using all available 32 cores. Kind Regards. Edmund |
|
March 4, 2013, 18:16 |
|
#16 |
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128 |
Hi Edmund,
Can you share some more information about the characteristics of the machines you've used? Such as:
Bruno
__________________
|
|
March 5, 2013, 03:07 |
|
#17 |
New Member
Join Date: Jan 2013
Posts: 15
Rep Power: 13 |
Hi Bruno,
The Processors are XEON W5580 (4 cores) or X5680 (6 cores) with 3.2 respectively 3.33 GHz without overclocking. The Memory is DDR3-1333. I used normal Ethernet 1GbpS. The modelsize was 1 Million cells. Running on 2 cores the Speedup is 2 as well. Using 4 cores on the same CPU the speedup is only ~2.6 but Using 4 cores on 2 CPU's the speedup will be nearly 4! It seems to be the same when looking at the unofficial benchmarks (http://code.google.com/p/bluecfd-sin...SE_12.1_x86_64) So on a cluster where you use machines with more than one CPU you need to do corbinding and define which task on which cpu-core with an additional rankfile in openmpi. Example 2 Dual CPU machines (no matter if 4 or 6 cores): mpirun -np 8 -hostfile ./hostfile.txt -rankfile ./rankfile.txt icoFoam -parallel hostfile: host_1 host_2 rankfile: rank 0 =host_1 slot=0:0 rank 1 =host_1 slot=0:1 rank 2 =host_1 slot=1:0 rank 3 =host_1 slot=1:1 rank 4 =host_2 slot=0:0 rank 5 =host_2 slot=0:1 rank 6 =host_2 slot=1:0 rank 7 =host_2 slot=1:1 Now the job runs on 8 cores distributed on cores 0 and 1 on any of the 4 CPU's with a speedup of more than 7. Perhaps the newest generation of CPU's show a faster CPU-RAM communication and one can use 3 cores per CPU. Kind Regards. Edmund Last edited by eddi0907; March 5, 2013 at 04:03. |
|
March 5, 2013, 17:30 |
|
#18 |
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128 |
Hi Edmund,
Many thanks for sharing the information! But I'm still wondering if there isn't some specific detail we're missing. I did some searching and:
Because from correlating all of this information, my guess is that your machines only have 2 RAM modules assigned per socket... 4 modules in total per machine. Either that or 1 million cells is not enough for a full test! Best regards, Bruno
__________________
|
|
March 6, 2013, 04:05 |
|
#19 |
New Member
Join Date: Jan 2013
Posts: 15
Rep Power: 13 |
Hi Bruno,
the slots are all filled with equal sized DIMM's. What do you mean with "full test"? Up to 20 cores 1 Million cells will be at least 50kcells/core. I don't want to tell stories. Attached you can find an overview of the timings I found and the test case I used in zip-format. Could you please crosscheck the speedup from 1 to 2 and 4 Cores to see if only my hardware behaves ugly? Kind Regards. Edmund |
|
March 6, 2013, 06:32 |
|
#20 |
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128 |
Hi Edmund,
Thanks for sharing. I'll give it a try when I get an opening on our clusters. In the meantime, check the following:
Bruno
__________________
|
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
HP MPI warning...Distributed parallel processing | Peter | CFX | 10 | May 14, 2011 07:17 |
FSI and parallel processing | Jorn | CFX | 5 | June 8, 2007 16:53 |
Paradox in parallel processing | Vagelis | FLUENT | 0 | October 26, 2005 06:36 |
About parallel processing in Linux | tuks | CFX | 10 | August 8, 2005 09:22 |
Parallel processing | L.S. Frinch | FLUENT | 1 | August 21, 2001 14:00 |