|
[Sponsors] |
Scaling of parallel computation? Solver/thread count combinations? |
|
LinkBack | Thread Tools | Search this Thread | Display Modes |
February 2, 2017, 12:01 |
Scaling of parallel computation? Solver/thread count combinations?
|
#1 | |
Member
Join Date: Jun 2016
Posts: 31
Rep Power: 10 |
Hi,
I'm currently looking into the scaling of OpenFOAM 4.0 and OF Extend 3.1 while running cases in parallel on my local machine (i7 6800k 6C/12T @ 4GHz, 32GB DDR4 2666MHz quadchannel, Windows 7 64bit Ultimate, Linux VMs with OF running in Virtualbox) using 4, 8 and 12 threads respectively. I've searched a bit about parallel scaling, but I've noticed a strange behaviour. Well, at least for me it is strange. After reading this https://www.hpc.ntnu.no/display/hpc/...mance+on+Vilje and this PDF http://www.dtic.mil/get-tr-doc/pdf?AD=ADA612337 I was quite confident that I'd get a nice approximately linear speedup on my little processor, but that wasn't the case at all. I've started using a Hagen-Poiseuille laminar pipe flow with about 144k elements and pisoFoam. Using 12 threads resulted the slowest simulation speed, 8 threads were a little faster and 4 threads were somewhere in the middle. I figured that the case was too small to profit from 12 domains and tested a lid-driven cavity flow with Re = 1000, pisoFoam again and 1.0E6 cells, so roughly 83.3E3 cells per thread. Interestingly, using 12 threads was the slowest method, 8 threads were fastest and 4 threads were somewhere in the middle. In OF Extend, 4 threads were actually the fastest. I've read the following here in the forum: Quote:
Cavity 1m cells with GAMG/GAMG solving for p/U: 12 threads: 726s walltime 8 threads: 576s 4 threads: 691s Cavity 1m cells with GAMG/GAMG solving for p/U, OF Extend: 12 threads: 1044s walltime 8 threads: 613s 4 threads: 592s Approximately the same bad scaling for the laminar pipe flow case. What is the cause? I'd appreciate any help Oh I forgot, I use openMPI and start the cases using "mpirun -np num_of_threads foamJob pisoFoam -parallel", that should be correct. |
||
February 2, 2017, 13:03 |
|
#2 |
Senior Member
khedar
Join Date: Oct 2016
Posts: 111
Rep Power: 10 |
1. Can you share walltime for 1 thread?
2. May be because of Virtual Machines? 3. May be the cache size of your processor not as much as of the one quoted in study.. |
|
February 3, 2017, 04:48 |
|
#3 |
Member
Join Date: Jun 2016
Posts: 31
Rep Power: 10 |
1. 1681s
2. Probably, I'll try to run some benchmarks on a native Linux machine 3. Cache size per core is actually the same, 20MB for 8 cores on the Xeon E5 2687W and 15MB for 6 cores on the i7 6800k |
|
February 3, 2017, 10:03 |
|
#4 |
Senior Member
|
Hi,
Since you only have 6 cores means you can not expect any improvement from using more than 6 processors (read the section on hyperthreading from the pdf). The virtualisation may also hurt a bit. I would advise running 1, 2, 4 and 6 cores. For large enough (100k Cells) cases I would expect the 6 cores to be fastest, however you have 4 memory channels, this may also mean that after 4 cpus you will have less than linear scaling since 6 cores are trying to reach the memory over only 4 channels. Regards, Tom |
|
February 3, 2017, 11:33 |
|
#5 |
Member
Join Date: Jun 2016
Posts: 31
Rep Power: 10 |
Thanks for your reply, it helped a lot I've read the HT part, but I didn't see any setup infos. I thought that you're supposed to use the number of threads since my CPU is "only" under a load of 50% when using 6 processes. I figured you'd have to use them all, but you're actually right and I've got the fastest result using 6 processes now.
Still, the speedup isn't quite as good as I hoped. Only 3 times faster with 6 times more processes seems bad. I'm going to investigate this on our cluster and tinker a bit with GAMG/PCG, solvers and cell count. |
|
January 13, 2022, 11:42 |
|
#6 |
New Member
Join Date: May 2021
Posts: 9
Rep Power: 5 |
Hi guys,
I guess using hyperthreading in general does not work for simulations. In fact I personally disabled hyperthreading on my desktop pc, which is a 6-core i7 as well. GAMG is also increasing computational expenses in parallel computing, because agglomeration can expand over your decomposed mesh interfaces. You can read it in OF-User Manual. There are currenly special agglomeration algorithms available for GAMG to reduce the additional inter-processor-communication as I understand, but for me those didn't show any benefit (maybe I applied them in a wrong way). However, my benchmark was very small and only consistet of very few simulations (conducted on a HPC cluster). Preconditioners are mostly inconsistent in parallel. Only diagonal preconditioner works well in parallel as it seems. Maybe, if you use more cores on your desktop i slows down, because normal CPU architecture is not specifically designed for parallel simulations. Also your background processes like running your OS and other applications need some computational capacity which can not be used for simulations. Hope that helps at least a little |
|
January 13, 2022, 12:09 |
|
#7 |
Member
Join Date: Jun 2016
Posts: 31
Rep Power: 10 |
Yes, I was not really aware of the HT problematic at the time, but after almost 5 years, I updated my knowledge a little bit . It seems obvious in retrospect to only start the same number of parallel threads as there are physical CPU cores.
|
|
January 13, 2022, 12:24 |
|
#8 |
New Member
Join Date: May 2021
Posts: 9
Rep Power: 5 |
Yes thats true, it was an old thread. To be honest, I saw the date after my post .
But I just thought I might say something which could be helpful, while anyone else has the same Problems, so I left it there . Actually after 5 years I guess you have much more experience on this than I have. |
|
January 13, 2022, 12:39 |
|
#9 |
Member
Join Date: Jun 2016
Posts: 31
Rep Power: 10 |
It surely is helpful for other people who might stumble on this thread.
Ad the experience: maybe, maybe not. You never know who you're dealing with At the moment, I don't care much about scaling and just guesstimate how many cores to use. If it's not the optimal amount, so be it. Seeing that I don't deal with large cases too often and don't even use OF anymore, it's not so important. |
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
[General] Extracting ParaView Data into Python Arrays | Jeffzda | ParaView | 30 | November 6, 2023 22:00 |
Partition: cell count = 0 | metmet | FLUENT | 1 | August 31, 2014 20:41 |
Serial UDF is working for parallel computation also | Tanjina | Fluent UDF and Scheme Programming | 0 | December 26, 2013 19:24 |
Installation issues for parallel computation | Akash C | SU2 Installation | 1 | June 21, 2013 06:26 |
Parallel computation problem in Tascflow | dandy | CFX | 3 | April 21, 2002 01:32 |