|
[Sponsors] |
Openfoam parallel calculation performance study - Half performance on mpirun |
|
LinkBack | Thread Tools | Search this Thread | Display Modes |
November 29, 2017, 12:07 |
Openfoam parallel calculation performance study - Half performance on mpirun
|
#1 |
New Member
Jurado
Join Date: Nov 2017
Posts: 22
Rep Power: 9 |
Hello,
I think this topic can interest anyone who wonders about the parallel performance of openfoam. I have actually run tests to determine the capacity of openfoam to run in parallel. These tests were performed on openfoam 5.0 and 4.1. I have followed the classic installation for ubuntu 16.04 (with the command and packages) and did an installation by myself with the source code. Both ended with the same behavior My test consisted in measuring on cases ranging from 200k cells to 3M cells the time to compute 150 iterations with several amout of processor in parallel (ranging from 1 to the maximum number of processor of a computer). Threads were deactivated when a computer had them. I have concluded that on every computer I tried the performance of parallel computation reached a limit at about half the processor a computer have. By that I mean if your computer has 20 processors, you will have nearly no gain in time from running on 10 processors than on 20 processors. For example if i compute a case with 10 processors it will need 60 seconds and on 20 processors 57 seconds. if I run 2 cases with 10 processors in parallel it will need about 117 seconds (as if i did them one after the other). I have tried option like bind-to-core or numactl cpunodebind options but it does not overcome this issue. My opinion is that maybe there is some option in openMPI that are not good or that in the hardware it's not the number of processors that are the limit but maybe the cache memory or the frequence of the RAM ? I have also find this article that tackles the issue but does not give real tested solution: https://pdfs.semanticscholar.org/0e1...83cf857e19.pdf I was wondering if someone else had the same problem and manage to overcome it ? |
|
November 30, 2017, 06:25 |
|
#2 |
Member
Hilbert
Join Date: Aug 2015
Location: Australia
Posts: 50
Rep Power: 11 |
Hi Jurado,
The scaling performance depends on the amount of cells per core that you use. Therefore when you quote runtime you have to tell us as well on what mesh those were run at. The slowdown point is very dependent on computer hardware and the compilation. You will find that a number that is quoted a lot on this website is 50k cells per core. This would mean that your 200k case would have an optimum at 4 core. In my group we are lucky to have access to a supercomputer that is in the top 100 of supercomputers. On this machine I found that for openfoam we had perfect acceleration till 10k cells per core. But an other solver that we have tested had a perfect speedup till 4k cells per core. |
|
November 30, 2017, 10:11 |
|
#3 |
New Member
Jurado
Join Date: Nov 2017
Posts: 22
Rep Power: 9 |
Hi Hillie,
Thank you for your answer and interest. For the installation and compilation I used the default installation for ubuntu with gcc 5.4. I also tried to compile from the sources myself with Intel compiler icc i17 and custom openmpi (version 3) but the results were unfortunately the same. For my test I really had a big variety of cells per processors. It can be summerized in the following table: In green the number of processors used for the calculation In purple the number of cells of the geometry In blue the time in seconds to achieve 150 iterations. --------------------1---- 10 ---- 20---- 30 ----40 218 807 ------ 44 ----5 ----- 2 ------2 -----2 457 583 ------ 106--- 12---- 7------ 6----- 6 849 346 -------209--- 24 ----16---- 15--- 13 1 682 903 ----463--- 56---- 38----- 37--- 34 3 064 411 ----886--- 117--- 82----- 79--- 71 So as one can see (exept for 200k cells for which the calculation is so fast that I can't really say much) between 20 processors and 40 processors in parallel there are nearly no gain in computation time for cells ranging from 400k to 3M cells. So in my test it seemed that the number of cells per processor did not really mattered. And it seems that at about maxProc/2 = 40/2= 20processors the gain in time is insignificant for every cases. There is also a problem with computation in parallel. Indeed, a computation on 10 processors will need for example 50 seconds and on 2*10 processors it will need about 95 seconds. Nearly as if the computation was sequential. this example was made on a server that had 40 processors. Nevertheless the rule of significant gain time until i reach maxProc/2 was seen on 4 others computers with a number of processors going from 16 to 24. They also had the default installation on ubuntu. So for I, it seems that there is an issue with maybe the default installation parameters of openfoam and openMPI or maybe the hardware of standard computers. But the behavior is oddly the same on several computers that are not the same kind or the same age. |
|
December 1, 2017, 02:15 |
|
#4 |
Member
Hilbert
Join Date: Aug 2015
Location: Australia
Posts: 50
Rep Power: 11 |
Hi Jurado,
Interesting table, but I do think there is something interesting going on here. You said you have switched off hyper threading. But how many physical cores do you have on the machines? On the 40 processor runs for example, was this on a machine with 40 physical cores? |
|
December 1, 2017, 07:15 |
|
#5 | |
New Member
Jurado
Join Date: Nov 2017
Posts: 22
Rep Power: 9 |
Quote:
Yes a lot of people I have spoken with about this usse asked me this question. But the server on which I did this test have no threads. and the processors are : 40x Intel(R) Xeon(R) CPU E7-4870 @2.40 GHz. The computer on which i did similar test with similar results had threads but I knew how many real processors there were. And the maxProc/2 rules applied without taking into account the threads, for example my computer have 20 processors and 20 threads but it stops gaining significant amount of time at 10 processors. If you want to see the test I did i can attach a file with the simulations for a given cell. |
||
December 2, 2017, 18:18 |
|
#6 |
Member
Hilbert
Join Date: Aug 2015
Location: Australia
Posts: 50
Rep Power: 11 |
Hi Juardo,
You clearly have threads on the Xeon, https://ark.intel.com/products/53579...-GTs-Intel-QPI . But for the rest it looks very system dependent. There are a couple of interesting articles on the web who state similar behaviour: Intel Core-i7 Hyperthreading and CFX, OpenFOAM + Hyperthreading |
|
December 3, 2017, 08:12 |
|
#7 |
New Member
Jurado
Join Date: Nov 2017
Posts: 22
Rep Power: 9 |
Hi Hillie,
Thank you for helping me. But I still have some interrogations: This could indeed explain the behavior I noticed on the server if it is not really 40 cores but 20. But it is quite strange that I find that on computer that I know the threads (my computer). Where I have 20 cores and 20 threads and that after mpirun 10 it stopped gaining significant time. And There is something I still do not fully understand about threads and cores. I use the command "hwloc-ls" to see my processors and their repartition on my hardware server and computer. (I have attached screenshot of them) While on my computer it detects the threads by the cores, on the server it does not. So for I, there were no threads. Moreover, on the link you gave me they say that the intel Xeon is by 10 cores which in my case would be one of my numanode. But maybe there is a notion I am missing. Could you enlighten me please ? Thanks you for your help. |
|
December 3, 2017, 18:06 |
|
#8 |
Member
Hilbert
Join Date: Aug 2015
Location: Australia
Posts: 50
Rep Power: 11 |
Hi Jurado,
I had never heard of the "hwloc-ls" command. When I execute it is stays physical on the bottom, indicating that it is showing the physical cores. Can you confirm that you are seeing that as well. If we assume that it are the physicals cores then you have 2 cpu's with 10 cores each on your computer and 4 cpu's with 10 cores on your server. That would suggest that hyper threading is not the problem. The more I think about if your memory bus might be the issue. Once you go over 1 cpu the memory bus needs to be quick enough to transport the data between the cpu's. If we assume that your memory bus was very fast, and we look at the data in your table you would expect to see the time drop in half going from 10 to 20 cores. But it doesn't, but what is interesting is that the speedup is mesh dependent, meaning that you get less of a speedup for the heavier meshes then for the lighter meshes. And then when you go to 3 cpu's the communication time completely determines the solution time. Cheers, |
|
December 3, 2017, 22:26 |
|
#9 |
Member
Join Date: Nov 2014
Posts: 92
Rep Power: 12 |
The speed of RAM is a very important factor for determining the scaling cpu number. Can you provide the RAM information by typing "lshw -c memory"?
Your mesh number is far too low for 40 CPU. Some formers say at least 50k mesh per physical core should be used to achieve better scaling. I guess you have 480 physical cores, right? If yes, then your mesh size should be around 24M. |
|
December 4, 2017, 14:48 |
|
#10 |
Senior Member
Lucky
Join Date: Apr 2011
Location: Orlando, FL USA
Posts: 5,762
Rep Power: 66 |
I don't follow. The Xeon(R) CPU E7-4870 has 10 physical cores.
If you have a workstation/server that has 2 of these sockets, you'll have 20 physical cores if hyper-threading is off. Your benchmark should stop then at the 20 cores. I see that from 10 to 20 cores, you achieve sqrt(Ncores) scaling which is pretty optimal still. Btw I've run XiFOAM on 1200-1400 processors w/ as low as 3k-5k cells per and achieved sqrt(Ncores) scaling. Or are you looking for linear scaling? |
|
December 5, 2017, 05:31 |
|
#11 |
New Member
Jurado
Join Date: Nov 2017
Posts: 22
Rep Power: 9 |
Thank you for your interest on my topic.
Hi Hillie, It is indeed the physical cores. I was also thinking that it was the communication between the CPU that was the problem and that it was somehow saturated when I had more than 2 CPU working together. However when I studied the way my server works I'm not sure about it. Indeed, it uses all the CPU at each time I run a computation if I have a mpirun >4. For example if I run on 8 cores (mpirun -np 8), it will use 2 cores of the first CPU then 2 of the second, 2 of the third CPU and 2 of the forth CPU. I have tried the option "numactl" that allows to bind a computation to a CPU. (mpirun -np 10 numactl --cpunodebind=0 mySolverFoam -parallel). And the result turn out constant but were not better. For example if I run on 20 cores on 2CPU (with numactl) it will need 75s vs 20 cores on 4CPU it will need 32s. But when I do 20 cores + 20 cores (in parallel) with numactl the time remain 75s vs 74s with the defaut mpirun launch. I have made a benchmark if you are interested I could translate the big lines in english if you wanna read it (excel format). What I am thinking might be the problem is that the cache memory might be flooded when I reach maxProc/2. Indeed, since the way it uses all the CPU for any computation, what might happens is that a core might be able to use the L3 cache memory that would normaly be given to 2 cores. Hi Lucky Tran and hokhay, I see the problem I might express myself bad with the terms "processors" and "cores". I have find definition of processors varies among people. When I say 40 processors I should have said I have 4 CPU with each 10 cores. To make myself clearer, I mean I do at max mpirun -np 40. And I tried mpirun 30,20 and 10.So with my test have cells per core ranging from 5 000 (40 cores for 200k cells) to 300 000 (10 cores for 3M cells). Also I have made this test on a server which has 40 cores (4 CPU) (up to mpirun 40) and I found that at 20 cores I stop gaining time. And on my computer which has 2 CPU (20 cores and 20 threads) I have find that I stopped gaining time at 10 cores (but I did not put the figures here). For my RAM it has a frequency of 1.333GHz for both my computer and server. Should I try to overclock it to 1.6 GHz ? |
|
December 5, 2017, 10:29 |
|
#12 |
Senior Member
Lucky
Join Date: Apr 2011
Location: Orlando, FL USA
Posts: 5,762
Rep Power: 66 |
Ok now I get it. So 4x10=40.
It should not be the L3 cache. Do not overclock the RAM. There is indeed something fishy going on since it runs slower w/ cpubind. Since you say it happens on every machine that you touch... I am wondering, what is your benchmark case? |
|
December 6, 2017, 11:37 |
|
#13 |
New Member
Jurado
Join Date: Nov 2017
Posts: 22
Rep Power: 9 |
Hi Lucky Tran,
I attached the file test I used with around 1.6M cells. However I have a colleague who did a similar test on a different case with a different solver and concluded the same way I did. So I do not think that it is a problem relative to my simulation. Maybe if you test on your own computer you will find the same behaviour. |
|
December 6, 2017, 17:25 |
|
#14 |
Senior Member
Lucky
Join Date: Apr 2011
Location: Orlando, FL USA
Posts: 5,762
Rep Power: 66 |
I can't seem to open the files. Can you check and maybe reupload?
My first question is: what do the contents of your decomposeParDict look like? I hope you are using scotch or metis. |
|
December 6, 2017, 21:12 |
|
#15 |
Member
Join Date: Nov 2014
Posts: 92
Rep Power: 12 |
Hi Jurado,
Since there are 4 CPUs, I suppose suppose there are 4 servers, right? Could you tell us how are these servers connected? In my experience, 10 Gbit Ethernet can only sustain up to 2 computers. A inifiniband network is needed if you go beyond 2, otherwise the performance will drop dramatically. |
|
December 13, 2017, 09:19 |
|
#16 |
New Member
Jurado
Join Date: Nov 2017
Posts: 22
Rep Power: 9 |
Hi LuckyTran,
Indeed, it seems that my file upload did not work, my constant file is too big so i can only upload the system one. But to answer the question I use the scotch method in decomposeParDict. Hi hokhay, The 4 CPU are inside a server, they are not separated, it is one station. I am really not sure that the problem comes from the connection between several CPU. Or well, at least it is not the only one. Indeed, for example I want to run a computation on 10 cores. By default what it will do is it will use 1 core on the first CPU, use 1 core on the second CPU, use 1 core on the third CPU, use 1 core on the forth CPU, use 1 core on the first CPU and so on. So in the end, the first CPU will have 3 cores used, the second CPU 3 cores, the third CPU 2 cores and the forth 2 CPU. So a total of 10 cores divided between the 4 CPU. I have attached a little drawning to help understand how it divides. The computation time will be 25s in this condition. Now there is another way of doing the computation. It would consist to reduce the communication between the CPU. For that, there is an option with numactl called cpunodebind which bind a prorcess to a CPU. One could think it will be faster since It does not have communication between CPUs. However the time will be 100s. Now I want to make 2 computation using each 10 cores (so a total of 20 cores used). If I run by default (which divide the charge to each CPU). It will have a computation time of 52s. The two computations are made on the same case on separated folders, so they should not interact and should finish at the same time as if I run the case without the other computation. However we lose time as if it was doing one after the other. Now I want to make the 2 computation per CPU. So one CPU will have the first computation and the other one the second. It should not interact and the time should be the same than with the previous case. And yes it is, the time for the two computation stay 100s. Now I want to make 4 computations on 10 cores.By default, To do so it will need 105s. With cpunodebind option it will still need 100s. This two tests lead me to two differents issues and conclusions: The first one is that with the first test with 10 cores, dividing between all the CPU seems to improve the computation compared to do it only on one CPU despite the fact that there is communication between the CPU. So for I, the only explanation here is that the L3 cache memory (shared by all the cores in one CPU) is somehow overloaded. And so by dividing to the 4 CPU we have x4 cache memory available. The second conclusion is that when I use every cores on my station (40 cores divided into 4 CPU). The overall cache memory for the for computation is the same since everything is used. With the cpunodebind option the computation is a bit faster 100 s against 105s by default. So we could say that the communication cost us in the overall 5s which is negligeable. So I think that the problem here is really a cache memory or something like that since the RAM is very far from being overloaded (I have checked that). |
|
December 13, 2017, 20:23 |
|
#17 |
Senior Member
Lucky
Join Date: Apr 2011
Location: Orlando, FL USA
Posts: 5,762
Rep Power: 66 |
Cache is meant to be loaded. Besides, you cannot control the cache. Cache is not handled at the job scheduler level. If the cache is the culprit hen the entire cpu must be faulty.
However these symptoms are consistent with a communications overhead which might probably well be the ram. Even if the ram capacity (measured in GB) is not fully utilized, it must still be accessed (read/write speed in GT/s) fast enough. The other thing is hard disk writes. Are you maybe saving a lot of files and your simulation is waiting on the hard disk? The HD is even slower than RAM and this can also bottleneck. That's why I asked for the case. |
|
December 14, 2017, 10:23 |
|
#18 | |
New Member
Jurado
Join Date: Nov 2017
Posts: 22
Rep Power: 9 |
Quote:
In my test benchmark I don't save the computation so I don't think it is the hard disk. But if it is the speed of the ram why would it be faster to divide between the 4 CPU than on 1 CPU ? The speed of the ram is still the same no ? |
||
December 14, 2017, 20:24 |
|
#19 |
Senior Member
Lucky
Join Date: Apr 2011
Location: Orlando, FL USA
Posts: 5,762
Rep Power: 66 |
So you don't save the terminal output either? aka some sort of log?
I don't know your hardware setup but I bet you have at least 2 banks of RAM (or possibly 4). Each CPU only has fast access to one particular bank. If you are running all the cores on the same CPU, you are accessing the RAM from the same bank. If the cores are spread across other CPU's you can spread across RAM banks and therefore read faster (e.g. reading 1GB from bank 1 and 1 GB from bank 2 rather than reading 2GB from bank 1 only). Btw cpubind lets you lock a process to a particular core, but this does not guarantee task affinity on that core. It can also be that other tasks (which we often call processes) have priority over the OF task, but this is more random and does not really explain the change in performance with different arrangements. |
|
December 15, 2017, 04:20 |
|
#20 |
New Member
Jurado
Join Date: Nov 2017
Posts: 22
Rep Power: 9 |
Hi LuckyTran,
You are right I have a log but it would always slow down no, and the way it acts at maxproc/2 could be explained by that ? Thanks for your reply, do you know if there is a way to test if it is the RAM the issue ? And if there is a way to solve that ? |
|
Tags |
openmpi, parallel calculation, performance |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Map of the OpenFOAM Forum - Understanding where to post your questions! | wyldckat | OpenFOAM | 10 | September 2, 2021 06:29 |
How to contribute to the community of OpenFOAM users and to the OpenFOAM technology | wyldckat | OpenFOAM | 17 | November 10, 2017 16:54 |
OpenFoam parallel on 2 computers : Cannot find file "points" | Blue8655 | OpenFOAM Running, Solving & CFD | 1 | June 3, 2015 22:59 |
Superlinear speedup in OpenFOAM 13 | msrinath80 | OpenFOAM Running, Solving & CFD | 18 | March 3, 2015 06:36 |
Performance of interFoam running in parallel | hsieh | OpenFOAM Running, Solving & CFD | 8 | September 14, 2006 10:15 |