|
[Sponsors] |
September 18, 2018, 08:46 |
Issues with poor performance in faster CPU
|
#1 |
Member
giovanni
Join Date: Sep 2017
Posts: 50
Rep Power: 9 |
Hi to everyone!
Actually i'm working on two type of machine for an OpenFoam simulation on my workThesis. i'm sorry about my poor preparation in hardware field but i cannot figure out why one machine, apparently with more performances with respect to the other, is anyway absolutely slower. here i reported the cpu charateristic of the two : First and faster machine: processor : 27 vendor_id : GenuineIntel cpu family : 6 model : 79 model name : Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz stepping : 1 cpu MHz : 2593.881 cache size : 35840 KB physical id : 1 siblings : 14 core id : 14 cpu cores : 14 apicid : 60 initial apicid : 60 fpu : yes fpu_exception : yes cpuid level : 20 wp : yes bogomips : 5187.60 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual And then my second machine which is apparently better but shows very bad performance in computational time (infinitely more sowly with respect to the previous one) processor : 95 vendor_id : GenuineIntel cpu family : 6 model : 85 model name : Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz stepping : 4 microcode : 0x2000018 cpu MHz : 3399.996 cache size : 33792 KB physical id : 1 siblings : 48 core id : 29 cpu cores : 24 apicid : 123 initial apicid : 123 fpu : yes fpu_exception : yes cpuid level : 22 wp : yes bugs : cpu_meltdown spectre_v1 spectre_v2 bogomips : 5388.93 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual Can anyone be so patient to explain me how can i imprve the computational time of the second slower one? is it an issue related to the cpu architecture or it depends also from other parameters? thanks |
|
September 19, 2018, 06:55 |
|
#2 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,428
Rep Power: 49 |
The first thing that would come to mind is -as always- memory.
Xeon V4 has 4 memory channels, Skylake-SP (Xeon Platinum) has 6 memory channels. For optimal performance, all memory channels have to be populated with identical amounts of memory. Other ideas: How many CPUs do these machines have? Not cores, but physical CPUs. Apparently, SMT/Hyperthreading is deactivated on the first machine. You should do the same on the second machine. |
|
September 19, 2018, 16:11 |
|
#3 |
Member
giovanni
Join Date: Sep 2017
Posts: 50
Rep Power: 9 |
thanks !! on second machine i have only two slot occupied!!!
maybe it is the problem! ProLiant-DL380-Gen10:~/OpenFOAM/innovation-2.2.x/run/1500sim$ sudo dmidecode -t memory | grep Size Size: No Module Installed Size: No Module Installed Size: No Module Installed Size: No Module Installed Size: No Module Installed Size: No Module Installed Size: No Module Installed Size: No Module Installed Size: No Module Installed Size: No Module Installed Size: No Module Installed Size: 32 GB Size: No Module Installed Size: No Module Installed Size: No Module Installed Size: No Module Installed Size: No Module Installed Size: No Module Installed Size: No Module Installed Size: No Module Installed Size: No Module Installed Size: No Module Installed Size: No Module Installed Size: 32 GB why i should de activate hypertreading? |
|
September 19, 2018, 18:53 |
|
#4 | |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,428
Rep Power: 49 |
Quote:
SMT is known to cause a performance penalty in many cases involving CFD computations. We have seen many examples for this behavior in this thread alone. That's why it is often turned off so nobody has to fiddle around with affinity settings. |
||
October 12, 2018, 15:29 |
|
#5 |
Member
giovanni
Join Date: Sep 2017
Posts: 50
Rep Power: 9 |
Hi! thanks for the reply !
i've followed your advice and i've saturated all the DIMMs with 32 Gb . The performance increased a lot but with the same machine that have installed the Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz, i decided to make another test with this set up : >same number of cells (2,5 *10^6) >same DIMMs as before >change the CPU to this one: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 85 model name : Intel(R) Xeon(R) Gold 6134 CPU @ 3.20GHz stepping : 4 microcode : 0x2000018 cpu MHz : 3699.875 cache size : 25344 KB physical id : 1 siblings : 8 core id : 26 cpu cores : 8 apicid : 116 initial apicid : 116 fpu : yes fpu_exception : yes cpuid level : 22 wp : yes flags : bugs : cpu_meltdown spectre_v1 spectre_v2 bogomips : 6386.72 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: speed now seems better but unfortunately i noticed that there are very few cores . You suggest to change to another type of cpu for further improvment? i've really need to reduce as much as possible computational time (at the moment only one node is available)... |
|
October 13, 2018, 09:28 |
|
#6 | ||
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,428
Rep Power: 49 |
Quote:
I find it a bit difficult to follow Quote:
How do you test? Same number of threads for both CPUs? Maximum number of threads available? When using 8 cores per CPU, the two models you compared should perform roughly the same give or take 10%. Are you comparing this new CPU with SMT disabled against the old CPU with SMT enabled? It would be helpful to have some actual numbers to compare the performance differences. It might help to distinguish between different kinds errors in the setup. Maybe I am missing something, but I still don't know if you are using single- or dual-CPU. There is not really a faster CPU you could buy in Intels lineup. The Xeon Platinum 8168 should not be significantly slower than any other CPU. Maybe you tested it with SMT on? Or maybe your test case shows negative scaling for a very high number of cores? If that is the case, you can simply reduce the number of cores your simulation runs at and distribute them evenly across both? CPUs. This should be the default behavior anyway. |
|||
October 15, 2018, 11:48 |
|
#7 |
Member
giovanni
Join Date: Sep 2017
Posts: 50
Rep Power: 9 |
hi thanks for your reply
i have 12 DIMM'S FOR 2 CPU (slot are 12x2 = 24, i have occupied one channel of the two available with 32 GB per slot) i've changed the cpu ( Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz) and mounted the new one ( Intel(R) Xeon(R) Gold 6134 CPU @ 3.20GHz) . With this configuration i can only use the 8 x 2 processors (intel platinum instead had 48 processors) and i simply compare the time to complete a simulation case with maximum number of processors available for both test. For both case we have tested dual-CPU. hypertreading is disabled . Who can i verify my scaling ? |
|
October 15, 2018, 12:01 |
|
#8 | |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,428
Rep Power: 49 |
Quote:
With 16 threads on dual Xeon 8168 you should get about the same performance as with dual Xeon 6134. Otherwise you will have to dig into stuff like thread pinning and sub-NUMA clustering (formerly cluster on die)... |
||
October 29, 2018, 07:04 |
|
#9 |
Member
giovanni
Join Date: Sep 2017
Posts: 50
Rep Power: 9 |
hi! thanks for your advice. i've made some test and the best number of core per simulation are infect 16-18 cores .
Anyway i noticed this stuff. when i run a single simulation on a single machine (whathewer simulation is , whatever the hardware is) using for example 16 processor over 48 , the speed up (visible also by eyes from terminal tail log) is much higher than the case in which i run two simulation in parallel on the same machine (obviously when i do this i'm careful to do not exceed the core available on my node . example: if available cores are 48, usually i use 16 +16 cores for the two simulations ) if is possible , how can be fixed this problem? thanks !! |
|
October 29, 2018, 14:34 |
|
#10 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,428
Rep Power: 49 |
This is usually not a problem that can be fixed. Unless you run out of memory with 2 simulations running simultaneously.
The reason for slowdown is -again- memory bandwidth limitation. An over-simplified example: Lets say the machine you are using has a peak memory bandwidth of 100GB/s. Running one simulation on 16 cores uses 80GB/s of memory bandwidth. Adding a second simulation that would also require 80GB/s of memory bandwidth when running on 16 cores will obviously max out the peak memory bandwidth of the machine and both simulations will run slower than a single simulation. |
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Any ideas on the Penalty for dual CPU and infiniband | JoshuaB | Hardware | 3 | July 3, 2018 14:00 |
Superlinear speedup in OpenFOAM 13 | msrinath80 | OpenFOAM Running, Solving & CFD | 18 | March 3, 2015 06:36 |
Star cd es-ice solver error | ernarasimman | STAR-CD | 2 | September 12, 2014 01:01 |
OpenFOAM 13 Intel quadcore parallel results | msrinath80 | OpenFOAM Running, Solving & CFD | 13 | February 5, 2008 06:26 |
more RAM or faster CPU?? | Fabrizio Grieco | Siemens | 11 | January 23, 2001 08:35 |