CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > General Forums > Hardware

Issues with poor performance in faster CPU

Register Blogs Community New Posts Updated Threads Search

Like Tree2Likes
  • 1 Post By flotus1
  • 1 Post By flotus1

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   September 18, 2018, 08:46
Default Issues with poor performance in faster CPU
  #1
Member
 
giovanni
Join Date: Sep 2017
Posts: 50
Rep Power: 9
gian93 is on a distinguished road
Hi to everyone!
Actually i'm working on two type of machine for an OpenFoam simulation on my workThesis.
i'm sorry about my poor preparation in hardware field but i cannot figure out why one machine, apparently with more performances with respect to the other, is anyway absolutely slower.



here i reported the cpu charateristic of the two :


First and faster machine:


processor : 27
vendor_id : GenuineIntel
cpu family : 6
model : 79
model name : Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
stepping : 1
cpu MHz : 2593.881
cache size : 35840 KB
physical id : 1
siblings : 14
core id : 14
cpu cores : 14
apicid : 60
initial apicid : 60
fpu : yes
fpu_exception : yes
cpuid level : 20
wp : yes
bogomips : 5187.60
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual


And then my second machine which is apparently better but shows very bad performance in computational time (infinitely more sowly with respect to the previous one)




processor : 95
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz
stepping : 4
microcode : 0x2000018
cpu MHz : 3399.996
cache size : 33792 KB
physical id : 1
siblings : 48
core id : 29
cpu cores : 24
apicid : 123
initial apicid : 123
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
bugs : cpu_meltdown spectre_v1 spectre_v2
bogomips : 5388.93
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual



Can anyone be so patient to explain me how can i imprve the computational time of the second slower one? is it an issue related to the cpu architecture or it depends also from other parameters?
thanks
gian93 is offline   Reply With Quote

Old   September 19, 2018, 06:55
Default
  #2
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
The first thing that would come to mind is -as always- memory.
Xeon V4 has 4 memory channels, Skylake-SP (Xeon Platinum) has 6 memory channels. For optimal performance, all memory channels have to be populated with identical amounts of memory.

Other ideas:
How many CPUs do these machines have? Not cores, but physical CPUs.
Apparently, SMT/Hyperthreading is deactivated on the first machine. You should do the same on the second machine.
flotus1 is offline   Reply With Quote

Old   September 19, 2018, 16:11
Default
  #3
Member
 
giovanni
Join Date: Sep 2017
Posts: 50
Rep Power: 9
gian93 is on a distinguished road
thanks !! on second machine i have only two slot occupied!!!
maybe it is the problem!




ProLiant-DL380-Gen10:~/OpenFOAM/innovation-2.2.x/run/1500sim$ sudo dmidecode -t memory | grep Size
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: 32 GB
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: 32 GB




why i should de activate hypertreading?
gian93 is offline   Reply With Quote

Old   September 19, 2018, 18:53
Default
  #4
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
Quote:
Originally Posted by gian93 View Post
thanks !! on second machine i have only two slot occupied!!!
maybe it is the problem!

why i should de activate hypertreading?
Only 2 DIMMs is definitely part of the problem. Memory is the MVP for CFD since it is usually bandwidth limited, especially with high core count CPUs. Throwing more money at "faster" CPUs usually does not help. You would need 5 additional DIMMs per CPU (based on the current memory population, I guess there are two CPUs installed?) in order to fix this.
SMT is known to cause a performance penalty in many cases involving CFD computations. We have seen many examples for this behavior in this thread alone. That's why it is often turned off so nobody has to fiddle around with affinity settings.
ashokac7 likes this.
flotus1 is offline   Reply With Quote

Old   October 12, 2018, 15:29
Default
  #5
Member
 
giovanni
Join Date: Sep 2017
Posts: 50
Rep Power: 9
gian93 is on a distinguished road
Hi! thanks for the reply !
i've followed your advice and i've saturated all the DIMMs with 32 Gb .



The performance increased a lot but with the same machine that have installed the Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz, i decided to make another test with this set up :
>same number of cells (2,5 *10^6)

>same DIMMs as before

>change the CPU to this one:



processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Gold 6134 CPU @ 3.20GHz
stepping : 4
microcode : 0x2000018
cpu MHz : 3699.875
cache size : 25344 KB
physical id : 1
siblings : 8
core id : 26
cpu cores : 8
apicid : 116
initial apicid : 116
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags :
bugs : cpu_meltdown spectre_v1 spectre_v2
bogomips : 6386.72
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:


speed now seems better but unfortunately i noticed that there are very few cores . You suggest to change to another type of cpu for further improvment? i've really need to reduce as much as possible computational time (at the moment only one node is available)...
gian93 is offline   Reply With Quote

Old   October 13, 2018, 09:28
Default
  #6
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
Quote:
i've followed your advice and i've saturated all the DIMMs with 32 Gb .
6 DIMMs per CPU really would have been enough.

I find it a bit difficult to follow
Quote:
The performance increased a lot but with the same machine that have installed the Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz, i decided to make another test with this set up :
>same number of cells (2,5 *10^6)

>same DIMMs as before
>change the CPU to this one: Xeon Gold 6134
Same DIMMs as before means what exactly? 2 DIMMs or all slots populated?
How do you test? Same number of threads for both CPUs? Maximum number of threads available? When using 8 cores per CPU, the two models you compared should perform roughly the same give or take 10%.
Are you comparing this new CPU with SMT disabled against the old CPU with SMT enabled?
It would be helpful to have some actual numbers to compare the performance differences. It might help to distinguish between different kinds errors in the setup.
Maybe I am missing something, but I still don't know if you are using single- or dual-CPU.

There is not really a faster CPU you could buy in Intels lineup. The Xeon Platinum 8168 should not be significantly slower than any other CPU. Maybe you tested it with SMT on? Or maybe your test case shows negative scaling for a very high number of cores? If that is the case, you can simply reduce the number of cores your simulation runs at and distribute them evenly across both? CPUs. This should be the default behavior anyway.
flotus1 is offline   Reply With Quote

Old   October 15, 2018, 11:48
Default
  #7
Member
 
giovanni
Join Date: Sep 2017
Posts: 50
Rep Power: 9
gian93 is on a distinguished road
hi thanks for your reply

i have 12 DIMM'S FOR 2 CPU (slot are 12x2 = 24, i have occupied one channel of the two available with 32 GB per slot)

i've changed the cpu ( Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz) and mounted the new one ( Intel(R) Xeon(R) Gold 6134 CPU @ 3.20GHz) . With this configuration i can only use the 8 x 2 processors (intel platinum instead had 48 processors) and i simply compare the time to complete a simulation case with maximum number of processors available for both test.

For both case we have tested dual-CPU.
hypertreading is disabled .

Who can i verify my scaling ?
gian93 is offline   Reply With Quote

Old   October 15, 2018, 12:01
Default
  #8
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
Quote:
Who can i verify my scaling ?
Run the same case with an increasing number of threads. 1, 2, 4, 8, 16 and so on. Perfect scaling would mean that the simulation time is proportional to 1/number of cores. You won't get that beyond 12-16 cores. With very high core counts, some simulations can even take longer than with lower core counts.
With 16 threads on dual Xeon 8168 you should get about the same performance as with dual Xeon 6134. Otherwise you will have to dig into stuff like thread pinning and sub-NUMA clustering (formerly cluster on die)...
flotus1 is offline   Reply With Quote

Old   October 29, 2018, 07:04
Default
  #9
Member
 
giovanni
Join Date: Sep 2017
Posts: 50
Rep Power: 9
gian93 is on a distinguished road
hi! thanks for your advice. i've made some test and the best number of core per simulation are infect 16-18 cores .
Anyway i noticed this stuff.

when i run a single simulation on a single machine (whathewer simulation is , whatever the hardware is) using for example 16 processor over 48 , the speed up (visible also by eyes from terminal tail log) is much higher than the case in which i run two simulation in parallel on the same machine (obviously when i do this i'm careful to do not exceed the core available on my node . example: if available cores are 48, usually i use 16 +16 cores for the two simulations )
if is possible , how can be fixed this problem?

thanks !!
gian93 is offline   Reply With Quote

Old   October 29, 2018, 14:34
Default
  #10
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
This is usually not a problem that can be fixed. Unless you run out of memory with 2 simulations running simultaneously.
The reason for slowdown is -again- memory bandwidth limitation. An over-simplified example: Lets say the machine you are using has a peak memory bandwidth of 100GB/s. Running one simulation on 16 cores uses 80GB/s of memory bandwidth. Adding a second simulation that would also require 80GB/s of memory bandwidth when running on 16 cores will obviously max out the peak memory bandwidth of the machine and both simulations will run slower than a single simulation.
davidtechassitance likes this.
flotus1 is offline   Reply With Quote

Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Any ideas on the Penalty for dual CPU and infiniband JoshuaB Hardware 3 July 3, 2018 14:00
Superlinear speedup in OpenFOAM 13 msrinath80 OpenFOAM Running, Solving & CFD 18 March 3, 2015 06:36
Star cd es-ice solver error ernarasimman STAR-CD 2 September 12, 2014 01:01
OpenFOAM 13 Intel quadcore parallel results msrinath80 OpenFOAM Running, Solving & CFD 13 February 5, 2008 06:26
more RAM or faster CPU?? Fabrizio Grieco Siemens 11 January 23, 2001 08:35


All times are GMT -4. The time now is 07:39.