CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > General Forums > Hardware

Poor scaling of dual-Xeon E5-2697A-V4

Register Blogs Community New Posts Updated Threads Search

Like Tree7Likes

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   January 19, 2018, 05:16
Default Poor scaling of dual-Xeon E5-2697A-V4
  #1
New Member
 
Håvard B. Refvik
Join Date: Jun 2015
Location: Norway
Posts: 17
Rep Power: 11
havref is on a distinguished road
After getting advice on this forum, I decided to run some scaling tests on our Dell R630 Server, equipped with 2x Xeon E5-2697A-V4, 8x16Gb 2400MHz Ram and an 480GB SSD. In addition, to benchmark against some larger cases I tested the Cavity-case previously tested on a HPC at NTNU: ( OpenFOAM Performance on Vilje ).

A .pdf of results, and a .txt file of memory setup is attached. In the first Scaling tests-part, all cases run are confidential hull designs, so I can't share much more information regarding those meshes. The cache was cleared between each of the benchmarks against Vilje using the command: sync && echo 3 | sudo tee /proc/sys/vm/drop_caches

As you see in the .pdf, the performance using only 16 cores is much higher than the performance of 32 cores, even with as many as ~27 million cells. Any idea why this is happening? I was expecting 32 cores to outperform 16 by far even at the InterDyMFOAM test using 9 million cells. Is this assumption wrong?

Because of these results, I am suspecting there's something wrong in our setup, but I have no idea where to start looking. Any comments or recommendations are greatly appreciated.
Attached Files
File Type: pdf Scaling and benchmarking of Dell R630.compressed.pdf (118.7 KB, 52 views)
File Type: txt memInfo_short.txt (4.3 KB, 11 views)

Last edited by havref; January 19, 2018 at 06:08. Reason: Adding information
havref is offline   Reply With Quote

Old   January 19, 2018, 09:12
Default Re: Poor scaling of dual-Xeon E5-2697A-V4
  #2
Member
 
Erik Andresen
Join Date: Feb 2016
Location: Denmark
Posts: 35
Rep Power: 10
ErikAdr is on a distinguished road
My guess is that the performance of your test case is limited by the memory bandwidth and not the cpu power. You have two strong cpu, but only 8 memory channels. For cfd two cores are usually enough to saturate one memory channel. The best performance in your case is achieved with that ratio between cores and memory channels. It doesn't surprise me that the performance drop when using 32 cores, but I don't believe the drop is due alone to having too many cores competing about the limited memory bandwidth. For a larger test case 32 cores could do better, but I think 16 cores are optimal.
havref likes this.
ErikAdr is offline   Reply With Quote

Old   January 22, 2018, 04:38
Default
  #3
New Member
 
Håvard B. Refvik
Join Date: Jun 2015
Location: Norway
Posts: 17
Rep Power: 11
havref is on a distinguished road
Thank you for the response. It makes sense when you put it like that. I'll do some tests using 24 cores etc. as well, to see where the peak performance is for these cases.
Quote:
Originally Posted by ErikAdr View Post
For cfd two cores are usually enough to saturate one memory channel. The best performance in your case is achieved with that ratio between cores and memory channels.
So I originally performed these scaling tests to see whether our server's performance scaled well before purchasing a second server. We were looking into the Xeon Gold 6154 3.0 GHz (18 cores), 6146 3.2 GHz (12 cores) and 6144 3.5 GHz(8 cores). As these new Xeon Scalable processors have 6 memory channels per processor, can I assume that both the 8 and 12 core processors are good choices purely based on the number of cores pr memory channel? Or do you think that 12 cores will still saturate the memory channels to such a degree that the performance decreases?
havref is offline   Reply With Quote

Old   January 22, 2018, 05:53
Default Re: Poor scaling of dual-Xeon E5-2697A-V4
  #4
Member
 
Erik Andresen
Join Date: Feb 2016
Location: Denmark
Posts: 35
Rep Power: 10
ErikAdr is on a distinguished road
I have no experience with OpenFoam so I don't feel qualified to give a definitive answer on this. But I would not choose less than 12 cores and I would feel safer with 16 cores. For my own in house cfd code I would go for an EPYC 7281 or EPYC 7301 which both have 8 memory channels and cost just a fraction of Intels Gold processors. If I had to select an Intel processor, then I would go for the cheaper models like Sliver 4116 or Gold 6130, and then buy some more systems to 'fill the budget'. Perhaps You are able to get the high end processors at a better price than I. Look at https://www.spec.org/cpu2017/results to see benchmarks for various systems. I usually look at the results for bwaves_r and wrf_r where the former is the most memory intensive. Please also look for the other treads on this forum on the topic i.e. 'Epyc vs Xeon Skylake SP'.
havref likes this.
ErikAdr is offline   Reply With Quote

Old   January 22, 2018, 05:57
Default
  #5
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
I have no experience with OpenFOAM, but such a drastic decrease of performance seems rather unusual.
In order to check if there is something wrong with your setup, you could run this benchmark here, I have numbers for the same platform to compare them:
http://www.palabos.org/software/download
Find the benchmark in "examples/benchmarks/cavity3d". I would recommend running problem sizes 100 and 400 with 1, 2, 4, 8... cores.

Concerning Skylake-SP, I would strongly advise against it for non-commercial CFD software. You don't have to pay a per-core license, so AMD Epyc would give you much better performance/$.
flotus1 is offline   Reply With Quote

Old   January 22, 2018, 06:28
Default
  #6
New Member
 
Håvard B. Refvik
Join Date: Jun 2015
Location: Norway
Posts: 17
Rep Power: 11
havref is on a distinguished road
Thank you Alex, I'll install and run though that benchmark shortly.

For the second server we were considering a quad-socket setup with one of the Skylake-SP processors. However, if two servers with dual Epyc 7351 (or other) will give a better performance/$ we will definitely consider it. Even a single server with dual Epyc would probably be sufficient if it performs much better than the server we have now.
havref is offline   Reply With Quote

Old   January 22, 2018, 06:53
Default
  #7
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
I just realized that I did not run a full set of benchmarks on our newest Intel system, but it should be sufficient for a quick comparison:

2x AMD Epyc 7301
Code:
#threads          msu_100   msu_400   msu_1000
01(1 die)           9.369    12.720      7.840
02(2 dies)         17.182    24.809     19.102
04(4 dies)         33.460    48.814     49.291
08(8 dies)         56.289    95.870    105.716
16(8 dies)        102.307   158.212    158.968
32(8 dies)        169.955   252.729    294.178
2x Intel Xeon E5-2650v4
Code:
#threads   msu_100   msu_400
01           8.412    11.747
24          88.268   154.787
Full description of the systems can be found here: AMD Epyc CFD benchmarks with Ansys Fluent
Yours should slightly outperform my Intel setup with 32 and 16 cores active. Testing your setup with 24 cores might give worse results, this benchmark does not run too well when the number of cores is not divisible by the number of active cores or a power of 2.
havref likes this.
flotus1 is offline   Reply With Quote

Old   January 22, 2018, 11:49
Default
  #8
New Member
 
Håvard B. Refvik
Join Date: Jun 2015
Location: Norway
Posts: 17
Rep Power: 11
havref is on a distinguished road
Here are the results. Ran each benchmark 3 times, so average values are given in the table to the left, with exact values to the right. As you can see, there's a huge decrease in MSU when increasing to 32 cores. It should also be noted that there's a large variation between each try when using 32 cores compared to the rest. Operating system is Ubuntu 16.04 LTS.
Code:
2x Intel Xeon E5-2697A V4			
			
#Cores	msu_100	msu_400	||#Cores	msu_100			msu_400		
1	10.97	12.51	||1	10.9552	10.9675	10.9739	|12.4802 12.5846 12.4516
2	18.20	23.31	||2	18.1663	18.022	18.4169	|23.4069 23.208  23.3125
4	29.26	39.95	||4	29.21	29.3525	29.222	|39.8766 39.566  40.3934
8	52.70	76.58	||8	52.6798	52.5828	52.8315	|76.3575 76.0869 77.3047
16	76.97	123.01	||16	76.892	77.2672	76.7622	|123.351 123.381 122.295
24	84.23	141.68	||24	84.6862	83.7586	84.2461	|140.979 141.109 142.966
32	39.61	113.68	||32	36.5778	43.7236	38.5174	|119.524 116.61  104.91
Obviously there's something strange going on here, but I have no idea what to look for. Got any ideas? Let me know if you need more hardware info
havref is offline   Reply With Quote

Old   January 22, 2018, 12:01
Default
  #9
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
I have never seen such a drastic decrease of performance with this benchmark on any system when using all cores. In fact, I never observed lower performance with max cores used compared to a lower amount of cores used.
You might want to check system temperatures (sensors) and frequencies (sudo turbostat) while running the benchmark.
I assume hyperthreading is turned off and memory is populated correctly, i.e. 4 DIMMs per socket in the right slots?
flotus1 is offline   Reply With Quote

Old   January 22, 2018, 12:22
Default
  #10
New Member
 
Håvard B. Refvik
Join Date: Jun 2015
Location: Norway
Posts: 17
Rep Power: 11
havref is on a distinguished road
I'll check the memory configuration once again to be sure, but I'm quite sure it is correct. Hyper-threading is turned off in BIOS and only 32 cores are visible from the Ubuntu terminal.

Turbostat-file and temperature monitoring is attached. I looked at the temperature over time and after a short while it stabilized at the values seen in the .txt file. I did two runs, so I attached both a temperature and turbostat output from both.
Attached Files
File Type: txt turbostat_32_2.txt (33.1 KB, 5 views)
File Type: txt Temps_32c_2.txt (2.1 KB, 2 views)
File Type: txt turbostat_32_1.txt (3.1 KB, 3 views)
File Type: txt Temps_32c_1.txt (2.1 KB, 2 views)
File Type: txt turbostat_32_3_MoreDetailed.txt (26.6 KB, 3 views)
havref is offline   Reply With Quote

Old   January 22, 2018, 12:33
Default
  #11
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
Weird. If that turbostat output was taken during a run with 32 cores, all cores should be near 100% load (despite the memory bottleneck) and running at 2900MHz edit: 3100MHz. Something is holding them back, and it is not CPU temperature or power draw.
What Linux kernel are you using and which MPI library?
flotus1 is offline   Reply With Quote

Old   January 22, 2018, 12:36
Default
  #12
New Member
 
Håvard B. Refvik
Join Date: Jun 2015
Location: Norway
Posts: 17
Rep Power: 11
havref is on a distinguished road
I agree. I attached another set of files in the post above. These are in the middle of the test.

Linux Kernel: 4.13.0-26-generic
MPI library: mpiexec (OpenRTE) 1.10.2


Edit: I did three new tests now with 32 cores and N=400, which yielded the following results:
154.4, 157,4 and 138.3 msu. It does seem to variate quite a lot.
havref is offline   Reply With Quote

Old   January 22, 2018, 12:53
Default
  #13
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
Slowly running out of ideas. Before agreeing that your hardware is cursed, we could try an even simpler stress test running:
stress -c 32 -m 32
If CPU load or frequency goes down during this test, it might be a VRM throttling issue. Though I doubt that this is the case.
flotus1 is offline   Reply With Quote

Old   January 22, 2018, 13:23
Default
  #14
New Member
 
Håvard B. Refvik
Join Date: Jun 2015
Location: Norway
Posts: 17
Rep Power: 11
havref is on a distinguished road
Thank you for helping out
Turbostat from the stresstest is attached. Unless I am reading it incorrectly, the average MHz looks to be appoximately 3100MHz. Which means that there is probably something wrong with the memory setup?

Edit: Added Memoryinformation.txt and the following text:
The memory setup of the server is attached. Originally the server was purchased with only 6 RAM sticks (I know), so two more with identical specs, but from a different vendor was installed. Can this be the fault? Or is there any additional setup in BIOS that needs to be done for all installed memory to work properly?

The server was (very roughly) tested both before and after the additional RAM upgrade, and showed quite a bit of performance increase. However, if you suspect the RAM-configuration is incorrect I'll be happy to run the same Palabos-tests with only 6x16Gb RAM to compare it properly.
Attached Files
File Type: txt Stresstest.txt (99.8 KB, 2 views)
File Type: txt MemoryInformation.txt (12.9 KB, 6 views)

Last edited by havref; January 22, 2018 at 13:42. Reason: Adding information
havref is offline   Reply With Quote

Old   January 22, 2018, 16:26
Default
  #15
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
Memory seems to be populated correctly.
The output of turbostat you attached, did you stop the stress test somewhere in between? If so, there is a pretty high idle load on your machine. Any idea where it is coming from?
flotus1 is offline   Reply With Quote

Old   January 23, 2018, 10:31
Default
  #16
New Member
 
Håvard B. Refvik
Join Date: Jun 2015
Location: Norway
Posts: 17
Rep Power: 11
havref is on a distinguished road
I checked this morning and both Xorg and compiz were using an surprising amount of CPU resources, both around 70% of one CPU core when idle.
First I deactivated most of the unnecessary animations and such for compiz. Secondly and probably more importantly, I reinstalled the latest driver for the CPUs, not using the provided Dell drivers this time, but the Processor microcode firmware for Intel CPU's from intel-microcode (open source) I found in the update manager.

New results using the same cavity3d benchmark from Palabos:
Code:
#Cores	msu_100	msu_400
1	11.05	12.83
2	18.40	23.98
4	32.21	43.16
8	58.01	83.04
16	82.62	133.23
24	90.67	153.89
32	105.66	175.41
So, finally some results similar to those of your Intel setup. This is a great improvement from yesterdays results and I guess it is closer to what is expected? Slightly better performance than AMD for 1-core processing, but slower when more cores are used due to the different number of memory channels?
flotus1 likes this.
havref is offline   Reply With Quote

Old   January 23, 2018, 10:50
Default
  #17
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
That is a good find and a pretty solid improvement.
Yes, AMD Epyc is so much faster with many cores thanks to 8 memory channels per CPU and slightly slower in single-core due to the low clock speed of 2.7GHz.
flotus1 is offline   Reply With Quote

Old   January 23, 2018, 12:19
Default
  #18
New Member
 
Håvard B. Refvik
Join Date: Jun 2015
Location: Norway
Posts: 17
Rep Power: 11
havref is on a distinguished road
Thank you so much for your help, flotus1!

And thank both you and ErikAdr for your advice regarding our next server as well. I'm looking into a few Supermicro-boards with Epyc processors and I now think we'll end up with one of the following instead of the Intel CPUs:
Code:
Processor            Euro per processor	       Euro total build
7281(2.1MHz)   	             685   	               6235
7301(2.2MHz)   	             870   	               6600
7351(2.4MHz) 	             1140  	               7140
Based on these off-the-website prices and benchmarks here (https://www.servethehome.com/amd-epy...ks-and-review/) I'm having a hard time deciding if the upgrade to 7351 is worth it.
havref is offline   Reply With Quote

Old   January 23, 2018, 12:36
Default
  #19
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
I would avoid the 7281 because it only has half the L3 cache of the larger models.
The 7351 gets you 0.2GHz more clock speed compared to the 7301 (2.9GHz vs 2.7GHz all-core turbo, these CPUs always use maximum turbo speed for CFD workloads). Looking at the prices for the processors alone, this is not worth it at all for 7.4% more clock speed. However, looking a the total system cost one might be tempted to do this upgrade. Personally, I don't think I would. The performance increase might be less than 5%. I can't make this decision for you
havref likes this.
flotus1 is offline   Reply With Quote

Old   January 24, 2018, 05:26
Default
  #20
Member
 
Erik Andresen
Join Date: Feb 2016
Location: Denmark
Posts: 35
Rep Power: 10
ErikAdr is on a distinguished road
For my own in house cfd program I would not worry about the smaller L3 cache in 7281. Per core it got 2 MB and that is the same as for the top range EPYC with 32 cores. Intel Gold got 1,275 MB per core. In the spec2017fp out of 13 test cases 7281 and 7301 performs equally in 10 cases, but 7301 is up to 16% faster in the remaining three cases. Try to compare: https://www.spec.org/cpu2017/results...128-01266.html and https://www.spec.org/cpu2017/results...128-01292.html
The price difference between 7281 and 7301 is small, so if you believe OpenFoam benefits from a larger cache, it is a small cost to choose 7301 instead of 7281. I don't know if that is the case. I think 7351 is not worth the extra cost.
havref likes this.
ErikAdr is offline   Reply With Quote

Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Threadripper or Dual Xeon system? Echidna Hardware 1 August 16, 2017 18:44
Dual Xeon E5-2687W v3 or i7-5960X? DoeBoy Hardware 1 August 25, 2015 16:44
single i7 MUCH faster than dual xeon E5-2650 v3 !!! acasas Hardware 28 March 13, 2015 13:40
Dual xeon? or Dual i7 cartman Hardware 8 June 8, 2012 20:42
Dual Xeon PIV 3.8Ghz vs 2x Dual Core E5130 2.0 GHz Michiel Hardware 4 July 31, 2009 07:06


All times are GMT -4. The time now is 00:49.