|
[Sponsors] |
January 19, 2018, 05:16 |
Poor scaling of dual-Xeon E5-2697A-V4
|
#1 |
New Member
Håvard B. Refvik
Join Date: Jun 2015
Location: Norway
Posts: 17
Rep Power: 11 |
After getting advice on this forum, I decided to run some scaling tests on our Dell R630 Server, equipped with 2x Xeon E5-2697A-V4, 8x16Gb 2400MHz Ram and an 480GB SSD. In addition, to benchmark against some larger cases I tested the Cavity-case previously tested on a HPC at NTNU: ( OpenFOAM Performance on Vilje ).
A .pdf of results, and a .txt file of memory setup is attached. In the first Scaling tests-part, all cases run are confidential hull designs, so I can't share much more information regarding those meshes. The cache was cleared between each of the benchmarks against Vilje using the command: sync && echo 3 | sudo tee /proc/sys/vm/drop_caches As you see in the .pdf, the performance using only 16 cores is much higher than the performance of 32 cores, even with as many as ~27 million cells. Any idea why this is happening? I was expecting 32 cores to outperform 16 by far even at the InterDyMFOAM test using 9 million cells. Is this assumption wrong? Because of these results, I am suspecting there's something wrong in our setup, but I have no idea where to start looking. Any comments or recommendations are greatly appreciated. Last edited by havref; January 19, 2018 at 06:08. Reason: Adding information |
|
January 19, 2018, 09:12 |
Re: Poor scaling of dual-Xeon E5-2697A-V4
|
#2 |
Member
Erik Andresen
Join Date: Feb 2016
Location: Denmark
Posts: 35
Rep Power: 10 |
My guess is that the performance of your test case is limited by the memory bandwidth and not the cpu power. You have two strong cpu, but only 8 memory channels. For cfd two cores are usually enough to saturate one memory channel. The best performance in your case is achieved with that ratio between cores and memory channels. It doesn't surprise me that the performance drop when using 32 cores, but I don't believe the drop is due alone to having too many cores competing about the limited memory bandwidth. For a larger test case 32 cores could do better, but I think 16 cores are optimal.
|
|
January 22, 2018, 04:38 |
|
#3 |
New Member
Håvard B. Refvik
Join Date: Jun 2015
Location: Norway
Posts: 17
Rep Power: 11 |
Thank you for the response. It makes sense when you put it like that. I'll do some tests using 24 cores etc. as well, to see where the peak performance is for these cases.
So I originally performed these scaling tests to see whether our server's performance scaled well before purchasing a second server. We were looking into the Xeon Gold 6154 3.0 GHz (18 cores), 6146 3.2 GHz (12 cores) and 6144 3.5 GHz(8 cores). As these new Xeon Scalable processors have 6 memory channels per processor, can I assume that both the 8 and 12 core processors are good choices purely based on the number of cores pr memory channel? Or do you think that 12 cores will still saturate the memory channels to such a degree that the performance decreases? |
|
January 22, 2018, 05:53 |
Re: Poor scaling of dual-Xeon E5-2697A-V4
|
#4 |
Member
Erik Andresen
Join Date: Feb 2016
Location: Denmark
Posts: 35
Rep Power: 10 |
I have no experience with OpenFoam so I don't feel qualified to give a definitive answer on this. But I would not choose less than 12 cores and I would feel safer with 16 cores. For my own in house cfd code I would go for an EPYC 7281 or EPYC 7301 which both have 8 memory channels and cost just a fraction of Intels Gold processors. If I had to select an Intel processor, then I would go for the cheaper models like Sliver 4116 or Gold 6130, and then buy some more systems to 'fill the budget'. Perhaps You are able to get the high end processors at a better price than I. Look at https://www.spec.org/cpu2017/results to see benchmarks for various systems. I usually look at the results for bwaves_r and wrf_r where the former is the most memory intensive. Please also look for the other treads on this forum on the topic i.e. 'Epyc vs Xeon Skylake SP'.
|
|
January 22, 2018, 05:57 |
|
#5 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
I have no experience with OpenFOAM, but such a drastic decrease of performance seems rather unusual.
In order to check if there is something wrong with your setup, you could run this benchmark here, I have numbers for the same platform to compare them: http://www.palabos.org/software/download Find the benchmark in "examples/benchmarks/cavity3d". I would recommend running problem sizes 100 and 400 with 1, 2, 4, 8... cores. Concerning Skylake-SP, I would strongly advise against it for non-commercial CFD software. You don't have to pay a per-core license, so AMD Epyc would give you much better performance/$. |
|
January 22, 2018, 06:28 |
|
#6 |
New Member
Håvard B. Refvik
Join Date: Jun 2015
Location: Norway
Posts: 17
Rep Power: 11 |
Thank you Alex, I'll install and run though that benchmark shortly.
For the second server we were considering a quad-socket setup with one of the Skylake-SP processors. However, if two servers with dual Epyc 7351 (or other) will give a better performance/$ we will definitely consider it. Even a single server with dual Epyc would probably be sufficient if it performs much better than the server we have now. |
|
January 22, 2018, 06:53 |
|
#7 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
I just realized that I did not run a full set of benchmarks on our newest Intel system, but it should be sufficient for a quick comparison:
2x AMD Epyc 7301 Code:
#threads msu_100 msu_400 msu_1000 01(1 die) 9.369 12.720 7.840 02(2 dies) 17.182 24.809 19.102 04(4 dies) 33.460 48.814 49.291 08(8 dies) 56.289 95.870 105.716 16(8 dies) 102.307 158.212 158.968 32(8 dies) 169.955 252.729 294.178 Code:
#threads msu_100 msu_400 01 8.412 11.747 24 88.268 154.787 Yours should slightly outperform my Intel setup with 32 and 16 cores active. Testing your setup with 24 cores might give worse results, this benchmark does not run too well when the number of cores is not divisible by the number of active cores or a power of 2. |
|
January 22, 2018, 11:49 |
|
#8 |
New Member
Håvard B. Refvik
Join Date: Jun 2015
Location: Norway
Posts: 17
Rep Power: 11 |
Here are the results. Ran each benchmark 3 times, so average values are given in the table to the left, with exact values to the right. As you can see, there's a huge decrease in MSU when increasing to 32 cores. It should also be noted that there's a large variation between each try when using 32 cores compared to the rest. Operating system is Ubuntu 16.04 LTS.
Code:
2x Intel Xeon E5-2697A V4 #Cores msu_100 msu_400 ||#Cores msu_100 msu_400 1 10.97 12.51 ||1 10.9552 10.9675 10.9739 |12.4802 12.5846 12.4516 2 18.20 23.31 ||2 18.1663 18.022 18.4169 |23.4069 23.208 23.3125 4 29.26 39.95 ||4 29.21 29.3525 29.222 |39.8766 39.566 40.3934 8 52.70 76.58 ||8 52.6798 52.5828 52.8315 |76.3575 76.0869 77.3047 16 76.97 123.01 ||16 76.892 77.2672 76.7622 |123.351 123.381 122.295 24 84.23 141.68 ||24 84.6862 83.7586 84.2461 |140.979 141.109 142.966 32 39.61 113.68 ||32 36.5778 43.7236 38.5174 |119.524 116.61 104.91 |
|
January 22, 2018, 12:01 |
|
#9 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
I have never seen such a drastic decrease of performance with this benchmark on any system when using all cores. In fact, I never observed lower performance with max cores used compared to a lower amount of cores used.
You might want to check system temperatures (sensors) and frequencies (sudo turbostat) while running the benchmark. I assume hyperthreading is turned off and memory is populated correctly, i.e. 4 DIMMs per socket in the right slots? |
|
January 22, 2018, 12:22 |
|
#10 |
New Member
Håvard B. Refvik
Join Date: Jun 2015
Location: Norway
Posts: 17
Rep Power: 11 |
I'll check the memory configuration once again to be sure, but I'm quite sure it is correct. Hyper-threading is turned off in BIOS and only 32 cores are visible from the Ubuntu terminal.
Turbostat-file and temperature monitoring is attached. I looked at the temperature over time and after a short while it stabilized at the values seen in the .txt file. I did two runs, so I attached both a temperature and turbostat output from both. |
|
January 22, 2018, 12:33 |
|
#11 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
Weird. If that turbostat output was taken during a run with 32 cores, all cores should be near 100% load (despite the memory bottleneck) and running at 2900MHz edit: 3100MHz. Something is holding them back, and it is not CPU temperature or power draw.
What Linux kernel are you using and which MPI library? |
|
January 22, 2018, 12:36 |
|
#12 |
New Member
Håvard B. Refvik
Join Date: Jun 2015
Location: Norway
Posts: 17
Rep Power: 11 |
I agree. I attached another set of files in the post above. These are in the middle of the test.
Linux Kernel: 4.13.0-26-generic MPI library: mpiexec (OpenRTE) 1.10.2 Edit: I did three new tests now with 32 cores and N=400, which yielded the following results: 154.4, 157,4 and 138.3 msu. It does seem to variate quite a lot. |
|
January 22, 2018, 12:53 |
|
#13 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
Slowly running out of ideas. Before agreeing that your hardware is cursed, we could try an even simpler stress test running:
stress -c 32 -m 32 If CPU load or frequency goes down during this test, it might be a VRM throttling issue. Though I doubt that this is the case. |
|
January 22, 2018, 13:23 |
|
#14 |
New Member
Håvard B. Refvik
Join Date: Jun 2015
Location: Norway
Posts: 17
Rep Power: 11 |
Thank you for helping out
Turbostat from the stresstest is attached. Unless I am reading it incorrectly, the average MHz looks to be appoximately 3100MHz. Which means that there is probably something wrong with the memory setup? Edit: Added Memoryinformation.txt and the following text: The memory setup of the server is attached. Originally the server was purchased with only 6 RAM sticks (I know), so two more with identical specs, but from a different vendor was installed. Can this be the fault? Or is there any additional setup in BIOS that needs to be done for all installed memory to work properly? The server was (very roughly) tested both before and after the additional RAM upgrade, and showed quite a bit of performance increase. However, if you suspect the RAM-configuration is incorrect I'll be happy to run the same Palabos-tests with only 6x16Gb RAM to compare it properly. Last edited by havref; January 22, 2018 at 13:42. Reason: Adding information |
|
January 22, 2018, 16:26 |
|
#15 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
Memory seems to be populated correctly.
The output of turbostat you attached, did you stop the stress test somewhere in between? If so, there is a pretty high idle load on your machine. Any idea where it is coming from? |
|
January 23, 2018, 10:31 |
|
#16 |
New Member
Håvard B. Refvik
Join Date: Jun 2015
Location: Norway
Posts: 17
Rep Power: 11 |
I checked this morning and both Xorg and compiz were using an surprising amount of CPU resources, both around 70% of one CPU core when idle.
First I deactivated most of the unnecessary animations and such for compiz. Secondly and probably more importantly, I reinstalled the latest driver for the CPUs, not using the provided Dell drivers this time, but the Processor microcode firmware for Intel CPU's from intel-microcode (open source) I found in the update manager. New results using the same cavity3d benchmark from Palabos: Code:
#Cores msu_100 msu_400 1 11.05 12.83 2 18.40 23.98 4 32.21 43.16 8 58.01 83.04 16 82.62 133.23 24 90.67 153.89 32 105.66 175.41 |
|
January 23, 2018, 10:50 |
|
#17 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
That is a good find and a pretty solid improvement.
Yes, AMD Epyc is so much faster with many cores thanks to 8 memory channels per CPU and slightly slower in single-core due to the low clock speed of 2.7GHz. |
|
January 23, 2018, 12:19 |
|
#18 |
New Member
Håvard B. Refvik
Join Date: Jun 2015
Location: Norway
Posts: 17
Rep Power: 11 |
Thank you so much for your help, flotus1!
And thank both you and ErikAdr for your advice regarding our next server as well. I'm looking into a few Supermicro-boards with Epyc processors and I now think we'll end up with one of the following instead of the Intel CPUs: Code:
Processor Euro per processor Euro total build 7281(2.1MHz) 685 6235 7301(2.2MHz) 870 6600 7351(2.4MHz) 1140 7140 |
|
January 23, 2018, 12:36 |
|
#19 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
I would avoid the 7281 because it only has half the L3 cache of the larger models.
The 7351 gets you 0.2GHz more clock speed compared to the 7301 (2.9GHz vs 2.7GHz all-core turbo, these CPUs always use maximum turbo speed for CFD workloads). Looking at the prices for the processors alone, this is not worth it at all for 7.4% more clock speed. However, looking a the total system cost one might be tempted to do this upgrade. Personally, I don't think I would. The performance increase might be less than 5%. I can't make this decision for you |
|
January 24, 2018, 05:26 |
|
#20 |
Member
Erik Andresen
Join Date: Feb 2016
Location: Denmark
Posts: 35
Rep Power: 10 |
For my own in house cfd program I would not worry about the smaller L3 cache in 7281. Per core it got 2 MB and that is the same as for the top range EPYC with 32 cores. Intel Gold got 1,275 MB per core. In the spec2017fp out of 13 test cases 7281 and 7301 performs equally in 10 cases, but 7301 is up to 16% faster in the remaining three cases. Try to compare: https://www.spec.org/cpu2017/results...128-01266.html and https://www.spec.org/cpu2017/results...128-01292.html
The price difference between 7281 and 7301 is small, so if you believe OpenFoam benefits from a larger cache, it is a small cost to choose 7301 instead of 7281. I don't know if that is the case. I think 7351 is not worth the extra cost. |
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Threadripper or Dual Xeon system? | Echidna | Hardware | 1 | August 16, 2017 18:44 |
Dual Xeon E5-2687W v3 or i7-5960X? | DoeBoy | Hardware | 1 | August 25, 2015 16:44 |
single i7 MUCH faster than dual xeon E5-2650 v3 !!! | acasas | Hardware | 28 | March 13, 2015 13:40 |
Dual xeon? or Dual i7 | cartman | Hardware | 8 | June 8, 2012 20:42 |
Dual Xeon PIV 3.8Ghz vs 2x Dual Core E5130 2.0 GHz | Michiel | Hardware | 4 | July 31, 2009 07:06 |