|
[Sponsors] |
Xeon Gold Cascade Lake vs Epyc Rome - CFX & Fluent - Benchmarks (Windows Server 2019) |
|
LinkBack | Thread Tools | Search this Thread | Display Modes |
February 2, 2020, 02:57 |
Xeon Gold Cascade Lake vs Epyc Rome - CFX & Fluent - Benchmarks (Windows Server 2019)
|
#1 |
Member
Join Date: Jul 2011
Posts: 53
Rep Power: 15 |
I have been benchmarking two PowerEdge machines from Dell. One 24 core R640 (Intel Cascade Lake) and one 32 core R6525 (Epyc Rome). Both running Windows Server 2019.
Specs: R640 2 x Intel Xeon Gold 6246 (Cascade Lake) 12c, 4.1 GHz all core turbo 12 x 16GB 2933 MHz RAM (Dual rank) Sub-Numa cluster enabled R6525 2 x Epyc Rome 7302 16c, 3.3 GHz all core turbo 16 x 16GB 3200 MHz RAM (Dual rank) NPS set to 4 The R6525 machine is 15 % cheaper than the R640 in the above spec. The rest of the specification list between the two machines is identical. I've run a bunch of the different official Fluent and CFX benchmarks from ANSYS. For CFX I've used Intel MPI, and Fluent the default ibmmpi. Average across the different benchmarks I've run: The Epyc Rome system is:
Here's an example of my result. Fluent CFX (See post below, forum spam filter is breaking my balls) Something that's interesting to note is the scaling on the AMD Epyc - there's a very clear improvement in performance in every multiple of 8 number of cores. Look at the aircraft_wing_14m fluent benchmark for example, there are scaling and performance peaks at 16, 24 and 32 cores. You "do not" want to run the AMD system at 26 cores - it is slower than at 24 cores. I'm guessing this is related to the CPU architecture and the splitting of cores into CCXs. Other interesting observations are that the Intel system runs both hot and power hungry - approx. 550W at full load with CPU temps of 80 C, compared to approx. 400 W at full load with CPU temps of 60 C for the AMD system. The decision is clear for me - I'll be building a mini-cluster consisting of four AMD Epyc Rome machines for a total of 128 nodes. The alternative would be to purchase five Intel Xeon Gold Cascade Lake systems (for a total of 120 nodes). The Intel setup would be 30 % more expensive and 10 % slower overall! I could also go for 6 machines which ought to theoretically match the 4 AMD machines, but then for a dizzying 50 % price premium. AMD Epyc Rome really is EPIC for CFD applications! |
|
February 2, 2020, 03:00 |
|
#2 |
Member
Join Date: Jul 2011
Posts: 53
Rep Power: 15 |
Fluent:
CFX: |
|
February 2, 2020, 08:23 |
|
#3 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,428
Rep Power: 49 |
Those are some very thorough investigations, with interesting results. Thanks for publishing.
Just out of curiosity, could you run one more comparison with the AMD system? It doesn't have to be a full scaling analysis, just one more data point with max cores would suffice. The change I am interested in: drop down the memory transfer speed to 2933 MT/s. I recently learned that this is the maximum frequency on Epyc Rome CPUs where infinity fabric and memory can run in sync. Compared to 3200 MT/s, you should get a little less bandwidth, but much better memory access times. Since you are on Windows, it should be easy to check IF speed with HWinfo or CPU-Z. See bottom of page 10 in this documentation for reference: https://developer.amd.com/wp-content...56745_0.80.pdf Last edited by flotus1; February 2, 2020 at 09:46. |
|
February 2, 2020, 19:52 |
|
#4 |
Member
EM
Join Date: Sep 2019
Posts: 59
Rep Power: 7 |
Did u run the same executable on each machine or did u use two different executables specifically compiled for each machine?
What compilers were used and what math libraries? Did u use mkl on intel? What flag options and what optimizations? -- |
|
February 2, 2020, 22:39 |
|
#5 | |
Member
Join Date: Nov 2011
Location: Czech Republic
Posts: 97
Rep Power: 15 |
Quote:
You've got precompiled binaries for Windows/Linux and that's all. As far as I know both are using MKL and CFX solver is compile by Intel Fortran Compiler. I might be interesting to set an environment variable MKL_DEBUG_CPU_TYPE=5 on AMD system to see if there is any impact on performance. Details can be found here. Thank you for your post! |
||
February 3, 2020, 04:25 |
|
#6 |
Member
EM
Join Date: Sep 2019
Posts: 59
Rep Power: 7 |
Ok. You do not have special access to these commercial codes.
Here is a suggestion: try nektar++. It comes as a precompiled binary or u can download the source and compile it yourself. Run any 3d case (u have to set it up yourself) - channel/duct/pipe/lid-driven cavity - for (say) 100 steps and, if u can, use ~200 million nodes or the highest u can. Use polynomials of at least order 10 (20 or more would be nice). -- |
|
February 3, 2020, 05:44 |
|
#7 | |
Member
Join Date: Jul 2011
Posts: 53
Rep Power: 15 |
Quote:
I've just tried, and it appears I am not able to change the memory speed on this PowerEdge R6525. It is locked at 3200 in the BIOS/iDRAC. |
||
February 3, 2020, 06:03 |
|
#8 | |
Member
Join Date: Jul 2011
Posts: 53
Rep Power: 15 |
Quote:
I just tried this, there was no performance change in either Fluent or CFX. |
||
February 3, 2020, 06:04 |
|
#9 | |
Member
Join Date: Jul 2011
Posts: 53
Rep Power: 15 |
Quote:
Sorry, am running Windows and it looks like it's quite a lot of work to compile nektra++. |
||
February 3, 2020, 06:40 |
|
#10 | |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,428
Rep Power: 49 |
Quote:
Would still be interesting to see what IF is clocked at in your system. In CPU-Z, it should be the value of "NB frequency". And HWInfo should have an entry for Infinity fabric. |
||
February 3, 2020, 08:22 |
|
#11 | |
Member
Join Date: Jul 2011
Posts: 53
Rep Power: 15 |
Quote:
You were right! I had to change the power profile from "Maximum Performance" to "Custom" which was on a different page from the memory settings. Fluent benchmark Aircraft_wing_14m @ 32 cores 3200 Mhz - 122.4 s 2932 Mhz - 160.8 s HOWEVER! It would seem there is a bug in the Dell BIOS. When selecting 2932 memory speed, it actually clocked the memory all the way down to 1600 Mhz (memory clock reported at 800 Mhz in CPU-Z/HWinfo). I've searched high and low and can't find an entry for NB frequency or Infinity fabric... |
||
February 4, 2020, 17:31 |
|
#12 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,428
Rep Power: 49 |
Nice catch with that bug.
HWInfo needs to be a recent version, maybe even a beta. Don't know if they already implemented this in a release version. Of course, the sensor reading could just fail because it is unfamiliar with your server hardware. In CPU-Z, you should find it in the memory tab. cpuz_nb.png hwinfo_nb.png |
|
February 11, 2020, 13:01 |
|
#13 |
Senior Member
Erik
Join Date: Feb 2011
Location: Earth (Land portion)
Posts: 1,188
Rep Power: 23 |
It is interesting that the AMD system performs most efficiently in CFX when running at a core count that is a balanced factor/multiple of the memory channel numbers. 8, 16, 24. It always drops in performance at a core count slightly higher than these numbers, like it has unbalanced the memory load.
|
|
February 11, 2020, 13:25 |
|
#14 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,428
Rep Power: 49 |
My take on this: it is caused by the chiplet design.
The 7302 should have 4 active chiplets with 4 cores each. Maybe SLC could verify that with a look at lstopo under in Linux... So running on 16 cores, each chiplet should have 2 threads assigned to it, and each thread has access to the same amount of shared resources: L3 cache, chiplet to I/O die bandwidth, memory bandwidth. Going to 17 cores, one chiplet has to take on 3 threads, which is 50% more than all other chiplets. Leaving these 3 threads with significantly less shared resources compared to the other threads. In addition to that, boost frequency is determined by the amount of threads per chiplet. So frequency of the cores on this chiplet might drop lower than the others. Since the slowest thread determines overall performance, this imbalance leads to a drop in performance. A more traditional dual-socket system using monolithic CPU dies experiences similar contention of shared CPU resources, but the imbalance is much less pronounced. |
|
February 12, 2020, 08:10 |
|
#15 | |
Member
Join Date: Dec 2016
Posts: 44
Rep Power: 10 |
Quote:
Also realy 3200 Mhz - 122.4 s 1600 Mhz - 160.8 s So the speedup of memory frequency is rated to (3200/1600)^0.4 -> 2^0.4 = 1.32 --> 160.8/122.4 = 1.31 I have also noticed a similar situation on a DDR 3 System Epyc 7551 vs 6850K; Fluent Bench Speedup only +30% // (2400/1600) + 50 % /// Scaling 1.5^0.66 with Memory Bandwidth |
||
June 11, 2020, 11:00 |
single vs dual cpu
|
#16 | |
New Member
sida
Join Date: Dec 2019
Posts: 6
Rep Power: 6 |
Quote:
|
||
June 11, 2020, 11:07 |
|
#17 |
Member
Join Date: Jul 2011
Posts: 53
Rep Power: 15 |
||
June 13, 2020, 17:35 |
|
#18 |
Member
Ivan
Join Date: Oct 2017
Location: 3rd planet
Posts: 34
Rep Power: 9 |
We bought new 2X7301 in 2018 for Ansys CFX tasks
We have long calculations 200+ hours (one task) And during each 10 task (in average, it is like 10% chance) CPUs stopped and switch out in first 15-60h. MB still working. And we lose all progress. It is not temperature problem or BIOS - we reinstalled and rechecked it many times. It is very uncomfortable for job, when you have deadlines. Because of this, during this year, we want to buy new 2 CPU xeon based cluster. Too afraid of buying AMD Rome, even they are looking faster and cost effective in paper and in general tests. Very tired of 'dancing with drums under this 2X7301'. |
|
June 13, 2020, 17:48 |
|
#19 |
New Member
sida
Join Date: Dec 2019
Posts: 6
Rep Power: 6 |
||
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
CFX vs. FLUENT | turbo | CFX | 4 | April 13, 2021 09:08 |
Running Fluent & CFX directly using MS MPI on Windows? | SLC | ANSYS | 2 | January 8, 2020 03:02 |
Fluent on Microsoft Windows Server 2003, possible? | IvanCFD | FLUENT | 0 | February 10, 2011 05:45 |
Fluent on windows server | Shamoon Jamshed | ANSYS | 0 | November 15, 2009 13:52 |
Fluent on windows server | Shamoon Jamshed | FLUENT | 0 | November 14, 2009 13:32 |