|
[Sponsors] |
December 3, 2017, 17:50 |
AMD Epyc CFD benchmarks with Ansys Fluent
|
#1 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
Earlier this year, AMD introduced a new CPU architecture called "Zen".
The most interesting CPUs in their lineup from a CFD perspective are definitely the "Epyc" CPUs. They consist of 4 dies connected through an interconnect called "infinity fabric". Each die has its own dual-channel memory controller resulting in 8 memory channels per CPU. Some of these Epyc CPUs are 2S scalable, which means 16 memory channels and a theoretical memory bandwidth of 341GB/s (DDR4-2666) in a dual-socket node. Now that these CPUs and motherboards are finally available, it is time to run some CFD benchmarks and compare them to an Intel system. System Specifications System "AMD" CPU: 2x AMD Epyc 7301 (16 cores, 2.2GHz base, 2.7GHz all-core, 2.7GHz single-core) RAM: 16x 16GB Samsung 2Rx4 DDR4-2133 reg ECC (use DDR4-2666 if you buy a system) Mainboard: Supermicro H11DSi GPU: Nvidia Geforce GTX 960 4GB SSD: Intel S3500 800GB PSU: Seasonic Focus Plus Platinum 850W (80+ platinum) System "INTEL" CPU: 2x Intel Xeon E5-2650v4 (12 cores, 2.2GHz base, 2.5GHz all-core, 2.9GHz single-core) RAM: 8x 16GB Samsung 2Rx4 DDR4-2400 reg ECC Mainboard: Supermicro X10DAX GPU: Nvidia Quadro 2000 1GB SSD: Intel S3500 800GB PSU: Super Flower Golden Green HX 750W (80+ gold) A note on memory: I would have liked to equip the AMD system with faster RAM, but there is no way I am buying memory with the current prices. So I work with what I have. The difference in memory size is irrelevant, all benchmarks shown here fit in the memory and caches were cleared before each run. Software: Operating system: CentOS 7 Linux Kernel: 4.14.3-1 Fluent version: 18.2 CPU governor: performance SMT/Hyperthreading: off Fluent performance I used some of the official Fluent benchmarks provided by Ansys. For detailed description of the cases see here: http://www.ansys.com/solutions/solut...ent-benchmarks These benchmark results should be representative for many finite volume solvers with MPI parallelization. The results given are solver wall-time in seconds. 1) External Flow Over an Aircraft Wing (aircraft_2m), single precision AMD, 1 core, 10 iterations: 179.3 s INTEL, 1 core, 10 iterations: 194.6 s AMD, 24 cores, 100 iterations: 92.6 s INTEL, 24 cores, 100 iterations: 121.9 s AMD, 32 cores, 100 iterations: 78.4 s 2) External Flow Over an Aircraft Wing (aircraft_14m), double precision (note that the default setting for this benchmark is single precision) This is the benchmark AMD used for their demonstration video: https://www.youtube.com/watch?v=gdYYRRDJDUc Since they apparently used a two-node setup and different processors, I decided to drop comparability and use double precision to mix things up. AMD, 24 cores, 10 iterations: 93.8 s INTEL, 24 cores, 10 iterations: 118.2 s AMD, 32 cores, 10 iterations: 72.2 s 3) 4-Stroke spray guided Gasoline Direct Injection model (ice_2m), double precision AMD, 24 cores, 100 iterations: 220.2 s INTEL, 24 cores, 100 iterations: 258.7 s AMD, 32 cores, 100 iterations: 172.4 s 4) Flow through a combustor (combustor_12m) AMD, 24 cores, 10 iterations: 339.6 s INTEL, 24 cores, 10 iterations: 386.0 s AMD, 32 cores, 10 iterations: 269.4 s A note on power consumption Since the systems differ in terms of GPU and PSU, take my values with a grain of salt. Measuring power draw at the wall (using a Brennenstuhl PM231 E), the systems are actually pretty similar. AMD, idle: ~115W INTEL, idle: ~125W AMD, solving aircraft_2m on 32 cores: ~350W INTEL, solving aircraft_2m on 24 cores: ~320W The Verdict A quite compelling comeback for AMD in terms of CFD performance. Their new Epyc lineup delivers exactly what Intel has only increased incrementally over the past few years: memory bandwidth. Although the AMD system in this benchmark ran slower DDR4-2133 instead of the maximum supported 2666MT/s, it beats the Intel-system even in terms of per-core performance. Using all its cores, it pulls ahead of Intel by up to 63%. Quite surprisingly, AMD even takes the lead in single-core performance. This might have to do with relatively large caches (512KB L2 per core instead of 256KB) and partially low latencies for cache access. However, with a different single-core in-house code that is more compute-bound (results not shown above) Intel pulls slightly ahead thanks to its higher clock speed. Speaking of clock speed: in my opinion AMD is missing a spot in its lineup: a medium core-count CPU with higher clock speeds. The 16-core CPUs don't seem to be using their TDP entirely, so there should have been headroom for a higher-clocked variant to tackle Intel in the "per-core performance" sector. Which would have made sense because Intel has not been idle in the meantime: Their new Skylake-SP architecture offers 6 DDR4-2666 memory channels per CPU, variants with high clock speed and 4S scalability and beyond. So for users with high costs for per-core licenses, Intel probably still has an edge over AMD. Which brings us to the costs: Epyc 7301 CPUs cost ~920€ - if they are available which is still a problem. For that kind of money all Intel has to offer are Xeon Silver with 12 cores and support for DDR4-2400. So when you are on a limited budget for a CFD workstation or need cost-efficient cluster nodes, you should consider AMD. They are back! Last edited by flotus1; December 4, 2017 at 05:59. |
|
December 4, 2017, 05:49 |
|
#2 |
New Member
Join Date: May 2013
Posts: 26
Rep Power: 13 |
awesome!
many thanks for your work and the great documentation of details on what is compared here :-) would be very interesting to see if Skylake (Xeon SP) bring big changes... broadwell (v4) -> skylake (v5) - 4 -> 6 memory channels - ring bus -> mesh connection: for cores, L3, memory, i/o - different L2/L3 architecture |
|
December 4, 2017, 12:37 |
|
#3 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
It definitely would have been nice to have a Skylake-SP platform for comparison. But I won't get my hands on one any time soon. So all we can do is extrapolate their performance based on specifications, different benchmarks and the numbers that Intel is advertising:
https://www.intel.com/content/www/us...pc-fluent.html Here they claim "up to 60% improvement" over the Haswell-EP (v3) platform. This translates to roughly 42% improvement compared to Broadwell-EP (v4). |
|
December 4, 2017, 12:52 |
|
#4 |
Senior Member
Join Date: Mar 2009
Location: Austin, TX
Posts: 160
Rep Power: 18 |
I'm curious, did you try setting the core affinity on the the 24 core EPYC simulation to ensure that it is using 6 cores per CCX? If the system decided to use 8 cores on one CCX, then you wouldn't be fully utilizing the memory bandwidth.
There is more improvement from 24 cores to 32 cores that I would have expected. |
|
December 4, 2017, 14:09 |
|
#5 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
I did not mess with affinity settings or verify which of the 32 cores Fluent was using while solving on 24 cores.
But to me the values look quite ok. For the aircraft_2m benchmark, parallel efficiency is a whopping 81% on 24 cores and drops to 71% on 32 cores. Remember, running on 32 cores here means only two cores per memory channel. So there is some room for improvement even on high core counts. |
|
December 4, 2017, 15:01 |
|
#6 |
Senior Member
Join Date: Mar 2009
Location: Austin, TX
Posts: 160
Rep Power: 18 |
I am mainly curious to see if going for a 32 core EPYC was really worth the extra cost over a 24 core chip. Your analysis certainly suggests that it is, but such a huge improvement for the 32 over the 24 doesn't smell right to me (even recognizing that the parallel efficiency has drops quite a bit). You are only adding cores and not memory bandwidth, so I would expect the difference to be much smaller.
I'm hypothesizing that the 24 core tests could be improved by setting the core affinity, bringing those results closer to the 32 core. Edit - I just realized you're using 2x 16 core chips, not a single 32 core. In this case there is no way to use the memory bandwidth efficiently with 24 threads since you will have some cores share a memory controller and others will have their own. Hopefully someone else gets their hands on some of the bigger chips. A 16 thread benchmark would be interesting to see on your setup. |
|
December 4, 2017, 15:23 |
|
#7 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
There is a total of 8 memory controllers in the system. One for each die.
So it is no problem to use the full memory bandwidth efficiently with 24 cores active. 3 cores per die. Both CCX on a die have access to the dies memory controller with no performance penalty. |
|
December 5, 2017, 16:40 |
|
#8 |
Senior Member
Micael
Join Date: Mar 2009
Location: Canada
Posts: 157
Rep Power: 18 |
Did the benchmark aircraft_14m with double precison on our 32-core cluster:
- 4 x (dual E5-2637v3 (4-core, 3.5 GHz), 64GB DDR4-2133) - interconnect = FDR infiniband - Red Hat Enterprise Linux 6.7 10 iterations took 74.8 sec. Would have never bet AMD would match this (actually beats it out a bit with still room for DDR4-2666), pretty good news. |
|
December 6, 2017, 05:23 |
|
#9 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
That is indeed an interesting result. Based on the specifications of your cluster, I would have expected that it performs a bit better than my Epyc workstation. Mainly because it is a pretty perfect setup for CFD and I would expect parallel efficiency to be above 100% with that kind of hardware. Did you clear caches before running the benchmark? I found this to be essential for consistent results. If you have the time you could try running the benchmark again on a single core.
|
|
December 6, 2017, 11:34 |
|
#10 |
Senior Member
Micael
Join Date: Mar 2009
Location: Canada
Posts: 157
Rep Power: 18 |
Yes I did clear the cache with (flush-cache).
Didn't had time for a single core run, but did a single node run with 8-core: 477 sec. That was using 50GB of ram out of the 64 available on the node. Now a more interesting result would be comparison with scalable xeon, most notably 6144 which might be the fastest one for FLUENT. |
|
December 6, 2017, 11:51 |
|
#11 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
So scaling is in fact super-linear but the individual nodes are a little on the slow side...
I second that a direct comparison with a Skylake-SP Xeon would be interesting. But a single Xeon gold 6144 costs about as much as I paid for the whole Epyc workstation. So I am not the one running these tests |
|
December 13, 2017, 16:52 |
|
#12 |
Senior Member
Join Date: Oct 2011
Posts: 242
Rep Power: 17 |
Many thanks flotus1 for sharing your results, it gives a good idea of the cpu capabilities. Did you have the opportunity to test it further with other cfd softwares ? I am considering in the next months ordering these as well, in the end are you convinced ?
|
|
December 14, 2017, 05:27 |
|
#13 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
I am completely convinced that AMD Epyc is currently the best processor for parallel CFD, at least with a price/performance ratio in mind.
I did not run any other commercial CFD codes for testing. Do you have anything specific in mind? Our in-house OpenMP parallel LB code also runs pretty well, more than 50% faster than on the Intel platform. The Palabos benchmark results for AMD (higher is better): Code:
#threads msu_100 msu_400 msu_1000 01(1 die) 9.369 12.720 7.840 02(2 dies) 17.182 24.809 19.102 04(4 dies) 33.460 48.814 49.291 08(8 dies) 56.289 95.870 105.716 16(8 dies) 102.307 158.212 158.968 32(8 dies) 169.955 252.729 294.178 Code:
#threads msu_100 msu_400 01 8.412 11.747 24 88.268 154.787 Last edited by flotus1; December 14, 2017 at 08:39. |
|
December 14, 2017, 11:12 |
|
#14 |
Member
Knut Erik T. Giljarhus
Join Date: Mar 2009
Location: Norway
Posts: 35
Rep Power: 22 |
Thanks for sharing these results, flotus, impressive performance for sure. Some OpenFOAM benchmark cases would also be interesting to see. I have just ordered a new workstation myself and had to get an Intel-based system due to a variety of reasons, it would be nice to compare the difference.
|
|
December 14, 2017, 11:26 |
|
#15 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
I would need rather specific directions what to test with OpenFOAM and how to do it exactly. I never really used it apart from running some tutorials a few years ago. So I don't feel confident to provide reliable results.
|
|
December 18, 2017, 04:53 |
|
#16 |
Member
Knut Erik T. Giljarhus
Join Date: Mar 2009
Location: Norway
Posts: 35
Rep Power: 22 |
I will try to run some benchmarks after I receive my workstation, then I will post the results along with the setup here.
|
|
January 15, 2018, 04:04 |
|
#17 |
Member
Ivan
Join Date: Oct 2017
Location: 3rd planet
Posts: 34
Rep Power: 9 |
What will be the current optimum price/performance configuration to buy?
My understanding: System "AMD" CPU: 2x AMD Epyc 7601 (32 cores, 2.2GHz base, 3,2 GHz Turbo) - do 32 cores will give me extra power worse of money? or 16 cores is more/less optimum because of DDR channels? RAM: 16 x 16GB Samsung 2Rx4 DDR4-2666 reg ECC (maybe some DDR4-3600 if there is for Supermicro H11DSi-NT motherboard) Which amount and speed I need for optimum for 2x16 and 2x32 cores? Mainboard: Supermicro H11DSi-NT (for Ethernet speed to add some more computers using ansys cfx parallel solver) GPU: No SSD: Samsung 850 - 512 Gb for system HD: 4x8 Tb Seagate Enterprice SATA III 3,5 - RAID 5 or 6 to make it safe (if 1 HD go down you can make comeback with this type of RAID) PSU: Be Quite 1200W As alternative: Maybe try to invest in 1 CPU (72 cores 1,5 Ghz, 1,7 Turbo) Intel Xeon Phi 7290F? Does someone run/own such computer with CFX/Fluent? https://www.intel.com/content/www/us...548.1516002678 https://www.youtube.com/watch?v=I0U6ZMeVrB4 |
|
January 15, 2018, 04:48 |
|
#18 | |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
Quote:
It costs less than 900$ compared to nearly 4000$ for the 32-core variant. You could build two systems with the same total amount of cores and much higher total performance. You need 16 DIMMS for this platform. Overclocking memory is no longer a thing with server platforms, so stiick to DDR4-2666 maximum. There is no faster reg ECC available anyway. Unless this is supposed to be a headless node, put in at least a small GPU like a GTX 1050TI. A 1200W power supply is a bit on the high side, the system as configured will never draw more than 400W. My power supply is rated for 850W (Seasonic Focus Plus Platinum) only because it has more connectors than the 750W variant. Speaking of 10G Ethernet: You could give it a try, but in the end you might want to switch to infiniband if you connect more nodes. Xeon Phi are not an alternative if you are not running code developed specifically for this platform. Commercial software like Fluent and CFX does not make full use the potential of this architecture, this is still under development. And even if they did, I highly doubt that it would outperform dual-Epyc for CFD workloads. There may not be many CFD benchmarks available for this type of processor, but this already tells a lot: If it were actually faster than normal platforms for CFD, Ansys and Intel marketing would not stop bragging about it. |
||
January 15, 2018, 05:08 |
|
#19 |
Member
Ivan
Join Date: Oct 2017
Location: 3rd planet
Posts: 34
Rep Power: 9 |
Thank you!
Some small issues: 1. Do I need water cooling? 2. Server-like horizontal or vertical large tower? 3. If water cooling - 4 120 mm radiators with water per each CPU? My i9 18 cores with 2 120 mm radiators with water are up 110 C for after 1 hour solving. |
|
January 15, 2018, 05:18 |
|
#20 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
I am using Noctua NH-U14s TR4-SP3 air coolers. The CPUs themselves run pretty cool thanks to the large surface area (and soldered heatspreader which Intel I9 are lacking), so water cooling is completely unnecessary from the thermal point of view. If you do it for aesthetics or some other reason, go ahead
The type of case is up to you, depends if you prefer rackmount or workstation. I have a normal E-ATX workstation case. Currently Nanoxia deep silence 2, but switching to Fractal design Define XL R2 for better build quality. |
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Using inlet mpi in parallel ANSYS fluent with AMD processors | freebird | ANSYS | 1 | June 16, 2017 10:04 |
Can you help me with a problem in ansys static structural solver? | sourabh.porwal | Structural Mechanics | 0 | March 27, 2016 18:07 |
CFD papers Numerical study - Upwind schemes ANSYS FLUENT | Volumeoffluid | FLUENT | 0 | January 31, 2014 13:21 |
CFD papers Numerical study- upwind schemes ANSYS FLUENT | Volumeoffluid | Main CFD Forum | 0 | January 30, 2014 12:19 |
Free UK seminars: ANSYS CFD software | Gavin Butcher | CFX | 0 | November 23, 2004 10:13 |