AMD Epyc CFD benchmarks with Ansys Fluent

SLC · May 3, 2018, 08:31

Quote:

Originally Posted by flotus1

To be honest, these numbers are lower (i.e. higher execution times) than I would expect for the kind of hardware you have. Both single-core and parallel. At least when comparing against the results in the initial post here. We used different operating systems and different software versions, so there is that...

I hardly think that poor single-threaded performance is linked to the choice of memory. A single thread usually can not saturate memory bandwidth on such a system. And even if this was the cause of the issue, performance difference between single- and dual-rank is less than 10%.

Checklist

disable SMT in bios
also in bios: rank interleaving enabled (edit: well kind of pointless for single-rank DIMMs); channel interleaving enabled; socket interleaving disabled
do the processors run at their expected frequencies both in single- and multi-threaded workloads, even for longer periods of time? In Windows you can use CPU-Z for this
do the systems reach their expected performance in synthetic benchmarks? I would recommend AIDA64 memory benchmark and Cinebench R15 respectively.

Scaling on two nodes is a different topic that would have to be addressed once we are sure that each node individually performs as expected.

The quoted numbers for my system were with Hyperthreading switched off, and the "maximum performance" profile selected in the Dell BIOS.

The CPU cores stay running solidly at 3.9 Ghz during all the benchmarks.

I've run the trial version of AIDA64 and Cinebench R15, and the benchmark results are as expected when compared to other published Skylake-SP results.

As for interleaving options in the BIOS, the only interleaving related parameter I can change is memory "node interleaving" where the default is "disabled" (which means NUMA is turned on). This is the Dell default and recommended setting when using NUMA-aware OS/applications.

I think the "problem" I'm seeing is that the single thread memory bandwidth of Skylake-SP is actually pretty poor, and lower than previous generation Xeons and way lower than Epyc processers. See the table of results here: https://www.anandtech.com/show/11544...-the-decade/12

As a comparison, if I run the fluent benchmark on my laptop (Skylake mobile xeon), I get the following results:

System
CPU: 1x Intel Xeon E3-1535M v5
RAM: 4 x 16GB DDR4-2133 non-ECC
OS: Windows 10 Pro
Fluent: 19.0

1) External Flow Over an Aircraft Wing (aircraft_2m), single precision

INTEL Single Node, 1 core, 10 iterations: 202 s

So my Skylake laptop is actually 14 % faster on a single core compared to the Xeon Gold 6146

Micael · May 3, 2018, 09:42

System
64 GB DDR3-2133 (8x8)
i7-4960X OC 4.6 GHz 6-core
windows 7
FLUENT 19.0

External Flow Over an Aircraft Wing (aircraft_2m), single precision
1 core, 10 iterations: 135 s
4 cores, 100 iterations: 380s

SLC · May 3, 2018, 09:45

Re-ran the benchmarks after a fresh system reboot as a sanity check (no settings changed), and got a better result for the aircraft_2m, dual node 32 core test (from 87 to 77 seconds). Also tested aircraft_14m on 24 cores on a single node, as well as 36 cores on dual nodes (seeing as this is now the number of cores I can use with 2 HPC packs).

System
CPU: 2x Intel Xeon Gold 6146 (12 cores, 3.9 GHz all-core turbo, 4.2 GHz single-core turbo)
RAM: 12 x 8GB DDR4-2666 ECC (single rank)
Interconnect: 10 GbE
OS: Windows 10 Pro
Fluent: 19.0

1) External Flow Over an Aircraft Wing (aircraft_2m), single precision

INTEL Single Node, 1 core, 10 iterations: 234 s

INTEL Single Node, 24 cores, 100 iterations: 107 s

INTEL Dual Node, 32 cores, 100 iterations: 77 s

INTEL Dual Node, 36 cores, 100 iterations: 68 s

2) External Flow Over an Aircraft Wing (aircraft_14m), double precision

INTEL Single Node, 24 cores, 10 iterations: 141 s

INTEL Dual Node, 24 cores, 10 iterations: 101 s

INTEL Dual Node, 32 cores, 10 iterations: 84 s

INTEL Dual Node, 36 cores, 10 iterations: 77 s

INCORRECT BENCHMARKS, SEE UPDATED POST AMD Epyc CFD benchmarks with Ansys Fluent

The time of 141 s for the aircraft_14m Single node 24 core run is a little disconcerting - it compares to 118.2 s that you got flotus1 on your intel system...

Edit: stupid question perhaps, but how are you guys actually running the benchmarks? Just opening a Fluent session manually, opening the case file, initializing, and then running the set amount of iterations? Or are you running via batch/scripted? What are you reporting as the benchmark time?

flotus1 · May 4, 2018, 05:55

Quote:

Originally Posted by SLC

Edit: stupid question perhaps, but how are you guys actually running the benchmarks? Just opening a Fluent session manually, opening the case file, initializing, and then running the set amount of iterations? Or are you running via batch/scripted? What are you reporting as the benchmark time?

That is exactly what I did here. Open Fluent manually and load benchmark case and data. Then in TUI:
parallel timer reset
(iterate 10)
---wait for the simulation to finish---
parallel timer usage

I reported the total wall clock time
Edit: note that I did not initialize the case as it would overwrite the data from the benchmark file.

But since we have different operating systems and Fluent versions, comparing results should be done with caution...if at all

SLC · May 4, 2018, 07:45

Ok so that changed things. I had been previously initializing before running iterations.

So, procedure for others in case it isn't clear:

- Open Fluent manually and load benchmark case and data.
- Then in TUI:

parallel timer reset
(iterate 10 or 100)
---wait for the simulation to finish---
parallel timer usage

Report the total wall clock time.

Updated results:

System
CPU: 2x Intel Xeon Gold 6146 (12 cores, 3.9 GHz all-core turbo, 4.2 GHz single-core turbo)
RAM: 12 x 8GB DDR4-2666 ECC (single rank)
Interconnect: 10 GbE
OS: Windows 10 Pro
Fluent: 19.0

1) External Flow Over an Aircraft Wing (aircraft_2m), single precision

INTEL Single Node, 1 core, 10 iterations: 165.6 s

INTEL Single Node, 24 cores, 100 iterations: 95.8 s

INTEL Dual Node, 32 cores, 100 iterations: 67.3 s

INTEL Dual Node, 36 cores, 100 iterations: 60.4 s

2) External Flow Over an Aircraft Wing (aircraft_14m), double precision

INTEL Single Node, 24 cores, 10 iterations: 108.0 s

INTEL Dual Node, 24 cores, 10 iterations: 79.2 s

INTEL Dual Node, 32 cores, 10 iterations: 66.2 s

INTEL Dual Node, 36 cores, 10 iterations: 61.7 s

flotus1 · May 4, 2018, 08:01

Glad we got that out of the way. So the difference up to this point was that you initialized the simulation again after loading data from benchmark files?

Now to find out how much of a bottleneck 10G Ethernet is, I would run a simulation on a single machine with 18 cores and on both machines with 36 cores. Scaling should be nearly linear (i.e. execution times cut in half) if the case is large enough and the interconnect is not slowing things down.

SLC · May 4, 2018, 09:08

Quote:

Originally Posted by flotus1

Glad we got that out of the way. So the difference up to this point was that you initialized the simulation again after loading data from benchmark files?

Now to find out how much of a bottleneck 10G Ethernet is, I would run a simulation on a single machine with 18 cores and on both machines with 36 cores. Scaling should be nearly linear (i.e. execution times cut in half) if the case is large enough and the interconnect is not slowing things down.

I never actually loaded the data, only the case file. Then I initialized for good measure

I've run through the "official" benchmark script for Fluent, here are the results (the benchmark is single precision, with 25 "timed" iterations after running 5 non-timed iterations first):

Aircraft_wing_14m

Note the negative scaling in running more than 20 nodes on one machine (i.e., more than 10 cores per CPU).

"Node scaling" in going from 18 cores on one node to 36 cores on two nodes is 1.96 using a 10 GbE interconnect and Intel MPI. In other words, 2 % away from perfectly linear scaling.

Out of interest I disabled the 10 GbE connection and ran using a 1 GbE link, and the performance reduced by 0.5 % for the 36 core run. So not a big difference between 1 GbE and 10 GbE for just two nodes.

Do you think I would have gotten perfectly linear scaling with infiniband?

We can compare my results to the results Ansys has published: https://www.ansys.com/solutions/solu...craft-wing-14m

Their 2 x Epyc 7601 32C results are as follows:

Code:

#Test: aircraft_wing_14m  
#Application: Fluent 18.1.0  
#Platform-Short: amd-epyc_7601,2200 
#Platform-Long: AMD white box,EPYC 7601, 64 cores, 2.2 GHz  
#Vendor-File: amd-epyc_7601,2200.txt 
#Details: 128GB_RAM 

#Processes    Machines    Core_Solver_Rating    Core_Solver_Speedup    Core_Solver_Efficiency 
    16             1            422.5455              16.000                     100.00% 
    32             1            639.1714              24.203                      75.63% 
    64             1            840.6714              31.833                      49.74% 
   128             2            1635.5892             61.933                      48.39%

Solver rating comparison:

16 cores
A single Intel Xeon Gold 6146 node: 368.3
A single Epyc 7601 node: 422.5

(Epyc is 14.7 % faster).

32 cores
Dual Intel Xeon Gold 6146 nodes: 724.7
A single Epyc 7601 node: 639.2

(Intel is 13.4 % faster).

I suspect I'm paying a lot of money for that 13.4 % improvement in 32 core performance!!

flotus1 · May 4, 2018, 09:45

Nice writeup!

Quote:

Do you think I would have gotten perfectly linear scaling with infiniband?

Probably. The benefit of infiniband over ethernet especially with a low number of nodes is latency, not so much bandwidth. Maybe even slightly super-linear node scaling because with this testing method the amount of cells per core decreases when using more nodes.

Quote:

I suspect I'm paying a lot of money for that 13.4 % improvement in 32 core performance!!

Sure, but you knew that

Although I must say that I had anticipated a slightly higher advantage for dual-node Intel. Now if only AMD had bothered to release a 16-core variant with higher clock speeds.

Micael · May 30, 2018, 09:42

Some other interesting benchmark

Dual Xeon 6150 (18-core, 2.7 GHz), 12x16GB DDR4-2666
OS: CentOS 7
CPU governor: performance
SMT/Hyperthreading: off

As new licensing rules now allow to add 4-core on top of HPC pack, I did run on all 36-core as well.

1) External Flow Over an Aircraft Wing (aircraft_2m), single precision

FLUENT R182, 32-cores, 100 iterations: 75.4 s

FLUENT R190, 24-cores, 100 iterations: 88.1 s
FLUENT R190, 32-cores, 100 iterations: 74.3 s
FLUENT R190, 36-cores, 100 iterations: 67.3 s

2) External Flow Over an Aircraft Wing (aircraft_14m), double precision

FLUENT R182, 32-cores, 10 iterations: 73.7

FLUENT R190, 24-cores, 10 iterations: 85.3 s
FLUENT R190, 32-cores, 10 iterations: 73.4 s
FLUENT R190, 36-cores, 10 iterations: 70.3 s

Echidna · June 23, 2018, 06:44

I know this is a forum about CFD, but could it be possible to run some benchmark comparisons on FEA between AMD and Intel CPUs?

RobertB · June 25, 2018, 18:21

Anandtech made some interesting statements about the effect of cache searching scheme on the performance of OpenFoam. The difference was referenced to be 20%.

https://www.anandtech.com/show/11544...f-the-decade/5

Has anyone else tried this?

Echidna · July 8, 2018, 13:44

AMD has made some serious steps forward and Intel is indeed in a very bad situation right now!

But i thing that buying a first generation Epyc at this time is not the best possible decision, unless someone needs a modern system ASAP. Second generation EPYC is coming in 2019, based on the new "Rome" 7nm architecture. "Infinity Fabric" improvements on Gen2 may make AMD the only viable option for server customers.

Even if Epyc 2nd gen is too expensive when released you can buy 1st gen Epyc at a reasonably lower price than today.

flotus1 · July 8, 2018, 16:28

Waiting for a scheduled AMD-release in 2019? Sounds like a bit of a stretch. I learned my lesson while waiting for Epyc 1st gen availability.
There is always something new and shiny on the hardware market horizon, so the waiting game could always be played and I usually advise against it. But I would not wait for an AMD release in particular.
Currently it is not the CPUs that make a CFD workstation expensive. 2x 16-core Epyc 7301: 1800$. 16x16GB DDR4: 3000$. And RAM prices probably won't go down in the foreseeable future.

Echidna · July 8, 2018, 16:50

You're right that the waiting game in high tech products is endless, but if Infinity Fabric is indeed improved in Gen2 Epyc then maybe the wait will be worth it.

As i am in the market for a new system and i am still not 100% convinced that Epyc really beats Intel (even in price/performance given that you can source some relatively cheap refurbished Xeons) would it be possible to send you an Ansys Mechanical benchmark file to make a comparison between the Epyc and E5-V4? If you can do this, please send me a pm.

flotus1 · July 9, 2018, 07:42

Unfortunately, i don't have an Ansys license any more.

o_mars_2010 · November 12, 2018, 06:33

Hi Flotus1,

Would you recommend Ryzen Tr 2950x or i9 9900k for CFD using ANSYS.
I would appreciate a lot your reply and recommendations.
Thanks in advance.

May 3, 2018, 09:45		#43
SLC Member Join Date: Jul 2011 Posts: 53 Rep Power: 15	Re-ran the benchmarks after a fresh system reboot as a sanity check (no settings changed), and got a better result for the aircraft_2m, dual node 32 core test (from 87 to 77 seconds). Also tested aircraft_14m on 24 cores on a single node, as well as 36 cores on dual nodes (seeing as this is now the number of cores I can use with 2 HPC packs). System CPU: 2x Intel Xeon Gold 6146 (12 cores, 3.9 GHz all-core turbo, 4.2 GHz single-core turbo) RAM: 12 x 8GB DDR4-2666 ECC (single rank) Interconnect: 10 GbE OS: Windows 10 Pro Fluent: 19.0 1) External Flow Over an Aircraft Wing (aircraft_2m), single precision INTEL Single Node, 1 core, 10 iterations: 234 s INTEL Single Node, 24 cores, 100 iterations: 107 s INTEL Dual Node, 32 cores, 100 iterations: 77 s INTEL Dual Node, 36 cores, 100 iterations: 68 s 2) External Flow Over an Aircraft Wing (aircraft_14m), double precision INTEL Single Node, 24 cores, 10 iterations: 141 s INTEL Dual Node, 24 cores, 10 iterations: 101 s INTEL Dual Node, 32 cores, 10 iterations: 84 s INTEL Dual Node, 36 cores, 10 iterations: 77 s INCORRECT BENCHMARKS, SEE UPDATED POST AMD Epyc CFD benchmarks with Ansys Fluent The time of 141 s for the aircraft_14m Single node 24 core run is a little disconcerting - it compares to 118.2 s that you got flotus1 on your intel system... Edit: stupid question perhaps, but how are you guys actually running the benchmarks? Just opening a Fluent session manually, opening the case file, initializing, and then running the set amount of iterations? Or are you running via batch/scripted? What are you reporting as the benchmark time? Last edited by SLC; May 4, 2018 at 07:46.

May 4, 2018, 07:45		#45
SLC Member Join Date: Jul 2011 Posts: 53 Rep Power: 15	Ok so that changed things. I had been previously initializing before running iterations. So, procedure for others in case it isn't clear: - Open Fluent manually and load benchmark case and data. - Then in TUI: parallel timer reset (iterate 10 or 100) ---wait for the simulation to finish--- parallel timer usage Report the total wall clock time. Updated results: System CPU: 2x Intel Xeon Gold 6146 (12 cores, 3.9 GHz all-core turbo, 4.2 GHz single-core turbo) RAM: 12 x 8GB DDR4-2666 ECC (single rank) Interconnect: 10 GbE OS: Windows 10 Pro Fluent: 19.0 1) External Flow Over an Aircraft Wing (aircraft_2m), single precision INTEL Single Node, 1 core, 10 iterations: 165.6 s INTEL Single Node, 24 cores, 100 iterations: 95.8 s INTEL Dual Node, 32 cores, 100 iterations: 67.3 s INTEL Dual Node, 36 cores, 100 iterations: 60.4 s 2) External Flow Over an Aircraft Wing (aircraft_14m), double precision INTEL Single Node, 24 cores, 10 iterations: 108.0 s INTEL Dual Node, 24 cores, 10 iterations: 79.2 s INTEL Dual Node, 32 cores, 10 iterations: 66.2 s INTEL Dual Node, 36 cores, 10 iterations: 61.7 s

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Using inlet mpi in parallel ANSYS fluent with AMD processors	freebird	ANSYS	1	June 16, 2017 10:04
Can you help me with a problem in ansys static structural solver?	sourabh.porwal	Structural Mechanics	0	March 27, 2016 18:07
CFD papers Numerical study - Upwind schemes ANSYS FLUENT	Volumeoffluid	FLUENT	0	January 31, 2014 13:21
CFD papers Numerical study- upwind schemes ANSYS FLUENT	Volumeoffluid	Main CFD Forum	0	January 30, 2014 12:19
Free UK seminars: ANSYS CFD software	Gavin Butcher	CFX	0	November 23, 2004 10:13

May 3, 2018, 09:42		#42
Micael Senior Member Micael Join Date: Mar 2009 Location: Canada Posts: 157 Rep Power: 18	System 64 GB DDR3-2133 (8x8) i7-4960X OC 4.6 GHz 6-core windows 7 FLUENT 19.0 External Flow Over an Aircraft Wing (aircraft_2m), single precision 1 core, 10 iterations: 135 s 4 cores, 100 iterations: 380s

May 4, 2018, 08:01		#46
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,427 Rep Power: 49	Glad we got that out of the way. So the difference up to this point was that you initialized the simulation again after loading data from benchmark files? Now to find out how much of a bottleneck 10G Ethernet is, I would run a simulation on a single machine with 18 cores and on both machines with 36 cores. Scaling should be nearly linear (i.e. execution times cut in half) if the case is large enough and the interconnect is not slowing things down.

May 30, 2018, 09:42		#49
Micael Senior Member Micael Join Date: Mar 2009 Location: Canada Posts: 157 Rep Power: 18	Some other interesting benchmark Dual Xeon 6150 (18-core, 2.7 GHz), 12x16GB DDR4-2666 OS: CentOS 7 CPU governor: performance SMT/Hyperthreading: off As new licensing rules now allow to add 4-core on top of HPC pack, I did run on all 36-core as well. 1) External Flow Over an Aircraft Wing (aircraft_2m), single precision FLUENT R182, 32-cores, 100 iterations: 75.4 s FLUENT R190, 24-cores, 100 iterations: 88.1 s FLUENT R190, 32-cores, 100 iterations: 74.3 s FLUENT R190, 36-cores, 100 iterations: 67.3 s 2) External Flow Over an Aircraft Wing (aircraft_14m), double precision FLUENT R182, 32-cores, 10 iterations: 73.7 FLUENT R190, 24-cores, 10 iterations: 85.3 s FLUENT R190, 32-cores, 10 iterations: 73.4 s FLUENT R190, 36-cores, 10 iterations: 70.3 s

June 23, 2018, 06:44		#50
Echidna Member Join Date: Jun 2010 Posts: 77 Rep Power: 16	I know this is a forum about CFD, but could it be possible to run some benchmark comparisons on FEA between AMD and Intel CPUs?

June 25, 2018, 18:21		#51
RobertB Senior Member Robert Join Date: Jun 2010 Posts: 117 Rep Power: 17	Anandtech made some interesting statements about the effect of cache searching scheme on the performance of OpenFoam. The difference was referenced to be 20%. https://www.anandtech.com/show/11544...f-the-decade/5 Has anyone else tried this?

July 8, 2018, 13:44		#52
Echidna Member Join Date: Jun 2010 Posts: 77 Rep Power: 16	AMD has made some serious steps forward and Intel is indeed in a very bad situation right now! But i thing that buying a first generation Epyc at this time is not the best possible decision, unless someone needs a modern system ASAP. Second generation EPYC is coming in 2019, based on the new "Rome" 7nm architecture. "Infinity Fabric" improvements on Gen2 may make AMD the only viable option for server customers. Even if Epyc 2nd gen is too expensive when released you can buy 1st gen Epyc at a reasonably lower price than today.

July 8, 2018, 16:28		#53
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,427 Rep Power: 49	Waiting for a scheduled AMD-release in 2019? Sounds like a bit of a stretch. I learned my lesson while waiting for Epyc 1st gen availability. There is always something new and shiny on the hardware market horizon, so the waiting game could always be played and I usually advise against it. But I would not wait for an AMD release in particular. Currently it is not the CPUs that make a CFD workstation expensive. 2x 16-core Epyc 7301: 1800$. 16x16GB DDR4: 3000$. And RAM prices probably won't go down in the foreseeable future.

July 8, 2018, 16:50		#54
Echidna Member Join Date: Jun 2010 Posts: 77 Rep Power: 16	You're right that the waiting game in high tech products is endless, but if Infinity Fabric is indeed improved in Gen2 Epyc then maybe the wait will be worth it. As i am in the market for a new system and i am still not 100% convinced that Epyc really beats Intel (even in price/performance given that you can source some relatively cheap refurbished Xeons) would it be possible to send you an Ansys Mechanical benchmark file to make a comparison between the Epyc and E5-V4? If you can do this, please send me a pm.

July 9, 2018, 07:42		#55
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,427 Rep Power: 49	Unfortunately, i don't have an Ansys license any more.

November 12, 2018, 06:33		#56
o_mars_2010 Member Osman Join Date: Oct 2012 Location: Japan Posts: 53 Rep Power: 14	Hi Flotus1, Would you recommend Ryzen Tr 2950x or i9 9900k for CFD using ANSYS. I would appreciate a lot your reply and recommendations. Thanks in advance.