CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > General Forums > Hardware

AMD Epyc CFD benchmarks with Ansys Fluent

Register Blogs Community New Posts Updated Threads Search

Like Tree33Likes

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   May 3, 2018, 08:31
Default
  #41
SLC
Member
 
Join Date: Jul 2011
Posts: 53
Rep Power: 15
SLC is on a distinguished road
Quote:
Originally Posted by flotus1 View Post
To be honest, these numbers are lower (i.e. higher execution times) than I would expect for the kind of hardware you have. Both single-core and parallel. At least when comparing against the results in the initial post here. We used different operating systems and different software versions, so there is that...

I hardly think that poor single-threaded performance is linked to the choice of memory. A single thread usually can not saturate memory bandwidth on such a system. And even if this was the cause of the issue, performance difference between single- and dual-rank is less than 10%.

Checklist
  • disable SMT in bios
  • also in bios: rank interleaving enabled (edit: well kind of pointless for single-rank DIMMs); channel interleaving enabled; socket interleaving disabled
  • do the processors run at their expected frequencies both in single- and multi-threaded workloads, even for longer periods of time? In Windows you can use CPU-Z for this
  • do the systems reach their expected performance in synthetic benchmarks? I would recommend AIDA64 memory benchmark and Cinebench R15 respectively.
Scaling on two nodes is a different topic that would have to be addressed once we are sure that each node individually performs as expected.
The quoted numbers for my system were with Hyperthreading switched off, and the "maximum performance" profile selected in the Dell BIOS.

The CPU cores stay running solidly at 3.9 Ghz during all the benchmarks.

I've run the trial version of AIDA64 and Cinebench R15, and the benchmark results are as expected when compared to other published Skylake-SP results.

As for interleaving options in the BIOS, the only interleaving related parameter I can change is memory "node interleaving" where the default is "disabled" (which means NUMA is turned on). This is the Dell default and recommended setting when using NUMA-aware OS/applications.

I think the "problem" I'm seeing is that the single thread memory bandwidth of Skylake-SP is actually pretty poor, and lower than previous generation Xeons and way lower than Epyc processers. See the table of results here: https://www.anandtech.com/show/11544...-the-decade/12

As a comparison, if I run the fluent benchmark on my laptop (Skylake mobile xeon), I get the following results:

System
CPU: 1x Intel Xeon E3-1535M v5
RAM: 4 x 16GB DDR4-2133 non-ECC
OS: Windows 10 Pro
Fluent: 19.0

1) External Flow Over an Aircraft Wing (aircraft_2m), single precision

INTEL Single Node, 1 core, 10 iterations: 202 s


So my Skylake laptop is actually 14 % faster on a single core compared to the Xeon Gold 6146

Last edited by SLC; May 4, 2018 at 05:01.
SLC is offline   Reply With Quote

Old   May 3, 2018, 09:42
Default
  #42
Senior Member
 
Micael
Join Date: Mar 2009
Location: Canada
Posts: 157
Rep Power: 18
Micael is on a distinguished road
System
64 GB DDR3-2133 (8x8)
i7-4960X OC 4.6 GHz 6-core
windows 7
FLUENT 19.0

External Flow Over an Aircraft Wing (aircraft_2m), single precision
1 core, 10 iterations: 135 s
4 cores, 100 iterations: 380s
Micael is offline   Reply With Quote

Old   May 3, 2018, 09:45
Default
  #43
SLC
Member
 
Join Date: Jul 2011
Posts: 53
Rep Power: 15
SLC is on a distinguished road
Re-ran the benchmarks after a fresh system reboot as a sanity check (no settings changed), and got a better result for the aircraft_2m, dual node 32 core test (from 87 to 77 seconds). Also tested aircraft_14m on 24 cores on a single node, as well as 36 cores on dual nodes (seeing as this is now the number of cores I can use with 2 HPC packs).


System

CPU: 2x Intel Xeon Gold 6146 (12 cores, 3.9 GHz all-core turbo, 4.2 GHz single-core turbo)
RAM: 12 x 8GB DDR4-2666 ECC (single rank)
Interconnect: 10 GbE
OS: Windows 10 Pro
Fluent: 19.0

1) External Flow Over an Aircraft Wing (aircraft_2m), single precision

INTEL Single Node, 1 core, 10 iterations: 234 s

INTEL Single Node, 24 cores, 100 iterations: 107 s

INTEL Dual Node, 32 cores, 100 iterations: 77 s

INTEL Dual Node, 36 cores, 100 iterations: 68 s


2) External Flow Over an Aircraft Wing (aircraft_14m), double precision

INTEL Single Node, 24 cores, 10 iterations: 141 s

INTEL Dual Node, 24 cores, 10 iterations: 101 s

INTEL Dual Node, 32 cores, 10 iterations: 84 s

INTEL Dual Node, 36 cores, 10 iterations: 77 s

INCORRECT BENCHMARKS, SEE UPDATED POST AMD Epyc CFD benchmarks with Ansys Fluent



The time of 141 s for the aircraft_14m Single node 24 core run is a little disconcerting - it compares to 118.2 s that you got flotus1 on your intel system...

Edit: stupid question perhaps, but how are you guys actually running the benchmarks? Just opening a Fluent session manually, opening the case file, initializing, and then running the set amount of iterations? Or are you running via batch/scripted? What are you reporting as the benchmark time?

Last edited by SLC; May 4, 2018 at 07:46.
SLC is offline   Reply With Quote

Old   May 4, 2018, 05:55
Default
  #44
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
Quote:
Originally Posted by SLC View Post
Edit: stupid question perhaps, but how are you guys actually running the benchmarks? Just opening a Fluent session manually, opening the case file, initializing, and then running the set amount of iterations? Or are you running via batch/scripted? What are you reporting as the benchmark time?
That is exactly what I did here. Open Fluent manually and load benchmark case and data. Then in TUI:
parallel timer reset
(iterate 10)
---wait for the simulation to finish---
parallel timer usage

I reported the total wall clock time
Edit: note that I did not initialize the case as it would overwrite the data from the benchmark file.

But since we have different operating systems and Fluent versions, comparing results should be done with caution...if at all
flotus1 is offline   Reply With Quote

Old   May 4, 2018, 07:45
Default
  #45
SLC
Member
 
Join Date: Jul 2011
Posts: 53
Rep Power: 15
SLC is on a distinguished road
Ok so that changed things. I had been previously initializing before running iterations.

So, procedure for others in case it isn't clear:

- Open Fluent manually and load benchmark case and data.
- Then in TUI:
parallel timer reset
(iterate 10 or 100)
---wait for the simulation to finish---
parallel timer usage
Report the total wall clock time.

Updated results:

System
CPU: 2x Intel Xeon Gold 6146 (12 cores, 3.9 GHz all-core turbo, 4.2 GHz single-core turbo)
RAM: 12 x 8GB DDR4-2666 ECC (single rank)
Interconnect: 10 GbE
OS: Windows 10 Pro
Fluent: 19.0


1) External Flow Over an Aircraft Wing (aircraft_2m), single precision


INTEL Single Node, 1 core, 10 iterations: 165.6 s

INTEL Single Node, 24 cores, 100 iterations: 95.8 s

INTEL Dual Node, 32 cores, 100 iterations: 67.3 s

INTEL Dual Node, 36 cores, 100 iterations: 60.4 s


2) External Flow Over an Aircraft Wing (aircraft_14m), double precision

INTEL Single Node, 24 cores, 10 iterations: 108.0 s

INTEL Dual Node, 24 cores, 10 iterations: 79.2 s

INTEL Dual Node, 32 cores, 10 iterations: 66.2 s

INTEL Dual Node, 36 cores, 10 iterations: 61.7 s
SLC is offline   Reply With Quote

Old   May 4, 2018, 08:01
Default
  #46
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
Glad we got that out of the way. So the difference up to this point was that you initialized the simulation again after loading data from benchmark files?

Now to find out how much of a bottleneck 10G Ethernet is, I would run a simulation on a single machine with 18 cores and on both machines with 36 cores. Scaling should be nearly linear (i.e. execution times cut in half) if the case is large enough and the interconnect is not slowing things down.
flotus1 is offline   Reply With Quote

Old   May 4, 2018, 09:08
Default
  #47
SLC
Member
 
Join Date: Jul 2011
Posts: 53
Rep Power: 15
SLC is on a distinguished road
Quote:
Originally Posted by flotus1 View Post
Glad we got that out of the way. So the difference up to this point was that you initialized the simulation again after loading data from benchmark files?

Now to find out how much of a bottleneck 10G Ethernet is, I would run a simulation on a single machine with 18 cores and on both machines with 36 cores. Scaling should be nearly linear (i.e. execution times cut in half) if the case is large enough and the interconnect is not slowing things down.
I never actually loaded the data, only the case file. Then I initialized for good measure

I've run through the "official" benchmark script for Fluent, here are the results (the benchmark is single precision, with 25 "timed" iterations after running 5 non-timed iterations first):

Aircraft_wing_14m




Note the negative scaling in running more than 20 nodes on one machine (i.e., more than 10 cores per CPU).

"Node scaling" in going from 18 cores on one node to 36 cores on two nodes is 1.96 using a 10 GbE interconnect and Intel MPI. In other words, 2 % away from perfectly linear scaling.

Out of interest I disabled the 10 GbE connection and ran using a 1 GbE link, and the performance reduced by 0.5 % for the 36 core run. So not a big difference between 1 GbE and 10 GbE for just two nodes.

Do you think I would have gotten perfectly linear scaling with infiniband?

We can compare my results to the results Ansys has published: https://www.ansys.com/solutions/solu...craft-wing-14m

Their 2 x Epyc 7601 32C results are as follows:

Code:
#Test: aircraft_wing_14m  
#Application: Fluent 18.1.0  
#Platform-Short: amd-epyc_7601,2200 
#Platform-Long: AMD white box,EPYC 7601, 64 cores, 2.2 GHz  
#Vendor-File: amd-epyc_7601,2200.txt 
#Details: 128GB_RAM 

#Processes    Machines    Core_Solver_Rating    Core_Solver_Speedup    Core_Solver_Efficiency 
    16             1            422.5455              16.000                     100.00% 
    32             1            639.1714              24.203                      75.63% 
    64             1            840.6714              31.833                      49.74% 
   128             2            1635.5892             61.933                      48.39%

Solver rating comparison:

16 cores

A single Intel Xeon Gold 6146 node: 368.3
A single Epyc 7601 node: 422.5

(Epyc is 14.7 % faster).

32 cores
Dual Intel Xeon Gold 6146 nodes: 724.7
A single Epyc 7601 node: 639.2

(Intel is 13.4 % faster).

I suspect I'm paying a lot of money for that 13.4 % improvement in 32 core performance!!
Attached Images
File Type: png results.png (21.0 KB, 417 views)
File Type: png results_plot.png (23.2 KB, 413 views)
Blanco and flotus1 like this.
SLC is offline   Reply With Quote

Old   May 4, 2018, 09:45
Default
  #48
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
Nice writeup!

Quote:
Do you think I would have gotten perfectly linear scaling with infiniband?
Probably. The benefit of infiniband over ethernet especially with a low number of nodes is latency, not so much bandwidth. Maybe even slightly super-linear node scaling because with this testing method the amount of cells per core decreases when using more nodes.

Quote:
I suspect I'm paying a lot of money for that 13.4 % improvement in 32 core performance!!
Sure, but you knew that
Although I must say that I had anticipated a slightly higher advantage for dual-node Intel. Now if only AMD had bothered to release a 16-core variant with higher clock speeds.
flotus1 is offline   Reply With Quote

Old   May 30, 2018, 09:42
Default
  #49
Senior Member
 
Micael
Join Date: Mar 2009
Location: Canada
Posts: 157
Rep Power: 18
Micael is on a distinguished road
Some other interesting benchmark

Dual Xeon 6150 (18-core, 2.7 GHz), 12x16GB DDR4-2666
OS: CentOS 7
CPU governor: performance
SMT/Hyperthreading: off

As new licensing rules now allow to add 4-core on top of HPC pack, I did run on all 36-core as well.


1) External Flow Over an Aircraft Wing (aircraft_2m), single precision

FLUENT R182, 32-cores, 100 iterations: 75.4 s

FLUENT R190, 24-cores, 100 iterations: 88.1 s
FLUENT R190, 32-cores, 100 iterations: 74.3 s
FLUENT R190, 36-cores, 100 iterations: 67.3 s


2) External Flow Over an Aircraft Wing (aircraft_14m), double precision

FLUENT R182, 32-cores, 10 iterations: 73.7

FLUENT R190, 24-cores, 10 iterations: 85.3 s
FLUENT R190, 32-cores, 10 iterations: 73.4 s
FLUENT R190, 36-cores, 10 iterations: 70.3 s
Micael is offline   Reply With Quote

Old   June 23, 2018, 06:44
Default
  #50
Member
 
Join Date: Jun 2010
Posts: 77
Rep Power: 16
Echidna is on a distinguished road
I know this is a forum about CFD, but could it be possible to run some benchmark comparisons on FEA between AMD and Intel CPUs?
Echidna is offline   Reply With Quote

Old   June 25, 2018, 18:21
Default
  #51
Senior Member
 
Robert
Join Date: Jun 2010
Posts: 117
Rep Power: 17
RobertB is on a distinguished road
Anandtech made some interesting statements about the effect of cache searching scheme on the performance of OpenFoam. The difference was referenced to be 20%.


https://www.anandtech.com/show/11544...f-the-decade/5



Has anyone else tried this?
RobertB is offline   Reply With Quote

Old   July 8, 2018, 13:44
Default
  #52
Member
 
Join Date: Jun 2010
Posts: 77
Rep Power: 16
Echidna is on a distinguished road
AMD has made some serious steps forward and Intel is indeed in a very bad situation right now!

But i thing that buying a first generation Epyc at this time is not the best possible decision, unless someone needs a modern system ASAP. Second generation EPYC is coming in 2019, based on the new "Rome" 7nm architecture. "Infinity Fabric" improvements on Gen2 may make AMD the only viable option for server customers.

Even if Epyc 2nd gen is too expensive when released you can buy 1st gen Epyc at a reasonably lower price than today.
Echidna is offline   Reply With Quote

Old   July 8, 2018, 16:28
Default
  #53
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
Waiting for a scheduled AMD-release in 2019? Sounds like a bit of a stretch. I learned my lesson while waiting for Epyc 1st gen availability.
There is always something new and shiny on the hardware market horizon, so the waiting game could always be played and I usually advise against it. But I would not wait for an AMD release in particular.
Currently it is not the CPUs that make a CFD workstation expensive. 2x 16-core Epyc 7301: 1800$. 16x16GB DDR4: 3000$. And RAM prices probably won't go down in the foreseeable future.
flotus1 is offline   Reply With Quote

Old   July 8, 2018, 16:50
Default
  #54
Member
 
Join Date: Jun 2010
Posts: 77
Rep Power: 16
Echidna is on a distinguished road
You're right that the waiting game in high tech products is endless, but if Infinity Fabric is indeed improved in Gen2 Epyc then maybe the wait will be worth it.

As i am in the market for a new system and i am still not 100% convinced that Epyc really beats Intel (even in price/performance given that you can source some relatively cheap refurbished Xeons) would it be possible to send you an Ansys Mechanical benchmark file to make a comparison between the Epyc and E5-V4? If you can do this, please send me a pm.
Echidna is offline   Reply With Quote

Old   July 9, 2018, 07:42
Default
  #55
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
Unfortunately, i don't have an Ansys license any more.
flotus1 is offline   Reply With Quote

Old   November 12, 2018, 06:33
Default
  #56
Member
 
Osman
Join Date: Oct 2012
Location: Japan
Posts: 53
Rep Power: 14
o_mars_2010 is on a distinguished road
Hi Flotus1,


Would you recommend Ryzen Tr 2950x or i9 9900k for CFD using ANSYS.
I would appreciate a lot your reply and recommendations.
Thanks in advance.
o_mars_2010 is offline   Reply With Quote

Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Using inlet mpi in parallel ANSYS fluent with AMD processors freebird ANSYS 1 June 16, 2017 10:04
Can you help me with a problem in ansys static structural solver? sourabh.porwal Structural Mechanics 0 March 27, 2016 18:07
CFD papers Numerical study - Upwind schemes ANSYS FLUENT Volumeoffluid FLUENT 0 January 31, 2014 13:21
CFD papers Numerical study- upwind schemes ANSYS FLUENT Volumeoffluid Main CFD Forum 0 January 30, 2014 12:19
Free UK seminars: ANSYS CFD software Gavin Butcher CFX 0 November 23, 2004 10:13


All times are GMT -4. The time now is 07:08.