AMD Genoa best configuration for 128 cores Fluent/Mechanical licenses

chefbouza74 · February 13, 2023, 06:10

Dear members,

I have a Fluent and a Mechanical licenses with 3 HPC packs that allow up to 132 cores parallel sessions. The new AMD Epyc Genoa processors are available and my company wants to invest in a new hardware that best fits with our licenses pack. Our main use is Fluent, Mechanical use is less frequent.

The benchmarks show that the Genoa 9374F with 32 cores per CPU is the best per core candidate for CFD/FEM workloads (high memory bandwidth and high frequency per core). Then, one can imagine that a configuration of 2 racks (InfiniBand link) with a double 9374F on each rack is the best one.
As CFD is our main use and the memory bandwidth peak on the Genoa series is about 460Gb/s per socket. Therefore, with this configuration, each memory channel will have 38.3Gb/s (460/12) and each core will have 12.8Gb/s (38.3/(32/12)).
The frequency per core of the 9374F is 3.85GHz (AMD base clock data)
The negative aspect of this configuration is the use of 2 racks related with InfiniBand. So I wonder if an alternative configuration with a single rack is not suitable.

This alternative configuration is a double 9554 with 64 cores per socket. That leads to a memory bandwidth of 6.4Gb/s per core. The frequency per core of the 9554 is 3.1GHz (AMD base clock data).

So to sum up :
Configuration ---------------------------------Memory bandwidth per core ----------------------------------Base clock speed
2 racks of double 9374F each--------------------------12.8 Gb/s---------------------------------------------------- 3.85 GHz
Single rack of double 9554------------------------------6.4 Gb/s------------------------------------------------------3.1 GHz

I don’t have the prices yet but one can expect that the configuration with 2 racks will be considerably more expensive. So my question is: is it relevant to choose the 2 racks configuration compared to the single rack one ?

Thank you in advance for your advices and your help!

flotus1 · February 13, 2023, 07:00

Coincidentally, AMD+Ansys have published benchmarks for exactly the CPUs you are looking for:
https://www.amd.com/system/files/doc...nerational.pdf

They show a 20-30% performance lead for 2x9554 vs 2x9374F.
It is a valid assumption that two nodes with 2x9374F each will roughly double performance. That will tell you how much faster your simulations will run with two 64-core nodes vs. a single 128-core node.
It is up to you whether that much of a performance uplift is worth the increased hardware costs. But generally speaking, factoring in license costs and how much the engineers working with that system are paid, faster hardware is usually worth it.

chefbouza74 · February 13, 2023, 08:26

Thank you very much Alex !

sharonyue · February 19, 2023, 02:45

Yes. Go for two nodes.

I would suggest you to use two racks and connect them just by eth. Two workstations (64+64) is much better than one workstation (128). For the previous one, it can achieve nearly 2x speed up. For the latter one, 128 cores can never achieve 2x speed up (like comparing CPUs with 64 cores in same generation).

When I say nealy 2x speed up, it depends on your communcation settings. You can simply use the 10G eth provided by the motherboard (it can simply achieve 1.8-1.95 speed up, just like the document provided by ANSYS). For our own products, if its lower than 1.8 it would be seen as a failure! So, definitely higher than 1.8.

You can also choose two infiniband cards, at the most it can achivev 2x speed up. No more. (in ANSYS's document, 2.04 is just 2, no more than 2.1). I would prefer those eth provided by the motherboards since its much cheaper.

chefbouza74 · February 20, 2023, 04:58

Quote:

Originally Posted by sharonyue

Yes. Go for two nodes.

I would suggest you to use two racks and connect them just by eth. Two workstations (64+64) is much better than one workstation (128). For the previous one, it can achieve nearly 2x speed up. For the latter one, 128 cores can never achieve 2x speed up (like comparing CPUs with 64 cores in same generation).

When I say nealy 2x speed up, it depends on your communcation settings. You can simply use the 10G eth provided by the motherboard (it can simply achieve 1.8-1.95 speed up, just like the document provided by ANSYS). For our own products, if its lower than 1.8 it would be seen as a failure! So, definitely higher than 1.8.

You can also choose two infiniband cards, at the most it can achivev 2x speed up. No more. (in ANSYS's document, 2.04 is just 2, no more than 2.1). I would prefer those eth provided by the motherboards since its much cheaper.

Thank you sharonyue for your advice. In fact I will go with an Infiniband card as the budget will allow it.

chefbouza74 · February 20, 2023, 05:07

I have a subsidiary question concerning this configuration.

As I have 3 ANSYS HPC packs licenses, the best target is 128 cores hardware config. But I wonder if it is not relevant to choose 2 nodes of 2x 9474F rather than with 9374F. This will give a config of 196 cores rather than 128.
The advantage is that this hardware can serve for other tasks at the same time than 128 cores on ANSYS, like parallel optimization with Matlab Simulink models, on high number of cores (high number of parallel designs).
My question is: assuming the budget is not limited, what will be the difference on ANSYS models (mainly CFD ones) between 128 cores of 9374F and 128 cores of 9474F ?

Thank you in advance for your help

flotus1 · February 21, 2023, 04:15

Pretty much zero zero difference between these CPUs when running the same core count. Provided the threads are distributed similarly across all 8 CCDs, which Fluent should be able to handle.

Just to avoid nasty surprises: leftover "free" cores are nice, but don't expect them to be actually free. When you use them to do some other heavy lifting with Matlab, it will both slow down the Matlab runs, and the fluent run. That's because shared CPU resources -like last level caches and memory bandwidth- are almost fully utilized by a Fluent simulation on 128 cores.
Additionally, I am not sure if these CPUs are ideal for Matlab/simulink. It's probably fine if your parallel optimization spawns several tasks that run independently, on a single core each.

chefbouza74 · February 21, 2023, 04:25

Thank you Alex!

Quote:

Originally Posted by flotus1

Additionally, I am not sure if these CPUs are ideal for Matlab/simulink. It's probably fine if your parallel optimization spawns several tasks that run independently, on a single core each.

In my understanding, when you use the Mathworks parallel toolbox it generates a pool of a chosen number of workers and each worker will handle a design in parallel with the others inside a "parfor" loop for example. Same consideration for Simulink models with the "parsim" feature. And therefore the more cores we have available (not free

), the larger the pool of workers can be.

flotus1 · February 21, 2023, 05:01

That sounds like the ideal case.
I am by no means an expert with Matlab/Simulink, just something you might want to be aware of: https://de.mathworks.com/matlabcentr...n-amd-epyc-cpu
No idea if this is fixed by now, or what even caused the issues.

chefbouza74 · February 21, 2023, 09:53

Quote:

Originally Posted by flotus1

That sounds like the ideal case.
I am by no means an expert with Matlab/Simulink, just something you might want to be aware of: https://de.mathworks.com/matlabcentr...n-amd-epyc-cpu
No idea if this is fixed by now, or what even caused the issues.

Thank you Alex!
I will check this.

the_phew · February 22, 2023, 12:45

Different solver, but I operate a 128 core CFD cluster composed of two 2P EPYC ROME nodes connected via 100gbps Infiniband. I went with Infiniband on the recommendation of the software vendor, and a gut feel that the vastly reduced latency would be helpful for the explicit solver I use that can run many time steps per second.

But when I monitored the actual network throughput over the Infiniband adapters, it was shockingly low (less than 1gbps) even for a simulation with over a billion cells. So I would advise against buying top-of-the-line Infiniband adapters; you can pick up surplus 40gbps Infiniband adapters for a fraction of the price of new 100gbps+ parts.

I also have a sneaking suspicion that 10gbps ethernet would be more than sufficient. You could always set that up first, and if your CPUs are obviously waiting on network traffic (evidenced by sub-98% utilization), then consider surplus Infiniband interconnects. But don't spend thousands on the latest Infiniband hardware for a 2-node cluster like I did.

February 13, 2023, 06:10	AMD Genoa best configuration for 128 cores Fluent/Mechanical licenses	#1
chefbouza74 New Member Chefbouza Join Date: Oct 2021 Posts: 10 Rep Power: 5	Dear members, I have a Fluent and a Mechanical licenses with 3 HPC packs that allow up to 132 cores parallel sessions. The new AMD Epyc Genoa processors are available and my company wants to invest in a new hardware that best fits with our licenses pack. Our main use is Fluent, Mechanical use is less frequent. The benchmarks show that the Genoa 9374F with 32 cores per CPU is the best per core candidate for CFD/FEM workloads (high memory bandwidth and high frequency per core). Then, one can imagine that a configuration of 2 racks (InfiniBand link) with a double 9374F on each rack is the best one. As CFD is our main use and the memory bandwidth peak on the Genoa series is about 460Gb/s per socket. Therefore, with this configuration, each memory channel will have 38.3Gb/s (460/12) and each core will have 12.8Gb/s (38.3/(32/12)). The frequency per core of the 9374F is 3.85GHz (AMD base clock data) The negative aspect of this configuration is the use of 2 racks related with InfiniBand. So I wonder if an alternative configuration with a single rack is not suitable. This alternative configuration is a double 9554 with 64 cores per socket. That leads to a memory bandwidth of 6.4Gb/s per core. The frequency per core of the 9554 is 3.1GHz (AMD base clock data). So to sum up : Configuration ---------------------------------Memory bandwidth per core ----------------------------------Base clock speed 2 racks of double 9374F each--------------------------12.8 Gb/s---------------------------------------------------- 3.85 GHz Single rack of double 9554------------------------------6.4 Gb/s------------------------------------------------------3.1 GHz I don’t have the prices yet but one can expect that the configuration with 2 racks will be considerably more expensive. So my question is: is it relevant to choose the 2 racks configuration compared to the single rack one ? Thank you in advance for your advices and your help!

February 13, 2023, 07:00		#2
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,427 Rep Power: 49	Coincidentally, AMD+Ansys have published benchmarks for exactly the CPUs you are looking for: https://www.amd.com/system/files/doc...nerational.pdf They show a 20-30% performance lead for 2x9554 vs 2x9374F. It is a valid assumption that two nodes with 2x9374F each will roughly double performance. That will tell you how much faster your simulations will run with two 64-core nodes vs. a single 128-core node. It is up to you whether that much of a performance uplift is worth the increased hardware costs. But generally speaking, factoring in license costs and how much the engineers working with that system are paid, faster hardware is usually worth it. wkernkamp likes this.

February 19, 2023, 02:45		#4
sharonyue Senior Member Dongyue Li Join Date: Jun 2012 Location: Beijing, China Posts: 849 Rep Power: 18	Yes. Go for two nodes. I would suggest you to use two racks and connect them just by eth. Two workstations (64+64) is much better than one workstation (128). For the previous one, it can achieve nearly 2x speed up. For the latter one, 128 cores can never achieve 2x speed up (like comparing CPUs with 64 cores in same generation). When I say nealy 2x speed up, it depends on your communcation settings. You can simply use the 10G eth provided by the motherboard (it can simply achieve 1.8-1.95 speed up, just like the document provided by ANSYS). For our own products, if its lower than 1.8 it would be seen as a failure! So, definitely higher than 1.8. You can also choose two infiniband cards, at the most it can achivev 2x speed up. No more. (in ANSYS's document, 2.04 is just 2, no more than 2.1). I would prefer those eth provided by the motherboards since its much cheaper. arvindpj and wkernkamp like this. __________________ My OpenFOAM algorithm website: http://dyfluid.com By far the largest Chinese CFD-based forum: http://www.cfd-china.com/category/6/openfoam We provide lots of clusters to Chinese customers, and we are considering to do business overseas: http://dyfluid.com/DMCmodel.html

February 22, 2023, 12:45		#11
the_phew Member Matt Join Date: May 2011 Posts: 44 Rep Power: 15	Different solver, but I operate a 128 core CFD cluster composed of two 2P EPYC ROME nodes connected via 100gbps Infiniband. I went with Infiniband on the recommendation of the software vendor, and a gut feel that the vastly reduced latency would be helpful for the explicit solver I use that can run many time steps per second. But when I monitored the actual network throughput over the Infiniband adapters, it was shockingly low (less than 1gbps) even for a simulation with over a billion cells. So I would advise against buying top-of-the-line Infiniband adapters; you can pick up surplus 40gbps Infiniband adapters for a fraction of the price of new 100gbps+ parts. I also have a sneaking suspicion that 10gbps ethernet would be more than sufficient. You could always set that up first, and if your CPUs are obviously waiting on network traffic (evidenced by sub-98% utilization), then consider surplus Infiniband interconnects. But don't spend thousands on the latest Infiniband hardware for a 2-node cluster like I did. wkernkamp likes this.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
CFD workstation configuration calling for help 2	Freewill1	Hardware	6	July 8, 2020 22:17
AMD Epyc CFD benchmarks with Ansys Fluent	flotus1	Hardware	55	November 12, 2018 06:33
Some ANSYS Benchmarks: 3 node i7 vs Dual and Quad Xeon	evcelica	Hardware	14	December 15, 2016 07:57
Superlinear speedup in OpenFOAM 13	msrinath80	OpenFOAM Running, Solving & CFD	18	March 3, 2015 06:36
Error in run Batch file	saba1366	CFX	4	February 10, 2013 02:15

February 13, 2023, 08:26		#3
chefbouza74 New Member Chefbouza Join Date: Oct 2021 Posts: 10 Rep Power: 5	Thank you very much Alex !

February 20, 2023, 05:07		#6
chefbouza74 New Member Chefbouza Join Date: Oct 2021 Posts: 10 Rep Power: 5	I have a subsidiary question concerning this configuration. As I have 3 ANSYS HPC packs licenses, the best target is 128 cores hardware config. But I wonder if it is not relevant to choose 2 nodes of 2x 9474F rather than with 9374F. This will give a config of 196 cores rather than 128. The advantage is that this hardware can serve for other tasks at the same time than 128 cores on ANSYS, like parallel optimization with Matlab Simulink models, on high number of cores (high number of parallel designs). My question is: assuming the budget is not limited, what will be the difference on ANSYS models (mainly CFD ones) between 128 cores of 9374F and 128 cores of 9474F ? Thank you in advance for your help

February 21, 2023, 04:15		#7
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,427 Rep Power: 49	Pretty much zero zero difference between these CPUs when running the same core count. Provided the threads are distributed similarly across all 8 CCDs, which Fluent should be able to handle. Just to avoid nasty surprises: leftover "free" cores are nice, but don't expect them to be actually free. When you use them to do some other heavy lifting with Matlab, it will both slow down the Matlab runs, and the fluent run. That's because shared CPU resources -like last level caches and memory bandwidth- are almost fully utilized by a Fluent simulation on 128 cores. Additionally, I am not sure if these CPUs are ideal for Matlab/simulink. It's probably fine if your parallel optimization spawns several tasks that run independently, on a single core each.

February 21, 2023, 05:01		#9
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,427 Rep Power: 49	That sounds like the ideal case. I am by no means an expert with Matlab/Simulink, just something you might want to be aware of: https://de.mathworks.com/matlabcentr...n-amd-epyc-cpu No idea if this is fixed by now, or what even caused the issues.