Ansys Fluent Multi-Node Bottleneck Question

Tyrmida · April 16, 2023, 05:33

Good Day,

I am an IT infrastructure provider so please excuse if I don't know the details or ins-and-outs of CFD appropriately.

I have a client which is using Ansys Fluent to solve simulations. Currently for large projects they are using a local HPC provider to speed up the processing. They use them because they come at 1/4 the cost of the big international names.

They are currently running a large project which will solve for 3 months using 72 cores across 6 nodes and I would like to find a way to give them an alternative solution to speed this up. We have a large DC with a lot of free capacity available and could work something out for them on our hardware.

What I am asking and not understanding properly is why they can't solve substantially faster using more cores. They pay per cpu-core-hour which in my inexperienced (with CFD) mind doesn't add up. If you double the number of nodes and thus cores you should be able to solve in half the time minus a little overhead?

The senior engineer (a guy that actually really knows his stuff even on the IT side) has told me that the efficiency drops substantially when adding more cores/nodes so it isn't worth it (on cost and time).

How can I help them be able to get the same efficiency on 144 cores for example?

Of course I don't have access to their current HPC provider, but I have some details. They are using:

PowerEdge C6320 servers
Xeon E5-2690v3 12-core CPUs
Some nodes 128GB and some nodes 64GB - I don't know the configuration but I assume they were clever enough to fill all the channels
Interconnect is FDR InfiniBand

I don't have the detailed specifics of their model, all I know right now is that the interval they're solving is milliseconds so as I understand things there is a lot of iterations which is why it is taking so long or is so big.

My question:

1. Where would the bottleneck be possibly for them to use more nodes and therefore more cores? Is it the FDR IB that is limiting them? That is the only thing that can possibly make sense to me. Would having the same hardware on EDR IB sort this out?

2. What other information of what they are trying to solve can I ask that will help get better guidance on where the bottleneck is?

License count is not a factor.

Thank you in advance - I hope this question was appropriate and made sense.

flotus1 · April 16, 2023, 08:16

Quote:

1. Where would the bottleneck be possibly for them to use more nodes and therefore more cores? Is it the FDR IB that is limiting them? That is the only thing that can possibly make sense to me. Would having the same hardware on EDR IB sort this out?

Highly unlikely that the node interconnect is the problem here. FDR is plenty fast for this generation of hardware, with such low core counts.
If you need 100% certainty on this, you would have to monitor network traffic. Or compare intra-node vs. inter-node scaling for this case.

Quote:

2. What other information of what they are trying to solve can I ask that will help get better guidance on where the bottleneck is?

Assuming their current hardware is set up properly, I see a few reasons that could cause poor scaling to this extent:
1) Extremely low cell count. Something in the order of 1 million cells or lower. In which case, the only hardware "solution" is running on the latest and greatest CPUs, with the highest clock speed. This obviously has limited potential.
2) Unnecessary file-I/O to slow storage. Like writing solution data at every time step or worse. Hardware solution could be using faster flash storage. But realistically, such a problem needs to be fixed by doing less file-I/O.
3) Serialization of the code. For example, if they are using Fluent UDFs that nobody bothered to parallelize. Again, the only hardware "solution" would be to run on the fastest available CPUs, with limited potential.
4) Lots of mesh interfaces, which are updated frequently. These have a tendency to scale poorly. Again, no real hardware solution here other than running on the fastest possible CPUs. Which won't fix the scaling issue, only decrease run time a bit.
5) A nasty edge-case where the default partitioning schemes used by Fluent produce bad results. Not something you can fix via hardware.

In short, you can't fix their scaling issue with a hardware solution.
You can decrease run time a bit with faster CPUs. And faster storage if file-I/O is an issue.
Getting better scaling is a software issue. They might want to get in contact with Ansys support directly to get that fixed.

Tyrmida · April 17, 2023, 02:41

Good Day,

Thank you very much for your insight and guidance. I have had a word with the engineer and determined the following:

Quote:

1) Extremely low cell count. Something in the order of 1 million cells or lower. In which case, the only hardware "solution" is running on the latest and greatest CPUs, with the highest clock speed. This obviously has limited potential.

About 10-million, so definitely not that.

Quote:

2) Unnecessary file-I/O to slow storage. Like writing solution data at every time step or worse. Hardware solution could be using faster flash storage. But realistically, such a problem needs to be fixed by doing less file-I/O.

They are writing every hour (which takes 3 minutes to write) because the HPC provider that they are currently using have random power cuts or stop their work for higher priority work on a schedule. I have no idea what their storage performance is like. I know they're using lustre fs and it would be incredibly strange if it is slow but you never know.

Quote:

3) Serialization of the code. For example, if they are using Fluent UDFs that nobody bothered to parallelize. Again, the only hardware "solution" would be to run on the fastest available CPUs, with limited potential.

Asked and not their problem

Quote:

4) Lots of mesh interfaces, which are updated frequently. These have a tendency to scale poorly. Again, no real hardware solution here other than running on the fastest possible CPUs. Which won't fix the scaling issue, only decrease run time a bit.

I asked and they said they don't have unusually or abnormally high number of mesh interfaces in the model.

Quote:

5) A nasty edge-case where the default partitioning schemes used by Fluent produce bad results. Not something you can fix via hardware.

According to them, also not their problem.

They gave me their old benchmarks that they ran at the hpc provider quite a while ago (using truck_poly_14m) but it was done very long ago on an old version. I have asked them to do the aircraft wing for me so that I have something to compare with here.

But even on that old benchmark, this was their results per processes:

core count 32 wall time of 6.4049 / iteration
core count 64 wall time of 3.7408 / iteration
core count 128 wall time of 2.3823 / iteration
core count 256 wall time of 2.4047 / iteration

So it does look like there's some kind of a bottleneck around 80 cores.

I have asked them to run the benchmark on current version at hpc provider and I will then compare it to what I have in my lab because this un-scalability is crazy. I feel really bad for people in your industry having to sit around waiting multiple months for simulations to run.

I will update when I have up-to-date benchmarks from their hpc and my lab to compare. Thank you very much

flotus1 · April 17, 2023, 12:12

Quote:

They gave me their old benchmarks that they ran at the hpc provider quite a while ago (using truck_poly_14m) but it was done very long ago on an old version. I have asked them to do the aircraft wing for me so that I have something to compare with here.

But even on that old benchmark, this was their results per processes:

core count 32 wall time of 6.4049 / iteration
core count 64 wall time of 3.7408 / iteration
core count 128 wall time of 2.3823 / iteration
core count 256 wall time of 2.4047 / iteration

I think that solves it. There is something seriously wrong with the hardware or setup of their current HPC provider.
This benchmark is rather simple: a single-phase flow without any interfaces. It should scale well to over 1000 cores.

Take a look at these archived results: https://fluidcodes.com/customer-supp...ruck_poly_14m/
With FDR Infiniband, they achieve pretty much 100% inter-node efficiency up to 256 cores / 16 nodes. While maintaining reasonable scaling beyond that up to 1000 cores.

Edit: just in case you need absolute performance numbers to compare: https://www.padtinc.com/2016/11/22/a...lyhedral-mesh/
0.625 seconds per iteration on 16 Xeon E5-2667v3 CPUs, i.e. 8 nodes or 128 cores. 82% parallel efficiency compared to running on a single core.

Tyrmida · April 18, 2023, 05:38

Thank you very much for the assistance thus far.

I set up a test environment with 2 nodes and ran benchmarks on each node locally. 2 x Xeon E5-2697v2 and 2 x Xeon Gold 6240 respectively.

It seems that on this testing environment efficiency-per-core is dropping dramatically as the cores increase. Can someone maybe let me know:

1. Is this normal and expected results?

2. Should I expect the same performance within margin if I do a multi-node run using msmpi/hpc-x on EDR Infiniband?

3. Would upgrading processors to Xeon Platinum over the Xeon Gold on the same memory/board give an increase other than what would be expected with the faster clock speed? In other words, would I still run into the scaling problem I am experiencing now?

4. If my results are slower than expected, where can I begin to try and diagnose

The benchmark was done using aircraft_wing_14m

Many thanks in advance it is so much appreciated. My benchmark results:

flotus1 · April 18, 2023, 06:24

Quote:

1. Is this normal and expected results?

As a tendency: yes.
Intra-node scaling -which you tested here- is expected to drop below 100% with high thread counts. Because the system gets starved for shared CPU resources, mostly memory bandwidth.
However, parallel efficiency in the range of 30% is much lower than I would expect on the hardware you tested. Might be hardware problems, or an issue with core binding. I.e. several threads getting assigned to the same CPU core.

Quote:

2. Should I expect the same performance within margin if I do a multi-node run using msmpi/hpc-x on EDR Infiniband?

Inter-node scaling for CFD is different. You would expect 100% or more efficiency when going from 1 to 2 nodes, while doubling total thread count. I.e. halved wall clock time.
That's because the second node doubles shared CPU resources, contrary to increasing thread count within a single node.

Quote:

3. Would upgrading processors to Xeon Platinum over the Xeon Gold on the same memory/board give an increase other than what would be expected with the faster clock speed? In other words, would I still run into the scaling problem I am experiencing now?

Xeon Platinum CPUs won't help much here. The intra-node scaling will be similar, at slightly increased total performance. Xeon Gold with moderate core counts were the go-to recommendation for that generation of CPUs.
You would need to find out first what causes these scaling problems.

Quote:

4. If my results are slower than expected, where can I begin to try and diagnose

Convert your results to the metric Ansys uses when publishing their results: "core solver rating" https://fluidcodes.com/customer-supp...g-terminology/
It mostly boils down to knowing how many iterations the benchmark is running by default. I don't have access to that information.
This will give you an indication how far off your results are when running on all cores of a single node, because you can compare to published results of similar systems.

Most common culprits for this kind of performance problem are, in this order:
1) Unbalanced memory population. Either by not filling all memory channels with at least one DIMM, mixing different DIMMs per memory channel, or losing memory channels due to hardware defects or poor contact. Also, clearing caches before running the benchmark can help. Memory management on Linux with default settings is not great for HPC. Run "echo 3 > /proc/sys/vm/drop_caches" as root.
2) Core binding problems. I.e. more than one solver thread running on a single core. Disable Hyperthreading in bios to make things easier (this is best practice anyway), and check reported loads of individual cores while running the benchmark. htop is a good first indication
3) Thermal problems

Tyrmida · April 18, 2023, 06:35

Oops - sorry I am the weakest link today. This is what happens when you're overworked on hours and lack sleep.

What I thought the memory configuration was is definitely not what it is on the Xeon Gold.

I was getting confused with the E5-2697v2 unit where I filled all the memory slots with the Xeon Gold.

In reality it is 2-channel instead of 6-channel per CPU. I will go to colo and fill up the memory slots and re-test when I can - will probably be the weekend. This would explain a lot.

So sorry - thank you for the time. I will update when I have results that maxes out the memory performance.

flotus1 · April 18, 2023, 06:56

Yeah, that should make a pretty big difference.
It doesn't explain what is happening on the Ivy Bridge system, but one problem at a time. Maybe we can just ignore it once the Cascade Lake system is fixed

Tyrmida · April 19, 2023, 19:03

Haha I wish I could ignore it, however considering I have a lot of them doing nothing I'm sure they can help at least somewhat if used together.

I sorted out the memory configuration on the Xeon Gold machine and it scaled pretty well up to 30 cores I think. The fix was getting it to 6-channel. Updated results below.

I think the Ivy Bridge is a different memory problem. It is currently running on 1333mhz modules. I need to go dig through the pile and find 8 equal 1866 modules and hopefully that will sort it out.

I will update once I have the result of this. If I can get my slowest node and fastest node performing as expected I can start working on the multi-node scaling and start doing those tests to know what I am capable of helping with here.

Again thank you so much for all the assistance.

By the way, I find it strange that there isn't a large central database of CPU/Socket/MEM/Benchmark figures somewhere - is this just me? Particularly because it seems that CFD computation is a very specific thing and quite different from other forms of HPC computation that is done frequently.

Herewith my updated results (E5-2697v2 didn't change). The last column is the Ansys solver rating (86400 divided by wall time seconds).

flotus1 · April 20, 2023, 02:59

Quote:

If I can get my slowest node and fastest node performing as expected I can start working on the multi-node scaling and start doing those tests to know what I am capable of helping with here.

While it is theoretically possible to make efficient use of heterogeneous clusters with Fluent, it is also another layer of complexity.
Without countering via load balancing, performance will be limited by the slowest system. While under-utilizing the faster nodes.

Quote:

By the way, I find it strange that there isn't a large central database of CPU/Socket/MEM/Benchmark figures somewhere - is this just me? Particularly because it seems that CFD computation is a very specific thing and quite different from other forms of HPC computation that is done frequently.

In the grand scheme of things, workloads like CFD and FEA are still a niche.
Evidenced by the ever-widening gap between compute and bandwidth of CPUs over the past decades. Though there have been some hardware advancements lately, specifically aimed at this problem. 3D-Vcache from AMD, and HBM inside the CPU package from Intel. Together with the decent bump from 8 or even 12 DDR5 memory channels.
On the other hand, Memory bandwidth is relatively straightforward. Number of memory channels * transfer rate is a solid indicator for how much bandwidth the CPUs can provide. Stay within 2-4 cores per memory channel, and you have a somewhat decent machine for CFD.

Tyrmida · April 20, 2023, 18:47

Quote:

Originally Posted by flotus1

While it is theoretically possible to make efficient use of heterogeneous clusters with Fluent, it is also another layer of complexity.
Without countering via load balancing, performance will be limited by the slowest system. While under-utilizing the faster nodes.

I am aware of this. The whole benefit I can offer to my client though on pricing is to use an existing standby rack that is already available and only being used for Hyper-V Replication of non-hpc machines so processing power is not being utilized at all. For now heterogeneous clusters is really all I have to meet the price-point requirement. Without the targeted-price-point being reached there is no point in my entire exercise so we will see.

Hoping all the gear I ordered to get everything up to scratch comes through in 10 days and can then start seriously testing multi-node and I will update accordingly!

wkernkamp · April 20, 2023, 21:14

Quote:

Originally Posted by Tyrmida

I am aware of this. The whole benefit I can offer to my client though on pricing is to use an existing standby rack that is already available and only being used for Hyper-V Replication of non-hpc machines so processing power is not being utilized at all. For now heterogeneous clusters is really all I have to meet the price-point requirement. Without the targeted-price-point being reached there is no point in my entire exercise so we will see.

Hoping all the gear I ordered to get everything up to scratch comes through in 10 days and can then start seriously testing multi-node and I will update accordingly!

Am I correct that you have basically two machine types, the older Ivy Bridge and the newerXeon Gold six channel?

If so, you can set up the slots for the mpirun. The bandwidth ratio of the machines is 33/16 assuming DDR3-1866x4 and DDR4-2933x6. The following slot combinations give a well balanced load between the machines (with essentially no waiting by the faster machines) on the OpenFOAM benchmark.

Code:

E5-2697v2     Gold 6240
    12              28
    14              32
    16              36

The higher slot combination may give the fastest completion, but you need some cores for the vms too.

wkernkamp · April 20, 2023, 22:32

The effective core scaling on the OpenFOAM benchmark for your two machine types is shown in the attachment. As you can see, your Ivy Bridge's memory is also not configured properly.

betakv · April 30, 2024, 17:03

Dear Robert

Have you found a solution to the problem you reported? I have similar problem with Xeon 6342.

Thank you very much for your reply.

Yours faithfully

Vojtech

April 16, 2023, 05:33	Ansys Fluent Multi-Node Bottleneck Question	#1
Tyrmida New Member Robert Schmitt Join Date: Apr 2023 Posts: 10 Rep Power: 3	Good Day, I am an IT infrastructure provider so please excuse if I don't know the details or ins-and-outs of CFD appropriately. I have a client which is using Ansys Fluent to solve simulations. Currently for large projects they are using a local HPC provider to speed up the processing. They use them because they come at 1/4 the cost of the big international names. They are currently running a large project which will solve for 3 months using 72 cores across 6 nodes and I would like to find a way to give them an alternative solution to speed this up. We have a large DC with a lot of free capacity available and could work something out for them on our hardware. What I am asking and not understanding properly is why they can't solve substantially faster using more cores. They pay per cpu-core-hour which in my inexperienced (with CFD) mind doesn't add up. If you double the number of nodes and thus cores you should be able to solve in half the time minus a little overhead? The senior engineer (a guy that actually really knows his stuff even on the IT side) has told me that the efficiency drops substantially when adding more cores/nodes so it isn't worth it (on cost and time). How can I help them be able to get the same efficiency on 144 cores for example? Of course I don't have access to their current HPC provider, but I have some details. They are using: PowerEdge C6320 servers Xeon E5-2690v3 12-core CPUs Some nodes 128GB and some nodes 64GB - I don't know the configuration but I assume they were clever enough to fill all the channels Interconnect is FDR InfiniBand I don't have the detailed specifics of their model, all I know right now is that the interval they're solving is milliseconds so as I understand things there is a lot of iterations which is why it is taking so long or is so big. My question: 1. Where would the bottleneck be possibly for them to use more nodes and therefore more cores? Is it the FDR IB that is limiting them? That is the only thing that can possibly make sense to me. Would having the same hardware on EDR IB sort this out? 2. What other information of what they are trying to solve can I ask that will help get better guidance on where the bottleneck is? License count is not a factor. Thank you in advance - I hope this question was appropriate and made sense.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
ANSYS Fluent UDFs in Batch mode	hwvfn	FLUENT	1	July 14, 2022 11:03
[ANSYS Meshing] How do I open a mesh made in Ansys Fluent 2020 R2 on Ansys Fluent 2019 R3?	Ryan T	ANSYS Meshing & Geometry	3	November 3, 2021 16:44
How to get velocity at each node of meshing at each time-step in Ansys Fluent?	Mahram Khan	FLUENT	4	January 15, 2019 02:04
How to combine ANSYS Fluent and Structural analysis?	diwakar	ANSYS	2	June 18, 2015 13:07
few quesions on ANSYS ICEMCFD and FLUENT	Prakash.Paudel	ANSYS	0	August 12, 2010 13:07

April 18, 2023, 05:38		#5
Tyrmida New Member Robert Schmitt Join Date: Apr 2023 Posts: 10 Rep Power: 3	Thank you very much for the assistance thus far. I set up a test environment with 2 nodes and ran benchmarks on each node locally. 2 x Xeon E5-2697v2 and 2 x Xeon Gold 6240 respectively. It seems that on this testing environment efficiency-per-core is dropping dramatically as the cores increase. Can someone maybe let me know: 1. Is this normal and expected results? 2. Should I expect the same performance within margin if I do a multi-node run using msmpi/hpc-x on EDR Infiniband? 3. Would upgrading processors to Xeon Platinum over the Xeon Gold on the same memory/board give an increase other than what would be expected with the faster clock speed? In other words, would I still run into the scaling problem I am experiencing now? 4. If my results are slower than expected, where can I begin to try and diagnose The benchmark was done using aircraft_wing_14m Many thanks in advance it is so much appreciated. My benchmark results:

April 18, 2023, 06:35		#7
Tyrmida New Member Robert Schmitt Join Date: Apr 2023 Posts: 10 Rep Power: 3	Oops - sorry I am the weakest link today. This is what happens when you're overworked on hours and lack sleep. What I thought the memory configuration was is definitely not what it is on the Xeon Gold. I was getting confused with the E5-2697v2 unit where I filled all the memory slots with the Xeon Gold. In reality it is 2-channel instead of 6-channel per CPU. I will go to colo and fill up the memory slots and re-test when I can - will probably be the weekend. This would explain a lot. So sorry - thank you for the time. I will update when I have results that maxes out the memory performance.

April 18, 2023, 06:56		#8
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,427 Rep Power: 49	Yeah, that should make a pretty big difference. It doesn't explain what is happening on the Ivy Bridge system, but one problem at a time. Maybe we can just ignore it once the Cascade Lake system is fixed

April 19, 2023, 19:03		#9
Tyrmida New Member Robert Schmitt Join Date: Apr 2023 Posts: 10 Rep Power: 3	Haha I wish I could ignore it, however considering I have a lot of them doing nothing I'm sure they can help at least somewhat if used together. I sorted out the memory configuration on the Xeon Gold machine and it scaled pretty well up to 30 cores I think. The fix was getting it to 6-channel. Updated results below. I think the Ivy Bridge is a different memory problem. It is currently running on 1333mhz modules. I need to go dig through the pile and find 8 equal 1866 modules and hopefully that will sort it out. I will update once I have the result of this. If I can get my slowest node and fastest node performing as expected I can start working on the multi-node scaling and start doing those tests to know what I am capable of helping with here. Again thank you so much for all the assistance. By the way, I find it strange that there isn't a large central database of CPU/Socket/MEM/Benchmark figures somewhere - is this just me? Particularly because it seems that CFD computation is a very specific thing and quite different from other forms of HPC computation that is done frequently. Herewith my updated results (E5-2697v2 didn't change). The last column is the Ansys solver rating (86400 divided by wall time seconds).

April 30, 2024, 17:03		#14
betakv Member Vojtech Betak Join Date: Mar 2009 Location: Czech republic Posts: 34 Rep Power: 18	Dear Robert Have you found a solution to the problem you reported? I have similar problem with Xeon 6342. Thank you very much for your reply. Yours faithfully Vojtech