|
[Sponsors] |
April 16, 2023, 05:33 |
Ansys Fluent Multi-Node Bottleneck Question
|
#1 |
New Member
Robert Schmitt
Join Date: Apr 2023
Posts: 10
Rep Power: 3 |
Good Day,
I am an IT infrastructure provider so please excuse if I don't know the details or ins-and-outs of CFD appropriately. I have a client which is using Ansys Fluent to solve simulations. Currently for large projects they are using a local HPC provider to speed up the processing. They use them because they come at 1/4 the cost of the big international names. They are currently running a large project which will solve for 3 months using 72 cores across 6 nodes and I would like to find a way to give them an alternative solution to speed this up. We have a large DC with a lot of free capacity available and could work something out for them on our hardware. What I am asking and not understanding properly is why they can't solve substantially faster using more cores. They pay per cpu-core-hour which in my inexperienced (with CFD) mind doesn't add up. If you double the number of nodes and thus cores you should be able to solve in half the time minus a little overhead? The senior engineer (a guy that actually really knows his stuff even on the IT side) has told me that the efficiency drops substantially when adding more cores/nodes so it isn't worth it (on cost and time). How can I help them be able to get the same efficiency on 144 cores for example? Of course I don't have access to their current HPC provider, but I have some details. They are using: PowerEdge C6320 servers Xeon E5-2690v3 12-core CPUs Some nodes 128GB and some nodes 64GB - I don't know the configuration but I assume they were clever enough to fill all the channels Interconnect is FDR InfiniBand I don't have the detailed specifics of their model, all I know right now is that the interval they're solving is milliseconds so as I understand things there is a lot of iterations which is why it is taking so long or is so big. My question: 1. Where would the bottleneck be possibly for them to use more nodes and therefore more cores? Is it the FDR IB that is limiting them? That is the only thing that can possibly make sense to me. Would having the same hardware on EDR IB sort this out? 2. What other information of what they are trying to solve can I ask that will help get better guidance on where the bottleneck is? License count is not a factor. Thank you in advance - I hope this question was appropriate and made sense. |
|
April 16, 2023, 08:16 |
|
#2 | ||
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
Quote:
If you need 100% certainty on this, you would have to monitor network traffic. Or compare intra-node vs. inter-node scaling for this case. Quote:
1) Extremely low cell count. Something in the order of 1 million cells or lower. In which case, the only hardware "solution" is running on the latest and greatest CPUs, with the highest clock speed. This obviously has limited potential. 2) Unnecessary file-I/O to slow storage. Like writing solution data at every time step or worse. Hardware solution could be using faster flash storage. But realistically, such a problem needs to be fixed by doing less file-I/O. 3) Serialization of the code. For example, if they are using Fluent UDFs that nobody bothered to parallelize. Again, the only hardware "solution" would be to run on the fastest available CPUs, with limited potential. 4) Lots of mesh interfaces, which are updated frequently. These have a tendency to scale poorly. Again, no real hardware solution here other than running on the fastest possible CPUs. Which won't fix the scaling issue, only decrease run time a bit. 5) A nasty edge-case where the default partitioning schemes used by Fluent produce bad results. Not something you can fix via hardware. In short, you can't fix their scaling issue with a hardware solution. You can decrease run time a bit with faster CPUs. And faster storage if file-I/O is an issue. Getting better scaling is a software issue. They might want to get in contact with Ansys support directly to get that fixed. Last edited by flotus1; April 16, 2023 at 09:30. |
|||
April 17, 2023, 02:41 |
|
#3 | |||||
New Member
Robert Schmitt
Join Date: Apr 2023
Posts: 10
Rep Power: 3 |
Good Day,
Thank you very much for your insight and guidance. I have had a word with the engineer and determined the following: Quote:
Quote:
Quote:
Quote:
Quote:
They gave me their old benchmarks that they ran at the hpc provider quite a while ago (using truck_poly_14m) but it was done very long ago on an old version. I have asked them to do the aircraft wing for me so that I have something to compare with here. But even on that old benchmark, this was their results per processes: core count 32 wall time of 6.4049 / iteration core count 64 wall time of 3.7408 / iteration core count 128 wall time of 2.3823 / iteration core count 256 wall time of 2.4047 / iteration So it does look like there's some kind of a bottleneck around 80 cores. I have asked them to run the benchmark on current version at hpc provider and I will then compare it to what I have in my lab because this un-scalability is crazy. I feel really bad for people in your industry having to sit around waiting multiple months for simulations to run. I will update when I have up-to-date benchmarks from their hpc and my lab to compare. Thank you very much |
||||||
April 17, 2023, 12:12 |
|
#4 | |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
Quote:
This benchmark is rather simple: a single-phase flow without any interfaces. It should scale well to over 1000 cores. Take a look at these archived results: https://fluidcodes.com/customer-supp...ruck_poly_14m/ With FDR Infiniband, they achieve pretty much 100% inter-node efficiency up to 256 cores / 16 nodes. While maintaining reasonable scaling beyond that up to 1000 cores. Edit: just in case you need absolute performance numbers to compare: https://www.padtinc.com/2016/11/22/a...lyhedral-mesh/ 0.625 seconds per iteration on 16 Xeon E5-2667v3 CPUs, i.e. 8 nodes or 128 cores. 82% parallel efficiency compared to running on a single core. Last edited by flotus1; April 17, 2023 at 15:06. |
||
April 18, 2023, 05:38 |
|
#5 |
New Member
Robert Schmitt
Join Date: Apr 2023
Posts: 10
Rep Power: 3 |
Thank you very much for the assistance thus far.
I set up a test environment with 2 nodes and ran benchmarks on each node locally. 2 x Xeon E5-2697v2 and 2 x Xeon Gold 6240 respectively. It seems that on this testing environment efficiency-per-core is dropping dramatically as the cores increase. Can someone maybe let me know: 1. Is this normal and expected results? 2. Should I expect the same performance within margin if I do a multi-node run using msmpi/hpc-x on EDR Infiniband? 3. Would upgrading processors to Xeon Platinum over the Xeon Gold on the same memory/board give an increase other than what would be expected with the faster clock speed? In other words, would I still run into the scaling problem I am experiencing now? 4. If my results are slower than expected, where can I begin to try and diagnose The benchmark was done using aircraft_wing_14m Many thanks in advance it is so much appreciated. My benchmark results: |
|
April 18, 2023, 06:24 |
|
#6 | ||||
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
Quote:
Intra-node scaling -which you tested here- is expected to drop below 100% with high thread counts. Because the system gets starved for shared CPU resources, mostly memory bandwidth. However, parallel efficiency in the range of 30% is much lower than I would expect on the hardware you tested. Might be hardware problems, or an issue with core binding. I.e. several threads getting assigned to the same CPU core. Quote:
That's because the second node doubles shared CPU resources, contrary to increasing thread count within a single node. Quote:
You would need to find out first what causes these scaling problems. Quote:
It mostly boils down to knowing how many iterations the benchmark is running by default. I don't have access to that information. This will give you an indication how far off your results are when running on all cores of a single node, because you can compare to published results of similar systems. Most common culprits for this kind of performance problem are, in this order: 1) Unbalanced memory population. Either by not filling all memory channels with at least one DIMM, mixing different DIMMs per memory channel, or losing memory channels due to hardware defects or poor contact. Also, clearing caches before running the benchmark can help. Memory management on Linux with default settings is not great for HPC. Run "echo 3 > /proc/sys/vm/drop_caches" as root. 2) Core binding problems. I.e. more than one solver thread running on a single core. Disable Hyperthreading in bios to make things easier (this is best practice anyway), and check reported loads of individual cores while running the benchmark. htop is a good first indication 3) Thermal problems |
|||||
April 18, 2023, 06:35 |
|
#7 |
New Member
Robert Schmitt
Join Date: Apr 2023
Posts: 10
Rep Power: 3 |
Oops - sorry I am the weakest link today. This is what happens when you're overworked on hours and lack sleep.
What I thought the memory configuration was is definitely not what it is on the Xeon Gold. I was getting confused with the E5-2697v2 unit where I filled all the memory slots with the Xeon Gold. In reality it is 2-channel instead of 6-channel per CPU. I will go to colo and fill up the memory slots and re-test when I can - will probably be the weekend. This would explain a lot. So sorry - thank you for the time. I will update when I have results that maxes out the memory performance. |
|
April 18, 2023, 06:56 |
|
#8 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
Yeah, that should make a pretty big difference.
It doesn't explain what is happening on the Ivy Bridge system, but one problem at a time. Maybe we can just ignore it once the Cascade Lake system is fixed |
|
April 19, 2023, 19:03 |
|
#9 |
New Member
Robert Schmitt
Join Date: Apr 2023
Posts: 10
Rep Power: 3 |
Haha I wish I could ignore it, however considering I have a lot of them doing nothing I'm sure they can help at least somewhat if used together.
I sorted out the memory configuration on the Xeon Gold machine and it scaled pretty well up to 30 cores I think. The fix was getting it to 6-channel. Updated results below. I think the Ivy Bridge is a different memory problem. It is currently running on 1333mhz modules. I need to go dig through the pile and find 8 equal 1866 modules and hopefully that will sort it out. I will update once I have the result of this. If I can get my slowest node and fastest node performing as expected I can start working on the multi-node scaling and start doing those tests to know what I am capable of helping with here. Again thank you so much for all the assistance. By the way, I find it strange that there isn't a large central database of CPU/Socket/MEM/Benchmark figures somewhere - is this just me? Particularly because it seems that CFD computation is a very specific thing and quite different from other forms of HPC computation that is done frequently. Herewith my updated results (E5-2697v2 didn't change). The last column is the Ansys solver rating (86400 divided by wall time seconds). |
|
April 20, 2023, 02:59 |
|
#10 | ||
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
Quote:
Without countering via load balancing, performance will be limited by the slowest system. While under-utilizing the faster nodes. Quote:
Evidenced by the ever-widening gap between compute and bandwidth of CPUs over the past decades. Though there have been some hardware advancements lately, specifically aimed at this problem. 3D-Vcache from AMD, and HBM inside the CPU package from Intel. Together with the decent bump from 8 or even 12 DDR5 memory channels. On the other hand, Memory bandwidth is relatively straightforward. Number of memory channels * transfer rate is a solid indicator for how much bandwidth the CPUs can provide. Stay within 2-4 cores per memory channel, and you have a somewhat decent machine for CFD. |
|||
April 20, 2023, 18:47 |
|
#11 | |
New Member
Robert Schmitt
Join Date: Apr 2023
Posts: 10
Rep Power: 3 |
Quote:
Hoping all the gear I ordered to get everything up to scratch comes through in 10 days and can then start seriously testing multi-node and I will update accordingly! |
||
April 20, 2023, 21:14 |
|
#12 | |
Senior Member
Will Kernkamp
Join Date: Jun 2014
Posts: 371
Rep Power: 14 |
Quote:
If so, you can set up the slots for the mpirun. The bandwidth ratio of the machines is 33/16 assuming DDR3-1866x4 and DDR4-2933x6. The following slot combinations give a well balanced load between the machines (with essentially no waiting by the faster machines) on the OpenFOAM benchmark. Code:
E5-2697v2 Gold 6240 12 28 14 32 16 36 Last edited by wkernkamp; April 20, 2023 at 22:20. |
||
April 20, 2023, 22:32 |
|
#13 |
Senior Member
Will Kernkamp
Join Date: Jun 2014
Posts: 371
Rep Power: 14 |
The effective core scaling on the OpenFOAM benchmark for your two machine types is shown in the attachment. As you can see, your Ivy Bridge's memory is also not configured properly.
|
|
April 30, 2024, 17:03 |
|
#14 |
Member
Vojtech Betak
Join Date: Mar 2009
Location: Czech republic
Posts: 34
Rep Power: 18 |
Dear Robert
Have you found a solution to the problem you reported? I have similar problem with Xeon 6342. Thank you very much for your reply. Yours faithfully Vojtech |
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
ANSYS Fluent UDFs in Batch mode | hwvfn | FLUENT | 1 | July 14, 2022 11:03 |
[ANSYS Meshing] How do I open a mesh made in Ansys Fluent 2020 R2 on Ansys Fluent 2019 R3? | Ryan T | ANSYS Meshing & Geometry | 3 | November 3, 2021 16:44 |
How to get velocity at each node of meshing at each time-step in Ansys Fluent? | Mahram Khan | FLUENT | 4 | January 15, 2019 02:04 |
How to combine ANSYS Fluent and Structural analysis? | diwakar | ANSYS | 2 | June 18, 2015 13:07 |
few quesions on ANSYS ICEMCFD and FLUENT | Prakash.Paudel | ANSYS | 0 | August 12, 2010 13:07 |