Updating CFD server E5-2697AV4 to something faster (x2/x3 the speed)

andy_ · October 11, 2024, 04:31

Quote:

Originally Posted by MFGT

The old server ran with 2600Mhz clock speed maximum, which is 25% less. So I dont understand its the same performance. We use the old 8x16 GB 2133MHz RAM and according to our IT, it is placed in the correct sockets. Furher carry over is the data HDD.

A plot of time vs number of cores (1, 2, 4, 8, 16, 32) with a decent sized CFD problem (e.g. 2M mesh like the pinned openfoam benchmark) will enable you to see when memory access starts to become the bottleneck. This may provide a clue to what's wrong. For example, the single core result may not line up with the benchmark of others, the efficiency may start plumetting at low cores (memory in wrong sockets), etc...

What tends to matter most for implicit CFD simulations is the number of memory channels. Your new system has either 8 or 16 depending on whether you have 1 or 2 processors. Your old dual Xeon system has 8 channels. So if your new system has only 1 processor I would expect implicit CFD performance to be in the same ballpark with the same memory. But information on parallel efficiency would help with being sure about where the bottleneck lies.

PS If you are only using 8 memory chips with 2 processors then that is very likely the problem since there are 16 memory channels. 8 more memory chips is likely to double the performance for runs using higher numbers of cores.

MFGT · October 11, 2024, 06:44

Thanks for your inpout Andy. Memory size is no issue, its propably the Speed.

And new RAM wasnt in budget unfortunatly.

So we are working on the DIMM population order at the moment.
According to the manual (https://www.hpe.com/psnow/doc/a00038346enw) we have set the 8 DIMMs in all white slots of Channels C, D, G and H of both processors, so they should run in quad channel mode. There is just this note at 8 DIMMs ***:
*** Recommended only with processors that have 128 MB L3 cache or less.
Well, the 7F52 has 256MB, is this an issue? Should be indeed also use more than 8 Dimms then?

Previously we had the 8 DIMMs in two channels per processor. Strangely enough this was faster?

edit:found the error, 1 DIMM wasnt clicked in correctly.

andy_ · October 11, 2024, 07:27

As far as I can see your system is behaving as expected when populated with insufficient memory chips. When budget allows buying the extra memory chips will likely double the performance when using higher number of cores with an implicit CFD problem of reasonable size and hence bring the performance inline with what you had hoped.

I have no experience with how to configure too few memory chips because it is not something I have ever considered doing given it substantially reduces the effective number of cores available for largish CFD runs. If you run a reasonably large implicit CFD benchmark on 1, 2, 4, 8, 16, 32, 64 cores and plot the parallel efficiency it will likely show there is currently little to be gained by using more than around 16 cores. If the machine is also used for other types of simulations these may run with a better parallel efficiency.

andy_ · October 11, 2024, 07:33

Quote:

Originally Posted by MFGT

edit:found the error, 1 DIMM wasnt clicked in correctly.

This has given you the expected overall performance or brought the populating different slots with different configurations in line with expectations?

MFGT · October 11, 2024, 08:28

Compared to the old server we now have a Speedup of +64%

(test with flowbench simulation).
Both configurations used 60 cores with HT on.

30 cores without HT has a speedup of +36%, the wrong DIMM config with 60 cores had only +10%, the one with only 7 effective DIMMs even had -14%.

andy_ · October 11, 2024, 10:24

Quote:

Originally Posted by MFGT

30 cores without HT has a speedup of +36%, the wrong DIMM config with 60 cores had only +10%, the one with only 7 effective DIMMs even had -14%.

Thanks but have you got any information on parallel efficiency with cores using a single reasonably large implicit CFD test case? I am asking because it would be interesting to know how much performance is lost by not fully populating the memory slots. I had assumed when memory was the bottleneck it would be pretty much linear but if you can configure the cpu to memory connections perhaps this is not the case?

wkernkamp · October 11, 2024, 17:02

Quote:

Originally Posted by andy_

Thanks but have you got any information on parallel efficiency with cores using a single reasonably large implicit CFD test case? I am asking because it would be interesting to know how much performance is lost by not fully populating the memory slots. I had assumed when memory was the bottleneck it would be pretty much linear but if you can configure the cpu to memory connections perhaps this is not the case?

You can run the benchmark with the 8 memory channels and then repeat later with the 16 channels. You will see that single core performance will be essentially unchanged. However, as more cores come into use the memory bottleneck will make the 8 channel config fall back more and more.

andy_ · October 11, 2024, 18:20

Quote:

Originally Posted by wkernkamp

You can run the benchmark with the 8 memory channels and then repeat later with the 16 channels. You will see that single core performance will be essentially unchanged. However, as more cores come into use the memory bottleneck will make the 8 channel config fall back more and more.

Yes but my question was about how much? I had expected the performance to be essentially halved with 8 instead of 16 dimms when memory access becomes the bottleneck for a largish model and the number of cores exceeding the number of memory channels. Yet the OP is reporting a 64% increase. I am not familiar with his benchmark which I think is an average of a range of different simulations (?) and so if there are some explicit ones and/or some small ones that are going to parallelise more efficiently on higher core numbers that may be the reason. If so, he is unlikely to see an equivalent increase when running his own large implicit problems. If not, then I would like to know how using the same dimms with the same effective number of memory channels (assuming they are?) a simulation that is being strongly limited by memory access can run at significantly different speeds.

In this case we have a lot of unknowns and it may not be possible to sort out quite what is going on without a widely used and understood CFD benchmark. The one pinned at the top of this forum has lots of results although I had to fiddle a bit to get it to run which I guess is going to put people off. The NAS parallel benchmarks were really useful for understanding this sort of thing but they didn't seem to catch on possibly because they produced a range of plots rather than a single number and later versions became rather supercomputer orientated.

MFGT · October 12, 2024, 10:37

Hi,

my testcase is a relativly small Flowbench Simulation, with up to 600.000 cells. I ran the exact same case several times with different configurations. And when calculating the speedup its very similar if I consider a) overall runtime or b) average Walltime per Timestep.

With RAM limitation I dont mean the overall amount (the case needs less than 16GB), its rather that slow 2133MHz RAM where up to 3200MHz is supported now.

And I am sorry, I wont do any benchmarks with different software, as I cant waste time with non project related work in my job.

But I will rerun a full cycle simulation and compare the results as well. Here we talk about cases with up to 1.5 million cells, including detailed chemistry etc.

andy_ · October 13, 2024, 05:24

Thanks for the clarification. It will be interesting to see how much the 64% changes with a larger more representative simulation but without more information it looks like we will have to speculate about what might be going on and what will or will not bring improvements.

MFGT · October 15, 2024, 05:44

I have some numbers, but of course I didnt rerun full cycle simulations with various numbers of cores. My impression is that the average walltime per timestep is reduced by 34-37% which equals nearly a speedup of 52-58% for full cycle simulations.

We are happy with that, considering we only made the following changes :

more modern CPU
increase of base clock speed: 2.6 GHz -> 3.5 Ghz
higher memory channel, which we are not using yet (quad -> octa)

Constants:

same number of Cores/Threads
same RAM 8x16GB = 128GB (2133MHz)

Since the passmark benchmark said a +92% of performance, I expect these values if we also consider upgrading the RAM (The old system allowed 2400MHz, the new one could utilize 3200MHz).

MFGT · October 15, 2024, 05:55

I did ran a small test, 20°CA of an engine simulation and analyzed the WallTimes. HT was activated all the time.

See the figure below.

We see an improvement until 32 cores, however HT works very good with this machine as 60 cores gives another improvement of 35% compared to 32 cores.

andy_ · October 15, 2024, 10:19

That does not seem to be scaling as I would expect for a typical distributed memory implicit CFD code with a reasonably large grid running on a shared memory machine (e.g. the pinned openfoam benchmark). Do you know how the solver in your program works and how it is parallelised? Can you provide a link to it because my googling "flowbench simulation" hasn't thrown up something obvious.

(And we are back to wanting to run something like the NAS parallel benchmarks in order to understand the performance).

wkernkamp · October 15, 2024, 16:01

Quote:

Originally Posted by MFGT

I did ran a small test, 20°CA of an engine simulation and analyzed the WallTimes. HT was activated all the time.

See the figure below.

We see an improvement until 32 cores, however HT works very good with this machine as 60 cores gives another improvement of 35% compared to 32 cores.

I agree with andy_. If you configure your memory right with more dimms and appropriate speed, you will get linear performance to 16 cores I would think. Then it will fall off and even go down when you exceed 32 cores.

MFGT · October 16, 2024, 03:15

Quote:

Originally Posted by andy_

That does not seem to be scaling as I would expect for a typical distributed memory implicit CFD code with a reasonably large grid running on a shared memory machine (e.g. the pinned openfoam benchmark). Do you know how the solver in your program works and how it is parallelised? Can you provide a link to it because my googling "flowbench simulation" hasn't thrown up something obvious.

(And we are back to wanting to run something like the NAS parallel benchmarks in order to understand the performance).

I am using CONVERGE CFD where I perform engine simulations (injection and combustion) or simple flowbench simulation to optimize port performance.

Since I am only the user, I can not install OpenFoam (have never worked with it) on my own and do some benchmark there, even if that would be very interesing. Furthermore, we are on Windows Server 2016, no Linux.

If you can guide me on how to run that benchmark I may be able to convince our IT to install and configure OpenFoam, although I dont really care about performance against other CFD Servers.

We see an improvement from our old server which is noteable and may be further improved with e.g. 16x8GB 3200MHz DIMMs.
I dont need more RAM, since I have never exceeded 100GB of usage.

andy_ · October 16, 2024, 04:54

Quote:

Originally Posted by MFGT

I am using CONVERGE CFD where I perform engine simulations (injection and combustion) or simple flowbench simulation to optimize port performance.

Since I am only the user, I can not install OpenFoam (have never worked with it) on my own and do some benchmark there, even if that would be very interesing. Furthermore, we are on Windows Server 2016, no Linux.

If you can guide me on how to run that benchmark I may be able to convince our IT to install and configure OpenFoam, although I dont really care about performance against other CFD Servers.

We see an improvement from our old server which is noteable and may be further improved with e.g. 16x8GB 3200MHz DIMMs.
I dont need more RAM, since I have never exceeded 100GB of usage.

Does your test case involve an adaptive and/or moving grid?

If you are using a single commercial code then running it with representative models is likely to be the most relevant benchmark. If you use other codes to perhaps check how efficiently your current commercial code is implemented then they will need to perform the same simulation or at least the same type of simulation.

I am not familiar with the details of the converge code but given the size of the efficiency improvements reported for version 4 it is likely still in the process of becoming well developed. This is perhaps to be expected for a newish code (assuming it is newish) and should improve with time if the company is competently run and profitable.

MFGT · October 16, 2024, 05:41

Quote:

Originally Posted by andy_

Does your test case involve an adaptive and/or moving grid?

If you are using a single commercial code then running it with representative models is likely to be the most relevant benchmark. If you use other codes to perhaps check how efficiently your current commercial code is implemented then they will need to perform the same simulation or at least the same type of simulation.

I am not familiar with the details of the converge code but given the size of the efficiency improvements reported for version 4 it is likely still in the process of becoming well developed. This is perhaps to be expected for a newish code (assuming it is newish) and should improve with time if the company is competently run and profitable.

Yes, the results above involve moving surfaces and adaptive grid. It was a 20°CA section of an engine simulation. We wont switch to another code because we are really happy with it.

The CONVERGE Code isnt that new (more than 15/20 years old) and should be more than profitable (80% of engine developers worldwide use it).

I had a look at the results again and have to make some corrections. I was using reported runtime for the speedup calculations, but a closer look showed that at short runtimes of 4 to 80 mins (which I had for core variations) the impact of simulation setup and writing output becomes overweight. So when calulating the Speedup by reported time for solving the transport equations only it is the same profile, but higher.

E.g.:
16 Cores: 12.1 Speedup
32 Cores: 22.1 Speedup
48 Cores: 22.2 Speedup
60 Cores: 29.7 Speedup
64 Cores: 28.2 Speedup

October 15, 2024, 05:44		#51
MFGT Senior Member Tobias Join Date: May 2016 Location: Germany Posts: 295 Rep Power: 11	I have some numbers, but of course I didnt rerun full cycle simulations with various numbers of cores. My impression is that the average walltime per timestep is reduced by 34-37% which equals nearly a speedup of 52-58% for full cycle simulations. We are happy with that, considering we only made the following changes : more modern CPU increase of base clock speed: 2.6 GHz -> 3.5 Ghz higher memory channel, which we are not using yet (quad -> octa) Constants: same number of Cores/Threads same RAM 8x16GB = 128GB (2133MHz) Since the passmark benchmark said a +92% of performance, I expect these values if we also consider upgrading the RAM (The old system allowed 2400MHz, the new one could utilize 3200MHz).

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
On the CFD market and trends	sbaffini	Main CFD Forum	14	June 13, 2017 12:48
CFD Online Celebrates 20 Years Online	jola	Site News & Announcements	22	January 31, 2015 01:30
how to solve the diverage of high speed centrifugal compressor, CFD code is STAR CCM	layth	STAR-CCM+	3	May 21, 2012 06:48
Which is better to develop in-house CFD code or to buy a available CFD package.	Tareq Al-shaalan	Main CFD Forum	10	June 13, 1999 00:27
public CFD Code development	Heinz Wilkening	Main CFD Forum	38	March 5, 1999 12:44

October 11, 2024, 07:27		#43
andy_ Senior Member andy Join Date: May 2009 Posts: 322 Rep Power: 18	As far as I can see your system is behaving as expected when populated with insufficient memory chips. When budget allows buying the extra memory chips will likely double the performance when using higher number of cores with an implicit CFD problem of reasonable size and hence bring the performance inline with what you had hoped. I have no experience with how to configure too few memory chips because it is not something I have ever considered doing given it substantially reduces the effective number of cores available for largish CFD runs. If you run a reasonably large implicit CFD benchmark on 1, 2, 4, 8, 16, 32, 64 cores and plot the parallel efficiency it will likely show there is currently little to be gained by using more than around 16 cores. If the machine is also used for other types of simulations these may run with a better parallel efficiency.

October 11, 2024, 08:28		#45
MFGT Senior Member Tobias Join Date: May 2016 Location: Germany Posts: 295 Rep Power: 11	Compared to the old server we now have a Speedup of +64% (test with flowbench simulation). Both configurations used 60 cores with HT on. 30 cores without HT has a speedup of +36%, the wrong DIMM config with 60 cores had only +10%, the one with only 7 effective DIMMs even had -14%.

October 12, 2024, 10:37		#49
MFGT Senior Member Tobias Join Date: May 2016 Location: Germany Posts: 295 Rep Power: 11	Hi, my testcase is a relativly small Flowbench Simulation, with up to 600.000 cells. I ran the exact same case several times with different configurations. And when calculating the speedup its very similar if I consider a) overall runtime or b) average Walltime per Timestep. With RAM limitation I dont mean the overall amount (the case needs less than 16GB), its rather that slow 2133MHz RAM where up to 3200MHz is supported now. And I am sorry, I wont do any benchmarks with different software, as I cant waste time with non project related work in my job. But I will rerun a full cycle simulation and compare the results as well. Here we talk about cases with up to 1.5 million cells, including detailed chemistry etc.

October 13, 2024, 05:24		#50
andy_ Senior Member andy Join Date: May 2009 Posts: 322 Rep Power: 18	Thanks for the clarification. It will be interesting to see how much the 64% changes with a larger more representative simulation but without more information it looks like we will have to speculate about what might be going on and what will or will not bring improvements.

October 15, 2024, 10:19		#53
andy_ Senior Member andy Join Date: May 2009 Posts: 322 Rep Power: 18	That does not seem to be scaling as I would expect for a typical distributed memory implicit CFD code with a reasonably large grid running on a shared memory machine (e.g. the pinned openfoam benchmark). Do you know how the solver in your program works and how it is parallelised? Can you provide a link to it because my googling "flowbench simulation" hasn't thrown up something obvious. (And we are back to wanting to run something like the NAS parallel benchmarks in order to understand the performance).