OpenFOAM benchmarks on various hardware

wkernkamp · March 28, 2024, 23:10

Interesting that the multiples of 24 show in the result.

MangoNrFive · March 29, 2024, 07:39

My Hypothesis is, when benchmarking all of the core-counts we would see a saw-tooth-pattern in the speedup.

My explanation for this is, that for example at 168 cores all of the CCDs are balanced nicely with 7 cores each. If we add just one more core, than the workload of each of the cores is only a tiny bit less (decreased by 169/168) but one of the CCDs now has 8 cores instead of 7 (so an increase of the workload of this one CCD by (8/7 * 168/169). So in the simulation, this one CCD acts as the weakest link. If we add yet another core, we get another tiny speedup by decreasing the workload of each core (now a decrease of 170/168 in total) we now have two "bad" CCDs but this second "bad" CCD doesn´t make it worse because just the weakest link counts. So 168 is best, 169 is a lot worse, and 170 a little better than 169 but still a lot worse than 168 and so on until 192 where the CCD are balanced again and we have the best performance.

flotus1 · March 29, 2024, 08:35

True in theory, when scattering threads across CCDs. In order to actually see this, quite a bit of effort would be necessary, in order to reduce run-to-run variance.
Additionally, with this many cores, the quality of domain decomposition can have a larger impact than adding/removing a few cores.

wkernkamp · March 30, 2024, 03:50

Quote:

Originally Posted by flotus1

True in theory, when scattering threads across CCDs. In order to actually see this, quite a bit of effort would be necessary, in order to reduce run-to-run variance.
Additionally, with this many cores, the quality of domain decomposition can have a larger impact than adding/removing a few cores.

I don't know if that is true. He had on his first run a recognizable pattern already each 24 cores.

AlexKaz · April 3, 2024, 06:12

I think the main difference of new Epyc results lies in only huge L3 cache. I saw with /proc/your_process_id/status, that binaries of snappyHexMesh and simpleFoam use about 400 MB of code with libraries. It is no surprising that if all the code lies in cache, then we sometimes are watching the big speed up. Also, as I seeing, the main code is very short, about some MB.

DVSoares · April 3, 2024, 08:31

Quote:

Originally Posted by AlexKaz

I think the main difference of new Epyc results lies in only huge L3 cache. I saw with /proc/your_process_id/status, that binaries of snappyHexMesh and simpleFoam use about 400 MB of code with libraries. It is no surprising that if all the code lies in cache, then we sometimes are watching the big speed up. Also, as I seeing, the main code is very short, about some MB.

Fully agree, especially considering how cache memory is faster than system RAM. Intel has a not too technical article on that: https://www.intel.com/content/www/us...-nutshell.html
I’m confident that anyone could find more scientific papers on computer science journals that cover this topic, if needed.

Counterdoc · April 16, 2024, 17:26

Apple Macbook Pro with M1 Max and 32 GB RAM running the natively compiled Openfoam version 2312.

# cores Wall time (s):
------------------------
8 85.57
6 102.25
4 135.12
2 240.02
1 433.18

Crowdion · April 17, 2024, 15:30

Lenovo ThinkStation P520c, Xeon W-2275 (HT Off), 4 x 32GB DDR4 2666Mhz
OpenFoam2312 (precompiled), Ubuntu 23.10.1, Motorbike_bench_template.tar.gz (default settings)

Meshing (real)
# cores | Meshing Wall time| Solver Wall time(s):
------------------------
1 | 10m5s | 790
2 | 7m13s | 412
4 | 4m10s | 205
6 | 2m57s | 153
8 | 2m26s | 134
12 | 2m2s | 118
14 | 2m15s | 116

fishladderguy · April 18, 2024, 04:38

Hi, I am quite new in CFD (2D-guy, water simulations). I have read this thread and learnt a lot about hardware, thanks everyone. I am going to build a server something like 2x Epyc (used) and 16x16 RAM. Question is should I prefer CPU frequency or L3 cache if other specs are about similar? Like Epyc 7532 (2.4 GHz/256 Mb) vs. Epyc 7542 (2,9 GHz/128 Mb). Or is there any notable difference? Thanks

wkernkamp · April 19, 2024, 19:48

Quote:

Originally Posted by fishladderguy

Hi, I am quite new in CFD (2D-guy, water simulations). I have read this thread and learnt a lot about hardware, thanks everyone. I am going to build a server something like 2x Epyc (used) and 16x16 RAM. Question is should I prefer CPU frequency or L3 cache if other specs are about similar? Like Epyc 7532 (2.4 GHz/256 Mb) vs. Epyc 7542 (2,9 GHz/128 Mb). Or is there any notable difference? Thanks

I recommend you go for cache because it helps memory performance which is critical for CFD. In addition, more cache usually means more chiplets. Each chiplet contributes it's own infiniti fabric lanes and memory channels. Flotus is the expert on the Epycs. Check with him if I am right to prefer the Epyc 7532.

Crowdion · April 20, 2024, 09:55

HPE DL360 Gen9 , 2 x E5-2643 v4 (HT Off), 16 x 16GB DDR4 2400Mhz (operates at 2133MHz, measured bandwidth (Intel MLC) is 105 GB/s)
OpenFoam2312 (precompiled), Ubuntu 23.10.1, Motorbike_bench_template.tar.gz (default settings)

Meshing (real)
# cores | Meshing Wall time| Solver Wall time(s):
------------------------
1 | 13m14s | 1000
2 | 8m17s | 479
4 | 4m40s | 219
6 | 3m26s | 156
8 | 2m50s | 122
12 | 2m34s | 94

P.S.

I tested how the number of populated RAM slots affects the actual RAM bandwidth and got some pretty weird results. When the number of RAM modules is 2 modules per 1 memory channel, i.e. total there are 16 2400 Mhz RAM modules installed, the RAM operating frequency is slightly reduced down to 2133 MHz, and the measured actual throughput is 105 GB/s. While when installing 1 RAM module per memory channel, i.e. total 8 RAM modules are installed, the RAM operating frequency is 2400 MHz, and the measured actual throughput is 103 GB/s.. That is, as the RAM frequency increases, the throughput decreases. This is quite a strange behavior.

wkernkamp · April 20, 2024, 17:00

Quote:

Originally Posted by Crowdion

HPE DL360 Gen9 , 2 x E5-2643 v4 (HT Off), 16 x 16GB DDR4 2400Mhz (operates at 2133MHz, measured bandwidth (Intel MLC) is 105 GB/s)

I tested how the number of populated RAM slots affects the actual RAM bandwidth and got some pretty weird results. When the number of RAM modules is 2 modules per 1 memory channel, i.e. total there are 16 2400 Mhz RAM modules installed, the RAM operating frequency is slightly reduced down to 2133 MHz, and the measured actual throughput is 105 GB/s. While when installing 1 RAM module per memory channel, i.e. total 8 RAM modules are installed, the RAM operating frequency is 2400 MHz, and the measured actual throughput is 103 GB/s.. That is, as the RAM frequency increases, the throughput decreases. This is quite a strange behavior.

The available bandwidth for your 8 channel system is:
DDR4-2133: 8*2133*8=136.5 GB/s
DDR4-2400: 8*2400*8=153.6 GB/s

Both your measurements do not reach the true limit. On a dual socket system, the measurement can be affected by one cpu reading from memory attached to another through the interconnect. This interconnect has it's own limits and of course a delay.

The nice thing of the openfoam motobike benchmark is that it does (on a properly set up system) have performance proportional to bandwidth. Does that benchmark show a difference in bandwidth between the 2133 and 2400 memory speeds?

I don't remember if on the HP DL360, you can force DDR-2400 speed when two slots are occupied. Most server systems have that option. I have found that on these XEON v1 through v4 systems, the fastest configuration is two dimms per channel of dual rank memory. All dimms must be the same to keep the system symmetric. Unsymmetric memory configurations incur large penalties. One dimm per channel, or single rank memory in one or more channels incurs small penalties.

wkernkamp · April 20, 2024, 17:07

Quote:

Originally Posted by Crowdion

HPE DL360 Gen9 , 2 x E5-2643 v4 (HT Off), 16 x 16GB DDR4 2400Mhz (operates at 2133MHz, measured bandwidth (Intel MLC) is 105 GB/s)
OpenFoam2312 (precompiled), Ubuntu 23.10.1, Motorbike_bench_template.tar.gz (default settings)

Meshing (real)
# cores | Meshing Wall time| Solver Wall time(s):
------------------------
1 | 13m14s | 1000
2 | 8m17s | 479
4 | 4m40s | 219
6 | 3m26s | 156
8 | 2m50s | 122
12 | 2m34s | 94

My fastest Xeon v4 system is in the thread here: OpenFOAM benchmarks on various hardware

It has this performance for comparison with your run:
Flow Calculation:
1 924.05
2 483.68
4 214.54
8 113.42
12 85.05

Your results is already looking good! Note that that run has two 16 core cpus that each have a proportionally larger cache, which helps memory access. So, I don't think you will be able to reach 85.05 on your system with E5-2643 v4 cpus.

Crowdion · April 20, 2024, 20:37

Quote:

Originally Posted by wkernkamp

The available bandwidth for your 8 channel system is:
DDR4-2133: 8*2133*8=136.5 GB/s
DDR4-2400: 8*2400*8=153.6 GB/s

Both your measurements do not reach the true limit. On a dual socket system, the measurement can be affected by one cpu reading from memory attached to another through the interconnect. This interconnect has it's own limits and of course a delay.

The nice thing of the openfoam motobike benchmark is that it does (on a properly set up system) have performance proportional to bandwidth. Does that benchmark show a difference in bandwidth between the 2133 and 2400 memory speeds?

I don't remember if on the HP DL360, you can force DDR-2400 speed when two slots are occupied. Most server systems have that option. I have found that on these XEON v1 through v4 systems, the fastest configuration is two dimms per channel of dual rank memory. All dimms must be the same to keep the system symmetric. Unsymmetric memory configurations incur large penalties. One dimm per channel, or single rank memory in one or more channels incurs small penalties.

Yes, I was surprised, that the actual peak bandwidth(BW) of my system was considerably lower than the your mentioned theoretical peak values for 2133 and 2400 MHz. Actually, 105GB/s for dual CPU config corresponds to DDR3 rates.
HP declares, that 2 DIMM/channel operates at 2133MHz, and 2400MHz at 1 DIMM/channel.

My DL360g9 has installed HP certified HPE 809082-091 single rank RAM.

I have removed 8 DIMMs to make 1 DIMM/channel configuration and rerun the benchmark and get the following results:

HPE DL360 Gen9 , 2 x E5-2643 v4 (HT Off), 8 x 16GB DDR4 2400Mhz (operates at 2400MHz)
OpenFoam2312 (precompiled), Ubuntu 23.10.1, Motorbike_bench_template.tar.gz (default settings)

Meshing (real)
# cores | Meshing Wall time| Solver Wall time(s):

"8 x 16GB" conf. | "16x 16GB" conf.
........"2400MHz" | "2133MHz"
--------------------------------------------------
1 | 12m2s | 902 | 13m14s | 1000
2 | 8m10s | 471 | 8m17s | 479
4 | 4m35s | 221 | 4m40s | 219
6 | 3m32s | 160 | 3m26s | 156
8 | 2m45s | 130 | 2m50s | 122
12 | 2m34s |115| 2m34s | 94

The single core performance is better for "8 x 16GB" config, whereas the multithread performance is better for "16 x 16GB" config. Hmm, a very strange situation.

I found in the Reddit thread reported measured BW values in different systems. It is reported ~138GB/S peak BW value for 2 x E5- 2683v4 (Supermicro X10DRG-OT+-CPU, 8x32GB DDR4 2400 Samsung RAM), which is much closer the theoretical peak BW value of 153.6 GB/s than mine measured

wkernkamp · April 20, 2024, 21:32

Quote:

Originally Posted by Crowdion

Yes, I was surprisedHPE DL360 Gen9 , 2 x E5-2643 v4 (HT Off), 8 x 16GB DDR4 2400Mhz (operates at 2400MHz)
OpenFoam2312 (precompiled), Ubuntu 23.10.1, Motorbike_bench_template.tar.gz (default settings)

Meshing (real)
# cores | Meshing Wall time| Solver Wall time(s):

"8 x 16GB" conf. | "16x 16GB" conf.
........"2400MHz" | "2133MHz"
--------------------------------------------------
1 | 12m2s | 902 | 13m14s | 1000
2 | 8m10s | 471 | 8m17s | 479
4 | 4m35s | 221 | 4m40s | 219
6 | 3m32s | 160 | 3m26s | 156
8 | 2m45s | 130 | 2m50s | 122
12 | 2m34s |115| 2m34s | 94

The single core performance is better for "8 x 16GB" config, whereas the multithread performance is better for "16 x 16GB" config. Hmm, a very strange situation.

I found in the Reddit thread reported measured BW values in different systems. It is reported ~138GB/S peak BW value for 2 x E5- 2683v4 (Supermicro X10DRG-OT+-CPU, 8x32GB DDR4 2400 Samsung RAM), which is much closer the theoretical peak BW value of 153.6 GB/s than mine measured

This is a fun puzzle!

I think that you need at least two ranks per channel to reach maximum throughput, because the controller alternates addressing these ranks for better performance. It is called rank interleaving I think.

Your best option is to go in the bios and see if you can force 2400 for the two-dimms-per-channel config. You may already have that setting, because the 94 sec result is pretty decent. Just checked and the e5-2643 v4 xeon has an unusually large cache of 20M. That is more than the 2.5xNcores which is the norm. So that is one reason it is doing so well.

If this is a machine you are going to use for a CFD project, you should look into getting a higher core count processor. They are cheap. Note that a bios upgrade is sometimes needed before the newer higher core count cpus work. I have had that problem. I have a pair of E5-2683 v4, which is the lower clocked 16-core cpu (versus the E5-2697A for which I showed the result). The 2683 will do the benchmark in 64 seconds, so not much slower. I also have the 18 core E5-2686 v4. Don't remember it's performance, probably 62 seconds. The Gygabyte motherboard has better memory performance for the same cpu. It has the ability to run two dimms at 2400 per channel as a feature per the manual.

Just looked on Ebay and the E5-2683 v4 is on offer for $25.

Counterdoc · April 21, 2024, 06:04

I am currently looking for a second hand server system. I found some offers on ebay for the Intel E7-8880 in different versions.

4x Intel Xeon E7-8880 v4 - 22 core @ 2.2 GHz base / 3.3 GHz turbo - DDR4 1866 MHz - 55 MB cache

4x Intel Xeon E7-8880 v3 - 18 core @ 2.2 GHz base / 3.1 GHz turbo - DDR4 1866 MHz - 45 MB cache

8x Intel Xeon E7-8880 v2 - 15 core @ 2.5 GHz base / 3.1 GHz turbo - DDR3 1600 MHz - 37,5 MB cache

Of course the v2 is the cheapest. It would also have the most cores. v2 = 120 cores, v3 = 72 cores, v4 = 88 cores.

I think v3 makes no sense as the prices are similar to v4 systems. But how would v2 compare with v4 when there is such a big difference in number of cores? Could DDR3 or the cache be the bottleneck?

wkernkamp · April 21, 2024, 15:01

Quote:

Originally Posted by Counterdoc

I am currently looking for a second hand server system. I found some offers on ebay for the Intel E7-8880 in different versions.

4x Intel Xeon E7-8880 v4 - 22 core @ 2.2 GHz base / 3.3 GHz turbo - DDR4 1866 MHz - 55 MB cache

4x Intel Xeon E7-8880 v3 - 18 core @ 2.2 GHz base / 3.1 GHz turbo - DDR4 1866 MHz - 45 MB cache

8x Intel Xeon E7-8880 v2 - 15 core @ 2.5 GHz base / 3.1 GHz turbo - DDR3 1600 MHz - 37,5 MB cache

Of course the v2 is the cheapest. It would also have the most cores. v2 = 120 cores, v3 = 72 cores, v4 = 88 cores.

I think v3 makes no sense as the prices are similar to v4 systems. But how would v2 compare with v4 when there is such a big difference in number of cores? Could DDR3 or the cache be the bottleneck?

The number of cores is less important than the total bandwidth. If you search "intel ark E7-8880" you will find that all three processors have the same 85 GB/s bandwidth. This means that the eight processor v2 system will have almost twice the performance of the other two with only four processors provided the memory is configured correctly. The E7 v2, v3 and v4 have a special memory controller "Jordan Creek" that allows two DDR 1333 dimms to act as one DDR 2666 unit. The bandwidth is then 2666*4*8/1000=85 GB/s. That is the speed limit for each of these cpus. The higher speeds DDR4 that the v3 and v4 allow are only useful when there is just one dimm per channel. You need eight dimms per processor to reach the bandwidth. These dimms are not expensive if you need to buy more.

The Xeon cpus get progressively more fuel efficient. So if power consumption is a concern, you should go for the v4 system.

Once you have bought your system, run the benchmark an post your result here. This is a good check to see if you have everything working correctly.

Counterdoc · April 23, 2024, 15:16

Quote:

Originally Posted by wkernkamp

The number of cores is less important than the total bandwidth. If you search "intel ark E7-8880" you will find that all three processors have the same 85 GB/s bandwidth. This means that the eight processor v2 system will have almost twice the performance of the other two with only four processors provided the memory is configured correctly. The E7 v2, v3 and v4 have a special memory controller "Jordan Creek" that allows two DDR 1333 dimms to act as one DDR 2666 unit. The bandwidth is then 2666*4*8/1000=85 GB/s. That is the speed limit for each of these cpus. The higher speeds DDR4 that the v3 and v4 allow are only useful when there is just one dimm per channel. You need eight dimms per processor to reach the bandwidth. These dimms are not expensive if you need to buy more.

The Xeon cpus get progressively more fuel efficient. So if power consumption is a concern, you should go for the v4 system.

Once you have bought your system, run the benchmark an post your result here. This is a good check to see if you have everything working correctly.

Thanks a lot for the explanation and recommendation! What about the SSDs? I don't want to have a bottleneck there either.

wkernkamp · April 23, 2024, 17:56

Quote:

Originally Posted by Counterdoc

Thanks a lot for the explanation and recommendation! What about the SSDs? I don't want to have a bottleneck there either.

You did not ask about SSD's before. In general, the disk storage is not a big factor because the data gets cached in RAM. So repeated reads from disk are read from RAM instead. Then there is the fact that these old servers typically come with some kind of SAS card. Those cards can be used in raid configs that have high speed read and write, or better, you can just attach SSD's to them. That is more economical because they use almost no power when not in use and have way faster random access.

What I tend to do is set up redundant zfs with the HDD's (cheap storage) and SSDs as cache disk and for keeping the log. For this purpose your old 128GB SSD's are perfect. The problem of the old server HDDs is that they are often end of life. With zfs you can correct disk errors when they occur. Mostly the old server SAS HDD's are fine, but there is an occasional failure that won't hurt if you run zfs. (Hardware raid is not as good for that.)

If you don't need a lot of storage you can just use one or two 1TB SSD's. Two mirrored disks read twice as fast and obviously have redundancy.

If you are looking to use nvme drives, you might be better of with a v3 or v4 system, because their bios usually allows boot from nvme and pcie splitting. My quanta grid server has an nvme slot on the motherboard. There are cheap pciex16 cards that allow say 4 nvme SSDs to be run of the pcie16x4 slot when split into 4xpciex4 chunks. Of course you can use specialized cards that handle multiple nvme and M.2 SATA drives on a pcie slot that has not been split.

The bios on v2 systems can sometimes be modified to allow boot from nmve. I have successfully done this on a Supermicro server. This server bios now also allows the pcie splitting.

oswald · August 7, 2024, 11:12

Results for OpenFOAM 9 on a dual EPYC 9684X with 4800 MHz DDR5 RAM:

Code:

# cores   Wall time (s):
------------------------
1  546.46
4  110.53
8   51.49
12  35.64
16  27.53
20  22.26
24  19.38
28  16.93
32  15.38
40  12.53
48  10.85
56   9.62
64   8.67
96   6.92
128  6.49
160  6.03
192  6.43

For a clean cache run with

Code:

sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches"
mpirun -np 160 --report-bindings --bind-to core --rank-by core --map-by numa simpleFoam -parallel > log.simpleFoam 2>&1

I get an ExecutionTime of 5.6s for 160 cores

March 29, 2024, 07:39		#762
MangoNrFive Member Philipp Wiedemer Join Date: Dec 2016 Location: Munich, Germany Posts: 42 Rep Power: 10	My Hypothesis is, when benchmarking all of the core-counts we would see a saw-tooth-pattern in the speedup. My explanation for this is, that for example at 168 cores all of the CCDs are balanced nicely with 7 cores each. If we add just one more core, than the workload of each of the cores is only a tiny bit less (decreased by 169/168) but one of the CCDs now has 8 cores instead of 7 (so an increase of the workload of this one CCD by (8/7 * 168/169). So in the simulation, this one CCD acts as the weakest link. If we add yet another core, we get another tiny speedup by decreasing the workload of each core (now a decrease of 170/168 in total) we now have two "bad" CCDs but this second "bad" CCD doesn´t make it worse because just the weakest link counts. So 168 is best, 169 is a lot worse, and 170 a little better than 169 but still a lot worse than 168 and so on until 192 where the CCD are balanced again and we have the best performance. wkernkamp likes this.

April 3, 2024, 06:12		#765
AlexKaz New Member Alexander Kazantcev Join Date: Sep 2019 Posts: 24 Rep Power: 7	I think the main difference of new Epyc results lies in only huge L3 cache. I saw with /proc/your_process_id/status, that binaries of snappyHexMesh and simpleFoam use about 400 MB of code with libraries. It is no surprising that if all the code lies in cache, then we sometimes are watching the big speed up. Also, as I seeing, the main code is very short, about some MB. DVSoares likes this.

April 16, 2024, 17:26		#767
Counterdoc New Member Marius Join Date: Sep 2022 Posts: 27 Rep Power: 4	Apple Macbook Pro with M1 Max and 32 GB RAM running the natively compiled Openfoam version 2312. # cores Wall time (s): ------------------------ 8 85.57 6 102.25 4 135.12 2 240.02 1 433.18 aparangement, wkernkamp, gerlero and 1 others like this. Last edited by Counterdoc; April 20, 2024 at 18:19.

April 17, 2024, 15:30		#768
Crowdion New Member DS Join Date: Jan 2022 Posts: 15 Rep Power: 4	Lenovo ThinkStation P520c, Xeon W-2275 (HT Off), 4 x 32GB DDR4 2666Mhz OpenFoam2312 (precompiled), Ubuntu 23.10.1, Motorbike_bench_template.tar.gz (default settings) Meshing (real) # cores \| Meshing Wall time\| Solver Wall time(s): ------------------------ 1 \| 10m5s \| 790 2 \| 7m13s \| 412 4 \| 4m10s \| 205 6 \| 2m57s \| 153 8 \| 2m26s \| 134 12 \| 2m2s \| 118 14 \| 2m15s \| 116 wkernkamp likes this.

April 18, 2024, 04:38	CPU frequency vs. L3 cache	#769
fishladderguy New Member Jamie Join Date: Apr 2024 Posts: 1 Rep Power: 0	Hi, I am quite new in CFD (2D-guy, water simulations). I have read this thread and learnt a lot about hardware, thanks everyone. I am going to build a server something like 2x Epyc (used) and 16x16 RAM. Question is should I prefer CPU frequency or L3 cache if other specs are about similar? Like Epyc 7532 (2.4 GHz/256 Mb) vs. Epyc 7542 (2,9 GHz/128 Mb). Or is there any notable difference? Thanks

March 28, 2024, 23:10		#761
wkernkamp Senior Member Will Kernkamp Join Date: Jun 2014 Posts: 372 Rep Power: 14	Interesting that the multiples of 24 show in the result.

March 29, 2024, 08:35		#763
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,428 Rep Power: 49	True in theory, when scattering threads across CCDs. In order to actually see this, quite a bit of effort would be necessary, in order to reduce run-to-run variance. Additionally, with this many cores, the quality of domain decomposition can have a larger impact than adding/removing a few cores.

April 20, 2024, 09:55		#771
Crowdion New Member DS Join Date: Jan 2022 Posts: 15 Rep Power: 4	HPE DL360 Gen9 , 2 x E5-2643 v4 (HT Off), 16 x 16GB DDR4 2400Mhz (operates at 2133MHz, measured bandwidth (Intel MLC) is 105 GB/s) OpenFoam2312 (precompiled), Ubuntu 23.10.1, Motorbike_bench_template.tar.gz (default settings) Meshing (real) # cores \| Meshing Wall time\| Solver Wall time(s): ------------------------ 1 \| 13m14s \| 1000 2 \| 8m17s \| 479 4 \| 4m40s \| 219 6 \| 3m26s \| 156 8 \| 2m50s \| 122 12 \| 2m34s \| 94 P.S. I tested how the number of populated RAM slots affects the actual RAM bandwidth and got some pretty weird results. When the number of RAM modules is 2 modules per 1 memory channel, i.e. total there are 16 2400 Mhz RAM modules installed, the RAM operating frequency is slightly reduced down to 2133 MHz, and the measured actual throughput is 105 GB/s. While when installing 1 RAM module per memory channel, i.e. total 8 RAM modules are installed, the RAM operating frequency is 2400 MHz, and the measured actual throughput is 103 GB/s.. That is, as the RAM frequency increases, the throughput decreases. This is quite a strange behavior. Last edited by Crowdion; April 20, 2024 at 13:52.

April 21, 2024, 06:04		#776
Counterdoc New Member Marius Join Date: Sep 2022 Posts: 27 Rep Power: 4	I am currently looking for a second hand server system. I found some offers on ebay for the Intel E7-8880 in different versions. 4x Intel Xeon E7-8880 v4 - 22 core @ 2.2 GHz base / 3.3 GHz turbo - DDR4 1866 MHz - 55 MB cache 4x Intel Xeon E7-8880 v3 - 18 core @ 2.2 GHz base / 3.1 GHz turbo - DDR4 1866 MHz - 45 MB cache 8x Intel Xeon E7-8880 v2 - 15 core @ 2.5 GHz base / 3.1 GHz turbo - DDR3 1600 MHz - 37,5 MB cache Of course the v2 is the cheapest. It would also have the most cores. v2 = 120 cores, v3 = 72 cores, v4 = 88 cores. I think v3 makes no sense as the prices are similar to v4 systems. But how would v2 compare with v4 when there is such a big difference in number of cores? Could DDR3 or the cache be the bottleneck?

August 7, 2024, 11:12	2x EPYC 9684X	#780
oswald Member Join Date: Sep 2010 Location: Leipzig, Germany Posts: 96 Rep Power: 16	Results for OpenFOAM 9 on a dual EPYC 9684X with 4800 MHz DDR5 RAM: Code: # cores Wall time (s): ------------------------ 1 546.46 4 110.53 8 51.49 12 35.64 16 27.53 20 22.26 24 19.38 28 16.93 32 15.38 40 12.53 48 10.85 56 9.62 64 8.67 96 6.92 128 6.49 160 6.03 192 6.43 For a clean cache run with Code: sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches" mpirun -np 160 --report-bindings --bind-to core --rank-by core --map-by numa simpleFoam -parallel > log.simpleFoam 2>&1 I get an ExecutionTime of 5.6s for 160 cores DVSoares, flotus1, techtuner and 5 others like this. Last edited by oswald; August 8, 2024 at 04:51. Reason: Corrected RAM frequency from 5600 MHz to 4800 MHz. Thanks to flotus1 for pointing it out! Added time for clean cache.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
How to contribute to the community of OpenFOAM users and to the OpenFOAM technology	wyldckat	OpenFOAM	17	November 10, 2017 16:54
UNIGE February 13th-17th - 2107. OpenFOAM advaced training days	joegi.geo	OpenFOAM Announcements from Other Sources	0	October 1, 2016 20:20
OpenFOAM Training Beijing 22-26 Aug 2016	cfd.direct	OpenFOAM Announcements from Other Sources	0	May 3, 2016 05:57
New OpenFOAM Forum Structure	jola	OpenFOAM	2	October 19, 2011 07:55
Hardware for OpenFOAM LES	LijieNPIC	Hardware	0	November 8, 2010 10:54