OpenFOAM benchmarks on various hardware

wkernkamp · April 19, 2023, 13:21

Quote:

Originally Posted by Tibo99

Ok, thank you for the clarification!

So, changing this setting is probably the last thing I can do to push the enveloppe without affecting too much the hardware right?

Regards,

Yes, I think so.

Lavos · April 28, 2023, 13:49

Little update on the hobo-cluster (8 x dual socket e5-2670v1 + 1333ddr3) now with infiniband (40GBit QDR) as it was meant to be. I'm basically seeing linear scaling with the GAMG solver and super linear scaling with PCG solver as advised by the AWS team. Super-linear scaling is due to small size of the benchmark relative to the cpu cache according to the theory. Biggest learning has been that the 1us mpi latency offered by infiniband RMDA is really a must for scaling OpenFOAM in a multi-node setup. I first had the mellanox cards running in 10Gbit ethernet mode using the classic tcp stack and scaling was just awful.

GAMG solver:
16 130,1
32 65,2
48 43,1
64 31,5
80 26,5
96 21,5
112 18,3
128 16,2

PGJ solver:
Flow Calculation:
16 130,0
32 65,0
48 41,9
64 28,9
80 22,4
96 17,8
112 15,1
128 13,4

I'm pretty happy with the results as-is. I might be able to get another 10-25% increase by overclocking the memory to 1666 and some bios/infiniband tuning. Not sure if it's worth the stability trade-off though, rather hook-up an extra 4 nodes if the need arises.

wkernkamp · April 28, 2023, 21:17

Quote:

Originally Posted by Lavos

Little update on the hobo-cluster (8 x dual socket e5-2670v1 + 1333ddr3) now with infiniband (40GBit QDR) as it was meant to be.......

I'm pretty happy with the results as-is. I might be able to get another 10-25% increase by overclocking the memory to 1666 and some bios/infiniband tuning. Not sure if it's worth the stability trade-off though, rather hook-up an extra 4 nodes if the need arises.

You could also look into upgrading to Ivy Bridge (Xeon E5 v2) instead of getting extra nodes. The prices for the processors are low: E5-2697 v2 domestic 4 day $37.50 and $32 from China. They have the potential to go to DDR3-1866. I have in the past been able to push 1333 to 1866, but no guarantees! CPU frequencies can be higher on more cores and power consumption will be lower.

Lavos · May 2, 2023, 06:15

Quote:

Originally Posted by wkernkamp

You could also look into upgrading to Ivy Bridge (Xeon E5 v2) instead of getting extra nodes. The prices for the processors are low: E5-2697 v2 domestic 4 day $37.50 and $32 from China. They have the potential to go to DDR3-1866. I have in the past been able to push 1333 to 1866, but no guarantees! CPU frequencies can be higher on more cores and power consumption will be lower.

I will experiment on a single node to see if the memory takes. The $18 E5-2696v2 is likely already sufficient in since OF scaling is so memory bandwidth bound. Just learned that Ivy Bridge introduced QPI home snoop which should also provide significant higher interanode bandwidth vs older generation. Definitely worth a try! Would be mad if we could push the benchmark sub .1 it/sec with less than 1k of janky old hardware.

wkernkamp · May 2, 2023, 11:52

Quote:

Originally Posted by Lavos

I will experiment on a single node to see if the memory takes. The $18 E5-2696v2 is likely already sufficient in since OF scaling is so memory bandwidth bound. Just learned that Ivy Bridge introduced QPI home snoop which should also provide significant higher interanode bandwidth vs older generation. Definitely worth a try! Would be mad if we could push the benchmark sub .1 it/sec with less than 1k of janky old hardware.

Go for it!

Malinator · May 24, 2023, 14:42

Bench results for modern workstation/desktop when on a budget

HW: AMD Ryzen 7700X (8-core Zen4), MSI MAG B650, 2*16Gb DDR5 (XMP 6200MHz C40, Hynix M-die based)
HW tuning: SMT off, PBO on, Custom optimizer to reduce core voltage by 30 mW, timings, subtimings of memory carefully optimized to 6200Mhz 30-37...etc, FCLK 2133MHz

SW:

Win: Win 10 Pro 22H2, WSL2, OF10 on Ubuntu 22.04.2

Lin: Kernel 6.2.15, Fedora 38, OF 10 compiled with additional -march=zenv4 flag

Results (average on 3 runs for benchv02 from thread head):

Win10 + WSL2

Cores | Wall (flow calculation) time, s -- Meshing time, s

1 | 312.3 -- 636.5

2 | 189.2 -- 430.6

4 | 130.2 -- 243.3

6 | 112.5 -- 202.3

8 | 109.9 -- 184.2

Linux native

Cores | Wall (flow calculation) time, s -- Meshing time, s

1 | 331.5 -- 567.0

2 | 192.9 -- 399.4

4 | 126.2 -- 241.0

6 | 110.3 -- 209.4

8 | 105.9 -- 162.9

Conclusions: decent machine for pre- post-processing (see respectable meshing times of 160ish seconds); even capable of light calculations, but ..
IMHO this particular model (and may be the latest Ryzen consumer line as a whole except for *X3D models) does not use full capability of fast DDR5 modules. 6200-6400Mhz is typically the highest sustainable frequency, memory bandwidth is still gimmicky with Infinity Fabric. This particular CPU has 1 CCX module, which (reportedly and consistent with my observations) likely makes FCLK another bottleneck in memory read tasks. All-in-all, performance-wise in CFD workloads it is more like 5800X3D, and considerably lags behind rivals from latest 13*00k Intel line that are capable of achieving higher memory bandwidth.
Still, a decent upgrade for 2-3 year consumer-class hardware as a relatively quiet desktop workstation.

sharonyue · May 31, 2023, 01:54

Quote:

Originally Posted by ym92

Unfortunately, the only version I have installed is v2206. Is there a good test case which is available in OF10 and v2206? Or maybe I find a way to download that tutorial somewhere..

Please see the attachment (20m cells). It should work with 2206.

Alright, attachment does not work, please download this one: https://www.cfd-china.com/assets/upl...9-2000w.tar.xz

aparangement · July 11, 2023, 23:19

Just wondering if ddr5 6000MHz would be faster than this.

Quote:

Originally Posted by Malinator

Bench results for modern workstation/desktop when on a budget

HW: AMD Ryzen 7700X (8-core Zen4), MSI MAG B650, 2*16Gb DDR5 (XMP 6200MHz C40, Hynix M-die based)

Win10 + WSL2

Cores | Wall (flow calculation) time, s -- Meshing time, s

1 | 312.3 -- 636.5

2 | 189.2 -- 430.6

4 | 130.2 -- 243.3

6 | 112.5 -- 202.3

8 | 109.9 -- 184.2

Linux native

Cores | Wall (flow calculation) time, s -- Meshing time, s

1 | 331.5 -- 567.0

2 | 192.9 -- 399.4

4 | 126.2 -- 241.0

6 | 110.3 -- 209.4

8 | 105.9 -- 162.9

Conclusions: decent machine for pre- post-processing (see respectable meshing times of 160ish seconds); even capable of light calculations, but ..
IMHO this particular model (and may be the latest Ryzen consumer line as a whole except for *X3D models) does not use full capability of fast DDR5 modules. 6200-6400Mhz is typically the highest sustainable frequency, memory bandwidth is still gimmicky with Infinity Fabric. This particular CPU has 1 CCX module, which (reportedly and consistent with my observations) likely makes FCLK another bottleneck in memory read tasks. All-in-all, performance-wise in CFD workloads it is more like 5800X3D, and considerably lags behind rivals from latest 13*00k Intel line that are capable of achieving higher memory bandwidth.
Still, a decent upgrade for 2-3 year consumer-class hardware as a relatively quiet desktop workstation.

aparangement · July 11, 2023, 23:41

Those numbers are just too great!

However would you mind checking if the cases with 48+ threads actually finished normally?

I am curious just because the improvement is huge, compared with normal L3 7003.

Quote:

Originally Posted by oswald

Hardware: 2x EPYC 7573X, 16x 32GB DDR4
Software: Ubuntu 20.04.3, OF7

Code:

cores   Wall time (s)
1    492.5
4    113.53
8    57.91
12    39.68
16    31.88
20    28.08
24    25.14
28    24.14
32    22.34
40    21.49
48    17.17
56    12.53
64    11.55

I did not use core bindings, which might explain the bad scaling behaviour when using 20 to 40 cores. Compared to my 2xEPYC7543 workstation, this machine is ~33% faster on 64 cores.

oswald · July 24, 2023, 03:42

Hi Yan,

I checked it and all runs finished as intended.

flotus1 · July 24, 2023, 05:56

The results look pretty tame compared to "normal" Epyc Milan without 3D-Vcache.
Comparing to my results with two 7543: OpenFOAM benchmarks on various hardware

Code:

#threads | 7543   | 7573X
=========|========|=======
01       | 471.92 | 492.5
02       | 227.14 | ---
04       | 108.51 | 113.53
08       |  52.11 | 57.91
16       |  28.81 | 31.88
32       |  18.11 | 22.34
48       |  15.46 | 17.17
64       |  13.81 | 11.55

I went through some effort to get the intermediate thread count results as fast as possible. So the only reasonable comparison to draw here is at 64 threads. And that difference is well within expectations.

danbence · July 25, 2023, 11:06

https://www.amd.com/system/files/doc...b-openfoam.pdf

wkernkamp · July 25, 2023, 12:34

Quote:

Originally Posted by danbence

https://www.amd.com/system/files/doc...b-openfoam.pdf

That is based on a 100x40x40 grid, which is really a small problem. The benefit of the L3 cache reduces the larger the problem.

L C · July 26, 2023, 14:45

Just take note that the link shows results for the new Genoa-X which has a revised microarchitecture and increased L1 and L2 cache per core, so it more likely scales differently than Milan-X.

wkernkamp · July 26, 2023, 21:21

Quote:

Originally Posted by L C

Just take note that the link shows results for the new Genoa-X which has a revised microarchitecture and increased L1 and L2 cache per core, so it more likely scales differently than Milan-X.

OpenFOAM solutions at higher core count are determined by memory bandwidth. The bandwidth to the various caches is much higher than the bandwidth to memory. In fact the memory will not even come into play on these systems when the problem is small. So AMD, having the larger caches, is giving itself a maximum advantage.

I ran a dual xeon v2 system on the phoronix 30M and 60M OpenFOAM test. I compared 2xE5-4627 v2 (16 cores) to 2xE5-2697 v2 (24 cores). (The additional cores beyond 16 don't add much). The difference was quite large on the 30M problem in favor of the E5-2697v2, However, the difference decreased for the 60M problem. I attributed the difference to the 50% larger cache of this processor. The larger the problem gets, the more the equal bandwidth to memory equalizes the run time. On the 2M OpenFOAM Benchmark, the 2xE5-4627v2 completes in 100 seconds and the 2xE5-2697v2 in 86 seconds. I tried to look up the openbenchmark.org results, but gave up. That website is badly in need of a usable interface.

aparangement · July 27, 2023, 05:22

Your 7543 is fast, for sure.

But I think and fair comparison would be runing the case without mpi tuning. (or with the same level of tuning, but this is sometimes difficult..)

Quote:

Originally Posted by flotus1

The results look pretty tame compared to "normal" Epyc Milan without 3D-Vcache.
Comparing to my results with two 7543: OpenFOAM benchmarks on various hardware

Code:

#threads | 7543   | 7573X
=========|========|=======
01       | 471.92 | 492.5
02       | 227.14 | ---
04       | 108.51 | 113.53
08       |  52.11 | 57.91
16       |  28.81 | 31.88
32       |  18.11 | 22.34
48       |  15.46 | 17.17
64       |  13.81 | 11.55

I went through some effort to get the intermediate thread count results as fast as possible. So the only reasonable comparison to draw here is at 64 threads. And that difference is well within expectations.

L C · July 27, 2023, 14:00

Quote:

Originally Posted by wkernkamp

OpenFOAM solutions at higher core count are determined by memory bandwidth. The bandwidth to the various caches is much higher than the bandwidth to memory. In fact the memory will not even come into play on these systems when the problem is small. So AMD, having the larger caches, is giving itself a maximum advantage.

That's what I was trying to convey, just from the other side. When the problem is not constrained by the main memory (i.e. fits within the L3 cache), I'd expect that Genoa should scale better than Rome because the cache hierarchy works more efficiently.

mrlau · July 28, 2023, 14:26

Hi everyone,

I'm having problems replicating the good results seen from the 5800x3d.

I'm using 2 sticks of I think single rank 3600MHz memory. DOCP is on in the bios, and I have populated the memory slots in accordance with the manual.

I have tried both benchmarks linked in this thread, and openfoam2306 and openfoam11 both packaged and compiled myself.

I have also tried turning SMT off in bios, but that does not seem to make a major difference.

My OS is Ubuntu 22.04.2 LTS

The best results I have gotten are:
# cores Wall time (s):
------------------------
8 171.62
6 167.26
4 174.13
2 235.1
1 409.13

Quite far from what I have seen others report.

Under windows the system performs fine in Cinebench so I don't think it a temperature problem. The PC uses custom loop water cooling.

Any help getting the performance up would be greatly appreciated.

wkernkamp · July 28, 2023, 15:15

I have not run this cpu myself. However, I seem to remember that the RAM was run at 4800 MT/s. That would be 33% faster. If you can run your memory at that frequency you might reduce your run time by 20-25%.

mrlau · July 28, 2023, 15:57

Quote:

Originally Posted by Simbelmynë

5800X3D, 2 x 8 GB DDR4 Rank1 @ 3200 MT/s (14-14-14-14-28,1T)
OFv9, OpenSUSE Tumbleweed, GCC 11.2, kernel 5.17.4

The 1 core result is amazing and the 6 core result is pretty decent as well. I assume this is the fastest dual channel CPU for CFD right now. Well at least until someone with a large wallet can post some results for Alder Lake with DDR5 @ 6400+ MT/s EDIT: (missed the post a couple pages back, the i5-12600 with DDR5 @ 6000 MT/s is indeed faster, and not terribly expensive with a B660 motherboard, so definitely a better value if buying an entire new computer)

The single-core result is 33% faster than the 5900X (from this thread). The 5900X has a single core boost up to 4.8 GHz while the 5800X3D only boosts to 4.5 GHz. Apparently the extra V-Cache is more important than the extra single-core speed.

Code:

 cores       Simulation     Meshing
#                (s)      (min.sec)
1             314.21        12m23s
2             201.98        8m21s
4             149.98        5m05s
6             138.55        4m02s

Will update if I manage to push the memory and IF to 1800 MHz.

EDIT:
2 x 8 GB DDR4 Rank1 @3800 MT/s (16-16-16-16-32, 1T)

Code:

cores    Simulation         Meshing
#           (s)             (min.sec)
1            304              12m14
2            188              8m12
4            135              4m58
6            124              3m55
8            122              3m28

I have some results where the IF manages 2000 MHz, which admits 4000 MT/s in 1:1. Not fully stable though so i need a few more days to learn this particular CPU. The interesting part is that higher IF speeds means that the L3 cache latency decreases, so it not only admits higher bandwidths.

These result seems to be obtained with memory at 3200MHz and 3800MHz, even if the timing are a little better on the 3200 kit, my results should still be in the same ballpark I would think

April 28, 2023, 13:49		#702
Lavos New Member Joost Join Date: Mar 2023 Posts: 3 Rep Power: 3	Little update on the hobo-cluster (8 x dual socket e5-2670v1 + 1333ddr3) now with infiniband (40GBit QDR) as it was meant to be. I'm basically seeing linear scaling with the GAMG solver and super linear scaling with PCG solver as advised by the AWS team. Super-linear scaling is due to small size of the benchmark relative to the cpu cache according to the theory. Biggest learning has been that the 1us mpi latency offered by infiniband RMDA is really a must for scaling OpenFOAM in a multi-node setup. I first had the mellanox cards running in 10Gbit ethernet mode using the classic tcp stack and scaling was just awful. GAMG solver: 16 130,1 32 65,2 48 43,1 64 31,5 80 26,5 96 21,5 112 18,3 128 16,2 PGJ solver: Flow Calculation: 16 130,0 32 65,0 48 41,9 64 28,9 80 22,4 96 17,8 112 15,1 128 13,4 I'm pretty happy with the results as-is. I might be able to get another 10-25% increase by overclocking the memory to 1666 and some bios/infiniband tuning. Not sure if it's worth the stability trade-off though, rather hook-up an extra 4 nodes if the need arises. wkernkamp and L C like this. Last edited by Lavos; April 28, 2023 at 16:40.

May 24, 2023, 14:42	Ryzen 7700X	#706
Malinator New Member Andrew Join Date: Apr 2012 Posts: 15 Rep Power: 14	Bench results for modern workstation/desktop when on a budget HW: AMD Ryzen 7700X (8-core Zen4), MSI MAG B650, 216Gb DDR5 (XMP 6200MHz C40, Hynix M-die based) HW tuning: SMT off, PBO on, Custom optimizer to reduce core voltage by 30 mW, timings, subtimings of memory carefully optimized to 6200Mhz 30-37...etc, FCLK 2133MHz SW: Win: Win 10 Pro 22H2, WSL2, OF10 on Ubuntu 22.04.2 Lin: Kernel 6.2.15, Fedora 38, OF 10 compiled with additional -march=zenv4 flag Results* (average on 3 runs for benchv02 from thread head): Win10 + WSL2 Cores \| Wall (flow calculation) time, s -- Meshing time, s 1 \| 312.3 -- 636.5 2 \| 189.2 -- 430.6 4 \| 130.2 -- 243.3 6 \| 112.5 -- 202.3 8 \| 109.9 -- 184.2 Linux native Cores \| Wall (flow calculation) time, s -- Meshing time, s 1 \| 331.5 -- 567.0 2 \| 192.9 -- 399.4 4 \| 126.2 -- 241.0 6 \| 110.3 -- 209.4 8 \| 105.9 -- 162.9 Conclusions: decent machine for pre- post-processing (see respectable meshing times of 160ish seconds); even capable of light calculations, but .. IMHO this particular model (and may be the latest Ryzen consumer line as a whole except for X3D models) does not use full capability of fast DDR5 modules. 6200-6400Mhz is typically the highest sustainable frequency, memory bandwidth is still gimmicky with Infinity Fabric. This particular CPU has 1 CCX module, which (reportedly and consistent with my observations) likely makes FCLK another bottleneck in memory read tasks. All-in-all, performance-wise in CFD workloads it is more like 5800X3D, and considerably lags behind rivals from latest 1300k Intel line that are capable of achieving higher memory bandwidth. Still, a decent upgrade for 2-3 year consumer-class hardware as a relatively quiet desktop workstation. wkernkamp and ErikAdr like this.

July 24, 2023, 05:56		#711
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,426 Rep Power: 49	The results look pretty tame compared to "normal" Epyc Milan without 3D-Vcache. Comparing to my results with two 7543: OpenFOAM benchmarks on various hardware Code: #threads \| 7543 \| 7573X =========\|========\|======= 01 \| 471.92 \| 492.5 02 \| 227.14 \| --- 04 \| 108.51 \| 113.53 08 \| 52.11 \| 57.91 16 \| 28.81 \| 31.88 32 \| 18.11 \| 22.34 48 \| 15.46 \| 17.17 64 \| 13.81 \| 11.55 I went through some effort to get the intermediate thread count results as fast as possible. So the only reasonable comparison to draw here is at 64 threads. And that difference is well within expectations.

July 25, 2023, 11:06	Genoa X OpenFOAM performance information released	#712
danbence Member dab bence Join Date: Mar 2013 Posts: 47 Rep Power: 13	https://www.amd.com/system/files/doc...b-openfoam.pdf

July 28, 2023, 14:26	Slow 5800x3d	#718
mrlau New Member Johannes Join Date: Sep 2022 Posts: 2 Rep Power: 0	Hi everyone, I'm having problems replicating the good results seen from the 5800x3d. I'm using 2 sticks of I think single rank 3600MHz memory. DOCP is on in the bios, and I have populated the memory slots in accordance with the manual. I have tried both benchmarks linked in this thread, and openfoam2306 and openfoam11 both packaged and compiled myself. I have also tried turning SMT off in bios, but that does not seem to make a major difference. My OS is Ubuntu 22.04.2 LTS The best results I have gotten are: # cores Wall time (s): ------------------------ 8 171.62 6 167.26 4 174.13 2 235.1 1 409.13 Quite far from what I have seen others report. Under windows the system performs fine in Cinebench so I don't think it a temperature problem. The PC uses custom loop water cooling. Any help getting the performance up would be greatly appreciated.

July 24, 2023, 03:42		#710
oswald Member Join Date: Sep 2010 Location: Leipzig, Germany Posts: 95 Rep Power: 16	Hi Yan, I checked it and all runs finished as intended.

July 26, 2023, 14:45		#714
L C New Member Join Date: Aug 2022 Posts: 8 Rep Power: 4	Just take note that the link shows results for the new Genoa-X which has a revised microarchitecture and increased L1 and L2 cache per core, so it more likely scales differently than Milan-X.

July 28, 2023, 15:15		#719
wkernkamp Senior Member Will Kernkamp Join Date: Jun 2014 Posts: 365 Rep Power: 14	I have not run this cpu myself. However, I seem to remember that the RAM was run at 4800 MT/s. That would be 33% faster. If you can run your memory at that frequency you might reduce your run time by 20-25%.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
How to contribute to the community of OpenFOAM users and to the OpenFOAM technology	wyldckat	OpenFOAM	17	November 10, 2017 15:54
UNIGE February 13th-17th - 2107. OpenFOAM advaced training days	joegi.geo	OpenFOAM Announcements from Other Sources	0	October 1, 2016 19:20
OpenFOAM Training Beijing 22-26 Aug 2016	cfd.direct	OpenFOAM Announcements from Other Sources	0	May 3, 2016 04:57
New OpenFOAM Forum Structure	jola	OpenFOAM	2	October 19, 2011 06:55
Hardware for OpenFOAM LES	LijieNPIC	Hardware	0	November 8, 2010 09:54