OpenFOAM benchmarks on various hardware

danbence · October 25, 2024, 15:54

An AMD performance brief compares the Turin 9755 (2x128) with the Genoa 9654 (2x96) and shows a 43% uplift on the composite OpenFoam benchmarks they chose.

http://www.amd.com/content/dam/amd/e...b-openfoam.pdf

gumersindu · October 29, 2024, 02:41

Hi all,

I'm trying to run the benchmark attached on the original post named "bench_template.tar.gz" on my PC: 2 x Intel Xeon E5-2690 v4 (14 cores, 2.6 GHz, 35Mb L3) | 8 x 16GB DDR4 ECC | 1TB HDD | Ubuntu 24.04 LTS | openFOAM-2312

It looks like snappyHexMesh is failing to create the mesh. Is there an updated version maybe?

andy_ · October 29, 2024, 04:18

Quote:

Originally Posted by gumersindu

It looks like snappyHexMesh is failing to create the mesh. Is there an updated version maybe?

It didn't work for me either (Ubuntu 24.04) but the test case seemed to be one of the tutorials with, if memory serves, an increased grid density so I ran that. I'm not in the office at the moment but will look for the script I used when I am back.

andy_ · October 29, 2024, 06:53

OK some of what I did is coming back but I am not an openfoam user and was hacking to get something working rather than carefully setting up a benchmark.

Version 11 and 12 of openfoam are organised significantly differently and require different scripts. Don't know about earlier versions.

I ran version 12 something like this (change the list of number of processors for the mesh, solver and writing the timing to taste):

Code:

#!/bin/bash
# PREPROCS="1 2 4 8 16"
# RUNPROCS="1 2 4 8 16"
PREPROCS=""
RUNPROCS="1"
TIMPROCS="1 2 4 8 16"

# Prepare cases
# This example runs on 1, 2 and 4 cores
for i in $PREPROCS; do
    d=run_$i
    echo "Prepare case ${d}..."
    cp -r basecase $d
    cd $d

    cp $FOAM_TUTORIALS/resources/geometry/motorBike.obj.gz constant/geometry/
    surfaceFeatures > log.surfaceFeatures 2>&1
    blockMesh > log.blockMesh 2>&1

    if [ $i -eq 1 ] 
    then
        snappyHexMesh -overwrite > log.snappyHexMesh 2>&1
    else
        sed -i "s/numberOfSubdomains.*/numberOfSubdomains ${i};/" system/decomposeParDict
        decomposePar -copyZero > log.decomposePar 2>&1
        mpirun -np ${i} snappyHexMesh -overwrite -parallel > log.snappyHexMesh 2>&1
    fi
    cd ..
done

# Run cases
for i in $RUNPROCS; do
    echo "Run for ${i}..."
    cd run_$i
    if [ $i -eq 1 ] 
    then
        potentialFoam > log.potentialFoam 2>&1
        foamRun -solver incompressibleFluid > log.incompressibleFluid 2>&1
    else
        # mpirun -np ${i} patchSummary  -parallel > log.patchSummary 2>&1
        mpirun -np ${i} potentialFoam -parallel > log.potentialFoam 2>&1
        mpirun -np ${i} foamRun -solver incompressibleFluid -parallel > log.incompressibleFluid 2>&1
        reconstructPar -latestTime > log.reconstructPar 2>&1

        # foamRun -solver incompressibleFluid -parallel
        #mpirun -np ${i} foamRun -solver incompressibleFluid -parallel > log.simpleFoam 2>&1
    fi
    cd ..
done

# Extract times
echo "# cores   Wall time (s):"
echo "------------------------"
for i in $TIMPROCS; do
    echo $i `grep Execution run_${i}/log.incompressibleFluid | tail -n 1 | cut -d " " -f 3`
done

and version 11 something like this:

Code:

#!/bin/bash

# Prepare cases
# This example runs on 1, 2 and 4 cores
for i in 1 2 4; do
    d=run_$i
    echo "Prepare case ${d}..."
    cp -r basecase $d
    cd $d
    if [ $i -eq 1 ] 
    then
        mv Allmesh_serial Allmesh
    fi
    sed -i "s/method.*/method scotch;/" system/decomposeParDict
    sed -i "s/numberOfSubdomains.*/numberOfSubdomains ${i};/" system/decomposeParDict
    time ./Allmesh
    cd ..
done

# Run cases
for i in 1 2 4; do
    echo "Run for ${i}..."
    cd run_$i
    if [ $i -eq 1 ] 
    then
        simpleFoam > log.simpleFoam 2>&1
    else
        mpirun -np ${i} simpleFoam -parallel > log.simpleFoam 2>&1
    fi
    cd ..
done

# Extract times
echo "# cores   Wall time (s):"
echo "------------------------"
for i in 1 2 4; do
    echo $i `grep Execution run_${i}/log.simpleFoam | tail -n 1 | cut -d " " -f 3`
done

The basecase was the motorBikeSteady tutorial with the following changes:

system/controlDict:
< endTime 500;
> endTime 100;

system/blockMeshDict:
< hex (0 1 2 3 4 5 6 7) (20 8 8) simpleGrading (1 1 1)
> hex (0 1 2 3 4 5 6 7) (40 16 16) simpleGrading (1 1 1)

system/decomposeParDict:
< numberOfSubdomains 16;
> numberOfSubdomains 2;

The first two change the number of iterations and the mesh size hints to match. Not sure about the 3rd but I may have been fiddling. I am not an openfoam user and was making guesses at the likely meaning of parameters. What is really needed is for an openfoam user to diff the earlier benchmark and the current tutorial and keep the parameter changes that are relevant. Whatever, my benchmark results are broadly in line with expectations and so if there are differences they are small.

wkernkamp · October 29, 2024, 19:56

Quote:

Originally Posted by andy_

OK some of what I did is coming back but I am not an openfoam user and was hacking to get something working rather than carefully setting up a benchmark.

Version 11 and 12 of openfoam are organised significantly differently and require different scripts. Don't know about earlier versions.

I ran version 12 something like this .....

The first two change the number of iterations and the mesh size hints to match. Not sure about the 3rd but I may have been fiddling. I am not an openfoam user and was making guesses at the likely meaning of parameters. What is really needed is for an openfoam user to diff the earlier benchmark and the current tutorial and keep the parameter changes that are relevant. Whatever, my benchmark results are broadly in line with expectations and so if there are differences they are small.

I did not see your result. From the looks of it you made the correct changes.

I find it extremely annoying that the basic call for the simpleFoam solution has been changed by openfoam.org. They remained the same for OpenFoam.com. (OpenFOAM v2312). I remember loosing all interest in the ruby language because they kept changing the language so that you had to rewrite your code for every version. Developers that don't know that better is the enemy of good, should be shot to save us all a lot of time.

andy_ · October 30, 2024, 06:24

Quote:

Originally Posted by wkernkamp

I did not see your result. From the looks of it you made the correct changes.

My results are a few posts up on the previous page and thanks for the confirmation.

wkernkamp · October 30, 2024, 22:24

Quote:

Originally Posted by andy_

My results are a few posts up on the previous page and thanks for the confirmation.

I looked at your previous posts. I was confused because some are quotes of other people 's results. Is your current system the one doing just over 100 seconds? If it is, you can do ~64 seconds with two 16+ cores. My fastest one does 60 seconds. They have the same total 8 memory channels at 2400 MT/s as you. However, they have much more L3 cache and that is a factor too. Upgrade your bios to the latest before installing the high core count cpus!

gumersindu · October 31, 2024, 05:35

Hi all,

I finally modified the motorbike tutorial to match the same configuration as in the benchmark from the original post. I've attached the modified code which worked for v2312.

These are the results I got for this PC config: HP Z840 | 2 x Intel Xeon E5-2690 v4 (14 cores, 2,6 GHz, 35Mb L3) | 8 x 16GB DDR4 ECC | 1TB HDD | Ubuntu 24.04 LTS

Code:

cores  MeshTime(s)     RunTime(s)     
-----------------------------------
1      1403.79         1098.68        
2      949.89          551.16         
4      495.73          246.11         
6      361.35          163.72         
8      293.58          128.46         
12     244.06          99.28          
16     229.99          84.12          
20     186.59          78.14          
24     183.3           74.44          
28     177.25          72.7

linuxguy123 · November 10, 2024, 15:56

The new Mac Mini M4 is very fast and really cheap and can be purchased with a 10 GB Ethernet port.

How would a cluster (8 or 16) of Minis perform using a 10 GB Ethernet backbone ?

The plain Mini also has a Thunderbolt 4 port that can transfer data at 40Gb/s while the Pro Mini has a Thunderbolt 5 port that can transfer data at 120Gb/s. I bet that a special router could be designed to give these machines incredible backbone bandwidth. Thunderbolt encapsulates PCIe. https://en.wikipedia.org/wiki/Thunderbolt_(interface)

wkernkamp · November 10, 2024, 17:40

Quote:

Originally Posted by linuxguy123

The new Mac Mini M4 is very fast and really cheap and can be purchased with a 10 GB Ethernet port.

How would a cluster (8 or 16) of Minis perform using a 10 GB Ethernet backbone ?

The plain Mini also has a Thunderbolt 4 port that can transfer data at 40Gb/s while the Pro Mini has a Thunderbolt 5 port that can transfer data at 120Gb/s. I bet that a special router could be designed to give these machines incredible backbone bandwidth. Thunderbolt encapsulates PCIe. https://en.wikipedia.org/wiki/Thunderbolt_(interface)

10 Gb ethernet is plenty fast for a small cluster. Get one and publish benchmark result?

andy_ · November 11, 2024, 07:01

Quote:

Originally Posted by linuxguy123

The new Mac Mini M4 is very fast and really cheap and can be purchased with a 10 GB Ethernet port.

About 15 years ago or so Apple brought out and advertised their "really fast" consumer chip when I was about to buy a cluster for the department I was working in at the time. I think it was a powerpc chip (?) but cheapened for the consumer market. Despite my skepticism the head of department was a decades old Apple fan and I felt obliged to benchmark it.

So I contacted Apple to get some values for relevant benchmarks rather than the irrelevant ones PC publications tended to use and Apple was using in their advertising to demonstrate how much "faster" their chip was compared to current intel chips. They didn't have any. So I asked to be put through to their internal technical support. Extraordinarly (to naive me) they didn't have that either. Technical support was provided by 3rd parties and so they put me through to a chain of shops which indeed had a small technical support department. Unfortunately it was technical support for what Apple customers tend to want to do with Apple computers (e.g. generating media using point and click) rather than crunching numbers. They were happy to give me access to the hardware but they had little idea what I was on about and when I sat down to compile and run some benchmarks the Apple development environment had not even been installed. As expected the benchmarks ran fast on tiny problems but slowly on normal sized problems. The department's cluster ended up using fairly expensive motherboards with fast memory support and the cheapest intel chips (i.e. lowest clockspeed) that supported it.

Given how Apple operates, their target market and how they price things the possibility of any Apple hardware offering a general high technical performance for the money is pretty low. It is not zero though and given the effectiveness of their marketing people looking to purchase clusters to crunch numbers will benefit from relevant hard evidence (unless they are fanboys of course). Clusters of ARM chips may well be about to become a good choice for CFD but I rather doubt Apple will be the supplier because of their pricing.

Perhaps I should add that Apple may be a reasonable choice for a desktop if CFD is only part of what is done with the machine. Indeed for 6 years I used an Apple laptop for office, lab and presenting but less so for software development or running simulations like CFD.

linuxguy123 · November 11, 2024, 12:43

Quote:

Originally Posted by andy_

About 15 years ago

A lot has changed since then.

Quote:

So I contacted Apple to get some values for relevant benchmarks rather than the irrelevant ones PC publications tended to use and Apple was using in their advertising to demonstrate how much "faster" their chip was compared to current intel chips.

It is very easy to compare computers on various benchmarks these days without relying on the manufacturer to do so. Geekbench, for example.

Quote:

They didn't have any. So I asked to be put through to their internal technical support. Extraordinarly (to naive me) they didn't have that either. Technical support was provided by 3rd parties and so they put me through to a chain of shops which indeed had a small technical support department. Unfortunately it was technical support for what Apple customers tend to want to do with Apple computers (e.g. generating media using point and click) rather than crunching numbers. They were happy to give me access to the hardware but they had little idea what I was on about and when I sat down to compile and run some benchmarks the Apple development environment had not even been installed. As expected the benchmarks ran fast on tiny problems but slowly on normal sized problems.

What does this have to do with anything today ?

Quote:

Given how Apple operates, their target market and how they price things the possibility of any Apple hardware offering a general high technical performance for the money is pretty low.

Let me introduce you to the Mac Mini. https://www.apple.com/mac-mini/

10 cores, 16 GB RAM, 256 GB SSD, 3 Thunderbolt 4 ports, US$600. Can add 10 GB Ethernet for $125. Please show me a faster unit of computing for less money.

Quote:

It is not zero though and given the effectiveness of their marketing people looking to purchase clusters to crunch numbers will benefit from relevant hard evidence (unless they are fanboys of course). Clusters of ARM chips may well be about to become a good choice for CFD but I rather doubt Apple will be the supplier because of their pricing.

I am not an Apple fan. I don't own a single Apple product. I'm just looking for the cheapest way to run CFD cases fast. If you can show why an M4 Mac Mini won't do that then I am all ears. Otherwise you are adding nothing to this conversation.

linuxguy123 · November 11, 2024, 12:54

Quote:

Originally Posted by wkernkamp

10 Gb ethernet is plenty fast for a small cluster. Get one and publish benchmark result?

I have never run a cluster. How does one predict the performance of a cluster given the performance of a single machine within that cluster ?

How big is a "small" cluster ? 10 nodes ? 20 ? 32 ?

M4 Mac Minis supposedly have a memory bandwidth of "120 GB/s". M4 Mac Mini Pros supposedly have "over half a terrabyte/sec" of memory bandwidth . https://en.wikipedia.org/wiki/Apple_M4

AMD EPYC Rome (Zen2, 7002) has a memory bandwidth of ~ 200 GB/sec (single socket). 8 Channels of DDR4-3200.

xuegy · November 12, 2024, 20:01

M4 Mac mini base model 4P+6E

# cores Wall time (s):
------------------------
1 315.54
2 191.29
4 118.64
8 111.61

The efficiency core is kind of useless. Can't wait to see M4 Pro/M4 Max results.

linuxguy123 · November 12, 2024, 20:10

Quote:

Originally Posted by xuegy

M4 Mac mini base model 4P+6E

# cores Wall time (s):
------------------------
1 315.54
2 191.29
4 118.64
8 111.61

The efficiency core is kind of useless. Can't wait to see M4 Pro/M4 Max results.

111 seconds is pretty good for a $600 off the shelf box. The Pro is supposed to have a lot more bandwidth.

10 M4 Minis in a cluster would be 15 seconds ?

xuegy · November 12, 2024, 21:05

Quote:

Originally Posted by linuxguy123

111 seconds is pretty good for a $600 off the shelf box. The Pro is supposed to have a lot more bandwidth.

10 M4 Minis in a cluster would be 15 seconds ?

Better wait for the new Mac Studio with M4 Max.

linuxguy123 · November 12, 2024, 23:04

Quote:

Originally Posted by xuegy

Better wait for the new Mac Studio with M4 Max.

Not sure it will be more cost competitive. It will be faster but also probably 4-5x more expensive.

We'll see. We live in interesting times.

xuegy · November 12, 2024, 23:07

Quote:

Originally Posted by linuxguy123

Not sure it will be more cost competitive. It will be faster but also probably 4-5x more expensive.

We'll see. We live in interesting times.

I would value it based on memory bandwidth. $100 per 20GB/s is OK.

aparangement · November 15, 2024, 05:31

Quote:

Originally Posted by xuegy

Better wait for the new Mac Studio with M4 Max.

It seems that M3 max enlarge the L3 (LLC?) cache to 64M and practical (CPU only?) memory bandwidth is higher than M2 max.

Really hope M4 does the same

It's a pitty linux-asahi works only on M2.

Kolan · November 17, 2024, 18:32

So I've gotten my hands on a dual E5-2630-v3 Xeon Workstation with 128 (8x16) gigs of ECC 2133 MHz RAM. It's an ASUS Z10PA-D8 motherboard.

I've installed Ubunto 24.04 and OpenFOAM 2406 and ran gumersindu's updated benchmark.

Code:

cores  MeshTime(s)     RunTime(s)     
-----------------------------------
1      1692.75         1095.52        
2      1161.71         561.19         
4      575.49          252.13         
6      449.91          172.21         
8      371.42          140.73         
12     296.43          111.46         
16     272.53          98.67

I've also got an M4 Pro (12 Core 48GB) Mac Mini on the way.

For reference here is my M3 Max again.

Code:

cores  MeshTime(s)     RunTime(s)     
-----------------------------------
1      510.43          377.13         
2      311.33          209.7          
4      195.35          110.33         
6      145.09          77.5     
8      124.87          63.6                 
12     125.53          81.98

I'll post the M4 Pro when it arrives in a few weeks.

October 25, 2024, 15:54	Performance of Epyc Turin	#801
danbence Member dab bence Join Date: Mar 2013 Posts: 48 Rep Power: 13	An AMD performance brief compares the Turin 9755 (2x128) with the Genoa 9654 (2x96) and shows a 43% uplift on the composite OpenFoam benchmarks they chose. http://www.amd.com/content/dam/amd/e...b-openfoam.pdf

November 10, 2024, 15:56	Mac M4 Clusters ?	#809
linuxguy123 Member Guy Join Date: Jun 2019 Posts: 44 Rep Power: 7	The new Mac Mini M4 is very fast and really cheap and can be purchased with a 10 GB Ethernet port. How would a cluster (8 or 16) of Minis perform using a 10 GB Ethernet backbone ? The plain Mini also has a Thunderbolt 4 port that can transfer data at 40Gb/s while the Pro Mini has a Thunderbolt 5 port that can transfer data at 120Gb/s. I bet that a special router could be designed to give these machines incredible backbone bandwidth. Thunderbolt encapsulates PCIe. https://en.wikipedia.org/wiki/Thunderbolt_(interface) bigphil likes this.

November 12, 2024, 20:01		#814
xuegy Senior Member Join Date: Jun 2016 Posts: 102 Rep Power: 10	M4 Mac mini base model 4P+6E # cores Wall time (s): ------------------------ 1 315.54 2 191.29 4 118.64 8 111.61 The efficiency core is kind of useless. Can't wait to see M4 Pro/M4 Max results. bigphil, DVSoares, wkernkamp and 2 others like this.

November 17, 2024, 18:32		#820
Kolan New Member Kevin Nolan Join Date: Nov 2012 Posts: 13 Rep Power: 14	So I've gotten my hands on a dual E5-2630-v3 Xeon Workstation with 128 (8x16) gigs of ECC 2133 MHz RAM. It's an ASUS Z10PA-D8 motherboard. I've installed Ubunto 24.04 and OpenFOAM 2406 and ran gumersindu's updated benchmark. Code: cores MeshTime(s) RunTime(s) ----------------------------------- 1 1692.75 1095.52 2 1161.71 561.19 4 575.49 252.13 6 449.91 172.21 8 371.42 140.73 12 296.43 111.46 16 272.53 98.67 I've also got an M4 Pro (12 Core 48GB) Mac Mini on the way. For reference here is my M3 Max again. Code: cores MeshTime(s) RunTime(s) ----------------------------------- 1 510.43 377.13 2 311.33 209.7 4 195.35 110.33 6 145.09 77.5 8 124.87 63.6 12 125.53 81.98 I'll post the M4 Pro when it arrives in a few weeks. bigphil, aparangement and Crowdion like this. Last edited by Kolan; November 17, 2024 at 19:11. Reason: updated 6 and 12 core runs for the M3 Max

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
How to contribute to the community of OpenFOAM users and to the OpenFOAM technology	wyldckat	OpenFOAM	17	November 10, 2017 16:54
UNIGE February 13th-17th - 2107. OpenFOAM advaced training days	joegi.geo	OpenFOAM Announcements from Other Sources	0	October 1, 2016 20:20
OpenFOAM Training Beijing 22-26 Aug 2016	cfd.direct	OpenFOAM Announcements from Other Sources	0	May 3, 2016 05:57
New OpenFOAM Forum Structure	jola	OpenFOAM	2	October 19, 2011 07:55
Hardware for OpenFOAM LES	LijieNPIC	Hardware	0	November 8, 2010 10:54

October 29, 2024, 02:41		#802
gumersindu New Member Marc Join Date: Mar 2022 Posts: 6 Rep Power: 4	Hi all, I'm trying to run the benchmark attached on the original post named "bench_template.tar.gz" on my PC: 2 x Intel Xeon E5-2690 v4 (14 cores, 2.6 GHz, 35Mb L3) \| 8 x 16GB DDR4 ECC \| 1TB HDD \| Ubuntu 24.04 LTS \| openFOAM-2312 It looks like snappyHexMesh is failing to create the mesh. Is there an updated version maybe?