CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > General Forums > Hardware

OpenFOAM benchmarks on various hardware

Register Blogs Community New Posts Updated Threads Search

Like Tree549Likes

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   May 27, 2022, 15:30
Default
  #501
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,428
Rep Power: 49
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
Well, I had a new workstation to play around with. Unfortunately, I can't get the benchmark to run properly.
I tried both compiling 2112 from source, as well as using the OpenFOAM 2112 installation from the OpenSUSE science repository.
The solver runs, but the mesh is not created properly, leading to solver run times of ~16s on a single core. I used the bench_template_v02.zip provided by Simbelmynė
The problems are the same. Here are the mesh logs from the single-core directory:
blockMesh.txt
decomposePar.txt
snappyHexMesh.txt
surfaceFeatures.txt
Maybe one of you can point me in the right direction.
flotus1 is offline   Reply With Quote

Old   May 28, 2022, 01:26
Default
  #502
Senior Member
 
Will Kernkamp
Join Date: Jun 2014
Posts: 372
Rep Power: 14
wkernkamp is on a distinguished road
I have run OF v2112. My MeshQualityDict in system has this includeEtc:
#includeEtc "caseDicts/meshQualityDict"


You seem to have the one that calls out caseDicts/mesh/generation/meshQualityDict. As is shown in the snappyHexMesh.txt file. That may be for OpenFOAM v9. (Not sure).


If this doesn't solve it, I will upload my entire basecase directory. Just let me know.
wkernkamp is offline   Reply With Quote

Old   May 28, 2022, 05:39
Default
  #503
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,428
Rep Power: 49
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
Thanks, I changed that line in meshQualityDict.
Unfortunately, that didn't do the trick. If you could provide me with a basecase and run script known to work with 2112, that would be great.
flotus1 is offline   Reply With Quote

Old   May 28, 2022, 13:46
Default
  #504
Senior Member
 
Will Kernkamp
Join Date: Jun 2014
Posts: 372
Rep Power: 14
wkernkamp is on a distinguished road
Here it is. Run it with run.tst The file has a list of numbers of nodes at the beginning. A little further down you can set prep=0 to avoid recalculating the mesh if you already have a valid mesh. In the loop for running openFOAM itself, I remove the simpleFoam log files, etc to allow a rerun to proceed. On the first try, these files are not there yet, so you see an error message that you can ignore.
Attached Files
File Type: zip benchOpenFOAM.zip (21.2 KB, 176 views)
wkernkamp is offline   Reply With Quote

Old   May 28, 2022, 18:39
Default
  #505
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,428
Rep Power: 49
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
Phew, that finally worked. If you don't mind, I would like to add your script to the first post of this thread, or link to your post. Please let me know if you are ok with that.

Anyway, here is my new toy. Well not actually mine, but I still got to play with it for a while.
Hardware: 2x AMD Epyc 7543, Gigabyte MZ72-HB0, 16x64GB DDR4-3200 (RDIMM, 2Rx4)
Bios settings: SMT disabled, workload tuning: HPC optimized, power settings: default, ACPI SRAT L3 cache as NUMA domain: enabled (results in 16 NUMA nodes)
Software: OpenSUSE Leap 15.3 with backport kernel., OpenFOAM v2112 compiled via gcc 11.2.1, using march=znver3, OpenMPI 4.1.4, scaling governor: performance, cleared caches before each run using "echo 3 > /proc/sys/vm/drop_caches"
Code:
simpleFoam run times for 100 iterations:
#threads | runtime/s
====================
01       | 471.92
02       | 227.14
04       | 108.51
08       |  52.11
16       |  28.81
32       |  18.11
48       |  15.46
64       |  13.81
Compared to the same OpenFOAM version from the OpenSUSE science repo, this runs a little faster. On 64 cores, that version takes around 14.9s.
Also, using one NUMA node per CCX is still a little faster than the usual recommendation of NPS=4. But of course would have huge drawbacks for software that isn't NUMA agnostic.
Tweaking bios settings can be tricky. I got consistently worse performance when tweaking the power settings more towards performance. There is probably still a little more to gain, but I'd rather not overdo it with bios settings on someone else's hardware.
I should also note that some of the runs with intermediate thread counts needed some hand-holding. E.g. the threads for the 02 run got mapped to cores on the same memory controller with default settings. Running with "mpirun -np 2 --bind-to core --rank-by core --map-by socket" fixes that.
bravebear, ErikAdr and Crowdion like this.

Last edited by flotus1; May 29, 2022 at 08:19.
flotus1 is offline   Reply With Quote

Old   May 28, 2022, 20:29
Default
  #506
Senior Member
 
Will Kernkamp
Join Date: Jun 2014
Posts: 372
Rep Power: 14
wkernkamp is on a distinguished road
Go right ahead posting my modification of the original script. Before you do, you might include the mpirun you used for certain cases. I have been doing similar things as you can see from the number of mpirun versions that were commented out. I have a version somewhere that splits it out based on number of cores. The strategy to set run parameters will be different for each cpu. It is still nice to have it in the script so that people can develop their plan without having to reinvent the wheel.



Nice job evaluating the borrowed machine. I also found that bios tweaking does not do much, except the memory has to be set for performance (obviously). I also don't bother setting the fans to maximum. Some servers are very noisy that way. Plus, the fans will spin as needed.
wkernkamp is offline   Reply With Quote

Old   May 29, 2022, 08:16
Default
  #507
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,428
Rep Power: 49
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
Well, the precise mpirun commands for consistent results vary with the number of threads. Someone else might be able to find a single command that works for all thread counts, but then there are still the variables of hardware, NUMA topology and MPI libraries. I don't think there is a "one size fits all" solution here.
I could try to go more into detail about what to look for, but it would end up being a rather lengthy post titled "how to benchmark correctly". Which -as pedants in the field may argue- we are all doing wrong anyway by leaving turbo boost enabled for such a short benchmark
Maybe another day.
flotus1 is offline   Reply With Quote

Old   May 29, 2022, 18:53
Default
  #508
Senior Member
 
Will Kernkamp
Join Date: Jun 2014
Posts: 372
Rep Power: 14
wkernkamp is on a distinguished road
Agreed, but what I meant was to leave the default as is, but add your special case commented out with a short description explaining the specific use.
wkernkamp is offline   Reply With Quote

Old   May 30, 2022, 03:24
Default
  #509
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,428
Rep Power: 49
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
There are many ways to achieve the same result, most of them more elegant than what I did: https://www.open-mpi.org/doc/v4.1/man1/mpirun.1.php

What I ended up using:
Code:
mpirun -np  2 --bind-to core --rank-by core --map-by socket
mpirun -np  4 --bind-to core --rank-by core --cpu-list 0,16,32,48
mpirun -np  8 --bind-to core --rank-by core --cpu-list 0,8,16,24,32,40,48,56
mpirun -np 16 --bind-to core --rank-by core --map-by numa
same from here on
The goal with all of them being to spread the threads out as evenly as possible across the shared CPU resources. I recommend htop for a quick visual confirmation, otherwise check the output of report-bindings.
Also lscpu and lstopo to find out about NUMA topology and shared resources like L3 cache. Which cores reside on a shared IMC needs to be figured out the hard way as far as I know. Reading docs and such...
wkernkamp and Crowdion like this.

Last edited by flotus1; May 30, 2022 at 04:29.
flotus1 is offline   Reply With Quote

Old   June 1, 2022, 09:05
Default
  #510
Member
 
Marco Bernardes
Join Date: May 2009
Posts: 59
Rep Power: 17
masb is on a distinguished road
Quote:
Originally Posted by wkernkamp View Post
This is still the same supermicro opteron server with the H8QG6-F Motherboard, 32x8Gb DDR3-1600 single rank, for 4x Opteron 6376.

I have found since the last post that the On Demand governor yields better results, because the opterons turbo higher when some cores are idling. Furthermore, I made some changes to the default openmpi process placement for np=2,12,24 and 48. The default tended to place processes together on adjacent integer cores. These cores share a single FPU, but also cache, so for cache this is good, but for openfoam it is not. (The difference is ~45% for the 2 core case.)


The baseline result before Overclock is:
1 2161.03
2 1045.07
4 506.82
8 249.7
12 193.92
16 145.46
24 110.93
32 93.86
48 87.21
64 85.53

After overclock using a motherboard base clock of 240 MHz instead of 200 MHz, the results are:
1 2112.27
2 1026.49
4 492.64
8 241.08
12 183.19
16 134.26
24 100.11
32 84.72
48 82.74
64 79.54

This overclock was accomplished with the OCNG5.3 BIOS. It is easy t do. Follow instructions here: https://hardforum.com/threads/ocng5-...forms.1836265/

The temperatures did not go high, so the board can still be clocked higher. The ram can also be overclocked. I will try 1866 MHz. In the past the execution time was about inversely proportional to RAM speed.
Hi Wkernkamp, thanks for the info. Would a machine with 4x AMD 16C 6282 also have a good performance as your system?
masb is offline   Reply With Quote

Old   June 2, 2022, 16:24
Default
  #511
Senior Member
 
Will Kernkamp
Join Date: Jun 2014
Posts: 372
Rep Power: 14
wkernkamp is on a distinguished road
Quote:
Originally Posted by masb View Post
Hi Wkernkamp, thanks for the info. Would a machine with 4x AMD 16C 6282 also have a good performance as your system?

Probably similar. The 6300 processors are an improvement over the 6200. With the cpus so cheap, I think you could try with yours and upgrade cpu if necessary.


Note that messing with the bios is risky. You might cause your machine to no longer boot! Performance without overclock is pretty decent due to the 16 available memory channels.
masb likes this.
wkernkamp is offline   Reply With Quote

Old   June 3, 2022, 09:09
Default AMD Ryzen 4800H under WSL Ubuntu 20.04
  #512
Member
 
Marco Bernardes
Join Date: May 2009
Posts: 59
Rep Power: 17
masb is on a distinguished road
AMD Ryzen 4800H:

# cores Wall time (s):
------------------------
Meshing Times:
1 1003.94
2 707.64
4 500.12
6 396.02
8 364.08

Flow Calculation:
1 753.92
2 486.19
4 351.89
6 329.93
8 323.98
masb is offline   Reply With Quote

Old   June 3, 2022, 09:11
Default AMD Threadripper 1950X under WSL Ubuntu 20.04
  #513
Member
 
Marco Bernardes
Join Date: May 2009
Posts: 59
Rep Power: 17
masb is on a distinguished road
AMD Threadripper 1950X under WSL Ubuntu 20.04

# cores Wall time (s):
------------------------
Meshing Times:
1 1056.81
2 701.65
4 496.73
6 393.98
8 381.59
10 360.49
12 339.13
14 323.9
16 343.45

Flow Calculation:
1 822.07
2 498.66
4 350.45
6 326.8
8 324.14
10 319.38
12 314.45
14 315.73
16 324.57
masb is offline   Reply With Quote

Old   June 4, 2022, 08:07
Default Benchmark run on Laptop With i7-11800H and 2x8GB (3200MHZ) on WSL2 Ubuntu20.04
  #514
New Member
 
Erdi
Join Date: Jun 2022
Posts: 2
Rep Power: 0
Erdi is on a distinguished road
OpenFOAM benchmark run on Laptop (Dell XPS 15) With i7-11800H and 2x8GB (3200MHZ) on WSL2 on Ubuntu 20.04 with openFOAMv9

Out of curiosity I wanted to try the benchmark on my laptop. First i tried with the default confiquration however it took a lot of time to run the one core version so I cnhanged the run.sh file and then run with 8 cores directly but it started to thermal throthle a lot so (what a suprise ) and then I tried to use 6 cores and got :

Code:
real    7m38.091s
user    45m16.834s
sys     0m11.617s
Run for 6...
# cores   Wall time (s):
------------------------
6 367.37
Crowdion likes this.
Erdi is offline   Reply With Quote

Old   June 5, 2022, 00:48
Default
  #515
Senior Member
 
Will Kernkamp
Join Date: Jun 2014
Posts: 372
Rep Power: 14
wkernkamp is on a distinguished road
Quote:
Originally Posted by Erdi View Post
OpenFOAM benchmark run on Laptop (Dell XPS 15) With i7-11800H and 2x8GB (3200MHZ) on WSL2 on Ubuntu 20.04 with openFOAMv9

6 367.37[/CODE]

Your performance is equal to my Dell r710 with dual E5649. That makes sense, because that server has six memory channels running at 1066 MT/s which is comparable to two at 3200 MT/s.



Quote:
Originally Posted by wkernkamp View Post
Dell Poweredge R710
12x4Gb Rdimm 1067Mhz

2xE5649 2.53ghz 6 cores per cpu:
Flow Calculation:
1 1486.54
2 880.04
4 422.03
6 342.61
8 317.83
10 333.38
12 307.18

2xX5675 3.07ghz 6 cores per cpu
Flow Calculation:
1 1322.84
2 787.4
4 375.77
6 305.44
8 286.3
12 278.02


Your cpu must be thermal throttling otherwise you would get 305.44 sec like the 2x X5675 or better.
wkernkamp is offline   Reply With Quote

Old   June 5, 2022, 00:50
Default Wsl2
  #516
Senior Member
 
Will Kernkamp
Join Date: Jun 2014
Posts: 372
Rep Power: 14
wkernkamp is on a distinguished road
I don't know how the benchmark performs on WSL2. I have only run linux. So that might be another issue erdi.
wkernkamp is offline   Reply With Quote

Old   June 6, 2022, 05:31
Default Two is better than one
  #517
Member
 
Marco Bernardes
Join Date: May 2009
Posts: 59
Rep Power: 17
masb is on a distinguished road
Hi!

I was wondering if the run of 2 benchmarks simultaneously would be better than 1 run after another. So the results of the 2 runs were surprisingly:


# cores Wall time (s):
------------------------
1 2 4 6 8 10 12 14 16
Meshing Times:
1 1151.36
2 857.94
4 623.2
6 563.06
8 537.3
10 526.86
12 518.92
14 523.49
16 569
Flow Calculation:
1 1034.82
2 763.45
4 550.1
6 523.57
8 542.37
10 600.15
12 625.04
14 668.28
16 710.25


# cores Wall time (s):
------------------------
1 2 4 6 8 10 12 14 16
Meshing Times:
1 1126.39
2 861.46
4 622.28
6 558.93
8 539.72
10 527.49
12 518.96
14 521.65
16 564.03
Flow Calculation:
1 1032.88
2 762.35
4 548.58
6 526.27
8 559.72
10 606.09
12 633.89
14 682.39
16 683.36

2 x 1 runs separately took in the best case (12 cores) approximately 630 seconds (previous posts)

1 x 2 runs simultaneously took in the best case (6 cores) approximately 526 seconds

Concluding: 1 x 2 runs simultaneously was 20% faster.

Any comments?

masb is offline   Reply With Quote

Old   June 6, 2022, 05:53
Default
  #518
Senior Member
 
Simbelmynė's Avatar
 
Join Date: May 2012
Posts: 552
Rep Power: 16
Simbelmynė is on a distinguished road
@masb, I do not understand your post. You have two recent posts, one of which is a 1950X with 16 cores that finish the benchmark in about 314 seconds. I do not see how this is slower than your latest test.


It should also be noted that 314 seconds is about twice as long time to finish the benchmark compared to my 1950X (specs available on the first page of this thread). WSL is not ideal, but if you do not access the file system through frequent saves then it should be fast enough. My guess is slow memory and/or timings.
Simbelmynė is offline   Reply With Quote

Old   June 6, 2022, 05:56
Default
  #519
Senior Member
 
Simbelmynė's Avatar
 
Join Date: May 2012
Posts: 552
Rep Power: 16
Simbelmynė is on a distinguished road
Quote:
Originally Posted by wkernkamp View Post
Your performance is equal to my Dell r710 with dual E5649. That makes sense, because that server has six memory channels running at 1066 MT/s which is comparable to two at 3200 MT/s.

Your cpu must be thermal throttling otherwise you would get 305.44 sec like the 2x X5675 or better.

You cannot make comparisons like that. There is a huge difference between some systems with identical theoretical bandwidth.
Simbelmynė is offline   Reply With Quote

Old   June 6, 2022, 07:42
Default Sorry for the confusing posts.
  #520
Member
 
Marco Bernardes
Join Date: May 2009
Posts: 59
Rep Power: 17
masb is on a distinguished road
Firstly I posted the benchmarks for 1950x and Ryzen 4800H just as an information. In the latest post I run two benchmarks simultaneously under WSL and 1950x. As I have to run lots of cases, I was just trying to analyze the performance for the both, sequentially and simultaneously. So, the run for two cases simultaneously using 6 cores was faster than the run for the same two cases running in 12 cores sequentially:

sequentially:

run1: 314.45 seconds
run2: 314.45 seconds

total: run1 +run2 = 629 seconds


simultaneously:

run1 || run2: 526.27 seconds

Is it clear now?


Last edited by masb; June 6, 2022 at 11:40.
masb is offline   Reply With Quote

Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
How to contribute to the community of OpenFOAM users and to the OpenFOAM technology wyldckat OpenFOAM 17 November 10, 2017 16:54
UNIGE February 13th-17th - 2107. OpenFOAM advaced training days joegi.geo OpenFOAM Announcements from Other Sources 0 October 1, 2016 20:20
OpenFOAM Training Beijing 22-26 Aug 2016 cfd.direct OpenFOAM Announcements from Other Sources 0 May 3, 2016 05:57
New OpenFOAM Forum Structure jola OpenFOAM 2 October 19, 2011 07:55
Hardware for OpenFOAM LES LijieNPIC Hardware 0 November 8, 2010 10:54


All times are GMT -4. The time now is 14:23.