CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > General Forums > Hardware

OpenFOAM benchmarks on various hardware

Register Blogs Community New Posts Updated Threads Search

Like Tree547Likes

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   August 25, 2023, 12:12
Default
  #721
New Member
 
Chris
Join Date: Nov 2022
Posts: 18
Rep Power: 3
cegan09 is on a distinguished road
I finally got around to running the benchmark on my system.

Dual 7532, 8 Samsung M393A2K43DB2-CWE (16gb) at 3200MHz per CPU.
SuperMicro H11dsi
openFoam 2112 on Ubuntu 22.04.2LTS

Code:
 # Cores Wall Time (s):
-----------------------------
64 202.76
56 18.17
48 19.46
40 21.11
32 23.11
28 24.26
24 27.72
20 32.71
16 40.48
12 52.69
8 77.48
4 165.22
1 729.19
I've done no tuning at all on this machine other than turning SMT off.
I don't fully understand why the 64core run is so much longer, though I suspect it's just me still being new and not understanding options and setting up the system. I have seen the same thing in my real runs where using 64 cores causes a huge jump in solve time, but backing off even a couple cores is back in line with where I expect things.
wkernkamp likes this.
cegan09 is offline   Reply With Quote

Old   August 25, 2023, 23:06
Default
  #722
Senior Member
 
Will Kernkamp
Join Date: Jun 2014
Posts: 371
Rep Power: 14
wkernkamp is on a distinguished road
Nice result! Impressive on first try too. Did you have anything else running during the 64 core run? It has no cores or threads to spare, so that would be my guess. I usually leave hyper threading on so that other processes have a chance to get a thread.
wkernkamp is offline   Reply With Quote

Old   August 26, 2023, 00:06
Default
  #723
Senior Member
 
Join Date: Jun 2011
Posts: 208
Rep Power: 16
CFDfan is on a distinguished road
Quote:
Originally Posted by cegan09 View Post
I finally got around to running the benchmark on my system.

Dual 7532, 8 Samsung M393A2K43DB2-CWE (16gb) at 3200MHz per CPU.
SuperMicro H11dsi
openFoam 2112 on Ubuntu 22.04.2LTS

Code:
 # Cores Wall Time (s):
-----------------------------
64 202.76
56 18.17
48 19.46
40 21.11
32 23.11
28 24.26
24 27.72
20 32.71
16 40.48
12 52.69
8 77.48
4 165.22
1 729.19
I've done no tuning at all on this machine other than turning SMT off.
I don't fully understand why the 64core run is so much longer, though I suspect it's just me still being new and not understanding options and setting up the system. I have seen the same thing in my real runs where using 64 cores causes a huge jump in solve time, but backing off even a couple cores is back in line with where I expect things.
very good scaling results except the 64 cores one. I would have tried running it with say 62 cores to see if there would be an improvement compared with the 56 cores one.
CFDfan is offline   Reply With Quote

Old   August 26, 2023, 05:35
Default
  #724
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
Quote:
Originally Posted by cegan09 View Post
I don't fully understand why the 64core run is so much longer, though I suspect it's just me still being new and not understanding options and setting up the system. I have seen the same thing in my real runs where using 64 cores causes a huge jump in solve time, but backing off even a couple cores is back in line with where I expect things.
I can think of a few things that might contribute to the outlier at 64 threads:
Some heavy background processes, thermal throttling, sub-optimal core binding, missing memory channels, an excessive amount of memory errors...

Here is what you could do to get to the bottom of it:
1) Install this https://www.supermicro.com/de/soluti...re/superdoctor
It is a handy tool to monitor a lot of things, like various temperature sensors you can not get otherwise. Or memory errors
2) Check for background processes before running the benchmark. For example with top or htop
3) See if all memory is recognized. SD5 can give you an idea, I like to check the output of dmidecode -t 17
4) When running the benchmark, you can check the CPU core frequencies with turbostat. If anything causes throttling, you will likely see the core frequencies drop.
5) Just before you run the benchmark, clear caches: echo 3 | sudo tee /proc/sys/vm/drop_caches
You can also check the output of numactl -H, to see how much free memory each NUMA node has
6) Optionally, use NPS4 mode instead of NPS1. It's a bios setting. This won't get rid of the outlier, but it is the recommended setting for our workloads.
7) Take control of core binding. E.g. do the 64 thread solver run again with
mpirun -np 64 -bind-to core --rank-by core --map-by numa simpleFoam -parallel > log.simpleFoam 2>&1
flotus1 is offline   Reply With Quote

Old   August 26, 2023, 12:06
Default
  #725
New Member
 
Chris
Join Date: Nov 2022
Posts: 18
Rep Power: 3
cegan09 is on a distinguished road
Yes, in retrospect running with all cores becomes a problem when other stuff is running. I was just focused on the result numbers.

My first thought was that since only 1 CPU can talk to the m.2, there was some issues with trying to write all the results to disk. But the obvious answer is I had a remote desktop application running. Which duh, that's going to need some processing to run.

I ran the benchmark with 63, 62, 61, 60 cores and the same program running just for consistency.



Code:
 # Cores Wall Time (s):
-----------------------------
64 202.76
63 23.02
62 20.04
61 18.08
60 18.07
56 18.17
48 19.46
40 21.11
32 23.11
28 24.26
24 27.72
20 32.71
16 40.48
12 52.69
8 77.48
4 165.22
1 729.19
So it looks like ideal in my case is around 60 cores. I am sure I can improve things with more tuning, proper prep to make sure resources are all free, core binding. But it's not important to me to squeeze those tiny bits of improvement from the runs yet. Once I have my workflow dialed in and finalized I'll see if I can tune the actual simulations for those extra improvements.


Edit: just to add, I know I'm not thermal throttling, CPU1 maxes out at like 62°C, and CPU1 is 52°C (the coolers feed into each other). All the ram is recognized and running at the correct speed.
cegan09 is offline   Reply With Quote

Old   August 29, 2023, 13:29
Default
  #726
Senior Member
 
Join Date: Jun 2011
Posts: 208
Rep Power: 16
CFDfan is on a distinguished road
Quote:
Originally Posted by cegan09 View Post
Yes, in retrospect running with all cores becomes a problem when other stuff is running. I was just focused on the result numbers.

My first thought was that since only 1 CPU can talk to the m.2, there was some issues with trying to write all the results to disk. But the obvious answer is I had a remote desktop application running. Which duh, that's going to need some processing to run.

I ran the benchmark with 63, 62, 61, 60 cores and the same program running just for consistency.



Code:
 # Cores Wall Time (s):
-----------------------------
64 202.76
63 23.02
62 20.04
61 18.08
60 18.07
56 18.17
48 19.46
40 21.11
32 23.11
28 24.26
24 27.72
20 32.71
16 40.48
12 52.69
8 77.48
4 165.22
1 729.19
So it looks like ideal in my case is around 60 cores. I am sure I can improve things with more tuning, proper prep to make sure resources are all free, core binding. But it's not important to me to squeeze those tiny bits of improvement from the runs yet. Once I have my workflow dialed in and finalized I'll see if I can tune the actual simulations for those extra improvements.


Edit: just to add, I know I'm not thermal throttling, CPU1 maxes out at like 62°C, and CPU1 is 52°C (the coolers feed into each other). All the ram is recognized and running at the correct speed.
The good thing is that you could run simultaneously two simulations (if you have enough RAM and licenses) with 30 cores each, since the time difference between 30 cores and 60 cores is less than 25%.
CFDfan is offline   Reply With Quote

Old   August 30, 2023, 10:58
Default
  #727
Member
 
Join Date: Nov 2019
Posts: 96
Rep Power: 6
FliegenderZirkus is on a distinguished road
If you run two simulations on one computer, then they will have to share the available memory bandwidth, won't they? In this case the benchmark job takes 23 seconds on 32 cores with the remaining 32 cores sitting idle. When you start two such 32-core jobs alongside each other, they will each take much longer than 23 seconds to complete because they compete for memory bandwidth. I guess one can utilize the remaining cores for some other activity that is not memory bound?
FliegenderZirkus is offline   Reply With Quote

Old   August 30, 2023, 13:56
Default
  #728
New Member
 
Chris
Join Date: Nov 2022
Posts: 18
Rep Power: 3
cegan09 is on a distinguished road
Probably, yes. Depends what your greater need is, more results or faster results.

For fun I setup two benchmarks to use 30 cores each, ran them at the same time doing nothing to try and control what cores got assigned to each. Times were 36.21 and 35.75 seconds. So yes, slower.

If I look at "real world", meaning the analysis I run most often, I have the following solve times
Single run on 30 cores when both CPUs are installed: ~19 hours
Single run on 60 cores with both CPUs: ~17 hours
Two runs in parallel with 30cores/each: ~35 hours

This is a really rough calculation looking at average compute time for each time step, multiplied by number of time steps, multiplied by the number of nose angles each run has. It basically comes out to a wash whether you run a single one at a time or two side by side, it will take about the same time to get there. So if I have a design with small tweaks between two versions I'd probably pick to run both together. so I can come back in a couple days and see which I like better. If I just have one, run it with 60.

Again, I've done nothing to try and optimize what cores are used or tune anything, just brute force throw two simulations at the machine. I'm sure I can dial things in as needed, but for the work I do it's not a big deal to be not perfectly optimized, these solve times are already so much faster than the old xeon server I had before that I'm happy just with the un-optimized setup.
cegan09 is offline   Reply With Quote

Old   August 30, 2023, 18:34
Default
  #729
Senior Member
 
Join Date: Oct 2011
Posts: 242
Rep Power: 17
naffrancois is on a distinguished road
"So if I have a design with small tweaks between two versions I'd probably pick to run both together."

You can also use a job scheduler such as slurm, or a simple script to schedule your runs.
naffrancois is offline   Reply With Quote

Old   August 31, 2023, 23:52
Default
  #730
Senior Member
 
Join Date: Jun 2011
Posts: 208
Rep Power: 16
CFDfan is on a distinguished road
[QUOTE So if I have a design with small tweaks between two versions I'd probably pick to run both together. so I can come back in a couple days and see which I like better. If I just have one, run it with 60.
[/QUOTE]

That is what i,ve been doing on my 5975wx with 32 cores. Since the time difference between 30 and 15 cores is about 25% and the real life calls for testing various design scenarios, so i run two of them simultaneously on 15 cores each. I have plenty of RAM to do that however.
CFDfan is offline   Reply With Quote

Old   September 1, 2023, 03:25
Default
  #731
Member
 
Join Date: Nov 2019
Posts: 96
Rep Power: 6
FliegenderZirkus is on a distinguished road
This is interesting, could you maybe post some numbers how long each scenario takes? I tried the same thing on a dual EPYC 7763 (128 cores in total) and found that I can't "cheat" the available memory bandwidth. In particular, the following two scenarios finish in pretty much exactly the same wall clock time:
1) Two instances of the same simulation executed alongside each other on 64 cores each.
2) The same simulation executed twice on 128 cores sequentially (second sim starts when first finishes).

The job had about 60milion cells (so large enough to saturate the memory bandwidth) in starccm+. I guess this will depend on the simulated physics, in my case it was just simple air flow using the segregated solver.
FliegenderZirkus is offline   Reply With Quote

Old   September 1, 2023, 03:36
Default
  #732
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
Running several instances simultaneously on lower thread counts is not faster with the benchmark in this thread.
This only works when something other than the memory subsystem -including last level caches- is limiting parallel efficiency. For example:
  • Very low cell count, so parallelization overhead becomes dominant
  • Poor parallelization of some parts of the code, aka Amdahl's law
giovanni.medici and CorbinMG like this.

Last edited by flotus1; September 1, 2023 at 10:29.
flotus1 is offline   Reply With Quote

Old   September 2, 2023, 23:50
Default
  #733
Senior Member
 
Join Date: Jun 2011
Posts: 208
Rep Power: 16
CFDfan is on a distinguished road
Quote:
Originally Posted by flotus1 View Post
Running several instances simultaneously on lower thread counts is not faster with the benchmark in this thread.
This only works when something other than the memory subsystem -including last level caches- is limiting parallel efficiency. For example:
  • Very low cell count, so parallelization overhead becomes dominant
  • Poor parallelization of some parts of the code, aka Amdahl's law
Yes, you turned out to be right. I recorded the time of running two simulations in parallel on lower core count and there was no (time) benefit compared with running them in series on higher core count. The only advantage (for me) was that I usually run those overnight and got both results in the morning.
CFDfan is offline   Reply With Quote

Old   September 20, 2023, 01:36
Default
  #734
New Member
 
Join Date: Aug 2023
Posts: 2
Rep Power: 0
mespinil is on a distinguished road
Ok, so I finally could run this test.

CPU: 7800x3D (8 cores, 5.2GHz boost frequency, 2 memory channels)
RAM: 96GB (2x48GB) DDR5-5600 CL40-40-40-89 1.25V AMD EXPO
At the moment I could only get 5400MHz with latest bios and EXPO profile activated
Ubuntu (Linux native)
System cost: 1900€ (Spain)

My results

Meshing Times:
1 5:47.31
2 3:58.93
4 2:31.55
6 1:56.04
8 1:42.43
Flow Calculation:
1 399.07
2 207.03
4 131.83
6 112.03
8 105.4

Competitor CPUs we have benchmarks about (for comparison):

Malinator

HW: AMD Ryzen 7700X (8-core Zen4), MSI MAG B650, 2*16Gb DDR5 (XMP 6200MHz C40, Hynix M-die based)
HW tuning: SMT off, PBO on, Custom optimizer to reduce core voltage by 30 mW, timings, subtimings of memory carefully optimized to 6200Mhz 30-37...etc, FCLK 2133MHz
Linux native
Cores | Wall (flow calculation) time, s -- Meshing time, s

1 | 331.5 -- 567.0

2 | 192.9 -- 399.4

4 | 126.2 -- 241.0

6 | 110.3 -- 209.4

8 | 105.9 -- 162.9

Simbelmynė (1)
5800X3D, 2 x 8 GB DDR4 Rank1 @ 3200 MT/s (14-14-14-14-28,1T)
OFv9, OpenSUSE Tumbleweed, GCC 11.2, kernel 5.17.4

2 x 8 GB DDR4 Rank1 @3800 MT/s (16-16-16-16-32, 1T)

Code:
cores Simulation Meshing
# (s) (min.sec)
1 304 12m14
2 188 8m12
4 135 4m58
6 124 3m55
8 122 3m28

Simbelmynė (2)
Intel 13900k (HT off), 32 GB DDR5@7200 MT/s (34-44-44-96), Ubuntu 22.04, OpenFOAM v10
Meshing (1,2,4,8 cores):

7m45,887s

5m32,672s

3m24,995s

2m16,678s

# cores Wall time (s):
------------------------
1 301.118
2 164.46
4 101.268
8 70.3852

Conclusion: I feel pretty relieved for a first ever build with no OC to get a good or at the very least logical result. I wanted 128 GB but I could not get that with current available kits, so I got a 96GB kit and I had to sacrifice some speed in the process. I would say there the build is faring well against closest build with a 7700x. Malinator's RAM is about 15% faster, so probably the x3D is adding about that 15% in extra performance (for this benchmark). Still I would say the 5800x3D is pretty much best bang for buck in this segment at least until fast DDR5 memory falls a lot.

Let me know what to think or if you think I could re-run the test changing BIOS settings or something like that I am happy to try.

Thanks to all of you for all the build sharing and discussion, I don't think there is such a good resource for making such an expensive and complex purchase in the whole internet
CFDfan and wkernkamp like this.
mespinil is offline   Reply With Quote

Old   September 20, 2023, 05:47
Default
  #735
Senior Member
 
Will Kernkamp
Join Date: Jun 2014
Posts: 371
Rep Power: 14
wkernkamp is on a distinguished road
Congratulations on your result. Also, very nice presentation with the other comparable results. What version of OpenFOAM are you running?
wkernkamp is offline   Reply With Quote

Old   October 29, 2023, 14:39
Default
  #736
New Member
 
Alexander Kazantcev
Join Date: Sep 2019
Posts: 24
Rep Power: 7
AlexKaz is on a distinguished road
Dual Xeon 8352Y ES / 16x3200 single rank dimms / OpenFOAM v1812 precompiled for Xeon v4 only / without tuning at bios

#cores Mesh_time(s) Wall_time(s):
------------------------
1 921.61 705.96
2 636.71 369.34
4 345.68 177.93
6 262.85 117.4
8 217.23 92.17
12 195.86 66.83
16 159.11 54.4
20 131.71 46.87
24 127.67 42.7
26 138.43 40.81
28 121.93 38.91
30 124.79 37.92
32 124.38 37.57
34 124.64 36.2
36 130.59 35.83
wkernkamp and Crowdion like this.
AlexKaz is offline   Reply With Quote

Old   October 29, 2023, 20:43
Default
  #737
Senior Member
 
Will Kernkamp
Join Date: Jun 2014
Posts: 371
Rep Power: 14
wkernkamp is on a distinguished road
Quote:
Originally Posted by AlexKaz View Post
Dual Xeon 8352Y ES / 16x3200 single rank dimms / OpenFOAM v1812 precompiled for Xeon v4 only / without tuning at bios

#cores Mesh_time(s) Wall_time(s):
------------------------
1 921.61 705.96
30 124.79 37.92
32 124.38 37.57
34 124.64 36.2
36 130.59 35.83

There is a bios configuration with just 16 cores active per processor. Would be interesting to see if you get better performance.
wkernkamp is offline   Reply With Quote

Old   October 30, 2023, 05:56
Default
  #738
New Member
 
Alexander Kazantcev
Join Date: Sep 2019
Posts: 24
Rep Power: 7
AlexKaz is on a distinguished road
Are you about SST-PP 2.0 function for Y-chips?

Platinum 8352Y
Intel® Speed Select Technology —
Performance Profile (Intel® SST-PP)
Config Active Cores Base Frequency TDP Description
1 24 2.3 GHz 185W
2 16 2.6 GHz 185W
High Priority Cores 12
High Priority Core Frequency 2.40 GHz
Low Priority Cores 20
Low Priority Core Frequency 2.00 GHz

I'l check the such configuration, but, I think, it is marketing in mainly. Both v3 and v4 families chips can increase frequencies to maximum value, because of TDP packet limit is staying the same after disabling some cores. As I saw early, the both ES are working with 3.4GHz up to about 8-12 threads.

Last edited by AlexKaz; October 30, 2023 at 07:10.
AlexKaz is offline   Reply With Quote

Old   October 30, 2023, 12:13
Default
  #739
New Member
 
Alexander Kazantcev
Join Date: Sep 2019
Posts: 24
Rep Power: 7
AlexKaz is on a distinguished road
Only with 2x16 cores:

#threads mesh(s) wall(s)

1 900.475 721 (single thread frequency of the loaded one CPU kernel is 3650-3680 MHz)
28 123.02 40.03
30 123.98 38.74
32 121.75 38.85
34 163.99 54.25
36 182.85 50.95
Crowdion likes this.

Last edited by AlexKaz; November 1, 2023 at 13:27.
AlexKaz is offline   Reply With Quote

Old   November 5, 2023, 20:07
Default
  #740
Senior Member
 
ztdep's Avatar
 
p ding
Join Date: Mar 2009
Posts: 427
Rep Power: 19
ztdep is on a distinguished road
Send a message via Yahoo to ztdep Send a message via Skype™ to ztdep
why the xeon Platium series cpu are missing from the data.
ztdep is offline   Reply With Quote

Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
How to contribute to the community of OpenFOAM users and to the OpenFOAM technology wyldckat OpenFOAM 17 November 10, 2017 16:54
UNIGE February 13th-17th - 2107. OpenFOAM advaced training days joegi.geo OpenFOAM Announcements from Other Sources 0 October 1, 2016 20:20
OpenFOAM Training Beijing 22-26 Aug 2016 cfd.direct OpenFOAM Announcements from Other Sources 0 May 3, 2016 05:57
New OpenFOAM Forum Structure jola OpenFOAM 2 October 19, 2011 07:55
Hardware for OpenFOAM LES LijieNPIC Hardware 0 November 8, 2010 10:54


All times are GMT -4. The time now is 10:15.