OpenFOAM benchmarks on various hardware

cegan09 · August 25, 2023, 12:12

I finally got around to running the benchmark on my system.

Dual 7532, 8 Samsung M393A2K43DB2-CWE (16gb) at 3200MHz per CPU.
SuperMicro H11dsi
openFoam 2112 on Ubuntu 22.04.2LTS

Code:

 # Cores Wall Time (s):
-----------------------------
64 202.76
56 18.17
48 19.46
40 21.11
32 23.11
28 24.26
24 27.72
20 32.71
16 40.48
12 52.69
8 77.48
4 165.22
1 729.19

I've done no tuning at all on this machine other than turning SMT off.
I don't fully understand why the 64core run is so much longer, though I suspect it's just me still being new and not understanding options and setting up the system. I have seen the same thing in my real runs where using 64 cores causes a huge jump in solve time, but backing off even a couple cores is back in line with where I expect things.

wkernkamp · August 25, 2023, 23:06

Nice result! Impressive on first try too. Did you have anything else running during the 64 core run? It has no cores or threads to spare, so that would be my guess. I usually leave hyper threading on so that other processes have a chance to get a thread.

CFDfan · August 26, 2023, 00:06

Quote:

Originally Posted by cegan09

I finally got around to running the benchmark on my system.

Dual 7532, 8 Samsung M393A2K43DB2-CWE (16gb) at 3200MHz per CPU.
SuperMicro H11dsi
openFoam 2112 on Ubuntu 22.04.2LTS

Code:

 # Cores Wall Time (s):
-----------------------------
64 202.76
56 18.17
48 19.46
40 21.11
32 23.11
28 24.26
24 27.72
20 32.71
16 40.48
12 52.69
8 77.48
4 165.22
1 729.19

I've done no tuning at all on this machine other than turning SMT off.
I don't fully understand why the 64core run is so much longer, though I suspect it's just me still being new and not understanding options and setting up the system. I have seen the same thing in my real runs where using 64 cores causes a huge jump in solve time, but backing off even a couple cores is back in line with where I expect things.

very good scaling results except the 64 cores one. I would have tried running it with say 62 cores to see if there would be an improvement compared with the 56 cores one.

flotus1 · August 26, 2023, 05:35

Quote:

Originally Posted by cegan09

I don't fully understand why the 64core run is so much longer, though I suspect it's just me still being new and not understanding options and setting up the system. I have seen the same thing in my real runs where using 64 cores causes a huge jump in solve time, but backing off even a couple cores is back in line with where I expect things.

I can think of a few things that might contribute to the outlier at 64 threads:
Some heavy background processes, thermal throttling, sub-optimal core binding, missing memory channels, an excessive amount of memory errors...

Here is what you could do to get to the bottom of it:
1) Install this https://www.supermicro.com/de/soluti...re/superdoctor
It is a handy tool to monitor a lot of things, like various temperature sensors you can not get otherwise. Or memory errors
2) Check for background processes before running the benchmark. For example with top or htop
3) See if all memory is recognized. SD5 can give you an idea, I like to check the output of dmidecode -t 17
4) When running the benchmark, you can check the CPU core frequencies with turbostat. If anything causes throttling, you will likely see the core frequencies drop.
5) Just before you run the benchmark, clear caches: echo 3 | sudo tee /proc/sys/vm/drop_caches
You can also check the output of numactl -H, to see how much free memory each NUMA node has
6) Optionally, use NPS4 mode instead of NPS1. It's a bios setting. This won't get rid of the outlier, but it is the recommended setting for our workloads.
7) Take control of core binding. E.g. do the 64 thread solver run again with
mpirun -np 64 -bind-to core --rank-by core --map-by numa simpleFoam -parallel > log.simpleFoam 2>&1

cegan09 · August 26, 2023, 12:06

Yes, in retrospect running with all cores becomes a problem when other stuff is running. I was just focused on the result numbers.

My first thought was that since only 1 CPU can talk to the m.2, there was some issues with trying to write all the results to disk. But the obvious answer is I had a remote desktop application running. Which duh, that's going to need some processing to run.

I ran the benchmark with 63, 62, 61, 60 cores and the same program running just for consistency.

Code:

 # Cores Wall Time (s):
-----------------------------
64 202.76
63 23.02
62 20.04
61 18.08
60 18.07
56 18.17
48 19.46
40 21.11
32 23.11
28 24.26
24 27.72
20 32.71
16 40.48
12 52.69
8 77.48
4 165.22
1 729.19

So it looks like ideal in my case is around 60 cores. I am sure I can improve things with more tuning, proper prep to make sure resources are all free, core binding. But it's not important to me to squeeze those tiny bits of improvement from the runs yet. Once I have my workflow dialed in and finalized I'll see if I can tune the actual simulations for those extra improvements.

Edit: just to add, I know I'm not thermal throttling, CPU1 maxes out at like 62°C, and CPU1 is 52°C (the coolers feed into each other). All the ram is recognized and running at the correct speed.

CFDfan · August 29, 2023, 13:29

Quote:

Originally Posted by cegan09

Yes, in retrospect running with all cores becomes a problem when other stuff is running. I was just focused on the result numbers.

My first thought was that since only 1 CPU can talk to the m.2, there was some issues with trying to write all the results to disk. But the obvious answer is I had a remote desktop application running. Which duh, that's going to need some processing to run.

I ran the benchmark with 63, 62, 61, 60 cores and the same program running just for consistency.

Code:

 # Cores Wall Time (s):
-----------------------------
64 202.76
63 23.02
62 20.04
61 18.08
60 18.07
56 18.17
48 19.46
40 21.11
32 23.11
28 24.26
24 27.72
20 32.71
16 40.48
12 52.69
8 77.48
4 165.22
1 729.19

So it looks like ideal in my case is around 60 cores. I am sure I can improve things with more tuning, proper prep to make sure resources are all free, core binding. But it's not important to me to squeeze those tiny bits of improvement from the runs yet. Once I have my workflow dialed in and finalized I'll see if I can tune the actual simulations for those extra improvements.

Edit: just to add, I know I'm not thermal throttling, CPU1 maxes out at like 62°C, and CPU1 is 52°C (the coolers feed into each other). All the ram is recognized and running at the correct speed.

The good thing is that you could run simultaneously two simulations (if you have enough RAM and licenses) with 30 cores each, since the time difference between 30 cores and 60 cores is less than 25%.

FliegenderZirkus · August 30, 2023, 10:58

If you run two simulations on one computer, then they will have to share the available memory bandwidth, won't they? In this case the benchmark job takes 23 seconds on 32 cores with the remaining 32 cores sitting idle. When you start two such 32-core jobs alongside each other, they will each take much longer than 23 seconds to complete because they compete for memory bandwidth. I guess one can utilize the remaining cores for some other activity that is not memory bound?

cegan09 · August 30, 2023, 13:56

Probably, yes. Depends what your greater need is, more results or faster results.

For fun I setup two benchmarks to use 30 cores each, ran them at the same time doing nothing to try and control what cores got assigned to each. Times were 36.21 and 35.75 seconds. So yes, slower.

If I look at "real world", meaning the analysis I run most often, I have the following solve times
Single run on 30 cores when both CPUs are installed: ~19 hours
Single run on 60 cores with both CPUs: ~17 hours
Two runs in parallel with 30cores/each: ~35 hours

This is a really rough calculation looking at average compute time for each time step, multiplied by number of time steps, multiplied by the number of nose angles each run has. It basically comes out to a wash whether you run a single one at a time or two side by side, it will take about the same time to get there. So if I have a design with small tweaks between two versions I'd probably pick to run both together. so I can come back in a couple days and see which I like better. If I just have one, run it with 60.

Again, I've done nothing to try and optimize what cores are used or tune anything, just brute force throw two simulations at the machine. I'm sure I can dial things in as needed, but for the work I do it's not a big deal to be not perfectly optimized, these solve times are already so much faster than the old xeon server I had before that I'm happy just with the un-optimized setup.

naffrancois · August 30, 2023, 18:34

"So if I have a design with small tweaks between two versions I'd probably pick to run both together."

You can also use a job scheduler such as slurm, or a simple script to schedule your runs.

CFDfan · August 31, 2023, 23:52

[QUOTE So if I have a design with small tweaks between two versions I'd probably pick to run both together. so I can come back in a couple days and see which I like better. If I just have one, run it with 60.
[/QUOTE]

That is what i,ve been doing on my 5975wx with 32 cores. Since the time difference between 30 and 15 cores is about 25% and the real life calls for testing various design scenarios, so i run two of them simultaneously on 15 cores each. I have plenty of RAM to do that however.

FliegenderZirkus · September 1, 2023, 03:25

This is interesting, could you maybe post some numbers how long each scenario takes? I tried the same thing on a dual EPYC 7763 (128 cores in total) and found that I can't "cheat" the available memory bandwidth. In particular, the following two scenarios finish in pretty much exactly the same wall clock time:
1) Two instances of the same simulation executed alongside each other on 64 cores each.
2) The same simulation executed twice on 128 cores sequentially (second sim starts when first finishes).

The job had about 60milion cells (so large enough to saturate the memory bandwidth) in starccm+. I guess this will depend on the simulated physics, in my case it was just simple air flow using the segregated solver.

flotus1 · September 1, 2023, 03:36

Running several instances simultaneously on lower thread counts is not faster with the benchmark in this thread.
This only works when something other than the memory subsystem -including last level caches- is limiting parallel efficiency. For example:

Very low cell count, so parallelization overhead becomes dominant
Poor parallelization of some parts of the code, aka Amdahl's law

CFDfan · September 2, 2023, 23:50

Quote:

Originally Posted by flotus1

Running several instances simultaneously on lower thread counts is not faster with the benchmark in this thread.
This only works when something other than the memory subsystem -including last level caches- is limiting parallel efficiency. For example:

Very low cell count, so parallelization overhead becomes dominant
Poor parallelization of some parts of the code, aka Amdahl's law

Yes, you turned out to be right. I recorded the time of running two simulations in parallel on lower core count and there was no (time) benefit compared with running them in series on higher core count. The only advantage (for me) was that I usually run those overnight and got both results in the morning.

mespinil · September 20, 2023, 01:36

Ok, so I finally could run this test.

CPU: 7800x3D (8 cores, 5.2GHz boost frequency, 2 memory channels)
RAM: 96GB (2x48GB) DDR5-5600 CL40-40-40-89 1.25V AMD EXPO
At the moment I could only get 5400MHz with latest bios and EXPO profile activated
Ubuntu (Linux native)
System cost: 1900€ (Spain)

My results

Meshing Times:
1 5:47.31
2 3:58.93
4 2:31.55
6 1:56.04
8 1:42.43
Flow Calculation:
1 399.07
2 207.03
4 131.83
6 112.03
8 105.4

Competitor CPUs we have benchmarks about (for comparison):

Malinator

HW: AMD Ryzen 7700X (8-core Zen4), MSI MAG B650, 2*16Gb DDR5 (XMP 6200MHz C40, Hynix M-die based)
HW tuning: SMT off, PBO on, Custom optimizer to reduce core voltage by 30 mW, timings, subtimings of memory carefully optimized to 6200Mhz 30-37...etc, FCLK 2133MHz
Linux native
Cores | Wall (flow calculation) time, s -- Meshing time, s

1 | 331.5 -- 567.0

2 | 192.9 -- 399.4

4 | 126.2 -- 241.0

6 | 110.3 -- 209.4

8 | 105.9 -- 162.9

Simbelmynë (1)
5800X3D, 2 x 8 GB DDR4 Rank1 @ 3200 MT/s (14-14-14-14-28,1T)
OFv9, OpenSUSE Tumbleweed, GCC 11.2, kernel 5.17.4

2 x 8 GB DDR4 Rank1 @3800 MT/s (16-16-16-16-32, 1T)

Code:
cores Simulation Meshing
# (s) (min.sec)
1 304 12m14
2 188 8m12
4 135 4m58
6 124 3m55
8 122 3m28

Simbelmynë (2)
Intel 13900k (HT off), 32 GB DDR5@7200 MT/s (34-44-44-96), Ubuntu 22.04, OpenFOAM v10
Meshing (1,2,4,8 cores):

7m45,887s

5m32,672s

3m24,995s

2m16,678s

# cores Wall time (s):
------------------------
1 301.118
2 164.46
4 101.268
8 70.3852

Conclusion: I feel pretty relieved for a first ever build with no OC to get a good or at the very least logical result. I wanted 128 GB but I could not get that with current available kits, so I got a 96GB kit and I had to sacrifice some speed in the process. I would say there the build is faring well against closest build with a 7700x. Malinator's RAM is about 15% faster, so probably the x3D is adding about that 15% in extra performance (for this benchmark). Still I would say the 5800x3D is pretty much best bang for buck in this segment at least until fast DDR5 memory falls a lot.

Let me know what to think or if you think I could re-run the test changing BIOS settings or something like that I am happy to try.

Thanks to all of you for all the build sharing and discussion, I don't think there is such a good resource for making such an expensive and complex purchase in the whole internet

wkernkamp · September 20, 2023, 05:47

Congratulations on your result. Also, very nice presentation with the other comparable results. What version of OpenFOAM are you running?

AlexKaz · October 29, 2023, 14:39

Dual Xeon 8352Y ES / 16x3200 single rank dimms / OpenFOAM v1812 precompiled for Xeon v4 only / without tuning at bios

#cores Mesh_time(s) Wall_time(s):
------------------------
1 921.61 705.96
2 636.71 369.34
4 345.68 177.93
6 262.85 117.4
8 217.23 92.17
12 195.86 66.83
16 159.11 54.4
20 131.71 46.87
24 127.67 42.7
26 138.43 40.81
28 121.93 38.91
30 124.79 37.92
32 124.38 37.57
34 124.64 36.2
36 130.59 35.83

wkernkamp · October 29, 2023, 20:43

Quote:

Originally Posted by AlexKaz

Dual Xeon 8352Y ES / 16x3200 single rank dimms / OpenFOAM v1812 precompiled for Xeon v4 only / without tuning at bios

#cores Mesh_time(s) Wall_time(s):
------------------------
1 921.61 705.96
30 124.79 37.92
32 124.38 37.57
34 124.64 36.2
36 130.59 35.83

There is a bios configuration with just 16 cores active per processor. Would be interesting to see if you get better performance.

AlexKaz · October 30, 2023, 05:56

Are you about SST-PP 2.0 function for Y-chips?

Platinum 8352Y
Intel® Speed Select Technology —
Performance Profile (Intel® SST-PP)
Config Active Cores Base Frequency TDP Description
1 24 2.3 GHz 185W
2 16 2.6 GHz 185W
High Priority Cores 12
High Priority Core Frequency 2.40 GHz
Low Priority Cores 20
Low Priority Core Frequency 2.00 GHz

I'l check the such configuration, but, I think, it is marketing in mainly. Both v3 and v4 families chips can increase frequencies to maximum value, because of TDP packet limit is staying the same after disabling some cores. As I saw early, the both ES are working with 3.4GHz up to about 8-12 threads.

AlexKaz · October 30, 2023, 12:13

Only with 2x16 cores:

#threads mesh(s) wall(s)

1 900.475 721 (single thread frequency of the loaded one CPU kernel is 3650-3680 MHz)
28 123.02 40.03
30 123.98 38.74
32 121.75 38.85
34 163.99 54.25
36 182.85 50.95

ztdep · November 5, 2023, 20:07

why the xeon Platium series cpu are missing from the data.

August 25, 2023, 12:12		#721
cegan09 New Member Chris Join Date: Nov 2022 Posts: 18 Rep Power: 3	I finally got around to running the benchmark on my system. Dual 7532, 8 Samsung M393A2K43DB2-CWE (16gb) at 3200MHz per CPU. SuperMicro H11dsi openFoam 2112 on Ubuntu 22.04.2LTS Code: # Cores Wall Time (s): ----------------------------- 64 202.76 56 18.17 48 19.46 40 21.11 32 23.11 28 24.26 24 27.72 20 32.71 16 40.48 12 52.69 8 77.48 4 165.22 1 729.19 I've done no tuning at all on this machine other than turning SMT off. I don't fully understand why the 64core run is so much longer, though I suspect it's just me still being new and not understanding options and setting up the system. I have seen the same thing in my real runs where using 64 cores causes a huge jump in solve time, but backing off even a couple cores is back in line with where I expect things. wkernkamp likes this.

August 26, 2023, 12:06		#725
cegan09 New Member Chris Join Date: Nov 2022 Posts: 18 Rep Power: 3	Yes, in retrospect running with all cores becomes a problem when other stuff is running. I was just focused on the result numbers. My first thought was that since only 1 CPU can talk to the m.2, there was some issues with trying to write all the results to disk. But the obvious answer is I had a remote desktop application running. Which duh, that's going to need some processing to run. I ran the benchmark with 63, 62, 61, 60 cores and the same program running just for consistency. Code: # Cores Wall Time (s): ----------------------------- 64 202.76 63 23.02 62 20.04 61 18.08 60 18.07 56 18.17 48 19.46 40 21.11 32 23.11 28 24.26 24 27.72 20 32.71 16 40.48 12 52.69 8 77.48 4 165.22 1 729.19 So it looks like ideal in my case is around 60 cores. I am sure I can improve things with more tuning, proper prep to make sure resources are all free, core binding. But it's not important to me to squeeze those tiny bits of improvement from the runs yet. Once I have my workflow dialed in and finalized I'll see if I can tune the actual simulations for those extra improvements. Edit: just to add, I know I'm not thermal throttling, CPU1 maxes out at like 62°C, and CPU1 is 52°C (the coolers feed into each other). All the ram is recognized and running at the correct speed.

September 1, 2023, 03:36		#732
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,427 Rep Power: 49	Running several instances simultaneously on lower thread counts is not faster with the benchmark in this thread. This only works when something other than the memory subsystem -including last level caches- is limiting parallel efficiency. For example: Very low cell count, so parallelization overhead becomes dominant Poor parallelization of some parts of the code, aka Amdahl's law giovanni.medici and CorbinMG like this. Last edited by flotus1; September 1, 2023 at 10:29.

September 20, 2023, 01:36		#734
mespinil New Member Join Date: Aug 2023 Posts: 2 Rep Power: 0	Ok, so I finally could run this test. CPU: 7800x3D (8 cores, 5.2GHz boost frequency, 2 memory channels) RAM: 96GB (2x48GB) DDR5-5600 CL40-40-40-89 1.25V AMD EXPO At the moment I could only get 5400MHz with latest bios and EXPO profile activated Ubuntu (Linux native) System cost: 1900€ (Spain) My results Meshing Times: 1 5:47.31 2 3:58.93 4 2:31.55 6 1:56.04 8 1:42.43 Flow Calculation: 1 399.07 2 207.03 4 131.83 6 112.03 8 105.4 Competitor CPUs we have benchmarks about (for comparison): Malinator HW: AMD Ryzen 7700X (8-core Zen4), MSI MAG B650, 216Gb DDR5 (XMP 6200MHz C40, Hynix M-die based) HW tuning: SMT off, PBO on, Custom optimizer to reduce core voltage by 30 mW, timings, subtimings of memory carefully optimized to 6200Mhz 30-37...etc, FCLK 2133MHz Linux native Cores \| Wall (flow calculation) time, s -- Meshing time, s 1 \| 331.5 -- 567.0 2 \| 192.9 -- 399.4 4 \| 126.2 -- 241.0 6 \| 110.3 -- 209.4 8 \| 105.9 -- 162.9 Simbelmynë (1)* 5800X3D, 2 x 8 GB DDR4 Rank1 @ 3200 MT/s (14-14-14-14-28,1T) OFv9, OpenSUSE Tumbleweed, GCC 11.2, kernel 5.17.4 2 x 8 GB DDR4 Rank1 @3800 MT/s (16-16-16-16-32, 1T) Code: cores Simulation Meshing # (s) (min.sec) 1 304 12m14 2 188 8m12 4 135 4m58 6 124 3m55 8 122 3m28 Simbelmynë (2) Intel 13900k (HT off), 32 GB DDR5@7200 MT/s (34-44-44-96), Ubuntu 22.04, OpenFOAM v10 Meshing (1,2,4,8 cores): 7m45,887s 5m32,672s 3m24,995s 2m16,678s # cores Wall time (s): ------------------------ 1 301.118 2 164.46 4 101.268 8 70.3852 Conclusion: I feel pretty relieved for a first ever build with no OC to get a good or at the very least logical result. I wanted 128 GB but I could not get that with current available kits, so I got a 96GB kit and I had to sacrifice some speed in the process. I would say there the build is faring well against closest build with a 7700x. Malinator's RAM is about 15% faster, so probably the x3D is adding about that 15% in extra performance (for this benchmark). Still I would say the 5800x3D is pretty much best bang for buck in this segment at least until fast DDR5 memory falls a lot. Let me know what to think or if you think I could re-run the test changing BIOS settings or something like that I am happy to try. Thanks to all of you for all the build sharing and discussion, I don't think there is such a good resource for making such an expensive and complex purchase in the whole internet CFDfan and wkernkamp like this.

October 29, 2023, 14:39		#736
AlexKaz New Member Alexander Kazantcev Join Date: Sep 2019 Posts: 24 Rep Power: 7	Dual Xeon 8352Y ES / 16x3200 single rank dimms / OpenFOAM v1812 precompiled for Xeon v4 only / without tuning at bios #cores Mesh_time(s) Wall_time(s): ------------------------ 1 921.61 705.96 2 636.71 369.34 4 345.68 177.93 6 262.85 117.4 8 217.23 92.17 12 195.86 66.83 16 159.11 54.4 20 131.71 46.87 24 127.67 42.7 26 138.43 40.81 28 121.93 38.91 30 124.79 37.92 32 124.38 37.57 34 124.64 36.2 36 130.59 35.83 wkernkamp and Crowdion like this.

August 25, 2023, 23:06		#722
wkernkamp Senior Member Will Kernkamp Join Date: Jun 2014 Posts: 371 Rep Power: 14	Nice result! Impressive on first try too. Did you have anything else running during the 64 core run? It has no cores or threads to spare, so that would be my guess. I usually leave hyper threading on so that other processes have a chance to get a thread.

August 30, 2023, 10:58		#727
FliegenderZirkus Member Join Date: Nov 2019 Posts: 96 Rep Power: 6	If you run two simulations on one computer, then they will have to share the available memory bandwidth, won't they? In this case the benchmark job takes 23 seconds on 32 cores with the remaining 32 cores sitting idle. When you start two such 32-core jobs alongside each other, they will each take much longer than 23 seconds to complete because they compete for memory bandwidth. I guess one can utilize the remaining cores for some other activity that is not memory bound?

August 30, 2023, 13:56		#728
cegan09 New Member Chris Join Date: Nov 2022 Posts: 18 Rep Power: 3	Probably, yes. Depends what your greater need is, more results or faster results. For fun I setup two benchmarks to use 30 cores each, ran them at the same time doing nothing to try and control what cores got assigned to each. Times were 36.21 and 35.75 seconds. So yes, slower. If I look at "real world", meaning the analysis I run most often, I have the following solve times Single run on 30 cores when both CPUs are installed: ~19 hours Single run on 60 cores with both CPUs: ~17 hours Two runs in parallel with 30cores/each: ~35 hours This is a really rough calculation looking at average compute time for each time step, multiplied by number of time steps, multiplied by the number of nose angles each run has. It basically comes out to a wash whether you run a single one at a time or two side by side, it will take about the same time to get there. So if I have a design with small tweaks between two versions I'd probably pick to run both together. so I can come back in a couple days and see which I like better. If I just have one, run it with 60. Again, I've done nothing to try and optimize what cores are used or tune anything, just brute force throw two simulations at the machine. I'm sure I can dial things in as needed, but for the work I do it's not a big deal to be not perfectly optimized, these solve times are already so much faster than the old xeon server I had before that I'm happy just with the un-optimized setup.

August 30, 2023, 18:34		#729
naffrancois Senior Member Join Date: Oct 2011 Posts: 242 Rep Power: 17	"So if I have a design with small tweaks between two versions I'd probably pick to run both together." You can also use a job scheduler such as slurm, or a simple script to schedule your runs.

August 31, 2023, 23:52		#730
CFDfan Senior Member Join Date: Jun 2011 Posts: 208 Rep Power: 16	[QUOTE So if I have a design with small tweaks between two versions I'd probably pick to run both together. so I can come back in a couple days and see which I like better. If I just have one, run it with 60. [/QUOTE] That is what i,ve been doing on my 5975wx with 32 cores. Since the time difference between 30 and 15 cores is about 25% and the real life calls for testing various design scenarios, so i run two of them simultaneously on 15 cores each. I have plenty of RAM to do that however.

September 1, 2023, 03:25		#731
FliegenderZirkus Member Join Date: Nov 2019 Posts: 96 Rep Power: 6	This is interesting, could you maybe post some numbers how long each scenario takes? I tried the same thing on a dual EPYC 7763 (128 cores in total) and found that I can't "cheat" the available memory bandwidth. In particular, the following two scenarios finish in pretty much exactly the same wall clock time: 1) Two instances of the same simulation executed alongside each other on 64 cores each. 2) The same simulation executed twice on 128 cores sequentially (second sim starts when first finishes). The job had about 60milion cells (so large enough to saturate the memory bandwidth) in starccm+. I guess this will depend on the simulated physics, in my case it was just simple air flow using the segregated solver.

September 20, 2023, 05:47		#735
wkernkamp Senior Member Will Kernkamp Join Date: Jun 2014 Posts: 371 Rep Power: 14	Congratulations on your result. Also, very nice presentation with the other comparable results. What version of OpenFOAM are you running?

October 30, 2023, 05:56		#738
AlexKaz New Member Alexander Kazantcev Join Date: Sep 2019 Posts: 24 Rep Power: 7	Are you about SST-PP 2.0 function for Y-chips? Platinum 8352Y Intel® Speed Select Technology — Performance Profile (Intel® SST-PP) Config Active Cores Base Frequency TDP Description 1 24 2.3 GHz 185W 2 16 2.6 GHz 185W High Priority Cores 12 High Priority Core Frequency 2.40 GHz Low Priority Cores 20 Low Priority Core Frequency 2.00 GHz I'l check the such configuration, but, I think, it is marketing in mainly. Both v3 and v4 families chips can increase frequencies to maximum value, because of TDP packet limit is staying the same after disabling some cores. As I saw early, the both ES are working with 3.4GHz up to about 8-12 threads. Last edited by AlexKaz; October 30, 2023 at 07:10.

October 30, 2023, 12:13		#739
AlexKaz New Member Alexander Kazantcev Join Date: Sep 2019 Posts: 24 Rep Power: 7	Only with 2x16 cores: #threads mesh(s) wall(s) 1 900.475 721 (single thread frequency of the loaded one CPU kernel is 3650-3680 MHz) 28 123.02 40.03 30 123.98 38.74 32 121.75 38.85 34 163.99 54.25 36 182.85 50.95 Crowdion likes this. Last edited by AlexKaz; November 1, 2023 at 13:27.

November 5, 2023, 20:07		#740
ztdep Senior Member p ding Join Date: Mar 2009 Posts: 427 Rep Power: 19	why the xeon Platium series cpu are missing from the data.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
How to contribute to the community of OpenFOAM users and to the OpenFOAM technology	wyldckat	OpenFOAM	17	November 10, 2017 16:54
UNIGE February 13th-17th - 2107. OpenFOAM advaced training days	joegi.geo	OpenFOAM Announcements from Other Sources	0	October 1, 2016 20:20
OpenFOAM Training Beijing 22-26 Aug 2016	cfd.direct	OpenFOAM Announcements from Other Sources	0	May 3, 2016 05:57
New OpenFOAM Forum Structure	jola	OpenFOAM	2	October 19, 2011 07:55
Hardware for OpenFOAM LES	LijieNPIC	Hardware	0	November 8, 2010 10:54