|
[Sponsors] |
August 25, 2023, 12:12 |
|
#721 |
New Member
Chris
Join Date: Nov 2022
Posts: 18
Rep Power: 3 |
I finally got around to running the benchmark on my system.
Dual 7532, 8 Samsung M393A2K43DB2-CWE (16gb) at 3200MHz per CPU. SuperMicro H11dsi openFoam 2112 on Ubuntu 22.04.2LTS Code:
# Cores Wall Time (s): ----------------------------- 64 202.76 56 18.17 48 19.46 40 21.11 32 23.11 28 24.26 24 27.72 20 32.71 16 40.48 12 52.69 8 77.48 4 165.22 1 729.19 I don't fully understand why the 64core run is so much longer, though I suspect it's just me still being new and not understanding options and setting up the system. I have seen the same thing in my real runs where using 64 cores causes a huge jump in solve time, but backing off even a couple cores is back in line with where I expect things. |
|
August 25, 2023, 23:06 |
|
#722 |
Senior Member
Will Kernkamp
Join Date: Jun 2014
Posts: 371
Rep Power: 14 |
Nice result! Impressive on first try too. Did you have anything else running during the 64 core run? It has no cores or threads to spare, so that would be my guess. I usually leave hyper threading on so that other processes have a chance to get a thread.
|
|
August 26, 2023, 00:06 |
|
#723 | |
Senior Member
Join Date: Jun 2011
Posts: 208
Rep Power: 16 |
Quote:
|
||
August 26, 2023, 05:35 |
|
#724 | |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
Quote:
Some heavy background processes, thermal throttling, sub-optimal core binding, missing memory channels, an excessive amount of memory errors... Here is what you could do to get to the bottom of it: 1) Install this https://www.supermicro.com/de/soluti...re/superdoctor It is a handy tool to monitor a lot of things, like various temperature sensors you can not get otherwise. Or memory errors 2) Check for background processes before running the benchmark. For example with top or htop 3) See if all memory is recognized. SD5 can give you an idea, I like to check the output of dmidecode -t 17 4) When running the benchmark, you can check the CPU core frequencies with turbostat. If anything causes throttling, you will likely see the core frequencies drop. 5) Just before you run the benchmark, clear caches: echo 3 | sudo tee /proc/sys/vm/drop_caches You can also check the output of numactl -H, to see how much free memory each NUMA node has 6) Optionally, use NPS4 mode instead of NPS1. It's a bios setting. This won't get rid of the outlier, but it is the recommended setting for our workloads. 7) Take control of core binding. E.g. do the 64 thread solver run again with mpirun -np 64 -bind-to core --rank-by core --map-by numa simpleFoam -parallel > log.simpleFoam 2>&1 |
||
August 26, 2023, 12:06 |
|
#725 |
New Member
Chris
Join Date: Nov 2022
Posts: 18
Rep Power: 3 |
Yes, in retrospect running with all cores becomes a problem when other stuff is running. I was just focused on the result numbers.
My first thought was that since only 1 CPU can talk to the m.2, there was some issues with trying to write all the results to disk. But the obvious answer is I had a remote desktop application running. Which duh, that's going to need some processing to run. I ran the benchmark with 63, 62, 61, 60 cores and the same program running just for consistency. Code:
# Cores Wall Time (s): ----------------------------- 64 202.76 63 23.02 62 20.04 61 18.08 60 18.07 56 18.17 48 19.46 40 21.11 32 23.11 28 24.26 24 27.72 20 32.71 16 40.48 12 52.69 8 77.48 4 165.22 1 729.19 Edit: just to add, I know I'm not thermal throttling, CPU1 maxes out at like 62°C, and CPU1 is 52°C (the coolers feed into each other). All the ram is recognized and running at the correct speed. |
|
August 29, 2023, 13:29 |
|
#726 | |
Senior Member
Join Date: Jun 2011
Posts: 208
Rep Power: 16 |
Quote:
|
||
August 30, 2023, 10:58 |
|
#727 |
Member
Join Date: Nov 2019
Posts: 96
Rep Power: 6 |
If you run two simulations on one computer, then they will have to share the available memory bandwidth, won't they? In this case the benchmark job takes 23 seconds on 32 cores with the remaining 32 cores sitting idle. When you start two such 32-core jobs alongside each other, they will each take much longer than 23 seconds to complete because they compete for memory bandwidth. I guess one can utilize the remaining cores for some other activity that is not memory bound?
|
|
August 30, 2023, 13:56 |
|
#728 |
New Member
Chris
Join Date: Nov 2022
Posts: 18
Rep Power: 3 |
Probably, yes. Depends what your greater need is, more results or faster results.
For fun I setup two benchmarks to use 30 cores each, ran them at the same time doing nothing to try and control what cores got assigned to each. Times were 36.21 and 35.75 seconds. So yes, slower. If I look at "real world", meaning the analysis I run most often, I have the following solve times Single run on 30 cores when both CPUs are installed: ~19 hours Single run on 60 cores with both CPUs: ~17 hours Two runs in parallel with 30cores/each: ~35 hours This is a really rough calculation looking at average compute time for each time step, multiplied by number of time steps, multiplied by the number of nose angles each run has. It basically comes out to a wash whether you run a single one at a time or two side by side, it will take about the same time to get there. So if I have a design with small tweaks between two versions I'd probably pick to run both together. so I can come back in a couple days and see which I like better. If I just have one, run it with 60. Again, I've done nothing to try and optimize what cores are used or tune anything, just brute force throw two simulations at the machine. I'm sure I can dial things in as needed, but for the work I do it's not a big deal to be not perfectly optimized, these solve times are already so much faster than the old xeon server I had before that I'm happy just with the un-optimized setup. |
|
August 30, 2023, 18:34 |
|
#729 |
Senior Member
Join Date: Oct 2011
Posts: 242
Rep Power: 17 |
"So if I have a design with small tweaks between two versions I'd probably pick to run both together."
You can also use a job scheduler such as slurm, or a simple script to schedule your runs. |
|
August 31, 2023, 23:52 |
|
#730 |
Senior Member
Join Date: Jun 2011
Posts: 208
Rep Power: 16 |
[QUOTE So if I have a design with small tweaks between two versions I'd probably pick to run both together. so I can come back in a couple days and see which I like better. If I just have one, run it with 60.
[/QUOTE] That is what i,ve been doing on my 5975wx with 32 cores. Since the time difference between 30 and 15 cores is about 25% and the real life calls for testing various design scenarios, so i run two of them simultaneously on 15 cores each. I have plenty of RAM to do that however. |
|
September 1, 2023, 03:25 |
|
#731 |
Member
Join Date: Nov 2019
Posts: 96
Rep Power: 6 |
This is interesting, could you maybe post some numbers how long each scenario takes? I tried the same thing on a dual EPYC 7763 (128 cores in total) and found that I can't "cheat" the available memory bandwidth. In particular, the following two scenarios finish in pretty much exactly the same wall clock time:
1) Two instances of the same simulation executed alongside each other on 64 cores each. 2) The same simulation executed twice on 128 cores sequentially (second sim starts when first finishes). The job had about 60milion cells (so large enough to saturate the memory bandwidth) in starccm+. I guess this will depend on the simulated physics, in my case it was just simple air flow using the segregated solver. |
|
September 1, 2023, 03:36 |
|
#732 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
Running several instances simultaneously on lower thread counts is not faster with the benchmark in this thread.
This only works when something other than the memory subsystem -including last level caches- is limiting parallel efficiency. For example:
Last edited by flotus1; September 1, 2023 at 10:29. |
|
September 2, 2023, 23:50 |
|
#733 | |
Senior Member
Join Date: Jun 2011
Posts: 208
Rep Power: 16 |
Quote:
|
||
September 20, 2023, 01:36 |
|
#734 |
New Member
Join Date: Aug 2023
Posts: 2
Rep Power: 0 |
Ok, so I finally could run this test.
CPU: 7800x3D (8 cores, 5.2GHz boost frequency, 2 memory channels) RAM: 96GB (2x48GB) DDR5-5600 CL40-40-40-89 1.25V AMD EXPO At the moment I could only get 5400MHz with latest bios and EXPO profile activated Ubuntu (Linux native) System cost: 1900 (Spain) My results Meshing Times: 1 5:47.31 2 3:58.93 4 2:31.55 6 1:56.04 8 1:42.43 Flow Calculation: 1 399.07 2 207.03 4 131.83 6 112.03 8 105.4 Competitor CPUs we have benchmarks about (for comparison): Malinator HW: AMD Ryzen 7700X (8-core Zen4), MSI MAG B650, 2*16Gb DDR5 (XMP 6200MHz C40, Hynix M-die based) HW tuning: SMT off, PBO on, Custom optimizer to reduce core voltage by 30 mW, timings, subtimings of memory carefully optimized to 6200Mhz 30-37...etc, FCLK 2133MHz Linux native Cores | Wall (flow calculation) time, s -- Meshing time, s 1 | 331.5 -- 567.0 2 | 192.9 -- 399.4 4 | 126.2 -- 241.0 6 | 110.3 -- 209.4 8 | 105.9 -- 162.9 Simbelmynė (1) 5800X3D, 2 x 8 GB DDR4 Rank1 @ 3200 MT/s (14-14-14-14-28,1T) OFv9, OpenSUSE Tumbleweed, GCC 11.2, kernel 5.17.4 2 x 8 GB DDR4 Rank1 @3800 MT/s (16-16-16-16-32, 1T) Code: cores Simulation Meshing # (s) (min.sec) 1 304 12m14 2 188 8m12 4 135 4m58 6 124 3m55 8 122 3m28 Simbelmynė (2) Intel 13900k (HT off), 32 GB DDR5@7200 MT/s (34-44-44-96), Ubuntu 22.04, OpenFOAM v10 Meshing (1,2,4,8 cores): 7m45,887s 5m32,672s 3m24,995s 2m16,678s # cores Wall time (s): ------------------------ 1 301.118 2 164.46 4 101.268 8 70.3852 Conclusion: I feel pretty relieved for a first ever build with no OC to get a good or at the very least logical result. I wanted 128 GB but I could not get that with current available kits, so I got a 96GB kit and I had to sacrifice some speed in the process. I would say there the build is faring well against closest build with a 7700x. Malinator's RAM is about 15% faster, so probably the x3D is adding about that 15% in extra performance (for this benchmark). Still I would say the 5800x3D is pretty much best bang for buck in this segment at least until fast DDR5 memory falls a lot. Let me know what to think or if you think I could re-run the test changing BIOS settings or something like that I am happy to try. Thanks to all of you for all the build sharing and discussion, I don't think there is such a good resource for making such an expensive and complex purchase in the whole internet |
|
September 20, 2023, 05:47 |
|
#735 |
Senior Member
Will Kernkamp
Join Date: Jun 2014
Posts: 371
Rep Power: 14 |
Congratulations on your result. Also, very nice presentation with the other comparable results. What version of OpenFOAM are you running?
|
|
October 29, 2023, 14:39 |
|
#736 |
New Member
Alexander Kazantcev
Join Date: Sep 2019
Posts: 24
Rep Power: 7 |
Dual Xeon 8352Y ES / 16x3200 single rank dimms / OpenFOAM v1812 precompiled for Xeon v4 only / without tuning at bios
#cores Mesh_time(s) Wall_time(s): ------------------------ 1 921.61 705.96 2 636.71 369.34 4 345.68 177.93 6 262.85 117.4 8 217.23 92.17 12 195.86 66.83 16 159.11 54.4 20 131.71 46.87 24 127.67 42.7 26 138.43 40.81 28 121.93 38.91 30 124.79 37.92 32 124.38 37.57 34 124.64 36.2 36 130.59 35.83 |
|
October 29, 2023, 20:43 |
|
#737 | |
Senior Member
Will Kernkamp
Join Date: Jun 2014
Posts: 371
Rep Power: 14 |
Quote:
There is a bios configuration with just 16 cores active per processor. Would be interesting to see if you get better performance. |
||
October 30, 2023, 05:56 |
|
#738 |
New Member
Alexander Kazantcev
Join Date: Sep 2019
Posts: 24
Rep Power: 7 |
Are you about SST-PP 2.0 function for Y-chips?
Platinum 8352Y Intel® Speed Select Technology — Performance Profile (Intel® SST-PP) Config Active Cores Base Frequency TDP Description 1 24 2.3 GHz 185W 2 16 2.6 GHz 185W High Priority Cores 12 High Priority Core Frequency 2.40 GHz Low Priority Cores 20 Low Priority Core Frequency 2.00 GHz I'l check the such configuration, but, I think, it is marketing in mainly. Both v3 and v4 families chips can increase frequencies to maximum value, because of TDP packet limit is staying the same after disabling some cores. As I saw early, the both ES are working with 3.4GHz up to about 8-12 threads. Last edited by AlexKaz; October 30, 2023 at 07:10. |
|
October 30, 2023, 12:13 |
|
#739 |
New Member
Alexander Kazantcev
Join Date: Sep 2019
Posts: 24
Rep Power: 7 |
Only with 2x16 cores:
#threads mesh(s) wall(s) 1 900.475 721 (single thread frequency of the loaded one CPU kernel is 3650-3680 MHz) 28 123.02 40.03 30 123.98 38.74 32 121.75 38.85 34 163.99 54.25 36 182.85 50.95 Last edited by AlexKaz; November 1, 2023 at 13:27. |
|
November 5, 2023, 20:07 |
|
#740 |
Senior Member
|
why the xeon Platium series cpu are missing from the data.
|
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
How to contribute to the community of OpenFOAM users and to the OpenFOAM technology | wyldckat | OpenFOAM | 17 | November 10, 2017 16:54 |
UNIGE February 13th-17th - 2107. OpenFOAM advaced training days | joegi.geo | OpenFOAM Announcements from Other Sources | 0 | October 1, 2016 20:20 |
OpenFOAM Training Beijing 22-26 Aug 2016 | cfd.direct | OpenFOAM Announcements from Other Sources | 0 | May 3, 2016 05:57 |
New OpenFOAM Forum Structure | jola | OpenFOAM | 2 | October 19, 2011 07:55 |
Hardware for OpenFOAM LES | LijieNPIC | Hardware | 0 | November 8, 2010 10:54 |