|
[Sponsors] |
June 6, 2022, 23:59 |
|
#521 | |
Senior Member
Will Kernkamp
Join Date: Jun 2014
Posts: 371
Rep Power: 14 |
Quote:
Yes, but only when there is something wrong in the setup so that the possible bandwidth is not achieved. Otherwise the bandwidth is a key factor that translates directly into OpenFOAM performance. What I was saying is that his performance is in the ball park correct, except that considering the more modern cpu and higher clock, I would expect a bit better. Maybe it is thermal throttling, maybe WSL2. Maybe his cpu was having a slow day. I don't know. |
||
June 7, 2022, 02:23 |
AMD Threadripper 1950X Ubuntu 20.04, no WSL
|
#522 | |
Member
Marco Bernardes
Join Date: May 2009
Posts: 59
Rep Power: 17 |
Quote:
# cores Wall time (s): ------------------------ 1 2 4 6 8 10 12 14 16 Meshing Times: 1 1026.86 2 697.82 4 397 6 294.65 8 251.36 10 231.26 12 210.35 14 201.72 16 207.07 Flow Calculation: 1 852.77 2 510.34 4 220.9 6 181.68 8 160.85 10 153.79 12 144.88 14 145.64 16 143.53 |
||
June 7, 2022, 04:04 |
|
#523 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
That's A LOT of performance left on the table with WSL. I wonder if it can be tweaked in any way to yield better results, or if that's just price for convenience.
|
|
June 7, 2022, 21:08 |
|
#524 |
Senior Member
Will Kernkamp
Join Date: Jun 2014
Posts: 371
Rep Power: 14 |
Seems that WSL is OK for 1 or 2 cores, but looses performance as you go beyond that. Is there some limitation on the amount of resource that gets allocated to WSL (looks like 50% in your case masb)
|
|
June 9, 2022, 18:52 |
|
#525 | |
Senior Member
Join Date: May 2012
Posts: 551
Rep Power: 16 |
Quote:
And I am saying this is not true. As a general indicator, bandwidth is by far the most important metric for CFD. However, recent CPUs from AMD (and possibly Intel) has shown that bandwidth is not the entire story. Check out results from 5800X3D for instance. It is really good in terms of performance per bandwidth. It started to be visible with Zen 2, most likely since Intel just produced minor upgrades to new desktop CPUs for several years. Last edited by Simbelmynë; June 10, 2022 at 02:39. |
||
June 12, 2022, 21:03 |
|
#526 | ||
Senior Member
Will Kernkamp
Join Date: Jun 2014
Posts: 371
Rep Power: 14 |
Quote:
The 5800 itself gets almost proportionally better with bandwidth. Quote:
The memory bandwith of the 2xE5-2687v2 is just under twice the bandwith of your 5800X. The performance ratio is 122/84=1.4 so there has been improvement probably related to cache organization and cache capacity. The more cache can be utilized, the more your effective bandwidth goes up. So the improvement you are talking about is 40% in ten years. |
|||
June 13, 2022, 02:48 |
|
#527 | ||||
Senior Member
Join Date: May 2012
Posts: 551
Rep Power: 16 |
Quote:
I think more recent CPUs should be compared as well. Quote:
Here are some CPUs from 2017. All of them have Rank 2 memory (compared to rank 1 of the 5800X3D). If we look at the 3200 MT/s results then the first two HEDT CPUs have double theoretical bandwidth and the 8700k has identical theoretical bandwidth. Quote:
Clearly there is a huge improvement where bandwidth is not the only answer. Memory latency and cache size likely plays an important role as well. If you wish to compare HEDT with HEDT then look at the results from the 3990X. This also gives an indication of how good the architecture is even if it is one gen older compared to the 5800X3D. Quote:
With similar architecture and a huge cache then bandwidth is king. |
|||||
June 14, 2022, 04:35 |
|
#528 |
Senior Member
Will Kernkamp
Join Date: Jun 2014
Posts: 371
Rep Power: 14 |
I think Geon-Hong misstated his configuration. He must have 8 channels active. There is a comparable threadripper 3960x in the data. It's single core performance is better than Geon-Hong's, but he is bandwidth limited at 93 seconds. That one has four channels:
|
|
June 14, 2022, 06:05 |
|
#529 |
Senior Member
Will Kernkamp
Join Date: Jun 2014
Posts: 371
Rep Power: 14 |
I was wrong about the 3990x: it has only 4 memory channels.
8x1866 = 14928 MT/s for E5-2697v2(x2) 4x3200 = 12800 MT/s for 3960x 2x3200 = 6400 MT/s for 5800X3D 4x2666 = 10666 MT/s for 3990x MT to complete benchmark: Code:
CPU DIMM CH MT/s Benchm. MT E5-2697v2 1866 8 14928 x 84s = 1253952 3960x 3200 4 12800 x 93s = 1190400 5800X3D 3200 2 6400 x 139s = 889600 3990x 2666 4 10666 x 63s = 671832 E5-2697v2 = 1.41 x more MT to complete than 5800X3D 3960x = 1.33 x more MT to complete than 5800X3D 3990x = 1.32 x fewer MT to complete than 5800X3D Level 3 Caches are: Code:
CPU Cache Cores Cache per Work per at Sat. Core at Sat. Core at Sat. E5-2697v2 60 MB 24 2.5 MB 4.1% 3960x 128 MB 16 8 MB 6.2% 5800X3D 96 MB 6 16 MB 16.7% 3990x 256 MB 32 8 MB 3.1% Last edited by wkernkamp; June 20, 2022 at 13:34. Reason: Added x2 for dual E5-2697v2 |
|
June 14, 2022, 12:11 |
|
#530 |
Senior Member
Join Date: May 2012
Posts: 551
Rep Power: 16 |
@wkernkamp
I like the idea of total MT to run the benchmark. Even if we have no idea what the actual bandwidth usage was during the simulation, this at least gives a relation that is based on theoretical bandwidth as well as actual simulation time. It also illustrates the, sometimes subtle, differences between different architectures. I was surprised by the large difference between the 3960X and 3990X, they both have the same L3 per core and the same architecture. I would have guessed that the 3960X is faster due to the faster memory, but there may be other factors also in play here. My guess is on RAM timings and perhaps also on rank as well as on the Linux kernel being used. |
|
June 14, 2022, 12:28 |
|
#531 | |
Member
Kailee
Join Date: Dec 2019
Posts: 35
Rep Power: 6 |
Quote:
Way behind (not just on a different field, but in a different park) came WSL. Admittedly, this was about a year ago and I understand stuff probably has moved along, but it was clearly not a viable alternative unless you're just interested in tinkering. Out of my 60 cores total, 20 live on my VMWare (data-)server which runs TrueNAS Core (4 cores) for the data, and a compute VM with 16 cores, 32 cores on a dedicated 4-socket bare-metal compute node, and a further 8 in my workstation. This is a compromise that works surprisingly well in a 10Gb environment. Sorry for the anecdotal-only data. I'll try to find actual numbers. Kai. |
||
June 14, 2022, 23:10 |
for dual E5 2683 v4
|
#532 | ||
New Member
Alexander Kazantcev
Join Date: Sep 2019
Posts: 24
Rep Power: 7 |
Quote:
Quote:
After some optimizations, dual 2683v4 run 32-threads solution with 67-68 seconds. HT on, Numa on, COD on, 2133 2 rank 8 dimms, foam v1812 (for v2112 ~ the same). I think, mainly reason in Numa on and the most early microcode for CPUID 406F1 0x0B00000B. Last edited by AlexKaz; June 17, 2022 at 09:31. |
|||
June 14, 2022, 23:42 |
|
#533 |
Senior Member
Will Kernkamp
Join Date: Jun 2014
Posts: 371
Rep Power: 14 |
Can you publish the full curve for the optimized machine. By the way, you should use 2400 MHz RDIMMs for best performance. I am interested in the result for 24 cores for comparison to the 2xE5-2697v2.
|
|
June 15, 2022, 04:19 |
|
#534 |
New Member
Alexander Kazantcev
Join Date: Sep 2019
Posts: 24
Rep Power: 7 |
Sorry, in my case it does not running at 2400 with 8 dimms. Only 7 dimms are working with 2400. It is a such silicone lottery for used cpus
Last edited by AlexKaz; June 15, 2022 at 12:24. |
|
June 15, 2022, 12:24 |
|
#535 |
New Member
Alexander Kazantcev
Join Date: Sep 2019
Posts: 24
Rep Power: 7 |
I can add only times for 2133, 2 rank, 13-12-12-....
1 1535.27 1098.81 2 1018.75 550.63 4 573.74 257.45 8 364 135.52 10 339.37 101.29 12 321.41 97.4 14 266.07 94.89 16 258.09 82.39 18 237.39 84.1 20 210.51 75.66 22 236.61 78.39 24 200.13 71.75 26 213.59 76.73 28 186.62 69.07 30 189.12 73.23 31 195.98 70.99 32 182.98 68.03 |
|
June 15, 2022, 16:53 |
|
#536 |
Senior Member
Will Kernkamp
Join Date: Jun 2014
Posts: 371
Rep Power: 14 |
Thanks for posting. Interesting that there is quite a bit of fluctuation up and down as the number of cores goes up.
|
|
June 16, 2022, 08:58 |
System76 Galago Ultrapro (2014 Laptop)
|
#537 |
New Member
Daniel
Join Date: Jun 2010
Posts: 14
Rep Power: 16 |
Hey guys,
Kudos to all for keeping this thread active. I am looking to (finally) replace my Galago Ultrapro bought in 2014 - have been using it until it gets too close to be fubar, decided to run the benchmark on it to get a sense of upgrade with today's options. System has an Intel(R) Core(TM) i7-4750HQ (clock 2GHz - 3.2GHz), data from lscpu: Code:
Vendor ID: GenuineIntel Model name: Intel(R) Core(TM) i7-4750HQ CPU @ 2.00GHz CPU family: 6 Model: 70 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 Stepping: 1 CPU max MHz: 3200.0000 CPU min MHz: 800.0000 ... Caches (sum of all): L1d: 128 KiB (4 instances) L1i: 128 KiB (4 instances) L2: 1 MiB (4 instances) L3: 6 MiB (1 instance) L4: 128 MiB (1 instance) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-7 Memory is DDR3 1600MHz 2x4GB in dual channel, supported by that large L4 cache. Bench results are: Code:
Meshing Times: 1 1522.67 2 971.47 3 740.59 4 584.75 Flow Calculation: 1 914.75 2 512.87 3 236.5 4 363.65 Cache hierarchy plays a central role in guaranteeing cores are properly fed and saturated with correct data (increased prefetching performance, etc.) - see how this cpu gets best fed with 3 threads, showcasing that no rule is 100% applicable to each cpu, in terms of OF performance. Now moving to some of these DDR5 equipped notebooks with a reasonable gpu and let this guy here rest in pieces Cheers |
|
June 17, 2022, 06:49 |
|
#538 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
Sorry for barging right into the middle of this conversation, but the benchmark running faster on 3 cores than on 4 cores on a laptop can have so many other reasons. "Cache hierarchy" would be way down on my list for checking potential causes.
|
|
June 17, 2022, 11:07 |
|
#539 |
New Member
Daniel
Join Date: Jun 2010
Posts: 14
Rep Power: 16 |
Hey flotus1, your comments are always most welcome, no need to apologize
I’ve repeated the runs at least 5 times, without even X11 running and in separate (in order to control temperatures), results didn’t vary more than 5% - just took the last run and put here. At the end of the day, one has to assess the entire platform (hardware and host software) - Simbelmynë’s last post is all about that too. I confess that laptop still serves my coding needs very well (no local compiling/running on it though) but it’s time has come |
|
June 20, 2022, 13:31 |
|
#540 | |
Senior Member
Will Kernkamp
Join Date: Jun 2014
Posts: 371
Rep Power: 14 |
Quote:
Your machine is very interesting for the current discussion, because it has an exceptionally large cache. If we analyze the number of transactions required to complete the benchmark same as I did above, we get: Code:
CPU DIMM MT/s Channels MT/s Benchm. MT i7-4750HQ 1600 2 3200 236.5s 756800 |
||
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
How to contribute to the community of OpenFOAM users and to the OpenFOAM technology | wyldckat | OpenFOAM | 17 | November 10, 2017 16:54 |
UNIGE February 13th-17th - 2107. OpenFOAM advaced training days | joegi.geo | OpenFOAM Announcements from Other Sources | 0 | October 1, 2016 20:20 |
OpenFOAM Training Beijing 22-26 Aug 2016 | cfd.direct | OpenFOAM Announcements from Other Sources | 0 | May 3, 2016 05:57 |
New OpenFOAM Forum Structure | jola | OpenFOAM | 2 | October 19, 2011 07:55 |
Hardware for OpenFOAM LES | LijieNPIC | Hardware | 0 | November 8, 2010 10:54 |