|
[Sponsors] |
March 28, 2024, 23:10 |
|
#761 |
Senior Member
Will Kernkamp
Join Date: Jun 2014
Posts: 372
Rep Power: 14 |
Interesting that the multiples of 24 show in the result.
|
|
March 29, 2024, 07:39 |
|
#762 |
Member
Philipp Wiedemer
Join Date: Dec 2016
Location: Munich, Germany
Posts: 42
Rep Power: 10 |
My Hypothesis is, when benchmarking all of the core-counts we would see a saw-tooth-pattern in the speedup.
My explanation for this is, that for example at 168 cores all of the CCDs are balanced nicely with 7 cores each. If we add just one more core, than the workload of each of the cores is only a tiny bit less (decreased by 169/168) but one of the CCDs now has 8 cores instead of 7 (so an increase of the workload of this one CCD by (8/7 * 168/169). So in the simulation, this one CCD acts as the weakest link. If we add yet another core, we get another tiny speedup by decreasing the workload of each core (now a decrease of 170/168 in total) we now have two "bad" CCDs but this second "bad" CCD doesn´t make it worse because just the weakest link counts. So 168 is best, 169 is a lot worse, and 170 a little better than 169 but still a lot worse than 168 and so on until 192 where the CCD are balanced again and we have the best performance. |
|
March 29, 2024, 08:35 |
|
#763 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,428
Rep Power: 49 |
True in theory, when scattering threads across CCDs. In order to actually see this, quite a bit of effort would be necessary, in order to reduce run-to-run variance.
Additionally, with this many cores, the quality of domain decomposition can have a larger impact than adding/removing a few cores. |
|
March 30, 2024, 03:50 |
|
#764 | |
Senior Member
Will Kernkamp
Join Date: Jun 2014
Posts: 372
Rep Power: 14 |
Quote:
|
||
April 3, 2024, 06:12 |
|
#765 |
New Member
Alexander Kazantcev
Join Date: Sep 2019
Posts: 24
Rep Power: 7 |
I think the main difference of new Epyc results lies in only huge L3 cache. I saw with /proc/your_process_id/status, that binaries of snappyHexMesh and simpleFoam use about 400 MB of code with libraries. It is no surprising that if all the code lies in cache, then we sometimes are watching the big speed up. Also, as I seeing, the main code is very short, about some MB.
|
|
April 3, 2024, 08:31 |
|
#766 | |
New Member
Daniel
Join Date: Jun 2010
Posts: 14
Rep Power: 16 |
Quote:
Im confident that anyone could find more scientific papers on computer science journals that cover this topic, if needed. |
||
April 16, 2024, 17:26 |
|
#767 |
New Member
Marius
Join Date: Sep 2022
Posts: 27
Rep Power: 4 |
Apple Macbook Pro with M1 Max and 32 GB RAM running the natively compiled Openfoam version 2312.
# cores Wall time (s): ------------------------ 8 85.57 6 102.25 4 135.12 2 240.02 1 433.18 Last edited by Counterdoc; April 20, 2024 at 18:19. |
|
April 17, 2024, 15:30 |
|
#768 |
New Member
DS
Join Date: Jan 2022
Posts: 15
Rep Power: 4 |
Lenovo ThinkStation P520c, Xeon W-2275 (HT Off), 4 x 32GB DDR4 2666Mhz
OpenFoam2312 (precompiled), Ubuntu 23.10.1, Motorbike_bench_template.tar.gz (default settings) Meshing (real) # cores | Meshing Wall time| Solver Wall time(s): ------------------------ 1 | 10m5s | 790 2 | 7m13s | 412 4 | 4m10s | 205 6 | 2m57s | 153 8 | 2m26s | 134 12 | 2m2s | 118 14 | 2m15s | 116 |
|
April 18, 2024, 04:38 |
CPU frequency vs. L3 cache
|
#769 |
New Member
Jamie
Join Date: Apr 2024
Posts: 1
Rep Power: 0 |
Hi, I am quite new in CFD (2D-guy, water simulations). I have read this thread and learnt a lot about hardware, thanks everyone. I am going to build a server something like 2x Epyc (used) and 16x16 RAM. Question is should I prefer CPU frequency or L3 cache if other specs are about similar? Like Epyc 7532 (2.4 GHz/256 Mb) vs. Epyc 7542 (2,9 GHz/128 Mb). Or is there any notable difference? Thanks
|
|
April 19, 2024, 19:48 |
|
#770 | |
Senior Member
Will Kernkamp
Join Date: Jun 2014
Posts: 372
Rep Power: 14 |
Quote:
|
||
April 20, 2024, 09:55 |
|
#771 |
New Member
DS
Join Date: Jan 2022
Posts: 15
Rep Power: 4 |
HPE DL360 Gen9 , 2 x E5-2643 v4 (HT Off), 16 x 16GB DDR4 2400Mhz (operates at 2133MHz, measured bandwidth (Intel MLC) is 105 GB/s)
OpenFoam2312 (precompiled), Ubuntu 23.10.1, Motorbike_bench_template.tar.gz (default settings) Meshing (real) # cores | Meshing Wall time| Solver Wall time(s): ------------------------ 1 | 13m14s | 1000 2 | 8m17s | 479 4 | 4m40s | 219 6 | 3m26s | 156 8 | 2m50s | 122 12 | 2m34s | 94 P.S. I tested how the number of populated RAM slots affects the actual RAM bandwidth and got some pretty weird results. When the number of RAM modules is 2 modules per 1 memory channel, i.e. total there are 16 2400 Mhz RAM modules installed, the RAM operating frequency is slightly reduced down to 2133 MHz, and the measured actual throughput is 105 GB/s. While when installing 1 RAM module per memory channel, i.e. total 8 RAM modules are installed, the RAM operating frequency is 2400 MHz, and the measured actual throughput is 103 GB/s.. That is, as the RAM frequency increases, the throughput decreases. This is quite a strange behavior. Last edited by Crowdion; April 20, 2024 at 13:52. |
|
April 20, 2024, 17:00 |
|
#772 | |
Senior Member
Will Kernkamp
Join Date: Jun 2014
Posts: 372
Rep Power: 14 |
Quote:
DDR4-2133: 8*2133*8=136.5 GB/s DDR4-2400: 8*2400*8=153.6 GB/s Both your measurements do not reach the true limit. On a dual socket system, the measurement can be affected by one cpu reading from memory attached to another through the interconnect. This interconnect has it's own limits and of course a delay. The nice thing of the openfoam motobike benchmark is that it does (on a properly set up system) have performance proportional to bandwidth. Does that benchmark show a difference in bandwidth between the 2133 and 2400 memory speeds? I don't remember if on the HP DL360, you can force DDR-2400 speed when two slots are occupied. Most server systems have that option. I have found that on these XEON v1 through v4 systems, the fastest configuration is two dimms per channel of dual rank memory. All dimms must be the same to keep the system symmetric. Unsymmetric memory configurations incur large penalties. One dimm per channel, or single rank memory in one or more channels incurs small penalties. |
||
April 20, 2024, 17:07 |
|
#773 | |
Senior Member
Will Kernkamp
Join Date: Jun 2014
Posts: 372
Rep Power: 14 |
Quote:
It has this performance for comparison with your run: Flow Calculation: 1 924.05 2 483.68 4 214.54 8 113.42 12 85.05 Your results is already looking good! Note that that run has two 16 core cpus that each have a proportionally larger cache, which helps memory access. So, I don't think you will be able to reach 85.05 on your system with E5-2643 v4 cpus. |
||
April 20, 2024, 20:37 |
|
#774 | |
New Member
DS
Join Date: Jan 2022
Posts: 15
Rep Power: 4 |
Quote:
HP declares, that 2 DIMM/channel operates at 2133MHz, and 2400MHz at 1 DIMM/channel. My DL360g9 has installed HP certified HPE 809082-091 single rank RAM. I have removed 8 DIMMs to make 1 DIMM/channel configuration and rerun the benchmark and get the following results: HPE DL360 Gen9 , 2 x E5-2643 v4 (HT Off), 8 x 16GB DDR4 2400Mhz (operates at 2400MHz) OpenFoam2312 (precompiled), Ubuntu 23.10.1, Motorbike_bench_template.tar.gz (default settings) Meshing (real) # cores | Meshing Wall time| Solver Wall time(s): "8 x 16GB" conf. | "16x 16GB" conf. ........"2400MHz" | "2133MHz" -------------------------------------------------- 1 | 12m2s | 902 | 13m14s | 1000 2 | 8m10s | 471 | 8m17s | 479 4 | 4m35s | 221 | 4m40s | 219 6 | 3m32s | 160 | 3m26s | 156 8 | 2m45s | 130 | 2m50s | 122 12 | 2m34s |115| 2m34s | 94 The single core performance is better for "8 x 16GB" config, whereas the multithread performance is better for "16 x 16GB" config. Hmm, a very strange situation. I found in the Reddit thread reported measured BW values in different systems. It is reported ~138GB/S peak BW value for 2 x E5- 2683v4 (Supermicro X10DRG-OT+-CPU, 8x32GB DDR4 2400 Samsung RAM), which is much closer the theoretical peak BW value of 153.6 GB/s than mine measured |
||
April 20, 2024, 21:32 |
|
#775 | |
Senior Member
Will Kernkamp
Join Date: Jun 2014
Posts: 372
Rep Power: 14 |
Quote:
I think that you need at least two ranks per channel to reach maximum throughput, because the controller alternates addressing these ranks for better performance. It is called rank interleaving I think. Your best option is to go in the bios and see if you can force 2400 for the two-dimms-per-channel config. You may already have that setting, because the 94 sec result is pretty decent. Just checked and the e5-2643 v4 xeon has an unusually large cache of 20M. That is more than the 2.5xNcores which is the norm. So that is one reason it is doing so well. If this is a machine you are going to use for a CFD project, you should look into getting a higher core count processor. They are cheap. Note that a bios upgrade is sometimes needed before the newer higher core count cpus work. I have had that problem. I have a pair of E5-2683 v4, which is the lower clocked 16-core cpu (versus the E5-2697A for which I showed the result). The 2683 will do the benchmark in 64 seconds, so not much slower. I also have the 18 core E5-2686 v4. Don't remember it's performance, probably 62 seconds. The Gygabyte motherboard has better memory performance for the same cpu. It has the ability to run two dimms at 2400 per channel as a feature per the manual. Just looked on Ebay and the E5-2683 v4 is on offer for $25. |
||
April 21, 2024, 06:04 |
|
#776 |
New Member
Marius
Join Date: Sep 2022
Posts: 27
Rep Power: 4 |
I am currently looking for a second hand server system. I found some offers on ebay for the Intel E7-8880 in different versions.
4x Intel Xeon E7-8880 v4 - 22 core @ 2.2 GHz base / 3.3 GHz turbo - DDR4 1866 MHz - 55 MB cache 4x Intel Xeon E7-8880 v3 - 18 core @ 2.2 GHz base / 3.1 GHz turbo - DDR4 1866 MHz - 45 MB cache 8x Intel Xeon E7-8880 v2 - 15 core @ 2.5 GHz base / 3.1 GHz turbo - DDR3 1600 MHz - 37,5 MB cache Of course the v2 is the cheapest. It would also have the most cores. v2 = 120 cores, v3 = 72 cores, v4 = 88 cores. I think v3 makes no sense as the prices are similar to v4 systems. But how would v2 compare with v4 when there is such a big difference in number of cores? Could DDR3 or the cache be the bottleneck? |
|
April 21, 2024, 15:01 |
|
#777 | |
Senior Member
Will Kernkamp
Join Date: Jun 2014
Posts: 372
Rep Power: 14 |
Quote:
The number of cores is less important than the total bandwidth. If you search "intel ark E7-8880" you will find that all three processors have the same 85 GB/s bandwidth. This means that the eight processor v2 system will have almost twice the performance of the other two with only four processors provided the memory is configured correctly. The E7 v2, v3 and v4 have a special memory controller "Jordan Creek" that allows two DDR 1333 dimms to act as one DDR 2666 unit. The bandwidth is then 2666*4*8/1000=85 GB/s. That is the speed limit for each of these cpus. The higher speeds DDR4 that the v3 and v4 allow are only useful when there is just one dimm per channel. You need eight dimms per processor to reach the bandwidth. These dimms are not expensive if you need to buy more. The Xeon cpus get progressively more fuel efficient. So if power consumption is a concern, you should go for the v4 system. Once you have bought your system, run the benchmark an post your result here. This is a good check to see if you have everything working correctly. |
||
April 23, 2024, 15:16 |
|
#778 | |
New Member
Marius
Join Date: Sep 2022
Posts: 27
Rep Power: 4 |
Quote:
Thanks a lot for the explanation and recommendation! What about the SSDs? I don't want to have a bottleneck there either. |
||
April 23, 2024, 17:56 |
|
#779 | |
Senior Member
Will Kernkamp
Join Date: Jun 2014
Posts: 372
Rep Power: 14 |
Quote:
What I tend to do is set up redundant zfs with the HDD's (cheap storage) and SSDs as cache disk and for keeping the log. For this purpose your old 128GB SSD's are perfect. The problem of the old server HDDs is that they are often end of life. With zfs you can correct disk errors when they occur. Mostly the old server SAS HDD's are fine, but there is an occasional failure that won't hurt if you run zfs. (Hardware raid is not as good for that.) If you don't need a lot of storage you can just use one or two 1TB SSD's. Two mirrored disks read twice as fast and obviously have redundancy. If you are looking to use nvme drives, you might be better of with a v3 or v4 system, because their bios usually allows boot from nvme and pcie splitting. My quanta grid server has an nvme slot on the motherboard. There are cheap pciex16 cards that allow say 4 nvme SSDs to be run of the pcie16x4 slot when split into 4xpciex4 chunks. Of course you can use specialized cards that handle multiple nvme and M.2 SATA drives on a pcie slot that has not been split. The bios on v2 systems can sometimes be modified to allow boot from nmve. I have successfully done this on a Supermicro server. This server bios now also allows the pcie splitting. |
||
August 7, 2024, 11:12 |
2x EPYC 9684X
|
#780 |
Member
Join Date: Sep 2010
Location: Leipzig, Germany
Posts: 96
Rep Power: 16 |
Results for OpenFOAM 9 on a dual EPYC 9684X with 4800 MHz DDR5 RAM:
Code:
# cores Wall time (s): ------------------------ 1 546.46 4 110.53 8 51.49 12 35.64 16 27.53 20 22.26 24 19.38 28 16.93 32 15.38 40 12.53 48 10.85 56 9.62 64 8.67 96 6.92 128 6.49 160 6.03 192 6.43 Code:
sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches" mpirun -np 160 --report-bindings --bind-to core --rank-by core --map-by numa simpleFoam -parallel > log.simpleFoam 2>&1 Last edited by oswald; August 8, 2024 at 04:51. Reason: Corrected RAM frequency from 5600 MHz to 4800 MHz. Thanks to flotus1 for pointing it out! Added time for clean cache. |
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
How to contribute to the community of OpenFOAM users and to the OpenFOAM technology | wyldckat | OpenFOAM | 17 | November 10, 2017 16:54 |
UNIGE February 13th-17th - 2107. OpenFOAM advaced training days | joegi.geo | OpenFOAM Announcements from Other Sources | 0 | October 1, 2016 20:20 |
OpenFOAM Training Beijing 22-26 Aug 2016 | cfd.direct | OpenFOAM Announcements from Other Sources | 0 | May 3, 2016 05:57 |
New OpenFOAM Forum Structure | jola | OpenFOAM | 2 | October 19, 2011 07:55 |
Hardware for OpenFOAM LES | LijieNPIC | Hardware | 0 | November 8, 2010 10:54 |