|
[Sponsors] |
Weak parallel efficiency of TR3990X-based workstation with Star-CCM+ |
|
LinkBack | Thread Tools | Search this Thread | Display Modes |
April 12, 2021, 07:01 |
Weak parallel efficiency of TR3990X-based workstation with Star-CCM+
|
#1 |
New Member
Georg
Join Date: Dec 2020
Posts: 10
Rep Power: 6 |
Hello all,
I recently ran some benchmark tests with Star-CCM+ on my departments workstation and noticed that the parallel efficiency scales painfully bad with the number of cores (see the attached specifications and benchmark results). I have read in other posts that the Threadripper lineup is not ideal for CFD-purposes due to the relatively small number of available memory lanes (4). Does this however explain why the parallel efficiency drops to as little as 65% with 8 cores already? I would really appreciate any suggestions on how to find and possibly fix the bottle-neck in the set-up. Best regards! Attachments: - Screenshot of representative benchmark test with Star-CCM+ - Workstation specifications CPU: AMD Ryzen Threadripper 3990X 64-Core Processor mem: Corsair Vengeance LPX, DDR4-3200, CL16 - 64 GB Dual Kit (128GB total) graphics: Gigabyte GeForce GTX 1660 Ti OC 6G, 6144MB GDDR6 SSD: Gigabyte Aorus NVMe SSD, PCIe 4.0 M.2 Type 2280 |
|
April 12, 2021, 11:57 |
|
#2 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
It is not entirely unexpected that you got less-than-linear scaling with 8 threads.
Whether 65% parallel efficiency is too low or not, I would not want to judge. Maybe we can narrow things down by answering a few questions: Are you running a double precision solver version? What about the absolute run time of your benchmark? Can you compare it to some other machine? Memory is sitting in the right slots according to the motherboard manual? And it's actually running at DDR4-3200? Have you tried checking which physical cores these 8 threads are pinned to? tools like htop and lstopo come in handy. Ideally, it should be one core from each of the 8 compute dies. Have you run any other popular benchmarks? This would allow you to stress-test your system (also keep an eye on temperatures and frequency), and compare to known good results. |
|
April 12, 2021, 13:00 |
|
#3 |
New Member
Georg
Join Date: Dec 2020
Posts: 10
Rep Power: 6 |
thanks for the quick reply, Alex! I appreciate the help.
(1) I am running the double precision version, which I now understand may entail compromises on the performance. To be honest, I never considered this and will have the mixed version installed! (2) I compared the performance to another machine with similar specs (other CPU but same amount of RAM, see the attached screenshot) which requires about twice the time on a single core but scales linearly with the number of cores and thus becomes as good as our problem child or better with 8 cores and more. (3) We opened up the workstation and checked: 4*32 GB of memory sticks are installed (1 per memory lane) (4) Hyperthreading is deactivated, so the tasks must be running on physical cores. Unfortunately I don't have the necessary admin rights to check whether all compute dies are used, but I will look into this with our IT department. (5) We haven't looked into other benchmark tests yet, but our IT department is planning to run a Cinebench on it. Upon reading through a spotlight presentation by the developers of Star-CCM+ outlining hardware requirements, I noticed they recommend 2 memory sticks per lane. Could this explain our issue? |
|
April 12, 2021, 13:46 |
|
#4 |
Member
EM
Join Date: Sep 2019
Posts: 59
Rep Power: 7 |
using the gnu compiler on the v3 system may be disadvantageous to intel chips. the intel compilers and mkl libs are now free for use in private and academia.
|
|
April 12, 2021, 13:55 |
|
#5 |
New Member
Georg
Join Date: Dec 2020
Posts: 10
Rep Power: 6 |
thanks for the info! But if this is an Intel issue it probably does not apply to the AMD TR3990X which is installed in our machine, does it?
|
|
April 12, 2021, 14:26 |
|
#6 |
Member
EM
Join Date: Sep 2019
Posts: 59
Rep Power: 7 |
right. u may be underestimating the v3 performance - that is all. fwiw, a couple of years ago i compared the performance of the TR gen 1 (16 core) against a 3930k using a fortran dns code, and the tr was about 10% faster than the 6-core intel chip. i expect things to be better with amd now, but i will not be buying an amd chip which is a rejig of zen-1. wait until zen-4.
|
|
April 12, 2021, 15:54 |
|
#7 | ||||
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
Quote:
Quote:
Quote:
Also: check which transfer rate the memory is actually running at. Just because you bought memory rated for up to DDR4-3200 doesn't mean that it is running at that speed. Quote:
What thy are probably referring to is the number of ranks per channel. There can be a performance difference in the order of 10% with one rank per channel vs. 2. But since the DIMMs you bought are already dual-rank, you automatically have two ranks per channel. Again, provided they are in the right slots. |
|||||
April 13, 2021, 09:04 |
|
#8 |
New Member
Georg
Join Date: Dec 2020
Posts: 10
Rep Power: 6 |
Hey Alex, thanks so much! Very helpful once again.
(1) the local machine runs 'Ubuntu 20.04.2 LTS' and kernel version '5.8.0-48-generic'. The machine used for comparison hosts 'Scientific Linux release 7.7 (Nitrogen)' and kernel version '3.10.0-1160.15.2.el7.x86_64' (2) The four DIMMs were indeed installed in the correct slots on the motherboard, we will check the actual transfer rate as soon as possible. I read that the developers of Star-CCM+ advise to set the NUMA nodes per socket (NPS) to 4 for AMD Epyc processors. Do you think this should also be the case for the TR3990x? (We use Power-On-Demand licenses, so the number of nodes should not be an issue) I will post some performance updates once the IT department has the mixed precision version installed! |
|
April 13, 2021, 09:39 |
|
#9 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
I was hoping that part of the problem might be an old Linux kernel. But that doesn't seem to be the case.
For Epyc Rome CPUs, NPS=4 is indeed the best setting. But the performance difference to NPS=1 is not huge, in the order of 10%. Not sure if a Zen2 Threadripper CPU/mobo has the same option available. It might only go up to NPS=2. |
|
April 14, 2021, 11:28 |
|
#10 |
New Member
Georg
Join Date: Dec 2020
Posts: 10
Rep Power: 6 |
quick update:
- switching to the most recent mixed-solver version decreased the single-core runtime by approx. 29% - we changed the NPS setting from Auto to 2, which decreased the efficiency a little at small core counts but seems to be quite beneficial for 16+ cores (up to 9% increase). NPS=4 was also possible but led to worse performance at core counts between 2 and 32. So all in all, the scaling is still not great but the run-times already look much better than a few days ago. We're hoping to get a further performance boost by installing the additional 4 memory sticks in order to operate two per memory lane. Last edited by kiteguy; April 27, 2021 at 19:17. |
|
April 14, 2021, 12:57 |
|
#11 |
Senior Member
Join Date: May 2012
Posts: 551
Rep Power: 16 |
Why is it not possible to operate the memory at 3200 MT/s? This seems a bit odd to me.
If you have control of the BIOS you could also try to tweak the memory settings. Zen2 and Zen3 can see huge increases with the proper memory timings. I have not seen reports from the Threadripper series yet on this forum and the only Threadripper I have access to is first generation which has a rather crappy memory controller, so I cannot test it myself. The Ryzen DRAM calculator has options for Threadripper so you could try that out. Seeing that you have a 3200 MHz CL16 memory, perhaps you should not expect any greater success, but I think it is worth a try. EDIT: Looking at the memory support page of your MB vendor it seems that it has a large amount of RAM that has passed 3600 MT/s. For instance, this kit "F4-3600C16Q-64GTZR" is dual rank @ CL16. The RAM support document even specifies the memory type (Samsung B-die, Hynix etc.). Running the infinity fabric 1:1 and the memory @ 3600 MT/s is likely the sweet spot for the 3990X also (for Ryzen it is). https://download.gigabyte.com/FileLi...eme_200304.pdf |
|
April 25, 2021, 19:18 |
|
#12 |
Senior Member
Chaotic Water
Join Date: Jul 2012
Location: Elgrin Fau
Posts: 438
Rep Power: 18 |
Kiteguy, you haven't actually described your simulation case, so I'll randomly pay your attention to the fact that parallel efficiency depends much on amount of Boundaries within the Region and amount of Interfaces (if any) also.
It is mentioned in the official Siemens Best Practices video (https://youtu.be/U9WUPEdX-6A) at 45:00. I guess you have amount of Boundaries much less than hundreds mentioned in the video, yet I remember even a dozen being a slowing factor reported in Star-CCM+ section of the forum. Maybe that could be a reason? |
|
April 27, 2021, 16:53 |
8 channels versus 4 channels
|
#13 | |
Senior Member
Will Kernkamp
Join Date: Jun 2014
Posts: 371
Rep Power: 14 |
Quote:
Your comparison system is a dual cpu config with a total of 8 memory channels versus 4 channels for the threadripper. This explains why the dual xeons are still linear with cores at 10 cores, while the threadripper is already falling off. |
||
Tags |
parallel efficiency, performance benchmark, star ccm+, tr3990x |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Abysmal performance of 64 cores opteron based workstation for CFD | Fauster | Hardware | 8 | June 4, 2018 11:51 |
Problem with Application based on a faceZone in parallel | psilkeit | OpenFOAM Programming & Development | 2 | April 28, 2016 10:47 |
OpenFOAM with Inifiband & parallel efficiency | LijieNPIC | OpenFOAM | 15 | June 23, 2011 06:10 |
star cd 4.06 parallel problem | whitemelon | Siemens | 3 | October 23, 2008 15:42 |
Parallel STAR version | Michael Schmid | Siemens | 1 | November 7, 2001 10:41 |