CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > General Forums > Hardware

Weak parallel efficiency of TR3990X-based workstation with Star-CCM+

Register Blogs Community New Posts Updated Threads Search

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   April 12, 2021, 07:01
Smile Weak parallel efficiency of TR3990X-based workstation with Star-CCM+
  #1
New Member
 
Georg
Join Date: Dec 2020
Posts: 10
Rep Power: 7
kiteguy is on a distinguished road
Hello all,

I recently ran some benchmark tests with Star-CCM+ on my departments workstation and noticed that the parallel efficiency scales painfully bad with the number of cores (see the attached specifications and benchmark results).

I have read in other posts that the Threadripper lineup is not ideal for CFD-purposes due to the relatively small number of available memory lanes (4). Does this however explain why the parallel efficiency drops to as little as 65% with 8 cores already?

I would really appreciate any suggestions on how to find and possibly fix the bottle-neck in the set-up.

Best regards!

Attachments:
- Screenshot of representative benchmark test with Star-CCM+
- Workstation specifications

CPU: AMD Ryzen Threadripper 3990X 64-Core Processor
mem: Corsair Vengeance LPX, DDR4-3200, CL16 - 64 GB Dual Kit (128GB total)
graphics: Gigabyte GeForce GTX 1660 Ti OC 6G, 6144MB GDDR6
SSD: Gigabyte Aorus NVMe SSD, PCIe 4.0 M.2 Type 2280
Attached Images
File Type: png Benchmark-RK.png (118.7 KB, 60 views)
File Type: png Specifications.png (66.5 KB, 33 views)
kiteguy is offline   Reply With Quote

Old   April 12, 2021, 11:57
Default
  #2
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
It is not entirely unexpected that you got less-than-linear scaling with 8 threads.
Whether 65% parallel efficiency is too low or not, I would not want to judge.

Maybe we can narrow things down by answering a few questions:
Are you running a double precision solver version?
What about the absolute run time of your benchmark? Can you compare it to some other machine?
Memory is sitting in the right slots according to the motherboard manual? And it's actually running at DDR4-3200?
Have you tried checking which physical cores these 8 threads are pinned to? tools like htop and lstopo come in handy. Ideally, it should be one core from each of the 8 compute dies.
Have you run any other popular benchmarks? This would allow you to stress-test your system (also keep an eye on temperatures and frequency), and compare to known good results.
flotus1 is offline   Reply With Quote

Old   April 12, 2021, 13:00
Default
  #3
New Member
 
Georg
Join Date: Dec 2020
Posts: 10
Rep Power: 7
kiteguy is on a distinguished road
thanks for the quick reply, Alex! I appreciate the help.

(1) I am running the double precision version, which I now understand may entail compromises on the performance. To be honest, I never considered this and will have the mixed version installed!
(2) I compared the performance to another machine with similar specs (other CPU but same amount of RAM, see the attached screenshot) which requires about twice the time on a single core but scales linearly with the number of cores and thus becomes as good as our problem child or better with 8 cores and more.
(3) We opened up the workstation and checked: 4*32 GB of memory sticks are installed (1 per memory lane)
(4) Hyperthreading is deactivated, so the tasks must be running on physical cores. Unfortunately I don't have the necessary admin rights to check whether all compute dies are used, but I will look into this with our IT department.
(5) We haven't looked into other benchmark tests yet, but our IT department is planning to run a Cinebench on it.

Upon reading through a spotlight presentation by the developers of Star-CCM+ outlining hardware requirements, I noticed they recommend 2 memory sticks per lane. Could this explain our issue?
Attached Images
File Type: png Benchmark-gb.png (99.5 KB, 27 views)
kiteguy is offline   Reply With Quote

Old   April 12, 2021, 13:46
Default
  #4
Member
 
EM
Join Date: Sep 2019
Posts: 59
Rep Power: 7
gnwt4a is on a distinguished road
using the gnu compiler on the v3 system may be disadvantageous to intel chips. the intel compilers and mkl libs are now free for use in private and academia.
gnwt4a is offline   Reply With Quote

Old   April 12, 2021, 13:55
Default
  #5
New Member
 
Georg
Join Date: Dec 2020
Posts: 10
Rep Power: 7
kiteguy is on a distinguished road
Quote:
Originally Posted by gnwt4a View Post
using the gnu compiler on the v3 system may be disadvantageous to intel chips. the intel compilers and mkl libs are now free for use in private and academia.
thanks for the info! But if this is an Intel issue it probably does not apply to the AMD TR3990X which is installed in our machine, does it?
kiteguy is offline   Reply With Quote

Old   April 12, 2021, 14:26
Default
  #6
Member
 
EM
Join Date: Sep 2019
Posts: 59
Rep Power: 7
gnwt4a is on a distinguished road
right. u may be underestimating the v3 performance - that is all. fwiw, a couple of years ago i compared the performance of the TR gen 1 (16 core) against a 3930k using a fortran dns code, and the tr was about 10% faster than the 6-core intel chip. i expect things to be better with amd now, but i will not be buying an amd chip which is a rejig of zen-1. wait until zen-4.
gnwt4a is offline   Reply With Quote

Old   April 12, 2021, 15:54
Default
  #7
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
Quote:
(1) I am running the double precision version, which I now understand may entail compromises on the performance. To be honest, I never considered this and will have the mixed version installed!
In a situation like this with a pretty severe memory bandwidth bottleneck, the single precision solver will be significantly faster. Only use DP if you really need it.

Quote:
(2) I compared the performance to another machine with similar specs (other CPU but same amount of RAM, see the attached screenshot) which requires about twice the time on a single core but scales linearly with the number of cores and thus becomes as good as our problem child or better with 8 cores and more.
It's a bit odd that the v3 Xeon catches up at 8 cores. Which Linux version are you running, and which kernel version?

Quote:
(3) We opened up the workstation and checked: 4*32 GB of memory sticks are installed (1 per memory lane)
Crack open the manual for your motherboard. It will contain a recommendation which exact DIMM slots need to be populated with 4 DIMMs.
Also: check which transfer rate the memory is actually running at. Just because you bought memory rated for up to DDR4-3200 doesn't mean that it is running at that speed.

Quote:
Upon reading through a spotlight presentation by the developers of Star-CCM+ outlining hardware requirements, I noticed they recommend 2 memory sticks per lane. Could this explain our issue?
No, one DIMM per channel is enough to get very close to peak performance. As long as they are in the correct slots and running at the advertised transfer rate. Again: check both.
What thy are probably referring to is the number of ranks per channel. There can be a performance difference in the order of 10% with one rank per channel vs. 2. But since the DIMMs you bought are already dual-rank, you automatically have two ranks per channel. Again, provided they are in the right slots.
flotus1 is offline   Reply With Quote

Old   April 13, 2021, 09:04
Default
  #8
New Member
 
Georg
Join Date: Dec 2020
Posts: 10
Rep Power: 7
kiteguy is on a distinguished road
Hey Alex, thanks so much! Very helpful once again.

(1) the local machine runs 'Ubuntu 20.04.2 LTS' and kernel version '5.8.0-48-generic'. The machine used for comparison hosts 'Scientific Linux release 7.7 (Nitrogen)' and kernel version '3.10.0-1160.15.2.el7.x86_64'

(2) The four DIMMs were indeed installed in the correct slots on the motherboard, we will check the actual transfer rate as soon as possible.

I read that the developers of Star-CCM+ advise to set the NUMA nodes per socket (NPS) to 4 for AMD Epyc processors. Do you think this should also be the case for the TR3990x? (We use Power-On-Demand licenses, so the number of nodes should not be an issue)

I will post some performance updates once the IT department has the mixed precision version installed!
kiteguy is offline   Reply With Quote

Old   April 13, 2021, 09:39
Default
  #9
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
I was hoping that part of the problem might be an old Linux kernel. But that doesn't seem to be the case.

For Epyc Rome CPUs, NPS=4 is indeed the best setting. But the performance difference to NPS=1 is not huge, in the order of 10%.
Not sure if a Zen2 Threadripper CPU/mobo has the same option available. It might only go up to NPS=2.
flotus1 is offline   Reply With Quote

Old   April 14, 2021, 11:28
Default
  #10
New Member
 
Georg
Join Date: Dec 2020
Posts: 10
Rep Power: 7
kiteguy is on a distinguished road
quick update:

- switching to the most recent mixed-solver version decreased the single-core runtime by approx. 29%
- we changed the NPS setting from Auto to 2, which decreased the efficiency a little at small core counts but seems to be quite beneficial for 16+ cores (up to 9% increase). NPS=4 was also possible but led to worse performance at core counts between 2 and 32.

So all in all, the scaling is still not great but the run-times already look much better than a few days ago. We're hoping to get a further performance boost by installing the additional 4 memory sticks in order to operate two per memory lane.

Last edited by kiteguy; April 27, 2021 at 19:17.
kiteguy is offline   Reply With Quote

Old   April 14, 2021, 12:57
Default
  #11
Senior Member
 
Simbelmynė's Avatar
 
Join Date: May 2012
Posts: 552
Rep Power: 16
Simbelmynė is on a distinguished road
Why is it not possible to operate the memory at 3200 MT/s? This seems a bit odd to me.


If you have control of the BIOS you could also try to tweak the memory settings. Zen2 and Zen3 can see huge increases with the proper memory timings. I have not seen reports from the Threadripper series yet on this forum and the only Threadripper I have access to is first generation which has a rather crappy memory controller, so I cannot test it myself.


The Ryzen DRAM calculator has options for Threadripper so you could try that out. Seeing that you have a 3200 MHz CL16 memory, perhaps you should not expect any greater success, but I think it is worth a try.






EDIT: Looking at the memory support page of your MB vendor it seems that it has a large amount of RAM that has passed 3600 MT/s. For instance, this kit "F4-3600C16Q-64GTZR" is dual rank @ CL16.


The RAM support document even specifies the memory type (Samsung B-die, Hynix etc.). Running the infinity fabric 1:1 and the memory @ 3600 MT/s is likely the sweet spot for the 3990X also (for Ryzen it is).


https://download.gigabyte.com/FileLi...eme_200304.pdf
Simbelmynė is offline   Reply With Quote

Old   April 25, 2021, 19:18
Default
  #12
cwl
Senior Member
 
Chaotic Water
Join Date: Jul 2012
Location: Elgrin Fau
Posts: 438
Rep Power: 18
cwl is on a distinguished road
Kiteguy, you haven't actually described your simulation case, so I'll randomly pay your attention to the fact that parallel efficiency depends much on amount of Boundaries within the Region and amount of Interfaces (if any) also.

It is mentioned in the official Siemens Best Practices video (https://youtu.be/U9WUPEdX-6A) at 45:00.
I guess you have amount of Boundaries much less than hundreds mentioned in the video, yet I remember even a dozen being a slowing factor reported in Star-CCM+ section of the forum.

Maybe that could be a reason?
cwl is offline   Reply With Quote

Old   April 27, 2021, 16:53
Default 8 channels versus 4 channels
  #13
Senior Member
 
Will Kernkamp
Join Date: Jun 2014
Posts: 372
Rep Power: 14
wkernkamp is on a distinguished road
Quote:
Originally Posted by kiteguy View Post
thanks for the quick reply, Alex! I appreciate the help.

(2) I compared the performance to another machine with similar specs (other CPU but same amount of RAM, see the attached screenshot) which requires about twice the time on a single core but scales linearly with the number of cores and thus becomes as good as our problem child or better with 8 cores and more.

Your comparison system is a dual cpu config with a total of 8 memory channels versus 4 channels for the threadripper. This explains why the dual xeons are still linear with cores at 10 cores, while the threadripper is already falling off.
wkernkamp is offline   Reply With Quote

Reply

Tags
parallel efficiency, performance benchmark, star ccm+, tr3990x


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Abysmal performance of 64 cores opteron based workstation for CFD Fauster Hardware 8 June 4, 2018 11:51
Problem with Application based on a faceZone in parallel psilkeit OpenFOAM Programming & Development 2 April 28, 2016 10:47
OpenFOAM with Inifiband & parallel efficiency LijieNPIC OpenFOAM 15 June 23, 2011 06:10
star cd 4.06 parallel problem whitemelon Siemens 3 October 23, 2008 15:42
Parallel STAR version Michael Schmid Siemens 1 November 7, 2001 10:41


All times are GMT -4. The time now is 16:28.