Weak parallel efficiency of TR3990X-based workstation with Star-CCM+

kiteguy · April 12, 2021, 07:01

Hello all,

I recently ran some benchmark tests with Star-CCM+ on my departments workstation and noticed that the parallel efficiency scales painfully bad with the number of cores (see the attached specifications and benchmark results).

I have read in other posts that the Threadripper lineup is not ideal for CFD-purposes due to the relatively small number of available memory lanes (4). Does this however explain why the parallel efficiency drops to as little as 65% with 8 cores already?

I would really appreciate any suggestions on how to find and possibly fix the bottle-neck in the set-up.

Best regards!

Attachments:
- Screenshot of representative benchmark test with Star-CCM+
- Workstation specifications

CPU: AMD Ryzen Threadripper 3990X 64-Core Processor
mem: Corsair Vengeance LPX, DDR4-3200, CL16 - 64 GB Dual Kit (128GB total)
graphics: Gigabyte GeForce GTX 1660 Ti OC 6G, 6144MB GDDR6
SSD: Gigabyte Aorus NVMe SSD, PCIe 4.0 M.2 Type 2280

flotus1 · April 12, 2021, 11:57

It is not entirely unexpected that you got less-than-linear scaling with 8 threads.
Whether 65% parallel efficiency is too low or not, I would not want to judge.

Maybe we can narrow things down by answering a few questions:
Are you running a double precision solver version?
What about the absolute run time of your benchmark? Can you compare it to some other machine?
Memory is sitting in the right slots according to the motherboard manual? And it's actually running at DDR4-3200?
Have you tried checking which physical cores these 8 threads are pinned to? tools like htop and lstopo come in handy. Ideally, it should be one core from each of the 8 compute dies.
Have you run any other popular benchmarks? This would allow you to stress-test your system (also keep an eye on temperatures and frequency), and compare to known good results.

kiteguy · April 12, 2021, 13:00

thanks for the quick reply, Alex! I appreciate the help.

(1) I am running the double precision version, which I now understand may entail compromises on the performance. To be honest, I never considered this and will have the mixed version installed!
(2) I compared the performance to another machine with similar specs (other CPU but same amount of RAM, see the attached screenshot) which requires about twice the time on a single core but scales linearly with the number of cores and thus becomes as good as our problem child or better with 8 cores and more.
(3) We opened up the workstation and checked: 4*32 GB of memory sticks are installed (1 per memory lane)
(4) Hyperthreading is deactivated, so the tasks must be running on physical cores. Unfortunately I don't have the necessary admin rights to check whether all compute dies are used, but I will look into this with our IT department.
(5) We haven't looked into other benchmark tests yet, but our IT department is planning to run a Cinebench on it.

Upon reading through a spotlight presentation by the developers of Star-CCM+ outlining hardware requirements, I noticed they recommend 2 memory sticks per lane. Could this explain our issue?

gnwt4a · April 12, 2021, 13:46

using the gnu compiler on the v3 system may be disadvantageous to intel chips. the intel compilers and mkl libs are now free for use in private and academia.

kiteguy · April 12, 2021, 13:55

Quote:

Originally Posted by gnwt4a

using the gnu compiler on the v3 system may be disadvantageous to intel chips. the intel compilers and mkl libs are now free for use in private and academia.

thanks for the info! But if this is an Intel issue it probably does not apply to the AMD TR3990X which is installed in our machine, does it?

gnwt4a · April 12, 2021, 14:26

right. u may be underestimating the v3 performance - that is all. fwiw, a couple of years ago i compared the performance of the TR gen 1 (16 core) against a 3930k using a fortran dns code, and the tr was about 10% faster than the 6-core intel chip. i expect things to be better with amd now, but i will not be buying an amd chip which is a rejig of zen-1. wait until zen-4.

flotus1 · April 12, 2021, 15:54

Quote:

(1) I am running the double precision version, which I now understand may entail compromises on the performance. To be honest, I never considered this and will have the mixed version installed!

In a situation like this with a pretty severe memory bandwidth bottleneck, the single precision solver will be significantly faster. Only use DP if you really need it.

Quote:

(2) I compared the performance to another machine with similar specs (other CPU but same amount of RAM, see the attached screenshot) which requires about twice the time on a single core but scales linearly with the number of cores and thus becomes as good as our problem child or better with 8 cores and more.

It's a bit odd that the v3 Xeon catches up at 8 cores. Which Linux version are you running, and which kernel version?

Quote:

(3) We opened up the workstation and checked: 4*32 GB of memory sticks are installed (1 per memory lane)

Crack open the manual for your motherboard. It will contain a recommendation which exact DIMM slots need to be populated with 4 DIMMs.
Also: check which transfer rate the memory is actually running at. Just because you bought memory rated for up to DDR4-3200 doesn't mean that it is running at that speed.

Quote:

Upon reading through a spotlight presentation by the developers of Star-CCM+ outlining hardware requirements, I noticed they recommend 2 memory sticks per lane. Could this explain our issue?

No, one DIMM per channel is enough to get very close to peak performance. As long as they are in the correct slots and running at the advertised transfer rate. Again: check both.
What thy are probably referring to is the number of ranks per channel. There can be a performance difference in the order of 10% with one rank per channel vs. 2. But since the DIMMs you bought are already dual-rank, you automatically have two ranks per channel. Again, provided they are in the right slots.

kiteguy · April 13, 2021, 09:04

Hey Alex, thanks so much! Very helpful once again.

(1) the local machine runs 'Ubuntu 20.04.2 LTS' and kernel version '5.8.0-48-generic'. The machine used for comparison hosts 'Scientific Linux release 7.7 (Nitrogen)' and kernel version '3.10.0-1160.15.2.el7.x86_64'

(2) The four DIMMs were indeed installed in the correct slots on the motherboard, we will check the actual transfer rate as soon as possible.

I read that the developers of Star-CCM+ advise to set the NUMA nodes per socket (NPS) to 4 for AMD Epyc processors. Do you think this should also be the case for the TR3990x? (We use Power-On-Demand licenses, so the number of nodes should not be an issue)

I will post some performance updates once the IT department has the mixed precision version installed!

flotus1 · April 13, 2021, 09:39

I was hoping that part of the problem might be an old Linux kernel. But that doesn't seem to be the case.

For Epyc Rome CPUs, NPS=4 is indeed the best setting. But the performance difference to NPS=1 is not huge, in the order of 10%.
Not sure if a Zen2 Threadripper CPU/mobo has the same option available. It might only go up to NPS=2.

kiteguy · April 14, 2021, 11:28

quick update:

- switching to the most recent mixed-solver version decreased the single-core runtime by approx. 29%
- we changed the NPS setting from Auto to 2, which decreased the efficiency a little at small core counts but seems to be quite beneficial for 16+ cores (up to 9% increase). NPS=4 was also possible but led to worse performance at core counts between 2 and 32.

So all in all, the scaling is still not great but the run-times already look much better than a few days ago. We're hoping to get a further performance boost by installing the additional 4 memory sticks in order to operate two per memory lane.

Simbelmynë · April 14, 2021, 12:57

Why is it not possible to operate the memory at 3200 MT/s? This seems a bit odd to me.

If you have control of the BIOS you could also try to tweak the memory settings. Zen2 and Zen3 can see huge increases with the proper memory timings. I have not seen reports from the Threadripper series yet on this forum and the only Threadripper I have access to is first generation which has a rather crappy memory controller, so I cannot test it myself.

The Ryzen DRAM calculator has options for Threadripper so you could try that out. Seeing that you have a 3200 MHz CL16 memory, perhaps you should not expect any greater success, but I think it is worth a try.

EDIT: Looking at the memory support page of your MB vendor it seems that it has a large amount of RAM that has passed 3600 MT/s. For instance, this kit "F4-3600C16Q-64GTZR" is dual rank @ CL16.

The RAM support document even specifies the memory type (Samsung B-die, Hynix etc.). Running the infinity fabric 1:1 and the memory @ 3600 MT/s is likely the sweet spot for the 3990X also (for Ryzen it is).

https://download.gigabyte.com/FileLi...eme_200304.pdf

cwl · April 25, 2021, 19:18

Kiteguy, you haven't actually described your simulation case, so I'll randomly pay your attention to the fact that parallel efficiency depends much on amount of Boundaries within the Region and amount of Interfaces (if any) also.

It is mentioned in the official Siemens Best Practices video (https://youtu.be/U9WUPEdX-6A) at 45:00.
I guess you have amount of Boundaries much less than hundreds mentioned in the video, yet I remember even a dozen being a slowing factor reported in Star-CCM+ section of the forum.

Maybe that could be a reason?

wkernkamp · April 27, 2021, 16:53

Quote:

Originally Posted by kiteguy

thanks for the quick reply, Alex! I appreciate the help.

(2) I compared the performance to another machine with similar specs (other CPU but same amount of RAM, see the attached screenshot) which requires about twice the time on a single core but scales linearly with the number of cores and thus becomes as good as our problem child or better with 8 cores and more.

Your comparison system is a dual cpu config with a total of 8 memory channels versus 4 channels for the threadripper. This explains why the dual xeons are still linear with cores at 10 cores, while the threadripper is already falling off.

April 14, 2021, 11:28		#10
kiteguy New Member Georg Join Date: Dec 2020 Posts: 10 Rep Power: 7	quick update: - switching to the most recent mixed-solver version decreased the single-core runtime by approx. 29% - we changed the NPS setting from Auto to 2, which decreased the efficiency a little at small core counts but seems to be quite beneficial for 16+ cores (up to 9% increase). NPS=4 was also possible but led to worse performance at core counts between 2 and 32. So all in all, the scaling is still not great but the run-times already look much better than a few days ago. We're hoping to get a further performance boost by installing the additional 4 memory sticks in order to operate two per memory lane. Last edited by kiteguy; April 27, 2021 at 19:17.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Abysmal performance of 64 cores opteron based workstation for CFD	Fauster	Hardware	8	June 4, 2018 11:51
Problem with Application based on a faceZone in parallel	psilkeit	OpenFOAM Programming & Development	2	April 28, 2016 10:47
OpenFOAM with Inifiband & parallel efficiency	LijieNPIC	OpenFOAM	15	June 23, 2011 06:10
star cd 4.06 parallel problem	whitemelon	Siemens	3	October 23, 2008 15:42
Parallel STAR version	Michael Schmid	Siemens	1	November 7, 2001 10:41

April 12, 2021, 11:57		#2
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,427 Rep Power: 49	It is not entirely unexpected that you got less-than-linear scaling with 8 threads. Whether 65% parallel efficiency is too low or not, I would not want to judge. Maybe we can narrow things down by answering a few questions: Are you running a double precision solver version? What about the absolute run time of your benchmark? Can you compare it to some other machine? Memory is sitting in the right slots according to the motherboard manual? And it's actually running at DDR4-3200? Have you tried checking which physical cores these 8 threads are pinned to? tools like htop and lstopo come in handy. Ideally, it should be one core from each of the 8 compute dies. Have you run any other popular benchmarks? This would allow you to stress-test your system (also keep an eye on temperatures and frequency), and compare to known good results.

April 12, 2021, 13:46		#4
gnwt4a Member EM Join Date: Sep 2019 Posts: 59 Rep Power: 7	using the gnu compiler on the v3 system may be disadvantageous to intel chips. the intel compilers and mkl libs are now free for use in private and academia.

April 12, 2021, 14:26		#6
gnwt4a Member EM Join Date: Sep 2019 Posts: 59 Rep Power: 7	right. u may be underestimating the v3 performance - that is all. fwiw, a couple of years ago i compared the performance of the TR gen 1 (16 core) against a 3930k using a fortran dns code, and the tr was about 10% faster than the 6-core intel chip. i expect things to be better with amd now, but i will not be buying an amd chip which is a rejig of zen-1. wait until zen-4.

April 13, 2021, 09:04		#8
kiteguy New Member Georg Join Date: Dec 2020 Posts: 10 Rep Power: 7	Hey Alex, thanks so much! Very helpful once again. (1) the local machine runs 'Ubuntu 20.04.2 LTS' and kernel version '5.8.0-48-generic'. The machine used for comparison hosts 'Scientific Linux release 7.7 (Nitrogen)' and kernel version '3.10.0-1160.15.2.el7.x86_64' (2) The four DIMMs were indeed installed in the correct slots on the motherboard, we will check the actual transfer rate as soon as possible. I read that the developers of Star-CCM+ advise to set the NUMA nodes per socket (NPS) to 4 for AMD Epyc processors. Do you think this should also be the case for the TR3990x? (We use Power-On-Demand licenses, so the number of nodes should not be an issue) I will post some performance updates once the IT department has the mixed precision version installed!

April 13, 2021, 09:39		#9
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,427 Rep Power: 49	I was hoping that part of the problem might be an old Linux kernel. But that doesn't seem to be the case. For Epyc Rome CPUs, NPS=4 is indeed the best setting. But the performance difference to NPS=1 is not huge, in the order of 10%. Not sure if a Zen2 Threadripper CPU/mobo has the same option available. It might only go up to NPS=2.

April 14, 2021, 12:57		#11
Simbelmynë Senior Member Join Date: May 2012 Posts: 552 Rep Power: 16	Why is it not possible to operate the memory at 3200 MT/s? This seems a bit odd to me. If you have control of the BIOS you could also try to tweak the memory settings. Zen2 and Zen3 can see huge increases with the proper memory timings. I have not seen reports from the Threadripper series yet on this forum and the only Threadripper I have access to is first generation which has a rather crappy memory controller, so I cannot test it myself. The Ryzen DRAM calculator has options for Threadripper so you could try that out. Seeing that you have a 3200 MHz CL16 memory, perhaps you should not expect any greater success, but I think it is worth a try. EDIT: Looking at the memory support page of your MB vendor it seems that it has a large amount of RAM that has passed 3600 MT/s. For instance, this kit "F4-3600C16Q-64GTZR" is dual rank @ CL16. The RAM support document even specifies the memory type (Samsung B-die, Hynix etc.). Running the infinity fabric 1:1 and the memory @ 3600 MT/s is likely the sweet spot for the 3990X also (for Ryzen it is). https://download.gigabyte.com/FileLi...eme_200304.pdf

April 25, 2021, 19:18		#12
cwl Senior Member Chaotic Water Join Date: Jul 2012 Location: Elgrin Fau Posts: 438 Rep Power: 18	Kiteguy, you haven't actually described your simulation case, so I'll randomly pay your attention to the fact that parallel efficiency depends much on amount of Boundaries within the Region and amount of Interfaces (if any) also. It is mentioned in the official Siemens Best Practices video (https://youtu.be/U9WUPEdX-6A) at 45:00. I guess you have amount of Boundaries much less than hundreds mentioned in the video, yet I remember even a dozen being a slowing factor reported in Star-CCM+ section of the forum. Maybe that could be a reason?