CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > General Forums > Hardware

Poor performance with Star-CCM+ on a Dell R7425

Register Blogs Community New Posts Updated Threads Search

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   January 25, 2022, 18:08
Default Poor performance with Star-CCM+ on a Dell R7425
  #1
New Member
 
Joseph Osman
Join Date: Aug 2011
Posts: 7
Rep Power: 15
josman84 is on a distinguished road
Originally posted here: Memory for AMD Epyc CPUs

I am dealing with a related issue, so I hope I am ok with jumping on this thread rather than starting a new one...

I have just purchased a Dell R7425 with:
2x Epyc 7601 32 core (64 cores total)
16 x Hynix 16GB 2RX8 PC4-2666V-R ECC (256gb total)

I am running STAR-CCM+ and getting really poor scaling... around 30% parallel efficiency. The maximum speedup I get is something like 20x on 64 cores - hopefully the attached image does a decent job of illustrating.

I have tried everything I can think of, but the only thing left is a suggestion to change the memory;
There is a single line in a hardware recommendations document from Siemens which states:
"Use 2 memory sticks per memory channel"
As you will see from the spec above, I have only 1 stick per channel
(ranks etc are not covered in the doc)

So my question: is it really feasible that this is what is hurting me?? It seems like such a specific detail and if it is really such a performance killer I would expect it to see it referenced all over the place!

STAR-CCM+ support says I should try changing the memory... which I am more than happy to do if I know it will work, but I would really like a little more reassurance before I go further down the wrong track!

For reference, I tried the same benchmark case on a cloud server using Epyc 7551 CPUs and achieved significantly better performance, so I am confident there is some extra performance to unlock...

Any pointers would be greatly appreciated!
Attached Images
File Type: png Epyc_Scaling.png (50.6 KB, 46 views)

Last edited by flotus1; January 26, 2022 at 03:32.
josman84 is offline   Reply With Quote

Old   January 26, 2022, 03:31
Default
  #2
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,426
Rep Power: 49
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
In light of the rather intense discussion over there, I should point out once more what kind of performance difference we are talking about here: it is in the order of 10-15% 1 vs 2 ranks per channel.
it is rather unlikely this could give you the performance increase you want.
Also, according to your description, you already have 2 ranks per channel. Adding another 2 will most likely force a 1st gen Epyc system to drop memory frequency, while gaining very little from the increased rank interleave. It's a sharp diminishing return. Going 1->2, performance increase is decent. Already going from 2->4, the benefit is MUCH smaller.
According to DELLs own documentation, they drop memory frequency to DDR4-2400 with 2 ranks per channel. And further down to DDR4-2133 with 4 ranks per channel https://www.dell.com/support/manuals...2-882a1e9361b6

This is difficult to diagnose from the safety of my armchair, all I can do is give you some pointers:
1) General lack of memory bandwidth for 64 cores. Less-than-ideal scaling is to be expected in this scenario. I get slightly above 50% with the OpenFOAM benchmark on my dual 7551 workstation. That's why I usually recommend staying below 4 cores per memory channel with per-core licensed CFD software.
2) Scaling also depends on the case you are running. Double precision scales worse than single precision. Lots of interfaces can have a negative impact on parallel efficiency. Other bottlenecks like file I/O can limit scaling...
3) Bios settings. It is possible that memory can be manually set to DDR4-2666 in such a scenario DISCLAIMER: if this fails, you will likely have to reset the bios entirely in order to make the machine post again. Memory interleaving should be set to "channel" or left at auto. cTDP for the CPUs could be configured lower than ideal, worth checking. And the general power profile should be set to performance.
4) Cooling issue/thermal throttling. Worth checking which frequency the CPUs are running at with the all-core load
5) Memory population gone wrong. A Dell R7425 should have a total of 32 DIMM slots. With only 16 DIMMs populated, they need to be in the correct slots for optimal performance. One DIMM per channel, refer to the manual linked above. The DIMMs should probably all be in the white slots.
6) Other undiagnosed hardware problems. Maybe a DIMM has failed, maybe a DIMM is not seated properly, maybe CPU installation went wrong so a few memory channels are missing. The latter is a relatively common problem with Epyc CPUs. Check whether all 16 DIMMs/256GB show up in Bios and in the OS.

BTW, I moved this to its own thread, because the original thread had already been pulled from another hijacked thread. Threadception...

Last edited by flotus1; January 26, 2022 at 05:11.
flotus1 is offline   Reply With Quote

Old   January 26, 2022, 19:41
Default
  #3
New Member
 
Joseph Osman
Join Date: Aug 2011
Posts: 7
Rep Power: 15
josman84 is on a distinguished road
Thanks @flotus1 - your detailed response is much appreciated! and thanks for moving to a new thread.

Do I understand correctly that having 1 stick of dual rank per channel is roughly equivalent to 2 sticks of single rank?
And therefore with my current setup i have essentially fulfilled the requirement:
"Use 2 memory sticks per memory channel"
Either way, as you say it seems unlikely that I am going to double my performance by changing the memory config (unless it is currently imbalanced or something)

I know I wont get perfect scaling, but I think the fact I get much better performance on a cloud server running Epyc 7551 indicates that something is wrong (do you know of any difference between 7551 and 7601 other than the clock speed?)

You raise an interesting point about CPU installation commonly going wrong... but everything appears present and correct in the bios. Would this not be detected and flagged?

I am running out of ideas!!
Thanks again
josman84 is offline   Reply With Quote

Old   January 27, 2022, 03:07
Default
  #4
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,426
Rep Power: 49
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
Quote:
Do I understand correctly that having 1 stick of dual rank per channel is roughly equivalent to 2 sticks of single rank?
I forgot to mention that explicitly: yes, pretty much the same in terms of performance
I can only speculate what prompted Siemens to put "Use 2 memory sticks per memory channel" in their documentation. And when I do that, I tend to ramble on.
So let's stick with the theory that this is a "one stone, many birds" solution. Simply filling up all DIMM slots (most boards don't have more than 2 slots per channel) eliminates a lot of potential problems with memory population, and also ensures at least 2 ranks per channel.

Quote:
do you know of any difference between 7551 and 7601 other than the clock speed?
They are exactly the same, apart from small differences in clock speeds.
BTW, looking at normalized scaling can be misleading. In the end, we only care about absolute performance. How do the 2 systems compare in that regard?

Quote:
but everything appears present and correct in the bios. Would this not be detected and flagged?
No, you often don't get a clear warning when this happens. Also: I have heard of instances where memory reported in bios/IPMI and operating system is not the same. Sounds crazy, I know, but still worth double-checking.

Quote:
I am running out of ideas!!
There are still some points on my list you did not comment on.
Also, running some other benchmarks could help pin down potential problems. Tests that allow you to compare your results to similar known-good systems, or theoretical values.
Which ones exactly depends on which OS you are running.
flotus1 is offline   Reply With Quote

Old   January 27, 2022, 06:20
Default
  #5
New Member
 
Joseph Osman
Join Date: Aug 2011
Posts: 7
Rep Power: 15
josman84 is on a distinguished road
Thanks @flotus1

Quote:
Originally Posted by flotus1 View Post
I forgot to mention that explicitly: yes, pretty much the same in terms of performance
I can only speculate what prompted Siemens to put "Use 2 memory sticks per memory channel" in their documentation. And when I do that, I tend to ramble on.
So let's stick with the theory that this is a "one stone, many birds" solution. Simply filling up all DIMM slots (most boards don't have more than 2 slots per channel) eliminates a lot of potential problems with memory population, and also ensures at least 2 ranks per channel.
I used to work in the STAR-CCM+ product management team that produces these documents and I wouldnt say it is impossible for something like this to get lost in translation and then slip through the net.. I have asked some ex-colleagues and more-or-less got the same conclusion as you on this point

In case you are interested, LINK to the full document
It is not the latest version, but it mentions my particular chip

Quote:
Originally Posted by flotus1 View Post
They are exactly the same, apart from small differences in clock speeds.
BTW, looking at normalized scaling can be misleading. In the end, we only care about absolute performance. How do the 2 systems compare in that regard?
Agree - it is the absolute performance which is important, rather than the scaling within a single node...
The closest comparison I have been able to do is on the Rescale Cloud.
I believe they run VMs on Amazon / Azure etc
A single node of Epyc 7551 has 60 cores (I assume they keep 4 spare for system overheads & virtualisation etc? - a bit out of my depth here!) and 240Gb
This setup runs an iteration of my benchmark in around 0.7s
My setup runs an iteration in around 1.1s

Quote:
Originally Posted by flotus1 View Post
No, you often don't get a clear warning when this happens. Also: I have heard of instances where memory reported in bios/IPMI and operating system is not the same. Sounds crazy, I know, but still worth double-checking.
Ok, this is the next thing I will look into - I did see another message which made me suspicious as the OS loaded up.

Quote:
Originally Posted by flotus1 View Post
There are still some points on my list you did not comment on.
1) Yes, agreed that I wont get perfect scaling, but the info from Siemens still suggests this should be a decent setup
2) Am using "mixed precision" in STAR-CCM+ which effectively means single in my case
3) BIOS settings for memory speed are 2666mhz by default - I have not changed this. Otherwise I have applied all the performance / turbo settings etc. Also tried with turbo switched off and still get a serious drop-off at high cores.
4) Have observed that that N cores seem to be running at 2.7 ghz when running a simulation on N cores, whilst idle cores run at 2.2 ghz. This was not a very robust test though, so any suggestions of a good way to do this are welcome!
5) Yes, all Dimms are in the white slots...
6) One thing I did try was to pull cpu2 from the system and put all 16 sticks of ram on cpu1 (i.e. 2 sticks per slot). The system then couldnt detect the hard disc/boot the OS so I assumed this is not a valid setup (although the dell manual seems to suggest it is - could this be a sign of something else wrong perhaps?). With cpu1 removed and cpu2 present, the system wouldnt even power on (that is probably expected). Finally, with both cpus in place, but all the ram on cpu1, it did actually "work" but the performance was even worse! (I dont know if this tells us anything - it does indicate that my current ram config is not the absolute worst it could be?!)


Quote:
Originally Posted by flotus1 View Post
Also, running some other benchmarks could help pin down potential problems. Tests that allow you to compare your results to similar known-good systems, or theoretical values.
Which ones exactly depends on which OS you are running.
Yes, it had occurred to me that there may be some non-cfd benchmark tests I could try to check the overall system performance, but this is a bit outside my comfort zone... any pointers on somewhere to start?

I am using Centos 8.2 at the moment, but have also tried with ubuntu and windows. Windows performance was significantly worse than what I have now, and my GPU was not detected (sign of something else wrong? or just that this platform isnt really designed to be a windows desktop?)

Thanks again @flotus1, this is really appreciated!!
josman84 is offline   Reply With Quote

Old   January 27, 2022, 07:02
Default
  #6
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,426
Rep Power: 49
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
Quote:
One thing I did try was to pull cpu2 from the system and put all 16 sticks of ram on cpu1 (i.e. 2 sticks per slot). The system then couldnt detect the hard disc/boot the OS so I assumed this is not a valid setup (although the dell manual seems to suggest it is - could this be a sign of something else wrong perhaps?). With cpu1 removed and cpu2 present, the system wouldnt even power on (that is probably expected). Finally, with both cpus in place, but all the ram on cpu1, it did actually "work" but the performance was even worse! (I dont know if this tells us anything - it does indicate that my current ram config is not the absolute worst it could be?!)
You can run these systems with only 1 CPU installed in the right slot. But you have to refer to the block diagram fist in order to find out which devices are attached to each CPU. Probably not worth the hassle at this point.
Yes, not having any memory on one CPU is about the worst case you can create here. Aside from also reducing the memory on the other CPU.

All right, so we are on Linux.
So you could run the OpenFOAM benchmark in the pinned thread here to compare your system to similar ones. But first things first:
Please run the command "sudo dmidecode -t 17" and post/attach the output here.
Also run "numactl -H" and show the output.
While running a benchmark, you can use htop to get a quick overview of which cores/hwthreads are being loaded. Maybe the scheduler messes things up.
Speaking of schedulers: "echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor" should get you the best performance.
For monitoring CPU frequencies, turbostat is one of the better options on Linux.
And before running any memory-intensive benchmark, it can be worth clearing caches first via "echo 3 | sudo tee /proc/sys/vm/drop_caches"

You have SMT turned off in bios, right?
We can also check if one or more of your DIMMs is failing. With ECC memory, this will not necessarily cause crashes, but can reduce performance if all errors are correctable.
Supermicros SuperDoctor tool gives me a quick overview for my Supermicro board. Maybe Dell has a similar tool? Or logs ECC errors in bios?
Linux can also log ECC errors https://serverfault.com/questions/64...rrors-in-linux
Been a while since I had to dig this deep, maybe this is no longer state-of-the-art. Errors logged this way are cleared during reboot, so you would have to run a memory-intensive benchmark first, then check if errors occurred.
flotus1 is offline   Reply With Quote

Old   February 2, 2022, 11:06
Default
  #7
New Member
 
Joseph Osman
Join Date: Aug 2011
Posts: 7
Rep Power: 15
josman84 is on a distinguished road
Thanks again!
Ok, i have some answers.. please see attached outputs from the dmidecode and numactl commands
using turbostat i was able to confirm that my average cpu frequency goes up to close to the max turbo frequency on all cores, so that looks good

This didnt didnt work - no such file:
"Speaking of schedulers: "echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor" should get you the best performance."
Not entirely sure what the intention was here, but let me know if this is something I should dig into. I dont have the cpu* folder, and couldnt find cpufreq or scaling governer

regarding SMT - I have seen that mentioned previously in other locations, but I cannot see anywhere to set it in the BIOS.. There is a "System Memory Testing" option which would have the same acronym, but i dont think its the same thing is it??
Could it have a different name in a Dell machine perhaps??


Not yet found anywhere that logs the ECC errors, but i will continue to dig (my linux skills are very limited!)

Unless you see something in the attached files, sounds like the next step is to try the openfoam benchmark - agree?

Many thanks
Joe
josman84 is offline   Reply With Quote

Old   February 2, 2022, 12:04
Default
  #8
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,426
Rep Power: 49
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
Quote:
This didnt didnt work - no such file:
That means you can't attach the output? Or that numactl and dmidecode are not found on your system? It is entirely possible that one or both tools are not part of a standard installation. But they should be found in the official repositories. Use whatever method you normally use to install them. I can't really help you with that, since I don't use CentOS.
But having these outputs would be really helpful for diagnosing any problems related to memory population and configuration.

"echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor" is the command I use on my systems if I want maximum CPU performance all the time.
It is possible that the default way of changing the CPU frequency governor in CentOS8 is different. My bad for using "scheduler" as a synonym, which is something else.
But since you already confirmed that your CPU cores are running close to maximum turbo frequency, let's not focus on that for now. This is chasing single-digit performance differences, and we have bigger fish to fry.

SMT means simultaneous multi-threading. Splitting hairs over minor differences in implementation aside, it is what Intel calls hyperthreading in their implementation. Exposing two hardware threads to the OS per physical CPU core. It should be rare these days, but leaving it on can lead to performance regression. Better to switch it off, especially while trying to diagnose performance issues.
There HAS to be a way to turn it off in bios. Apparently, Dell hides it under the option "Logical Processor". Set that to disabled.
flotus1 is offline   Reply With Quote

Old   February 2, 2022, 12:08
Default
  #9
New Member
 
Joseph Osman
Join Date: Aug 2011
Posts: 7
Rep Power: 15
josman84 is on a distinguished road
dmidecode and numactl did work, and i thought i had attached them, but it seems they didnt come through...
Trying again now!
Attached Files
File Type: txt dmidecode.txt (18.7 KB, 9 views)
File Type: txt numactl.txt (1,021 Bytes, 10 views)
josman84 is offline   Reply With Quote

Old   February 2, 2022, 12:11
Default
  #10
New Member
 
Joseph Osman
Join Date: Aug 2011
Posts: 7
Rep Power: 15
josman84 is on a distinguished road
ok, attaching seemed to work this time, and yes, i can confirm I have logical processors turned off.
Dont think i have mentioned it yet, but the STAR-CCM+ benchmark output indicates that as the cores increase, the time spent in MPI overhead goes up dramatically... I think that is about the only clue I really have so far!
josman84 is offline   Reply With Quote

Old   February 2, 2022, 12:27
Default
  #11
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,426
Rep Power: 49
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
Yeah, I can see that too from the files.

Two minor issues I can see from the output:
1) Configured memory speed is only DDR4-2400. Which is the official spec for this machine with 2 ranks per channel installed. Compared to an otherwise identical machine actually running DDR4-2666, this loses around 10% performance at high thread counts. You can try setting it to DDR4-2666 manually, but it might not work on an OEM machine.
2) You can see in the numactl output that node 0 is already half full. Linux memory management is not the best at handling these situations. Clearing caches before running memory-intensive workloads can really help. Depends on how much memory is actually required.
Clear caches once via
echo 3 | sudo tee /proc/sys/vm/drop_caches
then immediately run your benchmark on all 64 cores.
Other than that, everything is as it should be.

Since there are still lot of unknowns about the simulation you are trying to run, trying the well-established OpenFOAM benchmark next would be a good idea.

Quote:
but the STAR-CCM+ benchmark output indicates that as the cores increase, the time spent in MPI overhead goes up dramatically.
How much are we talking about here? I.e. what percentage of the total wall time if such a metric is available?
flotus1 is offline   Reply With Quote

Old   February 2, 2022, 12:37
Default
  #12
New Member
 
Joseph Osman
Join Date: Aug 2011
Posts: 7
Rep Power: 15
josman84 is on a distinguished road
on 64 cores the mpi overhead is 45%
same case on 16 cores it is 8%
josman84 is offline   Reply With Quote

Old   February 2, 2022, 16:58
Default
  #13
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,426
Rep Power: 49
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
That's quite a lot.
Now we can either try to pin down if there is anything special about your CCM+ simulation that makes it very sensitive to small differences in the system it's running on.
Or we check first if your system is within expectation for a well-behaved benchmark *winkwink* OpenFOAM
flotus1 is offline   Reply With Quote

Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
problem when imported geometry from 3D CAD to star ccm, TAREK GANAT STAR-CCM+ 1 May 21, 2013 22:15
[Commercial meshers] Using starToFoam clo OpenFOAM Meshing & Mesh Conversion 33 September 26, 2012 04:04
[Other] StarToFoam error Kart OpenFOAM Meshing & Mesh Conversion 1 February 4, 2010 04:38
error in star ccm maurizio Siemens 3 October 16, 2007 05:17
[Commercial meshers] Trimmed cell and embedded refinement mesh conversion issues michele OpenFOAM Meshing & Mesh Conversion 2 July 15, 2005 04:15


All times are GMT -4. The time now is 21:20.