CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > General Forums > Hardware

Can small CFD cases run entirely in CPU cache?

Register Blogs Community New Posts Updated Threads Search

Like Tree1Likes
  • 1 Post By flotus1

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   July 3, 2024, 05:56
Default Can small CFD cases run entirely in CPU cache?
  #1
New Member
 
Tom
Join Date: Jan 2023
Posts: 4
Rep Power: 3
asltpo is on a distinguished road
I'm planning a hardware upgrade and have read through the very useful info on this forum. I've completed the full questionnaire at the bottom of this post, but here's the TLDR:

- My current hardware is a single CPU with 4 cores and 8 MB L3 cache.
- On this hardware a certain OpenFoam case consumes a maximum of 425 MB RAM (total across 4 processes).
- I'm planning to upgrade to a dual CPU system with each CPU having 256 MB L3 cache.

Question: If I run the same case on the new system without any other non-essential processes, would the data reside entirely in cache, and therefore avoid any bottlenecks associated with RAM? Are there specific hardware or software settings to force this behaviour?

Thanks to everyone who contributes here, it's really helped with my upgrade.

====================================

Questionnaire answers:

Which software do you intend to use? OpenFoam
Are you limited by license constraints? I.e. does your software license only allow you to run on N threads? No
What type of simulations do you want to run? And what's the maximum cell count? RANS, steady, incompressible, 0.1 - 2 million cells
If there is a budget, how high is it? £2k
What kind of setting are you in? Hobbyist? Student? Academic research? Engineer? Engineer
Where can you source your new computer? Buying a complete package from a large OEM? Assemble it yourself from parts? Are used parts an option? Self-assembled with used MB/CPU and everything else new
Which part of the world are you from? It's cool if you don't want to tell, but since prices and availability vary depending on the region, this can sometimes be relevant. Particularly if it's not North America or Europe. UK
Anything else that people should know to help you better?
asltpo is offline   Reply With Quote

Old   July 3, 2024, 15:50
Default
  #2
Senior Member
 
Will Kernkamp
Join Date: Jun 2014
Posts: 365
Rep Power: 14
wkernkamp is on a distinguished road
Yes, it looks that your problem would just fit in cache. The system will do this automatically because any memory access is preceded by a check if specific memory is already in cache.
wkernkamp is offline   Reply With Quote

Old   July 3, 2024, 16:09
Default
  #3
New Member
 
Tom
Join Date: Jan 2023
Posts: 4
Rep Power: 3
asltpo is on a distinguished road
Thanks Will, that's good news.

Once I have the new system running I'll do some experiments to investigate the benefit of fitting the whole case in cache. Perhaps running it with a range of mesh sizes, on a single core to eliminate any effects from processor boundaries, with single/mixed/double precision.

I'm expecting a kink in the curve around 512 MB, which for my setup corresponds to around 300k cells at double precision.
asltpo is offline   Reply With Quote

Old   July 3, 2024, 19:42
Default
  #4
Senior Member
 
Will Kernkamp
Join Date: Jun 2014
Posts: 365
Rep Power: 14
wkernkamp is on a distinguished road
Cache also helps CFD when the case is not entirely in cache. My 7800X3D is much faster than the regular 7000X cous due to the 3D cache and a total of 96MB cache.
wkernkamp is offline   Reply With Quote

Old   July 4, 2024, 04:54
Default
  #5
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,426
Rep Power: 49
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
https://www.anandtech.com/show/16529...milan-review/4

If we REALLY want to specialize a build for low thread count and in-cache execution, maybe worth keeping the inter-core latency in mind. The gist of it: sharing data across two different sockets can be fairly slow with Epyc.
The solution here would be only one of the SKUs with extra L3 cache. E.g. AMD Epyc 7373X, 7573X, 7773X. Don't know how much that would help in your case, or if they fit your budget.
flotus1 is offline   Reply With Quote

Old   July 8, 2024, 03:01
Default
  #6
New Member
 
Tom
Join Date: Jan 2023
Posts: 4
Rep Power: 3
asltpo is on a distinguished road
Thanks Alex, that's a very interesting article.

Unfortunately the Zen3 CPUs don't fall within my budget, but the article also points out something else that I hadn't appreciated: regardless of which SKU you select, any individual core only has acces to a small portion of the overall L3 cache.

I think it means that, in order for my case to make use of all the available L3 cache on a pair of 7532 CPUs with 32 cores each and 4 cores per CCX, it would need to run across at least 16 cores, ensuring at least one core per CCX is in use. Whether the system would distribute the processes across the maximum number of CCXs (to maximise cache) or the fewest (to minimise latency) I'm not sure.

Obviously this distinction is irrelevant if you're using all the available cores, but with a mesh size of 300k cells (= 512MB in memory) then using all 64 available cores would result in ~5k cells/core which is on the low side for efficient parallel scaling particularly with multigrid.

All very interesting, and I expect lots of experimentation to find the optimum settings!
asltpo is offline   Reply With Quote

Old   July 8, 2024, 07:20
Default
  #7
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,426
Rep Power: 49
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
Good catch. That's going a bit beyond my knowledge CPU architecture, so I don't really know how well L3 cache slices on different CCX or even CCD are utulized. If they are, it definitely doesnt happen without a latency and bandwidth penalty.
NUMA topology probably plays a role here. I would expect a single NUMA node per CPU (NPS=1) to perform better for your use-case. This is configurable in bios.

Core placement is fairly straightforward if you are using OpenFOAM. You can use the additional options for mpirun to affect it:
https://www.open-mpi.org/doc/v4.0/man1/mpirun.1.php
--bind-to core --report-bindings --cpu-list ...
Would be the most straightforward way to force threads to spawn and stay on certain physical cores. report bindings just gives you confirmation, doesn't affect the binding.
asltpo likes this.
flotus1 is offline   Reply With Quote

Old   July 9, 2024, 10:36
Default
  #8
Senior Member
 
Will Kernkamp
Join Date: Jun 2014
Posts: 365
Rep Power: 14
wkernkamp is on a distinguished road
Quote:
Originally Posted by asltpo View Post
Thanks Alex, that's a very interesting article.

....


Obviously this distinction is irrelevant if you're using all the available cores, but with a mesh size of 300k cells (= 512MB in memory) then using all 64 available cores would result in ~5k cells/core which is on the low side for efficient parallel scaling particularly with multigrid.

All very interesting, and I expect lots of experimentation to find the optimum settings!

You don't need to completely be in L3 cache for the calculation to benefit from that cache. If you have 1/10 the cache of 512MB, there will be at least 10 memory to cache swaps. Lets say there are 20 swaps of 51.2MB which is 1000MB per iteration. With a bandwidth of say 100GB/s that would take 0.01 second/iteration. That compares to an estimated calculation time per iteration of ~ 0.05 seconds. If my numbers are accurate (not sure of that but ballpark), you are looking at a 20% loss compared to running fully in cache.
wkernkamp is offline   Reply With Quote

Reply

Tags
cache


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
OpenFOAM benchmarks on various hardware eric Hardware 794 Yesterday 12:04
CFD Salary CFD Main CFD Forum 17 January 3, 2017 17:09
CFD Design...The CFD Future John C. Chien Main CFD Forum 20 November 19, 2015 23:40
Transient run continues from last time (when startover is desired) bongbang CFX 2 March 22, 2015 23:05
Star cd es-ice solver error ernarasimman STAR-CD 2 September 12, 2014 00:01


All times are GMT -4. The time now is 20:22.