Can small CFD cases run entirely in CPU cache?

asltpo · July 3, 2024, 06:56

I'm planning a hardware upgrade and have read through the very useful info on this forum. I've completed the full questionnaire at the bottom of this post, but here's the TLDR:

- My current hardware is a single CPU with 4 cores and 8 MB L3 cache.
- On this hardware a certain OpenFoam case consumes a maximum of 425 MB RAM (total across 4 processes).
- I'm planning to upgrade to a dual CPU system with each CPU having 256 MB L3 cache.

Question: If I run the same case on the new system without any other non-essential processes, would the data reside entirely in cache, and therefore avoid any bottlenecks associated with RAM? Are there specific hardware or software settings to force this behaviour?

Thanks to everyone who contributes here, it's really helped with my upgrade.

====================================

Questionnaire answers:

Which software do you intend to use? OpenFoam
Are you limited by license constraints? I.e. does your software license only allow you to run on N threads? No
What type of simulations do you want to run? And what's the maximum cell count? RANS, steady, incompressible, 0.1 - 2 million cells
If there is a budget, how high is it? £2k
What kind of setting are you in? Hobbyist? Student? Academic research? Engineer? Engineer
Where can you source your new computer? Buying a complete package from a large OEM? Assemble it yourself from parts? Are used parts an option? Self-assembled with used MB/CPU and everything else new
Which part of the world are you from? It's cool if you don't want to tell, but since prices and availability vary depending on the region, this can sometimes be relevant. Particularly if it's not North America or Europe. UK
Anything else that people should know to help you better?

wkernkamp · July 3, 2024, 16:50

Yes, it looks that your problem would just fit in cache. The system will do this automatically because any memory access is preceded by a check if specific memory is already in cache.

asltpo · July 3, 2024, 17:09

Thanks Will, that's good news.

Once I have the new system running I'll do some experiments to investigate the benefit of fitting the whole case in cache. Perhaps running it with a range of mesh sizes, on a single core to eliminate any effects from processor boundaries, with single/mixed/double precision.

I'm expecting a kink in the curve around 512 MB, which for my setup corresponds to around 300k cells at double precision.

wkernkamp · July 3, 2024, 20:42

Cache also helps CFD when the case is not entirely in cache. My 7800X3D is much faster than the regular 7000X cous due to the 3D cache and a total of 96MB cache.

flotus1 · July 4, 2024, 05:54

https://www.anandtech.com/show/16529...milan-review/4

If we REALLY want to specialize a build for low thread count and in-cache execution, maybe worth keeping the inter-core latency in mind. The gist of it: sharing data across two different sockets can be fairly slow with Epyc.
The solution here would be only one of the SKUs with extra L3 cache. E.g. AMD Epyc 7373X, 7573X, 7773X. Don't know how much that would help in your case, or if they fit your budget.

asltpo · July 8, 2024, 04:01

Thanks Alex, that's a very interesting article.

Unfortunately the Zen3 CPUs don't fall within my budget, but the article also points out something else that I hadn't appreciated: regardless of which SKU you select, any individual core only has acces to a small portion of the overall L3 cache.

I think it means that, in order for my case to make use of all the available L3 cache on a pair of 7532 CPUs with 32 cores each and 4 cores per CCX, it would need to run across at least 16 cores, ensuring at least one core per CCX is in use. Whether the system would distribute the processes across the maximum number of CCXs (to maximise cache) or the fewest (to minimise latency) I'm not sure.

Obviously this distinction is irrelevant if you're using all the available cores, but with a mesh size of 300k cells (= 512MB in memory) then using all 64 available cores would result in ~5k cells/core which is on the low side for efficient parallel scaling particularly with multigrid.

All very interesting, and I expect lots of experimentation to find the optimum settings!

flotus1 · July 8, 2024, 08:20

Good catch. That's going a bit beyond my knowledge CPU architecture, so I don't really know how well L3 cache slices on different CCX or even CCD are utulized. If they are, it definitely doesnt happen without a latency and bandwidth penalty.
NUMA topology probably plays a role here. I would expect a single NUMA node per CPU (NPS=1) to perform better for your use-case. This is configurable in bios.

Core placement is fairly straightforward if you are using OpenFOAM. You can use the additional options for mpirun to affect it:
https://www.open-mpi.org/doc/v4.0/man1/mpirun.1.php
--bind-to core --report-bindings --cpu-list ...
Would be the most straightforward way to force threads to spawn and stay on certain physical cores. report bindings just gives you confirmation, doesn't affect the binding.

wkernkamp · July 9, 2024, 11:36

Quote:

Originally Posted by asltpo

Thanks Alex, that's a very interesting article.

....

Obviously this distinction is irrelevant if you're using all the available cores, but with a mesh size of 300k cells (= 512MB in memory) then using all 64 available cores would result in ~5k cells/core which is on the low side for efficient parallel scaling particularly with multigrid.

All very interesting, and I expect lots of experimentation to find the optimum settings!

You don't need to completely be in L3 cache for the calculation to benefit from that cache. If you have 1/10 the cache of 512MB, there will be at least 10 memory to cache swaps. Lets say there are 20 swaps of 51.2MB which is 1000MB per iteration. With a bandwidth of say 100GB/s that would take 0.01 second/iteration. That compares to an estimated calculation time per iteration of ~ 0.05 seconds. If my numbers are accurate (not sure of that but ballpark), you are looking at a 20% loss compared to running fully in cache.

July 3, 2024, 06:56	Can small CFD cases run entirely in CPU cache?	#1
asltpo New Member Tom Join Date: Jan 2023 Posts: 4 Rep Power: 3	I'm planning a hardware upgrade and have read through the very useful info on this forum. I've completed the full questionnaire at the bottom of this post, but here's the TLDR: - My current hardware is a single CPU with 4 cores and 8 MB L3 cache. - On this hardware a certain OpenFoam case consumes a maximum of 425 MB RAM (total across 4 processes). - I'm planning to upgrade to a dual CPU system with each CPU having 256 MB L3 cache. Question: If I run the same case on the new system without any other non-essential processes, would the data reside entirely in cache, and therefore avoid any bottlenecks associated with RAM? Are there specific hardware or software settings to force this behaviour? Thanks to everyone who contributes here, it's really helped with my upgrade. ==================================== Questionnaire answers: Which software do you intend to use? OpenFoam Are you limited by license constraints? I.e. does your software license only allow you to run on N threads? No What type of simulations do you want to run? And what's the maximum cell count? RANS, steady, incompressible, 0.1 - 2 million cells If there is a budget, how high is it? £2k What kind of setting are you in? Hobbyist? Student? Academic research? Engineer? Engineer Where can you source your new computer? Buying a complete package from a large OEM? Assemble it yourself from parts? Are used parts an option? Self-assembled with used MB/CPU and everything else new Which part of the world are you from? It's cool if you don't want to tell, but since prices and availability vary depending on the region, this can sometimes be relevant. Particularly if it's not North America or Europe. UK Anything else that people should know to help you better?

July 8, 2024, 08:20		#7
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,428 Rep Power: 49	Good catch. That's going a bit beyond my knowledge CPU architecture, so I don't really know how well L3 cache slices on different CCX or even CCD are utulized. If they are, it definitely doesnt happen without a latency and bandwidth penalty. NUMA topology probably plays a role here. I would expect a single NUMA node per CPU (NPS=1) to perform better for your use-case. This is configurable in bios. Core placement is fairly straightforward if you are using OpenFOAM. You can use the additional options for mpirun to affect it: https://www.open-mpi.org/doc/v4.0/man1/mpirun.1.php --bind-to core --report-bindings --cpu-list ... Would be the most straightforward way to force threads to spawn and stay on certain physical cores. report bindings just gives you confirmation, doesn't affect the binding. asltpo likes this.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
OpenFOAM benchmarks on various hardware	eric	Hardware	822	Today 01:18
CFD Salary	CFD	Main CFD Forum	17	January 3, 2017 18:09
CFD Design...The CFD Future	John C. Chien	Main CFD Forum	20	November 20, 2015 00:40
Transient run continues from last time (when startover is desired)	bongbang	CFX	2	March 23, 2015 00:05
Star cd es-ice solver error	ernarasimman	STAR-CD	2	September 12, 2014 01:01

July 3, 2024, 16:50		#2
wkernkamp Senior Member Will Kernkamp Join Date: Jun 2014 Posts: 372 Rep Power: 14	Yes, it looks that your problem would just fit in cache. The system will do this automatically because any memory access is preceded by a check if specific memory is already in cache.

July 3, 2024, 17:09		#3
asltpo New Member Tom Join Date: Jan 2023 Posts: 4 Rep Power: 3	Thanks Will, that's good news. Once I have the new system running I'll do some experiments to investigate the benefit of fitting the whole case in cache. Perhaps running it with a range of mesh sizes, on a single core to eliminate any effects from processor boundaries, with single/mixed/double precision. I'm expecting a kink in the curve around 512 MB, which for my setup corresponds to around 300k cells at double precision.

July 3, 2024, 20:42		#4
wkernkamp Senior Member Will Kernkamp Join Date: Jun 2014 Posts: 372 Rep Power: 14	Cache also helps CFD when the case is not entirely in cache. My 7800X3D is much faster than the regular 7000X cous due to the 3D cache and a total of 96MB cache.

July 4, 2024, 05:54		#5
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,428 Rep Power: 49	https://www.anandtech.com/show/16529...milan-review/4 If we REALLY want to specialize a build for low thread count and in-cache execution, maybe worth keeping the inter-core latency in mind. The gist of it: sharing data across two different sockets can be fairly slow with Epyc. The solution here would be only one of the SKUs with extra L3 cache. E.g. AMD Epyc 7373X, 7573X, 7773X. Don't know how much that would help in your case, or if they fit your budget.

July 8, 2024, 04:01		#6
asltpo New Member Tom Join Date: Jan 2023 Posts: 4 Rep Power: 3	Thanks Alex, that's a very interesting article. Unfortunately the Zen3 CPUs don't fall within my budget, but the article also points out something else that I hadn't appreciated: regardless of which SKU you select, any individual core only has acces to a small portion of the overall L3 cache. I think it means that, in order for my case to make use of all the available L3 cache on a pair of 7532 CPUs with 32 cores each and 4 cores per CCX, it would need to run across at least 16 cores, ensuring at least one core per CCX is in use. Whether the system would distribute the processes across the maximum number of CCXs (to maximise cache) or the fewest (to minimise latency) I'm not sure. Obviously this distinction is irrelevant if you're using all the available cores, but with a mesh size of 300k cells (= 512MB in memory) then using all 64 available cores would result in ~5k cells/core which is on the low side for efficient parallel scaling particularly with multigrid. All very interesting, and I expect lots of experimentation to find the optimum settings!