|
[Sponsors] |
Can small CFD cases run entirely in CPU cache? |
|
LinkBack | Thread Tools | Search this Thread | Display Modes |
July 3, 2024, 06:56 |
Can small CFD cases run entirely in CPU cache?
|
#1 |
New Member
Tom
Join Date: Jan 2023
Posts: 4
Rep Power: 3 |
I'm planning a hardware upgrade and have read through the very useful info on this forum. I've completed the full questionnaire at the bottom of this post, but here's the TLDR:
- My current hardware is a single CPU with 4 cores and 8 MB L3 cache. - On this hardware a certain OpenFoam case consumes a maximum of 425 MB RAM (total across 4 processes). - I'm planning to upgrade to a dual CPU system with each CPU having 256 MB L3 cache. Question: If I run the same case on the new system without any other non-essential processes, would the data reside entirely in cache, and therefore avoid any bottlenecks associated with RAM? Are there specific hardware or software settings to force this behaviour? Thanks to everyone who contributes here, it's really helped with my upgrade. ==================================== Questionnaire answers: Which software do you intend to use? OpenFoam Are you limited by license constraints? I.e. does your software license only allow you to run on N threads? No What type of simulations do you want to run? And what's the maximum cell count? RANS, steady, incompressible, 0.1 - 2 million cells If there is a budget, how high is it? £2k What kind of setting are you in? Hobbyist? Student? Academic research? Engineer? Engineer Where can you source your new computer? Buying a complete package from a large OEM? Assemble it yourself from parts? Are used parts an option? Self-assembled with used MB/CPU and everything else new Which part of the world are you from? It's cool if you don't want to tell, but since prices and availability vary depending on the region, this can sometimes be relevant. Particularly if it's not North America or Europe. UK Anything else that people should know to help you better? |
|
July 3, 2024, 16:50 |
|
#2 |
Senior Member
Will Kernkamp
Join Date: Jun 2014
Posts: 372
Rep Power: 14 |
Yes, it looks that your problem would just fit in cache. The system will do this automatically because any memory access is preceded by a check if specific memory is already in cache.
|
|
July 3, 2024, 17:09 |
|
#3 |
New Member
Tom
Join Date: Jan 2023
Posts: 4
Rep Power: 3 |
Thanks Will, that's good news.
Once I have the new system running I'll do some experiments to investigate the benefit of fitting the whole case in cache. Perhaps running it with a range of mesh sizes, on a single core to eliminate any effects from processor boundaries, with single/mixed/double precision. I'm expecting a kink in the curve around 512 MB, which for my setup corresponds to around 300k cells at double precision. |
|
July 3, 2024, 20:42 |
|
#4 |
Senior Member
Will Kernkamp
Join Date: Jun 2014
Posts: 372
Rep Power: 14 |
Cache also helps CFD when the case is not entirely in cache. My 7800X3D is much faster than the regular 7000X cous due to the 3D cache and a total of 96MB cache.
|
|
July 4, 2024, 05:54 |
|
#5 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,428
Rep Power: 49 |
https://www.anandtech.com/show/16529...milan-review/4
If we REALLY want to specialize a build for low thread count and in-cache execution, maybe worth keeping the inter-core latency in mind. The gist of it: sharing data across two different sockets can be fairly slow with Epyc. The solution here would be only one of the SKUs with extra L3 cache. E.g. AMD Epyc 7373X, 7573X, 7773X. Don't know how much that would help in your case, or if they fit your budget. |
|
July 8, 2024, 04:01 |
|
#6 |
New Member
Tom
Join Date: Jan 2023
Posts: 4
Rep Power: 3 |
Thanks Alex, that's a very interesting article.
Unfortunately the Zen3 CPUs don't fall within my budget, but the article also points out something else that I hadn't appreciated: regardless of which SKU you select, any individual core only has acces to a small portion of the overall L3 cache. I think it means that, in order for my case to make use of all the available L3 cache on a pair of 7532 CPUs with 32 cores each and 4 cores per CCX, it would need to run across at least 16 cores, ensuring at least one core per CCX is in use. Whether the system would distribute the processes across the maximum number of CCXs (to maximise cache) or the fewest (to minimise latency) I'm not sure. Obviously this distinction is irrelevant if you're using all the available cores, but with a mesh size of 300k cells (= 512MB in memory) then using all 64 available cores would result in ~5k cells/core which is on the low side for efficient parallel scaling particularly with multigrid. All very interesting, and I expect lots of experimentation to find the optimum settings! |
|
July 8, 2024, 08:20 |
|
#7 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,428
Rep Power: 49 |
Good catch. That's going a bit beyond my knowledge CPU architecture, so I don't really know how well L3 cache slices on different CCX or even CCD are utulized. If they are, it definitely doesnt happen without a latency and bandwidth penalty.
NUMA topology probably plays a role here. I would expect a single NUMA node per CPU (NPS=1) to perform better for your use-case. This is configurable in bios. Core placement is fairly straightforward if you are using OpenFOAM. You can use the additional options for mpirun to affect it: https://www.open-mpi.org/doc/v4.0/man1/mpirun.1.php --bind-to core --report-bindings --cpu-list ... Would be the most straightforward way to force threads to spawn and stay on certain physical cores. report bindings just gives you confirmation, doesn't affect the binding. |
|
July 9, 2024, 11:36 |
|
#8 | |
Senior Member
Will Kernkamp
Join Date: Jun 2014
Posts: 372
Rep Power: 14 |
Quote:
You don't need to completely be in L3 cache for the calculation to benefit from that cache. If you have 1/10 the cache of 512MB, there will be at least 10 memory to cache swaps. Lets say there are 20 swaps of 51.2MB which is 1000MB per iteration. With a bandwidth of say 100GB/s that would take 0.01 second/iteration. That compares to an estimated calculation time per iteration of ~ 0.05 seconds. If my numbers are accurate (not sure of that but ballpark), you are looking at a 20% loss compared to running fully in cache. |
||
Tags |
cache |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
OpenFOAM benchmarks on various hardware | eric | Hardware | 822 | Today 01:18 |
CFD Salary | CFD | Main CFD Forum | 17 | January 3, 2017 18:09 |
CFD Design...The CFD Future | John C. Chien | Main CFD Forum | 20 | November 20, 2015 00:40 |
Transient run continues from last time (when startover is desired) | bongbang | CFX | 2 | March 23, 2015 00:05 |
Star cd es-ice solver error | ernarasimman | STAR-CD | 2 | September 12, 2014 01:01 |