|
[Sponsors] |
If memory bound : how to improve performance? |
|
LinkBack | Thread Tools | Search this Thread | Display Modes |
July 5, 2021, 18:40 |
If memory bound : how to improve performance?
|
#1 |
Senior Member
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8 |
This is a very important question.
Using roofline performance model ( https://youtu.be/IrkNZG8MJ64 ), I could observe that since most of my code is vectorized, and does small operations (addition, multiplication, division) on large sets of data for N cells , it's probably memory bound. It makes sense, as we're loading a lot of data for N cells, and only doing small operations but our vectorized code is quite fast. It might not be possible for the memory bus to transfer enough data to the CPU, before our computations finish. I haven't verified it, as I don't know how to do roofline model analysis for cases where the code is vectorized, or how to instrument my code to see if the CPU is stalled while waiting for data from RAM. Even if the memory bus is able to keep up for a single threaded application, it's probably not gonna work when we use multiple threads, as all CPU cores use a single memory bus to transfer data, and it will be memory bound. And, moreover, since my code is explicit, the operations (except flux calculations) are extremely simple, and somewhat similar to DAXPY (more on this later). From this amazing lecture by ANL ( https://youtu.be/TPP5LavGEiI ) specifically at this time ( https://youtu.be/TPP5LavGEiI?t=1670 ), we can see how the performance of DAXPY, DGEMV, and DGEMM differ. we can see that ( from the attached snapshot of the video ), DAXPY and DGEMV have a SEVERE limitation on their theoretical peak performance due to being memory bound, as per roofline performance model. However DGEMM is shown to have EXTREMELY HIGH theoretical performance, as DGEMM has a lot of mathematical operations going on inside it, and modern implementations reuse cache by iterating through the matrices in blocks that fit in cache. The presenter also thus recommends to use DGEMM (i.e matrix * matrix) based solutions for solving complex system of linear equations, as we can see DGEMM has a EXTREMELY high theoretical performance, but DAXPY and DGEMV are almost dead on arrival. So, unintuitively, DGEMM based solutions would be faster and scale better even though it has significantly more operations. My issue is that, since my explicit code uses simple operations, it probably has the same performance limitations as DAXPY. So, my question is, in case of memory bound codes, how do we improve performance? PS : In future, I may have to write an implicit solver to get better performance. |
|
July 5, 2021, 19:04 |
|
#2 |
Senior Member
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8 |
One potential solution I have, is to run the explicit iterations on parts of the domain, like how DGEMM is using blocking to improve cache reuse.
Something like, if NCELLS=30,000, then solve in increment blocks of 5,000 cells, and calculate the residuals and solutions for those 5,000 cells at a time. This way, the secondary data like surface normals, surface face area, rho, rho*u, rho*v, rho*e, etc, are going to be present in the L2 cache, and it might no longer be memory bound. L1 cache misses aren't that bad. Reducing L2 and L3 cache misses will be important. Not completely sure. Need to study. |
|
July 5, 2021, 19:18 |
|
#3 |
Senior Member
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8 |
Another solution would be to use more computation, and not read much from memory.
For example, I store arrays like ainv = 1.0/a; rhoinv = 1.0/rho; and many more. Keeping extra arrays in memory is bad, as my i3 CPU has 2*256 KB 4 way associative cache, which can theoretically hold 512,000 / 8 = 64,000 double precision numbers. If I don't read things like ainv, rhoinv etc, from memory, I can store of my problem cells into my L2 cache, and thus write bigger loops, thus get some performance improvements. Additionally, doing the calculations like ainv=1.0/a; rhoinv=1.0/rho; will increase my code's Algorithmic Intensity (from roofline model), and thus push it more towards being CPU bound. If it's CPU bound, I can parallelize it and get performance improvements. PS : @sbfanni, this is probably why the code I optimized for you was faster for a few cases, then slower for a few cases. It was most probably memory bound too. |
|
July 5, 2021, 23:48 |
|
#4 |
Senior Member
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8 |
If the first post was difficult to understand : Basically, my single threaded vectorized code was so fast, that I couldn't load memory fast enough into the CPU caches, on repeated iterations. Multi-threading was slower, as data couldn't be loaded from memory fast enough.
I've definitely finished the race faster than everyone, but was running in the opposite direction. |
|
July 6, 2021, 05:22 |
|
#5 |
Senior Member
|
My advice, based completely on heuristic, is to always compute instead of store and retrieve, unless the compute itself needs loops, brings in additional arrays or is generally complex (e.g., several if branches).
Of course it is, in general, better if you only perform a division once in a vectorized fashion, but then you are probably now in the need of two vectors instead of one (you might need rho both at numerator and denominator in the same expression). If you look at how cache sizes evolved, I think, that pretty much signals that memory bound is going to be the norm in HPC. There are technical reasons for this so, my layman answer to the problem is, simply, don't go in a region where technical adavance is not going to save you in the foreseeable future. Maybe, but I don't know how this would work out, you could try to further exploit the L3 cache. That is, as L3 typically gets what is trashed from L1/L2, instead of doing whole computations at once, do them in multiple steps but using a single temporary array for storing the intermediate results. There certainly is a size and amount of computation where this is going to work better than not doing it. Yet, honestly, you would be now trashing code readability for good. |
|
July 6, 2021, 05:46 |
|
#6 | |
Senior Member
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8 |
Quote:
This seems like a really nice technique. Could you please share an example? I don't understand how to use it. |
||
July 6, 2021, 06:12 |
|
#7 |
Senior Member
|
Well, I actually overgeneralized. This should actually work only for an exclusive cache inclusion policy, where L3 is used as a victim cache (that is, collects stuff evicted from the other higher level caches).
In this scenario, the idea should be to not "overcompute" (as I would do), but to actually split things in pieces. You do the first piece. It is small, takes in less arrays than usual, should be faster than usual. Then when you start doing the second piece, hopefully, you are retrieving stuff from L3 only, not from RAM. Also, the second piece is small as well, etc. You get the idea. I have no idea of how to otpimize this, nor any example to share... I don't even know if this has some flaw somewhere else (besides the fact that it shouldn't work for inclusive cache policies). As I said, I typically work on the other end of the optimization spectrum (just don't care, unless something non-obvious appears), so I never actually tried this. If I had to generalize this, the statement would be: take exactly into account how your cache works. EDIT: a more accurate statement would be, in fact, how your compiler uses your cache |
|
July 6, 2021, 13:23 |
|
#8 |
Senior Member
Filippo Maria Denaro
Join Date: Jul 2010
Posts: 6,882
Rep Power: 73 |
I am no longer involved in specific programming topic like this, my old experience of cache miss makes me remind about specific options of the compiler.
Many years ago I learned a lot of issues from the KAP that provided the optimized source code. Are you sure that your effort in programming is still worthwile more than using the suitable compiler options? |
|
July 6, 2021, 14:11 |
|
#9 | |
Senior Member
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8 |
Quote:
It's not really "difficult", in the strictest sense. It's just about understanding the hardware and using them to their fullest potential. My previous SIMD based optimizations did work well, but I focused on the wrong thing. Instead of decreasing latency, I should've focused on increasing throughput. As sbaffini said, memory bandwidths may not improve much in foreseeable future, and we've almost reached the limits of increasing the GFLOPS for CPUs. Based on the roofline model ( the video I shared first, and https://hpc-wiki.info/hpc/Performance_model ), we're most likely going to be limited by memory bandwidth for the foreseeable future. So, developing our code with this two in mind, is quite important. Compiler options will not be helpful when data architecture and access is wrong. Based on the ANL video on the first post, and the attached picture, you can see that intel i7 on the presenter's machine has a theoretical max performance of 56 GFLOPS. DAXPY and DGEMV are limited to only having max performance of 1.6 and 3.4 GFLOPS respectively. While DGEMM is able to achieve a max performance of 54 GFLOPS. Even the presenter jokingly says they weren't smart enough to figure this out before, but now they're writing all their complex formulations using DGEMM or other BLAS-LEVEL-3 functions to get the best performance they can. I'm not an expert, but I would like to learn more about this, and possibly use it. |
||
July 6, 2021, 14:22 |
|
#10 | |
Senior Member
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8 |
Quote:
Something that's *really* worth the effort, is to somehow make the code's performance be at the bending corner point in the shown curve. That's where the CPU is using data as fast as it gets it. However that's difficult in practice, and I'll probably make it a little bit compute bound for 2.31 GHz CPU processors, as that's what I have, and most likely that's what HPC clusters will also have. Don't want to make it too much compute bound, as that would be wasted performance. |
||
July 6, 2021, 19:41 |
|
#11 |
New Member
Francisco
Join Date: Sep 2018
Location: Portugal
Posts: 27
Rep Power: 8 |
If I may barge in!
I was lurking through this thread and it reminded me of some material I stumbled upon today, completely by chance, while skimming through this page. It's long, old, and I haven't read all of it, but maybe it could be worthwhile to test if it still holds up in today's architectures. More specifically, seeing that an explicit solver is all about matrix multiplications, you can check out the algorithms in section 6 directly. I'd also like to add: I might be off here, and I'm not sure if what's in there qualifies as general knowledge, but I thought it would be better to post it here anyway. |
|
July 7, 2021, 03:33 |
|
#12 | |
Senior Member
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8 |
Quote:
Quite useful. I had the older version. But it's nice to get the BSD version, as they seem to have made it look more modern, and presented the details in a consistent manner. BSD rules. |
||
July 7, 2021, 04:01 |
|
#13 | |
Senior Member
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8 |
Quote:
Hey, found a good way to use this. From @ships26's attached pdf link section 6.21 : how to do matrix multiplication in a somewhat cache optimized manner : Naive code : Code:
for (i = 0; i < N; ++i) for (j = 0; j < N; ++j) for (k = 0; k < N; ++k) res[i][j] += mul1[i][k]*mul2[k][j]; // <- non-sequential access over k Code:
double tmp[N][N]; for (i = 0; i < N; ++i) for (j = 0; j < N; ++j) tmp[i][j] = mul2[j][i]; // <- transpose matrix mul2 for (i = 0; i < N; ++i) for (j = 0; j < N; ++j) for (k = 0; k < N; ++k) res[i][j] += mul1[i][k]*tmp[j][k]; // <- sequential access over k |
||
July 7, 2021, 06:44 |
|
#14 |
Member
EM
Join Date: Sep 2019
Posts: 59
Rep Power: 7 |
row-column matrix multiplication is for textbooks only.
the answer to your question in the title is: 'tiled matrix multiplication' which is an absolute must for gpus and cpus. it can reduce the number of ram (or gpu buffer) fetches by up to a factor of 2. at least one write to the buffer cannot be avoided. also, u can buy a used r9 280x gpu for about 150 euros, that will do a (say) 200x200 matrix multiplication an order of magnitude faster that any intel cpu that is not using avx512. the larger the matrix the bigger the speedup. and u dont have to write your own gpu-dgemm because now amd is making available for free its own gpu version which is expertly done. -- |
|
Tags |
blas, solver development |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Used Memory Accumulates During Course of Simulation Until interFoam gets Killed | Ship Designer | OpenFOAM Running, Solving & CFD | 7 | October 6, 2023 02:26 |
CFX Memory Issues (Partitioner, Interpolator) | alexksei | CFX | 2 | September 14, 2020 18:59 |
Solver error message - memory usage | Jared1986 | CFX | 12 | December 2, 2014 06:28 |
Integer stack memory insufficient | mactech001 | CFX | 1 | May 15, 2014 07:22 |
High performance virtual memory tip | connclark | OpenFOAM Running, Solving & CFD | 0 | December 5, 2007 19:35 |