If memory bound : how to improve performance?

aerosayan · July 5, 2021, 18:40

This is a very important question.

Using roofline performance model ( https://youtu.be/IrkNZG8MJ64 ), I could observe that since most of my code is vectorized, and does small operations (addition, multiplication, division) on large sets of data for N cells , it's probably memory bound.

It makes sense, as we're loading a lot of data for N cells, and only doing small operations but our vectorized code is quite fast. It might not be possible for the memory bus to transfer enough data to the CPU, before our computations finish.

I haven't verified it, as I don't know how to do roofline model analysis for cases where the code is vectorized, or how to instrument my code to see if the CPU is stalled while waiting for data from RAM.

Even if the memory bus is able to keep up for a single threaded application, it's probably not gonna work when we use multiple threads, as all CPU cores use a single memory bus to transfer data, and it will be memory bound.

And, moreover, since my code is explicit, the operations (except flux calculations) are extremely simple, and somewhat similar to DAXPY (more on this later).

From this amazing lecture by ANL ( https://youtu.be/TPP5LavGEiI ) specifically at this time ( https://youtu.be/TPP5LavGEiI?t=1670 ), we can see how the performance of DAXPY, DGEMV, and DGEMM differ.

we can see that ( from the attached snapshot of the video ), DAXPY and DGEMV have a SEVERE limitation on their theoretical peak performance due to being memory bound, as per roofline performance model. However DGEMM is shown to have EXTREMELY HIGH theoretical performance, as DGEMM has a lot of mathematical operations going on inside it, and modern implementations reuse cache by iterating through the matrices in blocks that fit in cache.

The presenter also thus recommends to use DGEMM (i.e matrix * matrix) based solutions for solving complex system of linear equations, as we can see DGEMM has a EXTREMELY high theoretical performance, but DAXPY and DGEMV are almost dead on arrival.

So, unintuitively, DGEMM based solutions would be faster and scale better even though it has significantly more operations.

My issue is that, since my explicit code uses simple operations, it probably has the same performance limitations as DAXPY.

So, my question is, in case of memory bound codes, how do we improve performance?

PS : In future, I may have to write an implicit solver to get better performance.

aerosayan · July 5, 2021, 19:04

One potential solution I have, is to run the explicit iterations on parts of the domain, like how DGEMM is using blocking to improve cache reuse.

Something like, if NCELLS=30,000, then solve in increment blocks of 5,000 cells, and calculate the residuals and solutions for those 5,000 cells at a time. This way, the secondary data like surface normals, surface face area, rho, rho*u, rho*v, rho*e, etc, are going to be present in the L2 cache, and it might no longer be memory bound.

L1 cache misses aren't that bad. Reducing L2 and L3 cache misses will be important.

Not completely sure. Need to study.

aerosayan · July 5, 2021, 19:18

Another solution would be to use more computation, and not read much from memory.

For example, I store arrays like ainv = 1.0/a; rhoinv = 1.0/rho; and many more.

Keeping extra arrays in memory is bad, as my i3 CPU has 2*256 KB 4 way associative cache, which can theoretically hold 512,000 / 8 = 64,000 double precision numbers.

If I don't read things like ainv, rhoinv etc, from memory, I can store of my problem cells into my L2 cache, and thus write bigger loops, thus get some performance improvements.

Additionally, doing the calculations like ainv=1.0/a; rhoinv=1.0/rho; will increase my code's Algorithmic Intensity (from roofline model), and thus push it more towards being CPU bound.

If it's CPU bound, I can parallelize it and get performance improvements.

PS : @sbfanni, this is probably why the code I optimized for you was faster for a few cases, then slower for a few cases. It was most probably memory bound too.

aerosayan · July 5, 2021, 23:48

If the first post was difficult to understand : Basically, my single threaded vectorized code was so fast, that I couldn't load memory fast enough into the CPU caches, on repeated iterations. Multi-threading was slower, as data couldn't be loaded from memory fast enough.

I've definitely finished the race faster than everyone, but was running in the opposite direction.

sbaffini · July 6, 2021, 05:22

My advice, based completely on heuristic, is to always compute instead of store and retrieve, unless the compute itself needs loops, brings in additional arrays or is generally complex (e.g., several if branches).

Of course it is, in general, better if you only perform a division once in a vectorized fashion, but then you are probably now in the need of two vectors instead of one (you might need rho both at numerator and denominator in the same expression).

If you look at how cache sizes evolved, I think, that pretty much signals that memory bound is going to be the norm in HPC. There are technical reasons for this so, my layman answer to the problem is, simply, don't go in a region where technical adavance is not going to save you in the foreseeable future.

Maybe, but I don't know how this would work out, you could try to further exploit the L3 cache. That is, as L3 typically gets what is trashed from L1/L2, instead of doing whole computations at once, do them in multiple steps but using a single temporary array for storing the intermediate results. There certainly is a size and amount of computation where this is going to work better than not doing it. Yet, honestly, you would be now trashing code readability for good.

aerosayan · July 6, 2021, 05:46

Quote:

Originally Posted by sbaffini

Maybe, but I don't know how this would work out, you could try to further exploit the L3 cache. That is, as L3 typically gets what is trashed from L1/L2, instead of doing whole computations at once, do them in multiple steps but using a single temporary array for storing the intermediate results.

This seems like a really nice technique. Could you please share an example? I don't understand how to use it.

sbaffini · July 6, 2021, 06:12

Well, I actually overgeneralized. This should actually work only for an exclusive cache inclusion policy, where L3 is used as a victim cache (that is, collects stuff evicted from the other higher level caches).

In this scenario, the idea should be to not "overcompute" (as I would do), but to actually split things in pieces. You do the first piece. It is small, takes in less arrays than usual, should be faster than usual. Then when you start doing the second piece, hopefully, you are retrieving stuff from L3 only, not from RAM. Also, the second piece is small as well, etc. You get the idea.

I have no idea of how to otpimize this, nor any example to share... I don't even know if this has some flaw somewhere else (besides the fact that it shouldn't work for inclusive cache policies). As I said, I typically work on the other end of the optimization spectrum (just don't care, unless something non-obvious appears), so I never actually tried this.

If I had to generalize this, the statement would be: take exactly into account how your cache works.

EDIT: a more accurate statement would be, in fact, how your compiler uses your cache

FMDenaro · July 6, 2021, 13:23

I am no longer involved in specific programming topic like this, my old experience of cache miss makes me remind about specific options of the compiler.
Many years ago I learned a lot of issues from the KAP that provided the optimized source code.
Are you sure that your effort in programming is still worthwile more than using the suitable compiler options?

aerosayan · July 6, 2021, 14:11

Quote:

Originally Posted by FMDenaro

I am no longer involved in specific programming topic like this, my old experience of cache miss makes me remind about specific options of the compiler.
Many years ago I learned a lot of issues from the KAP that provided the optimized source code.
Are you sure that your effort in programming is still worthwile more than using the suitable compiler options?

It's not really "difficult", in the strictest sense. It's just about understanding the hardware and using them to their fullest potential. My previous SIMD based optimizations did work well, but I focused on the wrong thing. Instead of decreasing latency, I should've focused on increasing throughput.

As sbaffini said, memory bandwidths may not improve much in foreseeable future, and we've almost reached the limits of increasing the GFLOPS for CPUs.

Based on the roofline model ( the video I shared first, and https://hpc-wiki.info/hpc/Performance_model ), we're most likely going to be limited by memory bandwidth for the foreseeable future.

So, developing our code with this two in mind, is quite important.

Compiler options will not be helpful when data architecture and access is wrong.

Based on the ANL video on the first post, and the attached picture, you can see that intel i7 on the presenter's machine has a theoretical max performance of 56 GFLOPS.

DAXPY and DGEMV are limited to only having max performance of 1.6 and 3.4 GFLOPS respectively.

While DGEMM is able to achieve a max performance of 54 GFLOPS.

Even the presenter jokingly says they weren't smart enough to figure this out before, but now they're writing all their complex formulations using DGEMM or other BLAS-LEVEL-3 functions to get the best performance they can.

I'm not an expert, but I would like to learn more about this, and possibly use it.

aerosayan · July 6, 2021, 14:22

Quote:

Originally Posted by FMDenaro

I am no longer involved in specific programming topic like this, my old experience of cache miss makes me remind about specific options of the compiler.
Many years ago I learned a lot of issues from the KAP that provided the optimized source code.
Are you sure that your effort in programming is still worthwile more than using the suitable compiler options?

Something that's *really* worth the effort, is to somehow make the code's performance be at the bending corner point in the shown curve. That's where the CPU is using data as fast as it gets it.

However that's difficult in practice, and I'll probably make it a little bit compute bound for 2.31 GHz CPU processors, as that's what I have, and most likely that's what HPC clusters will also have.

Don't want to make it too much compute bound, as that would be wasted performance.

ships26 · July 6, 2021, 19:41

If I may barge in!

I was lurking through this thread and it reminded me of some material I stumbled upon today, completely by chance, while skimming through this page.

It's long, old, and I haven't read all of it, but maybe it could be worthwhile to test if it still holds up in today's architectures. More specifically, seeing that an explicit solver is all about matrix multiplications, you can check out the algorithms in section 6 directly.

I'd also like to add: I might be off here, and I'm not sure if what's in there qualifies as general knowledge, but I thought it would be better to post it here anyway.

aerosayan · July 7, 2021, 03:33

Quote:

Originally Posted by ships26

If I may barge in!

I was lurking through this thread and it reminded me of some material I stumbled upon today, completely by chance, while skimming through this page.

It's long, old, and I haven't read all of it, but maybe it could be worthwhile to test if it still holds up in today's architectures. More specifically, seeing that an explicit solver is all about matrix multiplications, you can check out the algorithms in section 6 directly.

I'd also like to add: I might be off here, and I'm not sure if what's in there qualifies as general knowledge, but I thought it would be better to post it here anyway.

Quite useful. I had the older version. But it's nice to get the BSD version, as they seem to have made it look more modern, and presented the details in a consistent manner. BSD rules.

aerosayan · July 7, 2021, 04:01

Quote:

Originally Posted by sbaffini

Maybe, but I don't know how this would work out, you could try to further exploit the L3 cache. That is, as L3 typically gets what is trashed from L1/L2, instead of doing whole computations at once, do them in multiple steps but using a single temporary array for storing the intermediate results.

Hey, found a good way to use this. From @ships26's attached pdf link section 6.21 : how to do matrix multiplication in a somewhat cache optimized manner :

Naive code :

Code:

for (i = 0; i < N; ++i)
  for (j = 0; j < N; ++j)
    for (k = 0; k < N; ++k)
        res[i][j] += mul1[i][k]*mul2[k][j]; // <- non-sequential access over k

The code shown above is bad, since mul2 is being accessed in a non-sequential manner. A better approach would be to transponse mul2, as shown in the pdf, and then do the multiplication operation in a sequential manner. The transpose matrix could be stored on L3 cache, if absolutely necessary.

Code:

double tmp[N][N];
for (i = 0; i < N; ++i)
  for (j = 0; j < N; ++j)
    tmp[i][j] = mul2[j][i]; // <- transpose matrix mul2


for (i = 0; i < N; ++i)
  for (j = 0; j < N; ++j)
    for (k = 0; k < N; ++k)
      res[i][j] += mul1[i][k]*tmp[j][k]; // <- sequential access over k

Of course, if it was me, I would've stored the matrix in blocks, but that complicates code, and may not be required always, as everything doesn't have to be extremely fast.

gnwt4a · July 7, 2021, 06:44

row-column matrix multiplication is for textbooks only.

the answer to your question in the title is: 'tiled matrix multiplication'
which is an absolute must for gpus and cpus. it can reduce the number
of ram (or gpu buffer) fetches by up to a factor of 2. at least one write to
the buffer cannot be avoided.

also, u can buy a used r9 280x gpu for about 150 euros, that will do a
(say) 200x200 matrix multiplication an order of magnitude faster that any
intel cpu that is not using avx512. the larger the matrix the bigger the
speedup. and u dont have to write your own gpu-dgemm because now amd is
making available for free its own gpu version which is expertly done.
--

July 5, 2021, 19:18		#3
aerosayan Senior Member Sayan Bhattacharjee Join Date: Mar 2020 Posts: 495 Rep Power: 8	Another solution would be to use more computation, and not read much from memory. For example, I store arrays like ainv = 1.0/a; rhoinv = 1.0/rho; and many more. Keeping extra arrays in memory is bad, as my i3 CPU has 2*256 KB 4 way associative cache, which can theoretically hold 512,000 / 8 = 64,000 double precision numbers. If I don't read things like ainv, rhoinv etc, from memory, I can store of my problem cells into my L2 cache, and thus write bigger loops, thus get some performance improvements. Additionally, doing the calculations like ainv=1.0/a; rhoinv=1.0/rho; will increase my code's Algorithmic Intensity (from roofline model), and thus push it more towards being CPU bound. If it's CPU bound, I can parallelize it and get performance improvements. PS : @sbfanni, this is probably why the code I optimized for you was faster for a few cases, then slower for a few cases. It was most probably memory bound too. sbaffini and aero_head like this.

July 6, 2021, 05:22		#5
sbaffini Senior Member Paolo Lampitella Join Date: Mar 2009 Location: Italy Posts: 2,195 Blog Entries: 29 Rep Power: 39	My advice, based completely on heuristic, is to always compute instead of store and retrieve, unless the compute itself needs loops, brings in additional arrays or is generally complex (e.g., several if branches). Of course it is, in general, better if you only perform a division once in a vectorized fashion, but then you are probably now in the need of two vectors instead of one (you might need rho both at numerator and denominator in the same expression). If you look at how cache sizes evolved, I think, that pretty much signals that memory bound is going to be the norm in HPC. There are technical reasons for this so, my layman answer to the problem is, simply, don't go in a region where technical adavance is not going to save you in the foreseeable future. Maybe, but I don't know how this would work out, you could try to further exploit the L3 cache. That is, as L3 typically gets what is trashed from L1/L2, instead of doing whole computations at once, do them in multiple steps but using a single temporary array for storing the intermediate results. There certainly is a size and amount of computation where this is going to work better than not doing it. Yet, honestly, you would be now trashing code readability for good. aerosayan likes this.

July 6, 2021, 06:12		#7
sbaffini Senior Member Paolo Lampitella Join Date: Mar 2009 Location: Italy Posts: 2,195 Blog Entries: 29 Rep Power: 39	Well, I actually overgeneralized. This should actually work only for an exclusive cache inclusion policy, where L3 is used as a victim cache (that is, collects stuff evicted from the other higher level caches). In this scenario, the idea should be to not "overcompute" (as I would do), but to actually split things in pieces. You do the first piece. It is small, takes in less arrays than usual, should be faster than usual. Then when you start doing the second piece, hopefully, you are retrieving stuff from L3 only, not from RAM. Also, the second piece is small as well, etc. You get the idea. I have no idea of how to otpimize this, nor any example to share... I don't even know if this has some flaw somewhere else (besides the fact that it shouldn't work for inclusive cache policies). As I said, I typically work on the other end of the optimization spectrum (just don't care, unless something non-obvious appears), so I never actually tried this. If I had to generalize this, the statement would be: take exactly into account how your cache works. EDIT: a more accurate statement would be, in fact, how your compiler uses your cache aerosayan and aero_head like this.

July 6, 2021, 19:41		#11
ships26 New Member Francisco Join Date: Sep 2018 Location: Portugal Posts: 27 Rep Power: 8	If I may barge in! I was lurking through this thread and it reminded me of some material I stumbled upon today, completely by chance, while skimming through this page. It's long, old, and I haven't read all of it, but maybe it could be worthwhile to test if it still holds up in today's architectures. More specifically, seeing that an explicit solver is all about matrix multiplications, you can check out the algorithms in section 6 directly. I'd also like to add: I might be off here, and I'm not sure if what's in there qualifies as general knowledge, but I thought it would be better to post it here anyway. aerosayan likes this.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Used Memory Accumulates During Course of Simulation Until interFoam gets Killed	Ship Designer	OpenFOAM Running, Solving & CFD	7	October 6, 2023 02:26
CFX Memory Issues (Partitioner, Interpolator)	alexksei	CFX	2	September 14, 2020 18:59
Solver error message - memory usage	Jared1986	CFX	12	December 2, 2014 06:28
Integer stack memory insufficient	mactech001	CFX	1	May 15, 2014 07:22
High performance virtual memory tip	connclark	OpenFOAM Running, Solving & CFD	0	December 5, 2007 19:35

July 5, 2021, 19:04		#2
aerosayan Senior Member Sayan Bhattacharjee Join Date: Mar 2020 Posts: 495 Rep Power: 8	One potential solution I have, is to run the explicit iterations on parts of the domain, like how DGEMM is using blocking to improve cache reuse. Something like, if NCELLS=30,000, then solve in increment blocks of 5,000 cells, and calculate the residuals and solutions for those 5,000 cells at a time. This way, the secondary data like surface normals, surface face area, rho, rhou, rhov, rho*e, etc, are going to be present in the L2 cache, and it might no longer be memory bound. L1 cache misses aren't that bad. Reducing L2 and L3 cache misses will be important. Not completely sure. Need to study.

July 6, 2021, 13:23		#8
FMDenaro Senior Member Filippo Maria Denaro Join Date: Jul 2010 Posts: 6,897 Rep Power: 73	I am no longer involved in specific programming topic like this, my old experience of cache miss makes me remind about specific options of the compiler. Many years ago I learned a lot of issues from the KAP that provided the optimized source code. Are you sure that your effort in programming is still worthwile more than using the suitable compiler options?

July 7, 2021, 06:44		#14
gnwt4a Member EM Join Date: Sep 2019 Posts: 59 Rep Power: 7	row-column matrix multiplication is for textbooks only. the answer to your question in the title is: 'tiled matrix multiplication' which is an absolute must for gpus and cpus. it can reduce the number of ram (or gpu buffer) fetches by up to a factor of 2. at least one write to the buffer cannot be avoided. also, u can buy a used r9 280x gpu for about 150 euros, that will do a (say) 200x200 matrix multiplication an order of magnitude faster that any intel cpu that is not using avx512. the larger the matrix the bigger the speedup. and u dont have to write your own gpu-dgemm because now amd is making available for free its own gpu version which is expertly done. --