|
[Sponsors] |
Motivation and introduction for learning assembly coding |
|
LinkBack | Thread Tools | Search this Thread | Display Modes |
January 14, 2021, 16:32 |
Motivation and introduction for learning assembly coding
|
#1 |
Senior Member
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8 |
I'm really thankful to this forum and its members for helping me out in my hobby project. However I have seen that many members, even super experienced ones don't want to mess with understanding the machine(let's say assembly for brevity) code generated by the compiler.
I understand why diving deep into the assembly code might not seem very interesting or even worth the effort for most. Unfortunately it is very useful, and people are missing out on the benefits of understanding and analyzing the assembly code. This post is to start a discussion and me willing to help anyone interested in learning more. PART 1 : What are the benefits of understanding assembly code? Here are a few benefits of understanding assembly code : - We know exactly how much fast a small portion of code will run, since we know from Intel/AMD CPU manuals how much fast (throughput and latency) any particular machine instruction like (IMUL, ADD, SUB) are and if the generated assembly uses the highest SIMD vector registers (i.e XMM (okay), YMM (fast and most common on Intel i3-i5-i7 family), ZMM (really fast and available on special CPUs)) So, if you see that your generated assembly code uses YMM registers, you're in for a good time. That means your code is using AVX2 vectorization and not the slow SSE vectorization. - If you know how the assembly code works, you know exactly how your data needs to be in order to gain maximum performance. (Hint : 1D arrays is the best. Linear access is the best.) You might want to store the CSV as vector = [rho, rho*u, rho*v, rho*E][rho, rho*u, rho*v, rho*E] You might think that having the data close to each other like that will improve your cache performance. You're right. So what's the problem? The problem is that you're doomed to only using slow XMM registers. That's what the compiler will generate in order to be safe. If you want to force the compiler to generate AVX2 vectorized code that uses YMM registers, you can. However the problem is that in order to use the data in that form (say CSV to PSV conversion operation is required), AVX2 vectorized code will use machine instructions that do complex operations like : interleave the data, permutate the data, rotate the data and other very complex operations. This will cause your AVX2 "optimized" code to run slower than the code first generated by the compiler. What's the "right" way in this context? Store the data as vector = [rho, rho, rho, rho][rho*u, rho*u, rho*u, rho*u], [rho*v, rho*v, rho*v][rho*E, rho*E, rho*E, rho*E] and access the data as arrays U1[i], U2[i], U3[i], U4[i] etc. Where you'll use pointers to set the first element of each of those arrays. I have done this, and every loop is vectorized to use AVX2 instructions by the compiler! This is 4X faster than the serial code for double precision data and 8X faster than serial code for single precision data. That's because YMM registers are 256 bits wide and can fit 4 and 8 double and single precision values respectively.And then every vector operation will add/multiply/subtract/divide the numbers (4 or 8) at once in a single instruction! If your code isn't using SIMD vectorization, you're wasting performance. Sorry, that's mathematically proven. PART 2 : How do we compile and study the assembly code... coming soon.. |
|
January 14, 2021, 17:10 |
|
#2 |
Senior Member
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8 |
PART 2 : How do we compile and study the assembly code
I hope some are interested to dive deep into the generated assembly code and optimize our operations for maximum performance. Let's look at the operation UN = U that is done at every timestep. We copy over the values inside U from the previous timestep and use it to solve for our current timestep. Cool... this is a very important operation as it is right in the middle of our solver loop and I don't want it to be slow. Since this is a very simple operation, you would be able to understand what's going on, and able to focus on the method. Here's the C++ code: Code:
// Store solution from the previous time step for use in mstage iterations // #pragma omp simd linear(i:1) for(i=0; i<ncells; ++i) { UN1_[i] = U1_[i]; // mark <- NOTE this string "mark" UN2_[i] = U2_[i]; // mark UN3_[i] = U3_[i]; // mark UN4_[i] = U4_[i]; // mark } Computers love 1D arrays and linear access. Give her some. Now, before we go any further .. you'll need some tools. TOOLS required : g++ compiler, objdump tool, grep tool running under GNU/Linux OS. We will want to provide the flags -g to g++ to ensure that our source code and debug symbols are also included in the compiled binary. This will allow us to understand the assembly code easily. Compile: Code:
$ g++ -g -O3 -fopenmp -mavx2 my_awesome_code.cpp -fopenmp -mavx2 are used to allow OpenMP instructions to be used in the C++ code. And since I've used them, I included the flags here. They're not strictly required. Our task is to analyze the machine code generated by the compiler. However since we can't read machine code, we use a disassembler to disassemble it into assembly code. objdump is our disassembler of choice. Here's the command: Code:
$ objdump -SC -Mintel a.out | grep 'mark' -A30 -Mintel tells the disassembler to use the Intel Assembly code syntax. Trust me, it's life saving. You don't want to read AT&T Assembly code syntax. a.out is our executable This will print all of the annotated assembly code. That would be roughly around 9000 to 9,000,000 lines (based on your project size). So, we need something to make it easier for us to search through this blob of text and find our region of interest. That's why grep is used. grep 'mark' -A30 will search for the string "mark" (see in the C++ code) and print 30 lines of assembly code after it. This lets us find our interesting assembly code in an instant. Change "mark" to whatever you like. Here's the assembly code: Code:
objdump -SC -Mintel a.out | grep 'mark' -A30 UN1_[i] = U1_[i]; // mark 3259: c4 c1 7d 10 64 05 00 vmovupd ymm4,YMMWORD PTR [r13+rax*1+0x0] 3260: c4 c1 78 11 24 02 vmovups XMMWORD PTR [r10+rax*1],xmm4 3266: c4 c3 7d 19 64 02 10 vextractf128 XMMWORD PTR [r10+rax*1+0x10],ymm4,0x1 326d: 01 UN2_[i] = U2_[i]; // mark 326e: c4 c1 7d 10 14 04 vmovupd ymm2,YMMWORD PTR [r12+rax*1] 3274: c4 c1 78 11 14 01 vmovups XMMWORD PTR [r9+rax*1],xmm2 327a: c4 c3 7d 19 54 01 10 vextractf128 XMMWORD PTR [r9+rax*1+0x10],ymm2,0x1 3281: 01 UN3_[i] = U3_[i]; // mark 3282: c5 fd 10 1c 03 vmovupd ymm3,YMMWORD PTR [rbx+rax*1] 3287: c4 c1 78 11 1c 00 vmovups XMMWORD PTR [r8+rax*1],xmm3 328d: c4 c3 7d 19 5c 00 10 vextractf128 XMMWORD PTR [r8+rax*1+0x10],ymm3,0x1 3294: 01 UN4_[i] = U4_[i]; // mark 3295: c4 c1 7d 10 24 03 vmovupd ymm4,YMMWORD PTR [r11+rax*1] 329b: c5 f8 11 24 07 vmovups XMMWORD PTR [rdi+rax*1],xmm4 32a0: c4 e3 7d 19 64 07 10 vextractf128 XMMWORD PTR [rdi+rax*1+0x10],ymm4,0x1 32a7: 01 32a8: 48 83 c0 20 add rax,0x20 32ac: 48 39 85 68 fd ff ff cmp QWORD PTR [rbp-0x298],rax 32b3: 75 a4 jne 3259 <main+0xab9> 32b5: 39 d6 cmp esi,edx 32b7: 74 3a je 32f3 <main+0xb53> 32b9: 48 63 c6 movsxd rax,esi UN1_[i] = U1_[i]; // mark 32bc: c4 c1 7b 10 44 c5 00 vmovsd xmm0,QWORD PTR [r13+rax*8+0x0] 32c3: c4 c1 7b 11 04 c2 vmovsd QWORD PTR [r10+rax*8],xmm0 UN2_[i] = U2_[i]; // mark 32c9: c4 c1 7b 10 04 c4 vmovsd xmm0,QWORD PTR [r12+rax*8] 32cf: c4 c1 7b 11 04 c1 vmovsd QWORD PTR [r9+rax*8],xmm0 UN3_[i] = U3_[i]; // mark 32d5: c5 fb 10 04 c3 vmovsd xmm0,QWORD PTR [rbx+rax*8] 32da: c4 c1 7b 11 04 c0 vmovsd QWORD PTR [r8+rax*8],xmm0 UN4_[i] = U4_[i]; // mark 32e0: c4 c1 7b 10 04 c3 vmovsd xmm0,QWORD PTR [r11+rax*8] 32e6: c5 fb 11 04 c7 vmovsd QWORD PTR [rdi+rax*8],xmm0 32eb: 48 83 c0 01 add rax,0x1 32ef: 39 c2 cmp edx,eax 32f1: 7f c9 jg 32bc <main+0xb1c> Notice how, we can see YMM registers right after the annotated C++ code. You can now tweak your code to optimize more if you want. (EDIT : You'll notice that it's using VMOVUPS U for unalligned. I might want it to use VMOVAPS A for alligned. In older hardware, VMOVAPS is significantly faster.) (EDIT : For clarification, you'll still need to profile your code. Just because it uses YMM registers doesn't mean that it will be fast in every case. The case shown above was a sample, so you need to profile your code. Most likely, in many cases where you don't have large arrays of data, serial code will be faster since CPUs can do roughly 10^8 (IIRC) operations in a second. If your arrays are very small, you will want to stick with serial code.) Notice how, we can see XMM registers at the bottom. Those are there as a safety feature. SIMD vectors instructions only work on vectors of lengths that are multiple of 4,8,16 etc. This means that in some arrays, there will be left over elements. Those elements are also needed to be calculated, thus the XMM registers are used. ( EDIT : The previously shown code took 10.0 seconds for 100,000 iterations in both serial and SIMD vectorized execution. So where did our 4X or 8X boost go? Remember this golden tid-bit : "SIMD vectorizations are used to do a single operation over a large amount of data." In the previous code shown, we were'nt really doing a single operation, we were doing four operations. My optimization was to split that up into 4 loops: Code:
// Store solution from the previous time step for use in mstage iterations // #pragma omp simd linear(i:1) for(i=0; i<ncells; ++i) { UN1_[i] = U1_[i]; // mark } #pragma omp simd linear(i:1) for(i=0; i<ncells; ++i) { UN2_[i] = U2_[i]; // mark } #pragma omp simd linear(i:1) for(i=0; i<ncells; ++i) { UN3_[i] = U3_[i]; // mark } #pragma omp simd linear(i:1) for(i=0; i<ncells; ++i) { UN4_[i] = U4_[i]; // mark } Why was this version faster? The cache was going to be hot for this type of linear acces, and looking at the assembly we can see that the "serial" code a the end of the loop for the leftover elements was completely unrolled in this version. To be honest, I didn't know that this version would be faster, but now that I know how the assembly code looks for fast loops, I know what to look out for in the future. In the previous version, the operations on the leftover elements using XMM registers also used a conditional loop that we see in -O2 optimization. The complete loop unrolling that we see in this version is seen in -O3 optimization. This is how I was able to tell why this version was faster!!!!!! I knew from my previous experience how -O3 vs -O2 optimized code looks and which is faster. Do you now see why I say that learning assembly will help you a lot? Code:
UN4_[i] = U4_[i]; // mark 39c2: c4 c1 7d 10 24 33 vmovupd ymm4,YMMWORD PTR [r11+rsi*1] 39c8: c4 c1 78 11 24 37 vmovups XMMWORD PTR [r15+rsi*1],xmm4 39ce: c4 c3 7d 19 64 37 10 vextractf128 XMMWORD PTR [r15+rsi*1+0x10],ymm4,0x1 39d5: 01 The "serial" code was actually SSE vectorized. I just said it was "serial" for ease of understanding. AVX2 is 2X faster than SSE. ) Now, you know how easy it is to analyze your assembly code. There's a lot more to learn, but the intention of this post was to show how easy it was to analyze your assembly code, and to start a discussion. Thanks and regards ~sayan Last edited by aerosayan; January 14, 2021 at 19:49. Reason: added optimization tips |
|
January 15, 2021, 06:06 |
|
#3 |
Senior Member
|
Just a disclaimer for myself... of course, if I could, I would like to be good at everything (yet, probably, assembly would still not make it into the top 5 in this case). But healthy time and brain occupations are limited, and typically you invest them where better returns are expected.
Of course, in theory, assembly is almost fundamental in programming. But if I barely understand C++, I'm not gonna look at assembly with any great return. In the same time I would spend on it optimizing a certain loop I could implement, say, a new turbulence model in a code, and I would perceive that as something with better ROI. You know, we are who we are. However, keep rolling this as, for the exact same reasons above, it is going to help a lot of us |
|
January 15, 2021, 06:44 |
|
#4 | |
Senior Member
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8 |
Quote:
I agree on many of the things you said. However I respectfully disagree with the statement of implementing a new turbulence model instead of optimizing loops. Maybe that was just a simple example that you gave, but I would rather optimize the loop to run as fast as possible then go for the turbulence mode, if my future work depends on the simple loop. Why would I want a shiny new turbulence model, when I know that my code will still be slow and I need to fix that before it becomes unfixable ? "Do one thing, but do it well." That's been my motto. The UN=U is a very simple example, but a very crucial one. This operation happens every iteration. I'm not sure if the solution I provided here is even the best solution. I could in theory run a loop from i=0 to i=nrows*ncols*4-1 since all of the 4 arrays are allocated in a single contiguous memory block. Now, that would be even faster. (EDIT : I did the tests. I converted the whole thing as a single loop. The SSE vectorized code took 15 seconds. While the AVX2 vectorized code only took 6.13 seconds!!! 2X improvement from SSE. That's what we expected from the operation!) (EDIT : This was done with -O2 optimization since many people still like to use it. In -O3 optimization, both get compiled to the AVX2 vectorized form and take 6.19 seconds!) The important parts of C++ isn't difficult. The new modern things in C++ are. We shouldn't even be using the new modern things. Most of them are bad. If someone knows basic C, they can do just fine. OOP is overrated and slow as heck. I'm a full time C++ dev, and I hate overuse of OOP with a burning passion. Look up Orthodox C++. I don't like agree with many of the things said in the article (like only using printf for printing), but this and the links provided tells you what are the most important things necessary for C++ to run fast. Surprise, it's mostly C like codes. Overuse of modern OOP is a horrible decision. https://gist.github.com/bkaradzic/2e39896bc7d8c34e042b OOP is long dead, long live Data Oriented Design. https://youtu.be/yy8jQgmhbAU I'm sure you already know about CPU caches and other things since you have so much experience, but in case someone else in the forum wants to learn about it, this video by Scott Meyers is absolutely critical. https://youtu.be/WDIkqP4JbkE Learning Assembly coding is actually simple. I know this stuff because I was interested in OS development and made a small kernel. If you want to learn Assembly coding easily, this 16 video playlist by Creel would be more than enough to get you started. https://www.youtube.com/watch?v=rxsB...tFT1Rmh3uJp7kA I know many will disagree with worrying too much about Assembly code being generated, but as I showed above, learning this gives me immense freedom and confidence to write my code knowing that it will perform well. I still have a lot to learn, but learning assembly has been one of my best decisions. |
||
January 15, 2021, 07:47 |
|
#5 | ||
Senior Member
|
I don't want to spoil this thread with additional personal consideration (instead, I invite you to keep it rolling with everything you feel useful to us in order to understand the matter), but:
Quote:
Still, if I have to make a more specific example in CFD, MPI is instead something that you should specifically favor over assembly. It makes the difference between running efficiently over any known system and, basically, just your PC. Quote:
See, there is indeed a very specific reason for you to be into that. Not that making a good CFD code wouldn't be enough of a reason, but you still have a very specific background that most of us really don't have. |
|||
January 15, 2021, 08:20 |
|
#6 |
Senior Member
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8 |
PART 3 : How to measure performance?
You'll definitely want to measure the performance of your code. Even though the assembly may look fine, the code may still run slower, as we have seen above. There's nothing better than actually profiling your code. You could use many tools for the job. I generally use the following : perf, valgrind, google-benchmark perf - A very important tool used by linux kernel devs to measure the performance of their low latency code. It's a little difficult to use at first, but it's extremely fast. Look at the image attached, you'll see that perf report has highlighted which assembly instruction is consuming how much % of our execution time. This is a blessing. It becomes difficult to understand sometimes, but if you know perf well, you are going to have a good time. Installing perf can be difficult as it depends on the kernel. Expect a difficulty curve when trying to install it for the first time, and use it. Here's a basic intro : https://youtu.be/M6ldFtwWup0 Here's another basic intro by Chandler Carruth : https://youtu.be/nXaxk27zwlk Chandler goes into the nitty gritty details that you might need to understsand perf better. valgrind - Great tool for finding memory leak bugs, and it's submodule cachegrind is extremely awesome for performance analysis. The problem is that valgrind is extremely slow. So, if you want to measure performance in a fast speed using valgrind and cachegrind, you'll have to write in the code where you want to measure the performance. This is advanced stuff and I'll not be covering it for now. To look at the results produced by cachegrind, use the GUI tool kcachegrind. It's a really good GUI tool and makes the performance results easy to understand. kcachegrind : https://kcachegrind.github.io/html/Home.html google-benchmark - google-benchmark is a micro-benchmark library in which you can test how fast certain sections of your code will run. Say, you want to see whether to use a tree or an array to do an operation, you'll write a simple proof of concept code and test it in google-benchmark to see which one runs faster. Since this is a micro-benchmarking library, you can't test your whole code's performance in it. Only small parts. See Chandler's video to understand some basics about google benchmark. google-benchmark : https://github.com/google/benchmark |
|
January 15, 2021, 18:59 |
|
#7 |
Senior Member
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8 |
PART - 4 : ALWAYS profile your code...
Use the perf tool to profile your code if you need to know how much performance cost each instruction is having. This gives a very good knowledge of exactly what's making your code slow. Look at the attached image, you can see that VDIVPD i.e a division operation is hogging up a major portion of the computation time. However this test is not enough. You'll want to run your code and see how much time it's actually running. You'll want to make little changes to the code and see in how much time the code finishes execution. This code ran in 69 seconds (on average of 4 trials) for 4*100,000*ncells*4 iterations in each trial: (<- I don't completely trust this result for now. But it looks promising.) Code:
// Iterate over all cells // #pragma omp simd linear(i:1) for(i=0; i<ncells*4; ++i) { // <- loop i U_[i] = UN_[i]-(1.0/static_cast<fxp>(mstage-(m-1)))*dt/cvols_[i]*R_[i]; } // -> loop i Code:
// Iterate over all cells // #pragma omp simd linear(i:1) for(i=0; i<ncells*4; ++i) { // <- loop i U_[i] = UN_[i]-dt/cvols_[i]/static_cast<fxp>(mstage-(m-1))*R_[i]; } // -> loop i |
|
April 11, 2021, 00:45 |
|
#8 |
Senior Member
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8 |
Part 5 : Finding where you are in the code
I have previously explained how to find where you are in the code using a marker and grep tool to search through an objdump output. All of that is correct, but sometimes CPUs have a habit of reorganizing the order of operations to ensure better performance. That makes it a little bit difficult to read the assembly code. Maybe the section of code you're investigation, is so performance critical that, you can't even do a single unnecessary addition in the loop or out of the loop. So, you can't clearly mark the end of the loop correctly. In those cases, you can use the assembly NOP instruction to identify where you are. NOP instruction essentially says to the CPU to do nothing at all, so it has almost no performance loss. Considering the code given below, we want to find out in the assembly code, where the loop ends. Right after the end of the loop, we will see the NOP machine instruction, that we have inserted in the C++ code using asm("nop") Code:
// Setup the CSV // #pragma omp simd linear(i:1) for(i=0; i<ncells*4; ++i) { U1_[i] = UN1_[i]; // xdawg } asm("nop"); Code:
U1_[i] = UN1_[i]; // xdawg 8488: c5 f8 10 24 03 vmovups xmm4,XMMWORD PTR [rbx+rax*1] 848d: c4 e3 5d 18 44 03 10 vinsertf128 ymm0,ymm4,XMMWORD PTR [rbx+rax*1+0x10],0x1 8494: 01 8495: c4 c1 78 11 04 04 vmovups XMMWORD PTR [r12+rax*1],xmm0 849b: c4 c3 7d 19 44 04 10 vextractf128 XMMWORD PTR [r12+rax*1+0x10],ymm0,0x1 84a2: 01 84a3: 48 83 c0 20 add rax,0x20 84a7: 48 39 c2 cmp rdx,rax 84aa: 75 dc jne 8488 <LOTS OF CODE HERE ... WE JUST REMOVED THEM FOR CLARITY> 8554: c5 f8 77 vzeroupper <- HERE'S OUR END OF LOOP asm("nop"); 8557: 90 nop <- HERE's OUR NOP INSTRUCTION |
|
April 11, 2021, 08:16 |
|
#9 | |
Senior Member
Arjun
Join Date: Mar 2009
Location: Nurenberg, Germany
Posts: 1,291
Rep Power: 35 |
Quote:
my guess is that in first code first (1/ something is computed) which may be by a specialized algorithm. In second code dt/cvols is calculated which is not so fast as 1/something. |
||
May 16, 2021, 18:26 |
|
#10 |
Senior Member
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8 |
Part 6 : Always freakin' profile your code for cache misses
Coding in C++ has amazing advantages, except for all of the curses required to make it work. Cache misses suck. Instruction cache misses aren't as much of a problem as data cache misses. In data cache misses, L3 data cache misses cause horrible slowdowns. If you're starting to optimize your code, just doing profiling with a simple profiler like perf is not enough. Perf is a great tool, but in many cases of writing numerical codes, you would want to also do a cache performance profile. Best tool for this (as per my own experience) is cachegrind, and to see the results you will use kcachegrind GUI tool. In this part, I will share how profiling my code using cachegrind literally reduced L3 cache misses from 114,075,777 to 868,001. The numbers by themselves don't mean much, as everything is relative, except when we see them increase or decrease. If they decrease, our performance is improving. My faulty code, was the main solution update loop: Code:
#pragma omp simd linear(i:1) for(i=0; i<ncells*4; ++i) { U_[i] = UN_[i] + alpha*rcvols_[i]*R_[i]; } Unfortunately, this is trash code. Utter, garbage, beyond any salvage. Unfortunately, this code was not removed, because I didn't do a profile for the cache misses produced by this code. Specially the L3 cache misses. Here's the result from cachegrind. Code:
==36349== ==36349== I refs: 2,097,055,417 ==36349== I1 misses: 2,384,958 ==36349== LLi misses: 3,215 ==36349== I1 miss rate: 0.11% ==36349== LLi miss rate: 0.00% ==36349== ==36349== D refs: 1,004,739,022 (685,037,493 rd + 319,701,529 wr) ==36349== D1 misses: 135,452,262 ( 97,769,886 rd + 37,682,376 wr) ==36349== LLd misses: 114,075,777 ( 86,001,879 rd + 28,073,898 wr) ==36349== D1 miss rate: 13.5% ( 14.3% + 11.8% ) ==36349== LLd miss rate: 11.4% ( 12.6% + 8.8% ) ==36349== ==36349== LL refs: 137,837,220 (100,154,844 rd + 37,682,376 wr) ==36349== LL misses: 114,078,992 ( 86,005,094 rd + 28,073,898 wr) ==36349== LL miss rate: 3.7% ( 3.1% + 8.8% ) After noticing this behavior, I used kcachegrind to figure out, where the slowdown was. As shown in cache-1.jpg, it's the solution update loop. Doing some manual testing, I figured it out, that the performance degradation was due to using ncells*4. Apparently for some unknown reason, the data wants to be accessed in groups of ncells. I don't know why this is the case. This is why we can see around 81 million L3 cache misses happened in that loop. Completely unacceptable. So, I rewrote the code. Code:
#pragma omp simd linear(i:1) for(i=0; i<ncells; ++i) { // <- loop i U1_[i] = UN1_[i] + alpha*rcvols_[i]*R1_[i]; } // -> loop i #pragma omp simd linear(i:1) for(i=0; i<ncells; ++i) { // <- loop i U2_[i] = UN2_[i] + alpha*rcvols_[i]*R2_[i]; } // -> loop i #pragma omp simd linear(i:1) for(i=0; i<ncells; ++i) { // <- loop i U3_[i] = UN3_[i] + alpha*rcvols_[i]*R3_[i]; } // -> loop i #pragma omp simd linear(i:1) for(i=0; i<ncells; ++i) { // <- loop i U4_[i] = UN4_[i] + alpha*rcvols_[i]*R4_[i]; } // -> loop i Code:
==36599== ==36599== I refs: 2,097,125,455 ==36599== I1 misses: 2,384,984 ==36599== LLi misses: 3,239 ==36599== I1 miss rate: 0.11% ==36599== LLi miss rate: 0.00% ==36599== ==36599== D refs: 1,019,785,554 (700,075,010 rd + 319,710,544 wr) ==36599== D1 misses: 135,472,760 ( 97,781,885 rd + 37,690,875 wr) ==36599== LLd misses: 868,001 ( 516,180 rd + 351,821 wr) ==36599== D1 miss rate: 13.3% ( 14.0% + 11.8% ) ==36599== LLd miss rate: 0.1% ( 0.1% + 0.1% ) ==36599== ==36599== LL refs: 137,857,744 (100,166,869 rd + 37,690,875 wr) ==36599== LL misses: 871,240 ( 519,419 rd + 351,821 wr) ==36599== LL miss rate: 0.0% ( 0.0% + 0.1% ) I'm not sure if 90,000 L3 misses is good. I don't think it is, but I will investigate that on a later date. Probably my data alignment is all messed up or something. I don't know. Last edited by aerosayan; May 16, 2021 at 18:35. Reason: changed 0.3% to 0.1% |
|
May 16, 2021, 23:47 |
|
#11 |
Senior Member
Arjun
Join Date: Mar 2009
Location: Nurenberg, Germany
Posts: 1,291
Rep Power: 35 |
I had a calculation affected by this issue recently where the 8 processor run did the job in 12 seconds while 1 processor run did it in 300 seconds. Found out that one loop had too much going on.
Now slowly breaking down things in smaller groups :-D |
|
May 17, 2021, 06:11 |
|
#12 |
Senior Member
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8 |
Why does that happen? I have to look into it, but I don't know why the single threaded long loop performed poor. It makes no sense currently to me. I always thought CPUs like big arrays. That's what SIMD tutorials recommend, and technically makes sense. This is a mystery to me. EDIT : Are you using the __restrict__ keyword? |
|
June 13, 2021, 16:13 |
|
#13 |
Senior Member
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8 |
Part 7 : Use the Load, Calculate, Store method to write easily optimizable code
Older compilers sometimes have difficulties to optimize code if some equation is very long, and complex. It makes sense to write code that will be optimized easily even by older compilers. The simplest method for me, has been to use the Load, Calculate, Store method. It's as easy as doing your complex equation in multiple steps, instead of writing a single big equation. Like how we did in this piece of code from my solver. In this : we load the data, do some calculations, and store it back. The compiler loves this!!! Code:
/// Calculate pressure for all cells /// @param psi : pressure /// @param rho : density /// @param e : energy /// @param u : x component of velocity /// @param v : y component of velocity /// @param g : gamma-1 /// > ensure it's gamma-1 not gamma /// @param n : total cells /// template<typename fxp> NCX_FORCE_INLINE void calc_pressure(fxp * ncxrestrict psi , fxp * ncxrestrict rho, fxp * ncxrestrict e , fxp * ncxrestrict u , fxp * ncxrestrict v , fxp g, szt n) { #pragma omp simd for(szt i=0; i<n; ++i) { // // > EQN : // // psi[i] = g*rho[i]*( e[i] - 0.5*(u[i]*u[i] + v[i]*v[i]) ); // // ^ ^ ^ ^---------------------^ part_1 // ^ ^ ^-------------------------^ part_2 // ^ ^--------------------------------^ part_3 // ^-------------------------------------------^ part_4 // fxp const ii_u = u[i]; fxp const ii_v = v[i]; fxp const ii_e = e[i]; fxp const ii_rho = rho[i]; fxp const uu = ii_u * ii_u; fxp const vv = ii_v * ii_v; fxp const part_1 = uu + vv; fxp const part_2 = static_cast<fxp>(0.5) * part_1; fxp const part_3 = ii_e - part_2; fxp const part_4 = g * ii_rho * part_3; psi[i] = part_4; //> mk-calc-pressure } } |
|
June 14, 2021, 04:57 |
|
#14 | |
Senior Member
|
Quote:
|
||
June 14, 2021, 08:14 |
|
#15 | |
Senior Member
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8 |
Quote:
Not a vague question. My boss wrote his solvers and his whole CFD software to be compiled with GCC 4.8. Apparently users in NASA, and US Army, used older machines which generally didn't have new Operating Systems and new compilers. 10 years earlier, he used to support older versions of GCC too. In those older versions of GCC (I think < GCC 4.8), you don't have the openmp simd directives to force the compiler to do aggressive vectorization. In such cases, you have to depend on the auto-vectorization provided by the compiler. So, you have to write the code in such a way that it's easy for the compiler to auto-vectorize. In Game Engine development for XBOX, PS2, PS3, etc, they also use older compilers (more specifically they sometimes used Microsoft compilers, which are garbage). They definitely don't use OpenMP. So, even there, they write the code using Load, Calculate, Store, method to allow auto-vectorization. The equation psi[i] = g*rho[i]*... gets vectorized even if I didn't break it up into parts. Because OpenMP simd directive forces the compiler to do aggresqsive vectorization on that equation. But, in case I ever need to support older compilers that don't have OpenMP SIMD directives, I would like it to be auto-vectorized. Obviously this is more noticeable in larger equations that span multiple lines. I know there's an overhead of writing out the code in multiple parts, but I still like to do it, as it simplifies the code structure for me, and most importantly, I can understand the disassembly very easily, as every part of the equation is in a separate line. It might not be 100% necessary for Fortran, if you use Intel compilers, since they do aggressive optimizations. However, I don't trust gfortran to always produce optimized code. GCC on the other hand is more mature, thus I prefer to use this technique on C/C++ codes. I have also used the const keyword to instruct to the compiler that the values of the different parts won't change once initiated, so it can do further aggressive optimizations. To add some further optimizations, I use the ncxrestrict macro which is just #define ncxrestrict __restrict__ since C/C++ codes generally suffer from aliasing issues. Thus the C++ code should be as fast as Fortran-77. I need to run tests someday to verify. |
||
June 14, 2021, 09:02 |
|
#16 |
Senior Member
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8 |
I mentioned previously that writing the equation out in multiple parts helps in easier understanding of the equation. Even if you're not extremely skilled in understanding assembly code, you can still understand it if you write out the equation in multiple parts.
As shown in this section of disassembled code, you can see that the annotated source code, easily shows you exactly which line produces which assembly code instructions. This is very good, since you can easily understand it, if you know just the basics of assembly code instructions like ADD, SUB, MUL, DIV etc. Code:
fxp const ii_u = u[i]; 3c70: c4 c1 79 10 34 12 vmovupd xmm6,XMMWORD PTR [r10+rdx*1] fxp const ii_v = v[i]; 3c76: c4 c1 79 10 3c 11 vmovupd xmm7,XMMWORD PTR [r9+rdx*1] fxp const ii_u = u[i]; 3c7c: c4 c3 4d 18 44 12 10 vinsertf128 ymm0,ymm6,XMMWORD PTR [r10+rdx*1+0x10],0x1 3c83: 01 fxp const ii_v = v[i]; 3c84: c4 c3 45 18 54 11 10 vinsertf128 ymm2,ymm7,XMMWORD PTR [r9+rdx*1+0x10],0x1 3c8b: 01 fxp const ii_e = e[i]; 3c8c: c4 c1 79 10 2c 10 vmovupd xmm5,XMMWORD PTR [r8+rdx*1] 3c92: c4 c3 55 18 4c 10 10 vinsertf128 ymm1,ymm5,XMMWORD PTR [r8+rdx*1+0x10],0x1 3c99: 01 fxp const uu = ii_u * ii_u; 3c9a: c5 fd 59 c0 vmulpd ymm0,ymm0,ymm0 fxp const ii_rho = rho[i]; 3c9e: c5 f9 10 34 16 vmovupd xmm6,XMMWORD PTR [rsi+rdx*1] fxp const vv = ii_v * ii_v; 3ca3: c5 ed 59 d2 vmulpd ymm2,ymm2,ymm2 fxp const part_1 = uu + vv; 3ca7: c5 fd 58 c2 vaddpd ymm0,ymm0,ymm2 fxp const part_2 = half * part_1; 3cab: c5 fd 59 c4 vmulpd ymm0,ymm0,ymm4 fxp const part_3 = ii_e - part_2; 3caf: c5 f5 5c c0 vsubpd ymm0,ymm1,ymm0 fxp const ii_rho = rho[i]; 3cb3: c4 e3 4d 18 4c 16 10 vinsertf128 ymm1,ymm6,XMMWORD PTR [rsi+rdx*1+0x10],0x1 3cba: 01 fxp const part_4 = g * ii_rho * part_3; 3cbb: c5 f5 59 cb vmulpd ymm1,ymm1,ymm3 3cbf: c5 fd 59 c1 vmulpd ymm0,ymm0,ymm1 psi[i] = part_4; //> mk-calc-pressure 3cc3: c5 f8 11 04 11 vmovups XMMWORD PTR [rcx+rdx*1],xmm0 3cc8: c4 e3 7d 19 44 11 10 vextractf128 XMMWORD PTR [rcx+rdx*1+0x10],ymm0,0x1 |
|
July 31, 2021, 05:38 |
|
#17 |
Senior Member
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8 |
Part 8 : easy method to extract assembly code
This is an important trick. It has saved me HOURS of development time. Let's say I want to see the code generated by some arbitrary piece of code. For this example, we will consider a call to printf. Well, how do we do that, the previous marker method I showed only works on a per line basis, but it's not meant for extracting the code from a whole section. And during development, I tend to extract code from a whole section, not individual lines. Here comes the NOP instruction Code:
// we'll use this #define NOP __asm("nop"); // NOP; printf("hella fine\n"); NOP; // Now, we will use sed to extract all of the code between these two statements Code:
$ objdump -w -SC --no-show-raw-insn -Mintel myExecutable | sed -n '/NOP/,/NOP/p' NOP; 10de: nop 10df: lea rsi,[rip+0xf95] # 207b <_IO_stdin_used+0x7b> 10e6: mov edi,0x1 10eb: xor eax,eax 10ed: call 1090 <__printf_chk@plt> printf("hella fine\n"); NOP; I have previously shown how to use objdump, this is an intuitive method to get better results. For ease of use, I have made a bash function out of this. Code:
# objdump code between two markers # my executable file name is in file run.txt, so I can use this command anywhere # and I don't have to manually enter my executable file name # obseg() { objdump -w -SC --no-show-raw-insn -Mintel `cat run.txt` | sed -n '/'"$1"'/,/'"$1"'/p' } Code:
$ obseg NOP Hopefully, it was useful to some. |
|
Tags |
assembly code, introduction, tutorial |
|
|