CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > General Forums > Main CFD Forum

Fortran->Assembly : How to remove redundant non-vectorized code?

Register Blogs Community New Posts Updated Threads Search

Like Tree2Likes
  • 1 Post By aerosayan
  • 1 Post By pcosta

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   November 17, 2020, 10:47
Default Fortran->Assembly : How to remove redundant non-vectorized code?
  #1
Senior Member
 
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8
aerosayan is on a distinguished road
SUMMARY :


When gfortran vectorizes an equation such as x = y*z , the compiler will only use vector registers (such as YMM) and vector opcodes, if and only if the compiler knows that all of the data (x,y,z) can be loaded on said vector registers. However if the compiler doesn't know whether all of the data can be loaded onto the registers, it will generate redundant code such that the remaining data can be operated upon.


This redundant code will generally not use the highest available SIMD vector registers and will bloat the executable causing ICACHE misses or preventing function/subroutine inlining optimizations.


I would like to remove this redundant code since I know that all of my data (x,y,z) will fit perfectly on the YMM vector registers.


DETAILS :

Code compiled and run on Intel i3 CPU with AVX2 SIMD support.


I observed that for the sample operation x = y*z, with array lengths of 8 each, the compiler is able to generate only vectorized code for the equation x = y*z. This happens since 8 integers can be perfectly fit onto a AVX2 YMM register.

CODE :

Code:
! COMPILE : gfortran -g -O3 -mavx2 -march=skylake fortran_vectorization_1a.dis.F90
! DISASM  : objdump -d -SC -Mintel a.out

program main
implicit none
integer *4, dimension(8) :: x, y, z

! Initialize
x = 0
y = (/10, 11, 12, 13, 14, 15, 16, 17/)
z = (/1 ,  2,  3,  4,  5,  6,  7,  8/)

! Vector multiplication
x(:) = y*z

! Prevent dead code elimination
print *, x

end program
GODBOLT LINK : https://gcc.godbolt.org/z/GW9E5W

ASSEMBLY : Slightly cleaned up

Code:
00000000000011b0 <MAIN__>:
program main
    11b0:    55                       push   rbp
program main
    11b8:    48 89 e5                 mov    rbp,rsp
    11bb:    41 54                    push   r12
    11bd:    48 83 e4 e0              and    rsp,0xffffffffffffffe0
    11c1:    48 81 ec c0 02 00 00     sub    rsp,0x2c0
y = (/10, 11, 12, 13, 14, 15, 16, 17/)
    11c8:    c5 f9 6f 0d 90 0e 00     vmovdqa xmm1,XMMWORD PTR [rip+0xe90]        # 2060 <A.0.3879>
    11cf:    00 
    11d0:    c5 f9 6f 15 98 0e 00     vmovdqa xmm2,XMMWORD PTR [rip+0xe98]        # 2070 <A.0.3879+0x10>
    11d7:    00 
z = (/1 ,  2,  3,  4,  5,  6,  7,  8/)
    11d8:    c5 f9 6f 1d 60 0e 00     vmovdqa xmm3,XMMWORD PTR [rip+0xe60]        # 2040 <A.1.3881>
    11df:    00 
    11e0:    c5 f9 6f 25 68 0e 00     vmovdqa xmm4,XMMWORD PTR [rip+0xe68]        # 2050 <A.1.3881+0x10>
    11e7:    00 
y = (/10, 11, 12, 13, 14, 15, 16, 17/)
    11e8:    c5 f8 29 4c 24 20        vmovaps XMMWORD PTR [rsp+0x20],xmm1
    11ee:    c5 f8 29 54 24 30        vmovaps XMMWORD PTR [rsp+0x30],xmm2
x(:) = y*z
    11f4:    c5 fd 6f 6c 24 20        vmovdqa ymm5,YMMWORD PTR [rsp+0x20]
z = (/1 ,  2,  3,  4,  5,  6,  7,  8/)
    120a:    c5 f8 29 5c 24 40        vmovaps XMMWORD PTR [rsp+0x40],xmm3
    1210:    c5 f8 29 64 24 50        vmovaps XMMWORD PTR [rsp+0x50],xmm4
x(:) = y*z
    1223:    c4 e2 55 40 44 24 40     vpmulld ymm0,ymm5,YMMWORD PTR [rsp+0x40]
x(:) = y*z
    123d:    c5 fd 7f 04 24           vmovdqa YMMWORD PTR [rsp],ymm0
We can clearly see above that only the vector registers and vector operators were used for the x=y*z operation.


However when we increase the length of the arrays to 9, one integer can't be fit on a YMM vector register. So, the compiler has to generate extra non-vectorized code to take care of the last integer.


CODE :


Code:
! COMPILE : gfortran -g -O3 -mavx2 -march=skylake fortran_vectorization_1b.dis.F90
! DISASM  : objdump -d -SC -Mintel a.out

program main
implicit none
integer *4, dimension(9) :: x, y, z

! Initialize
x = 0
y = (/10, 11, 12, 13, 14, 15, 16, 17, 18/)
z = (/1 ,  2,  3,  4,  5,  6,  7,  8,  9/)

! Vector multiplication
x(:) = y*z

! Prevent dead code elimination
print *, x

end program
GODOT LINK : https://gcc.godbolt.org/z/bdrsa7


ASSEMBLY : Slightly cleaned up...


Code:
00000000000011b0 <MAIN__>:
program main
    11b0:    55                       push   rbp
    11b1:    48 89 e5                 mov    rbp,rsp
    11b4:    41 54                    push   r12
    11b6:    48 83 e4 e0              and    rsp,0xffffffffffffffe0
    11ba:    48 81 ec 00 03 00 00     sub    rsp,0x300
y = (/10, 11, 12, 13, 14, 15, 16, 17, 18/)
    11c1:    8b 15 d9 0e 00 00        mov    edx,DWORD PTR [rip+0xed9]        # 20a0 <A.0.3879+0x20>
z = (/1 ,  2,  3,  4,  5,  6,  7,  8,  9/)
    11c7:    8b 05 93 0e 00 00        mov    eax,DWORD PTR [rip+0xe93]        # 2060 <A.1.3881+0x20>
y = (/10, 11, 12, 13, 14, 15, 16, 17, 18/)
    11cd:    c5 f9 6f 0d ab 0e 00     vmovdqa xmm1,XMMWORD PTR [rip+0xeab]        # 2080 <A.0.3879>
    11d4:    00 
z = (/1 ,  2,  3,  4,  5,  6,  7,  8,  9/)
    11d5:    89 84 24 a0 00 00 00     mov    DWORD PTR [rsp+0xa0],eax
y = (/10, 11, 12, 13, 14, 15, 16, 17, 18/)
    11dc:    c5 f9 6f 15 ac 0e 00     vmovdqa xmm2,XMMWORD PTR [rip+0xeac]        # 2090 <A.0.3879+0x10>
    11e3:    00 
x(:) = y*z
    11e4:    0f af c2                 imul   eax,edx
z = (/1 ,  2,  3,  4,  5,  6,  7,  8,  9/)
    11e7:    c5 f9 6f 1d 51 0e 00     vmovdqa xmm3,XMMWORD PTR [rip+0xe51]        # 2040 <A.1.3881>
    11ee:    00 
    11ef:    c5 f9 6f 25 59 0e 00     vmovdqa xmm4,XMMWORD PTR [rip+0xe59]        # 2050 <A.1.3881+0x10>
    11f6:    00 
y = (/10, 11, 12, 13, 14, 15, 16, 17, 18/)
    11f7:    c5 f8 29 4c 24 40        vmovaps XMMWORD PTR [rsp+0x40],xmm1
    11fd:    c5 f8 29 54 24 50        vmovaps XMMWORD PTR [rsp+0x50],xmm2
x(:) = y*z
    1203:    89 44 24 20              mov    DWORD PTR [rsp+0x20],eax
    1207:    c5 fd 6f 6c 24 40        vmovdqa ymm5,YMMWORD PTR [rsp+0x40]
z = (/1 ,  2,  3,  4,  5,  6,  7,  8,  9/)
    1224:    c5 f8 29 9c 24 80 00     vmovaps XMMWORD PTR [rsp+0x80],xmm3
    122b:    00 00 
    122d:    c5 f8 29 a4 24 90 00     vmovaps XMMWORD PTR [rsp+0x90],xmm4
    1234:    00 00 
x(:) = y*z
    1243:    c4 e2 55 40 84 24 80     vpmulld ymm0,ymm5,YMMWORD PTR [rsp+0x80]
    124a:    00 00 00 
y = (/10, 11, 12, 13, 14, 15, 16, 17, 18/)
    124d:    89 54 24 60              mov    DWORD PTR [rsp+0x60],edx
x(:) = y*z
    1264:    c5 fd 7f 04 24           vmovdqa YMMWORD PTR [rsp],ymm0
We can clearly see above that the compiler had to use imul to take care of the integer multiplication that couldn't be fit onto the vector registers.


This is a good feature. However it becomes a problem when we use allocatable arrays. The compiler doesn't know the size of the array so it HAS TO generate both the vectorized and non-vectorized versions of the code. This can be seen in the next example.


CODE :


Code:
! COMPILE : gfortran -g -O3 -mavx2 -march=skylake fortran_vectorization_1c.dis.F90
! DISASM  : objdump -d -SC -Mintel --visualize-jumps a.out

program main
implicit none
integer *4, dimension(:), allocatable :: x, y, z
integer *4 :: n

read *, n

! Allocate
allocate(x(n))
allocate(y(n))
allocate(z(n))

! Initialize
x = 0
y(n/2) = n
z(n/2) = n/2

! Vector multiplication
x(:) = y*z

! Prevent dead code elimination
print *, x
print *, y
print *, z

! Scrub clean
deallocate(x,y,z)
end program
GODOT LINK : https://gcc.godbolt.org/z/1fdzK8


ASSEMBLY : The generated code is too large. I'm only posting the x=y*z part.



Code:
x(:) = y*z
    1498:    c4 c1 7e 6f 0c 07        vmovdqu ymm1,YMMWORD PTR [r15+rax*1]
    149e:    c4 c2 75 40 04 04        vpmulld ymm0,ymm1,YMMWORD PTR [r12+rax*1]
    14a4:    c4 c1 7e 7f 04 06        vmovdqu YMMWORD PTR [r14+rax*1],ymm0
    14aa:    48 83 c0 20              add    rax,0x20
    14ae:    48 39 c2                 cmp    rdx,rax
    14b1:    75 e5                    jne    1498 <MAIN__+0x278>
    14b3:    48 89 ca                 mov    rdx,rcx
    14b6:    48 83 e2 f8              and    rdx,0xfffffffffffffff8
    14ba:    48 8d 42 01              lea    rax,[rdx+0x1]
    14be:    48 39 ca                 cmp    rdx,rcx
    14c1:    0f 84 aa 01 00 00        je     1671 <MAIN__+0x451>
    14c7:    c5 f8 77                 vzeroupper 
    14ca:    48 8d 50 ff              lea    rdx,[rax-0x1]
    14ce:    41 8b 34 97              mov    esi,DWORD PTR [r15+rdx*4]
    14d2:    41 0f af 34 94           imul   esi,DWORD PTR [r12+rdx*4]
    14d7:    41 89 34 96              mov    DWORD PTR [r14+rdx*4],esi
    14db:    48 8d 50 01              lea    rdx,[rax+0x1]
    14df:    48 39 d1                 cmp    rcx,rdx
    14e2:    7c 7b                    jl     155f <MAIN__+0x33f>
    14e4:    41 8b 34 87              mov    esi,DWORD PTR [r15+rax*4]
    14e8:    41 0f af 34 84           imul   esi,DWORD PTR [r12+rax*4]
    14ed:    41 89 34 86              mov    DWORD PTR [r14+rax*4],esi
    14f1:    48 8d 70 02              lea    rsi,[rax+0x2]
    14f5:    48 39 f1                 cmp    rcx,rsi
    14f8:    7c 65                    jl     155f <MAIN__+0x33f>
    14fa:    41 8b 3c 97              mov    edi,DWORD PTR [r15+rdx*4]
    14fe:    41 0f af 3c 94           imul   edi,DWORD PTR [r12+rdx*4]
    1503:    41 89 3c 96              mov    DWORD PTR [r14+rdx*4],edi
    1507:    48 8d 50 03              lea    rdx,[rax+0x3]
    150b:    48 39 d1                 cmp    rcx,rdx
    150e:    7c 4f                    jl     155f <MAIN__+0x33f>
    1510:    41 8b 3c b7              mov    edi,DWORD PTR [r15+rsi*4]
    1514:    41 0f af 3c b4           imul   edi,DWORD PTR [r12+rsi*4]
    1519:    41 89 3c b6              mov    DWORD PTR [r14+rsi*4],edi
    151d:    48 8d 70 04              lea    rsi,[rax+0x4]
    1521:    48 39 f1                 cmp    rcx,rsi
    1524:    7c 39                    jl     155f <MAIN__+0x33f>
    1526:    41 8b 3c 97              mov    edi,DWORD PTR [r15+rdx*4]
    152a:    41 0f af 3c 94           imul   edi,DWORD PTR [r12+rdx*4]
    152f:    41 89 3c 96              mov    DWORD PTR [r14+rdx*4],edi
    1533:    48 8d 50 05              lea    rdx,[rax+0x5]
    1537:    48 39 d1                 cmp    rcx,rdx
    153a:    7c 23                    jl     155f <MAIN__+0x33f>
    153c:    41 8b 3c b7              mov    edi,DWORD PTR [r15+rsi*4]
    1540:    48 83 c0 06              add    rax,0x6
    1544:    41 0f af 3c b4           imul   edi,DWORD PTR [r12+rsi*4]
    1549:    41 89 3c b6              mov    DWORD PTR [r14+rsi*4],edi
    154d:    48 39 c1                 cmp    rcx,rax
    1550:    7c 0d                    jl     155f <MAIN__+0x33f>
    1552:    41 8b 04 94              mov    eax,DWORD PTR [r12+rdx*4]
    1556:    41 0f af 04 97           imul   eax,DWORD PTR [r15+rdx*4]
    155b:    41 89 04 96              mov    DWORD PTR [r14+rdx*4],eax
We can clearly see that at the top, the vector registers and operations were used. However, since the compiler HAS TO be sure that it will work on arrays of all lengths, the compiler will generate 7 more IMUL operations to take care of 7 possible integers that couldn't be fit onto a AVX2 register.


We can clearly see how much bloat is generated. We know that we will only pass in an array of integers which has length of 8 or multiples of 8 to this vector operation. So, we don't actually need the rest of the 7 redundant IMUL operations. However since the arrays x,y,z are allocatalbe, the compiler doesn't know this.

Is there anyway to tell the compiler to only generate vectorized code like we see in the first case?


CONCLUSION:

The three examples given above show that the compiler generates bloat when it doesn't know for sure that all of the data can be fit perfectly on AVX2 vector registers.


This seems like a good opportunity for optimization.


I would even call it necessary. Just look at the bloat generated for the following equation :


Code:
! Find the area of a triangle by using cross product
 vols = 0.5*sqrt((x1*(y2-y3))**2.0 + (x2*(y3-y1))**2.0 + (x3*(y1-y2))**2.0)

GODOT LINK : https://gcc.godbolt.org/z/oxcWzd



Thanks
praveen likes this.

Last edited by aerosayan; November 17, 2020 at 10:56. Reason: :) gets translated to smilies. Changed vols(:) to vols
aerosayan is offline   Reply With Quote

Old   November 17, 2020, 15:26
Default
  #2
Senior Member
 
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8
aerosayan is on a distinguished road
So I was able to use GODBOLT to try my code in Intel's Fortran compiler (x86-64 IFORT 21.1.9) and was able to generate satisfactory assembly code.

I don't have access to an IFORT compiler, so I can't use it, I'm still looking for how I could achieve such results in GFORTRAN compiler.

Here's what I did :

Looked up this Intel article "Data Allignment to Assist Vectorization" by Rakesh Krishnaiyer.

URL : https://software.intel.com/content/www/us/en/develop/articles/data-alignment-to-assist-vectorization.html

ARCHIVE : https://archive.vn/RHEIe

I saw that we could give the compiler some information about the value N, so that it would be able to optimize the code based on it. As shown in the first example, we should get purely vectorized code when N%8==0. This is due to the YMM vector registers being able to hold 8 integers.

So, I used !DIR$ ASSUME_ALIGNED X: 32 to tell the compiler that we have X aligned on 32.

Then I used !DIR$ ASSUME (MOD(N,8) .EQ. 0) to tell the compiler that N will be a multiple of 8.

I kept rest of the code same.

Code:
program main
implicitnone
integer *4 :: n, i
integer *4, dimension(:), allocatable :: x, y, z

read *, n

! Allocate
allocate(x(n))
allocate(y(n))
allocate(z(n))

! Initialize
x = 0
y(n/2) = n
z(n/2) = n/2

! Vector multiplication
!DIR$ ASSUME_ALIGNED X: 32
!DIR$ ASSUME (MOD(N,8) .EQ. 0)
do i=1,n
    x(i) = y(i)*z(i)
enddo

! Prevent dead code elimination
print *, x
print *, y
print *, z

! Scrub clean
deallocate(x,y,z)
endprogram
The generated assembly code for the loop of x(i) = y(i)*z(i) was :


Code:
..B1.15:
vmovdqu  ymm0, YMMWORD PTR [r13+r12*4]
vpmulld  ymm1, ymm0, YMMWORD PTR [r14+r12*4]
vmovdqu  YMMWORD PTR [rax+r12*4], ymm1
add      r12, 8
cmp      r12, rcx
jb       ..B1.15

Here's the live GODBOLT link for the code shown above : https://gcc.godbolt.org/z/PPf1zE
Here's the live GODBOLT link for the code to calculate the volume of triangles using cross product : https://gcc.godbolt.org/z/Y5orh1





Now that I was able to optimize the code in IFORT, I know that it is a viable optimization. Unfortunately, I still don't know how to do it in GFORTRAN. Hopefully, someone will be able to help if they know more about GFORTRAN's and GCC's compiler directives.

I will keep looking too.

Last edited by aerosayan; November 17, 2020 at 15:46. Reason: Added godbolt link
aerosayan is offline   Reply With Quote

Old   November 18, 2020, 03:26
Default
  #3
Senior Member
 
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8
aerosayan is on a distinguished road
I optimized the solution for GFORTRAN compiler, but I'm not too happy with it.


The solution is to let the compiler know that the loop ends at a multiple of 8. Here's how I did it.



Code:
! Vector multiplication
do i=1,AND(n,-8)
    x(i) = y(i)*z(i)
end do
The equation AND(N,-8) is mathematically equivalent to N - MOD(N,8).


This lets the compiler know that it doesn't need to generate the redundant non-vectorized code. However I'm not too happy with it since gfortran should have compiler directives to make this easier.


Anyway, here's the full code.


Code:
! COMPILE : gfortran -g -O3 -mavx2 -march=skylake fortran_vectorization_1g.dis.F90
! DISASM  : objdump -d -SC -Mintel --visualize-jumps a.out

program main
implicit none
integer *4, dimension(:), allocatable :: x, y, z
integer *4 :: n, i

read *, n

! Allocate
allocate(x(n))
allocate(y(n))
allocate(z(n))

! Initialize
do i=1,n
    x(i) = 0
    y(i) = i
    z(i) = 2*i
end do

! Loop truncation math
print *, "AND(N,-8)    :", AND(N,-8)
print *, "N - MOD(N,8) :", N - MOD(N,8)

! Vector multiplication
do i=1,AND(n,-8)
    x(i) = y(i)*z(i)
end do


! Prevent dead code elimination
print *, "RESULT       :"
do i=1,n
    print *, y(i), "         x", z(i) ,"=", x(i)
end do

! Scrub clean
deallocate(x,y,z)
end program
Here's the live GODBOLT link : https://gcc.godbolt.org/z/MW7cPG


Here's the assembly code of the required operation x(i) = y(i)*z(i) :


Code:
    x(i) = y(i)*z(i)
    15a8:    48 8b 4c 24 18           mov    rcx,QWORD PTR [rsp+0x18]
    15ad:    c5 fe 6f 24 01           vmovdqu ymm4,YMMWORD PTR [rcx+rax*1]
    15b2:    48 8b 4c 24 10           mov    rcx,QWORD PTR [rsp+0x10]
    15b7:    c5 fd 7f 64 24 20        vmovdqa YMMWORD PTR [rsp+0x20],ymm4
    15bd:    c4 e2 5d 40 04 01        vpmulld ymm0,ymm4,YMMWORD PTR [rcx+rax*1]
    15c3:    48 8b 4c 24 08           mov    rcx,QWORD PTR [rsp+0x8]
    15c8:    c5 fe 7f 04 01           vmovdqu YMMWORD PTR [rcx+rax*1],ymm0
    15cd:    48 83 c0 20              add    rax,0x20
    15d1:    48 39 d0                 cmp    rax,rdx
    15d4:    75 d2                    jne    15a8 <MAIN__+0x378>
We can clearly see that only vector operations were used. However I'm not too happy with the MOV operations in there. I'll look at it later.


Running the executable and entering N=8 gives the result:


Code:
8
 AND(N,-8)    :           8
 N - MOD(N,8) :           8
 RESULT       :
           1          x           2 =           2
           2          x           4 =           8
           3          x           6 =          18
           4          x           8 =          32
           5          x          10 =          50
           6          x          12 =          72
           7          x          14 =          98
           8          x          16 =         128
Running the executable and entering N=31 gives the result :


Code:
31
 AND(N,-8)    :          24
 N - MOD(N,8) :          24
 RESULT       :
           1          x           2 =           2
           2          x           4 =           8
           3          x           6 =          18
           4          x           8 =          32
           5          x          10 =          50
           6          x          12 =          72
           7          x          14 =          98
           8          x          16 =         128
           9          x          18 =         162
          10          x          20 =         200
          11          x          22 =         242
          12          x          24 =         288
          13          x          26 =         338
          14          x          28 =         392
          15          x          30 =         450
          16          x          32 =         512
          17          x          34 =         578
          18          x          36 =         648
          19          x          38 =         722
          20          x          40 =         800
          21          x          42 =         882
          22          x          44 =         968
          23          x          46 =        1058
          24          x          48 =        1152
          25          x          50 =           0
          26          x          52 =           0
          27          x          54 =           0
          28          x          56 =           0
          29          x          58 =           0
          30          x          60 =           0
          31          x          62 =           0
We can clearly see that AND(N,-8) == N - MOD(N,8). And we can clearly see that the vector operation happened only till the I==24.


This shows that the optimization was a success.

Also observed HUGE optimization for the code that calculates volume of triangles : https://gcc.godbolt.org/z/1or59d


However, I'm looking for input from other Fortran experts who know if we can achieve this result in an easier way as done in IFORT.

Last edited by aerosayan; November 18, 2020 at 03:42. Reason: Added godbolt link
aerosayan is offline   Reply With Quote

Old   November 18, 2020, 10:22
Default
  #4
Senior Member
 
sbaffini's Avatar
 
Paolo Lampitella
Join Date: Mar 2009
Location: Italy
Posts: 2,195
Blog Entries: 29
Rep Power: 39
sbaffini will become famous soon enoughsbaffini will become famous soon enough
Send a message via Skype™ to sbaffini
You are certainly learning something, and I am certainly not an expert, but never were any of my codes so good that the next thing to optimize was at this level.

May I ask what sort of improvements are you achieving for your test code on triangle areas and what, if any, in your actual code? Is it actually worth the effort?
sbaffini is offline   Reply With Quote

Old   November 18, 2020, 11:34
Default
  #5
Senior Member
 
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8
aerosayan is on a distinguished road
Quote:
Originally Posted by sbaffini View Post
You are certainly learning something, and I am certainly not an expert, but never were any of my codes so good that the next thing to optimize was at this level.

May I ask what sort of improvements are you achieving for your test code on triangle areas and what, if any, in your actual code? Is it actually worth the effort?

I was just playing around when I observed this code bloat. I was implementing the triangle volume calculation for my CFD solver when I observed that the actual AVX2 vectorized assembly code (which uses YMM registers) was only around 14 lines while the generated SSE vectorized redundant code (which uses XMM registers) was around 7*14 lines. And the volume calculation equation wasn't even that complex.



Imagine how much amount of bloat would be present in a generic number crunching function/subroutine.


The method described here has tremendous advantages and applications:


- Almost all of my vectorized operations can now be supercharged by removing the redundant code. All I need to do, is to ensure that I fill up the YMM registers completely. For example : If there are 31 cells, I will add an extra dummy cell to make the cell list have 32 elements.



- This would allow less Instruction Cache misses and would allow the compiler to basically inline a lot of functions/subroutines, thus removing any cost of function/subroutine call. It would make sense to use such an optimization when we know that everything else will be accelerated.



- I just used the volume calculation as an example, since it was sufficiently complicated and it was the first equation that I tried to optimize. Since ALL our cell data (eg, rho, u, v, E etc) is organized in 1D arrays, we can SIMD vectorize most of our CFD solver code (I agree it is tough to do and I need more experience with it).


- The optimization is surprisingly portable. A YMM register would fit four 64 bit doubles. To get the same optimization, I just need to use AND(N,-4). For 32 bit floats, I would need AND(N,-8) since eight floats can be fit on a YMM register. When I need to run my code on a machine that supports AVX512, I would need to simply change the -8 and -4 part. This can be done using a #define macro. If the machine doesn't support AVX2, and only supports SSE, the code will still compile and run if we define the optimization as AND(N, MACRO_SIMD_MAX_PACKED_ELEMENTS_BLAH_BLAH).

Last edited by aerosayan; November 18, 2020 at 11:40. Reason: Added last line.
aerosayan is offline   Reply With Quote

Old   November 19, 2020, 13:24
Default
  #6
Senior Member
 
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8
aerosayan is on a distinguished road
I did some testing to see if I could get the same optimization while continuing to use Fortran's amazing array slice operations.


Don't wanna write ugly code with xyz(i) 's everywhere.

Here's how to do it : We just need to explicitly define the range of operation i.e 1,AND(N,-8) and the compiler understands that it can optimize.

Code:
! Vector operartion
vols(1:AND(N,-8)) = 0.5*sqrt((x1*(y2-y3))**2.0 + (x2*(y3-y1))**2.0 + (x3*(y1-y2))**2.0)

GODBOLT LINK : https://gcc.godbolt.org/z/7PoKzf


Assembly code generated :
Code:
        and     ebx, -8
        jle     .L6
        movsx   rbx, ebx
        vmovaps ymm4, YMMWORD PTR .LC2[rip]
        sal     rbx, 2
        xor     eax, eax
.L9:
        vmovups ymm3, YMMWORD PTR [r9+rax]
        vmovups ymm1, YMMWORD PTR [r11+rax]
        vmovups ymm0, YMMWORD PTR [r15+rax]
        vsubps  ymm2, ymm3, ymm1
        vsubps  ymm1, ymm1, ymm0
        vsubps  ymm0, ymm0, ymm3
        vmulps  ymm2, ymm2, YMMWORD PTR [r13+0+rax]
        vmulps  ymm1, ymm1, YMMWORD PTR [r14+rax]
        vmulps  ymm0, ymm0, YMMWORD PTR [r10+rax]
        vmulps  ymm1, ymm1, ymm1
        vfmadd231ps     ymm1, ymm2, ymm2
        vfmadd132ps     ymm0, ymm1, ymm0
        vsqrtps ymm0, ymm0
        vmulps  ymm0, ymm0, ymm4
        vmovups YMMWORD PTR [r12+rax], ymm0
        add     rax, 32
        cmp     rax, rbx
        jne     .L9
aerosayan is offline   Reply With Quote

Old   November 20, 2020, 06:37
Default
  #7
New Member
 
Pedro Costa
Join Date: Jul 2017
Posts: 9
Rep Power: 9
pcosta is on a distinguished road
https://fortran-lang.discourse.group/


this forum may be better suited for a detailed discussion about the Fortran programming language.
sbaffini likes this.
pcosta is offline   Reply With Quote

Reply

Tags
assembly, fortran 90, solver deveopment


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Patankar CFD FORTRAN 90 Code FVM siddiquesil Main CFD Forum 19 October 25, 2022 06:21
Parallel Fortran 95 code using Openmp Shiv1510 Main CFD Forum 5 February 14, 2020 12:36
CGNS fortran code compiling with intel visual fortran xe dokeun Main CFD Forum 1 April 5, 2012 22:01
Comparison between C/C++ and Fortran? rick Main CFD Forum 45 September 6, 2011 01:52
CFD Code with Fortran murat Main CFD Forum 0 October 11, 2009 18:05


All times are GMT -4. The time now is 09:56.