CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > General Forums > Main CFD Forum

gfortran couldn't even vectorize a simple 2D matrix addition

Register Blogs Community New Posts Updated Threads Search

Like Tree14Likes
  • 1 Post By aerosayan
  • 1 Post By mermeruim
  • 1 Post By aerosayan
  • 2 Post By agd
  • 1 Post By sbaffini
  • 2 Post By aerosayan
  • 1 Post By digger
  • 2 Post By andy_
  • 3 Post By sbaffini

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   January 13, 2021, 04:29
Default gfortran couldn't even vectorize a simple 2D matrix addition
  #1
Senior Member
 
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8
aerosayan is on a distinguished road
I observed that my 2D solver code wasn't getting vectorized when compiling with gfortran.
So, I wrote an extremely simple code to test if the fault was mine.


Turns out, even the simple code (2D matrix addition) didn't vectorize, while 1D array addition vectorized well.

(EDIT : When I say it didn't vectorize, I mean that it didn't vectorize using the highest vector register my machine has i.e AVX2 YMM registers.)


GODBOLT LINK : https://godbolt.org/z/6Wx3ob

Code:
! COMPILE :  -O3 -fopenmp -mavx2 -march=native -ftree-vectorize -ftree-vectorizer-verbose=3 -fopt-info-vec-missed

program main
implicit none
real(8), dimension(200,150):: z, x, y
real(8), dimension(200)  :: c, a, b
integer(4) :: i,j

a(:) = 1.0d0
b(:) = 1.0d0
a(100) = 2.0d0
b(100) = 2.0d0

!$omp simd
do i=1,200
    c(i) = a(i) + b(i)
end do

x(:,:) = 1.0d0
y(:,:) = 2.0d0
x(50,75) = 2.0d0
y(50,75) = 4.0d0

!$omp simd collapse(2)
do j=1,150
do i=1,200
    z(i,j) = x(i,j) + y(i,j)
end do
end do

print *, c(1)  , c(100)   , c(200)
print *, z(1,1), z(50, 75), z(200, 150)
end program
You can see in the godbolt link I provided, Intel's fortran compiler (IFORT) is able to vectorize the matrix addition very well, while gfortran (even the latest version) fails miserably. I even used OpenMP SIMD directives. Although, I'm not an expert in OpenMP, I think my code is correct.


What's going on here?
Is it my fault (I hope I missed something) or is it the fault of gfortran (that would be horrible)?


My intuition tells me that it's the fault of gfortran. Since gfortran couldn't even vectorize x( : , : ) = 1.0d0 or y( : , : ) = 2.0d0. IFORT did. (EDIT : After in-depth review, it looks like this section was partially vectorized.)



Thanks and Regards
~sayan

Last edited by aerosayan; January 13, 2021 at 12:18. Reason: fixing wrong information
aerosayan is offline   Reply With Quote

Old   January 13, 2021, 07:34
Default
  #2
Senior Member
 
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8
aerosayan is on a distinguished road
In C/C++, I was able vectorize operations over a 2D matrix by accessing it as a 1D array.
@sbafinni This is the method I was talking about during our last conversation.



GODBOLT LINK : https://godbolt.org/z/xqvoP3
sbaffini likes this.
aerosayan is offline   Reply With Quote

Old   January 13, 2021, 13:18
Default Possible Solution
  #3
New Member
 
Join Date: Jan 2021
Posts: 1
Rep Power: 0
mermeruim is on a distinguished road
Quote:
Originally Posted by aerosayan View Post
I observed that my 2D solver code wasn't getting vectorized when compiling with gfortran.
So, I wrote an extremely simple code to test if the fault was mine.


Turns out, even the simple code (2D matrix addition) didn't vectorize, while 1D array addition vectorized well.

(EDIT : When I say it didn't vectorize, I mean that it didn't vectorize using the highest vector register my machine has i.e AVX2 YMM registers.)


GODBOLT LINK : https://godbolt.org/z/6Wx3ob

Code:
! COMPILE :  -O3 -fopenmp -mavx2 -march=native -ftree-vectorize -ftree-vectorizer-verbose=3 -fopt-info-vec-missed

program main
implicit none
real(8), dimension(200,150):: z, x, y
real(8), dimension(200)  :: c, a, b
integer(4) :: i,j

a(:) = 1.0d0
b(:) = 1.0d0
a(100) = 2.0d0
b(100) = 2.0d0

!$omp simd
do i=1,200
    c(i) = a(i) + b(i)
end do

x(:,:) = 1.0d0
y(:,:) = 2.0d0
x(50,75) = 2.0d0
y(50,75) = 4.0d0

!$omp simd collapse(2)
do j=1,150
do i=1,200
    z(i,j) = x(i,j) + y(i,j)
end do
end do

print *, c(1)  , c(100)   , c(200)
print *, z(1,1), z(50, 75), z(200, 150)
end program
You can see in the godbolt link I provided, Intel's fortran compiler (IFORT) is able to vectorize the matrix addition very well, while gfortran (even the latest version) fails miserably. I even used OpenMP SIMD directives. Although, I'm not an expert in OpenMP, I think my code is correct.


What's going on here?
Is it my fault (I hope I missed something) or is it the fault of gfortran (that would be horrible)?


My intuition tells me that it's the fault of gfortran. Since gfortran couldn't even vectorize x( : , : ) = 1.0d0 or y( : , : ) = 2.0d0. IFORT did. (EDIT : After in-depth review, it looks like this section was partially vectorized.)



Thanks and Regards
~sayan
Hello Mr. aerosayan
I manage the compile the code the outputs are:
2,4,2;3,6,3
When I inspect the your code posted, I see directly that your code is not in UPPERCASE, try PROGRAM MAIN, and make sure you know the general syntax. I suggest using some "IDE" with debugger like geany can solve non-future problem, you may have.
Old version of Fortran must be in UPPERCASE.
aerosayan likes this.
mermeruim is offline   Reply With Quote

Old   January 13, 2021, 13:25
Default
  #4
Senior Member
 
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8
aerosayan is on a distinguished road
Quote:
Originally Posted by mermeruim View Post
Hello Mr. aerosayan
I manage the compile the code the outputs are:
2,4,2;3,6,3
When I inspect the your code posted, I see directly that your code is not in UPPERCASE, try PROGRAM MAIN, and make sure you know the general syntax. I suggest using some "IDE" with debugger like geany can solve non-future problem, you may have.
Old version of Fortran must be in UPPERCASE.

Hello mermeuim, Thanks for your effort.
Unfortunately, your suggestion wasn't helpful to me this time. I hope you'll be able to help me in future.
My problem wasn't about compilation, it was about SIMD vectorization (advanced topic).
May I respectfully suggest letting others answer this question? I'm afraid you didn't understand the question and my problem.
Although kudos for effort. I hope to see you around here.
mermeruim likes this.
aerosayan is offline   Reply With Quote

Old   January 13, 2021, 15:12
Default
  #5
agd
Senior Member
 
Join Date: Jul 2009
Posts: 357
Rep Power: 19
agd is on a distinguished road
mermeruim,


At least as far back as F77 fortran has been insensitive to case.
aerosayan and mermeruim like this.
agd is offline   Reply With Quote

Old   January 14, 2021, 05:13
Default
  #6
Member
 
EM
Join Date: Sep 2019
Posts: 59
Rep Power: 7
gnwt4a is on a distinguished road
gfortran with intel chips is a sorry tale.


intel used to give linux-ifort for free but this is long gone.


if u are going to use intel chips for linear algebra and multithreading,
the mkl library is free for download (and unbeatable), and u can also
download for free the pthreads for omp use. u will have to register first.


also, there is an intel site that will give you all the appropriate flags and
libraries for linking mkl and pthreads with gfortran.
--
gnwt4a is offline   Reply With Quote

Old   January 14, 2021, 14:30
Default
  #7
Senior Member
 
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8
aerosayan is on a distinguished road
Moving back to using only C++.
I have more experience and control in C++.

If gfortran can't handle such simple optimizations, not worth my effort.
aerosayan is offline   Reply With Quote

Old   January 14, 2021, 14:43
Default
  #8
Senior Member
 
sbaffini's Avatar
 
Paolo Lampitella
Join Date: Mar 2009
Location: Italy
Posts: 2,184
Blog Entries: 29
Rep Power: 39
sbaffini will become famous soon enoughsbaffini will become famous soon enough
Send a message via Skype™ to sbaffini
Not related directly, but intel now provides oneapi, which is free (https://www.scivision.dev/intel-oneapi-fortran-install/)
aerosayan likes this.
sbaffini is offline   Reply With Quote

Old   January 17, 2021, 18:08
Default
  #9
New Member
 
Join Date: Jan 2021
Posts: 18
Rep Power: 5
digger is on a distinguished road
Quote:
Originally Posted by aerosayan View Post
I observed that my 2D solver code wasn't getting vectorized when compiling with gfortran.
So, I wrote an extremely simple code to test if the fault was mine.


Turns out, even the simple code (2D matrix addition) didn't vectorize, while 1D array addition vectorized well.

(EDIT : When I say it didn't vectorize, I mean that it didn't vectorize using the highest vector register my machine has i.e AVX2 YMM registers.)


GODBOLT LINK : https://godbolt.org/z/6Wx3ob

Code:
! COMPILE :  -O3 -fopenmp -mavx2 -march=native -ftree-vectorize -ftree-vectorizer-verbose=3 -fopt-info-vec-missed

program main
implicit none
real(8), dimension(200,150):: z, x, y
real(8), dimension(200)  :: c, a, b
integer(4) :: i,j

a(:) = 1.0d0
b(:) = 1.0d0
a(100) = 2.0d0
b(100) = 2.0d0

!$omp simd
do i=1,200
    c(i) = a(i) + b(i)
end do

x(:,:) = 1.0d0
y(:,:) = 2.0d0
x(50,75) = 2.0d0
y(50,75) = 4.0d0

!$omp simd collapse(2)
do j=1,150
do i=1,200
    z(i,j) = x(i,j) + y(i,j)
end do
end do

print *, c(1)  , c(100)   , c(200)
print *, z(1,1), z(50, 75), z(200, 150)
end program
You can see in the godbolt link I provided, Intel's fortran compiler (IFORT) is able to vectorize the matrix addition very well, while gfortran (even the latest version) fails miserably. I even used OpenMP SIMD directives. Although, I'm not an expert in OpenMP, I think my code is correct.


What's going on here?
Is it my fault (I hope I missed something) or is it the fault of gfortran (that would be horrible)?


My intuition tells me that it's the fault of gfortran. Since gfortran couldn't even vectorize x( : , : ) = 1.0d0 or y( : , : ) = 2.0d0. IFORT did. (EDIT : After in-depth review, it looks like this section was partially vectorized.)



Thanks and Regards
~sayan



Dear aerosayan,

could you please briefly introduce the meaning of the vectorization for me? What I can understand from your code it seems to simply does the addition of some dummy numbers? It just because I am curious.

Thanks for your effort
digger is offline   Reply With Quote

Old   January 17, 2021, 23:31
Default
  #10
Senior Member
 
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8
aerosayan is on a distinguished road
Quote:
Originally Posted by digger View Post
Dear aerosayan,

could you please briefly introduce the meaning of the vectorization for me? What I can understand from your code it seems to simply does the addition of some dummy numbers? It just because I am curious.

Thanks for your effort

SIMD (Single Instruction Multiple Data) vectorization is a method of thread level parallelization that allows you to complete more work in less time on a single CPU core.


for operation x[i] = y[i]+z[i], in normal serial code, only one y and z value will be added to fill up one x value.


for operation x[i] = y[i]+z[i], in vectorized serial code, 2 or 4 or 8 or 16 values of y and z will be added in a single instruction to fill up the 2 or 4 or 8 or 16 values of x.


This is possible since machine level mathematical instructions work on registers. The SIMD resgisters have enough memory to hold 2 or 4 or 8 or 16 values together.



I'll let someone important from Intel explain this : https://youtu.be/_OJmxi4-twY
aero_head and digger like this.
aerosayan is offline   Reply With Quote

Old   January 18, 2021, 01:10
Default
  #11
New Member
 
Join Date: Jan 2021
Posts: 18
Rep Power: 5
digger is on a distinguished road
Quote:
Originally Posted by aerosayan View Post
SIMD (Single Instruction Multiple Data) vectorization is a method of thread level parallelization that allows you to complete more work in less time on a single CPU core.


for operation x[i] = y[i]+z[i], in normal serial code, only one y and z value will be added to fill up one x value.


for operation x[i] = y[i]+z[i], in vectorized serial code, 2 or 4 or 8 or 16 values of y and z will be added in a single instruction to fill up the 2 or 4 or 8 or 16 values of x.


This is possible since machine level mathematical instructions work on registers. The SIMD resgisters have enough memory to hold 2 or 4 or 8 or 16 values together.



I'll let someone important from Intel explain this : https://youtu.be/_OJmxi4-twY



Thank you. Can I ask you one more thing? Do I understand correctly that what you mentioned is basically a feature of most parallel codes? I mean that any core has its own array x[i] and works on it and after operation the data can be gathered in some big array? Probably I still miss the key feature of vectorization.

I haven’t watched the video yet, maybe this will clarify my doubts.

From your experience, is it better to start out writing a code using some parallel libraries from the very beginning or is it possible to parallelize a code at the end?

Thank you once again.
digger is offline   Reply With Quote

Old   January 18, 2021, 01:34
Default
  #12
Senior Member
 
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8
aerosayan is on a distinguished road
Quote:
Originally Posted by digger View Post
From your experience, is it better to start out writing a code using some parallel libraries from the very beginning or is it possible to parallelize a code at the end?

Experts in this forum (though I can't say for all) will recommend to use MPI or OpenMP for most tasks. Use OpenMP if you're new to parallel programming, since it's easy to learn and use. Use MPI if you're a little bit more advanced and you need to run your code in HPC clusters. Most probably you won't need to use MPI if you're a beginner.


You don't need to follow what I'm doing, since it is more trouble than it's worth for many people, and you need to understand Assembly language and CPU architecture moderately well.

If you want to learn Assembly language or understand why I'm not following in the same path as the experts, i will include a link to one of my posts : Motivation and introduction for learning assembly coding





Quote:
Originally Posted by digger View Post
Do I understand correctly that what you mentioned is basically a feature of most parallel codes? I mean that any core has its own array x[i] and works on it and after operation the data can be gathered in some big array? Probably I still miss the key feature of vectorization.

I haven’t watched the video yet, maybe this will clarify my doubts.

SIMD vectorization is not a feature of most parallel codes. Most parallel codes will be written to scale well to thousands of HPC clusters using MPI and/or OpenMP.


SIMD vectorization is about getting the maximum performance out of a single CPU core.


SIMD vector registers are registers in the CPU that can hold say 8 integers at once.


For the operation x[i] = y[i] + z[i], lets say that 3 registers YMM1, YMM2, YMM3 were used.


What the CPU will do, is to load y[i=1,8] into YMM2 and z[i=1,8] into YMM3.


Then using a single add instruction, calculate YMM1 as YMM1 = YMM2 + YMM3


Since there are 8 integers in the YMM registers, all of the 8 integers will be added in a single CPU instruction!

This is called vectorization.

This essentially means that you get a 8X speed improvement than a non-vectorized serial code.
aerosayan is offline   Reply With Quote

Old   January 18, 2021, 06:08
Default
  #13
New Member
 
Join Date: Jan 2021
Posts: 18
Rep Power: 5
digger is on a distinguished road
Great explanation! Now I understand the main idea. I didn’t know that the processors have such abilities to speed up algebra and make several additions at once.



Thank you for your suggestion, I will focus on OMPI.
aerosayan likes this.
digger is offline   Reply With Quote

Old   January 18, 2021, 06:27
Default
  #14
Senior Member
 
andy
Join Date: May 2009
Posts: 292
Rep Power: 18
andy_ is on a distinguished road
Quote:
Originally Posted by aerosayan View Post
My intuition tells me that it's the fault of gfortran. Since gfortran couldn't even vectorize x( : , : ) = 1.0d0 or y( : , : ) = 2.0d0. IFORT did. (EDIT : After in-depth review, it looks like this section was partially vectorized.)
A native compiler can be expected to optimise better than a non-native one but mastering how to optimise effectively with a particular compiler can be expected to take some time and effort. For example, if you can optimise effectively using gcc then there is a fair chance the same can be achieved in gfortran given how they are related.

If one is prepared to devote significant time and effort to achieving high efficiency for matrix operations then the native high performance matrix library (e.g. BLAS) for the particular hardware is likely to be the best option. MKL was mentioned earlier but the best choice will tend to vary with hardware and circumstances. It also tends to sidestep C aliasing issues leading to effectively comparable performance with Fortran.

Again if one is prepared to put in significant effort to raise computational efficiency using 1D arrays is likely to offer the higher general performance. Need to make sure the optimiser knows the offset is a positive constant which may require a directive if, for example, it is passed as a variable in an argument list.

There are of course a range of other ways of coding that optimises performance depending on the form of hardware acceleration present. Back in the early 80s the industrial CFD code I was involved in developing used to have substantially different versions of of key routines adapted for sequential computers, IBMs with small vector registers or Crays with larger vector registers. Low level algorithms were changed as well restructuring do loops and caching. Simpler algorithms with less dependencies that converged more slowly per iteration but vectorised efficiently could become preferable when substantial numbers of vector registers were available. Involved a significant amount of development effort though which I suspect is rarely going to be cost effective these days.
sbaffini and aerosayan like this.
andy_ is offline   Reply With Quote

Old   January 18, 2021, 06:48
Default
  #15
Senior Member
 
Filippo Maria Denaro
Join Date: Jul 2010
Posts: 6,842
Rep Power: 73
FMDenaro has a spectacular aura aboutFMDenaro has a spectacular aura aboutFMDenaro has a spectacular aura about
Many many years ago I worked using Fortran on the Convex vector computer... well, I think I just wasted time supposing to get some speed-up ...


I will never be an expert in IT and HPC, but I think CFD is some different stuff ...
FMDenaro is offline   Reply With Quote

Old   January 18, 2021, 07:57
Default
  #16
Senior Member
 
sbaffini's Avatar
 
Paolo Lampitella
Join Date: Mar 2009
Location: Italy
Posts: 2,184
Blog Entries: 29
Rep Power: 39
sbaffini will become famous soon enoughsbaffini will become famous soon enough
Send a message via Skype™ to sbaffini
Quote:
Originally Posted by FMDenaro View Post
I will never be an expert in IT and HPC, but I think CFD is some different stuff ...
It is, probably just like driving a car is different from being a mechanic or an engineer or a car seller. But, the things are related, and how much you (should) dig into one or the other is probably dependent on your specific task/occupation.

I think there is a specific balance which is dictated by the most efficient way to obtain what you need. Some need to test a new physical model they developed, some others need to sell this model in their code, some others yet just need to use this model... some will have to answer phone calls on why this code that was payed a lot, when using that very model, crashes on the new fancy cluster in the department.
FMDenaro, aerosayan and aero_head like this.
sbaffini is offline   Reply With Quote

Reply

Tags
fortran, ifort, openmp


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
How do I make steady state SIMPLE solver to transient? cfdnewb123 Main CFD Forum 5 April 22, 2020 12:49
[snappyHexMesh] SHM Layer Addition Phase dickcruz OpenFOAM Meshing & Mesh Conversion 4 November 1, 2018 07:05
simple question about OpenFoam matrix class TurbJet OpenFOAM 2 August 10, 2018 20:53
What is this FOAM warning: Unknown matrix type combination means? chengdi OpenFOAM Running, Solving & CFD 1 May 13, 2017 21:34
SIMPLE algorithm in 3D cylindrical coordinates zouchu Main CFD Forum 1 January 20, 2014 17:02


All times are GMT -4. The time now is 23:59.