gfortran couldn't even vectorize a simple 2D matrix addition

aerosayan · January 13, 2021, 05:29

I observed that my 2D solver code wasn't getting vectorized when compiling with gfortran.
So, I wrote an extremely simple code to test if the fault was mine.

Turns out, even the simple code (2D matrix addition) didn't vectorize, while 1D array addition vectorized well.

(EDIT : When I say it didn't vectorize, I mean that it didn't vectorize using the highest vector register my machine has i.e AVX2 YMM registers.)

GODBOLT LINK : https://godbolt.org/z/6Wx3ob

Code:

! COMPILE :  -O3 -fopenmp -mavx2 -march=native -ftree-vectorize -ftree-vectorizer-verbose=3 -fopt-info-vec-missed

program main
implicit none
real(8), dimension(200,150):: z, x, y
real(8), dimension(200)  :: c, a, b
integer(4) :: i,j

a(:) = 1.0d0
b(:) = 1.0d0
a(100) = 2.0d0
b(100) = 2.0d0

!$omp simd
do i=1,200
    c(i) = a(i) + b(i)
end do

x(:,:) = 1.0d0
y(:,:) = 2.0d0
x(50,75) = 2.0d0
y(50,75) = 4.0d0

!$omp simd collapse(2)
do j=1,150
do i=1,200
    z(i,j) = x(i,j) + y(i,j)
end do
end do

print *, c(1)  , c(100)   , c(200)
print *, z(1,1), z(50, 75), z(200, 150)
end program

You can see in the godbolt link I provided, Intel's fortran compiler (IFORT) is able to vectorize the matrix addition very well, while gfortran (even the latest version) fails miserably. I even used OpenMP SIMD directives. Although, I'm not an expert in OpenMP, I think my code is correct.

What's going on here?
Is it my fault (I hope I missed something) or is it the fault of gfortran (that would be horrible)?

My intuition tells me that it's the fault of gfortran. Since gfortran couldn't even vectorize x( : , : ) = 1.0d0 or y( : , : ) = 2.0d0. IFORT did. (EDIT : After in-depth review, it looks like this section was partially vectorized.)

Thanks and Regards
~sayan

aerosayan · January 13, 2021, 08:34

In C/C++, I was able vectorize operations over a 2D matrix by accessing it as a 1D array.
@sbafinni This is the method I was talking about during our last conversation.

GODBOLT LINK : https://godbolt.org/z/xqvoP3

mermeruim · January 13, 2021, 14:18

Quote:

Originally Posted by aerosayan

I observed that my 2D solver code wasn't getting vectorized when compiling with gfortran.
So, I wrote an extremely simple code to test if the fault was mine.

Turns out, even the simple code (2D matrix addition) didn't vectorize, while 1D array addition vectorized well.

(EDIT : When I say it didn't vectorize, I mean that it didn't vectorize using the highest vector register my machine has i.e AVX2 YMM registers.)

GODBOLT LINK : https://godbolt.org/z/6Wx3ob

Code:

! COMPILE :  -O3 -fopenmp -mavx2 -march=native -ftree-vectorize -ftree-vectorizer-verbose=3 -fopt-info-vec-missed

program main
implicit none
real(8), dimension(200,150):: z, x, y
real(8), dimension(200)  :: c, a, b
integer(4) :: i,j

a(:) = 1.0d0
b(:) = 1.0d0
a(100) = 2.0d0
b(100) = 2.0d0

!$omp simd
do i=1,200
    c(i) = a(i) + b(i)
end do

x(:,:) = 1.0d0
y(:,:) = 2.0d0
x(50,75) = 2.0d0
y(50,75) = 4.0d0

!$omp simd collapse(2)
do j=1,150
do i=1,200
    z(i,j) = x(i,j) + y(i,j)
end do
end do

print *, c(1)  , c(100)   , c(200)
print *, z(1,1), z(50, 75), z(200, 150)
end program

You can see in the godbolt link I provided, Intel's fortran compiler (IFORT) is able to vectorize the matrix addition very well, while gfortran (even the latest version) fails miserably. I even used OpenMP SIMD directives. Although, I'm not an expert in OpenMP, I think my code is correct.

What's going on here?
Is it my fault (I hope I missed something) or is it the fault of gfortran (that would be horrible)?

My intuition tells me that it's the fault of gfortran. Since gfortran couldn't even vectorize x( : , : ) = 1.0d0 or y( : , : ) = 2.0d0. IFORT did. (EDIT : After in-depth review, it looks like this section was partially vectorized.)

Thanks and Regards
~sayan

Hello Mr. aerosayan
I manage the compile the code the outputs are:
2,4,2;3,6,3
When I inspect the your code posted, I see directly that your code is not in UPPERCASE, try PROGRAM MAIN, and make sure you know the general syntax. I suggest using some "IDE" with debugger like geany can solve non-future problem, you may have.
Old version of Fortran must be in UPPERCASE.

aerosayan · January 13, 2021, 14:25

Quote:

Originally Posted by mermeruim

Hello Mr. aerosayan
I manage the compile the code the outputs are:
2,4,2;3,6,3
When I inspect the your code posted, I see directly that your code is not in UPPERCASE, try PROGRAM MAIN, and make sure you know the general syntax. I suggest using some "IDE" with debugger like geany can solve non-future problem, you may have.
Old version of Fortran must be in UPPERCASE.

Hello mermeuim, Thanks for your effort.
Unfortunately, your suggestion wasn't helpful to me this time. I hope you'll be able to help me in future.

My problem wasn't about compilation, it was about SIMD vectorization (advanced topic).
May I respectfully suggest letting others answer this question? I'm afraid you didn't understand the question and my problem.

Although kudos for effort. I hope to see you around here.

agd · January 13, 2021, 16:12

mermeruim,

At least as far back as F77 fortran has been insensitive to case.

gnwt4a · January 14, 2021, 06:13

gfortran with intel chips is a sorry tale.

intel used to give linux-ifort for free but this is long gone.

if u are going to use intel chips for linear algebra and multithreading,
the mkl library is free for download (and unbeatable), and u can also
download for free the pthreads for omp use. u will have to register first.

also, there is an intel site that will give you all the appropriate flags and
libraries for linking mkl and pthreads with gfortran.
--

aerosayan · January 14, 2021, 15:30

Moving back to using only C++.
I have more experience and control in C++.

If gfortran can't handle such simple optimizations, not worth my effort.

sbaffini · January 14, 2021, 15:43

Not related directly, but intel now provides oneapi, which is free (https://www.scivision.dev/intel-oneapi-fortran-install/)

digger · January 17, 2021, 19:08

Quote:

Originally Posted by aerosayan

I observed that my 2D solver code wasn't getting vectorized when compiling with gfortran.
So, I wrote an extremely simple code to test if the fault was mine.

Turns out, even the simple code (2D matrix addition) didn't vectorize, while 1D array addition vectorized well.

(EDIT : When I say it didn't vectorize, I mean that it didn't vectorize using the highest vector register my machine has i.e AVX2 YMM registers.)

GODBOLT LINK : https://godbolt.org/z/6Wx3ob

Code:

! COMPILE :  -O3 -fopenmp -mavx2 -march=native -ftree-vectorize -ftree-vectorizer-verbose=3 -fopt-info-vec-missed

program main
implicit none
real(8), dimension(200,150):: z, x, y
real(8), dimension(200)  :: c, a, b
integer(4) :: i,j

a(:) = 1.0d0
b(:) = 1.0d0
a(100) = 2.0d0
b(100) = 2.0d0

!$omp simd
do i=1,200
    c(i) = a(i) + b(i)
end do

x(:,:) = 1.0d0
y(:,:) = 2.0d0
x(50,75) = 2.0d0
y(50,75) = 4.0d0

!$omp simd collapse(2)
do j=1,150
do i=1,200
    z(i,j) = x(i,j) + y(i,j)
end do
end do

print *, c(1)  , c(100)   , c(200)
print *, z(1,1), z(50, 75), z(200, 150)
end program

You can see in the godbolt link I provided, Intel's fortran compiler (IFORT) is able to vectorize the matrix addition very well, while gfortran (even the latest version) fails miserably. I even used OpenMP SIMD directives. Although, I'm not an expert in OpenMP, I think my code is correct.

What's going on here?
Is it my fault (I hope I missed something) or is it the fault of gfortran (that would be horrible)?

My intuition tells me that it's the fault of gfortran. Since gfortran couldn't even vectorize x( : , : ) = 1.0d0 or y( : , : ) = 2.0d0. IFORT did. (EDIT : After in-depth review, it looks like this section was partially vectorized.)

Thanks and Regards
~sayan

Dear aerosayan,

could you please briefly introduce the meaning of the vectorization for me? What I can understand from your code it seems to simply does the addition of some dummy numbers? It just because I am curious.

Thanks for your effort

aerosayan · January 18, 2021, 00:31

Quote:

Originally Posted by digger

Dear aerosayan,

could you please briefly introduce the meaning of the vectorization for me? What I can understand from your code it seems to simply does the addition of some dummy numbers? It just because I am curious.

Thanks for your effort

SIMD (Single Instruction Multiple Data) vectorization is a method of thread level parallelization that allows you to complete more work in less time on a single CPU core.

for operation x[i] = y[i]+z[i], in normal serial code, only one y and z value will be added to fill up one x value.

for operation x[i] = y[i]+z[i], in vectorized serial code, 2 or 4 or 8 or 16 values of y and z will be added in a single instruction to fill up the 2 or 4 or 8 or 16 values of x.

This is possible since machine level mathematical instructions work on registers. The SIMD resgisters have enough memory to hold 2 or 4 or 8 or 16 values together.

I'll let someone important from Intel explain this : https://youtu.be/_OJmxi4-twY

digger · January 18, 2021, 02:10

Quote:

Originally Posted by aerosayan

SIMD (Single Instruction Multiple Data) vectorization is a method of thread level parallelization that allows you to complete more work in less time on a single CPU core.

for operation x[i] = y[i]+z[i], in normal serial code, only one y and z value will be added to fill up one x value.

for operation x[i] = y[i]+z[i], in vectorized serial code, 2 or 4 or 8 or 16 values of y and z will be added in a single instruction to fill up the 2 or 4 or 8 or 16 values of x.

This is possible since machine level mathematical instructions work on registers. The SIMD resgisters have enough memory to hold 2 or 4 or 8 or 16 values together.

I'll let someone important from Intel explain this : https://youtu.be/_OJmxi4-twY

Thank you. Can I ask you one more thing? Do I understand correctly that what you mentioned is basically a feature of most parallel codes? I mean that any core has its own array x[i] and works on it and after operation the data can be gathered in some big array? Probably I still miss the key feature of vectorization.

I haven’t watched the video yet, maybe this will clarify my doubts.

From your experience, is it better to start out writing a code using some parallel libraries from the very beginning or is it possible to parallelize a code at the end?

Thank you once again.

aerosayan · January 18, 2021, 02:34

Quote:

Originally Posted by digger

From your experience, is it better to start out writing a code using some parallel libraries from the very beginning or is it possible to parallelize a code at the end?

Experts in this forum (though I can't say for all) will recommend to use MPI or OpenMP for most tasks. Use OpenMP if you're new to parallel programming, since it's easy to learn and use. Use MPI if you're a little bit more advanced and you need to run your code in HPC clusters. Most probably you won't need to use MPI if you're a beginner.

You don't need to follow what I'm doing, since it is more trouble than it's worth for many people, and you need to understand Assembly language and CPU architecture moderately well.

If you want to learn Assembly language or understand why I'm not following in the same path as the experts, i will include a link to one of my posts : Motivation and introduction for learning assembly coding

Quote:

Originally Posted by digger

Do I understand correctly that what you mentioned is basically a feature of most parallel codes? I mean that any core has its own array x[i] and works on it and after operation the data can be gathered in some big array? Probably I still miss the key feature of vectorization.

I haven’t watched the video yet, maybe this will clarify my doubts.

SIMD vectorization is not a feature of most parallel codes. Most parallel codes will be written to scale well to thousands of HPC clusters using MPI and/or OpenMP.

SIMD vectorization is about getting the maximum performance out of a single CPU core.

SIMD vector registers are registers in the CPU that can hold say 8 integers at once.

For the operation x[i] = y[i] + z[i], lets say that 3 registers YMM1, YMM2, YMM3 were used.

What the CPU will do, is to load y[i=1,8] into YMM2 and z[i=1,8] into YMM3.

Then using a single add instruction, calculate YMM1 as YMM1 = YMM2 + YMM3

Since there are 8 integers in the YMM registers, all of the 8 integers will be added in a single CPU instruction!

This is called vectorization.

This essentially means that you get a 8X speed improvement than a non-vectorized serial code.

digger · January 18, 2021, 07:08

Great explanation! Now I understand the main idea. I didn’t know that the processors have such abilities to speed up algebra and make several additions at once.

Thank you for your suggestion, I will focus on OMPI.

andy_ · January 18, 2021, 07:27

Quote:

Originally Posted by aerosayan

My intuition tells me that it's the fault of gfortran. Since gfortran couldn't even vectorize x( : , : ) = 1.0d0 or y( : , : ) = 2.0d0. IFORT did. (EDIT : After in-depth review, it looks like this section was partially vectorized.)

A native compiler can be expected to optimise better than a non-native one but mastering how to optimise effectively with a particular compiler can be expected to take some time and effort. For example, if you can optimise effectively using gcc then there is a fair chance the same can be achieved in gfortran given how they are related.

If one is prepared to devote significant time and effort to achieving high efficiency for matrix operations then the native high performance matrix library (e.g. BLAS) for the particular hardware is likely to be the best option. MKL was mentioned earlier but the best choice will tend to vary with hardware and circumstances. It also tends to sidestep C aliasing issues leading to effectively comparable performance with Fortran.

Again if one is prepared to put in significant effort to raise computational efficiency using 1D arrays is likely to offer the higher general performance. Need to make sure the optimiser knows the offset is a positive constant which may require a directive if, for example, it is passed as a variable in an argument list.

There are of course a range of other ways of coding that optimises performance depending on the form of hardware acceleration present. Back in the early 80s the industrial CFD code I was involved in developing used to have substantially different versions of of key routines adapted for sequential computers, IBMs with small vector registers or Crays with larger vector registers. Low level algorithms were changed as well restructuring do loops and caching. Simpler algorithms with less dependencies that converged more slowly per iteration but vectorised efficiently could become preferable when substantial numbers of vector registers were available. Involved a significant amount of development effort though which I suspect is rarely going to be cost effective these days.

FMDenaro · January 18, 2021, 07:48

Many many years ago I worked using Fortran on the Convex vector computer... well, I think I just wasted time supposing to get some speed-up ...

I will never be an expert in IT and HPC, but I think CFD is some different stuff ...

sbaffini · January 18, 2021, 08:57

Quote:

Originally Posted by FMDenaro

I will never be an expert in IT and HPC, but I think CFD is some different stuff ...

It is, probably just like driving a car is different from being a mechanic or an engineer or a car seller. But, the things are related, and how much you (should) dig into one or the other is probably dependent on your specific task/occupation.

I think there is a specific balance which is dictated by the most efficient way to obtain what you need. Some need to test a new physical model they developed, some others need to sell this model in their code, some others yet just need to use this model... some will have to answer phone calls on why this code that was payed a lot, when using that very model, crashes on the new fancy cluster in the department.

January 13, 2021, 05:29	gfortran couldn't even vectorize a simple 2D matrix addition	#1
aerosayan Senior Member Sayan Bhattacharjee Join Date: Mar 2020 Posts: 495 Rep Power: 8	I observed that my 2D solver code wasn't getting vectorized when compiling with gfortran. So, I wrote an extremely simple code to test if the fault was mine. Turns out, even the simple code (2D matrix addition) didn't vectorize, while 1D array addition vectorized well. (EDIT : When I say it didn't vectorize, I mean that it didn't vectorize using the highest vector register my machine has i.e AVX2 YMM registers.) GODBOLT LINK : https://godbolt.org/z/6Wx3ob Code: ! COMPILE : -O3 -fopenmp -mavx2 -march=native -ftree-vectorize -ftree-vectorizer-verbose=3 -fopt-info-vec-missed program main implicit none real(8), dimension(200,150):: z, x, y real(8), dimension(200) :: c, a, b integer(4) :: i,j a(:) = 1.0d0 b(:) = 1.0d0 a(100) = 2.0d0 b(100) = 2.0d0 !$omp simd do i=1,200 c(i) = a(i) + b(i) end do x(:,:) = 1.0d0 y(:,:) = 2.0d0 x(50,75) = 2.0d0 y(50,75) = 4.0d0 !$omp simd collapse(2) do j=1,150 do i=1,200 z(i,j) = x(i,j) + y(i,j) end do end do print , c(1) , c(100) , c(200) print , z(1,1), z(50, 75), z(200, 150) end program You can see in the godbolt link I provided, Intel's fortran compiler (IFORT) is able to vectorize the matrix addition very well, while gfortran (even the latest version) fails miserably. I even used OpenMP SIMD directives. Although, I'm not an expert in OpenMP, I think my code is correct. What's going on here? Is it my fault (I hope I missed something) or is it the fault of gfortran (that would be horrible)? My intuition tells me that it's the fault of gfortran. Since gfortran couldn't even vectorize x( : , : ) = 1.0d0 or y( : , : ) = 2.0d0. IFORT did. (EDIT : After in-depth review, it looks like this section was partially vectorized.) Thanks and Regards ~sayan Last edited by aerosayan; January 13, 2021 at 13:18. Reason: fixing wrong information

January 13, 2021, 08:34		#2
aerosayan Senior Member Sayan Bhattacharjee Join Date: Mar 2020 Posts: 495 Rep Power: 8	In C/C++, I was able vectorize operations over a 2D matrix by accessing it as a 1D array. @sbafinni This is the method I was talking about during our last conversation. GODBOLT LINK : https://godbolt.org/z/xqvoP3 sbaffini likes this.

January 13, 2021, 16:12		#5
agd Senior Member Join Date: Jul 2009 Posts: 358 Rep Power: 19	mermeruim, At least as far back as F77 fortran has been insensitive to case. aerosayan and mermeruim like this.

January 14, 2021, 15:43		#8
sbaffini Senior Member Paolo Lampitella Join Date: Mar 2009 Location: Italy Posts: 2,195 Blog Entries: 29 Rep Power: 39	Not related directly, but intel now provides oneapi, which is free (https://www.scivision.dev/intel-oneapi-fortran-install/) aerosayan likes this.

January 18, 2021, 07:08		#13
digger New Member Join Date: Jan 2021 Posts: 18 Rep Power: 5	Great explanation! Now I understand the main idea. I didn’t know that the processors have such abilities to speed up algebra and make several additions at once. Thank you for your suggestion, I will focus on OMPI. aerosayan likes this.

January 14, 2021, 06:13		#6
gnwt4a Member EM Join Date: Sep 2019 Posts: 59 Rep Power: 7	gfortran with intel chips is a sorry tale. intel used to give linux-ifort for free but this is long gone. if u are going to use intel chips for linear algebra and multithreading, the mkl library is free for download (and unbeatable), and u can also download for free the pthreads for omp use. u will have to register first. also, there is an intel site that will give you all the appropriate flags and libraries for linking mkl and pthreads with gfortran. --

January 14, 2021, 15:30		#7
aerosayan Senior Member Sayan Bhattacharjee Join Date: Mar 2020 Posts: 495 Rep Power: 8	Moving back to using only C++. I have more experience and control in C++. If gfortran can't handle such simple optimizations, not worth my effort.

January 18, 2021, 07:48		#15
FMDenaro Senior Member Filippo Maria Denaro Join Date: Jul 2010 Posts: 6,896 Rep Power: 73	Many many years ago I worked using Fortran on the Convex vector computer... well, I think I just wasted time supposing to get some speed-up ... I will never be an expert in IT and HPC, but I think CFD is some different stuff ...

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
How do I make steady state SIMPLE solver to transient?	cfdnewb123	Main CFD Forum	6	December 3, 2024 10:03
[snappyHexMesh] SHM Layer Addition Phase	dickcruz	OpenFOAM Meshing & Mesh Conversion	4	November 1, 2018 08:05
simple question about OpenFoam matrix class	TurbJet	OpenFOAM	2	August 10, 2018 21:53
What is this FOAM warning: Unknown matrix type combination means?	chengdi	OpenFOAM Running, Solving & CFD	1	May 13, 2017 22:34
SIMPLE algorithm in 3D cylindrical coordinates	zouchu	Main CFD Forum	1	January 20, 2014 18:02