|
[Sponsors] |
gfortran couldn't even vectorize a simple 2D matrix addition |
|
LinkBack | Thread Tools | Search this Thread | Display Modes |
January 13, 2021, 05:29 |
gfortran couldn't even vectorize a simple 2D matrix addition
|
#1 |
Senior Member
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8 |
I observed that my 2D solver code wasn't getting vectorized when compiling with gfortran.
So, I wrote an extremely simple code to test if the fault was mine. Turns out, even the simple code (2D matrix addition) didn't vectorize, while 1D array addition vectorized well. (EDIT : When I say it didn't vectorize, I mean that it didn't vectorize using the highest vector register my machine has i.e AVX2 YMM registers.) GODBOLT LINK : https://godbolt.org/z/6Wx3ob Code:
! COMPILE : -O3 -fopenmp -mavx2 -march=native -ftree-vectorize -ftree-vectorizer-verbose=3 -fopt-info-vec-missed program main implicit none real(8), dimension(200,150):: z, x, y real(8), dimension(200) :: c, a, b integer(4) :: i,j a(:) = 1.0d0 b(:) = 1.0d0 a(100) = 2.0d0 b(100) = 2.0d0 !$omp simd do i=1,200 c(i) = a(i) + b(i) end do x(:,:) = 1.0d0 y(:,:) = 2.0d0 x(50,75) = 2.0d0 y(50,75) = 4.0d0 !$omp simd collapse(2) do j=1,150 do i=1,200 z(i,j) = x(i,j) + y(i,j) end do end do print *, c(1) , c(100) , c(200) print *, z(1,1), z(50, 75), z(200, 150) end program What's going on here? Is it my fault (I hope I missed something) or is it the fault of gfortran (that would be horrible)? My intuition tells me that it's the fault of gfortran. Since gfortran couldn't even vectorize x( : , : ) = 1.0d0 or y( : , : ) = 2.0d0. IFORT did. (EDIT : After in-depth review, it looks like this section was partially vectorized.) Thanks and Regards ~sayan Last edited by aerosayan; January 13, 2021 at 13:18. Reason: fixing wrong information |
|
January 13, 2021, 08:34 |
|
#2 |
Senior Member
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8 |
In C/C++, I was able vectorize operations over a 2D matrix by accessing it as a 1D array.
@sbafinni This is the method I was talking about during our last conversation. GODBOLT LINK : https://godbolt.org/z/xqvoP3 |
|
January 13, 2021, 14:18 |
Possible Solution
|
#3 | |
New Member
Join Date: Jan 2021
Posts: 1
Rep Power: 0 |
Quote:
I manage the compile the code the outputs are: 2,4,2;3,6,3 When I inspect the your code posted, I see directly that your code is not in UPPERCASE, try PROGRAM MAIN, and make sure you know the general syntax. I suggest using some "IDE" with debugger like geany can solve non-future problem, you may have. Old version of Fortran must be in UPPERCASE. |
||
January 13, 2021, 14:25 |
|
#4 | |
Senior Member
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8 |
Quote:
Hello mermeuim, Thanks for your effort. Unfortunately, your suggestion wasn't helpful to me this time. I hope you'll be able to help me in future. My problem wasn't about compilation, it was about SIMD vectorization (advanced topic). May I respectfully suggest letting others answer this question? I'm afraid you didn't understand the question and my problem. Although kudos for effort. I hope to see you around here. |
||
January 13, 2021, 16:12 |
|
#5 |
Senior Member
Join Date: Jul 2009
Posts: 358
Rep Power: 19 |
mermeruim,
At least as far back as F77 fortran has been insensitive to case. |
|
January 14, 2021, 06:13 |
|
#6 |
Member
EM
Join Date: Sep 2019
Posts: 59
Rep Power: 7 |
gfortran with intel chips is a sorry tale.
intel used to give linux-ifort for free but this is long gone. if u are going to use intel chips for linear algebra and multithreading, the mkl library is free for download (and unbeatable), and u can also download for free the pthreads for omp use. u will have to register first. also, there is an intel site that will give you all the appropriate flags and libraries for linking mkl and pthreads with gfortran. -- |
|
January 14, 2021, 15:30 |
|
#7 |
Senior Member
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8 |
Moving back to using only C++.
I have more experience and control in C++. If gfortran can't handle such simple optimizations, not worth my effort. |
|
January 14, 2021, 15:43 |
|
#8 |
Senior Member
|
Not related directly, but intel now provides oneapi, which is free (https://www.scivision.dev/intel-oneapi-fortran-install/)
|
|
January 17, 2021, 19:08 |
|
#9 | |
New Member
Join Date: Jan 2021
Posts: 18
Rep Power: 5 |
Quote:
Dear aerosayan, could you please briefly introduce the meaning of the vectorization for me? What I can understand from your code it seems to simply does the addition of some dummy numbers? It just because I am curious. Thanks for your effort |
||
January 18, 2021, 00:31 |
|
#10 | |
Senior Member
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8 |
Quote:
SIMD (Single Instruction Multiple Data) vectorization is a method of thread level parallelization that allows you to complete more work in less time on a single CPU core. for operation x[i] = y[i]+z[i], in normal serial code, only one y and z value will be added to fill up one x value. for operation x[i] = y[i]+z[i], in vectorized serial code, 2 or 4 or 8 or 16 values of y and z will be added in a single instruction to fill up the 2 or 4 or 8 or 16 values of x. This is possible since machine level mathematical instructions work on registers. The SIMD resgisters have enough memory to hold 2 or 4 or 8 or 16 values together. I'll let someone important from Intel explain this : https://youtu.be/_OJmxi4-twY |
||
January 18, 2021, 02:10 |
|
#11 | |
New Member
Join Date: Jan 2021
Posts: 18
Rep Power: 5 |
Quote:
Thank you. Can I ask you one more thing? Do I understand correctly that what you mentioned is basically a feature of most parallel codes? I mean that any core has its own array x[i] and works on it and after operation the data can be gathered in some big array? Probably I still miss the key feature of vectorization. I haven’t watched the video yet, maybe this will clarify my doubts. From your experience, is it better to start out writing a code using some parallel libraries from the very beginning or is it possible to parallelize a code at the end? Thank you once again. |
||
January 18, 2021, 02:34 |
|
#12 | ||
Senior Member
Sayan Bhattacharjee
Join Date: Mar 2020
Posts: 495
Rep Power: 8 |
Quote:
Experts in this forum (though I can't say for all) will recommend to use MPI or OpenMP for most tasks. Use OpenMP if you're new to parallel programming, since it's easy to learn and use. Use MPI if you're a little bit more advanced and you need to run your code in HPC clusters. Most probably you won't need to use MPI if you're a beginner. You don't need to follow what I'm doing, since it is more trouble than it's worth for many people, and you need to understand Assembly language and CPU architecture moderately well. If you want to learn Assembly language or understand why I'm not following in the same path as the experts, i will include a link to one of my posts : Motivation and introduction for learning assembly coding Quote:
SIMD vectorization is not a feature of most parallel codes. Most parallel codes will be written to scale well to thousands of HPC clusters using MPI and/or OpenMP. SIMD vectorization is about getting the maximum performance out of a single CPU core. SIMD vector registers are registers in the CPU that can hold say 8 integers at once. For the operation x[i] = y[i] + z[i], lets say that 3 registers YMM1, YMM2, YMM3 were used. What the CPU will do, is to load y[i=1,8] into YMM2 and z[i=1,8] into YMM3. Then using a single add instruction, calculate YMM1 as YMM1 = YMM2 + YMM3 Since there are 8 integers in the YMM registers, all of the 8 integers will be added in a single CPU instruction! This is called vectorization. This essentially means that you get a 8X speed improvement than a non-vectorized serial code. |
|||
January 18, 2021, 07:08 |
|
#13 |
New Member
Join Date: Jan 2021
Posts: 18
Rep Power: 5 |
Great explanation! Now I understand the main idea. I didn’t know that the processors have such abilities to speed up algebra and make several additions at once.
Thank you for your suggestion, I will focus on OMPI. |
|
January 18, 2021, 07:27 |
|
#14 | |
Senior Member
andy
Join Date: May 2009
Posts: 322
Rep Power: 18 |
Quote:
If one is prepared to devote significant time and effort to achieving high efficiency for matrix operations then the native high performance matrix library (e.g. BLAS) for the particular hardware is likely to be the best option. MKL was mentioned earlier but the best choice will tend to vary with hardware and circumstances. It also tends to sidestep C aliasing issues leading to effectively comparable performance with Fortran. Again if one is prepared to put in significant effort to raise computational efficiency using 1D arrays is likely to offer the higher general performance. Need to make sure the optimiser knows the offset is a positive constant which may require a directive if, for example, it is passed as a variable in an argument list. There are of course a range of other ways of coding that optimises performance depending on the form of hardware acceleration present. Back in the early 80s the industrial CFD code I was involved in developing used to have substantially different versions of of key routines adapted for sequential computers, IBMs with small vector registers or Crays with larger vector registers. Low level algorithms were changed as well restructuring do loops and caching. Simpler algorithms with less dependencies that converged more slowly per iteration but vectorised efficiently could become preferable when substantial numbers of vector registers were available. Involved a significant amount of development effort though which I suspect is rarely going to be cost effective these days. |
||
January 18, 2021, 07:48 |
|
#15 |
Senior Member
Filippo Maria Denaro
Join Date: Jul 2010
Posts: 6,896
Rep Power: 73 |
Many many years ago I worked using Fortran on the Convex vector computer... well, I think I just wasted time supposing to get some speed-up ...
I will never be an expert in IT and HPC, but I think CFD is some different stuff ... |
|
January 18, 2021, 08:57 |
|
#16 | |
Senior Member
|
Quote:
I think there is a specific balance which is dictated by the most efficient way to obtain what you need. Some need to test a new physical model they developed, some others need to sell this model in their code, some others yet just need to use this model... some will have to answer phone calls on why this code that was payed a lot, when using that very model, crashes on the new fancy cluster in the department. |
||
Tags |
fortran, ifort, openmp |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
How do I make steady state SIMPLE solver to transient? | cfdnewb123 | Main CFD Forum | 6 | December 3, 2024 10:03 |
[snappyHexMesh] SHM Layer Addition Phase | dickcruz | OpenFOAM Meshing & Mesh Conversion | 4 | November 1, 2018 08:05 |
simple question about OpenFoam matrix class | TurbJet | OpenFOAM | 2 | August 10, 2018 21:53 |
What is this FOAM warning: Unknown matrix type combination means? | chengdi | OpenFOAM Running, Solving & CFD | 1 | May 13, 2017 22:34 |
SIMPLE algorithm in 3D cylindrical coordinates | zouchu | Main CFD Forum | 1 | January 20, 2014 18:02 |