CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Software User Forums > OpenFOAM > OpenFOAM Installation

[Other] howto optimize OpenFOAM for Core i7 CPU using extended instruction set

Register Blogs Community New Posts Updated Threads Search

Like Tree5Likes
  • 4 Post By cutter
  • 1 Post By wyldckat

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   January 23, 2015, 05:49
Default howto optimize OpenFOAM for Core i7 CPU using extended instruction set
  #1
Senior Member
 
Join Date: Mar 2010
Location: Germany
Posts: 154
Rep Power: 16
cutter is on a distinguished road
Hi,

I tried to compile and optimize OpenFOAM for some new Core i7 CPUs with AVX2 and FMA. As far as I understand the default settings are using the general x86_64 instruction set. I forced the compiler to optimize for the extended instruction set by adding the -march=corei7 flag in /wmake/rules/linux64Gcc/c++Opt and /wmake/rules/linux64Gcc/cOpt. The compiler successfully used the settings, my first benchmarks did not show any noticeable effect though. I've been using a single thread for my cases in order to rule out MPI wait times and measure the raw CPU performance.

I've got two questions regarding this issue:

1. Is this the best or correct way to set the compiler flags?

2. What performance gain can be expected from optimized binaries?

Many Thanks
Cutter
cutter is offline   Reply With Quote

Old   January 24, 2015, 11:16
Default
  #2
Retired Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128
wyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to all
Greetings Cutter,

In theory, AVX should increase performance in mathematical operations, for any application, after compiling with the necessary options. But I'm not sure if and how much OpenFOAM takes advantage of this, although this is usually optimized by the compiler either way.

In addition, it also depends on the GCC version you're using. It's also possible that you're using GCC version that is new enough and already does this optimization by default, which would explain why you don't notice any performance increase with and without the option.

Therefore, please provide the following details:
  • CPU model you're using.
  • GCC version you're using.
  • Linux Distribution you're using.
I ask this so that it's easier to diagnose what might be the reason why this is either already working or not working at all.

Best regards,
Bruno
__________________
wyldckat is offline   Reply With Quote

Old   January 25, 2015, 03:26
Default
  #3
Senior Member
 
Francesco Del Citto
Join Date: Mar 2009
Location: Zürich Area, Switzerland
Posts: 237
Rep Power: 18
fra76 is on a distinguished road
Hi all,

I have the same experience as Cutter. I have tried over time with many openfoam versions, gcc, CPUs and operating system, without getting any measurable improvement from the machine-specific optimisation.
Last test a few weeks ago, with gcc 4.9.2 on a very recent hardware with two different CPUs. The march option was correctly applied in both cases, the compilation itself took much longer, about 3 times longer, but the running time of the motorBike tutorial was almost exactly the same, both for mesh and solution.

It would be interesting to know if anyone has a different experience and could point out the compiler options used.

Best regards,
Francesco
fra76 is offline   Reply With Quote

Old   January 30, 2015, 10:19
Default
  #4
Senior Member
 
Join Date: Mar 2010
Location: Germany
Posts: 154
Rep Power: 16
cutter is on a distinguished road
Hi,

thanks to both of you for the initial feedback!

I'm currently targeting the following two CPU models (obtained via cat /proc/cpuinfo and g++ --version):
* Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz, g++ (GCC) 4.8.3 20140911 (Red Hat 4.8.3-7)
* Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz, g++ (GCC) 4.8.2 20140120 (Red Hat 4.8.2-16)

I'm currently doing the research on the first of the two machines, which is running on a Fedora release 19 (Schrödinger’s Cat) with KDE desktop installation:
Code:
$ uname -a
Linux hostname 3.14.23-100.fc19.x86_64 #1 SMP Thu Oct 30 18:36:00 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
As far as I understand the version of g++ doesn't optimize for the complete instruction set. This can be checked with the following command:
Code:
g++ -dM -E -x c /dev/null | grep -i -e avx -e fma
<<no output here>>
When specifying the concrete architecture via the -march option the compiler activates the optimizations and defines the corresponding preprocessor variables:
Code:
g++ -march=core-avx2 -dM -E -x c /dev/null | grep -i -e avx -e fma
#define __core_avx2__ 1
#define __AVX__ 1
#define __FP_FAST_FMAF 1
#define __FMA__ 1
#define __AVX2__ 1
#define __tune_core_avx2__ 1
#define __core_avx2 1
#define __FP_FAST_FMA 1
The same thing happens when I let the compiler choose the instruction set using the -march=native option:
Code:
$ g++ -march=native -dM -E -x c /dev/null | grep -i -e avx -e fma
#define __core_avx2__ 1
#define __AVX__ 1
#define __FP_FAST_FMAF 1
#define __FMA__ 1
#define __AVX2__ 1
#define __tune_core_avx2__ 1
#define __core_avx2 1
#define __FP_FAST_FMA 1
Hope that will help to shed some light on the issue.

Best Regards
Cutter
elvis, wyldckat, Ohbuchi and 1 others like this.
cutter is offline   Reply With Quote

Old   February 2, 2015, 01:24
Default
  #5
Senior Member
 
Francesco Del Citto
Join Date: Mar 2009
Location: Zürich Area, Switzerland
Posts: 237
Rep Power: 18
fra76 is on a distinguished road
Hi Cutter,

Nice checks!
Now we know the compiler is doing its job, or at least is enabling the set of instructions specific to the CPUs, as I think we all expected.

Now the questions are: is it able to use them when compiling OpenFOAM? Does this make any difference to the execution time?

Francesco
fra76 is offline   Reply With Quote

Old   August 29, 2015, 11:27
Default
  #6
Member
 
Lianhua Zhu
Join Date: Aug 2011
Location: Wuhan, China
Posts: 35
Rep Power: 15
zhulianhua is on a distinguished road
Hi, Francesco,
Recently, I compared the performance of OF with icc and gcc.
The two configurations are:
#1. Icc 15.0.0, OpenFOAM-2.4.0, runs on E5-2680v3@2.5 GHz, compiled with -xHost -O3 flag, OS: CentOS 6.5 x64, RAM DDR4
#2. Gcc-4.8.1, OpenFOAM-2.3.0, runs on E5-2697v2@2.7 GHz, compiled with the default -m64 flag, OS: CentOS 7.0 x64, RAM DDR3

NOTE a): "-xHost will cause icc/icpc or icl to check the cpu information and find the highest level of extended instructions support to use."
NOTE b): E5-2680v3 supports AVX2.0 instructions while E5-2697v2 doesn't.

I run the cavity flow case in $FOAM_TUT/incompressible/icoFoam/cavity without modifying any files in it, (using only one process.)

Results:
The Icc configuration (#1) takes 0.16s
The Gcc configuration (#2) takes 0.15s

You see, almost the same!

Hope this testing helps,

--
Lianhua

Quote:
Originally Posted by fra76 View Post
Hi Cutter,

Nice checks!
Now we know the compiler is doing its job, or at least is enabling the set of instructions specific to the CPUs, as I think we all expected.

Now the questions are: is it able to use them when compiling OpenFOAM? Does this make any difference to the execution time?

Francesco
zhulianhua is offline   Reply With Quote

Old   December 28, 2015, 21:19
Default
  #7
Retired Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128
wyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to all
Greetings to all!

I've had this thread on my to-do list and I haven't reached a solution yet. Nonetheless, I've done some basic tests that can at least give us a way to get the feeling for the scale up we can hope for. The repository is available here: https://github.com/wyldckat/avxtest

The source code does not depend on OpenFOAM, needs only GCC (4.7 or newer) for building it and the summary results were as follows (using an AMD A10-7850K):
  • float (single precision):
    • x86 FPU: 44478.285 ms
    • x86 AVX: 6253.096 ms
  • double (double precision):
    • x86 FPU: 44543.217 ms
    • x86 AVX: 13095.627 ms
Which makes for an interesting result: for the people that think that single precision calculations with 64-bit processors will be faster than using double precision... well, they are actually wasting electricity by not investing in a more accurate result

As for OpenFOAM, I still need to look into this in more detail. The compiler should be able to vectorize things on its own, but it seems that the code must be prepared in a way that the compiler can understand "oh, this I can vectorize like so and so".

Best regards,
Bruno
Ohbuchi likes this.
__________________
wyldckat is offline   Reply With Quote

Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Superlinear speedup in OpenFOAM 13 msrinath80 OpenFOAM Running, Solving & CFD 18 March 3, 2015 06:36
Star cd es-ice solver error ernarasimman STAR-CD 2 September 12, 2014 01:01
OpenFOAM CPU Usage musahossein OpenFOAM Running, Solving & CFD 26 July 18, 2013 10:03
OpenFOAM 13 Intel quadcore parallel results msrinath80 OpenFOAM Running, Solving & CFD 13 February 5, 2008 06:26
OpenFOAM 13 AMD quadcore parallel results msrinath80 OpenFOAM Running, Solving & CFD 1 November 11, 2007 00:23


All times are GMT -4. The time now is 13:40.