OpenFOAM with ICC - Intel's ICC is "allergic" to AMD CPUs
This is yet another one of my "rants on my blog" I do once in a while.
So today I was trying to reproduce this report: http://www.openfoam.org/mantisbt/view.php?id=590 - I was curious about this, because ICC is known for having great optimization on Intel processors (or at least it had in the past), as well as being completely free for non-commercial use on Linux.
The detail here is that I don't have an Intel CPU on my home computer. I've got an AMD 1055T x6
But that hasn't stopped me from compiling OpenFOAM on Ubuntu x86_64 with ICC. The detail now is that today it was the first time I took the build for a spin.
My surprise (and origin for this rant) is this message that appeared in front of me:
Wow... I knew it wouldn't work in optimized mode, due to several messages written everywhere in documents and as documented at Wikipedia: http://en.wikipedia.org/wiki/Intel_C...iler#Criticism
So I went to search more on this error and voilá: http://developer.amd.com/documentati...292005119.aspx - this was the first link to pop-up when searching for this message:
So, what to do next?
Using a crowbar is out of the question... Which is why I then remembered this bug report: http://www.openfoam.org/mantisbt/view.php?id=322
And rebuild!
And why to go through all of this trouble? Because I want to know!
PS: to answer the first comment:
__________________________
Continuing investigation has lead to the following results
Mental notes:
So today I was trying to reproduce this report: http://www.openfoam.org/mantisbt/view.php?id=590 - I was curious about this, because ICC is known for having great optimization on Intel processors (or at least it had in the past), as well as being completely free for non-commercial use on Linux.
The detail here is that I don't have an Intel CPU on my home computer. I've got an AMD 1055T x6
But that hasn't stopped me from compiling OpenFOAM on Ubuntu x86_64 with ICC. The detail now is that today it was the first time I took the build for a spin.
My surprise (and origin for this rant) is this message that appeared in front of me:
Code:
Fatal Error: This program was not built to run on the processor in your system. The allowed processors are: Intel(R) Pentium(R) 4 and compatible Intel processors with Intel(R) Streaming SIMD Extensions 3 (Intel(R) SSE3) instruction support.
So I went to search more on this error and voilá: http://developer.amd.com/documentati...292005119.aspx - this was the first link to pop-up when searching for this message:
Quote:
This program was not built to run on the processor in your system. The allowed processors are: Intel
Using a crowbar is out of the question... Which is why I then remembered this bug report: http://www.openfoam.org/mantisbt/view.php?id=322
Quote:
wmake/rules/linux64Icc/c++Opt should be edited to remove the -SSE3 option.
And why to go through all of this trouble? Because I want to know!
PS: to answer the first comment:
Code:
icc --version icc (ICC) 12.1.4 20120410 Copyright (C) 1985-2012 Intel Corporation. All rights reserved. (Non-commercial version for Linux.)
Continuing investigation has lead to the following results
Mental notes:
- Nice list of safe flags for Gcc: http://en.gentoo-wiki.com/wiki/Safe_Cflags
- OpenFOAM 2.1.x
- motorBike 2.1.x case from this page: http://code.google.com/p/bluecfd-sin...untimes202_211 - it uses snappyHexMesh in parallel as well
- Processor AMD 1055T x6, using DDR2 800Mhz memory
- Only a single run was done, so the timings may have a 5 to 10s margin of error...
- Single core:
- sHM 300s, simpleFoam 1614s <-- Gcc 4.6.1 -O3
- sHM 300s, simpleFoam 1399s <-- Gcc 4.6.1 -march=amdfam10 -O2 -pipe
- sHM 245s, simpleFoam 1409s <-- Icc -O2 -no-prec-div
- sHM 242s, simpleFoam 1369s <-- Icc -O2 -no-prec-div -msse3
- Parallel 2 cores:
- sHM 222s, simpleFoam 1296s <-- Gcc 4.6.1 -O3
- sHM 198s, simpleFoam 867s <-- Gcc 4.6.1 -march=amdfam10 -O2 -pipe
- sHM 176s, simpleFoam 894s <-- Icc -O2 -no-prec-div
- sHM 173s, simpleFoam 884s <-- Icc -O2 -no-prec-div -msse3
- Parallel 4 cores:
- sHM 160s, simpleFoam 1196s <-- Gcc 4.6.1 -O3
- sHM 130s, simpleFoam 726s <-- Gcc 4.6.1 -march=amdfam10 -O2 -pipe
- sHM 163s, simpleFoam 788s <-- Icc -O2 -no-prec-div
- sHM 159s, simpleFoam 771s <-- Icc -O2 -no-prec-div -msse3
Total Comments 10
Comments
-
Posted July 28, 2012 at 02:02 by SergeAS -
Good point, I forgot to mention in the blog post. edit: I've added it now.
Nonetheless, this issue with ICC and AMD processors has several years now, as you can see from the references at Wikipedia And I don't think they (intel) have any plans to change this point of view.Posted July 28, 2012 at 05:10 by wyldckat
Updated July 28, 2012 at 05:10 by wyldckat (see "edit:") -
AFAIK since icc 10 or 10.1 has special option -xO
Code:-x<codes> generate specialized code to run exclusively on processors indicated by <codes> as described below W Intel Pentium 4 and compatible Intel processors P Intel(R) Core(TM) processor family with Streaming SIMD Extensions 3 (SSE3) instruction support T Intel(R) Core(TM)2 processor family with SSSE3 O Intel(R) Core(TM) processor family. Code is expected to run properly on any processor that supports SSE3, SSE2 and SSE instruction sets S Future Intel processors supporting SSE4 Vectorizing Compiler and Media Accelerator instructions
Code:icc --version icc (ICC) 10.1 20080801 Copyright (C) 1985-2008 Intel Corporation. All rights reserved.
Posted July 28, 2012 at 10:27 by SergeAS -
How did I not see this before... In the last link I posted, there is this other link: http://developer.amd.com/Assets/Comp...f-61004100.pdf (edit: the AMD website is being revamped and the PDF got lost... best to Google for "CompilerOptQuickRef-61004100.pdf")
There is indicated to use the option "-msse3" (and to avoid "-ax"), which in Icc 12.1 is the same as the now deprecated option "-oX" you've written about!
My machine is currently building another set of OpenFOAM 2.1.x with this option and then I'll run some more tests to compare performance times! If it fares well, it's a good idea to propose to include in OpenFOAM the crazy "WM_COMPILER" option "IccAMD"
Serge, many thanks for the info! Although I can almost bet that Gcc 4.6 can still outrun IccPosted July 28, 2012 at 14:01 by wyldckat
Updated May 18, 2014 at 13:46 by wyldckat (see "edit:") -
It would be interesting to see comparative benchmarks gcc and icc on OF.
PS: In my own density-based solver ( for compressible reacting flows from subsonic to hypersonic) icc beat gcc by about 10% but it is quite old compilers (icc 10.1 vs gcc 4.1) and processors - (AMD Opteron285 and Intel Xeon 5120)
UPD: I spent a little test for my current problem for the "pure" performance of the solver (serial version), for 3 different compilers: gcc 4.1.0, gcc 4.5.1 and icc 10.1 and got the following results (one time step)
gcc 4.1 - 14.7 seconds
gcc 4.5.1 - 14.95 sec
icc 10.1 - 9.45 seconds
icc in this case produces code that is faster by more than 30%
(The test was performed on AMD Opteron285)Posted July 28, 2012 at 16:08 by SergeAS
Updated July 29, 2012 at 08:59 by SergeAS -
I've finally managed to complete the benchmarks! That "-msse3" option did bring good results! But I then went out to get the flags for GCC + 1055T and it did even better
edit: I updated the last part of the blog post
For your Opteron 285, looks like these are the perfect gcc options:
Quote:Originally Posted by http://en.gentoo-wiki.com/wiki/Safe_Cflags/AMD#2xx.2F8xx_Opteron-march=opteron -O2 -pipePosted August 1, 2012 at 06:37 by wyldckat
Updated August 1, 2012 at 06:38 by wyldckat (see "edit:") -
In my solver I'm using next options set for gcc
Code:-O3 -ffast-math -funroll-loops -fprefetch-loop-arrays -frerun-loop-opt -ffloat-store -frerun-cse-after-loop -fschedule-insns2 -fno-strength-reduce -fexpensive-optimizations -fcse-follow-jumps -falign-loops=32 -falign-jumps=32 -falign-functions=32 -falign-labels=32 -fcse-skip-blocks -ftree-vectorize -msse3 -msse2 -msse -mmmx -m3dnow -mfpmath=sse -mtune=opteron
PS: in case of OpenMP solver both compilers has horrible performance/scaling for NUMA systems on Opterons but gcc 4.5.1 a little bit faster then icc 10.1
in case of dual head Operon285 in 16Gb node (2x2 core) on same test (6 m nodes - ~10 GB overal task size) the results is next:
icc 10.1 OpenMP - 8.15 sec (~29% parallel effectivity on 4 cores )
gcc 4.5.1 - 7.95 sec (~47 % parallel effectivity on 4 cores)
For NUMA systems is more appropriate MPI as programming model (my test task has ~41% parallel effectivity ... on 54 cores in case of MPI )
PPS: One small remark
option -pipe speedup compiling of OF but not the execution of the solver code
Quote:-pipe'
Use pipes rather than temporary files for communication between the
various stages of compilation.Posted August 1, 2012 at 16:16 by SergeAS
Updated August 1, 2012 at 17:00 by SergeAS -
Many thanks for the info! I did think that the "-pipe" option was something for optimizing the way cache was handled in the core... I should have read the manual
But, why are you using "-ffast-math"? AFAIK, that option is hated in the CFD world, because it doesn't offer full precision nor reproducibility!
edit: OpenMP is pointless in OpenFOAM, because it doesn't take advantage of multi-threading capabilities, nor is OpenFOAM thread-safe in some cases!Posted August 1, 2012 at 17:16 by wyldckat -
Posted August 1, 2012 at 17:41 by SergeAS -
Posted September 22, 2013 at 11:04 by Tobi