Intelbs MPI and performance tools in OpenFOAM

hplum · December 7, 2007, 07:18

Hi,

I'm working as an applications engineer in Intel and was involved in running OpenFOAM on Intel platforms, in particular check a replacement of Open MPI with Intel's product MPI.

I saw OpenFOAM (simpleFoam) running over 30% faster with Intel's MPI, as compared to OpenMPI, on crucial benchmarks of an important enduser of OpenFOAM, and Intel coop partner.

Also were this way the performance analysis tools of Intel's enabled with OpenFOAM.

Is this interesting for you? Have you ever thought about a model to link OpenFOAM with
commerical libaries of MPI?

Would appreciate yout interest in this story ...

asaha · December 7, 2007, 23:40

Hello Hans,

It was interesting to note that intel's mpi runs faster thatn openmpi implementation in OpenFoam. I would be keen to check the same performance improvement on Opteron based platforms. It would be great if you can help me in getting intel's mpi version for OpenFoam 1.4.1.

msrinath80 · December 8, 2007, 08:41

Hans, when you say 30% faster I would like to know what is the basis you use for that comparison. Do you mean to imply that the parallel speedup for say a 2-CPU job is 30% higher when using Intel's MPI as opposed to OpenMPI? If it's just the solution time, then I don't think there is anything surprising there as Intel's compilers/libraries are optimized to work exceptionally well on Intel platforms (which by the way are rarely used in any of the clusters at my university). On a related note, what exactly is your stand on using multi-core systems (dual/quad/octa) for parallel CFD computing. Is Intel aware that memory-bandwidth is the real bottleneck when switching to multi-core systems?

hplum · December 10, 2007, 07:43

Hello,

to respond to the comments of a a saha, Srinath:

- Intel MPI is a commercial product, not just
as easy to get as OpenMPI. That's why I was wondering whether the OpenFOAM creators could think of a model to support other MPI's for building that users might have a license of

- The over 30% comes >>just<< out of the MPI.
It compares 2 runs on exactly the same Cluster
(here an Intel 4 core cluster with 16 nodes,
that is 64 processors) OpenMPI vs IntelMPI -
all timings except the MPI coincide, only the MPI makes the difference.

msrinath80 · December 10, 2007, 12:38

Hello Hans,

I am not sure I understand your answer. When you compare the speedup (for example see [1]) I wish to know if Intel's MPI gives a 30% increase when compared to Open MPI.

Secondly, you mention that you ran the tests on 16 nodes (each node featuring a quad core CPU). How did you assign the processes:

i) Did you schedule one MPI process per node so that each process would then communicate through an interconnect. If so, which interconnect (gigabit, infiniband, quadrics etc.)?

ii) Did you schedule MPI processes by filling each node and then moving to the next one? In this case, for 2 and 4 processes, all program instances would run on the same quad-core node. Only when moving to 6 or 8 processes, would the interconnect be used.

iii) What case did you run (as in how big)? How much RES memory did the serial run consume? How many time steps did you run the case for?

[1] http://www.cfd-online.com/OpenFOAM_D...es/1/4626.html

asaha · December 11, 2007, 00:37

I have run my parallel cases on both Xeon and Opteron based machines. The parallel performance with Opteron based machines are far superior to Xeon based machines with OpenMPI.

hplum · December 11, 2007, 05:51

Hi,

the comparisons are as follows:

I run simpleFoam 2 times; in both runs
>> everything<< coincides:

- test case, compilers, compiler flags,
run command, the cluster to run on,
mapping of processes to the
parallel nodes of the cluster

>> except <<

- I link Intel's MPI in the first, OpenMPI in
the second run

Then, the first run, just by managing the message
passing better, gives 30% better runtime (not speedup), e.g 230 s instead of 300 s wall clock time

This happens throughout all different styles of mapping, be in 1 process per node, 2, or 4.

I'm afraid I cannot disclose details of the test case as it's confidential.

msrinath80 · December 11, 2007, 14:34

Thank you Hans. As I mentioned before, if there is no improvement in the speedup, I doubt anyone will be that interested. Throw in a few more processors and you will always get better runtimes if the speedup stays close to linear. Besides, to my knowledge the intel architecture is rarely used in clusters anymore. We had a very prominent Xeon based cluster in our university, but that was like 3-4 years ago. Now everything has been changed to AMD (hypertransport technology) and/or IBM POWER (far superior memory bandwidth, generous L3 cache etc.).

fra76 · December 12, 2007, 05:47

Srinath, I'm sorry but I don't agree with you.
If you can save 30% of computational time just by changing the MPI library, it's a huge improvement. Even if the parallel speedup doesn't change. Saving 30% of times means running 30% more simulations in the same time, and in the end it means saving money for buying and mantaining 30% more hardware and having the same performance.

About the cluster, it's not true that Intel CPUs are not used anymore, trust me!

@Hans: what if you have a propretary network (like Myrinet or Quadrics, instead of a Gigabit)?

Francesco

jens_klostermann · December 12, 2007, 14:52

Hi,

Of course it would be nice if somebody has already a license for Intel's mpi to get their licensed software to work with (in this case) OpenFOAM. Since OpenFOAM is under gnu license what keeps intel from supplying an interface between their proprietary mpi and OpenFOAM in source code or binary form (like the NVIDIA drivers for linux)?

One point I didn't get: The 30% walltime reduction was for ethernet interconnect only or is this also valid for infiniband interconnect? If this is only for ethernet their is also a promissing (also up to 30% walltime reduction) "free" alternative: Gamma which is already implemented/supported by OpenFOAM.

@Srinath We did some Benchmarking with Intel and Opteron systems and the Intel systems are for different cases mostly as fast or even faster.

Jens

msrinath80 · December 12, 2007, 15:21

Francesco, I respect your opinion. My observations are merely based on experience and discussions with people who have been working on High Performance Computing (HPC) for quite some time. AMD still rules over Intel when you factor in the price and power consumption and compare the performance. IBM still rules over both AMD and Intel when it comes to processors suited for HPC. The price of IBM servers are of course exorbitant.

I'm no AMD fanboy. I still respect Intel and their processors but only when it comes to desktop use. In fact I chose Intel over AMD when I bought my Fujitsu notebook simply because Intel supports free software 3D graphics drivers. Nevertheless I will elaborate the reason for my skepticism. Without mentioning the size of the test case there really is no reason to get exited over 30% improvement. Intel is famous for posting benchmarks of commercial CFD codes (e.g Fluent) and claiming superiority over AMD. However, their benchmarks are based on relatively small test cases. Increase the size of the problem and Intel struggles to match the performance that AMD can deliver (thanks to its hypertransport technology which reduces the FSB bottleneck by providing separate path to memory and and all other PC components through the motherboard chipset). This is also why Intel processors have higher L2 cache in an attempt to offset the loss in performance that comes from having to use the Front Side Bus (which manages both both memory and I/O communications) every time. I chose to believe neither Intel nor AMD when it came to benchmarking. I did all the tests myself (some which I could get around to summarizing, I posted in this form). In majority of the parallel tests I saw that AMD gave much better results. Let me see Intel give me a better speedup at reasonable price and I will gladly recommend it.

As regards to Intel-MPI I will admit outright that I am biased towards open solutions. The very fact that Intel releases its very own compiler and MPI libraries is clear evidence that its processors have some performance pathways that are not documented so that gcc and other free alternatives cannot exploit it. In other words, Intel (like any other company including AMD) wants to get additional revenue by promoting use of its compilers etc. Nothing surprising there, eh. I'm sure there are folks who love to get the best of both worlds (free and commercial) as long as it benefits them. But then again this is the consequence of practical choices, isn't it. I choose open solutions not just because they are free but also because they promote growth and productivity more than commercial alternatives

hplum · December 13, 2007, 08:19

All,

interesting discussion. In terms of judging the 30%, I agree with Francesco del Citto - what counts is the run time and #simulations per time you can run - speedup is a largely over estimated (and also abused) measure. I can easily
make an application 2x slower but improve its parallel speedup ....

I also understand scepticism - let me ensure that measurements were done with significant test cases, but sorry no more details possible

As to the interconnect used for those 30%: it was an Infiniband ...

caw · December 13, 2007, 08:41

Hans,

thank you for this statement. IMHO run time on a given number of CPUs is the only measure that counts. It goes with speedup hand in hand anyway.

Talking about Intel-MPI: I suppose you will have difficulties introducing this into an open source community...
But what about a contibution, lets say a "Open-Foam-Intel-MPI-Special-Edition"....that would be nice, right? ;-))

Kind regards
Christian

hplum · December 13, 2007, 09:45

Christian,

thanks for the comment.
Well, it could be like OpenFOAM's installs
provide for a branch linking other MPI-s -
if a client has this library, it works, if not, it doesn't, so he continues to use OpenMPI - actually pretty easy. Due to the nice encapsulation in the libPstream.so this could keep 99.9% of the code / compilation untouched.

fra76 · December 13, 2007, 11:30

That's true. I compiled OpenFOAM on a propetary network using their MPI library. It's really very easy!
And the performance improvement was fantastic, in my case, especially on small cells/cpu number.

alberto · December 16, 2007, 15:51

Just a small suggestion.

There are various simple test cases in the literature, with all details needed to reproduce them. For example, a simple but computationally intensive test case is a direct numerical simulation in a channel flow, with predetermined flow conditions and solver settings.

This kind of test case is easily scalable, adaptable to high computational resources, and not covered by secrecy agreements.

This would allow Intel to make results public, with detailed information and specific hints on how to get them, increasing its credibility.

With kind regards,
Alberto

alberto · December 16, 2007, 15:58

Just some link of test cases:

- Ercoftac database: http://cfd.mace.manchester.ac.uk/cgi-bin/cfddb/ezdb.cgi?ercdb+search+retrieve+&& &*%%%%dm=Line

- iCFD database (cases with detailed results too) http://cfd.cineca.it/cfd

December 7, 2007, 07:18	Hi, I'm working as an appli	#1
hplum New Member Hans-Joachim Plum Join Date: Mar 2009 Posts: 8 Rep Power: 17	Hi, I'm working as an applications engineer in Intel and was involved in running OpenFOAM on Intel platforms, in particular check a replacement of Open MPI with Intel's product MPI. I saw OpenFOAM (simpleFoam) running over 30% faster with Intel's MPI, as compared to OpenMPI, on crucial benchmarks of an important enduser of OpenFOAM, and Intel coop partner. Also were this way the performance analysis tools of Intel's enabled with OpenFOAM. Is this interesting for you? Have you ever thought about a model to link OpenFOAM with commerical libaries of MPI? Would appreciate yout interest in this story ...

December 7, 2007, 23:40	Hello Hans, It was interest	#2
asaha Member vof_user Join Date: Mar 2009 Posts: 67 Rep Power: 17	Hello Hans, It was interesting to note that intel's mpi runs faster thatn openmpi implementation in OpenFoam. I would be keen to check the same performance improvement on Opteron based platforms. It would be great if you can help me in getting intel's mpi version for OpenFoam 1.4.1.

December 8, 2007, 08:41	Hans, when you say 30% faster	#3
msrinath80 Senior Member Srinath Madhavan (a.k.a pUl\|) Join Date: Mar 2009 Location: Edmonton, AB, Canada Posts: 703 Rep Power: 21	Hans, when you say 30% faster I would like to know what is the basis you use for that comparison. Do you mean to imply that the parallel speedup for say a 2-CPU job is 30% higher when using Intel's MPI as opposed to OpenMPI? If it's just the solution time, then I don't think there is anything surprising there as Intel's compilers/libraries are optimized to work exceptionally well on Intel platforms (which by the way are rarely used in any of the clusters at my university). On a related note, what exactly is your stand on using multi-core systems (dual/quad/octa) for parallel CFD computing. Is Intel aware that memory-bandwidth is the real bottleneck when switching to multi-core systems?

December 10, 2007, 07:43	Hello, to respond to the co	#4
hplum New Member Hans-Joachim Plum Join Date: Mar 2009 Posts: 8 Rep Power: 17	Hello, to respond to the comments of a a saha, Srinath: - Intel MPI is a commercial product, not just as easy to get as OpenMPI. That's why I was wondering whether the OpenFOAM creators could think of a model to support other MPI's for building that users might have a license of - The over 30% comes >>just<< out of the MPI. It compares 2 runs on exactly the same Cluster (here an Intel 4 core cluster with 16 nodes, that is 64 processors) OpenMPI vs IntelMPI - all timings except the MPI coincide, only the MPI makes the difference.

December 10, 2007, 12:38	Hello Hans, I am not sure I	#5
msrinath80 Senior Member Srinath Madhavan (a.k.a pUl\|) Join Date: Mar 2009 Location: Edmonton, AB, Canada Posts: 703 Rep Power: 21	Hello Hans, I am not sure I understand your answer. When you compare the speedup (for example see [1]) I wish to know if Intel's MPI gives a 30% increase when compared to Open MPI. Secondly, you mention that you ran the tests on 16 nodes (each node featuring a quad core CPU). How did you assign the processes: i) Did you schedule one MPI process per node so that each process would then communicate through an interconnect. If so, which interconnect (gigabit, infiniband, quadrics etc.)? ii) Did you schedule MPI processes by filling each node and then moving to the next one? In this case, for 2 and 4 processes, all program instances would run on the same quad-core node. Only when moving to 6 or 8 processes, would the interconnect be used. iii) What case did you run (as in how big)? How much RES memory did the serial run consume? How many time steps did you run the case for? [1] http://www.cfd-online.com/OpenFOAM_D...es/1/4626.html

December 11, 2007, 00:37	I have run my parallel cases o	#6
asaha Member vof_user Join Date: Mar 2009 Posts: 67 Rep Power: 17	I have run my parallel cases on both Xeon and Opteron based machines. The parallel performance with Opteron based machines are far superior to Xeon based machines with OpenMPI.

December 11, 2007, 05:51	Hi, the comparisons are as	#7
hplum New Member Hans-Joachim Plum Join Date: Mar 2009 Posts: 8 Rep Power: 17	Hi, the comparisons are as follows: I run simpleFoam 2 times; in both runs >> everything<< coincides: - test case, compilers, compiler flags, run command, the cluster to run on, mapping of processes to the parallel nodes of the cluster >> except << - I link Intel's MPI in the first, OpenMPI in the second run Then, the first run, just by managing the message passing better, gives 30% better runtime (not speedup), e.g 230 s instead of 300 s wall clock time This happens throughout all different styles of mapping, be in 1 process per node, 2, or 4. I'm afraid I cannot disclose details of the test case as it's confidential.

December 11, 2007, 14:34	Thank you Hans. As I mentioned	#8
msrinath80 Senior Member Srinath Madhavan (a.k.a pUl\|) Join Date: Mar 2009 Location: Edmonton, AB, Canada Posts: 703 Rep Power: 21	Thank you Hans. As I mentioned before, if there is no improvement in the speedup, I doubt anyone will be that interested. Throw in a few more processors and you will always get better runtimes if the speedup stays close to linear. Besides, to my knowledge the intel architecture is rarely used in clusters anymore. We had a very prominent Xeon based cluster in our university, but that was like 3-4 years ago. Now everything has been changed to AMD (hypertransport technology) and/or IBM POWER (far superior memory bandwidth, generous L3 cache etc.).

December 12, 2007, 05:47	Srinath, I'm sorry but I don't	#9
fra76 Senior Member Francesco Del Citto Join Date: Mar 2009 Location: Zürich Area, Switzerland Posts: 237 Rep Power: 18	Srinath, I'm sorry but I don't agree with you. If you can save 30% of computational time just by changing the MPI library, it's a huge improvement. Even if the parallel speedup doesn't change. Saving 30% of times means running 30% more simulations in the same time, and in the end it means saving money for buying and mantaining 30% more hardware and having the same performance. About the cluster, it's not true that Intel CPUs are not used anymore, trust me! @Hans: what if you have a propretary network (like Myrinet or Quadrics, instead of a Gigabit)? Francesco

December 12, 2007, 14:52	Hi, Of course it would be n	#10
jens_klostermann Senior Member Jens Klostermann Join Date: Mar 2009 Posts: 117 Rep Power: 17	Hi, Of course it would be nice if somebody has already a license for Intel's mpi to get their licensed software to work with (in this case) OpenFOAM. Since OpenFOAM is under gnu license what keeps intel from supplying an interface between their proprietary mpi and OpenFOAM in source code or binary form (like the NVIDIA drivers for linux)? One point I didn't get: The 30% walltime reduction was for ethernet interconnect only or is this also valid for infiniband interconnect? If this is only for ethernet their is also a promissing (also up to 30% walltime reduction) "free" alternative: Gamma which is already implemented/supported by OpenFOAM. @Srinath We did some Benchmarking with Intel and Opteron systems and the Intel systems are for different cases mostly as fast or even faster. Jens

December 12, 2007, 15:21	Francesco, I respect your opin	#11
msrinath80 Senior Member Srinath Madhavan (a.k.a pUl\|) Join Date: Mar 2009 Location: Edmonton, AB, Canada Posts: 703 Rep Power: 21	Francesco, I respect your opinion. My observations are merely based on experience and discussions with people who have been working on High Performance Computing (HPC) for quite some time. AMD still rules over Intel when you factor in the price and power consumption and compare the performance. IBM still rules over both AMD and Intel when it comes to processors suited for HPC. The price of IBM servers are of course exorbitant. I'm no AMD fanboy. I still respect Intel and their processors but only when it comes to desktop use. In fact I chose Intel over AMD when I bought my Fujitsu notebook simply because Intel supports free software 3D graphics drivers. Nevertheless I will elaborate the reason for my skepticism. Without mentioning the size of the test case there really is no reason to get exited over 30% improvement. Intel is famous for posting benchmarks of commercial CFD codes (e.g Fluent) and claiming superiority over AMD. However, their benchmarks are based on relatively small test cases. Increase the size of the problem and Intel struggles to match the performance that AMD can deliver (thanks to its hypertransport technology which reduces the FSB bottleneck by providing separate path to memory and and all other PC components through the motherboard chipset). This is also why Intel processors have higher L2 cache in an attempt to offset the loss in performance that comes from having to use the Front Side Bus (which manages both both memory and I/O communications) every time. I chose to believe neither Intel nor AMD when it came to benchmarking. I did all the tests myself (some which I could get around to summarizing, I posted in this form). In majority of the parallel tests I saw that AMD gave much better results. Let me see Intel give me a better speedup at reasonable price and I will gladly recommend it. As regards to Intel-MPI I will admit outright that I am biased towards open solutions. The very fact that Intel releases its very own compiler and MPI libraries is clear evidence that its processors have some performance pathways that are not documented so that gcc and other free alternatives cannot exploit it. In other words, Intel (like any other company including AMD) wants to get additional revenue by promoting use of its compilers etc. Nothing surprising there, eh. I'm sure there are folks who love to get the best of both worlds (free and commercial) as long as it benefits them. But then again this is the consequence of practical choices, isn't it. I choose open solutions not just because they are free but also because they promote growth and productivity more than commercial alternatives

December 13, 2007, 08:19	All, interesting discussion	#12
hplum New Member Hans-Joachim Plum Join Date: Mar 2009 Posts: 8 Rep Power: 17	All, interesting discussion. In terms of judging the 30%, I agree with Francesco del Citto - what counts is the run time and #simulations per time you can run - speedup is a largely over estimated (and also abused) measure. I can easily make an application 2x slower but improve its parallel speedup .... I also understand scepticism - let me ensure that measurements were done with significant test cases, but sorry no more details possible As to the interconnect used for those 30%: it was an Infiniband ...

December 13, 2007, 08:41	Hans, thank you for this st	#13
caw Member Christian Winkler Join Date: Mar 2009 Location: Mannheim, Germany Posts: 63 Rep Power: 17	Hans, thank you for this statement. IMHO run time on a given number of CPUs is the only measure that counts. It goes with speedup hand in hand anyway. Talking about Intel-MPI: I suppose you will have difficulties introducing this into an open source community... But what about a contibution, lets say a "Open-Foam-Intel-MPI-Special-Edition"....that would be nice, right? ;-)) Kind regards Christian

December 13, 2007, 09:45	Christian, thanks for the c	#14
hplum New Member Hans-Joachim Plum Join Date: Mar 2009 Posts: 8 Rep Power: 17	Christian, thanks for the comment. Well, it could be like OpenFOAM's installs provide for a branch linking other MPI-s - if a client has this library, it works, if not, it doesn't, so he continues to use OpenMPI - actually pretty easy. Due to the nice encapsulation in the libPstream.so this could keep 99.9% of the code / compilation untouched.

December 13, 2007, 11:30	That's true. I compiled OpenFO	#15
fra76 Senior Member Francesco Del Citto Join Date: Mar 2009 Location: Zürich Area, Switzerland Posts: 237 Rep Power: 18	That's true. I compiled OpenFOAM on a propetary network using their MPI library. It's really very easy! And the performance improvement was fantastic, in my case, especially on small cells/cpu number.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Converting OpenFOAM to VHDL Performance Testing	bclodfe2uiucedu	OpenFOAM	15	March 23, 2010 08:54
Quadcores does Intelbs performance justify itbs cost	kar	OpenFOAM	0	October 4, 2008 15:13
FSI tools	Benjamon	Main CFD Forum	0	June 2, 2008 07:40
CFD tools	Dmitri	Main CFD Forum	1	June 7, 2006 07:01

December 16, 2007, 15:51	Just a small suggestion. Th	#16
alberto Senior Member Alberto Passalacqua Join Date: Mar 2009 Location: Ames, Iowa, United States Posts: 1,912 Rep Power: 36	Just a small suggestion. There are various simple test cases in the literature, with all details needed to reproduce them. For example, a simple but computationally intensive test case is a direct numerical simulation in a channel flow, with predetermined flow conditions and solver settings. This kind of test case is easily scalable, adaptable to high computational resources, and not covered by secrecy agreements. This would allow Intel to make results public, with detailed information and specific hints on how to get them, increasing its credibility. With kind regards, Alberto __________________ Alberto Passalacqua GeekoCFD - A free distribution based on openSUSE 64 bit with CFD tools, including OpenFOAM. Available as in both physical and virtual formats (current status: http://albertopassalacqua.com/?p=1541) OpenQBMM - An open-source implementation of quadrature-based moment methods. To obtain more accurate answers, please specify the version of OpenFOAM you are using.

December 16, 2007, 15:58	Just some link of test cases:	#17
alberto Senior Member Alberto Passalacqua Join Date: Mar 2009 Location: Ames, Iowa, United States Posts: 1,912 Rep Power: 36	Just some link of test cases: - Ercoftac database: http://cfd.mace.manchester.ac.uk/cgi-bin/cfddb/ezdb.cgi?ercdb+search+retrieve+&& &*%%%%dm=Line - iCFD database (cases with detailed results too) http://cfd.cineca.it/cfd __________________ Alberto Passalacqua GeekoCFD - A free distribution based on openSUSE 64 bit with CFD tools, including OpenFOAM. Available as in both physical and virtual formats (current status: http://albertopassalacqua.com/?p=1541) OpenQBMM - An open-source implementation of quadrature-based moment methods. To obtain more accurate answers, please specify the version of OpenFOAM you are using.