Performance of interFoam running in parallel

hsieh · September 9, 2006, 20:58

Dear OpenFOAM experts:

I ran interFoam on an AMD X2 3800+ CPU (3 GB DDR, total number of elements is about 390,000) using LAM (simple decompose). It actually took longer time to run when compared to running in serial. I noticed that it required more number of iterations to converge at each time step when running in parallel. Also, extra communication is needed in parallel. Is there any way to improve the performance in parallel?

I am setting up a cluter with 6 workstations (each with either dual CPUs or a dual core CPU). Gigabit switch is used to connect the 6 boxes. Most of my work used interFoam/lesInterFoam. Now, I am wondering whether it is worth it to even setting up the cluster (most of the cases are less than 1 million elements).

It will be appreciated if someone can shed some light on this?

Pei

gschaider · September 11, 2006, 05:09

It might be a problem with the dual-cores. Both processors have to share the same memory-bandwidth.

I recently did some benchmarks with a 600k-cells simpleFoam-case. The cluster has dual-core dual-cpu nodes connected with Gigabit Ethernet (2 interfaces, channel bonded).

The speedups for all processes on 1 node are
N=1 1. (not too surprising)
N=2 2.06 (quite OK, only one core used per CPU)
N=3 2.09 (Oops)
N=4 2.69 (We're rising again)

If I do the same benchmark with only one process per node on up to 4 nodes (all communication over the network) i get
N=2 1.99
N=4 3.72
which is not spectacular but OK.
So I guess it's NOT an OpenFOAM-problem but an architectur problem (benchmarks with Fluent on the same machines hint in the same direction)

mattijs · September 11, 2006, 05:51

What processors are these? Do the latest models still have these memory bandwidth problems?

Also does the channel bonding have any effect? I think OpenMPI does it automatically?

gschaider · September 11, 2006, 07:16

Hi Mattjis!

The processors I did the benchmarks with were DualCore Opterons 275. There is still the possibility that the motherboard doesn't handle things as well as it could (it's a Tyan S2892) or that the kernel doesn't allocate memory optimaly (it's only a 2.6.9 - as far as I remember there have been some changes in the way memory gets allocated in NUMA-like architectures, but I don't know in which version).
On the other hand the official Fluent-benchmarks show no such effects on the DualCore-machines which might indicate that something is wrong with my setup (but I suspect them to distribute the processes optimally onto the nodes like I did for the second set)

To be honest, I didn't measure the effect of the channel bonding (it doesn't make things worse, that I made sure). The MoBo had two interfaces anyway so the cost to set it up only was the cost of the patch cables. The load-distribution happens on a per-connection basis (this means that one connection only can send with 1GBit, but a second connection can send at the same bandwith at the same time, verified with iperf). Per-packet distribution (which could give 2GBit per connection) should be as easy to set up, but MIGHT have issues with out-of-order packets (and I figured if I run more than one process per node 2x1GBit is as good as 1x2GBit)

hsieh · September 11, 2006, 15:12

Hi,

Thanks for the feedback.

I have done some more testing and this is what I found:

interFoam was applied. 1 gigabit switch. No special stuffs (channel bonding..).

N=1 1.
N=4 2.47 (two dual core systems)
N=6 3.6 (two dual core system + 1 dual CPU system)

pei

olwi · September 13, 2006, 09:19

A note on Bernhard's post from Sept 11: I have similar experiences running FLUENT (sorry) on a large cluster of nodes, each node having 2 AMD64 cpu:s, and Gb switch. When we use only one cpu per node, speedup is reasonably, but using both cpu:s on each node it's very bad. The problem is that both cpu:s want to use the same network interface at more or less the same time. Latency is a lot greater in that case. This is consistent with Bernhard's tests on dual core.

A note on the positive side: As long as our problem have at least 150.000-200.000 cells per cpu, statistics are quite ok even when we fill the nodes. You "just" need to give the cpu:s a LOT of work between each data exchange...

/Ola

pannala · September 13, 2006, 09:44

Did you have shmem enabled on the nodes? If that is the case, it should be a memory to memory copy instead of going through the network interface. Only concern depending on how the memory is being managed, might be one is filling up the pipeline or having lot of cache conflicts when running large problems but that should not happen with small (~10K/CPU) problems. I do not have much experience benchmarking OF or Fluent but should be experimenting with OF in near future. I will definitely share the results when I have them.

Cheers!

Sreekanth

gschaider · September 13, 2006, 16:27

Concerning the performance of memory intensive applications on dual-core machines I found this paper:

www.novell.com/collateral/4622016/4622016.pdf

I'm afraid the situation with OpenFOAM is quite similar to the STREAM-benchmark shown in Figure.1: no good speedup with dual-cores.

If the number of cores is the same Single-Core-SMP makes more Foam than Dual-Core. Much more. (at least for AMD DualCores - anyone has acess to Intels?)

hsieh · September 14, 2006, 10:15

Hi, Bernhard,

I have two Dell Precision 380 systems, each has a Pentium D 3.2 GHz CPU. I will be happy to do some testing.

I made a mistake in my earlier post. The numbers there were ExecutionTime. For N=6, the real time speed up is only about 2. That is, clockTime is much longer than executionTime. I guess, the parallel run spent a lot of time communicating.

I am runing another test: 4 CPUs - only 1 CPU (or 1 core) per workstation. HyperThreading was disabled on all workstation as suggested.

Bernhard, I am wondering if you can share how your cluster is setup. I think that my cluster is very premitive and I am hoping that I can follow your setup to improve the efficiency since I am just getting into the cluster area.

Pei

September 9, 2006, 20:58	Dear OpenFOAM experts: I ra	#1
hsieh Senior Member Pei-Ying Hsieh Join Date: Mar 2009 Posts: 317 Rep Power: 18	Dear OpenFOAM experts: I ran interFoam on an AMD X2 3800+ CPU (3 GB DDR, total number of elements is about 390,000) using LAM (simple decompose). It actually took longer time to run when compared to running in serial. I noticed that it required more number of iterations to converge at each time step when running in parallel. Also, extra communication is needed in parallel. Is there any way to improve the performance in parallel? I am setting up a cluter with 6 workstations (each with either dual CPUs or a dual core CPU). Gigabit switch is used to connect the 6 boxes. Most of my work used interFoam/lesInterFoam. Now, I am wondering whether it is worth it to even setting up the cluster (most of the cases are less than 1 million elements). It will be appreciated if someone can shed some light on this? Pei

September 11, 2006, 05:09	It might be a problem with the	#2
gschaider Assistant Moderator Bernhard Gschaider Join Date: Mar 2009 Posts: 4,225 Rep Power: 51	It might be a problem with the dual-cores. Both processors have to share the same memory-bandwidth. I recently did some benchmarks with a 600k-cells simpleFoam-case. The cluster has dual-core dual-cpu nodes connected with Gigabit Ethernet (2 interfaces, channel bonded). The speedups for all processes on 1 node are N=1 1. (not too surprising) N=2 2.06 (quite OK, only one core used per CPU) N=3 2.09 (Oops) N=4 2.69 (We're rising again) If I do the same benchmark with only one process per node on up to 4 nodes (all communication over the network) i get N=2 1.99 N=4 3.72 which is not spectacular but OK. So I guess it's NOT an OpenFOAM-problem but an architectur problem (benchmarks with Fluent on the same machines hint in the same direction) __________________ Note: I don't use "Friend"-feature on this forum out of principle. Ah. And by the way: I'm not on Facebook either. So don't be offended if I don't accept your invitation/friend request

September 11, 2006, 05:51	What processors are these? Do	#3
mattijs Senior Member Mattijs Janssens Join Date: Mar 2009 Posts: 1,419 Rep Power: 26	What processors are these? Do the latest models still have these memory bandwidth problems? Also does the channel bonding have any effect? I think OpenMPI does it automatically?

September 11, 2006, 07:16	Hi Mattjis! The processors	#4
gschaider Assistant Moderator Bernhard Gschaider Join Date: Mar 2009 Posts: 4,225 Rep Power: 51	Hi Mattjis! The processors I did the benchmarks with were DualCore Opterons 275. There is still the possibility that the motherboard doesn't handle things as well as it could (it's a Tyan S2892) or that the kernel doesn't allocate memory optimaly (it's only a 2.6.9 - as far as I remember there have been some changes in the way memory gets allocated in NUMA-like architectures, but I don't know in which version). On the other hand the official Fluent-benchmarks show no such effects on the DualCore-machines which might indicate that something is wrong with my setup (but I suspect them to distribute the processes optimally onto the nodes like I did for the second set) To be honest, I didn't measure the effect of the channel bonding (it doesn't make things worse, that I made sure). The MoBo had two interfaces anyway so the cost to set it up only was the cost of the patch cables. The load-distribution happens on a per-connection basis (this means that one connection only can send with 1GBit, but a second connection can send at the same bandwith at the same time, verified with iperf). Per-packet distribution (which could give 2GBit per connection) should be as easy to set up, but MIGHT have issues with out-of-order packets (and I figured if I run more than one process per node 2x1GBit is as good as 1x2GBit) __________________ Note: I don't use "Friend"-feature on this forum out of principle. Ah. And by the way: I'm not on Facebook either. So don't be offended if I don't accept your invitation/friend request

September 11, 2006, 15:12	Hi, Thanks for the feedback	#5
hsieh Senior Member Pei-Ying Hsieh Join Date: Mar 2009 Posts: 317 Rep Power: 18	Hi, Thanks for the feedback. I have done some more testing and this is what I found: interFoam was applied. 1 gigabit switch. No special stuffs (channel bonding..). N=1 1. N=4 2.47 (two dual core systems) N=6 3.6 (two dual core system + 1 dual CPU system) pei

September 13, 2006, 09:19	A note on Bernhard's post from	#6
olwi Member Ola Widlund Join Date: Mar 2009 Location: Sweden Posts: 87 Rep Power: 17	A note on Bernhard's post from Sept 11: I have similar experiences running FLUENT (sorry) on a large cluster of nodes, each node having 2 AMD64 cpu:s, and Gb switch. When we use only one cpu per node, speedup is reasonably, but using both cpu:s on each node it's very bad. The problem is that both cpu:s want to use the same network interface at more or less the same time. Latency is a lot greater in that case. This is consistent with Bernhard's tests on dual core. A note on the positive side: As long as our problem have at least 150.000-200.000 cells per cpu, statistics are quite ok even when we fill the nodes. You "just" need to give the cpu:s a LOT of work between each data exchange... /Ola

September 13, 2006, 09:44	Did you have shmem enabled on	#7
pannala New Member Sreekanth Pannala Join Date: Mar 2009 Posts: 6 Rep Power: 17	Did you have shmem enabled on the nodes? If that is the case, it should be a memory to memory copy instead of going through the network interface. Only concern depending on how the memory is being managed, might be one is filling up the pipeline or having lot of cache conflicts when running large problems but that should not happen with small (~10K/CPU) problems. I do not have much experience benchmarking OF or Fluent but should be experimenting with OF in near future. I will definitely share the results when I have them. Cheers! Sreekanth

September 13, 2006, 16:27	Concerning the performance of	#8
gschaider Assistant Moderator Bernhard Gschaider Join Date: Mar 2009 Posts: 4,225 Rep Power: 51	Concerning the performance of memory intensive applications on dual-core machines I found this paper: www.novell.com/collateral/4622016/4622016.pdf I'm afraid the situation with OpenFOAM is quite similar to the STREAM-benchmark shown in Figure.1: no good speedup with dual-cores. If the number of cores is the same Single-Core-SMP makes more Foam than Dual-Core. Much more. (at least for AMD DualCores - anyone has acess to Intels?) __________________ Note: I don't use "Friend"-feature on this forum out of principle. Ah. And by the way: I'm not on Facebook either. So don't be offended if I don't accept your invitation/friend request

September 14, 2006, 10:15	Hi, Bernhard, I have two De	#9
hsieh Senior Member Pei-Ying Hsieh Join Date: Mar 2009 Posts: 317 Rep Power: 18	Hi, Bernhard, I have two Dell Precision 380 systems, each has a Pentium D 3.2 GHz CPU. I will be happy to do some testing. I made a mistake in my earlier post. The numbers there were ExecutionTime. For N=6, the real time speed up is only about 2. That is, clockTime is much longer than executionTime. I guess, the parallel run spent a lot of time communicating. I am runing another test: 4 CPUs - only 1 CPU (or 1 core) per workstation. HyperThreading was disabled on all workstation as suggested. Bernhard, I am wondering if you can share how your cluster is setup. I think that my cluster is very premitive and I am hoping that I can follow your setup to improve the efficiency since I am just getting into the cluster area. Pei

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
parallel performance	ivandipia	CFX	6	January 29, 2009 16:26
Parallel performance	liu	OpenFOAM Running, Solving & CFD	8	October 17, 2006 11:04
InterFoam problem running parallel	vatant	OpenFOAM Running, Solving & CFD	0	April 28, 2006 20:22
Parallel Performance of Fluent	Soheyl	FLUENT	2	October 30, 2005 07:11
Parallel performance	hsing	OpenFOAM Running, Solving & CFD	16	August 30, 2005 15:38