Performance issues with channelOodLes

steja · April 19, 2005, 05:34

In order to test OpenFOAMs LES performance we
used the channelOodLes solver to compute a simple
turbulent channel flow similar to the one
described in the channel395 tutorial but with
Re_tau=180 (64x64x64 cells, 2 blocks and a simple
grading stretching close to the walls).
We tested all available time schemes and
different solvers for pressure (AMG
(NCells=26000) and ICCG with abs. convergence
tol=1e-5).
The timestep was setup to be 0.005, yielding to an average CFL number of 1.1 during the
simulation.

Running OpenFOAM parallel on 4 cpus on a
linux-cluster, we found out that:

1)
Using AMG is much slower than ICCG, right, or are
there any options other than NCells, which have
an significant influence on the performance?

2)
Comparing with our inhouse Fortran-Code running the same testcase on single cpu, OpenFOAM still needs almost 10 times more computation time for one time step.
We need to mention that our explicit LES code is limited to one-block domains and optimized with the intel fortran compiler. Further it uses an optimized pressure solver (FFT -> periodic directions).

2.1)
The question concerning this statement is: Is the only reason of the difference the prize we have to pay for a very flexible code infrastructure, or are there built-in options with which one can play to get a better performance?

2.2)
Is it worth to use another compiler (e.g. intel) or are the differences only small?

//Thanks for help in advance

henry · April 19, 2005, 05:51

1) Yes the current implementation of AMG has poor performance running in parallel but this can be improved by choosing a largish number of cells at the coarsest level and could be improved further if the coarsest level were solved using a direct solver on the master-node. However, for channel LES cases we do not find that the AMG offers any performance improvement over ICCG.

2) I would expect an implicit p-U code to be significantly slower than an optimised explicit FFT code but I am surprised it's a factor of 10, I would expect 4 or 5.

2.1) You are probably finding the cost of the pressure solution is dominating the computation, how many PISO correctors are you doing? How many iterations in the pressure solver is it doing?

Is your code running compressible or incompressible? If the case is compressible you will find a significant gain in performance if you also run the implicit code compressible because the diagonal dominance of the pressure equation improves convergence enormously.

2.2) You might get speed imrpovements with other compilers, although I have not found the Intel compiler better than gcc. I have tried the KAI compiler in the past and got a reasonable improvements in performance, perhaps the pathscale or portland compilers would be even better but I haven't tried them yet.

grtabor · April 19, 2005, 05:56

On point 2) : is the in-house Fortran code structured or unstructured? If its limited to 1 block domains that suggests that the addressing might be significantly simpler, which might give a further speedup.

Gavin

steja · April 19, 2005, 06:22

Additional notes to:
2.1) 2 piso correctors, for the first corrector step it needs 170 and for the second 150 iterations
Our code is incompressible.

eugene · April 19, 2005, 06:23

What is the interconnect on your cluster? If it is just fast ethernet or gigabit ethernet, I would expect this to massively decrease the parallel performance of channelOodles' implicit solver. Even if you had infiniband or a 4-way opteron, running only 260000 cells over 4 processors is not going to get you near 100% efficiency (I get ~85% for 500k cells on a quad opteron). Please run the case on a single processor and compare that to your 4-way timings.

Explicit codes have much less of a comms overhead. Combine a 50% efficiency with Henry's factor of 4-5 and you have you 10x slowdown.

henry · April 19, 2005, 06:28

With 170 iterations in the first corrector and 150 in the second it sounds like PISO is not converging very well which is probably due to your Courant number being quite high. Are these numbers from early in the run or after the flow has developed?

steja · April 19, 2005, 06:58

I need to correct my previous posting: the
number of iterations in the first step was around
140 and in the second around 50. This is for the
converged flow.

ralph · July 19, 2005, 05:49

Hi,
I did a few parallel calculations to check the capabilities of OpenFoam. Therefore I used 2 Pentiums with 3.2 GHz connected with ethernet gigabit.

1) Amongst others I ran the OpenFoam tutorial "channelOodles". I just did a decomposition into 2 parts and didn´t do any other changes (the mesh is built of 60´000 cells). Comparing the decomposed case (running on two processors) to the undecomposed case (one processor) I got a speedup of about 1.5.
Is that a realistic value?

2) When doing the same test with other cases and solvers I got very different results. (Not surprising) when the number of cells is lower, the speedup is lower. At another place at this forum I read about a speedup of 1.3, running the "cavity" tutorial on a network with 2 processors. Is that really realistic? Because the cavity tutorial is built of just 400 cells.

3) Could anyone give me a rough estimation, what speedup I can expect, depending on the mesh size and the number of processors I use?

Ralph

henry · July 19, 2005, 06:00

What values of the parallel running control parameters did you choose and what difference in performance did you get when you changed them?

ralph · July 19, 2005, 06:07

Hello Henry,
I´m not quite sure what you mean with running control parameters. I left all parameters unchanged, compared to the original tutorial-case.
The simulation ran at a Courant number of about 0.3.
I did just a simple-decompositon of the "channelOodles" case into 2 equal parts. Where decomposition in wall normal direction resulted in the highest speedup (1.5).
Ralph

henry · July 19, 2005, 06:26

Take a look at this tread:

http://www.cfd-online.com/OpenFOAM_D...ges/1/819.html

Also are your machines connected via a cross-over cable or a switch? If a switch is it a good one?

ralph · July 19, 2005, 06:39

Thanks Henry,
I´ll try playing with the parameters and check their influence.
The machines are connected by a switch and I was told, it is a good one.
Maybe you could tell me, what you think of the speedup of 1.5 for "channelOodles" (60´000 cells) on 2 processors. Is it rather a good or a bad value?
Ralph

henry · July 19, 2005, 06:48

I would expect better speed-up than that but because the case has two sets of cyclic planes you can end up with quite a large communications overhead if you split the domain between either pair. I suggest you decompose the case by splitting between the walls, i.e. in the y-direction if I remember correctly. Also I expect that you will get better speed-up by using floatTransfer.

ralph · July 19, 2005, 12:03

I did some calculations. The results are not very satisfying. Maybe you could tell me, what you think about it.
I did some calculations of "channelOodles" with different configurations concerning the number of mesh cells. I did each calculation both on 1 processor (not decomposed) and on 2 processors (decomposed).
For the decomposed cases I splitted the mesh between the walls. In the following table one can see the speedup of going from the 1 processor case to the 2 processor case.

Ncells speedup
40´000 1.35
60´000 1.50 ("original" tutorial case)
80´000 1.55
100´000 1.65
120´000 1.65

I did all parallel calculations both with and without "floatTransfer". The results DID NOT change.

I did the mesh refinement by placing points in spanwise direction. This means that the number of cells on the processer-interfaces is also increased. But I did one calculation of the 120´000 case with a refinement in wall normal direction (number of cells on interface is not changed by mesh refinement) and what really surprised me, I got the same simulation times as above for both the 1 and the 2 processor run.

Ralph

henry · July 19, 2005, 12:13

I get a marked improvement when using floatTransfer and I am surprised you don't. It appears your case is limited by the global-sums (i.e. latency) rather than bandwidth otherwise you would have seen a difference when using floatTransfer and the refinement direction. I think you should run some tests on your network performance to see where the bottle-neck is. It might also be interesting is you could run the tests with a cross-over cable instead of the switch.

ralph · July 19, 2005, 14:40

Thanks Henry for all the informations.
Next I will do some network checks and hope that I find the bottle neck.

Could anyone tell me, what´s a good speedup for my problem, so that I have some orientation or does anyone have some experience about speedup dependent on problem size?

Ralph

eugene · July 19, 2005, 16:20

With only two processors, you should be getting very near 100% speedup. No less than 95%

henry · July 19, 2005, 16:33

I would agree if the two processors were on a shared-memory or NUMA machine but they communicate across a GigaBit switch in which case I would estimate the speedup will be less than 90%.

ralph · July 20, 2005, 06:30

Thanks Eugene, thanks Henry
I assume with 100% speedup you mean half the calculation time of a single processor run ?
Ralph

eugene · July 20, 2005, 08:53

I made up some quick numbers for two LES channels. In all cases, float transfer is on, the rest is stock.
Two machines:
1. P4 3.0GHz Gigabit cluster (P4)
2. Opteron 2x246 (O2)

For the 60k stock channel run I get the following timings:
P4 single: 137s
P4 two: 87s

O2 single: 118s
O2 two: 64s

P4 parallel x2 cpu efficiency: 79.3%
O2 parallel x2 cpu efficiency: 92.2%

These numbers are misleading though. A 60k mesh with 1200 communicating faces is quite heavy on the comms. I therefore made a 480k mesh and re-ran the test on the P4s. This time the picture is a lot different:

P4 parallel x2 cpu efficiency: 96.7%

Thats very close to 100% speedup. As you can see the question of parallel efficiency is not straight forward and any code that claims it can consistantly provide this performance is doing something ... well lets just say "special" and leave it at that.

A quick additional stat, the cell->comm face ratio for the 60k case is 50:1, while the same stat for the 480k case is 100:1. Additionally, there might be issues unrelated to comms performance (like cache size) that can also influence the calculation times, skewing the scaling results.

All-in-all a less than trivial question.

Note: cpu efficiency calculated as (0.5*1cpu time)/(2cpu time)*100

April 19, 2005, 05:34	In order to test OpenFOAMs LES	#1
steja New Member Steffen Jahnke Join Date: Mar 2009 Posts: 14 Rep Power: 17	In order to test OpenFOAMs LES performance we used the channelOodLes solver to compute a simple turbulent channel flow similar to the one described in the channel395 tutorial but with Re_tau=180 (64x64x64 cells, 2 blocks and a simple grading stretching close to the walls). We tested all available time schemes and different solvers for pressure (AMG (NCells=26000) and ICCG with abs. convergence tol=1e-5). The timestep was setup to be 0.005, yielding to an average CFL number of 1.1 during the simulation. Running OpenFOAM parallel on 4 cpus on a linux-cluster, we found out that: 1) Using AMG is much slower than ICCG, right, or are there any options other than NCells, which have an significant influence on the performance? 2) Comparing with our inhouse Fortran-Code running the same testcase on single cpu, OpenFOAM still needs almost 10 times more computation time for one time step. We need to mention that our explicit LES code is limited to one-block domains and optimized with the intel fortran compiler. Further it uses an optimized pressure solver (FFT -> periodic directions). 2.1) The question concerning this statement is: Is the only reason of the difference the prize we have to pay for a very flexible code infrastructure, or are there built-in options with which one can play to get a better performance? 2.2) Is it worth to use another compiler (e.g. intel) or are the differences only small? //Thanks for help in advance

April 19, 2005, 05:51	1) Yes the current implementa	#2
henry Senior Member Join Date: Mar 2009 Posts: 854 Rep Power: 22	1) Yes the current implementation of AMG has poor performance running in parallel but this can be improved by choosing a largish number of cells at the coarsest level and could be improved further if the coarsest level were solved using a direct solver on the master-node. However, for channel LES cases we do not find that the AMG offers any performance improvement over ICCG. 2) I would expect an implicit p-U code to be significantly slower than an optimised explicit FFT code but I am surprised it's a factor of 10, I would expect 4 or 5. 2.1) You are probably finding the cost of the pressure solution is dominating the computation, how many PISO correctors are you doing? How many iterations in the pressure solver is it doing? Is your code running compressible or incompressible? If the case is compressible you will find a significant gain in performance if you also run the implicit code compressible because the diagonal dominance of the pressure equation improves convergence enormously. 2.2) You might get speed imrpovements with other compilers, although I have not found the Intel compiler better than gcc. I have tried the KAI compiler in the past and got a reasonable improvements in performance, perhaps the pathscale or portland compilers would be even better but I haven't tried them yet.

April 19, 2005, 05:56	On point 2) : is the in-house	#3
grtabor Senior Member Gavin Tabor Join Date: Mar 2009 Posts: 181 Rep Power: 17	On point 2) : is the in-house Fortran code structured or unstructured? If its limited to 1 block domains that suggests that the addressing might be significantly simpler, which might give a further speedup. Gavin

April 19, 2005, 06:22	Additional notes to: 2.1) 2 p	#4
steja New Member Steffen Jahnke Join Date: Mar 2009 Posts: 14 Rep Power: 17	Additional notes to: 2.1) 2 piso correctors, for the first corrector step it needs 170 and for the second 150 iterations Our code is incompressible.

April 19, 2005, 06:23	What is the interconnect on yo	#5
eugene Senior Member Eugene de Villiers Join Date: Mar 2009 Posts: 725 Rep Power: 21	What is the interconnect on your cluster? If it is just fast ethernet or gigabit ethernet, I would expect this to massively decrease the parallel performance of channelOodles' implicit solver. Even if you had infiniband or a 4-way opteron, running only 260000 cells over 4 processors is not going to get you near 100% efficiency (I get ~85% for 500k cells on a quad opteron). Please run the case on a single processor and compare that to your 4-way timings. Explicit codes have much less of a comms overhead. Combine a 50% efficiency with Henry's factor of 4-5 and you have you 10x slowdown.

April 19, 2005, 06:28	With 170 iterations in the fir	#6
henry Senior Member Join Date: Mar 2009 Posts: 854 Rep Power: 22	With 170 iterations in the first corrector and 150 in the second it sounds like PISO is not converging very well which is probably due to your Courant number being quite high. Are these numbers from early in the run or after the flow has developed?

April 19, 2005, 06:58	I need to correct my previous	#7
steja New Member Steffen Jahnke Join Date: Mar 2009 Posts: 14 Rep Power: 17	I need to correct my previous posting: the number of iterations in the first step was around 140 and in the second around 50. This is for the converged flow.

July 19, 2005, 05:49	Hi, I did a few parallel calc	#8
ralph Member Ralph Join Date: Mar 2009 Posts: 40 Rep Power: 17	Hi, I did a few parallel calculations to check the capabilities of OpenFoam. Therefore I used 2 Pentiums with 3.2 GHz connected with ethernet gigabit. 1) Amongst others I ran the OpenFoam tutorial "channelOodles". I just did a decomposition into 2 parts and didn´t do any other changes (the mesh is built of 60´000 cells). Comparing the decomposed case (running on two processors) to the undecomposed case (one processor) I got a speedup of about 1.5. Is that a realistic value? 2) When doing the same test with other cases and solvers I got very different results. (Not surprising) when the number of cells is lower, the speedup is lower. At another place at this forum I read about a speedup of 1.3, running the "cavity" tutorial on a network with 2 processors. Is that really realistic? Because the cavity tutorial is built of just 400 cells. 3) Could anyone give me a rough estimation, what speedup I can expect, depending on the mesh size and the number of processors I use? Ralph

July 19, 2005, 06:00	What values of the parallel ru	#9
henry Senior Member Join Date: Mar 2009 Posts: 854 Rep Power: 22	What values of the parallel running control parameters did you choose and what difference in performance did you get when you changed them?

July 19, 2005, 06:07	Hello Henry, I´m not quite su	#10
ralph Member Ralph Join Date: Mar 2009 Posts: 40 Rep Power: 17	Hello Henry, I´m not quite sure what you mean with running control parameters. I left all parameters unchanged, compared to the original tutorial-case. The simulation ran at a Courant number of about 0.3. I did just a simple-decompositon of the "channelOodles" case into 2 equal parts. Where decomposition in wall normal direction resulted in the highest speedup (1.5). Ralph

July 19, 2005, 06:26	Take a look at this tread:	#11
henry Senior Member Join Date: Mar 2009 Posts: 854 Rep Power: 22	Take a look at this tread: http://www.cfd-online.com/OpenFOAM_D...ges/1/819.html Also are your machines connected via a cross-over cable or a switch? If a switch is it a good one?

July 19, 2005, 06:39	Thanks Henry, I´ll try playin	#12
ralph Member Ralph Join Date: Mar 2009 Posts: 40 Rep Power: 17	Thanks Henry, I´ll try playing with the parameters and check their influence. The machines are connected by a switch and I was told, it is a good one. Maybe you could tell me, what you think of the speedup of 1.5 for "channelOodles" (60´000 cells) on 2 processors. Is it rather a good or a bad value? Ralph

July 19, 2005, 06:48	I would expect better speed-up	#13
henry Senior Member Join Date: Mar 2009 Posts: 854 Rep Power: 22	I would expect better speed-up than that but because the case has two sets of cyclic planes you can end up with quite a large communications overhead if you split the domain between either pair. I suggest you decompose the case by splitting between the walls, i.e. in the y-direction if I remember correctly. Also I expect that you will get better speed-up by using floatTransfer.

July 19, 2005, 12:03	I did some calculations. The r	#14
ralph Member Ralph Join Date: Mar 2009 Posts: 40 Rep Power: 17	I did some calculations. The results are not very satisfying. Maybe you could tell me, what you think about it. I did some calculations of "channelOodles" with different configurations concerning the number of mesh cells. I did each calculation both on 1 processor (not decomposed) and on 2 processors (decomposed). For the decomposed cases I splitted the mesh between the walls. In the following table one can see the speedup of going from the 1 processor case to the 2 processor case. Ncells speedup 40´000 1.35 60´000 1.50 ("original" tutorial case) 80´000 1.55 100´000 1.65 120´000 1.65 I did all parallel calculations both with and without "floatTransfer". The results DID NOT change. I did the mesh refinement by placing points in spanwise direction. This means that the number of cells on the processer-interfaces is also increased. But I did one calculation of the 120´000 case with a refinement in wall normal direction (number of cells on interface is not changed by mesh refinement) and what really surprised me, I got the same simulation times as above for both the 1 and the 2 processor run. Ralph

July 19, 2005, 12:13	I get a marked improvement whe	#15
henry Senior Member Join Date: Mar 2009 Posts: 854 Rep Power: 22	I get a marked improvement when using floatTransfer and I am surprised you don't. It appears your case is limited by the global-sums (i.e. latency) rather than bandwidth otherwise you would have seen a difference when using floatTransfer and the refinement direction. I think you should run some tests on your network performance to see where the bottle-neck is. It might also be interesting is you could run the tests with a cross-over cable instead of the switch.

July 19, 2005, 14:40	Thanks Henry for all the infor	#16
ralph Member Ralph Join Date: Mar 2009 Posts: 40 Rep Power: 17	Thanks Henry for all the informations. Next I will do some network checks and hope that I find the bottle neck. Could anyone tell me, what´s a good speedup for my problem, so that I have some orientation or does anyone have some experience about speedup dependent on problem size? Ralph

July 19, 2005, 16:20	With only two processors, you	#17
eugene Senior Member Eugene de Villiers Join Date: Mar 2009 Posts: 725 Rep Power: 21	With only two processors, you should be getting very near 100% speedup. No less than 95%

July 19, 2005, 16:33	I would agree if the two proce	#18
henry Senior Member Join Date: Mar 2009 Posts: 854 Rep Power: 22	I would agree if the two processors were on a shared-memory or NUMA machine but they communicate across a GigaBit switch in which case I would estimate the speedup will be less than 90%.

July 20, 2005, 06:30	Thanks Eugene, thanks Henry I	#19
ralph Member Ralph Join Date: Mar 2009 Posts: 40 Rep Power: 17	Thanks Eugene, thanks Henry I assume with 100% speedup you mean half the calculation time of a single processor run ? Ralph

July 20, 2005, 08:53	I made up some quick numbers f	#20
eugene Senior Member Eugene de Villiers Join Date: Mar 2009 Posts: 725 Rep Power: 21	I made up some quick numbers for two LES channels. In all cases, float transfer is on, the rest is stock. Two machines: 1. P4 3.0GHz Gigabit cluster (P4) 2. Opteron 2x246 (O2) For the 60k stock channel run I get the following timings: P4 single: 137s P4 two: 87s O2 single: 118s O2 two: 64s P4 parallel x2 cpu efficiency: 79.3% O2 parallel x2 cpu efficiency: 92.2% These numbers are misleading though. A 60k mesh with 1200 communicating faces is quite heavy on the comms. I therefore made a 480k mesh and re-ran the test on the P4s. This time the picture is a lot different: P4 parallel x2 cpu efficiency: 96.7% Thats very close to 100% speedup. As you can see the question of parallel efficiency is not straight forward and any code that claims it can consistantly provide this performance is doing something ... well lets just say "special" and leave it at that. A quick additional stat, the cell->comm face ratio for the 60k case is 50:1, while the same stat for the 480k case is 100:1. Additionally, there might be issues unrelated to comms performance (like cache size) that can also influence the calculation times, skewing the scaling results. All-in-all a less than trivial question. Note: cpu efficiency calculated as (0.51cpu time)/(2cpu time)100

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
ChannelOodles	maka	OpenFOAM Bugs	12	February 5, 2009 19:17
GradP in channeloodles	nikos_fb16	OpenFOAM Running, Solving & CFD	0	September 11, 2007 05:28
GradP in channeloodles	nikos_fb16	OpenFOAM Running, Solving & CFD	0	September 10, 2007 10:46
GradP in channeloodles	nikos_fb16	OpenFOAM Running, Solving & CFD	1	September 4, 2007 11:52
ChannelOodles in parallel	maka	OpenFOAM Bugs	3	August 21, 2007 18:30