CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Software User Forums > OpenFOAM > OpenFOAM Running, Solving & CFD

Performance issues with channelOodLes

Register Blogs Community New Posts Updated Threads Search

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   April 19, 2005, 05:34
Default In order to test OpenFOAMs LES
  #1
New Member
 
Steffen Jahnke
Join Date: Mar 2009
Posts: 14
Rep Power: 17
steja is on a distinguished road
In order to test OpenFOAMs LES performance we
used the channelOodLes solver to compute a simple
turbulent channel flow similar to the one
described in the channel395 tutorial but with
Re_tau=180 (64x64x64 cells, 2 blocks and a simple
grading stretching close to the walls).
We tested all available time schemes and
different solvers for pressure (AMG
(NCells=26000) and ICCG with abs. convergence
tol=1e-5).
The timestep was setup to be 0.005, yielding to an average CFL number of 1.1 during the
simulation.

Running OpenFOAM parallel on 4 cpus on a
linux-cluster, we found out that:

1)
Using AMG is much slower than ICCG, right, or are
there any options other than NCells, which have
an significant influence on the performance?


2)
Comparing with our inhouse Fortran-Code running the same testcase on single cpu, OpenFOAM still needs almost 10 times more computation time for one time step.
We need to mention that our explicit LES code is limited to one-block domains and optimized with the intel fortran compiler. Further it uses an optimized pressure solver (FFT -> periodic directions).

2.1)
The question concerning this statement is: Is the only reason of the difference the prize we have to pay for a very flexible code infrastructure, or are there built-in options with which one can play to get a better performance?

2.2)
Is it worth to use another compiler (e.g. intel) or are the differences only small?

//Thanks for help in advance
steja is offline   Reply With Quote

Old   April 19, 2005, 05:51
Default 1) Yes the current implementa
  #2
Senior Member
 
Join Date: Mar 2009
Posts: 854
Rep Power: 22
henry is on a distinguished road
1) Yes the current implementation of AMG has poor performance running in parallel but this can be improved by choosing a largish number of cells at the coarsest level and could be improved further if the coarsest level were solved using a direct solver on the master-node. However, for channel LES cases we do not find that the AMG offers any performance improvement over ICCG.

2) I would expect an implicit p-U code to be significantly slower than an optimised explicit FFT code but I am surprised it's a factor of 10, I would expect 4 or 5.

2.1) You are probably finding the cost of the pressure solution is dominating the computation, how many PISO correctors are you doing? How many iterations in the pressure solver is it doing?

Is your code running compressible or incompressible? If the case is compressible you will find a significant gain in performance if you also run the implicit code compressible because the diagonal dominance of the pressure equation improves convergence enormously.

2.2) You might get speed imrpovements with other compilers, although I have not found the Intel compiler better than gcc. I have tried the KAI compiler in the past and got a reasonable improvements in performance, perhaps the pathscale or portland compilers would be even better but I haven't tried them yet.
henry is offline   Reply With Quote

Old   April 19, 2005, 05:56
Default On point 2) : is the in-house
  #3
Senior Member
 
Gavin Tabor
Join Date: Mar 2009
Posts: 181
Rep Power: 17
grtabor is on a distinguished road
On point 2) : is the in-house Fortran code structured or unstructured? If its limited to 1 block domains that suggests that the addressing might be significantly simpler, which might give a further speedup.

Gavin
grtabor is offline   Reply With Quote

Old   April 19, 2005, 06:22
Default Additional notes to: 2.1) 2 p
  #4
New Member
 
Steffen Jahnke
Join Date: Mar 2009
Posts: 14
Rep Power: 17
steja is on a distinguished road
Additional notes to:
2.1) 2 piso correctors, for the first corrector step it needs 170 and for the second 150 iterations
Our code is incompressible.
steja is offline   Reply With Quote

Old   April 19, 2005, 06:23
Default What is the interconnect on yo
  #5
Senior Member
 
Eugene de Villiers
Join Date: Mar 2009
Posts: 725
Rep Power: 21
eugene is on a distinguished road
What is the interconnect on your cluster? If it is just fast ethernet or gigabit ethernet, I would expect this to massively decrease the parallel performance of channelOodles' implicit solver. Even if you had infiniband or a 4-way opteron, running only 260000 cells over 4 processors is not going to get you near 100% efficiency (I get ~85% for 500k cells on a quad opteron). Please run the case on a single processor and compare that to your 4-way timings.

Explicit codes have much less of a comms overhead. Combine a 50% efficiency with Henry's factor of 4-5 and you have you 10x slowdown.
eugene is offline   Reply With Quote

Old   April 19, 2005, 06:28
Default With 170 iterations in the fir
  #6
Senior Member
 
Join Date: Mar 2009
Posts: 854
Rep Power: 22
henry is on a distinguished road
With 170 iterations in the first corrector and 150 in the second it sounds like PISO is not converging very well which is probably due to your Courant number being quite high. Are these numbers from early in the run or after the flow has developed?
henry is offline   Reply With Quote

Old   April 19, 2005, 06:58
Default I need to correct my previous
  #7
New Member
 
Steffen Jahnke
Join Date: Mar 2009
Posts: 14
Rep Power: 17
steja is on a distinguished road
I need to correct my previous posting: the
number of iterations in the first step was around
140 and in the second around 50. This is for the
converged flow.
steja is offline   Reply With Quote

Old   July 19, 2005, 05:49
Default Hi, I did a few parallel calc
  #8
Member
 
Ralph
Join Date: Mar 2009
Posts: 40
Rep Power: 17
ralph is on a distinguished road
Hi,
I did a few parallel calculations to check the capabilities of OpenFoam. Therefore I used 2 Pentiums with 3.2 GHz connected with ethernet gigabit.

1) Amongst others I ran the OpenFoam tutorial "channelOodles". I just did a decomposition into 2 parts and didn´t do any other changes (the mesh is built of 60´000 cells). Comparing the decomposed case (running on two processors) to the undecomposed case (one processor) I got a speedup of about 1.5.
Is that a realistic value?

2) When doing the same test with other cases and solvers I got very different results. (Not surprising) when the number of cells is lower, the speedup is lower. At another place at this forum I read about a speedup of 1.3, running the "cavity" tutorial on a network with 2 processors. Is that really realistic? Because the cavity tutorial is built of just 400 cells.

3) Could anyone give me a rough estimation, what speedup I can expect, depending on the mesh size and the number of processors I use?

Ralph
ralph is offline   Reply With Quote

Old   July 19, 2005, 06:00
Default What values of the parallel ru
  #9
Senior Member
 
Join Date: Mar 2009
Posts: 854
Rep Power: 22
henry is on a distinguished road
What values of the parallel running control parameters did you choose and what difference in performance did you get when you changed them?
henry is offline   Reply With Quote

Old   July 19, 2005, 06:07
Default Hello Henry, I´m not quite su
  #10
Member
 
Ralph
Join Date: Mar 2009
Posts: 40
Rep Power: 17
ralph is on a distinguished road
Hello Henry,
I´m not quite sure what you mean with running control parameters. I left all parameters unchanged, compared to the original tutorial-case.
The simulation ran at a Courant number of about 0.3.
I did just a simple-decompositon of the "channelOodles" case into 2 equal parts. Where decomposition in wall normal direction resulted in the highest speedup (1.5).
Ralph
ralph is offline   Reply With Quote

Old   July 19, 2005, 06:26
Default Take a look at this tread:
  #11
Senior Member
 
Join Date: Mar 2009
Posts: 854
Rep Power: 22
henry is on a distinguished road
Take a look at this tread:

http://www.cfd-online.com/OpenFOAM_D...ges/1/819.html

Also are your machines connected via a cross-over cable or a switch? If a switch is it a good one?
henry is offline   Reply With Quote

Old   July 19, 2005, 06:39
Default Thanks Henry, I´ll try playin
  #12
Member
 
Ralph
Join Date: Mar 2009
Posts: 40
Rep Power: 17
ralph is on a distinguished road
Thanks Henry,
I´ll try playing with the parameters and check their influence.
The machines are connected by a switch and I was told, it is a good one.
Maybe you could tell me, what you think of the speedup of 1.5 for "channelOodles" (60´000 cells) on 2 processors. Is it rather a good or a bad value?
Ralph
ralph is offline   Reply With Quote

Old   July 19, 2005, 06:48
Default I would expect better speed-up
  #13
Senior Member
 
Join Date: Mar 2009
Posts: 854
Rep Power: 22
henry is on a distinguished road
I would expect better speed-up than that but because the case has two sets of cyclic planes you can end up with quite a large communications overhead if you split the domain between either pair. I suggest you decompose the case by splitting between the walls, i.e. in the y-direction if I remember correctly. Also I expect that you will get better speed-up by using floatTransfer.
henry is offline   Reply With Quote

Old   July 19, 2005, 12:03
Default I did some calculations. The r
  #14
Member
 
Ralph
Join Date: Mar 2009
Posts: 40
Rep Power: 17
ralph is on a distinguished road
I did some calculations. The results are not very satisfying. Maybe you could tell me, what you think about it.
I did some calculations of "channelOodles" with different configurations concerning the number of mesh cells. I did each calculation both on 1 processor (not decomposed) and on 2 processors (decomposed).
For the decomposed cases I splitted the mesh between the walls. In the following table one can see the speedup of going from the 1 processor case to the 2 processor case.

Ncells speedup
40´000 1.35
60´000 1.50 ("original" tutorial case)
80´000 1.55
100´000 1.65
120´000 1.65

I did all parallel calculations both with and without "floatTransfer". The results DID NOT change.

I did the mesh refinement by placing points in spanwise direction. This means that the number of cells on the processer-interfaces is also increased. But I did one calculation of the 120´000 case with a refinement in wall normal direction (number of cells on interface is not changed by mesh refinement) and what really surprised me, I got the same simulation times as above for both the 1 and the 2 processor run.

Ralph
ralph is offline   Reply With Quote

Old   July 19, 2005, 12:13
Default I get a marked improvement whe
  #15
Senior Member
 
Join Date: Mar 2009
Posts: 854
Rep Power: 22
henry is on a distinguished road
I get a marked improvement when using floatTransfer and I am surprised you don't. It appears your case is limited by the global-sums (i.e. latency) rather than bandwidth otherwise you would have seen a difference when using floatTransfer and the refinement direction. I think you should run some tests on your network performance to see where the bottle-neck is. It might also be interesting is you could run the tests with a cross-over cable instead of the switch.
henry is offline   Reply With Quote

Old   July 19, 2005, 14:40
Default Thanks Henry for all the infor
  #16
Member
 
Ralph
Join Date: Mar 2009
Posts: 40
Rep Power: 17
ralph is on a distinguished road
Thanks Henry for all the informations.
Next I will do some network checks and hope that I find the bottle neck.

Could anyone tell me, what´s a good speedup for my problem, so that I have some orientation or does anyone have some experience about speedup dependent on problem size?

Ralph
ralph is offline   Reply With Quote

Old   July 19, 2005, 16:20
Default With only two processors, you
  #17
Senior Member
 
Eugene de Villiers
Join Date: Mar 2009
Posts: 725
Rep Power: 21
eugene is on a distinguished road
With only two processors, you should be getting very near 100% speedup. No less than 95%
eugene is offline   Reply With Quote

Old   July 19, 2005, 16:33
Default I would agree if the two proce
  #18
Senior Member
 
Join Date: Mar 2009
Posts: 854
Rep Power: 22
henry is on a distinguished road
I would agree if the two processors were on a shared-memory or NUMA machine but they communicate across a GigaBit switch in which case I would estimate the speedup will be less than 90%.
henry is offline   Reply With Quote

Old   July 20, 2005, 06:30
Default Thanks Eugene, thanks Henry I
  #19
Member
 
Ralph
Join Date: Mar 2009
Posts: 40
Rep Power: 17
ralph is on a distinguished road
Thanks Eugene, thanks Henry
I assume with 100% speedup you mean half the calculation time of a single processor run ?
Ralph
ralph is offline   Reply With Quote

Old   July 20, 2005, 08:53
Default I made up some quick numbers f
  #20
Senior Member
 
Eugene de Villiers
Join Date: Mar 2009
Posts: 725
Rep Power: 21
eugene is on a distinguished road
I made up some quick numbers for two LES channels. In all cases, float transfer is on, the rest is stock.
Two machines:
1. P4 3.0GHz Gigabit cluster (P4)
2. Opteron 2x246 (O2)

For the 60k stock channel run I get the following timings:
P4 single: 137s
P4 two: 87s

O2 single: 118s
O2 two: 64s

P4 parallel x2 cpu efficiency: 79.3%
O2 parallel x2 cpu efficiency: 92.2%

These numbers are misleading though. A 60k mesh with 1200 communicating faces is quite heavy on the comms. I therefore made a 480k mesh and re-ran the test on the P4s. This time the picture is a lot different:

P4 parallel x2 cpu efficiency: 96.7%

Thats very close to 100% speedup. As you can see the question of parallel efficiency is not straight forward and any code that claims it can consistantly provide this performance is doing something ... well lets just say "special" and leave it at that.

A quick additional stat, the cell->comm face ratio for the 60k case is 50:1, while the same stat for the 480k case is 100:1. Additionally, there might be issues unrelated to comms performance (like cache size) that can also influence the calculation times, skewing the scaling results.

All-in-all a less than trivial question.

Note: cpu efficiency calculated as (0.5*1cpu time)/(2cpu time)*100
eugene is offline   Reply With Quote

Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
ChannelOodles maka OpenFOAM Bugs 12 February 5, 2009 19:17
GradP in channeloodles nikos_fb16 OpenFOAM Running, Solving & CFD 0 September 11, 2007 05:28
GradP in channeloodles nikos_fb16 OpenFOAM Running, Solving & CFD 0 September 10, 2007 10:46
GradP in channeloodles nikos_fb16 OpenFOAM Running, Solving & CFD 1 September 4, 2007 11:52
ChannelOodles in parallel maka OpenFOAM Bugs 3 August 21, 2007 18:30


All times are GMT -4. The time now is 20:07.