Concurrent runs slowing each others

johndeas · October 21, 2008, 13:40

Hi,

I wanted to launch several runs of one case on a multicore machine at the same time. I hoped, since each process would be completely independent, that computation time of each run won't affect too much the others. To the contrary, I witnessed an increase in the calculation time spent on each timestep according to the following chart:

(Results are not very precise, I used a timer to get them quickly, but the trends are represented). There is a very big increase in computation time, and it seems to occur by levels (e.g. with 3 or 4 cores results are quite the same).

To detect whether the problem is coming from OpenFOAM or not, I tried to see if another will lay the same results. I chose to launch several runs of Scilab, computing a basic matrix inversion. The command was: for i=1:10000, inv(rand(1000,1000));

The following results were obtained :

As one can see, the performance of the runs are relatively unaffected by the others until almost all the cores are occupied by Scilab processes.

I am using a dell station with 2 Intel E5450 processors each consisting of 4 cores cadenced at 3 gHz. The operating system is RHEL 5.

Is my problem coming from OF, should I adjust some parameters ? I hinted that OpenFOAM is making much more access to the RAM than Scilab, and maybe can slow the other processes accesses. Has somebody witnessed the same behaviour with OF ?

Thank you,

JD

johndeas · October 21, 2008, 13:42

Sorry for the image sizes, didn't check that !

olesen · October 21, 2008, 14:16

We have dual-core dual-cpu servers.
With both OpenFOAM and STAR-CD we get better performance if we use a single core from each cpu and split across more machines rather than use all available cpus and cores.
Apparently the bottleneck to memory is much more significant than the additional network traffic by splitting across more separate machines.

With quad-cores, the memory bottleneck will look even worse. I think your best chance is to run each case in parallel and run the cases sequentially.
In this case, you should see how the parallel speed up looks for a single case.

juergen · October 21, 2008, 14:44

John and Mark,

just out of curiosity: Which processor architecture do you use? AMD or Intel? Xeon? Opteron? Is NUMA active?

What is your 'uname -a'?

I'm about to build a multiprocessor machine for work with OpenFOAM and I'd like to collect some experience from multi-core users.

Thanks a lot.

Ciao, Juergen

msrinath80 · October 21, 2008, 15:28

If you search the forum, you'll find that many of us have experimented with dual/quad core offerings from AMD and Intel and posted the results here.

markusrehm · October 22, 2008, 03:35

Hi John,

matrix calculation is computationally very dense which means that memory access is rather small and much of it can be done in CPU cache.

By the way: which solver and problem size did you benchmark?

It depends a lot on the case and size. You should benchmark your cases and decide if it is better to fill up your whole cluster or leave half of the processors empty as Mark mentioned.

Regards Markus.

johndeas · October 29, 2008, 09:43

First, thank you for your answers, and sorry for not replying myself as promptly.

@Juergen : uname -a gave me: Linux lambda30 2.6.18-92.el5 #1 SMP Tue Apr 29 13:16:15 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
Regarding NUMA being enabled or not, I have no idea.

@Mark : "I think your best chance is to run each case in parallel and run the cases sequentially." Doing this, am I not adding communication between process slow down to the existing memory bandwith slodown ?

@Florian : What processors are your using ?

I also had results from another calculation on this machine, performed using Fluent on an 8 million case. The case was run in parallel, and the results are reported below. As well as the various speedups obtained running concurrent OpenFOAM cases at the same time.

The openFoam case is a plane channel with 300000 calculation points. I am running a modified version of icoFoam which allows me to sustain a constant pressure gradient, and compute some statistics on the fly (like what channeloodles did before using controlDict functions). The Fluent case is an 8 million cells rectangular domain with velocity inlet/ pressure outlet, and symmetries on the sides.

One of the results which is astonishing is that Fluent run in parallel scales better that individual small cases in OpenFOAM run simultaneously. As I said, the Scilab results are obtained inversing 1000*1000 matrices. Since cat /proc/cpuinfo gave me a cache size of 6000 kb, I suspect the inversion to be made in cache, explaining the good speedup.

Based on what I have read on this forum, right now, as the datasets used in CFD are often large, memory bandwith is the most limiting factor. Having multiple cores sharing this bandwith through a single Therefore, multiple cores CPU are not suited for those tasks, because they share a front side bus to access the memory. The Opteron architecture, however is not subject to those limitations, as every core has a direct. The processor to memory controller interface is on the processor die.
As my laboratory is willing to invest in a cluster, I guess a good base could be a bunch of multicore Opterons connected using Infiniband.

I don't understand then why so many clusters seem to use Xeon machines ? Have they some kind of improvement which makes their shared bandwith efficient anyway ?

johndeas · November 28, 2008, 14:54

I posted this one on CFD online. Not everybody might track posts there and well, as I am mainly doign calculation with OpenFOAM, I value more comments from this forum.

CPU have become much more performant than memory access, which is putting a lot of pressure on the FSB paradigm, created times ago, when memory access were comparatively faster. The CPU cache has been developped to limit the access to the memory, but, due to the large amount of data needed to solve Navier-Stokes equations on a large domain, cache performance become less important as it needs to be frequently filled with data from the RAM. Hypertransport from AMD is a solution, has it removes the FSB and allow the Opteron CPU to be connected directly to memory. However, recent Xeon provide several FSB (one per core) which also cirvumvent the problem of saturation of a single FSB, and might explain why Xeon equip a large percentage of clusters, despite its use of "old" FSB technology.

What is your say on this ?

floooo · January 15, 2009, 04:27

John Deas -> I use dell workstation with 2 intel Xenon quad processor and 64GB of memory; the machines are connected togever with a Gbit network:
My speedup is nearly linear.
I can't make a nice plot because I never alone on the machines.

Here are the results obtained by ENEA (national agency for new technology, energy and envionment - Italy)
http://www.eneagrid.enea.it/papers_presentations/papers/NapoliEScience2008_09_CR ESCO.pdf

With an infiniband their result are linear.
The study show that the bandwith of the connection becomes a major criterion when the number of cores increase.
For a Gb bandwith the effect of this limitation appear after 20 cores.

Your machines are maybe connected with only a 100M ethernet bandwith.

vijayakumar · January 15, 2009, 04:44

i need to run a combustion problem.... i used XI-foam as solver, while running some errors are coming... i need to know which solver is good for combustion problem.. and few guidelines for solving combustion problem.

October 21, 2008, 13:42	Sorry for the image sizes, did	#2
johndeas Senior Member John Deas Join Date: Mar 2009 Posts: 160 Rep Power: 17	Sorry for the image sizes, didn't check that !

October 21, 2008, 14:16	We have dual-core dual-cpu ser	#3
olesen Senior Member Mark Olesen Join Date: Mar 2009 Location: https://olesenm.github.io/ Posts: 1,715 Rep Power: 40	We have dual-core dual-cpu servers. With both OpenFOAM and STAR-CD we get better performance if we use a single core from each cpu and split across more machines rather than use all available cpus and cores. Apparently the bottleneck to memory is much more significant than the additional network traffic by splitting across more separate machines. With quad-cores, the memory bottleneck will look even worse. I think your best chance is to run each case in parallel and run the cases sequentially. In this case, you should see how the parallel speed up looks for a single case.

October 21, 2008, 14:44	John and Mark, just out of	#4
juergen New Member Juergen Neubauer Join Date: Mar 2009 Location: Los Angeles, CA, USA Posts: 2 Rep Power: 0	John and Mark, just out of curiosity: Which processor architecture do you use? AMD or Intel? Xeon? Opteron? Is NUMA active? What is your 'uname -a'? I'm about to build a multiprocessor machine for work with OpenFOAM and I'd like to collect some experience from multi-core users. Thanks a lot. Ciao, Juergen

October 21, 2008, 15:28	If you search the forum, you'l	#5
msrinath80 Senior Member Srinath Madhavan (a.k.a pUl\|) Join Date: Mar 2009 Location: Edmonton, AB, Canada Posts: 703 Rep Power: 21	If you search the forum, you'll find that many of us have experimented with dual/quad core offerings from AMD and Intel and posted the results here.

October 22, 2008, 03:35	Hi John, matrix calculation	#6
markusrehm Senior Member Markus Rehm Join Date: Mar 2009 Location: Erlangen (Germany) Posts: 184 Rep Power: 17	Hi John, matrix calculation is computationally very dense which means that memory access is rather small and much of it can be done in CPU cache. By the way: which solver and problem size did you benchmark? It depends a lot on the case and size. You should benchmark your cases and decide if it is better to fill up your whole cluster or leave half of the processors empty as Mark mentioned. Regards Markus.

October 21, 2008, 13:40	Hi, I wanted to launch seve	#1
johndeas Senior Member John Deas Join Date: Mar 2009 Posts: 160 Rep Power: 17	Hi, I wanted to launch several runs of one case on a multicore machine at the same time. I hoped, since each process would be completely independent, that computation time of each run won't affect too much the others. To the contrary, I witnessed an increase in the calculation time spent on each timestep according to the following chart: (Results are not very precise, I used a timer to get them quickly, but the trends are represented). There is a very big increase in computation time, and it seems to occur by levels (e.g. with 3 or 4 cores results are quite the same). To detect whether the problem is coming from OpenFOAM or not, I tried to see if another will lay the same results. I chose to launch several runs of Scilab, computing a basic matrix inversion. The command was: for i=1:10000, inv(rand(1000,1000)); The following results were obtained : As one can see, the performance of the runs are relatively unaffected by the others until almost all the cores are occupied by Scilab processes. I am using a dell station with 2 Intel E5450 processors each consisting of 4 cores cadenced at 3 gHz. The operating system is RHEL 5. Is my problem coming from OF, should I adjust some parameters ? I hinted that OpenFOAM is making much more access to the RAM than Scilab, and maybe can slow the other processes accesses. Has somebody witnessed the same behaviour with OF ? Thank you, JD

October 29, 2008, 09:43	First, thank you for your answ	#7
johndeas Senior Member John Deas Join Date: Mar 2009 Posts: 160 Rep Power: 17	First, thank you for your answers, and sorry for not replying myself as promptly. @Juergen : uname -a gave me: Linux lambda30 2.6.18-92.el5 #1 SMP Tue Apr 29 13:16:15 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux Regarding NUMA being enabled or not, I have no idea. @Mark : "I think your best chance is to run each case in parallel and run the cases sequentially." Doing this, am I not adding communication between process slow down to the existing memory bandwith slodown ? @Florian : What processors are your using ? I also had results from another calculation on this machine, performed using Fluent on an 8 million case. The case was run in parallel, and the results are reported below. As well as the various speedups obtained running concurrent OpenFOAM cases at the same time. The openFoam case is a plane channel with 300000 calculation points. I am running a modified version of icoFoam which allows me to sustain a constant pressure gradient, and compute some statistics on the fly (like what channeloodles did before using controlDict functions). The Fluent case is an 8 million cells rectangular domain with velocity inlet/ pressure outlet, and symmetries on the sides. One of the results which is astonishing is that Fluent run in parallel scales better that individual small cases in OpenFOAM run simultaneously. As I said, the Scilab results are obtained inversing 1000*1000 matrices. Since cat /proc/cpuinfo gave me a cache size of 6000 kb, I suspect the inversion to be made in cache, explaining the good speedup. Based on what I have read on this forum, right now, as the datasets used in CFD are often large, memory bandwith is the most limiting factor. Having multiple cores sharing this bandwith through a single Therefore, multiple cores CPU are not suited for those tasks, because they share a front side bus to access the memory. The Opteron architecture, however is not subject to those limitations, as every core has a direct. The processor to memory controller interface is on the processor die. As my laboratory is willing to invest in a cluster, I guess a good base could be a bunch of multicore Opterons connected using Infiniband. I don't understand then why so many clusters seem to use Xeon machines ? Have they some kind of improvement which makes their shared bandwith efficient anyway ?

November 28, 2008, 14:54	I posted this one on CFD onlin	#8
johndeas Senior Member John Deas Join Date: Mar 2009 Posts: 160 Rep Power: 17	I posted this one on CFD online. Not everybody might track posts there and well, as I am mainly doign calculation with OpenFOAM, I value more comments from this forum. CPU have become much more performant than memory access, which is putting a lot of pressure on the FSB paradigm, created times ago, when memory access were comparatively faster. The CPU cache has been developped to limit the access to the memory, but, due to the large amount of data needed to solve Navier-Stokes equations on a large domain, cache performance become less important as it needs to be frequently filled with data from the RAM. Hypertransport from AMD is a solution, has it removes the FSB and allow the Opteron CPU to be connected directly to memory. However, recent Xeon provide several FSB (one per core) which also cirvumvent the problem of saturation of a single FSB, and might explain why Xeon equip a large percentage of clusters, despite its use of "old" FSB technology. What is your say on this ?

January 15, 2009, 04:27	John Deas -> I use dell workst	#9
floooo Member florian Join Date: Mar 2009 Location: Mannheim - Vincennes - Valenciennes, Deutchland - France Posts: 34 Rep Power: 17	John Deas -> I use dell workstation with 2 intel Xenon quad processor and 64GB of memory; the machines are connected togever with a Gbit network: My speedup is nearly linear. I can't make a nice plot because I never alone on the machines. Here are the results obtained by ENEA (national agency for new technology, energy and envionment - Italy) http://www.eneagrid.enea.it/papers_presentations/papers/NapoliEScience2008_09_CR ESCO.pdf With an infiniband their result are linear. The study show that the bandwith of the connection becomes a major criterion when the number of cores increase. For a Gb bandwith the effect of this limitation appear after 20 cores. Your machines are maybe connected with only a 100M ethernet bandwith.

January 15, 2009, 04:44	i need to run a combustion pro	#10
vijayakumar New Member VIJAYAKUMAR R Join Date: Mar 2009 Location: BANGALORE, KARNATAKA, INDIA Posts: 20 Rep Power: 17	i need to run a combustion problem.... i used XI-foam as solver, while running some errors are coming... i need to know which solver is good for combustion problem.. and few guidelines for solving combustion problem.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Parallel runs using LAM	sek	OpenFOAM Running, Solving & CFD	11	February 13, 2008 08:36
Slowing an animation sequence	Shanti	FLUENT	1	May 11, 2006 19:44
LES runs	anindya	FLUENT	0	June 25, 2005 08:03
CFD in concurrent engineering	David Howell	Main CFD Forum	5	April 29, 1999 08:46