Cluster Parallelization Performance

minger · November 20, 2013, 08:50

I have been fortunate enough to given some hardware to try and set up an OpenFOAM cluster. I have it up and running with 2 nodes at the moment and am getting unexpectedly poor performance. I am hoping someone can provide some input as to where to look. Here is the info:

Hardware:
2 identical HP Z400 with
Xeon CPU W3550 @ 3.07 GHz x 4
11.7 GB Memory

They are connected via a Linksys RVS4000 Gigabit switch. I have used iperf and can vouch that the machines are transferring at gigabit speed.

The version of foam is 2.2. Both machines are running Ubuntu 13.10 and have identical setups.

My test case is the pimpleDyMFoam tutorial mixerVesselAMI2D. I have thrown away the default mesh and 2 levels of refinement. The first level of refinement is 307200 cells.

I have 3 results for the first case, one with a single core (no parallel option), one with a single node, 4 processors, and one with both nodes, 8 processors.

I am using the scotch method (per the tutorial) of decomposition for both single host and multihost runs. The 2-node case is decomposed as:

Code:

FoamFile
{
    version     2.0;
    format      ascii;
    class       dictionary;
    location    "system";
    object      decomposeParDict;
}
// * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * //

//- Force AMI to be on single processor. Can cause imbalance with some
//  decomposers.
//singleProcessorFaceSets ((AMI -1));

numberOfSubdomains 8;

method          scotch;

distributed     yes;

roots           ( );


// ************************************************************************* //

The single node case is simply changed to 4 subdomains.

To launch the cases
Parallel: time mpirun -hostfile hostfile pimpleDyMFoam -parallel > log
Single Core: time pimpleDyMFoam > log

The hostfile looks like

Code:

192.168.0.3 slots=4
192.168.0.4 slots=4

for the 2 node case, and I simply remove the second node for the single node case.

Results of the first level of mesh refinement look like:

Code:

Single Core Run
real    79m05.605s
user    78m17.394s
sys    0m44.688s

Single Host 4 core machine
real    42m49.394s
user    168m27.428s
sys    0m13.658s

Full Parallel Run
real    60m58.221s
user    104m3.251s
sys    137m15.823s

The second case is further refined to a cell count of 1.2 MM. Results from that look like (no single core run) (note that I reduced the physical time as well which is why the times are similar)

Code:

Single Node 4 Core
real    65m20.622s
user    256m19.924s
sys    0m56.965s

Full Parallel Run
real    58m50.084s
user    143m23.455s
sys    90m40.328s

I guess I'm "content" with a 25% speedup at 1.2 MM elements, but I would have expected better scalaility ... and moreso, I would have not expected to have to run 1 MM elements to get any return from the second node.

So, am I'm trying this, I'm seeing a couple of things. Firstly, I suppose the AMI may be causing issuse? I wouldn't expect the AMI to degrade the parallel performance so much. Also, I am seeing something about a

Code:

nCellsInCoarsestLevel 10;

that is supposed to be set to the square root of the number of cells ... is that right? Is that number of cells in the entire domain, or in the partitioned subdomain?

Anyways, I will keep churning through these -- but any insight or help is appreciated; thanks!

edit: The single node job finished 10 min faster than projected. Barely a 10% speedup on the 2 node, 8 processor run.\

edit 2:
Job finished with modified nCellsInCoarsestLevel. I went back to the "fine" case, and set that value:

Code:

nCellsInCoarsestLevel 550; ! sqrt(300000)

The results are largely unchanged:

Code:

Single Core Run 
real    79m05.605s 
user    78m17.394s 
sys    0m44.688s  

Single Host 4 core machine 
real    42m49.394s 
user    168m27.428s 
sys   0m13.658s  

Full Parallel Run 
real    60m58.221s 
user    104m3.251s 
sys    137m15.823s

Full Parallel Run, Modified nCellsInCoarestLevel
real    57m39.171s
user    90m7.315s
sys    138m22.254s

I am currently running a case with

Code:

singleProcessorFaceSets ((AMI -1));

in decomposePar

minger · November 21, 2013, 18:45

It seems that the AMI and/or dynamic mesh motion was SEVERELY slowing the parallelization down. I went to a more basic test case, and chose the pitzDaily simpleFoam test. Results are:

Code:

================================
pitzDaily

Single Host 4 core machine
real    0m23.332s
user    1m24.179s
sys    0m0.688s

Full Parallel
real    0m48.747s
user    1m5.969s
sys    1m56.091s

================================
pitzDaily Fine - 49k cells

Single Host 4 core machine
real    2m33.846s
user    10m8.397s
sys    0m0.982s

Full Parallel 
real    2m36.021s
user    5m30.923s
sys    4m32.839s

================================
pitzDaily xFine - 195k cells

Single Host 4 core machine
real    45m59.531s
user    182m16.379s
sys    0m6.847s

Full Parallel
real    19m44.253s
user    61m9.221s
sys    16m59.335s

So, I was able to get linear scaling on the parallelization somewhere between 7 and 25k cells per node.

It does raise the question as to whether its the AMI or the DyN that is causing the slowdown.

November 20, 2013, 08:50	Cluster Parallelization Performance	#1
minger Member Join Date: Apr 2009 Posts: 36 Rep Power: 17	I have been fortunate enough to given some hardware to try and set up an OpenFOAM cluster. I have it up and running with 2 nodes at the moment and am getting unexpectedly poor performance. I am hoping someone can provide some input as to where to look. Here is the info: Hardware: 2 identical HP Z400 with Xeon CPU W3550 @ 3.07 GHz x 4 11.7 GB Memory They are connected via a Linksys RVS4000 Gigabit switch. I have used iperf and can vouch that the machines are transferring at gigabit speed. The version of foam is 2.2. Both machines are running Ubuntu 13.10 and have identical setups. My test case is the pimpleDyMFoam tutorial mixerVesselAMI2D. I have thrown away the default mesh and 2 levels of refinement. The first level of refinement is 307200 cells. I have 3 results for the first case, one with a single core (no parallel option), one with a single node, 4 processors, and one with both nodes, 8 processors. I am using the scotch method (per the tutorial) of decomposition for both single host and multihost runs. The 2-node case is decomposed as: Code: FoamFile { version 2.0; format ascii; class dictionary; location "system"; object decomposeParDict; } // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * // //- Force AMI to be on single processor. Can cause imbalance with some // decomposers. //singleProcessorFaceSets ((AMI -1)); numberOfSubdomains 8; method scotch; distributed yes; roots ( ); // ************************************************************************* // The single node case is simply changed to 4 subdomains. To launch the cases Parallel: time mpirun -hostfile hostfile pimpleDyMFoam -parallel > log Single Core: time pimpleDyMFoam > log The hostfile looks like Code: 192.168.0.3 slots=4 192.168.0.4 slots=4 for the 2 node case, and I simply remove the second node for the single node case. Results of the first level of mesh refinement look like: Code: Single Core Run real 79m05.605s user 78m17.394s sys 0m44.688s Single Host 4 core machine real 42m49.394s user 168m27.428s sys 0m13.658s Full Parallel Run real 60m58.221s user 104m3.251s sys 137m15.823s The second case is further refined to a cell count of 1.2 MM. Results from that look like (no single core run) (note that I reduced the physical time as well which is why the times are similar) Code: Single Node 4 Core real 65m20.622s user 256m19.924s sys 0m56.965s Full Parallel Run real 58m50.084s user 143m23.455s sys 90m40.328s I guess I'm "content" with a 25% speedup at 1.2 MM elements, but I would have expected better scalaility ... and moreso, I would have not expected to have to run 1 MM elements to get any return from the second node. So, am I'm trying this, I'm seeing a couple of things. Firstly, I suppose the AMI may be causing issuse? I wouldn't expect the AMI to degrade the parallel performance so much. Also, I am seeing something about a Code: nCellsInCoarsestLevel 10; that is supposed to be set to the square root of the number of cells ... is that right? Is that number of cells in the entire domain, or in the partitioned subdomain? Anyways, I will keep churning through these -- but any insight or help is appreciated; thanks! edit: The single node job finished 10 min faster than projected. Barely a 10% speedup on the 2 node, 8 processor run.\ edit 2: Job finished with modified nCellsInCoarsestLevel. I went back to the "fine" case, and set that value: Code: nCellsInCoarsestLevel 550; ! sqrt(300000) The results are largely unchanged: Code: Single Core Run real 79m05.605s user 78m17.394s sys 0m44.688s Single Host 4 core machine real 42m49.394s user 168m27.428s sys 0m13.658s Full Parallel Run real 60m58.221s user 104m3.251s sys 137m15.823s Full Parallel Run, Modified nCellsInCoarestLevel real 57m39.171s user 90m7.315s sys 138m22.254s I am currently running a case with Code: singleProcessorFaceSets ((AMI -1)); in decomposePar Last edited by minger; November 20, 2013 at 12:00. Reason: added run with nCellsInCoarsestLevel

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Set up for High performance cluster	c0sk	OpenFOAM Running, Solving & CFD	2	January 31, 2014 00:33
poor performance at massive parallel run using SGI cluster	matthias	OpenFOAM Running, Solving & CFD	8	October 21, 2011 09:24
Parallel cluster solving with OpenFoam? P2P Cluster?	hornig	OpenFOAM Programming & Development	8	December 5, 2010 17:06
Linux Cluster Performance with a bi-processor PC	M.	FLUENT	1	April 22, 2005 10:25
link to cluster performance study	fourier	Main CFD Forum	0	March 8, 2002 02:00