Superlinear speedup in OpenFOAM 13

msrinath80 · January 8, 2007, 22:41

Hi OpenFOAMers,

I just finished testing OpenFOAM speedup on an 8-CPU (16-core) Opteron machine loaded with 60 GB RAM. The results are pretty impressive given that the latency in an SMP-like system is really very low. Firstly, here are some technical details of the hardware used:

[cfd@sunfire icoFoam]$ uname -a
Linux sunfire 2.6.9-42.0.2.ELlargesmp #1 SMP Tue Aug 22 18:52:10 CDT 2006 x86_64 x86_64 x86_64 GNU/Linux

(Basically Scientific Linux 4.x)

[cfd@sunfire ~]$ cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 33
model name : Dual Core AMD Opteron(tm) Processor 885
stepping : 2
cpu MHz : 2613.696
cache size : 1024 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext lm 3dnowext 3dnow pni
bogomips : 5230.07
TLB size : 1088 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

processor : 1
vendor_id : AuthenticAMD
cpu family : 15
model : 33
model name : Dual Core AMD Opteron(tm) Processor 885
stepping : 2
cpu MHz : 2613.696
cache size : 1024 KB
physical id : 0
siblings : 2
core id : 1
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext lm 3dnowext 3dnow pni
bogomips : 5226.93
TLB size : 1088 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

processor : 2
vendor_id : AuthenticAMD
cpu family : 15
model : 33
model name : Dual Core AMD Opteron(tm) Processor 885
stepping : 2
cpu MHz : 2613.696
cache size : 1024 KB
physical id : 1
siblings : 2
core id : 0
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext lm 3dnowext 3dnow pni
bogomips : 5226.92
TLB size : 1088 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

processor : 3
vendor_id : AuthenticAMD
cpu family : 15
model : 33
model name : Dual Core AMD Opteron(tm) Processor 885
stepping : 2
cpu MHz : 2613.696
cache size : 1024 KB
physical id : 1
siblings : 2
core id : 1
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext lm 3dnowext 3dnow pni
bogomips : 5226.88
TLB size : 1088 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

processor : 4
vendor_id : AuthenticAMD
cpu family : 15
model : 33
model name : Dual Core AMD Opteron(tm) Processor 885
stepping : 2
cpu MHz : 2613.696
cache size : 1024 KB
physical id : 2
siblings : 2
core id : 0
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext lm 3dnowext 3dnow pni
bogomips : 5226.90
TLB size : 1088 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

processor : 5
vendor_id : AuthenticAMD
cpu family : 15
model : 33
model name : Dual Core AMD Opteron(tm) Processor 885
stepping : 2
cpu MHz : 2613.696
cache size : 1024 KB
physical id : 2
siblings : 2
core id : 1
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext lm 3dnowext 3dnow pni
bogomips : 5226.89
TLB size : 1088 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

processor : 6
vendor_id : AuthenticAMD
cpu family : 15
model : 33
model name : Dual Core AMD Opteron(tm) Processor 885
stepping : 2
cpu MHz : 2613.696
cache size : 1024 KB
physical id : 3
siblings : 2
core id : 0
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext lm 3dnowext 3dnow pni
bogomips : 5226.90
TLB size : 1088 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

processor : 7
vendor_id : AuthenticAMD
cpu family : 15
model : 33
model name : Dual Core AMD Opteron(tm) Processor 885
stepping : 2
cpu MHz : 2613.696
cache size : 1024 KB
physical id : 3
siblings : 2
core id : 1
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext lm 3dnowext 3dnow pni
bogomips : 5226.90
TLB size : 1088 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

processor : 8
vendor_id : AuthenticAMD
cpu family : 15
model : 33
model name : Dual Core AMD Opteron(tm) Processor 885
stepping : 2
cpu MHz : 2613.696
cache size : 1024 KB
physical id : 4
siblings : 2
core id : 0
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext lm 3dnowext 3dnow pni
bogomips : 5226.90
TLB size : 1088 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

processor : 9
vendor_id : AuthenticAMD
cpu family : 15
model : 33
model name : Dual Core AMD Opteron(tm) Processor 885
stepping : 2
cpu MHz : 2613.696
cache size : 1024 KB
physical id : 4
siblings : 2
core id : 1
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext lm 3dnowext 3dnow pni
bogomips : 5226.88
TLB size : 1088 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

processor : 10
vendor_id : AuthenticAMD
cpu family : 15
model : 33
model name : Dual Core AMD Opteron(tm) Processor 885
stepping : 2
cpu MHz : 2613.696
cache size : 1024 KB
physical id : 5
siblings : 2
core id : 0
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext lm 3dnowext 3dnow pni
bogomips : 5226.89
TLB size : 1088 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

processor : 11
vendor_id : AuthenticAMD
cpu family : 15
model : 33
model name : Dual Core AMD Opteron(tm) Processor 885
stepping : 2
cpu MHz : 2613.696
cache size : 1024 KB
physical id : 5
siblings : 2
core id : 1
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext lm 3dnowext 3dnow pni
bogomips : 5226.89
TLB size : 1088 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

processor : 12
vendor_id : AuthenticAMD
cpu family : 15
model : 33
model name : Dual Core AMD Opteron(tm) Processor 885
stepping : 2
cpu MHz : 2613.696
cache size : 1024 KB
physical id : 6
siblings : 2
core id : 0
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext lm 3dnowext 3dnow pni
bogomips : 5226.88
TLB size : 1088 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

processor : 13
vendor_id : AuthenticAMD
cpu family : 15
model : 33
model name : Dual Core AMD Opteron(tm) Processor 885
stepping : 2
cpu MHz : 2613.696
cache size : 1024 KB
physical id : 6
siblings : 2
core id : 1
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext lm 3dnowext 3dnow pni
bogomips : 5226.90
TLB size : 1088 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

processor : 14
vendor_id : AuthenticAMD
cpu family : 15
model : 33
model name : Dual Core AMD Opteron(tm) Processor 885
stepping : 2
cpu MHz : 2613.696
cache size : 1024 KB
physical id : 7
siblings : 2
core id : 0
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext lm 3dnowext 3dnow pni
bogomips : 5226.89
TLB size : 1088 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

processor : 15
vendor_id : AuthenticAMD
cpu family : 15
model : 33
model name : Dual Core AMD Opteron(tm) Processor 885
stepping : 2
cpu MHz : 2613.696
cache size : 1024 KB
physical id : 7
siblings : 2
core id : 1
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext lm 3dnowext 3dnow pni
bogomips : 5226.89
TLB size : 1088 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

[cfd@sunfire ~]$ free -mot
total used free shared buffers cached
Mem: 59923 48535 11387 0 83 37052
Swap: 59839 6 59833
Total: 119763 48541 71221

checkMesh output reads:

[cfd@sunfire icoFoam]$ checkMesh . one_sq_cyl_3d_unsteady_wtavg_4_2_cpus
/*---------------------------------------------------------------------------*\
| ========= | |
| \ / F ield | OpenFOAM: The Open Source CFD Toolbox |
| \ / O peration | Version: 1.3 |
| \ / A nd | Web: http://www.openfoam.org |
| \/ M anipulation | |
\*---------------------------------------------------------------------------*/

Exec : checkMesh . one_sq_cyl_3d_unsteady_wtavg_4_2_cpus
Date : Jan 08 2007
Time : 18:48:18
Host : sunfire
PID : 13166
Root : /home/cfd/OpenFOAM/cfd-1.3/run/tutorials/icoFoam
Case : one_sq_cyl_3d_unsteady_wtavg_4_2_cpus
Nprocs : 1
Create time

Create polyMesh for time = constant

Time = constant
Boundary definition OK.

Number of points: 11042070
edges: 32811382
faces: 32498784
internal faces: 31878048
cells: 10729472
boundary patches: 4
point zones: 0
face zones: 0
cell zones: 0

Checking topology and geometry ...
Point usage check OK.

Upper triangular ordering OK.

Topological cell zip-up check OK.

Face vertices OK.

Face-face connectivity OK.

Basic topo ok ...

Checking patch topology for multiply connected surfaces ...

Patch Faces Points Surface
ChannelWalls 604352 604734 ok (not multiply connected)
ObstacleWalls 6144 6240 ok (not multiply connected)
vinlet 5120 5265 ok (not multiply connected)
poutlet 5120 5265 ok (not multiply connected)

Patch topo ok ...
Topology check done.

Domain bounding box: min = (-1.165 -0.02 -0.05) max = (0.705 0.02 0.05) meters.

Checking geometry...
Boundary openness in x-direction = -7.92611080393212e-19
Boundary openness in y-direction = -3.33563488923205e-14
Boundary openness in z-direction = 1.36264413686285e-15
Boundary closed (OK).
Max cell openness = 2.49399995957591e-21 Max aspect ratio = 1.74011094308106. All cells OK.

Minumum face area = 1.95312499999909e-07. Maximum face area = 7.44396594083542e-06. Face area magnitudes OK.

Min volume = 1.8600559925712e-10. Max volume = 7.06069406503132e-09. Total volume = 0.00747000000001155. Cell volumes OK.

Mesh non-orthogonality Max: 0 average: 0
Non-orthogonality check OK.

Face pyramids OK.

Max skewness = 2.84219262700115e-10 percent. Face skewness OK.

Minumum edge length = 0.000312499999999998. Maximum edge length = 0.00312657168090757.

All angles in faces are convex or less than 10 degrees concave.

Face flatness (1 = flat, 0 = butterfly) : average = 1 min = 1
All faces are flat in that the ratio between projected and actual area is > 0.8

Geometry check done.

Number of cells by type:
hexahedra: 10729472
prisms: 0
wedges: 0
pyramids: 0
tet wedges: 0
tetrahedra: 0
polyhedra: 0
Number of regions: 1 (OK).
Mesh OK.

Time = 0
No mesh.

End

fvSchemes reads:

/*---------------------------------------------------------------------------*\
| ========= | |
| \ / F ield | OpenFOAM: The Open Source CFD Toolbox |
| \ / O peration | Version: 1.3 |
| \ / A nd | Web: http://www.openfoam.org |
| \/ M anipulation | |
\*---------------------------------------------------------------------------*/

// FoamX Case Dictionary.

FoamFile
{
version 2.0;
format ascii;

root "/home/madhavan/OpenFOAM/madhavan-1.3/run/tutorials/icoFoam";
case "";
instance "system";
local "";

class dictionary;
object fvSchemes;
}

// * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * //

ddtSchemes
{
default CrankNicholson 1;
}

gradSchemes
{
default Gauss linear;
grad(p) Gauss linear;
}

divSchemes
{
default none;
div(phi,U) Gauss linear;
}

laplacianSchemes
{
default none;
laplacian(nu,U) Gauss linear corrected;
laplacian(1|A(U),p) Gauss linear corrected;
}

interpolationSchemes
{
default linear;
interpolate(HbyA) linear;
}

snGradSchemes
{
default corrected;
}

fluxRequired
{
default no;
p;
}

// ************************************************** *********************** //

And fvSolution:

/*---------------------------------------------------------------------------*\
| ========= | |
| \ / F ield | OpenFOAM: The Open Source CFD Toolbox |
| \ / O peration | Version: 1.3 |
| \ / A nd | Web: http://www.openfoam.org |
| \/ M anipulation | |
\*---------------------------------------------------------------------------*/

// FoamX Case Dictionary.

FoamFile
{
version 2.0;
format ascii;

root "/home/madhavan/OpenFOAM/madhavan-1.3/run/tutorials/icoFoam";
case "";
instance "system";
local "";

class dictionary;
object fvSolution;
}

// * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * //

solvers
{
// p ICCG 1e-06 0;
p AMG 1e-06 0 25;
U BICCG 1e-05 0;
}

PISO
{
momentumPredictor yes;
nCorrectors 2;
nNonOrthogonalCorrectors 0;
pRefCell 0;
pRefValue 0;
}

// ************************************************** *********************** //

The case I ran was laminar unsteady vortex shedding past a square cylinder in a rectangular channel. The solver was a slightly modified version of icoFoam. Modifications included calculation of Lift/Drag coefficients [stolen from Frank Bos ;)] and writing out time-averaged velocity/pressure and velocity probes (stolen from oodles). 19 probeLocations were defined in all the simulations. The speedup was calculated as follows:

Speedup from 'N' CPUs = (ClockTime for a serial run) / (ClockTime for a parallel run with 'N' CPUs)

The time step chosen was 0.02 seconds. This ensured that the Maximum Courant number was well below 1 (typically around 0.4). Starting from time t = 0, the simulation was run upto time t = 0.68 seconds (i.e. 34 times steps). 'writeFormat' in controlDict was set to 'binary' and 'writePrecision' to 15. Metis decomposition was used with equal processor weighting throughout. All parallel runs were dedicated (i.e. only I was using the machine). LAM MPI was used throughout. It was seen that in each of the parallel runs, the total RES memory reported by 'top' was around 10.2 GB.

Keeping in mind that dual cores are memory bandwidth limited, two types of parallel configurations were tested:

1. In the first parallel configuration, only one core from each physical processor was used. This was possible using the 'taskset' command in GNU/Linux which allows one to hard-request specific cores (i.e. override the kernel CPU affinity mask). This command also makes sure that until the process quits, it will be locked on to only the user-specified set of CPUs. Thus the maximum number of CPUs for this case was 8.

2. The second parallel configuration was composed of using 2-cores from each processor. Thus for a 4-CPU run, two physical processors were hard-requested and so on. In this configuration, one could go upto 16 CPUs in total.

The speedup results are available here [http://www.ualberta.ca/~madhavan/openfoam_speedup.eps].

It can be seen that the first kind of parallel configuration (i.e. using just 8 CPUs [one core from each physical processor]) exhibits what appears to be a case of super-linear speedup. This is explained in the following wikipedia entry:

http://en.wikipedia.org/wiki/Speedup

Has anyone experienced this with OpenFOAM before?

The second parallel configuration (i.e. 16 cores) displays acceptable speedup as well. However it should be noted that the maximum speedup in this case was around 15.2 using 16 cores. A slightly higher speedup (15.964) was obtained using just 8 CPUs when working in the first kind of parallel configuration. Also noteworthy is the fact that memory bandwidth limitation when using both cores does not seem to detrimentally impair the speedup.

A sample log file from an 8-CPU run is shown below:

/*---------------------------------------------------------------------------*\
| ========= | |
| \ / F ield | OpenFOAM: The Open Source CFD Toolbox |
| \ / O peration | Version: 1.3 |
| \ / A nd | Web: http://www.openfoam.org |
| \/ M anipulation | |
\*---------------------------------------------------------------------------*/

/*---------------------------------------------------------------------------*\
| ========= | |
| \ / F ield | OpenFOAM: The Open Source CFD Toolbox |
| \ / O peration | Version: 1.3 |
| \ / A nd | Web: http://www.openfoam.org |
| \/ M anipulation | |
\*---------------------------------------------------------------------------*/

/*---------------------------------------------------------------------------*\
| ========= | |
| \ / F ield | OpenFOAM: The Open Source CFD Toolbox |
| \ / O peration | Version: 1.3 |
| \ / A nd | Web: http://www.openfoam.org |
| \/ M anipulation | |
\*---------------------------------------------------------------------------*/

/*---------------------------------------------------------------------------*\
| ========= | |
| \ / F ield | OpenFOAM: The Open Source CFD Toolbox |
| \ / O peration | Version: 1.3 |
| \ / A nd | Web: http://www.openfoam.org |
| \/ M anipulation | |
\*---------------------------------------------------------------------------*/

Exec : icoFoam . one_sq_cyl_3d_unsteady_wtavg_4_8_cpus -parallel
Exec : icoFoam . one_sq_cyl_3d_unsteady_wtavg_4_8_cpus -parallel
Exec : icoFoam . one_sq_cyl_3d_unsteady_wtavg_4_8_cpus -parallel
Exec : icoFoam . one_sq_cyl_3d_unsteady_wtavg_4_8_cpus -parallel
/*---------------------------------------------------------------------------*\
| ========= | |
| \ / F ield | OpenFOAM: The Open Source CFD Toolbox |
| \ / O peration | Version: 1.3 |
| \ / A nd | Web: http://www.openfoam.org |
| \/ M anipulation | |
\*---------------------------------------------------------------------------*/

/*---------------------------------------------------------------------------*\
| ========= | |
| \ / F ield | OpenFOAM: The Open Source CFD Toolbox |
| \ / O peration | Version: 1.3 |
| \ / A nd | Web: http://www.openfoam.org |
| \/ M anipulation | |
\*---------------------------------------------------------------------------*/

Exec : icoFoam . one_sq_cyl_3d_unsteady_wtavg_4_8_cpus -parallel
Exec : icoFoam . one_sq_cyl_3d_unsteady_wtavg_4_8_cpus -parallel
/*---------------------------------------------------------------------------*\
| ========= | |
| \ / F ield | OpenFOAM: The Open Source CFD Toolbox |
| \ / O peration | Version: 1.3 |
| \ / A nd | Web: http://www.openfoam.org |
| \/ M anipulation | |
\*---------------------------------------------------------------------------*/

Exec : icoFoam . one_sq_cyl_3d_unsteady_wtavg_4_8_cpus -parallel
/*---------------------------------------------------------------------------*\
| ========= | |
| \ / F ield | OpenFOAM: The Open Source CFD Toolbox |
| \ / O peration | Version: 1.3 |
| \ / A nd | Web: http://www.openfoam.org |
| \/ M anipulation | |
\*---------------------------------------------------------------------------*/

Exec : icoFoam . one_sq_cyl_3d_unsteady_wtavg_4_8_cpus -parallel
[1] Date : Dec 27 2006
[1] Time : 11:17:35
[1] Host : sunfire
[1] PID : 5608
[5] Date : Dec 27 2006
[5] Time : 11:17:35
[5] Host : sunfire
[5] PID : 5612
[7] Date : Dec 27 2006
[7] Time : 11:17:35
[7] Host : sunfire
[7] PID : 5614
[3] Date : Dec 27 2006
[3] Time : 11:17:35
[3] Host : sunfire
[3] PID : 5610
[4] Date : Dec 27 2006
[4] Time : 11:17:35
[4] Host : sunfire
[4] PID : 5611
[0] Date : Dec 27 2006
[0] Time : 11:17:35
[0] Host : sunfire
[0] PID : 5607
[1] Root : /home/madhavan/OpenFOAM/madhavan-1.3/run/tutorials/icoFoam
[1] Case : one_sq_cyl_3d_unsteady_wtavg_4_8_cpus
[1] Nprocs : 8
[2] Date : Dec 27 2006
[2] Time : 11:17:35
[2] Host : sunfire
[2] PID : 5609
[6] Date : Dec 27 2006
[6] Time : 11:17:35
[6] Host : sunfire
[6] PID : 5613
[5] Root : /home/madhavan/OpenFOAM/madhavan-1.3/run/tutorials/icoFoam
[5] Case : one_sq_cyl_3d_unsteady_wtavg_4_8_cpus
[5] Nprocs : 8
[7] Root : /home/madhavan/OpenFOAM/madhavan-1.3/run/tutorials/icoFoam
[7] Case : one_sq_cyl_3d_unsteady_wtavg_4_8_cpus
[7] Nprocs : 8
[3] Root : /home/madhavan/OpenFOAM/madhavan-1.3/run/tutorials/icoFoam
[3] Case : one_sq_cyl_3d_unsteady_wtavg_4_8_cpus
[3] Nprocs : 8
[4] Root : /home/madhavan/OpenFOAM/madhavan-1.3/run/tutorials/icoFoam
[4] Case : one_sq_cyl_3d_unsteady_wtavg_4_8_cpus
[4] Nprocs : 8
[2] Root : /home/madhavan/OpenFOAM/madhavan-1.3/run/tutorials/icoFoam
[2] Case : one_sq_cyl_3d_unsteady_wtavg_4_8_cpus
[2] Nprocs : 8
[6] Root : /home/madhavan/OpenFOAM/madhavan-1.3/run/tutorials/icoFoam
[6] Case : one_sq_cyl_3d_unsteady_wtavg_4_8_cpus
[6] Nprocs : 8
[0] Root : /home/madhavan/OpenFOAM/madhavan-1.3/run/tutorials/icoFoam
[0] Case : one_sq_cyl_3d_unsteady_wtavg_4_8_cpus
[0] Nprocs : 8
[0] Slaves :
[0] 7
[0] (
[0] sunfire.5608
[0] sunfire.5609
[0] sunfire.5610
[0] sunfire.5611
[0] sunfire.5612
[0] sunfire.5613
[0] sunfire.5614
[0] )
[0]
Create time

Create mesh for time = 0

Reading transportProperties

Reading field p

Reading field U

Reading/calculating face flux field phi

Creating field Umean

Creating field pMean

Reading probeLocations

Constructing probes

Starting time loop

Time = 0.02

Mean and max Courant Numbers = 0 0.0799610193770155
BICCG: Solving for Ux, Initial residual = 0.999999999999942, Final residual = 1.72057068708726e-06, No Iterations 2
BICCG: Solving for Uy, Initial residual = 0, Final residual = 0, No Iterations 0
BICCG: Solving for Uz, Initial residual = 0, Final residual = 0, No Iterations 0
AMG: Solving for p, Initial residual = 1, Final residual = 9.48240838699873e-07, No Iterations 264
time step continuity errors : sum local = 6.34770499582916e-11, global = -4.66773069030591e-12, cumulative = -4.66773069030591e-12
AMG: Solving for p, Initial residual = 0.000327390016863783, Final residual = 9.50144270815434e-07, No Iterations 125
time step continuity errors : sum local = 7.58317575730968e-08, global = -7.09519972870107e-09, cumulative = -7.09986745939137e-09

Wall patch = 0
Wall patch name = ChannelWalls
Uav = (1 0 0)
Aref = 1
nu = nu [0 2 -1 0 0 0 0] 1.00481e-06
DragCoefficient = 2.39031097936705e-05
pressureDragCoefficient = 1.10457835730627e-19
viscDragCoefficient = 2.39031097936704e-05
LiftCoefficient = -2.7464517768576e-08

Wall patch = 1
Wall patch name = ObstacleWalls
Uav = (1 0 0)
Aref = 1
nu = nu [0 2 -1 0 0 0 0] 1.00481e-06
DragCoefficient = 1.53640797773116e-05
pressureDragCoefficient = 1.51743063164737e-05
viscDragCoefficient = 1.89773460837957e-07
LiftCoefficient = 2.19062878500774e-10

ExecutionTime = 429.61 s ClockTime = 430 s

Time = 0.04

Mean and max Courant Numbers = 0.0520366499170818 0.618464623572251
BICCG: Solving for Ux, Initial residual = 0.9503276306348, Final residual = 7.30093823030692e-07, No Iterations 4
BICCG: Solving for Uy, Initial residual = 0.336108715228218, Final residual = 7.75236516500687e-06, No Iterations 3
BICCG: Solving for Uz, Initial residual = 0.318782629311303, Final residual = 2.46974968726866e-06, No Iterations 3
AMG: Solving for p, Initial residual = 0.00142885986455076, Final residual = 9.59204590238137e-07, No Iterations 161
time step continuity errors : sum local = 3.31915427708201e-08, global = -3.85201065960417e-09, cumulative = -1.09518781189955e-08
AMG: Solving for p, Initial residual = 0.00125690267648775, Final residual = 9.90718932840224e-07, No Iterations 148
time step continuity errors : sum local = 8.61031360523968e-09, global = -1.00374894983836e-09, cumulative = -1.19556270688339e-08

Wall patch = 0
Wall patch name = ChannelWalls
Uav = (1 0 0)
Aref = 1
nu = nu [0 2 -1 0 0 0 0] 1.00481e-06
DragCoefficient = 3.11686822943573e-05
pressureDragCoefficient = -1.19036971197506e-20
viscDragCoefficient = 3.11686822943573e-05
LiftCoefficient = 5.08497633583287e-08

Wall patch = 1
Wall patch name = ObstacleWalls
Uav = (1 0 0)
Aref = 1
nu = nu [0 2 -1 0 0 0 0] 1.00481e-06
DragCoefficient = -8.73183986075596e-07
pressureDragCoefficient = -1.12158407004779e-06
viscDragCoefficient = 2.48400083972195e-07
LiftCoefficient = -4.33142310759729e-10

ExecutionTime = 773.79 s ClockTime = 774 s

Time = 0.06

Mean and max Courant Numbers = 0.0520549574469423 0.633176355927174
BICCG: Solving for Ux, Initial residual = 0.729870873945099, Final residual = 8.71082566621832e-07, No Iterations 4
BICCG: Solving for Uy, Initial residual = 0.0449089477055162, Final residual = 1.17121591863239e-06, No Iterations 3
BICCG: Solving for Uz, Initial residual = 0.429338306203659, Final residual = 2.25243687316615e-06, No Iterations 3
AMG: Solving for p, Initial residual = 0.00746482234535778, Final residual = 9.66298628563578e-07, No Iterations 172
time step continuity errors : sum local = 3.73219678904151e-09, global = 4.02713235054858e-10, cumulative = -1.1552913833779e-08
AMG: Solving for p, Initial residual = 0.000155512648150767, Final residual = 9.95720557424151e-07, No Iterations 114
time step continuity errors : sum local = 1.43764071652251e-08, global = -1.63319844581523e-09, cumulative = -1.31861122795943e-08

Wall patch = 0
Wall patch name = ChannelWalls
Uav = (1 0 0)
Aref = 1
nu = nu [0 2 -1 0 0 0 0] 1.00481e-06
DragCoefficient = 1.94601024085051e-05
pressureDragCoefficient = -4.21515010300766e-20
viscDragCoefficient = 1.94601024085052e-05
LiftCoefficient = -2.75174260612183e-08

Wall patch = 1
Wall patch name = ObstacleWalls
Uav = (1 0 0)
Aref = 1
nu = nu [0 2 -1 0 0 0 0] 1.00481e-06
DragCoefficient = -5.42732911306796e-06
pressureDragCoefficient = -5.57975173572317e-06
viscDragCoefficient = 1.52422622655207e-07
LiftCoefficient = 1.90975313446053e-10

ExecutionTime = 1096.18 s ClockTime = 1097 s

Time = 0.08

Mean and max Courant Numbers = 0.0520609078097951 0.573409966876077
BICCG: Solving for Ux, Initial residual = 0.907786134944961, Final residual = 7.13709554853266e-07, No Iterations 4
BICCG: Solving for Uy, Initial residual = 0.218107164255757, Final residual = 4.31544674425797e-06, No Iterations 3
BICCG: Solving for Uz, Initial residual = 0.483568109971064, Final residual = 9.75818435573959e-06, No Iterations 2
AMG: Solving for p, Initial residual = 0.00175055175027143, Final residual = 9.66179519860942e-07, No Iterations 159
time step continuity errors : sum local = 1.28147244147868e-08, global = 1.42689767854083e-09, cumulative = -1.17592146010535e-08
AMG: Solving for p, Initial residual = 0.00173420165308252, Final residual = 9.83219295244694e-07, No Iterations 155
time step continuity errors : sum local = 1.95495752173065e-09, global = -2.18349730379148e-10, cumulative = -1.19775643314326e-08

Wall patch = 0
Wall patch name = ChannelWalls
Uav = (1 0 0)
Aref = 1
nu = nu [0 2 -1 0 0 0 0] 1.00481e-06
DragCoefficient = 1.2096984821921e-05
pressureDragCoefficient = 9.00013655768262e-22
viscDragCoefficient = 1.2096984821921e-05
LiftCoefficient = 3.17872454814612e-08

Wall patch = 1
Wall patch name = ObstacleWalls
Uav = (1 0 0)
Aref = 1
nu = nu [0 2 -1 0 0 0 0] 1.00481e-06
DragCoefficient = 3.05866437660425e-07
pressureDragCoefficient = 2.16859882289008e-07
viscDragCoefficient = 8.9006555371417e-08
LiftCoefficient = -2.80144135822249e-10

ExecutionTime = 1445.04 s ClockTime = 1446 s

Time = 0.1

Mean and max Courant Numbers = 0.0520483817059996 0.500862085074748
BICCG: Solving for Ux, Initial residual = 0.0654195035431545, Final residual = 4.684699571965e-06, No Iterations 2
BICCG: Solving for Uy, Initial residual = 0.0133065421499664, Final residual = 9.06483525526088e-06, No Iterations 2
BICCG: Solving for Uz, Initial residual = 0.107661631606992, Final residual = 4.57633279853188e-06, No Iterations 2
AMG: Solving for p, Initial residual = 0.0147390588023079, Final residual = 9.98397594852791e-07, No Iterations 149
time step continuity errors : sum local = 2.97501995648839e-10, global = -3.17354895118249e-11, cumulative = -1.20092998209444e-08
AMG: Solving for p, Initial residual = 0.00327776568097673, Final residual = 9.97367895344957e-07, No Iterations 132
time step continuity errors : sum local = 9.72865748170479e-11, global = -1.06901988694658e-11, cumulative = -1.20199900198139e-08

Wall patch = 0
Wall patch name = ChannelWalls
Uav = (1 0 0)
Aref = 1
nu = nu [0 2 -1 0 0 0 0] 1.00481e-06
DragCoefficient = 1.14058184148547e-05
pressureDragCoefficient = 3.64388470612232e-22
viscDragCoefficient = 1.14058184148547e-05
LiftCoefficient = -4.16144498831859e-08

Wall patch = 1
Wall patch name = ObstacleWalls
Uav = (1 0 0)
Aref = 1
nu = nu [0 2 -1 0 0 0 0] 1.00481e-06
DragCoefficient = 2.70235447485773e-07
pressureDragCoefficient = 1.87830258208696e-07
viscDragCoefficient = 8.24051892770773e-08
LiftCoefficient = 2.32675948985381e-10

ExecutionTime = 1757.44 s ClockTime = 1758 s

I would appreciate if anyone shares their thoughts/comments in this regard. I just finished compiling OpenFOAM with mvapi (infiniband) support through openmpi and plan to run the same case for a comparison.

eugene · January 9, 2007, 08:53

This is remarkable. I have a couple of 8way Opteron VX50s in the office and they do not show anywhere near this kind of performance.

In fact a single cpu on the 8-way performs significantly worse than a 3GHz Northwood P4. It was explained to me that the cache coherency communication on the 8-way introduces an overhead that cripples this architecture.

On the other hand, I ran extensive memory tests with Stream to measure cpu-memory bandwidth and the tests reported that the maximum achieveable bandwidth (around 3.2GB/s) was not between the cpu and local memory, but rather with a neighbouring memory bank. To me this reeks of an error in the BIOS/OS assigned affinity between memory banks and cpus. If I disconnect the top board (i.e. downgrade to a 4-way) the machine becomes a screamer, with scaling similar to what you report. Possibly your Scientific Linux has a better NUMA module or the Sun Mobo has addressed the 8-way issue (I use a Tyan board with Suse 10.0).

However, there is no way you can get 16X speedup with 8 cores. Super-linear speedup might give you something like 8.5 speedup on 8 cores, never 16.

ziad · January 9, 2007, 10:34

Hi,

This result is as impressive as puzzling. How exactly did you turn off the second core for each CPU? Is it possible that a single core with twice the cache it normally gets would give such a tremendous speedup?

Ziad

ziad · January 9, 2007, 10:43

one last thing: to compare apples and apples one should, I imagine, run serial on one core as well and then compute the speedup...

msrinath80 · January 9, 2007, 11:44

Firstly thank you for all your comments. Indeed the results are true. I have not performed repeatability tests yet, but I am fairly confident I will be able to reproduce the results. In either case I will run the 4 and 8 CPU tests once more just to be sure.

@Euguene: If you like I can contact the system administrator in my department who bought and commissioned the machine to find out the exact details of what Mobo and RAM are used. Just let me know what exactly is the information you need. I can also find the exact release of Scientific Linux used.

BTW here is a paper where 8-CPUs gives a speedup of 11 or some such (http://www.jncasr.ac.in/kirti/current_science.pdf)

How exactly did you turn off the second core for each CPU?

A very good question indeed. The answer, I did not. You see the 'taskset' command in linux is used to dictate processor affinity only; which means I get a say in placing the first instance of icoFoam on a certain processor core, the second instance elsewhere and so on. I do this through mpirun as follows:

Ex: A 4-CPU case (first parallel configuration) i.e. one core from each CPU:

nohup mpirun -np 4 taskset -c 0,2,4,6 icoFoam . case_name -parallel > cas_name/log 2>&1 &

Now, a 4-CPU case (second parallel configuration) i.e. two cores from each CPU:

nohup mpirun -np 4 taskset -c 0,1,2,3 icoFoam . case_name -parallel > cas_name/log 2>&1 &

Now the question is how do I know that I am requesting individual cores or not? If we look carefully at the output of /proc/cpuinfo, we see that every 2-CPUs listed have the same physical id. Thus for this machine, the physical CPUs are arranged as follows (Three columns Physical CPU, core1 and core2):

Physical
CPU core1 core2
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
5 10 11
6 12 13
7 14 15

I think the reason for the speedup is that when only one core is used from each physical processor, we still have access to the L1/L2 cache of the other (which is not being used by any other process). As a result, the number of cache hits increases dramatically. However, I will need more expert opinion before I conclude this to be the cause.

"I imagine, run serial on one core as well and then compute the speedup..."

Yes, the serial run was also run on one core only. Of course the other core was sitting idle. So based on the previous argument, even the serial run had access to the L1/L2 cache of the other core.

The other thing I would like to mention is that throughout each run, none of the icoFoam instances jumped from CPU to CPU as the default linux scheduler would usually work trying to balance the load on the machine. In essence I bound the processess to specific CPU cores and they never left it until the parallel run finished. This can be seen in the 'top' command output. I wonder if this can have an effect?

eugene · January 9, 2007, 12:45

I guess you could get a very large superlinear speedup if you case is small compared to the cache size. L2 cache latency is 5-10 times lower than main memory, so that would account for the difference.

Any info on your hardware and NUMA in Science linux wouuld make for interesting reading.

msrinath80 · January 9, 2007, 18:37

My apologies: The X-Axis should read: Number of cores NOT number of CPUs. It basically boils down to how one defines a CPU. In a strictly practical sense, each core is a central processing unit. There is no hyperthreading or such involved and therefore, when we refer to a core, we are referring to a processing unit (one of the two cores on the same die).

But I guess changing the X-Axis to read number of cores will make my point clear. The fact remains that I can very easily choose which core to run on.

Thanks for the correction

ziad · January 9, 2007, 18:43

you're very welcome. it is an interesting case either way and I can honestly say I learned a few things in there. how about posting the corrected curve? and do you guys do any multiphase?

msrinath80 · January 9, 2007, 19:21

Corrections in place:

Based on cores:
http://www.ualberta.ca/~madhavan/openfoam_speedup.eps

Based on CPUs:
http://www.ualberta.ca/~madhavan/ope...eedup_CPUs.eps

The 'Based on CPUs' curve is normalized using the clocktime for a run on 1 CPU (i.e. one that uses both cores) because a core counts as a physical processing unit even if it were etched on the same die. Am I making sense here? I'm still not sure about this. I feel the 'No. of cores' comparison is the least confusing.

Interesting that you should mention Multiphase. My PhD revolves around DNS of fluid-fluid systems. I plan to start with something like icoFSIfoam to solve Newtons linear and angular momentum laws for a solid particle instead of the elasticity equations and later move on to fluid particles. Any suggestions are most welcome!

msrinath80 · January 9, 2007, 19:24

That second curve does not sound right. Someone correct me?

msrinath80 · January 9, 2007, 19:36

Without getting into technicalities of CPU definition, we can conclude from the first graph: It is either the memory bandwidth limitation that degrades the speedup when using both cores OR the fact that each core was able to access the L1/L2 cache of the sibling for the first configuration, that results in the difference observed.

ziad · January 9, 2007, 21:53

well one can define it any possible way but to be able to compare to the paper you quoted you should use their own definition.

about multiphase, I am a multiphase consultant and that is why I am interested in OF. There is room for creativity since the source code is freely available. my background is actually in aerospace and stability methods for flow regime prediction.

your thesis sounds quite interesting (and ambitious!). The solid particles approach shouldn't be too difficult since solid mechanics is much better understood than fluid mechanics and there is tons of literature on fluid/structure interactions (that is basically what it boils down to and you are definitely using the right code since you don't have to couple externally). bubbles on the other hand will prove quite challenging. without getting in the details, you'll probably need to take an energy balance approach including the surface energy (read surface tension dependent) between the continuous phase and the discrete phase. it should be doable as long as you are not going as far as bubble burst, collisions and merging. this is the "esoteric" side of things. I would expect a lot of empirical correlations, even at a DNS level.

yada yada yada! It's easy to talk about it when you have the luxury of not having to do it yourself. Good luck dude!

msrinath80 · June 6, 2007, 19:11

I know that this is really late information. But for those interested, the specs of the machine used in the above scale up tests are here[1].

[1] http://www.sun.com/servers/x64/x4600/index.xml

msrinath80 · June 6, 2007, 19:30

And here are the tech specs:

http://www.sun.com/servers/x64/x4600/arch-wp.pdf

schmidt_d · December 27, 2007, 15:14

Hi,
Running our own home-grown OpenFOAM CFD application produced super-linear speedup on the NCSA's Mercury cluster. You an google the spec's, but if memory serves, it is a cluster of dual Itanium 2's connected with Myrinet. We were superlinear up to 8 cpu's and then started to drop a little bit. It was not a big case (350K cells) which was probably a factor. My student has theorized that the Itaniums have nice big caches and with the upper triangular ordering inherent in OF, we were getting more and more cache hits.

msrinath80 · December 29, 2007, 16:23

Thanks for the info David.

christian · January 31, 2008, 11:17

When I look in the /proc/cpuinfo I indeed see the "physical id". But should I use the "processor" number or the "core id" number when using the "taskset -c" flag?

Best regards,
Christian Svensson

lakeat · August 22, 2009, 04:59

wow, that's means 8 processors are used just as if there are 11 processors, right? wow. How did you do that, in OpenFOAM to achieve the largest speed-up? I am very interested!

odellar · March 3, 2015, 06:36

Hi,

I'm conducting a speedup test at the moment and just wondered, how do you decide on what to include in timing?

I.e. to just run a solver for m timesteps with several probes WITHOUT then reconstructing the data for all m timesteps at the end from each of your N processor directories is, obviously, a lot cheaper than doing the same and finally reconstructing it all at the end - so how do you decide on whether or not to reconstruct it?

One could argue that that reconstruction is one of the penalties incurred by running the solver in parallel (where as all the data would all be there perfectly had you run it in serial), and so should be included in the timing.

On the other hand, I suppose that depends on whether you care about having the full domain flow data for all m timesteps (as if you only care about the probes' data, reconstructing isn't an issue).

So how is this decided? Is it purely 'if you need all the data reconstructed at the end, you have to include it in timing, if you don't, you do not'? That seems a bit grey to me!

Thanks,
Olie

January 9, 2007, 08:53	This is remarkable. I have a c	#2
eugene Senior Member Eugene de Villiers Join Date: Mar 2009 Posts: 725 Rep Power: 21	This is remarkable. I have a couple of 8way Opteron VX50s in the office and they do not show anywhere near this kind of performance. In fact a single cpu on the 8-way performs significantly worse than a 3GHz Northwood P4. It was explained to me that the cache coherency communication on the 8-way introduces an overhead that cripples this architecture. On the other hand, I ran extensive memory tests with Stream to measure cpu-memory bandwidth and the tests reported that the maximum achieveable bandwidth (around 3.2GB/s) was not between the cpu and local memory, but rather with a neighbouring memory bank. To me this reeks of an error in the BIOS/OS assigned affinity between memory banks and cpus. If I disconnect the top board (i.e. downgrade to a 4-way) the machine becomes a screamer, with scaling similar to what you report. Possibly your Scientific Linux has a better NUMA module or the Sun Mobo has addressed the 8-way issue (I use a Tyan board with Suse 10.0). However, there is no way you can get 16X speedup with 8 cores. Super-linear speedup might give you something like 8.5 speedup on 8 cores, never 16.

January 9, 2007, 10:34	Hi, This result is as impre	#3
ziad Senior Member Join Date: Mar 2009 Location: My oyster Posts: 124 Rep Power: 17	Hi, This result is as impressive as puzzling. How exactly did you turn off the second core for each CPU? Is it possible that a single core with twice the cache it normally gets would give such a tremendous speedup? Ziad

January 9, 2007, 10:43	one last thing: to compare app	#4
ziad Senior Member Join Date: Mar 2009 Location: My oyster Posts: 124 Rep Power: 17	one last thing: to compare apples and apples one should, I imagine, run serial on one core as well and then compute the speedup...

January 9, 2007, 11:44	Firstly thank you for all your	#5
msrinath80 Senior Member Srinath Madhavan (a.k.a pUl\|) Join Date: Mar 2009 Location: Edmonton, AB, Canada Posts: 703 Rep Power: 21	Firstly thank you for all your comments. Indeed the results are true. I have not performed repeatability tests yet, but I am fairly confident I will be able to reproduce the results. In either case I will run the 4 and 8 CPU tests once more just to be sure. @Euguene: If you like I can contact the system administrator in my department who bought and commissioned the machine to find out the exact details of what Mobo and RAM are used. Just let me know what exactly is the information you need. I can also find the exact release of Scientific Linux used. BTW here is a paper where 8-CPUs gives a speedup of 11 or some such (http://www.jncasr.ac.in/kirti/current_science.pdf) How exactly did you turn off the second core for each CPU? A very good question indeed. The answer, I did not. You see the 'taskset' command in linux is used to dictate processor affinity only; which means I get a say in placing the first instance of icoFoam on a certain processor core, the second instance elsewhere and so on. I do this through mpirun as follows: Ex: A 4-CPU case (first parallel configuration) i.e. one core from each CPU: nohup mpirun -np 4 taskset -c 0,2,4,6 icoFoam . case_name -parallel > cas_name/log 2>&1 & Now, a 4-CPU case (second parallel configuration) i.e. two cores from each CPU: nohup mpirun -np 4 taskset -c 0,1,2,3 icoFoam . case_name -parallel > cas_name/log 2>&1 & Now the question is how do I know that I am requesting individual cores or not? If we look carefully at the output of /proc/cpuinfo, we see that every 2-CPUs listed have the same physical id. Thus for this machine, the physical CPUs are arranged as follows (Three columns Physical CPU, core1 and core2): Physical CPU core1 core2 0 0 1 1 2 3 2 4 5 3 6 7 4 8 9 5 10 11 6 12 13 7 14 15 I think the reason for the speedup is that when only one core is used from each physical processor, we still have access to the L1/L2 cache of the other (which is not being used by any other process). As a result, the number of cache hits increases dramatically. However, I will need more expert opinion before I conclude this to be the cause. "I imagine, run serial on one core as well and then compute the speedup..." Yes, the serial run was also run on one core only. Of course the other core was sitting idle. So based on the previous argument, even the serial run had access to the L1/L2 cache of the other core. The other thing I would like to mention is that throughout each run, none of the icoFoam instances jumped from CPU to CPU as the default linux scheduler would usually work trying to balance the load on the machine. In essence I bound the processess to specific CPU cores and they never left it until the parallel run finished. This can be seen in the 'top' command output. I wonder if this can have an effect?

January 9, 2007, 12:45	I guess you could get a very l	#6
eugene Senior Member Eugene de Villiers Join Date: Mar 2009 Posts: 725 Rep Power: 21	I guess you could get a very large superlinear speedup if you case is small compared to the cache size. L2 cache latency is 5-10 times lower than main memory, so that would account for the difference. Any info on your hardware and NUMA in Science linux wouuld make for interesting reading.

January 9, 2007, 18:37	My apologies: The X-Axis shoul	#7
msrinath80 Senior Member Srinath Madhavan (a.k.a pUl\|) Join Date: Mar 2009 Location: Edmonton, AB, Canada Posts: 703 Rep Power: 21	My apologies: The X-Axis should read: Number of cores NOT number of CPUs. It basically boils down to how one defines a CPU. In a strictly practical sense, each core is a central processing unit. There is no hyperthreading or such involved and therefore, when we refer to a core, we are referring to a processing unit (one of the two cores on the same die). But I guess changing the X-Axis to read number of cores will make my point clear. The fact remains that I can very easily choose which core to run on. Thanks for the correction

January 9, 2007, 18:43	you're very welcome. it is an	#8
ziad Senior Member Join Date: Mar 2009 Location: My oyster Posts: 124 Rep Power: 17	you're very welcome. it is an interesting case either way and I can honestly say I learned a few things in there. how about posting the corrected curve? and do you guys do any multiphase?

January 9, 2007, 19:21	Corrections in place: Based	#9
msrinath80 Senior Member Srinath Madhavan (a.k.a pUl\|) Join Date: Mar 2009 Location: Edmonton, AB, Canada Posts: 703 Rep Power: 21	Corrections in place: Based on cores: http://www.ualberta.ca/~madhavan/openfoam_speedup.eps Based on CPUs: http://www.ualberta.ca/~madhavan/ope...eedup_CPUs.eps The 'Based on CPUs' curve is normalized using the clocktime for a run on 1 CPU (i.e. one that uses both cores) because a core counts as a physical processing unit even if it were etched on the same die. Am I making sense here? I'm still not sure about this. I feel the 'No. of cores' comparison is the least confusing. Interesting that you should mention Multiphase. My PhD revolves around DNS of fluid-fluid systems. I plan to start with something like icoFSIfoam to solve Newtons linear and angular momentum laws for a solid particle instead of the elasticity equations and later move on to fluid particles. Any suggestions are most welcome!

January 9, 2007, 19:24	That second curve does not sou	#10
msrinath80 Senior Member Srinath Madhavan (a.k.a pUl\|) Join Date: Mar 2009 Location: Edmonton, AB, Canada Posts: 703 Rep Power: 21	That second curve does not sound right. Someone correct me?

January 9, 2007, 19:36	Without getting into technical	#11
msrinath80 Senior Member Srinath Madhavan (a.k.a pUl\|) Join Date: Mar 2009 Location: Edmonton, AB, Canada Posts: 703 Rep Power: 21	Without getting into technicalities of CPU definition, we can conclude from the first graph: It is either the memory bandwidth limitation that degrades the speedup when using both cores OR the fact that each core was able to access the L1/L2 cache of the sibling for the first configuration, that results in the difference observed.

January 9, 2007, 21:53	well one can define it any pos	#12
ziad Senior Member Join Date: Mar 2009 Location: My oyster Posts: 124 Rep Power: 17	well one can define it any possible way but to be able to compare to the paper you quoted you should use their own definition. about multiphase, I am a multiphase consultant and that is why I am interested in OF. There is room for creativity since the source code is freely available. my background is actually in aerospace and stability methods for flow regime prediction. your thesis sounds quite interesting (and ambitious!). The solid particles approach shouldn't be too difficult since solid mechanics is much better understood than fluid mechanics and there is tons of literature on fluid/structure interactions (that is basically what it boils down to and you are definitely using the right code since you don't have to couple externally). bubbles on the other hand will prove quite challenging. without getting in the details, you'll probably need to take an energy balance approach including the surface energy (read surface tension dependent) between the continuous phase and the discrete phase. it should be doable as long as you are not going as far as bubble burst, collisions and merging. this is the "esoteric" side of things. I would expect a lot of empirical correlations, even at a DNS level. yada yada yada! It's easy to talk about it when you have the luxury of not having to do it yourself. Good luck dude!

June 6, 2007, 19:11	I know that this is really lat	#13
msrinath80 Senior Member Srinath Madhavan (a.k.a pUl\|) Join Date: Mar 2009 Location: Edmonton, AB, Canada Posts: 703 Rep Power: 21	I know that this is really late information. But for those interested, the specs of the machine used in the above scale up tests are here[1]. [1] http://www.sun.com/servers/x64/x4600/index.xml

June 6, 2007, 19:30	And here are the tech specs:	#14
msrinath80 Senior Member Srinath Madhavan (a.k.a pUl\|) Join Date: Mar 2009 Location: Edmonton, AB, Canada Posts: 703 Rep Power: 21	And here are the tech specs: http://www.sun.com/servers/x64/x4600/arch-wp.pdf

December 27, 2007, 15:14	Hi, Running our own home-grow	#15
schmidt_d Member David P. Schmidt Join Date: Mar 2009 Posts: 72 Rep Power: 17	Hi, Running our own home-grown OpenFOAM CFD application produced super-linear speedup on the NCSA's Mercury cluster. You an google the spec's, but if memory serves, it is a cluster of dual Itanium 2's connected with Myrinet. We were superlinear up to 8 cpu's and then started to drop a little bit. It was not a big case (350K cells) which was probably a factor. My student has theorized that the Itaniums have nice big caches and with the upper triangular ordering inherent in OF, we were getting more and more cache hits.

December 29, 2007, 16:23	Thanks for the info David.	#16
msrinath80 Senior Member Srinath Madhavan (a.k.a pUl\|) Join Date: Mar 2009 Location: Edmonton, AB, Canada Posts: 703 Rep Power: 21	Thanks for the info David.

January 31, 2008, 11:17	When I look in the /proc/cpuin	#17
christian Member Christian Lindbäck Join Date: Mar 2009 Posts: 55 Rep Power: 17	When I look in the /proc/cpuinfo I indeed see the "physical id". But should I use the "processor" number or the "core id" number when using the "taskset -c" flag? Best regards, Christian Svensson

August 22, 2009, 04:59		#18
lakeat Senior Member Daniel WEI (老魏) Join Date: Mar 2009 Location: Beijing, China Posts: 689 Blog Entries: 9 Rep Power: 21	wow, that's means 8 processors are used just as if there are 11 processors, right? wow. How did you do that, in OpenFOAM to achieve the largest speed-up? I am very interested! __________________ ~ Daniel WEI ------------- Boeing Research & Technology - China Beijing, China Email

March 3, 2015, 06:36	What 'computations' to include in speedup test?	#19
odellar Member Olie Join Date: Oct 2013 Posts: 51 Rep Power: 13	Hi, I'm conducting a speedup test at the moment and just wondered, how do you decide on what to include in timing? I.e. to just run a solver for m timesteps with several probes WITHOUT then reconstructing the data for all m timesteps at the end from each of your N processor directories is, obviously, a lot cheaper than doing the same and finally reconstructing it all at the end - so how do you decide on whether or not to reconstruct it? One could argue that that reconstruction is one of the penalties incurred by running the solver in parallel (where as all the data would all be there perfectly had you run it in serial), and so should be included in the timing. On the other hand, I suppose that depends on whether you care about having the full domain flow data for all m timesteps (as if you only care about the probes' data, reconstructing isn't an issue). So how is this decided? Is it purely 'if you need all the data reconstructed at the end, you have to include it in timing, if you don't, you do not'? That seems a bit grey to me! Thanks, Olie

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Problem with parallelization speedup using many CPUbs	aunola	OpenFOAM Running, Solving & CFD	20	January 23, 2009 07:59
speedup questions	tony	CFX	5	February 3, 2008 18:26
Superlinear SeedUp	mamaly60	OpenFOAM Running, Solving & CFD	2	November 11, 2007 03:49
cluster - parallel speedup	George	Main CFD Forum	3	March 29, 2005 12:32
cluster - parallel speedup	George	FLUENT	0	March 25, 2005 06:54