|
[Sponsors] |
January 8, 2007, 22:41 |
Hi OpenFOAMers,
I just fini
|
#1 |
Senior Member
Srinath Madhavan (a.k.a pUl|)
Join Date: Mar 2009
Location: Edmonton, AB, Canada
Posts: 703
Rep Power: 21 |
Hi OpenFOAMers,
I just finished testing OpenFOAM speedup on an 8-CPU (16-core) Opteron machine loaded with 60 GB RAM. The results are pretty impressive given that the latency in an SMP-like system is really very low. Firstly, here are some technical details of the hardware used: [cfd@sunfire icoFoam]$ uname -a Linux sunfire 2.6.9-42.0.2.ELlargesmp #1 SMP Tue Aug 22 18:52:10 CDT 2006 x86_64 x86_64 x86_64 GNU/Linux (Basically Scientific Linux 4.x) [cfd@sunfire ~]$ cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 15 model : 33 model name : Dual Core AMD Opteron(tm) Processor 885 stepping : 2 cpu MHz : 2613.696 cache size : 1024 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 2 fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext lm 3dnowext 3dnow pni bogomips : 5230.07 TLB size : 1088 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp processor : 1 vendor_id : AuthenticAMD cpu family : 15 model : 33 model name : Dual Core AMD Opteron(tm) Processor 885 stepping : 2 cpu MHz : 2613.696 cache size : 1024 KB physical id : 0 siblings : 2 core id : 1 cpu cores : 2 fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext lm 3dnowext 3dnow pni bogomips : 5226.93 TLB size : 1088 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp processor : 2 vendor_id : AuthenticAMD cpu family : 15 model : 33 model name : Dual Core AMD Opteron(tm) Processor 885 stepping : 2 cpu MHz : 2613.696 cache size : 1024 KB physical id : 1 siblings : 2 core id : 0 cpu cores : 2 fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext lm 3dnowext 3dnow pni bogomips : 5226.92 TLB size : 1088 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp processor : 3 vendor_id : AuthenticAMD cpu family : 15 model : 33 model name : Dual Core AMD Opteron(tm) Processor 885 stepping : 2 cpu MHz : 2613.696 cache size : 1024 KB physical id : 1 siblings : 2 core id : 1 cpu cores : 2 fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext lm 3dnowext 3dnow pni bogomips : 5226.88 TLB size : 1088 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp processor : 4 vendor_id : AuthenticAMD cpu family : 15 model : 33 model name : Dual Core AMD Opteron(tm) Processor 885 stepping : 2 cpu MHz : 2613.696 cache size : 1024 KB physical id : 2 siblings : 2 core id : 0 cpu cores : 2 fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext lm 3dnowext 3dnow pni bogomips : 5226.90 TLB size : 1088 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp processor : 5 vendor_id : AuthenticAMD cpu family : 15 model : 33 model name : Dual Core AMD Opteron(tm) Processor 885 stepping : 2 cpu MHz : 2613.696 cache size : 1024 KB physical id : 2 siblings : 2 core id : 1 cpu cores : 2 fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext lm 3dnowext 3dnow pni bogomips : 5226.89 TLB size : 1088 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp processor : 6 vendor_id : AuthenticAMD cpu family : 15 model : 33 model name : Dual Core AMD Opteron(tm) Processor 885 stepping : 2 cpu MHz : 2613.696 cache size : 1024 KB physical id : 3 siblings : 2 core id : 0 cpu cores : 2 fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext lm 3dnowext 3dnow pni bogomips : 5226.90 TLB size : 1088 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp processor : 7 vendor_id : AuthenticAMD cpu family : 15 model : 33 model name : Dual Core AMD Opteron(tm) Processor 885 stepping : 2 cpu MHz : 2613.696 cache size : 1024 KB physical id : 3 siblings : 2 core id : 1 cpu cores : 2 fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext lm 3dnowext 3dnow pni bogomips : 5226.90 TLB size : 1088 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp processor : 8 vendor_id : AuthenticAMD cpu family : 15 model : 33 model name : Dual Core AMD Opteron(tm) Processor 885 stepping : 2 cpu MHz : 2613.696 cache size : 1024 KB physical id : 4 siblings : 2 core id : 0 cpu cores : 2 fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext lm 3dnowext 3dnow pni bogomips : 5226.90 TLB size : 1088 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp processor : 9 vendor_id : AuthenticAMD cpu family : 15 model : 33 model name : Dual Core AMD Opteron(tm) Processor 885 stepping : 2 cpu MHz : 2613.696 cache size : 1024 KB physical id : 4 siblings : 2 core id : 1 cpu cores : 2 fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext lm 3dnowext 3dnow pni bogomips : 5226.88 TLB size : 1088 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp processor : 10 vendor_id : AuthenticAMD cpu family : 15 model : 33 model name : Dual Core AMD Opteron(tm) Processor 885 stepping : 2 cpu MHz : 2613.696 cache size : 1024 KB physical id : 5 siblings : 2 core id : 0 cpu cores : 2 fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext lm 3dnowext 3dnow pni bogomips : 5226.89 TLB size : 1088 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp processor : 11 vendor_id : AuthenticAMD cpu family : 15 model : 33 model name : Dual Core AMD Opteron(tm) Processor 885 stepping : 2 cpu MHz : 2613.696 cache size : 1024 KB physical id : 5 siblings : 2 core id : 1 cpu cores : 2 fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext lm 3dnowext 3dnow pni bogomips : 5226.89 TLB size : 1088 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp processor : 12 vendor_id : AuthenticAMD cpu family : 15 model : 33 model name : Dual Core AMD Opteron(tm) Processor 885 stepping : 2 cpu MHz : 2613.696 cache size : 1024 KB physical id : 6 siblings : 2 core id : 0 cpu cores : 2 fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext lm 3dnowext 3dnow pni bogomips : 5226.88 TLB size : 1088 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp processor : 13 vendor_id : AuthenticAMD cpu family : 15 model : 33 model name : Dual Core AMD Opteron(tm) Processor 885 stepping : 2 cpu MHz : 2613.696 cache size : 1024 KB physical id : 6 siblings : 2 core id : 1 cpu cores : 2 fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext lm 3dnowext 3dnow pni bogomips : 5226.90 TLB size : 1088 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp processor : 14 vendor_id : AuthenticAMD cpu family : 15 model : 33 model name : Dual Core AMD Opteron(tm) Processor 885 stepping : 2 cpu MHz : 2613.696 cache size : 1024 KB physical id : 7 siblings : 2 core id : 0 cpu cores : 2 fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext lm 3dnowext 3dnow pni bogomips : 5226.89 TLB size : 1088 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp processor : 15 vendor_id : AuthenticAMD cpu family : 15 model : 33 model name : Dual Core AMD Opteron(tm) Processor 885 stepping : 2 cpu MHz : 2613.696 cache size : 1024 KB physical id : 7 siblings : 2 core id : 1 cpu cores : 2 fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext lm 3dnowext 3dnow pni bogomips : 5226.89 TLB size : 1088 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp [cfd@sunfire ~]$ free -mot total used free shared buffers cached Mem: 59923 48535 11387 0 83 37052 Swap: 59839 6 59833 Total: 119763 48541 71221 checkMesh output reads: [cfd@sunfire icoFoam]$ checkMesh . one_sq_cyl_3d_unsteady_wtavg_4_2_cpus /*---------------------------------------------------------------------------*\ | ========= | | | \ / F ield | OpenFOAM: The Open Source CFD Toolbox | | \ / O peration | Version: 1.3 | | \ / A nd | Web: http://www.openfoam.org | | \/ M anipulation | | \*---------------------------------------------------------------------------*/ Exec : checkMesh . one_sq_cyl_3d_unsteady_wtavg_4_2_cpus Date : Jan 08 2007 Time : 18:48:18 Host : sunfire PID : 13166 Root : /home/cfd/OpenFOAM/cfd-1.3/run/tutorials/icoFoam Case : one_sq_cyl_3d_unsteady_wtavg_4_2_cpus Nprocs : 1 Create time Create polyMesh for time = constant Time = constant Boundary definition OK. Number of points: 11042070 edges: 32811382 faces: 32498784 internal faces: 31878048 cells: 10729472 boundary patches: 4 point zones: 0 face zones: 0 cell zones: 0 Checking topology and geometry ... Point usage check OK. Upper triangular ordering OK. Topological cell zip-up check OK. Face vertices OK. Face-face connectivity OK. Basic topo ok ... Checking patch topology for multiply connected surfaces ... Patch Faces Points Surface ChannelWalls 604352 604734 ok (not multiply connected) ObstacleWalls 6144 6240 ok (not multiply connected) vinlet 5120 5265 ok (not multiply connected) poutlet 5120 5265 ok (not multiply connected) Patch topo ok ... Topology check done. Domain bounding box: min = (-1.165 -0.02 -0.05) max = (0.705 0.02 0.05) meters. Checking geometry... Boundary openness in x-direction = -7.92611080393212e-19 Boundary openness in y-direction = -3.33563488923205e-14 Boundary openness in z-direction = 1.36264413686285e-15 Boundary closed (OK). Max cell openness = 2.49399995957591e-21 Max aspect ratio = 1.74011094308106. All cells OK. Minumum face area = 1.95312499999909e-07. Maximum face area = 7.44396594083542e-06. Face area magnitudes OK. Min volume = 1.8600559925712e-10. Max volume = 7.06069406503132e-09. Total volume = 0.00747000000001155. Cell volumes OK. Mesh non-orthogonality Max: 0 average: 0 Non-orthogonality check OK. Face pyramids OK. Max skewness = 2.84219262700115e-10 percent. Face skewness OK. Minumum edge length = 0.000312499999999998. Maximum edge length = 0.00312657168090757. All angles in faces are convex or less than 10 degrees concave. Face flatness (1 = flat, 0 = butterfly) : average = 1 min = 1 All faces are flat in that the ratio between projected and actual area is > 0.8 Geometry check done. Number of cells by type: hexahedra: 10729472 prisms: 0 wedges: 0 pyramids: 0 tet wedges: 0 tetrahedra: 0 polyhedra: 0 Number of regions: 1 (OK). Mesh OK. Time = 0 No mesh. End fvSchemes reads: /*---------------------------------------------------------------------------*\ | ========= | | | \ / F ield | OpenFOAM: The Open Source CFD Toolbox | | \ / O peration | Version: 1.3 | | \ / A nd | Web: http://www.openfoam.org | | \/ M anipulation | | \*---------------------------------------------------------------------------*/ // FoamX Case Dictionary. FoamFile { version 2.0; format ascii; root "/home/madhavan/OpenFOAM/madhavan-1.3/run/tutorials/icoFoam"; case ""; instance "system"; local ""; class dictionary; object fvSchemes; } // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * // ddtSchemes { default CrankNicholson 1; } gradSchemes { default Gauss linear; grad(p) Gauss linear; } divSchemes { default none; div(phi,U) Gauss linear; } laplacianSchemes { default none; laplacian(nu,U) Gauss linear corrected; laplacian(1|A(U),p) Gauss linear corrected; } interpolationSchemes { default linear; interpolate(HbyA) linear; } snGradSchemes { default corrected; } fluxRequired { default no; p; } // ************************************************** *********************** // And fvSolution: /*---------------------------------------------------------------------------*\ | ========= | | | \ / F ield | OpenFOAM: The Open Source CFD Toolbox | | \ / O peration | Version: 1.3 | | \ / A nd | Web: http://www.openfoam.org | | \/ M anipulation | | \*---------------------------------------------------------------------------*/ // FoamX Case Dictionary. FoamFile { version 2.0; format ascii; root "/home/madhavan/OpenFOAM/madhavan-1.3/run/tutorials/icoFoam"; case ""; instance "system"; local ""; class dictionary; object fvSolution; } // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * // solvers { // p ICCG 1e-06 0; p AMG 1e-06 0 25; U BICCG 1e-05 0; } PISO { momentumPredictor yes; nCorrectors 2; nNonOrthogonalCorrectors 0; pRefCell 0; pRefValue 0; } // ************************************************** *********************** // The case I ran was laminar unsteady vortex shedding past a square cylinder in a rectangular channel. The solver was a slightly modified version of icoFoam. Modifications included calculation of Lift/Drag coefficients [stolen from Frank Bos ;)] and writing out time-averaged velocity/pressure and velocity probes (stolen from oodles). 19 probeLocations were defined in all the simulations. The speedup was calculated as follows: Speedup from 'N' CPUs = (ClockTime for a serial run) / (ClockTime for a parallel run with 'N' CPUs) The time step chosen was 0.02 seconds. This ensured that the Maximum Courant number was well below 1 (typically around 0.4). Starting from time t = 0, the simulation was run upto time t = 0.68 seconds (i.e. 34 times steps). 'writeFormat' in controlDict was set to 'binary' and 'writePrecision' to 15. Metis decomposition was used with equal processor weighting throughout. All parallel runs were dedicated (i.e. only I was using the machine). LAM MPI was used throughout. It was seen that in each of the parallel runs, the total RES memory reported by 'top' was around 10.2 GB. Keeping in mind that dual cores are memory bandwidth limited, two types of parallel configurations were tested: 1. In the first parallel configuration, only one core from each physical processor was used. This was possible using the 'taskset' command in GNU/Linux which allows one to hard-request specific cores (i.e. override the kernel CPU affinity mask). This command also makes sure that until the process quits, it will be locked on to only the user-specified set of CPUs. Thus the maximum number of CPUs for this case was 8. 2. The second parallel configuration was composed of using 2-cores from each processor. Thus for a 4-CPU run, two physical processors were hard-requested and so on. In this configuration, one could go upto 16 CPUs in total. The speedup results are available here [http://www.ualberta.ca/~madhavan/openfoam_speedup.eps]. It can be seen that the first kind of parallel configuration (i.e. using just 8 CPUs [one core from each physical processor]) exhibits what appears to be a case of super-linear speedup. This is explained in the following wikipedia entry: http://en.wikipedia.org/wiki/Speedup Has anyone experienced this with OpenFOAM before? The second parallel configuration (i.e. 16 cores) displays acceptable speedup as well. However it should be noted that the maximum speedup in this case was around 15.2 using 16 cores. A slightly higher speedup (15.964) was obtained using just 8 CPUs when working in the first kind of parallel configuration. Also noteworthy is the fact that memory bandwidth limitation when using both cores does not seem to detrimentally impair the speedup. A sample log file from an 8-CPU run is shown below: /*---------------------------------------------------------------------------*\ | ========= | | | \ / F ield | OpenFOAM: The Open Source CFD Toolbox | | \ / O peration | Version: 1.3 | | \ / A nd | Web: http://www.openfoam.org | | \/ M anipulation | | \*---------------------------------------------------------------------------*/ /*---------------------------------------------------------------------------*\ | ========= | | | \ / F ield | OpenFOAM: The Open Source CFD Toolbox | | \ / O peration | Version: 1.3 | | \ / A nd | Web: http://www.openfoam.org | | \/ M anipulation | | \*---------------------------------------------------------------------------*/ /*---------------------------------------------------------------------------*\ | ========= | | | \ / F ield | OpenFOAM: The Open Source CFD Toolbox | | \ / O peration | Version: 1.3 | | \ / A nd | Web: http://www.openfoam.org | | \/ M anipulation | | \*---------------------------------------------------------------------------*/ /*---------------------------------------------------------------------------*\ | ========= | | | \ / F ield | OpenFOAM: The Open Source CFD Toolbox | | \ / O peration | Version: 1.3 | | \ / A nd | Web: http://www.openfoam.org | | \/ M anipulation | | \*---------------------------------------------------------------------------*/ Exec : icoFoam . one_sq_cyl_3d_unsteady_wtavg_4_8_cpus -parallel Exec : icoFoam . one_sq_cyl_3d_unsteady_wtavg_4_8_cpus -parallel Exec : icoFoam . one_sq_cyl_3d_unsteady_wtavg_4_8_cpus -parallel Exec : icoFoam . one_sq_cyl_3d_unsteady_wtavg_4_8_cpus -parallel /*---------------------------------------------------------------------------*\ | ========= | | | \ / F ield | OpenFOAM: The Open Source CFD Toolbox | | \ / O peration | Version: 1.3 | | \ / A nd | Web: http://www.openfoam.org | | \/ M anipulation | | \*---------------------------------------------------------------------------*/ /*---------------------------------------------------------------------------*\ | ========= | | | \ / F ield | OpenFOAM: The Open Source CFD Toolbox | | \ / O peration | Version: 1.3 | | \ / A nd | Web: http://www.openfoam.org | | \/ M anipulation | | \*---------------------------------------------------------------------------*/ Exec : icoFoam . one_sq_cyl_3d_unsteady_wtavg_4_8_cpus -parallel Exec : icoFoam . one_sq_cyl_3d_unsteady_wtavg_4_8_cpus -parallel /*---------------------------------------------------------------------------*\ | ========= | | | \ / F ield | OpenFOAM: The Open Source CFD Toolbox | | \ / O peration | Version: 1.3 | | \ / A nd | Web: http://www.openfoam.org | | \/ M anipulation | | \*---------------------------------------------------------------------------*/ Exec : icoFoam . one_sq_cyl_3d_unsteady_wtavg_4_8_cpus -parallel /*---------------------------------------------------------------------------*\ | ========= | | | \ / F ield | OpenFOAM: The Open Source CFD Toolbox | | \ / O peration | Version: 1.3 | | \ / A nd | Web: http://www.openfoam.org | | \/ M anipulation | | \*---------------------------------------------------------------------------*/ Exec : icoFoam . one_sq_cyl_3d_unsteady_wtavg_4_8_cpus -parallel [1] Date : Dec 27 2006 [1] Time : 11:17:35 [1] Host : sunfire [1] PID : 5608 [5] Date : Dec 27 2006 [5] Time : 11:17:35 [5] Host : sunfire [5] PID : 5612 [7] Date : Dec 27 2006 [7] Time : 11:17:35 [7] Host : sunfire [7] PID : 5614 [3] Date : Dec 27 2006 [3] Time : 11:17:35 [3] Host : sunfire [3] PID : 5610 [4] Date : Dec 27 2006 [4] Time : 11:17:35 [4] Host : sunfire [4] PID : 5611 [0] Date : Dec 27 2006 [0] Time : 11:17:35 [0] Host : sunfire [0] PID : 5607 [1] Root : /home/madhavan/OpenFOAM/madhavan-1.3/run/tutorials/icoFoam [1] Case : one_sq_cyl_3d_unsteady_wtavg_4_8_cpus [1] Nprocs : 8 [2] Date : Dec 27 2006 [2] Time : 11:17:35 [2] Host : sunfire [2] PID : 5609 [6] Date : Dec 27 2006 [6] Time : 11:17:35 [6] Host : sunfire [6] PID : 5613 [5] Root : /home/madhavan/OpenFOAM/madhavan-1.3/run/tutorials/icoFoam [5] Case : one_sq_cyl_3d_unsteady_wtavg_4_8_cpus [5] Nprocs : 8 [7] Root : /home/madhavan/OpenFOAM/madhavan-1.3/run/tutorials/icoFoam [7] Case : one_sq_cyl_3d_unsteady_wtavg_4_8_cpus [7] Nprocs : 8 [3] Root : /home/madhavan/OpenFOAM/madhavan-1.3/run/tutorials/icoFoam [3] Case : one_sq_cyl_3d_unsteady_wtavg_4_8_cpus [3] Nprocs : 8 [4] Root : /home/madhavan/OpenFOAM/madhavan-1.3/run/tutorials/icoFoam [4] Case : one_sq_cyl_3d_unsteady_wtavg_4_8_cpus [4] Nprocs : 8 [2] Root : /home/madhavan/OpenFOAM/madhavan-1.3/run/tutorials/icoFoam [2] Case : one_sq_cyl_3d_unsteady_wtavg_4_8_cpus [2] Nprocs : 8 [6] Root : /home/madhavan/OpenFOAM/madhavan-1.3/run/tutorials/icoFoam [6] Case : one_sq_cyl_3d_unsteady_wtavg_4_8_cpus [6] Nprocs : 8 [0] Root : /home/madhavan/OpenFOAM/madhavan-1.3/run/tutorials/icoFoam [0] Case : one_sq_cyl_3d_unsteady_wtavg_4_8_cpus [0] Nprocs : 8 [0] Slaves : [0] 7 [0] ( [0] sunfire.5608 [0] sunfire.5609 [0] sunfire.5610 [0] sunfire.5611 [0] sunfire.5612 [0] sunfire.5613 [0] sunfire.5614 [0] ) [0] Create time Create mesh for time = 0 Reading transportProperties Reading field p Reading field U Reading/calculating face flux field phi Creating field Umean Creating field pMean Reading probeLocations Constructing probes Starting time loop Time = 0.02 Mean and max Courant Numbers = 0 0.0799610193770155 BICCG: Solving for Ux, Initial residual = 0.999999999999942, Final residual = 1.72057068708726e-06, No Iterations 2 BICCG: Solving for Uy, Initial residual = 0, Final residual = 0, No Iterations 0 BICCG: Solving for Uz, Initial residual = 0, Final residual = 0, No Iterations 0 AMG: Solving for p, Initial residual = 1, Final residual = 9.48240838699873e-07, No Iterations 264 time step continuity errors : sum local = 6.34770499582916e-11, global = -4.66773069030591e-12, cumulative = -4.66773069030591e-12 AMG: Solving for p, Initial residual = 0.000327390016863783, Final residual = 9.50144270815434e-07, No Iterations 125 time step continuity errors : sum local = 7.58317575730968e-08, global = -7.09519972870107e-09, cumulative = -7.09986745939137e-09 Wall patch = 0 Wall patch name = ChannelWalls Uav = (1 0 0) Aref = 1 nu = nu [0 2 -1 0 0 0 0] 1.00481e-06 DragCoefficient = 2.39031097936705e-05 pressureDragCoefficient = 1.10457835730627e-19 viscDragCoefficient = 2.39031097936704e-05 LiftCoefficient = -2.7464517768576e-08 Wall patch = 1 Wall patch name = ObstacleWalls Uav = (1 0 0) Aref = 1 nu = nu [0 2 -1 0 0 0 0] 1.00481e-06 DragCoefficient = 1.53640797773116e-05 pressureDragCoefficient = 1.51743063164737e-05 viscDragCoefficient = 1.89773460837957e-07 LiftCoefficient = 2.19062878500774e-10 ExecutionTime = 429.61 s ClockTime = 430 s Time = 0.04 Mean and max Courant Numbers = 0.0520366499170818 0.618464623572251 BICCG: Solving for Ux, Initial residual = 0.9503276306348, Final residual = 7.30093823030692e-07, No Iterations 4 BICCG: Solving for Uy, Initial residual = 0.336108715228218, Final residual = 7.75236516500687e-06, No Iterations 3 BICCG: Solving for Uz, Initial residual = 0.318782629311303, Final residual = 2.46974968726866e-06, No Iterations 3 AMG: Solving for p, Initial residual = 0.00142885986455076, Final residual = 9.59204590238137e-07, No Iterations 161 time step continuity errors : sum local = 3.31915427708201e-08, global = -3.85201065960417e-09, cumulative = -1.09518781189955e-08 AMG: Solving for p, Initial residual = 0.00125690267648775, Final residual = 9.90718932840224e-07, No Iterations 148 time step continuity errors : sum local = 8.61031360523968e-09, global = -1.00374894983836e-09, cumulative = -1.19556270688339e-08 Wall patch = 0 Wall patch name = ChannelWalls Uav = (1 0 0) Aref = 1 nu = nu [0 2 -1 0 0 0 0] 1.00481e-06 DragCoefficient = 3.11686822943573e-05 pressureDragCoefficient = -1.19036971197506e-20 viscDragCoefficient = 3.11686822943573e-05 LiftCoefficient = 5.08497633583287e-08 Wall patch = 1 Wall patch name = ObstacleWalls Uav = (1 0 0) Aref = 1 nu = nu [0 2 -1 0 0 0 0] 1.00481e-06 DragCoefficient = -8.73183986075596e-07 pressureDragCoefficient = -1.12158407004779e-06 viscDragCoefficient = 2.48400083972195e-07 LiftCoefficient = -4.33142310759729e-10 ExecutionTime = 773.79 s ClockTime = 774 s Time = 0.06 Mean and max Courant Numbers = 0.0520549574469423 0.633176355927174 BICCG: Solving for Ux, Initial residual = 0.729870873945099, Final residual = 8.71082566621832e-07, No Iterations 4 BICCG: Solving for Uy, Initial residual = 0.0449089477055162, Final residual = 1.17121591863239e-06, No Iterations 3 BICCG: Solving for Uz, Initial residual = 0.429338306203659, Final residual = 2.25243687316615e-06, No Iterations 3 AMG: Solving for p, Initial residual = 0.00746482234535778, Final residual = 9.66298628563578e-07, No Iterations 172 time step continuity errors : sum local = 3.73219678904151e-09, global = 4.02713235054858e-10, cumulative = -1.1552913833779e-08 AMG: Solving for p, Initial residual = 0.000155512648150767, Final residual = 9.95720557424151e-07, No Iterations 114 time step continuity errors : sum local = 1.43764071652251e-08, global = -1.63319844581523e-09, cumulative = -1.31861122795943e-08 Wall patch = 0 Wall patch name = ChannelWalls Uav = (1 0 0) Aref = 1 nu = nu [0 2 -1 0 0 0 0] 1.00481e-06 DragCoefficient = 1.94601024085051e-05 pressureDragCoefficient = -4.21515010300766e-20 viscDragCoefficient = 1.94601024085052e-05 LiftCoefficient = -2.75174260612183e-08 Wall patch = 1 Wall patch name = ObstacleWalls Uav = (1 0 0) Aref = 1 nu = nu [0 2 -1 0 0 0 0] 1.00481e-06 DragCoefficient = -5.42732911306796e-06 pressureDragCoefficient = -5.57975173572317e-06 viscDragCoefficient = 1.52422622655207e-07 LiftCoefficient = 1.90975313446053e-10 ExecutionTime = 1096.18 s ClockTime = 1097 s Time = 0.08 Mean and max Courant Numbers = 0.0520609078097951 0.573409966876077 BICCG: Solving for Ux, Initial residual = 0.907786134944961, Final residual = 7.13709554853266e-07, No Iterations 4 BICCG: Solving for Uy, Initial residual = 0.218107164255757, Final residual = 4.31544674425797e-06, No Iterations 3 BICCG: Solving for Uz, Initial residual = 0.483568109971064, Final residual = 9.75818435573959e-06, No Iterations 2 AMG: Solving for p, Initial residual = 0.00175055175027143, Final residual = 9.66179519860942e-07, No Iterations 159 time step continuity errors : sum local = 1.28147244147868e-08, global = 1.42689767854083e-09, cumulative = -1.17592146010535e-08 AMG: Solving for p, Initial residual = 0.00173420165308252, Final residual = 9.83219295244694e-07, No Iterations 155 time step continuity errors : sum local = 1.95495752173065e-09, global = -2.18349730379148e-10, cumulative = -1.19775643314326e-08 Wall patch = 0 Wall patch name = ChannelWalls Uav = (1 0 0) Aref = 1 nu = nu [0 2 -1 0 0 0 0] 1.00481e-06 DragCoefficient = 1.2096984821921e-05 pressureDragCoefficient = 9.00013655768262e-22 viscDragCoefficient = 1.2096984821921e-05 LiftCoefficient = 3.17872454814612e-08 Wall patch = 1 Wall patch name = ObstacleWalls Uav = (1 0 0) Aref = 1 nu = nu [0 2 -1 0 0 0 0] 1.00481e-06 DragCoefficient = 3.05866437660425e-07 pressureDragCoefficient = 2.16859882289008e-07 viscDragCoefficient = 8.9006555371417e-08 LiftCoefficient = -2.80144135822249e-10 ExecutionTime = 1445.04 s ClockTime = 1446 s Time = 0.1 Mean and max Courant Numbers = 0.0520483817059996 0.500862085074748 BICCG: Solving for Ux, Initial residual = 0.0654195035431545, Final residual = 4.684699571965e-06, No Iterations 2 BICCG: Solving for Uy, Initial residual = 0.0133065421499664, Final residual = 9.06483525526088e-06, No Iterations 2 BICCG: Solving for Uz, Initial residual = 0.107661631606992, Final residual = 4.57633279853188e-06, No Iterations 2 AMG: Solving for p, Initial residual = 0.0147390588023079, Final residual = 9.98397594852791e-07, No Iterations 149 time step continuity errors : sum local = 2.97501995648839e-10, global = -3.17354895118249e-11, cumulative = -1.20092998209444e-08 AMG: Solving for p, Initial residual = 0.00327776568097673, Final residual = 9.97367895344957e-07, No Iterations 132 time step continuity errors : sum local = 9.72865748170479e-11, global = -1.06901988694658e-11, cumulative = -1.20199900198139e-08 Wall patch = 0 Wall patch name = ChannelWalls Uav = (1 0 0) Aref = 1 nu = nu [0 2 -1 0 0 0 0] 1.00481e-06 DragCoefficient = 1.14058184148547e-05 pressureDragCoefficient = 3.64388470612232e-22 viscDragCoefficient = 1.14058184148547e-05 LiftCoefficient = -4.16144498831859e-08 Wall patch = 1 Wall patch name = ObstacleWalls Uav = (1 0 0) Aref = 1 nu = nu [0 2 -1 0 0 0 0] 1.00481e-06 DragCoefficient = 2.70235447485773e-07 pressureDragCoefficient = 1.87830258208696e-07 viscDragCoefficient = 8.24051892770773e-08 LiftCoefficient = 2.32675948985381e-10 ExecutionTime = 1757.44 s ClockTime = 1758 s I would appreciate if anyone shares their thoughts/comments in this regard. I just finished compiling OpenFOAM with mvapi (infiniband) support through openmpi and plan to run the same case for a comparison. |
|
January 9, 2007, 08:53 |
This is remarkable. I have a c
|
#2 |
Senior Member
Eugene de Villiers
Join Date: Mar 2009
Posts: 725
Rep Power: 21 |
This is remarkable. I have a couple of 8way Opteron VX50s in the office and they do not show anywhere near this kind of performance.
In fact a single cpu on the 8-way performs significantly worse than a 3GHz Northwood P4. It was explained to me that the cache coherency communication on the 8-way introduces an overhead that cripples this architecture. On the other hand, I ran extensive memory tests with Stream to measure cpu-memory bandwidth and the tests reported that the maximum achieveable bandwidth (around 3.2GB/s) was not between the cpu and local memory, but rather with a neighbouring memory bank. To me this reeks of an error in the BIOS/OS assigned affinity between memory banks and cpus. If I disconnect the top board (i.e. downgrade to a 4-way) the machine becomes a screamer, with scaling similar to what you report. Possibly your Scientific Linux has a better NUMA module or the Sun Mobo has addressed the 8-way issue (I use a Tyan board with Suse 10.0). However, there is no way you can get 16X speedup with 8 cores. Super-linear speedup might give you something like 8.5 speedup on 8 cores, never 16. |
|
January 9, 2007, 10:34 |
Hi,
This result is as impre
|
#3 |
Senior Member
Join Date: Mar 2009
Location: My oyster
Posts: 124
Rep Power: 17 |
Hi,
This result is as impressive as puzzling. How exactly did you turn off the second core for each CPU? Is it possible that a single core with twice the cache it normally gets would give such a tremendous speedup? Ziad |
|
January 9, 2007, 10:43 |
one last thing: to compare app
|
#4 |
Senior Member
Join Date: Mar 2009
Location: My oyster
Posts: 124
Rep Power: 17 |
one last thing: to compare apples and apples one should, I imagine, run serial on one core as well and then compute the speedup...
|
|
January 9, 2007, 11:44 |
Firstly thank you for all your
|
#5 |
Senior Member
Srinath Madhavan (a.k.a pUl|)
Join Date: Mar 2009
Location: Edmonton, AB, Canada
Posts: 703
Rep Power: 21 |
Firstly thank you for all your comments. Indeed the results are true. I have not performed repeatability tests yet, but I am fairly confident I will be able to reproduce the results. In either case I will run the 4 and 8 CPU tests once more just to be sure.
@Euguene: If you like I can contact the system administrator in my department who bought and commissioned the machine to find out the exact details of what Mobo and RAM are used. Just let me know what exactly is the information you need. I can also find the exact release of Scientific Linux used. BTW here is a paper where 8-CPUs gives a speedup of 11 or some such (http://www.jncasr.ac.in/kirti/current_science.pdf) How exactly did you turn off the second core for each CPU? A very good question indeed. The answer, I did not. You see the 'taskset' command in linux is used to dictate processor affinity only; which means I get a say in placing the first instance of icoFoam on a certain processor core, the second instance elsewhere and so on. I do this through mpirun as follows: Ex: A 4-CPU case (first parallel configuration) i.e. one core from each CPU: nohup mpirun -np 4 taskset -c 0,2,4,6 icoFoam . case_name -parallel > cas_name/log 2>&1 & Now, a 4-CPU case (second parallel configuration) i.e. two cores from each CPU: nohup mpirun -np 4 taskset -c 0,1,2,3 icoFoam . case_name -parallel > cas_name/log 2>&1 & Now the question is how do I know that I am requesting individual cores or not? If we look carefully at the output of /proc/cpuinfo, we see that every 2-CPUs listed have the same physical id. Thus for this machine, the physical CPUs are arranged as follows (Three columns Physical CPU, core1 and core2): Physical CPU core1 core2 0 0 1 1 2 3 2 4 5 3 6 7 4 8 9 5 10 11 6 12 13 7 14 15 I think the reason for the speedup is that when only one core is used from each physical processor, we still have access to the L1/L2 cache of the other (which is not being used by any other process). As a result, the number of cache hits increases dramatically. However, I will need more expert opinion before I conclude this to be the cause. "I imagine, run serial on one core as well and then compute the speedup..." Yes, the serial run was also run on one core only. Of course the other core was sitting idle. So based on the previous argument, even the serial run had access to the L1/L2 cache of the other core. The other thing I would like to mention is that throughout each run, none of the icoFoam instances jumped from CPU to CPU as the default linux scheduler would usually work trying to balance the load on the machine. In essence I bound the processess to specific CPU cores and they never left it until the parallel run finished. This can be seen in the 'top' command output. I wonder if this can have an effect? |
|
January 9, 2007, 12:45 |
I guess you could get a very l
|
#6 |
Senior Member
Eugene de Villiers
Join Date: Mar 2009
Posts: 725
Rep Power: 21 |
I guess you could get a very large superlinear speedup if you case is small compared to the cache size. L2 cache latency is 5-10 times lower than main memory, so that would account for the difference.
Any info on your hardware and NUMA in Science linux wouuld make for interesting reading. |
|
January 9, 2007, 18:37 |
My apologies: The X-Axis shoul
|
#7 |
Senior Member
Srinath Madhavan (a.k.a pUl|)
Join Date: Mar 2009
Location: Edmonton, AB, Canada
Posts: 703
Rep Power: 21 |
My apologies: The X-Axis should read: Number of cores NOT number of CPUs. It basically boils down to how one defines a CPU. In a strictly practical sense, each core is a central processing unit. There is no hyperthreading or such involved and therefore, when we refer to a core, we are referring to a processing unit (one of the two cores on the same die).
But I guess changing the X-Axis to read number of cores will make my point clear. The fact remains that I can very easily choose which core to run on. Thanks for the correction |
|
January 9, 2007, 18:43 |
you're very welcome. it is an
|
#8 |
Senior Member
Join Date: Mar 2009
Location: My oyster
Posts: 124
Rep Power: 17 |
you're very welcome. it is an interesting case either way and I can honestly say I learned a few things in there. how about posting the corrected curve? and do you guys do any multiphase?
|
|
January 9, 2007, 19:21 |
Corrections in place:
Based
|
#9 |
Senior Member
Srinath Madhavan (a.k.a pUl|)
Join Date: Mar 2009
Location: Edmonton, AB, Canada
Posts: 703
Rep Power: 21 |
Corrections in place:
Based on cores: http://www.ualberta.ca/~madhavan/openfoam_speedup.eps Based on CPUs: http://www.ualberta.ca/~madhavan/ope...eedup_CPUs.eps The 'Based on CPUs' curve is normalized using the clocktime for a run on 1 CPU (i.e. one that uses both cores) because a core counts as a physical processing unit even if it were etched on the same die. Am I making sense here? I'm still not sure about this. I feel the 'No. of cores' comparison is the least confusing. Interesting that you should mention Multiphase. My PhD revolves around DNS of fluid-fluid systems. I plan to start with something like icoFSIfoam to solve Newtons linear and angular momentum laws for a solid particle instead of the elasticity equations and later move on to fluid particles. Any suggestions are most welcome! |
|
January 9, 2007, 19:24 |
That second curve does not sou
|
#10 |
Senior Member
Srinath Madhavan (a.k.a pUl|)
Join Date: Mar 2009
Location: Edmonton, AB, Canada
Posts: 703
Rep Power: 21 |
That second curve does not sound right. Someone correct me?
|
|
January 9, 2007, 19:36 |
Without getting into technical
|
#11 |
Senior Member
Srinath Madhavan (a.k.a pUl|)
Join Date: Mar 2009
Location: Edmonton, AB, Canada
Posts: 703
Rep Power: 21 |
Without getting into technicalities of CPU definition, we can conclude from the first graph: It is either the memory bandwidth limitation that degrades the speedup when using both cores OR the fact that each core was able to access the L1/L2 cache of the sibling for the first configuration, that results in the difference observed.
|
|
January 9, 2007, 21:53 |
well one can define it any pos
|
#12 |
Senior Member
Join Date: Mar 2009
Location: My oyster
Posts: 124
Rep Power: 17 |
well one can define it any possible way but to be able to compare to the paper you quoted you should use their own definition.
about multiphase, I am a multiphase consultant and that is why I am interested in OF. There is room for creativity since the source code is freely available. my background is actually in aerospace and stability methods for flow regime prediction. your thesis sounds quite interesting (and ambitious!). The solid particles approach shouldn't be too difficult since solid mechanics is much better understood than fluid mechanics and there is tons of literature on fluid/structure interactions (that is basically what it boils down to and you are definitely using the right code since you don't have to couple externally). bubbles on the other hand will prove quite challenging. without getting in the details, you'll probably need to take an energy balance approach including the surface energy (read surface tension dependent) between the continuous phase and the discrete phase. it should be doable as long as you are not going as far as bubble burst, collisions and merging. this is the "esoteric" side of things. I would expect a lot of empirical correlations, even at a DNS level. yada yada yada! It's easy to talk about it when you have the luxury of not having to do it yourself. Good luck dude! |
|
June 6, 2007, 19:11 |
I know that this is really lat
|
#13 |
Senior Member
Srinath Madhavan (a.k.a pUl|)
Join Date: Mar 2009
Location: Edmonton, AB, Canada
Posts: 703
Rep Power: 21 |
I know that this is really late information. But for those interested, the specs of the machine used in the above scale up tests are here[1].
[1] http://www.sun.com/servers/x64/x4600/index.xml |
|
June 6, 2007, 19:30 |
And here are the tech specs:
|
#14 |
Senior Member
Srinath Madhavan (a.k.a pUl|)
Join Date: Mar 2009
Location: Edmonton, AB, Canada
Posts: 703
Rep Power: 21 |
||
December 27, 2007, 15:14 |
Hi,
Running our own home-grow
|
#15 |
Member
David P. Schmidt
Join Date: Mar 2009
Posts: 72
Rep Power: 17 |
Hi,
Running our own home-grown OpenFOAM CFD application produced super-linear speedup on the NCSA's Mercury cluster. You an google the spec's, but if memory serves, it is a cluster of dual Itanium 2's connected with Myrinet. We were superlinear up to 8 cpu's and then started to drop a little bit. It was not a big case (350K cells) which was probably a factor. My student has theorized that the Itaniums have nice big caches and with the upper triangular ordering inherent in OF, we were getting more and more cache hits. |
|
December 29, 2007, 16:23 |
Thanks for the info David.
|
#16 |
Senior Member
Srinath Madhavan (a.k.a pUl|)
Join Date: Mar 2009
Location: Edmonton, AB, Canada
Posts: 703
Rep Power: 21 |
Thanks for the info David.
|
|
January 31, 2008, 11:17 |
When I look in the /proc/cpuin
|
#17 |
Member
Christian Lindbäck
Join Date: Mar 2009
Posts: 55
Rep Power: 17 |
When I look in the /proc/cpuinfo I indeed see the "physical id". But should I use the "processor" number or the "core id" number when using the "taskset -c" flag?
Best regards, Christian Svensson |
|
August 22, 2009, 04:59 |
|
#18 |
Senior Member
Daniel WEI (老魏)
Join Date: Mar 2009
Location: Beijing, China
Posts: 689
Blog Entries: 9
Rep Power: 21 |
wow, that's means 8 processors are used just as if there are 11 processors, right? wow. How did you do that, in OpenFOAM to achieve the largest speed-up? I am very interested!
__________________
~ Daniel WEI ------------- Boeing Research & Technology - China Beijing, China |
|
March 3, 2015, 06:36 |
What 'computations' to include in speedup test?
|
#19 |
Member
Olie
Join Date: Oct 2013
Posts: 51
Rep Power: 13 |
Hi,
I'm conducting a speedup test at the moment and just wondered, how do you decide on what to include in timing? I.e. to just run a solver for m timesteps with several probes WITHOUT then reconstructing the data for all m timesteps at the end from each of your N processor directories is, obviously, a lot cheaper than doing the same and finally reconstructing it all at the end - so how do you decide on whether or not to reconstruct it? One could argue that that reconstruction is one of the penalties incurred by running the solver in parallel (where as all the data would all be there perfectly had you run it in serial), and so should be included in the timing. On the other hand, I suppose that depends on whether you care about having the full domain flow data for all m timesteps (as if you only care about the probes' data, reconstructing isn't an issue). So how is this decided? Is it purely 'if you need all the data reconstructed at the end, you have to include it in timing, if you don't, you do not'? That seems a bit grey to me! Thanks, Olie |
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Problem with parallelization speedup using many CPUbs | aunola | OpenFOAM Running, Solving & CFD | 20 | January 23, 2009 07:59 |
speedup questions | tony | CFX | 5 | February 3, 2008 18:26 |
Superlinear SeedUp | mamaly60 | OpenFOAM Running, Solving & CFD | 2 | November 11, 2007 03:49 |
cluster - parallel speedup | George | Main CFD Forum | 3 | March 29, 2005 12:32 |
cluster - parallel speedup | George | FLUENT | 0 | March 25, 2005 06:54 |