|
[Sponsors] |
June 6, 2007, 15:59 |
Hello OF lovers,
I'm back w
|
#1 |
Senior Member
Srinath Madhavan (a.k.a pUl|)
Join Date: Mar 2009
Location: Edmonton, AB, Canada
Posts: 703
Rep Power: 21 |
Hello OF lovers,
I'm back with another parallel study, this time on an intel machine with 8 cores (i.e. two quad-core CPUs). The machine has around 8 Gigs of memory so I had to restrict myself to a 6.5-7 Gig vortex-shedding case. Here is the excerpt from cat /proc/cpuinfo: madhavan@frzserver:~$ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz stepping : 7 cpu MHz : 1998.000 cache size : 4096 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm bogomips : 4659.18 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz stepping : 7 cpu MHz : 1998.000 cache size : 4096 KB physical id : 0 siblings : 4 core id : 2 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm bogomips : 4655.05 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: processor : 2 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz stepping : 7 cpu MHz : 1998.000 cache size : 4096 KB physical id : 1 siblings : 4 core id : 0 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm bogomips : 4655.07 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: processor : 3 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz stepping : 7 cpu MHz : 1998.000 cache size : 4096 KB physical id : 1 siblings : 4 core id : 2 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm bogomips : 4655.16 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: processor : 4 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz stepping : 7 cpu MHz : 1998.000 cache size : 4096 KB physical id : 0 siblings : 4 core id : 1 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm bogomips : 4655.14 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: processor : 5 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz stepping : 7 cpu MHz : 1998.000 cache size : 4096 KB physical id : 0 siblings : 4 core id : 3 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm bogomips : 4655.03 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: processor : 6 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz stepping : 7 cpu MHz : 1998.000 cache size : 4096 KB physical id : 1 siblings : 4 core id : 1 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm bogomips : 4655.14 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: processor : 7 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz stepping : 7 cpu MHz : 1998.000 cache size : 4096 KB physical id : 1 siblings : 4 core id : 3 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm bogomips : 4655.14 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: The actual clock speed rises to 2.33 GHz once there is some activity on any of the CPU cores. Anyways, without further ado, I present the parallel scale-up results below: http://www.ualberta.ca/~madhavan/ope..._frzserver.jpg As you can see the scale-up is very good for 2 processors (in fact slightly better than the ideal case). However as one moves to using 4 and 8 cores, the situation detoriates rapidly. STREAM[1] memory bandwidth benchmarks show that the quad-core features superior memory bandwidth (approx 2500 MB/s) compared to an AMD Opteron 248 (approx 1600 MB/s) for instance. However, if that bandwidth were to be split among 4 cores, we end up with a per core bandwidth of 625 MB/s per core. Even if the Opteron in question was a dual-core, the per-core memory bandwidth would at least work out to 800 MB/s which is definitely an edge over 625 MB/s. Interestingly, the 4 MB shared L2 cache in the quad-core does not seem to influence the scale-up for the 4 or 8-core cases. I would also like to draw your attention to the red dot (triangle) in the above graph. As a quad-core is basically an MCM (Multi-chip module), it is made by slapping two dual-core CPUs on one processor die. That red triangle was the 2-core scale-up result, if I placed both processes on the same physical CPU (i.e. on one of the dual-core units). If I chose one process to use one dual-core unit and the other process to use the other dual-core unit on the same processor, I would get a slightly better result, but still well below the ideal curve. Bottom-line based on the above results, is that intel quad-cores are not really good news. Of course, this is just based on one kind of study. I would appreciate more feedback from others with access to quad-core systems (both intel and AMD). [1] http://www.cs.virginia.edu/stream/ |
|
June 6, 2007, 16:03 |
I forgot to mention that the g
|
#2 |
Senior Member
Srinath Madhavan (a.k.a pUl|)
Join Date: Mar 2009
Location: Edmonton, AB, Canada
Posts: 703
Rep Power: 21 |
I forgot to mention that the good (slightly better than ideal) 2-core result you see in the above graph was obtained while scheduling both icoFoam processes on different physical CPUs. In other words each icoFoam instance enjoyed the full 2500 MB/s.
|
|
June 7, 2007, 05:12 |
Did you make a test with a sma
|
#3 |
New Member
Nicolas Coste
Join Date: Mar 2009
Location: Marseilles, France
Posts: 11
Rep Power: 17 |
Did you make a test with a smaller case such that in could be in the cache ?
is your kernel tuned ? |
|
June 7, 2007, 05:44 |
Nope... Did not do that. Kerne
|
#4 |
Senior Member
Srinath Madhavan (a.k.a pUl|)
Join Date: Mar 2009
Location: Edmonton, AB, Canada
Posts: 703
Rep Power: 21 |
Nope... Did not do that. Kernel is certainly tuned to my knowledge. If you can be more specific I could furnish more information.
The important thing I want to know is whether anyone has found a different trend on an intel quad core. |
|
June 7, 2007, 06:14 |
if I remember on smp computer
|
#5 |
New Member
Nicolas Coste
Join Date: Mar 2009
Location: Marseilles, France
Posts: 11
Rep Power: 17 |
if I remember on smp computer kernel.shmmax (which can be tuned in /etc/sysctl.conf) value can affect performance of mpi communicator ... see lam/mpi user guide
a too high value can affect performance correct me if I m wrong but for example with mpich you can set something like P4_GLOBMEMSIZE ipcs -a can give u more info |
|
June 7, 2007, 09:59 |
We are (hopefully if it runs t
|
#6 |
New Member
Richard Morgans
Join Date: Mar 2009
Posts: 16
Rep Power: 17 |
We are (hopefully if it runs tonight!) benchmarking a quad core (Q6600), with an interacting bluff body vortex shedding case.
We're interested in this discussion and will keep you posted. Rick |
|
June 7, 2007, 10:17 |
Thanks. Looking forward to you
|
#7 |
Senior Member
Srinath Madhavan (a.k.a pUl|)
Join Date: Mar 2009
Location: Edmonton, AB, Canada
Posts: 703
Rep Power: 21 |
Thanks. Looking forward to your results
@Nicolas: I am aware of the P4_GLOBMEMSIZE env variable for MPICH. For lam however, I've never had any complaints of shared memory shortage. The default value of 33554432 (32 MB?) is what comes up when I issue cat /proc/sys/kernel/shmmax. Do you have any pointers? |
|
June 7, 2007, 11:45 |
which communicator do you use
|
#8 |
New Member
Nicolas Coste
Join Date: Mar 2009
Location: Marseilles, France
Posts: 11
Rep Power: 17 |
which communicator do you use ? if lam you can use rpi sysv module to use shared memory on the same node ... something like mpirun -ssi rpi sysv
even if application is different on large Oracle cluster type they increase the shmmax to 2Gb increasing to 128 or 256 Mb will do the trick ... too high value can decrease performance kernel should be tuned for parallel computation on each side part, memory and network by default linux kernel values are not appropriate for parallel computation (smp and distributed) and nfs services. If you want to obtain best performance you must adjust your kernel for your computation in summary, small cases need practically no adjustment (except for network) but as soon as you are out of the memory cache ... :-> |
|
June 7, 2007, 23:25 |
Thanks for your input. Just on
|
#9 |
Senior Member
Srinath Madhavan (a.k.a pUl|)
Join Date: Mar 2009
Location: Edmonton, AB, Canada
Posts: 703
Rep Power: 21 |
Thanks for your input. Just one observation. Wallclock times are better indicators for parallel speedup estimates. Unless you actually meant wallclock times?
But based on what I see you have a speedup close to 2.3 for 4 processes. So I guess our results have agreement. |
|
June 7, 2007, 23:32 |
@Nicolas
Thanks for the poi
|
#10 |
Senior Member
Srinath Madhavan (a.k.a pUl|)
Join Date: Mar 2009
Location: Edmonton, AB, Canada
Posts: 703
Rep Power: 21 |
@Nicolas
Thanks for the pointers. I don't however see the need to tweak any of those settings. Based on my google search, the majority who change shmmax et al. are those working with databases (e.g. oracle). I searched petsc/lam mpi mailing lists to see if anyone has reported significant improvement by tweaking those settings. I could not find anything. If you have experienced benefits, can you specifically state the nature of the parallel case you tested (how many cells etc.) and what was the difference in speedup when using the default setting of 32 MB versus 128 or 256 MB? Once again thanks for your help! |
|
June 8, 2007, 05:43 |
Ok I'm to busy to do some test
|
#11 |
New Member
Nicolas Coste
Join Date: Mar 2009
Location: Marseilles, France
Posts: 11
Rep Power: 17 |
Ok I'm to busy to do some test on our cluster and unfortunately to come to workshop ...
An interessant benchmark would be to pass from ethernet ... I mean using multiple ethernet on the same host and using openmpi because with openmpi u can have multiple ethernet devices on the same host. lam don't like this (because of lamd) ... when I will have more time I will try it have a nice day ps : when you do benchmark, please put os, compiler and communicator |
|
June 12, 2007, 09:14 |
Update: I changed SHMMAX and S
|
#12 |
Senior Member
Srinath Madhavan (a.k.a pUl|)
Join Date: Mar 2009
Location: Edmonton, AB, Canada
Posts: 703
Rep Power: 21 |
Update: I changed SHMMAX and SHMALL in /etc/sysctl.conf and rebooted. /proc/sys/kernel/shm* displays the new values I set. However, the speedup was again the same 1.2 when using both cores from the same physical CPU. Apparently there is no effect in parallel speedup.
OS: RHEL 4.x (Scientific Linux 4.1) Compiler: OpenFOAM stock supplied Communicator: LAM |
|
June 14, 2007, 09:10 |
Hi
following this discussion
|
#13 |
New Member
Nicolas Coste
Join Date: Mar 2009
Location: Marseilles, France
Posts: 11
Rep Power: 17 |
Hi
following this discussion I ve done some test on a small 2d case 100000 cells using openmpi first, compiling OF1.4 with -march=nocoma on EMT64 processor will reduce global cpu time by 10% on a serial case and 5% on a parallel case second I've the same results concerning speedup i.e 2 cores : 1.46 4 cores : 2.06 warning since I ve only one quad core cpu 4 cores result may be polluted on 4 cores computtation by OS activity (Centos 4) |
|
February 5, 2008, 06:26 |
You cant have dedicated memory
|
#14 |
Senior Member
Eugene de Villiers
Join Date: Mar 2009
Posts: 725
Rep Power: 21 |
You cant have dedicated memory channels for each core unless you are prepared to make the cpu package a lot lot bigger to incorporate the additional pins needed for 64bit interfaces. In addition, the FSB based design does not admit for individual memory channels which will only appear on Intel chips when CSI is released with Nahlem later this year.
The memory bottleneck is one of the main reasons AMD cpus with hypertransport can still be competitive in some applications. Unfortunately, there isn't much you can do other than buying lower latency memory to make things faster. You only have around 10 GB/s of memory bandwidth shared between 8 cores. Bring back rambus I say. |
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
OpenFOAM results in SALOME | kar | OpenFOAM Post-Processing | 1 | January 31, 2008 14:15 |
OpenFOAM 141 parallel results infiniband vs gigabit vs SMP | msrinath80 | OpenFOAM Running, Solving & CFD | 10 | November 30, 2007 19:11 |
OpenFOAM 13 AMD quadcore parallel results | msrinath80 | OpenFOAM Running, Solving & CFD | 1 | November 11, 2007 00:23 |
OpenFOAM on Intel Core 2 Duo | michael_owen | OpenFOAM Installation | 1 | November 4, 2006 17:06 |
AMD X2 & INTEL core 2 are compatible for parallel? | nikolas | FLUENT | 0 | October 5, 2006 07:49 |