CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Software User Forums > OpenFOAM > OpenFOAM Running, Solving & CFD

OpenFOAM 13 Intel quadcore parallel results

Register Blogs Community New Posts Updated Threads Search

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   June 6, 2007, 15:59
Default Hello OF lovers, I'm back w
  #1
Senior Member
 
Srinath Madhavan (a.k.a pUl|)
Join Date: Mar 2009
Location: Edmonton, AB, Canada
Posts: 703
Rep Power: 21
msrinath80 is on a distinguished road
Hello OF lovers,

I'm back with another parallel study, this time on an intel machine with 8 cores (i.e. two quad-core CPUs). The machine has around 8 Gigs of memory so I had to restrict myself to a 6.5-7 Gig vortex-shedding case.

Here is the excerpt from cat /proc/cpuinfo:

madhavan@frzserver:~$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz
stepping : 7
cpu MHz : 1998.000
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm
bogomips : 4659.18
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz
stepping : 7
cpu MHz : 1998.000
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 2
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm
bogomips : 4655.05
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 2
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz
stepping : 7
cpu MHz : 1998.000
cache size : 4096 KB
physical id : 1
siblings : 4
core id : 0
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm
bogomips : 4655.07
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz
stepping : 7
cpu MHz : 1998.000
cache size : 4096 KB
physical id : 1
siblings : 4
core id : 2
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm
bogomips : 4655.16
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 4
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz
stepping : 7
cpu MHz : 1998.000
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 1
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm
bogomips : 4655.14
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 5
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz
stepping : 7
cpu MHz : 1998.000
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 3
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm
bogomips : 4655.03
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 6
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz
stepping : 7
cpu MHz : 1998.000
cache size : 4096 KB
physical id : 1
siblings : 4
core id : 1
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm
bogomips : 4655.14
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz
stepping : 7
cpu MHz : 1998.000
cache size : 4096 KB
physical id : 1
siblings : 4
core id : 3
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm
bogomips : 4655.14
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

The actual clock speed rises to 2.33 GHz once there is some activity on any of the CPU cores.

Anyways, without further ado, I present the parallel scale-up results below:

http://www.ualberta.ca/~madhavan/ope..._frzserver.jpg

As you can see the scale-up is very good for 2 processors (in fact slightly better than the ideal case). However as one moves to using 4 and 8 cores, the situation detoriates rapidly.

STREAM[1] memory bandwidth benchmarks show that the quad-core features superior memory bandwidth (approx 2500 MB/s) compared to an AMD Opteron 248 (approx 1600 MB/s) for instance. However, if that bandwidth were to be split among 4 cores, we end up with a per core bandwidth of 625 MB/s per core. Even if the Opteron in question was a dual-core, the per-core memory bandwidth would at least work out to 800 MB/s which is definitely an edge over 625 MB/s. Interestingly, the 4 MB shared L2 cache in the quad-core does not seem to influence the scale-up for the 4 or 8-core cases.

I would also like to draw your attention to the red dot (triangle) in the above graph. As a quad-core is basically an MCM (Multi-chip module), it is made by slapping two dual-core CPUs on one processor die. That red triangle was the 2-core scale-up result, if I placed both processes on the same physical CPU (i.e. on one of the dual-core units). If I chose one process to use one dual-core unit and the other process to use the other dual-core unit on the same processor, I would get a slightly better result, but still well below the ideal curve.

Bottom-line based on the above results, is that intel quad-cores are not really good news. Of course, this is just based on one kind of study. I would appreciate more feedback from others with access to quad-core systems (both intel and AMD).

[1] http://www.cs.virginia.edu/stream/
msrinath80 is offline   Reply With Quote

Old   June 6, 2007, 16:03
Default I forgot to mention that the g
  #2
Senior Member
 
Srinath Madhavan (a.k.a pUl|)
Join Date: Mar 2009
Location: Edmonton, AB, Canada
Posts: 703
Rep Power: 21
msrinath80 is on a distinguished road
I forgot to mention that the good (slightly better than ideal) 2-core result you see in the above graph was obtained while scheduling both icoFoam processes on different physical CPUs. In other words each icoFoam instance enjoyed the full 2500 MB/s.
msrinath80 is offline   Reply With Quote

Old   June 7, 2007, 05:12
Default Did you make a test with a sma
  #3
New Member
 
Nicolas Coste
Join Date: Mar 2009
Location: Marseilles, France
Posts: 11
Rep Power: 17
nikos is on a distinguished road
Did you make a test with a smaller case such that in could be in the cache ?
is your kernel tuned ?
nikos is offline   Reply With Quote

Old   June 7, 2007, 05:44
Default Nope... Did not do that. Kerne
  #4
Senior Member
 
Srinath Madhavan (a.k.a pUl|)
Join Date: Mar 2009
Location: Edmonton, AB, Canada
Posts: 703
Rep Power: 21
msrinath80 is on a distinguished road
Nope... Did not do that. Kernel is certainly tuned to my knowledge. If you can be more specific I could furnish more information.

The important thing I want to know is whether anyone has found a different trend on an intel quad core.
msrinath80 is offline   Reply With Quote

Old   June 7, 2007, 06:14
Default if I remember on smp computer
  #5
New Member
 
Nicolas Coste
Join Date: Mar 2009
Location: Marseilles, France
Posts: 11
Rep Power: 17
nikos is on a distinguished road
if I remember on smp computer kernel.shmmax (which can be tuned in /etc/sysctl.conf) value can affect performance of mpi communicator ... see lam/mpi user guide

a too high value can affect performance

correct me if I m wrong but for example with mpich you can set something like P4_GLOBMEMSIZE

ipcs -a can give u more info
nikos is offline   Reply With Quote

Old   June 7, 2007, 09:59
Default We are (hopefully if it runs t
  #6
New Member
 
Richard Morgans
Join Date: Mar 2009
Posts: 16
Rep Power: 17
rmorgans is on a distinguished road
We are (hopefully if it runs tonight!) benchmarking a quad core (Q6600), with an interacting bluff body vortex shedding case.

We're interested in this discussion and will keep you posted.

Rick
rmorgans is offline   Reply With Quote

Old   June 7, 2007, 10:17
Default Thanks. Looking forward to you
  #7
Senior Member
 
Srinath Madhavan (a.k.a pUl|)
Join Date: Mar 2009
Location: Edmonton, AB, Canada
Posts: 703
Rep Power: 21
msrinath80 is on a distinguished road
Thanks. Looking forward to your results

@Nicolas: I am aware of the P4_GLOBMEMSIZE env variable for MPICH. For lam however, I've never had any complaints of shared memory shortage. The default value of 33554432 (32 MB?) is what comes up when I issue cat /proc/sys/kernel/shmmax.

Do you have any pointers?
msrinath80 is offline   Reply With Quote

Old   June 7, 2007, 11:45
Default which communicator do you use
  #8
New Member
 
Nicolas Coste
Join Date: Mar 2009
Location: Marseilles, France
Posts: 11
Rep Power: 17
nikos is on a distinguished road
which communicator do you use ? if lam you can use rpi sysv module to use shared memory on the same node ... something like mpirun -ssi rpi sysv

even if application is different on large Oracle cluster type they increase the shmmax to 2Gb

increasing to 128 or 256 Mb will do the trick ...
too high value can decrease performance

kernel should be tuned for parallel computation on each side part, memory and network

by default linux kernel values are not appropriate for parallel computation (smp and distributed) and nfs services. If you want to obtain best performance you must adjust your kernel for your computation

in summary, small cases need practically no adjustment (except for network) but as soon as you are out of the memory cache ...

:->
nikos is offline   Reply With Quote

Old   June 7, 2007, 23:25
Default Thanks for your input. Just on
  #9
Senior Member
 
Srinath Madhavan (a.k.a pUl|)
Join Date: Mar 2009
Location: Edmonton, AB, Canada
Posts: 703
Rep Power: 21
msrinath80 is on a distinguished road
Thanks for your input. Just one observation. Wallclock times are better indicators for parallel speedup estimates. Unless you actually meant wallclock times?

But based on what I see you have a speedup close to 2.3 for 4 processes. So I guess our results have agreement.
msrinath80 is offline   Reply With Quote

Old   June 7, 2007, 23:32
Default @Nicolas Thanks for the poi
  #10
Senior Member
 
Srinath Madhavan (a.k.a pUl|)
Join Date: Mar 2009
Location: Edmonton, AB, Canada
Posts: 703
Rep Power: 21
msrinath80 is on a distinguished road
@Nicolas

Thanks for the pointers. I don't however see the need to tweak any of those settings. Based on my google search, the majority who change shmmax et al. are those working with databases (e.g. oracle). I searched petsc/lam mpi mailing lists to see if anyone has reported significant improvement by tweaking those settings. I could not find anything. If you have experienced benefits, can you specifically state the nature of the parallel case you tested (how many cells etc.) and what was the difference in speedup when using the default setting of 32 MB versus 128 or 256 MB?

Once again thanks for your help!
msrinath80 is offline   Reply With Quote

Old   June 8, 2007, 05:43
Default Ok I'm to busy to do some test
  #11
New Member
 
Nicolas Coste
Join Date: Mar 2009
Location: Marseilles, France
Posts: 11
Rep Power: 17
nikos is on a distinguished road
Ok I'm to busy to do some test on our cluster and unfortunately to come to workshop ...

An interessant benchmark would be to pass from ethernet ... I mean using multiple ethernet on the same host and using openmpi because with openmpi u can have multiple ethernet devices on the same host.

lam don't like this (because of lamd) ...

when I will have more time I will try it

have a nice day

ps : when you do benchmark, please put os, compiler and communicator
nikos is offline   Reply With Quote

Old   June 12, 2007, 09:14
Default Update: I changed SHMMAX and S
  #12
Senior Member
 
Srinath Madhavan (a.k.a pUl|)
Join Date: Mar 2009
Location: Edmonton, AB, Canada
Posts: 703
Rep Power: 21
msrinath80 is on a distinguished road
Update: I changed SHMMAX and SHMALL in /etc/sysctl.conf and rebooted. /proc/sys/kernel/shm* displays the new values I set. However, the speedup was again the same 1.2 when using both cores from the same physical CPU. Apparently there is no effect in parallel speedup.

OS: RHEL 4.x (Scientific Linux 4.1)
Compiler: OpenFOAM stock supplied
Communicator: LAM
msrinath80 is offline   Reply With Quote

Old   June 14, 2007, 09:10
Default Hi following this discussion
  #13
New Member
 
Nicolas Coste
Join Date: Mar 2009
Location: Marseilles, France
Posts: 11
Rep Power: 17
nikos is on a distinguished road
Hi
following this discussion I ve done some test
on a small 2d case 100000 cells using openmpi

first, compiling OF1.4 with -march=nocoma on EMT64 processor will reduce global cpu time by 10% on a serial case and 5% on a parallel case

second I've the same results concerning speedup i.e
2 cores : 1.46
4 cores : 2.06
warning since I ve only one quad core cpu 4 cores result may be polluted on 4 cores computtation by OS activity (Centos 4)
nikos is offline   Reply With Quote

Old   February 5, 2008, 06:26
Default You cant have dedicated memory
  #14
Senior Member
 
Eugene de Villiers
Join Date: Mar 2009
Posts: 725
Rep Power: 21
eugene is on a distinguished road
You cant have dedicated memory channels for each core unless you are prepared to make the cpu package a lot lot bigger to incorporate the additional pins needed for 64bit interfaces. In addition, the FSB based design does not admit for individual memory channels which will only appear on Intel chips when CSI is released with Nahlem later this year.

The memory bottleneck is one of the main reasons AMD cpus with hypertransport can still be competitive in some applications.

Unfortunately, there isn't much you can do other than buying lower latency memory to make things faster. You only have around 10 GB/s of memory bandwidth shared between 8 cores.

Bring back rambus I say.
eugene is offline   Reply With Quote

Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
OpenFOAM results in SALOME kar OpenFOAM Post-Processing 1 January 31, 2008 14:15
OpenFOAM 141 parallel results infiniband vs gigabit vs SMP msrinath80 OpenFOAM Running, Solving & CFD 10 November 30, 2007 19:11
OpenFOAM 13 AMD quadcore parallel results msrinath80 OpenFOAM Running, Solving & CFD 1 November 11, 2007 00:23
OpenFOAM on Intel Core 2 Duo michael_owen OpenFOAM Installation 1 November 4, 2006 17:06
AMD X2 & INTEL core 2 are compatible for parallel? nikolas FLUENT 0 October 5, 2006 07:49


All times are GMT -4. The time now is 09:23.