4 cpu motherboard for CFD

petrile_83 · November 6, 2011, 04:54

Have anyone used 4 cpu motherboars for cfd? With 4 cpu motherboard and four opteron 6174 prosessors you can build compact 48 core machine. For example this motherboard:

http://www.supermicro.com/Aplus/moth...x0/H8QGi-F.cfm

Which one would be faster setup?

hardware 1:
cluster of 2 computers
2 cpu motherboards and two 6174 opteron for each machine

hardware 2:
1 computer
4 pcs 6174 opteron
4 cpu motherboard

sail · November 7, 2011, 16:06

Quote:

Originally Posted by petrile_83

Have anyone used 4 cpu motherboars for cfd? With 4 cpu motherboard and four opteron 6174 prosessors you can build compact 48 core machine. For example this motherboard:

http://www.supermicro.com/Aplus/moth...x0/H8QGi-F.cfm

Which one would be faster setup?

hardware 1:
cluster of 2 computers
2 cpu motherboards and two 6174 opteron for each machine

hardware 2:
1 computer
4 pcs 6174 opteron
4 cpu motherboard

should be the setup number 2. comunnications within the machine are way faster.

RobertB · November 7, 2011, 18:02

Aren't the 2 processors per board and 4 processors per board a different part number - for opterons now they are the 2000 and 8000 series I believe? Typically the 4 processor per board models are significantly more expensive.

I would think that with a decent interconnect that the two board solution is probably faster and cheaper. You will probably get more memory bandwidth per core with the 2 board solution.

markusrehm · November 8, 2011, 05:05

Hi,

the 4 processor solution together with the 12 core Opterons (6100 series aka Magny Cours) and the soon available 6200 series (aka Interlagos) which should fit into the same board are really very popular at the moment in the HPC community because they offer a very good price performance ratio. The memory bandwidth is also quite nice. For a benchmark you might read here:

http://www.anandtech.com/show/3894/s...clash-dellr815

If you can wait: the prices of Interlagos should be even more competitive but what first benchmarks for yet available desktop FX-series indicate is that you need some compiler tuning to get full performance:

http://www.phoronix.com/scan.php?pag...ompilers&num=1

Regards, Markus.

kyle · November 9, 2011, 10:47

CFD performance with unstructured grids on AMD's multi-socket boards is extremely poor. This article from anandtech tries to investigate why. I am assuming that Interlagos won't fix this entirely.

The best price/performance for CFD available now is far and away Intel's desktop chips. Four i5 2400 machines, which you can build for as little as $300 each, would blow your two choices out of the water. With just four machines you can get away with just a gig-e network.

Or, you could wait a week and get the new Intel Sandy Bridge E chips, which have six cores and an absolutely ridiculous amount of memory bandwidth. They machines would cost a little more than ones using the current Sandy Bridge chips, but the performance should be significantly more as well. It definitely would be way cheaper, and way faster, than buying server class hardware from AMD.

markusrehm · November 10, 2011, 04:49

I doubt that Euler3d results are representative for general CFD
performance. On this system

http://www.cfd-online.com/Forums/ope...tml#post314891

the speedup was almost linear.

Also Gigabit Ethernet interconnects are not a good choice if you want top performance.

From my point of view you are better off with Intel chips at the moment if the licensing model of your CFD code is per core. If this doesn't matter Opterons are often the better alternative. But as we saw before this is not generally valid so best you run benchmarks of your code before buying.

Regards, Markus.

kyle · November 10, 2011, 11:25

Gigabit ethernet is good enough for very small clusters. I had a four node cluster with gigabit ethernet that scaled from one to four nodes at 90% efficiency. Infiniband would take that up to what, 93%? For the money I could just buy another node and get ~20% speedup instead of ~3%.

AMD just is not competitive right now. With traditional CFD on unstructured grids, performance is dominated by memory bandwidth, memory latency and caching... all of which are areas that Intel has a significant advantage. Clockspeed doesn't really matter, I overclocked my machines from 3.4ghz to 4.0ghz and only saw a tiny speedup.

Regardless of per-core licensing issues, if you have a fixed amount of money to spend then buying Intel systems will give you the fastest cluster.

All of this only holds true for traditional CFD on unstructured meshes. If you are using structured meshes or a Lattice Boltzman code like Exa, then AMD likely DOES make sense.

abdul099 · November 11, 2011, 16:46

Another point to consider is energy consumption. My private owned AMD CPU is slower than the Xeon in my workstation and needs more energy. This is no issue as long as it doesn't run for a long time, but when it's up an 24/7 and under full load, it makes a huge difference. In Germany, it makes a difference of 50 bucks on the electricity bill per node in just a year. But the AMD would need to run at leas 20% longer to get the same results.
It's a shame, as I don't like the total market control and pricing policy of Intel - but at least the moement, AMD can't compete with the power and efficiency of Intel CPU's.

USiller · December 18, 2011, 13:10

I recently had the chance to make a little benchmark between a two socket XeonX5675 (24 Cores, 3.06GHz) and the new AMD Opteron 6274 (32 Cores, 2.1GHz). I run the DLR turbomachinery solver TRACE on a multi-block mesh of a axial compressor stage. OS was openSuse 12.1 in both cases, use of openMPI for parallelization

The results at a glance

machine numberJobs numberCores timesteps/minute (over all jobs)
XeonX5675 3 4 30,57
XeonX5675 3 8 33,93
XeonX5675 4 6 34,09

Opt6274 4 4 26,79
Opt6274 4 8 37,57

The main conclusions (from my perspective)

- Hyperthreading on Xeon is only effective in case of imperfect load balancing, at least for this number crunching intensive code.
- The sharing of one FPU for two cores on the Opteron system is the better deal for CFD, the test with 4*8 cores has about 40% more speed than 4*4 cores (one FPU per process)
- Opteron is the better deal, especially for a four socket system with infiniband interconnection, resulting in much lower hardware costs.

gskillas · April 11, 2012, 06:14

Dear Mr. Siller

just to make sure I understand your benchmark correctly: You run three/four distinct cases utilizing all cores available to the system.

Could it be that if you use all cores for one job (and make sure that no processor switches happen, emptying the INT/CMD/FPU pipelines) the results may look different? (And yes, I agree, HT is not relevant for CFD).

I am asking because I have to make the desicion Opteron 62XX vs E5-26YY and there are different aspects to consider. From the Benchmarks

http://www.amd.com/de/products/serve...t-servers.aspx

ROMS and WRFv3 are interesting for CFD applications, while

http://investors.ansys.com/releaseDe...leaseID=662929

it seems to me that the 6174 processor can only win in certain rather artificial situations. If any you need to consider 6276 as a direct E5-26YY competitor.

Best regards,

George Skillas

Quote:

Originally Posted by USiller

I recently had the chance to make a little benchmark between a two socket XeonX5675 (24 Cores, 3.06GHz) and the new AMD Opteron 6274 (32 Cores, 2.1GHz). I run the DLR turbomachinery solver TRACE on a multi-block mesh of a axial compressor stage. OS was openSuse 12.1 in both cases, use of openMPI for parallelization

The results at a glance

machine numberJobs numberCores timesteps/minute (over all jobs)
XeonX5675 3 4 30,57
XeonX5675 3 8 33,93
XeonX5675 4 6 34,09

Opt6274 4 4 26,79
Opt6274 4 8 37,57

The main conclusions (from my perspective)

- Hyperthreading on Xeon is only effective in case of imperfect load balancing, at least for this number crunching intensive code.
- The sharing of one FPU for two cores on the Opteron system is the better deal for CFD, the test with 4*8 cores has about 40% more speed than 4*4 cores (one FPU per process)
- Opteron is the better deal, especially for a four socket system with infiniband interconnection, resulting in much lower hardware costs.

USiller · April 16, 2012, 09:08

Hi Mr. Skillas,

your are right: I started the same computation n times on the machine and measured the time to finish for a specific number of timesteps. While for the Interlagos and the Xeon without HT all runs finished quite at the same time, the OT on case had very different running times (up to 10%).

My little benchmark is far away answering even the most important questions of the matrix beeing relevant for parallel computing.

We had the following strategy to answer the question:
- We have no core based licensing issue of our CFD solver - that simplifies a lot.
- Comparing the hardware costs of an Xeon based 2-socket server and an Interlagos 4-socket server (both with IB interconnection) we came up with approx. half the hardware costs per core for the AMD system - the lower clock speed of the AMD is already included.

Last week we received our HPC cluster from Delta Computer GmbH (Hamburg) and we are now looking forward to test again in-house

.

Best regards,
Ulrich Siller

CapSizer · April 16, 2012, 17:49

Ulrich, it would be great if you could keep us informed about what you find. I am particularly interested in seeing how well your application scales on a node compared to how well it scales across nodes. There seems to be quite a lot of uncertainty about whether it is really better to run with many cores on a motherboard (call it pure shared memory), or if it is faster to have more nodes, but not so many cores per motherboard.

November 6, 2011, 04:54	4 cpu motherboard for CFD	#1
petrile_83 New Member Join Date: Oct 2011 Posts: 5 Rep Power: 15	Have anyone used 4 cpu motherboars for cfd? With 4 cpu motherboard and four opteron 6174 prosessors you can build compact 48 core machine. For example this motherboard: http://www.supermicro.com/Aplus/moth...x0/H8QGi-F.cfm Which one would be faster setup? hardware 1: cluster of 2 computers 2 cpu motherboards and two 6174 opteron for each machine hardware 2: 1 computer 4 pcs 6174 opteron 4 cpu motherboard

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
stop when I run in parallel	Nolwenn	OpenFOAM	36	March 21, 2021 05:56
Superlinear speedup in OpenFOAM 13	msrinath80	OpenFOAM Running, Solving & CFD	18	March 3, 2015 06:36
OpenFOAM 13 Intel quadcore parallel results	msrinath80	OpenFOAM Running, Solving & CFD	13	February 5, 2008 06:26
OpenFOAM 13 AMD quadcore parallel results	msrinath80	OpenFOAM Running, Solving & CFD	1	November 11, 2007 00:23
Dual Core CPU	hjasak	OpenFOAM Running, Solving & CFD	5	July 22, 2006 04:57

November 7, 2011, 18:02		#3
RobertB Senior Member Robert Join Date: Jun 2010 Posts: 117 Rep Power: 17	Aren't the 2 processors per board and 4 processors per board a different part number - for opterons now they are the 2000 and 8000 series I believe? Typically the 4 processor per board models are significantly more expensive. I would think that with a decent interconnect that the two board solution is probably faster and cheaper. You will probably get more memory bandwidth per core with the 2 board solution.

November 8, 2011, 05:05		#4
markusrehm Senior Member Markus Rehm Join Date: Mar 2009 Location: Erlangen (Germany) Posts: 184 Rep Power: 17	Hi, the 4 processor solution together with the 12 core Opterons (6100 series aka Magny Cours) and the soon available 6200 series (aka Interlagos) which should fit into the same board are really very popular at the moment in the HPC community because they offer a very good price performance ratio. The memory bandwidth is also quite nice. For a benchmark you might read here: http://www.anandtech.com/show/3894/s...clash-dellr815 If you can wait: the prices of Interlagos should be even more competitive but what first benchmarks for yet available desktop FX-series indicate is that you need some compiler tuning to get full performance: http://www.phoronix.com/scan.php?pag...ompilers&num=1 Regards, Markus.

November 9, 2011, 10:47		#5
kyle Senior Member Join Date: Mar 2009 Location: Austin, TX Posts: 160 Rep Power: 18	CFD performance with unstructured grids on AMD's multi-socket boards is extremely poor. This article from anandtech tries to investigate why. I am assuming that Interlagos won't fix this entirely. The best price/performance for CFD available now is far and away Intel's desktop chips. Four i5 2400 machines, which you can build for as little as $300 each, would blow your two choices out of the water. With just four machines you can get away with just a gig-e network. Or, you could wait a week and get the new Intel Sandy Bridge E chips, which have six cores and an absolutely ridiculous amount of memory bandwidth. They machines would cost a little more than ones using the current Sandy Bridge chips, but the performance should be significantly more as well. It definitely would be way cheaper, and way faster, than buying server class hardware from AMD.

November 10, 2011, 04:49		#6
markusrehm Senior Member Markus Rehm Join Date: Mar 2009 Location: Erlangen (Germany) Posts: 184 Rep Power: 17	I doubt that Euler3d results are representative for general CFD performance. On this system http://www.cfd-online.com/Forums/ope...tml#post314891 the speedup was almost linear. Also Gigabit Ethernet interconnects are not a good choice if you want top performance. From my point of view you are better off with Intel chips at the moment if the licensing model of your CFD code is per core. If this doesn't matter Opterons are often the better alternative. But as we saw before this is not generally valid so best you run benchmarks of your code before buying. Regards, Markus.

November 10, 2011, 11:25		#7
kyle Senior Member Join Date: Mar 2009 Location: Austin, TX Posts: 160 Rep Power: 18	Gigabit ethernet is good enough for very small clusters. I had a four node cluster with gigabit ethernet that scaled from one to four nodes at 90% efficiency. Infiniband would take that up to what, 93%? For the money I could just buy another node and get ~20% speedup instead of ~3%. AMD just is not competitive right now. With traditional CFD on unstructured grids, performance is dominated by memory bandwidth, memory latency and caching... all of which are areas that Intel has a significant advantage. Clockspeed doesn't really matter, I overclocked my machines from 3.4ghz to 4.0ghz and only saw a tiny speedup. Regardless of per-core licensing issues, if you have a fixed amount of money to spend then buying Intel systems will give you the fastest cluster. All of this only holds true for traditional CFD on unstructured meshes. If you are using structured meshes or a Lattice Boltzman code like Exa, then AMD likely DOES make sense.

November 11, 2011, 16:46		#8
abdul099 Senior Member Join Date: Oct 2009 Location: Germany Posts: 636 Rep Power: 22	Another point to consider is energy consumption. My private owned AMD CPU is slower than the Xeon in my workstation and needs more energy. This is no issue as long as it doesn't run for a long time, but when it's up an 24/7 and under full load, it makes a huge difference. In Germany, it makes a difference of 50 bucks on the electricity bill per node in just a year. But the AMD would need to run at leas 20% longer to get the same results. It's a shame, as I don't like the total market control and pricing policy of Intel - but at least the moement, AMD can't compete with the power and efficiency of Intel CPU's.

December 18, 2011, 13:10		#9
USiller New Member Ulrich Siller Join Date: Dec 2011 Location: Germany Posts: 2 Rep Power: 0	I recently had the chance to make a little benchmark between a two socket XeonX5675 (24 Cores, 3.06GHz) and the new AMD Opteron 6274 (32 Cores, 2.1GHz). I run the DLR turbomachinery solver TRACE on a multi-block mesh of a axial compressor stage. OS was openSuse 12.1 in both cases, use of openMPI for parallelization The results at a glance machine numberJobs numberCores timesteps/minute (over all jobs) XeonX5675 3 4 30,57 XeonX5675 3 8 33,93 XeonX5675 4 6 34,09 Opt6274 4 4 26,79 Opt6274 4 8 37,57 The main conclusions (from my perspective) - Hyperthreading on Xeon is only effective in case of imperfect load balancing, at least for this number crunching intensive code. - The sharing of one FPU for two cores on the Opteron system is the better deal for CFD, the test with 48 cores has about 40% more speed than 44 cores (one FPU per process) - Opteron is the better deal, especially for a four socket system with infiniband interconnection, resulting in much lower hardware costs.

April 16, 2012, 09:08		#11
USiller New Member Ulrich Siller Join Date: Dec 2011 Location: Germany Posts: 2 Rep Power: 0	Hi Mr. Skillas, your are right: I started the same computation n times on the machine and measured the time to finish for a specific number of timesteps. While for the Interlagos and the Xeon without HT all runs finished quite at the same time, the OT on case had very different running times (up to 10%). My little benchmark is far away answering even the most important questions of the matrix beeing relevant for parallel computing. We had the following strategy to answer the question: - We have no core based licensing issue of our CFD solver - that simplifies a lot. - Comparing the hardware costs of an Xeon based 2-socket server and an Interlagos 4-socket server (both with IB interconnection) we came up with approx. half the hardware costs per core for the AMD system - the lower clock speed of the AMD is already included. Last week we received our HPC cluster from Delta Computer GmbH (Hamburg) and we are now looking forward to test again in-house . Best regards, Ulrich Siller

April 16, 2012, 17:49		#12
CapSizer Senior Member Charles Join Date: Apr 2009 Posts: 185 Rep Power: 18	Ulrich, it would be great if you could keep us informed about what you find. I am particularly interested in seeing how well your application scales on a node compared to how well it scales across nodes. There seems to be quite a lot of uncertainty about whether it is really better to run with many cores on a motherboard (call it pure shared memory), or if it is faster to have more nodes, but not so many cores per motherboard.