Dual cpu workstation VS 2 node cluster single cpu workstation

Verdi · April 27, 2011, 06:53

I want to run open foam, ansys cfx and ansys mechanical using linux as operating system. I have to options for the hardware set up. Which of the two will perform the best?

For pre processing and post processing I have another workstation available. The main goal for the new machines is to provide some computational power.

The two options I have are;
-Option 1
Dual cpu workstation with Xeon X5690.

-Option 2
Two workstation with I7 990X or W3690 in a 2 node cluster.

For the price I think it does not make a big difference. But how about the performance?

Looking at the cpu benchmark (http://www.cpubenchmark.net/high_end_cpus.html) the W3690 seems to be the fastest cpu. But when I use two of these in a 2 node cluster, will this still be faster than when I use dual X5690 in a single workstation? My feeling tells me that a dual cpu would have a more efficient communication between the two cpu. But will there be a significant difference when looking at a real life scenario?

JBeilke · April 27, 2011, 06:57

The i7-2600 is probably a lot faster than the xeons. So get 4 of them and just use 2 cores per processor.

Verdi · April 28, 2011, 19:29

Quote:

Originally Posted by JBeilke

The i7-2600 is probably a lot faster than the xeons. So get 4 of them and just use 2 cores per processor.

According to the passmark benchmark the I7 2600 is slower... But the price is also much lower. I think I can get a 6 node cluster with the 2600 for the same price as what i would pay for a dual xeon x5690 workstation. This could be option 3 ??

But still i would like to know what is the best way to go looking at the two option from my first post. My feeling tells me that 2 cpu's on one motherboard with a direct connection between the cpu's is more efficient when the cpu's have to work together. Is my feeling correct? And if there is a difference, would you notice it in a real world case running parallel cfd aplications?

JBeilke · April 29, 2011, 04:18

Forget the usual benchmarks. CFD calculations require a lot of memory bandwidth. Even if you use only two cores of an modern intel cpu your speedup is not linear getting worse when using more cores.

And when you use only 2 cores per processor the i7-2600 is the fastest one:
http://www.xbitlabs.com/articles/cpu...600k-990x.html

I'm not sure about 2 cpus on one board.

abdul099 · May 1, 2011, 11:26

It depends much on the connection between the different workstations when using the i7. For communication, the main issue is not the link bandwith but latency. When using a simple ethernet, my feeling would tell me it's slower - but I don't have any measurement values.

Martin Hegedus · May 2, 2011, 23:57

I agree with Joern Belike, the main bottle neck is the memory bandwidth, especially for unstructured solvers where the memory distribution is randomish. Structured solvers have an advantage. If possible, you should size the cpus to your memory bandwidth. Then decide how many machines you need.

abdul099 · May 4, 2011, 04:00

I agree only partially. You can't compare requirements for a stand-alone machine and a machine which will be part of a small "cluster".

Memory bandwidth is important when running the whole case on a single machine, because the full case needs data from the same memory and processes might be blocking each other while reading or writing.

But when running the case in parallel on more than one machine, every machine has to handle only a part of the full model and needs less memory. And much of memory access can be performed at the same time on different machines. Therefore the memory bandwidth becomes less critical with an increasing number of different machines, but latency of the communication interface will become very important. That's how every cluster is build: Infiniband connection to get low latency, but no special memory.

lalula2 · May 4, 2011, 22:58

I agree with what abdul099 said. You will need a very good communication interface between each nodes.
My dual Xeon E5620 processor run faster than my small cluster (2 pc of AMD X6 1090T), even through my AMD have higher clock speed than my Xeon. I suspect is due to the communication between the 2 PC is not good enough to handle such large bandwidth.
I think cluster is effective when you wish to solve a case with very large mesh size where single workstation memory is insufficient to carry out the calculation. If the mesh size for a case is small enough that a single workstation can handle, I still prefer to perform it on single machine.

kyle · May 6, 2011, 17:38

lalula2, the speed difference is not because of limited bandwidth between the two machines, it is because of limited bandwidth from the AMD CPU's to their memory. Your Xeon processors are much, much bettor for CFD than your AMD processors simply because they have faster access to the system memory.

Unless your decomposition method is very poor... with just 2 nodes, your simulation is not going to be bottlenecked by gigabit network speeds.

Verdi · May 11, 2011, 07:24

Thank you all for your replies! It becomes a bit more clear for me.

So I have two different cases with different potential bottlenecks... For the dual CPU single workstation the memory bandwidth is the important factor.
For a cluster the memory bandwidth per machine is less important, but the network connection will determine the overall performance.

When I look at the two options from my first post, I think I will go for the dual CPU workstation. This is easier to set up and to maintain.
Only when I want to scale up the number of CPU’s then the cluster with cheap CPU and memory and good network connection this option can be cheaper for the same performance.

abdul099 · May 22, 2011, 08:17

kyle, the amount of data to be exchanged is not very much. Therefore you are right, the network bandwidth is not the bottleneck. But don't forget the poor latency of an ethernet. There is a good reason why nearly all clusters have an infiniband connection between the nodes.

And to be honest, the Xeon E-series cpu's are just crap compared to the X-series. And that is because the E-series does NOT have a fast memory access. I haven't tested it, but just from the cpu architecture, the Phenom X6 should have a faster memory access than a Xeon E (and of course is much slower than a Xeon X).

kyle · May 22, 2011, 21:17

abdul,

This thread was strictly about filesystem bandwidth. Of course both memory bandwidth and network bandwidth are extremely important. The highest memory bandwidth per core, as well as lowest memory latency, is a Sandy Bridge i5 or i7 with 2133mhz memory. If you run a dual socket Xeon X series system, you can get a much higher memory bandwidth per system, but that isn't really meaningful. Memory bandwidth per dollar is the much more important number.

Check out the CPU benchmarks on http://techreport.com. They run memory bandwidth and latency, as well as a CFD benchmarks for every CPU right after it comes out.

lalula2 · May 31, 2011, 03:41

abdul,

There isnt much different between X and E series (E5620 above) of Xeon processor. They are both triple channel and in term of price, X series is more expensive. What you may get is faster in clock speed, others are pretty much the same. You said it crap just because it slower 100-200mhz in clock speed?? But have you compared it in term of mhz/dollar?

Phenom X6 is only run @ 21 GB/s dual channel memory bandwidth compared to to Xeon E series which is running 25.6 GB/s triple channel memory.

abdul099 · June 3, 2011, 13:37

lalula,

you're right. I've mixed it up because in our company, all E-series xeon processors are based on the Core architecture while the X-series processors are the newer ones based on the Nehalem architecture. Therefore there is a huge difference due to the integrated memory controller of the Nehalem cpu's and the memory access through a frontside bus of the Core cpu's.
As the Phenom is more similar to a Nehalem cpu (AMD had a integrated memory controller much earlier than Intel), the old Core-based Xeons should be easily beaten even by a Phenom X6, like I've written before.

Anyway, Mhz per $ doesn't mean all. When I pick 200 Pentium I 120Mhz out of the trash bin, I will get a lot of "Mhz / $" - but it's not fast and no good choice although it's cheap.
It depends much from the specific case whether a system can beat another one when comparing not "Mhz / $" but "performance / $".

evcelica · September 26, 2011, 20:13

Somewhat off this topic but also interesting. I did some benchmarking with my 4.8GHz overclocked i7-2600K system and a dual Xeon X5675 system both running two cores.
I was running a non-linear buckling analysis.
The i7 system ran nearly twice as fast as the dual XEON system. So per core performance of the i7 showed to be much much better than the XEONs.

aerogt3 · March 9, 2012, 08:27

This thread has been very helpful so far. Does anyone know the difference between Sandy bridge and sandy bridge-E as far as CFD goes? For example, these two similarly priced CPU's:

http://www.newegg.com/Product/Produc...82E16819115082

http://www.newegg.com/Product/Produc...82E16819117270

Is the first one a 1P CPU and the second intended for 2P system use?

evcelica · March 21, 2012, 20:16

The number located at the pound sign: EX-#XXX, states how many cpus can be put on a single board. Sandy bridge are all single socket, Sandy Bridge E have both single and dual socket processors, and 4 in the future.
Sandy-bridge has dual channel memory, Sandy-Bridge-E has quadruple.
between those two CPUs the E3 would probably blow away the E5 in most everyday tasks, but the quadruple channel memory might make it a little closer race for CFD. But I would still put my money on the E3 since it has a much higher clock speed.
If your going dual socket, then the E5 would be the way to go.

There may be more differences but those are the ones I know of.

aerogt3 · March 22, 2012, 04:43

Quote:

Originally Posted by evcelica

The number located at the pound sign: EX-#XXX, states how many cpus can be put on a single board. Sandy bridge are all single socket, Sandy Bridge E have both single and dual socket processors, and 4 in the future.
Sandy-bridge has dual channel memory, Sandy-Bridge-E has quadruple.
between those two CPUs the E3 would probably blow away the E5 in most everyday tasks, but the quadruple channel memory might make it a little closer race for CFD. But I would still put my money on the E3 since it has a much higher clock speed.
If your going dual socket, then the E5 would be the way to go.

There may be more differences but those are the ones I know of.

Great info! I need a dual socket processor, so you've settled it for me. Thanks a bunch!

Whitebear · September 2, 2013, 04:09

ANSYS Workbench is very slow in Xeon X5690 CPU.

April 27, 2011, 06:53	Dual cpu workstation VS 2 node cluster single cpu workstation	#1
Verdi New Member anonymous Join Date: Apr 2011 Posts: 8 Rep Power: 15	I want to run open foam, ansys cfx and ansys mechanical using linux as operating system. I have to options for the hardware set up. Which of the two will perform the best? For pre processing and post processing I have another workstation available. The main goal for the new machines is to provide some computational power. The two options I have are; -Option 1 Dual cpu workstation with Xeon X5690. -Option 2 Two workstation with I7 990X or W3690 in a 2 node cluster. For the price I think it does not make a big difference. But how about the performance? Looking at the cpu benchmark (http://www.cpubenchmark.net/high_end_cpus.html) the W3690 seems to be the fastest cpu. But when I use two of these in a 2 node cluster, will this still be faster than when I use dual X5690 in a single workstation? My feeling tells me that a dual cpu would have a more efficient communication between the two cpu. But will there be a significant difference when looking at a real life scenario?

September 2, 2013, 04:09	Xeon X5690 has a problem.	#19
Whitebear Member Jinwhan Ryuk Join Date: Feb 2013 Location: South Korea Posts: 91 Rep Power: 13	ANSYS Workbench is very slow in Xeon X5690 CPU.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Superlinear speedup in OpenFOAM 13	msrinath80	OpenFOAM Running, Solving & CFD	18	March 3, 2015 06:36
Dual Nodes is Slower Than Single Node (Reposting)	Mrxlazuardin	Hardware	1	May 26, 2010 11:25
Dual Nodes is Slower Than Single Node	Mrxlazuardin	FLUENT	0	May 21, 2010 02:48
OpenFOAM 13 Intel quadcore parallel results	msrinath80	OpenFOAM Running, Solving & CFD	13	February 5, 2008 06:26
P4 1.5 or Dual P3 800EB on Gibabyte board	Danial	FLUENT	4	September 12, 2001 12:44

April 27, 2011, 06:57		#2
JBeilke Senior Member Joern Beilke Join Date: Mar 2009 Location: Dresden Posts: 539 Rep Power: 20	The i7-2600 is probably a lot faster than the xeons. So get 4 of them and just use 2 cores per processor.

April 29, 2011, 04:18		#4
JBeilke Senior Member Joern Beilke Join Date: Mar 2009 Location: Dresden Posts: 539 Rep Power: 20	Forget the usual benchmarks. CFD calculations require a lot of memory bandwidth. Even if you use only two cores of an modern intel cpu your speedup is not linear getting worse when using more cores. And when you use only 2 cores per processor the i7-2600 is the fastest one: http://www.xbitlabs.com/articles/cpu...600k-990x.html I'm not sure about 2 cpus on one board.

May 1, 2011, 11:26		#5
abdul099 Senior Member Join Date: Oct 2009 Location: Germany Posts: 636 Rep Power: 22	It depends much on the connection between the different workstations when using the i7. For communication, the main issue is not the link bandwith but latency. When using a simple ethernet, my feeling would tell me it's slower - but I don't have any measurement values.

May 2, 2011, 23:57		#6
Martin Hegedus Senior Member Martin Hegedus Join Date: Feb 2011 Posts: 500 Rep Power: 19	I agree with Joern Belike, the main bottle neck is the memory bandwidth, especially for unstructured solvers where the memory distribution is randomish. Structured solvers have an advantage. If possible, you should size the cpus to your memory bandwidth. Then decide how many machines you need.

May 4, 2011, 04:00		#7
abdul099 Senior Member Join Date: Oct 2009 Location: Germany Posts: 636 Rep Power: 22	I agree only partially. You can't compare requirements for a stand-alone machine and a machine which will be part of a small "cluster". Memory bandwidth is important when running the whole case on a single machine, because the full case needs data from the same memory and processes might be blocking each other while reading or writing. But when running the case in parallel on more than one machine, every machine has to handle only a part of the full model and needs less memory. And much of memory access can be performed at the same time on different machines. Therefore the memory bandwidth becomes less critical with an increasing number of different machines, but latency of the communication interface will become very important. That's how every cluster is build: Infiniband connection to get low latency, but no special memory.

May 4, 2011, 22:58		#8
lalula2 New Member VLKOH Join Date: Mar 2009 Location: Malaysia Posts: 20 Rep Power: 17	I agree with what abdul099 said. You will need a very good communication interface between each nodes. My dual Xeon E5620 processor run faster than my small cluster (2 pc of AMD X6 1090T), even through my AMD have higher clock speed than my Xeon. I suspect is due to the communication between the 2 PC is not good enough to handle such large bandwidth. I think cluster is effective when you wish to solve a case with very large mesh size where single workstation memory is insufficient to carry out the calculation. If the mesh size for a case is small enough that a single workstation can handle, I still prefer to perform it on single machine.

May 6, 2011, 17:38		#9
kyle Senior Member Join Date: Mar 2009 Location: Austin, TX Posts: 160 Rep Power: 18	lalula2, the speed difference is not because of limited bandwidth between the two machines, it is because of limited bandwidth from the AMD CPU's to their memory. Your Xeon processors are much, much bettor for CFD than your AMD processors simply because they have faster access to the system memory. Unless your decomposition method is very poor... with just 2 nodes, your simulation is not going to be bottlenecked by gigabit network speeds.

May 11, 2011, 07:24		#10
Verdi New Member anonymous Join Date: Apr 2011 Posts: 8 Rep Power: 15	Thank you all for your replies! It becomes a bit more clear for me. So I have two different cases with different potential bottlenecks... For the dual CPU single workstation the memory bandwidth is the important factor. For a cluster the memory bandwidth per machine is less important, but the network connection will determine the overall performance. When I look at the two options from my first post, I think I will go for the dual CPU workstation. This is easier to set up and to maintain. Only when I want to scale up the number of CPU’s then the cluster with cheap CPU and memory and good network connection this option can be cheaper for the same performance.

May 22, 2011, 08:17		#11
abdul099 Senior Member Join Date: Oct 2009 Location: Germany Posts: 636 Rep Power: 22	kyle, the amount of data to be exchanged is not very much. Therefore you are right, the network bandwidth is not the bottleneck. But don't forget the poor latency of an ethernet. There is a good reason why nearly all clusters have an infiniband connection between the nodes. And to be honest, the Xeon E-series cpu's are just crap compared to the X-series. And that is because the E-series does NOT have a fast memory access. I haven't tested it, but just from the cpu architecture, the Phenom X6 should have a faster memory access than a Xeon E (and of course is much slower than a Xeon X).

May 22, 2011, 21:17		#12
kyle Senior Member Join Date: Mar 2009 Location: Austin, TX Posts: 160 Rep Power: 18	abdul, This thread was strictly about filesystem bandwidth. Of course both memory bandwidth and network bandwidth are extremely important. The highest memory bandwidth per core, as well as lowest memory latency, is a Sandy Bridge i5 or i7 with 2133mhz memory. If you run a dual socket Xeon X series system, you can get a much higher memory bandwidth per system, but that isn't really meaningful. Memory bandwidth per dollar is the much more important number. Check out the CPU benchmarks on http://techreport.com. They run memory bandwidth and latency, as well as a CFD benchmarks for every CPU right after it comes out.

May 31, 2011, 03:41		#13
lalula2 New Member VLKOH Join Date: Mar 2009 Location: Malaysia Posts: 20 Rep Power: 17	abdul, There isnt much different between X and E series (E5620 above) of Xeon processor. They are both triple channel and in term of price, X series is more expensive. What you may get is faster in clock speed, others are pretty much the same. You said it crap just because it slower 100-200mhz in clock speed?? But have you compared it in term of mhz/dollar? Phenom X6 is only run @ 21 GB/s dual channel memory bandwidth compared to to Xeon E series which is running 25.6 GB/s triple channel memory.

June 3, 2011, 13:37		#14
abdul099 Senior Member Join Date: Oct 2009 Location: Germany Posts: 636 Rep Power: 22	lalula, you're right. I've mixed it up because in our company, all E-series xeon processors are based on the Core architecture while the X-series processors are the newer ones based on the Nehalem architecture. Therefore there is a huge difference due to the integrated memory controller of the Nehalem cpu's and the memory access through a frontside bus of the Core cpu's. As the Phenom is more similar to a Nehalem cpu (AMD had a integrated memory controller much earlier than Intel), the old Core-based Xeons should be easily beaten even by a Phenom X6, like I've written before. Anyway, Mhz per $ doesn't mean all. When I pick 200 Pentium I 120Mhz out of the trash bin, I will get a lot of "Mhz / $" - but it's not fast and no good choice although it's cheap. It depends much from the specific case whether a system can beat another one when comparing not "Mhz / $" but "performance / $".

September 26, 2011, 20:13		#15
evcelica Senior Member Erik Join Date: Feb 2011 Location: Earth (Land portion) Posts: 1,188 Rep Power: 23	Somewhat off this topic but also interesting. I did some benchmarking with my 4.8GHz overclocked i7-2600K system and a dual Xeon X5675 system both running two cores. I was running a non-linear buckling analysis. The i7 system ran nearly twice as fast as the dual XEON system. So per core performance of the i7 showed to be much much better than the XEONs.

March 9, 2012, 08:27		#16
aerogt3 Member Join Date: Mar 2009 Posts: 90 Rep Power: 17	This thread has been very helpful so far. Does anyone know the difference between Sandy bridge and sandy bridge-E as far as CFD goes? For example, these two similarly priced CPU's: http://www.newegg.com/Product/Produc...82E16819115082 http://www.newegg.com/Product/Produc...82E16819117270 Is the first one a 1P CPU and the second intended for 2P system use?

March 21, 2012, 20:16		#17
evcelica Senior Member Erik Join Date: Feb 2011 Location: Earth (Land portion) Posts: 1,188 Rep Power: 23	The number located at the pound sign: EX-#XXX, states how many cpus can be put on a single board. Sandy bridge are all single socket, Sandy Bridge E have both single and dual socket processors, and 4 in the future. Sandy-bridge has dual channel memory, Sandy-Bridge-E has quadruple. between those two CPUs the E3 would probably blow away the E5 in most everyday tasks, but the quadruple channel memory might make it a little closer race for CFD. But I would still put my money on the E3 since it has a much higher clock speed. If your going dual socket, then the E5 would be the way to go. There may be more differences but those are the ones I know of.