What Hardware for CFX

ghorrocks · September 27, 2012, 19:32

Quote:

The SPEC fp benchmarks do not capture this

I never said it did. The CPU2006fp benchmark is a good proxy for single processor performance. You can extract an accurate benchmark for multi-processor performance from a combination of the CPU2006fp and CPU2006fp_rate benchmarks - but this is a bit more complex and I did not explain that in a simple forum posting.

Len - Having said that I agree with all your points. Another comment is often two sepearte machines, each with a single CPU is often faster than a single machine with the same two CPUs in the one machine. Recent motherboards are much improved here, but the general trend is still there.

Which brings up the point of motherboard quality - a good motherboard which has high bandwidth is important for multi processor use. A few years back I had a machine which turned out to have a poor motherboard, and when swapped with a machine with the same CPU but a better motherboard parallel operation ran twice as fast. I think motherboards are more reliable nowdays, most quality motherboards which support the top CPUs are pretty good.

Quote:

The only reason I got a 6 core is because when using all 4 cores of a 4 core processor the computer is pretty much worthless for anything else. With the 6 core I can run on 4 and still actually use my computer for other stuff if I want. When I had a 4 core, I only ran on 3 for this very reason.

Doing stuff on the remaining cores will still be pretty painfully slow, regardless of 0, 1 or 2 cores free on a 6 core machine. It is still best avoid the 6 core machines, even if you want to do other stuff at the same time.

Big Len · September 28, 2012, 04:52

OK first up I did make an error with the original quote

The information was based on a per-core performance. Basically it was saying that if you turned off 2 cores of a six core machine, for same number of job paritions you will decrease the solution time. This is how most data is presented regarding CFX as it is obviously the licensing costs the dominate. I would imagine this effect may also become more pronouced as you compare chips that are actually 4 core vs 6 core. (not to mention that the only E5 6 core with the cfx-life-giving 8GT/s system bus costs 75% more)

I look at it this way (using dell prices)

For $6,400 I can have 16 cores at 2.9GHz with a total system bus of 16GT/s

or

For $6,600 I can have 16 cores at 3.3Ghz with a total system bus of 32GT/s

One of these systems will blow the other out of the water ...

Big Len · September 28, 2012, 05:08

Quote:

Originally Posted by ghorrocks

I never said it did. The CPU2006fp benchmark is a good proxy for single processor performance. You can extract an accurate benchmark for multi-processor performance from a combination of the CPU2006fp and CPU2006fp_rate benchmarks - but this is a bit more complex and I did not explain that in a simple forum posting.

Hi ghorrocks, I did not mean to sound confrontational in my post - I was merely being terse to not have my point lost in a sea of words.

ghorrocks · September 28, 2012, 08:30

No offense taken. It's all good.

It is important for opinions to be expressed clearly, and if something is wrong then say so. You have obviously done some work and research in this area and your opinion is a good contribution to the forum.

shreyasr · October 9, 2012, 06:02

Hi everybody!

This is a great discussion.
I had a few related questions..
Hope it's okay that I post them on this thread :

1. Why is a dual socket/processor array better than a single socket processor, with the same number of cores ?

2. Lets say you have a dual socket Xeon E5 processor, with a speed of 2.6 GHz and then a single socket E5, with a speed of 3.6GHz; both with the same DDR3, 1600MHz RAM. Which would you prefer, and which would be faster for CFX ?

3. How far is Cache memory important in CFX simulations ?

4. How exactly does Intel's Turbo boost help with CFX ? Does it mean that the processors will run at the max turbo-boosted speed throughout the run ?

Looking forward to your responses !

-shreyas

evcelica · October 9, 2012, 08:51

Dual socket would be better since each socket has its own memory channels, so you would have 8 memory channels instead of "only" 4 with a single socket. Memory bandwidth seems to be our bottleneck in CFX, so I would go for the dual socket.

I don't think cache would matter much in larger problems with high RAM usage. I don't know for sure though.

Intel's "turbo boost" just increases the CPU clock speed under load and depending on how many cores are being used and if the temperature/power load is low enough. It would probably be max turbo boost with one core running, and decreasing clock as more cores are used.

bookie56 · October 9, 2012, 16:12

Hi guys!
I am glad I started this thread....it has been a fountain of information regarding different aspects of running CFX...

Thank you to all that have posted here!!

Much appreciated!

bookie56

evcelica · October 26, 2012, 20:54

I posted this on the Hardware forum but I thought I would share here too:

Just thought I'd share the somewhat unexpected results of my 2 node "cluster". I'm using two identical 6-core i7-3930K computers overclocked to 4.4 GHz, each with 32GB of 2133MHz ram. They are connected using Intel gigabit and I'm using platform-MPI running ANSYS CFX v14.

Benchmark case has ~4 million nodes - steady state thermal with multiple domains.

When comparing:
1 computer running 4 cores to
2 computers running 4 cores each

My speedup shows to be 2.22 times faster

!
So much for linear scaling, has anyone else seen this, it just seems a little odd to me, though I'm definitely happy about it!
This is something to consider If anyone has been thinking about adding a second node.

I'd also be happy to do a little benchmarking against some dual socket XEON-E5 machines to compare the old 1 vs. 2 node question. I can set my CPU and memory frequency to whatever to make the test more even.

Thinking about this more, perhaps a cluster of single sockets nodes would scale better than dual sockets since you would have twice as many interconnects, where dual sockets would be sharing one lane? Perhaps the E5-2643 is not the best choice then, instead maybe the i7-3820 would take its place at it is almost $600 cheaper? Even my 6 cores are several hundred cheaper than the E5-2643.

EDIT:
After running it a few more times I realized during my single node simulation I accidently had the CPU downclocked to 3.8GHz instead of 4.4. So the 15.6% Overclock gave me the extra 11% speed per node. Running it again with the same 4.4GHz clock speed on all nodes I got 99.5% efficient scaling. Sorry for the misinformation.

shreyasr · October 27, 2012, 03:18

Hi Eric,

That's an interesting observation. However, wouldn't one expect ~2X performance increase in such a mini cluster setup, assuming both the i7's have the same configuration ?
Why do you find it odd ?

I'd be very interested to know the benchmarking results with the Xeon E5's, especially since I am in the process of figuring out the optimum configuration to upgrade to in my office, with respect to CFX.

So far, in my benchmarking tests with our current computers :
Case :
Steady, Incompressible, subsonic flow
Geometry : complete hydraulic passages of a centrifugal pump, Frozen rotor config. ~2 Million cells.

I've found a 2X speedup with a dual socket (3.0GHz quad core), comparing with a single socket quad core (2.4GHz processor). They both have exactly the same RAM, ~533MHz, DDR2.

I've also found that a Westmere (Quad core 2.4GHz, dual socket config), with 1.3GHz DDR3 RAM completed the same simulation 3.5 hours earlier (46% speedup) , compared to my existing dual socket 3.0GHz quad core.

Based on the above observations, I'd be a little sceptical about parallel single socket configurations being able to beat the performance of dual socket configurations. Extending that further, I also think, when it comes to interconnects, it's probably the speed of the interconnects (Gig-eth/infiniband) which would make a noticable difference rather than the number of interconnects. That's also what ANSYS swear by, though I understand it is really based on the application and the number of computers/cores being connected together.

Please feel free to correct me if I am wrong.

Came across this interesting document which is somewhat relevant (though it's old) : http://www.hpcadvisorycouncil.com/pdf/CFX_Analysis.pdf

Once again, looking forward to your benchmark study with the Xeon E5 2643's.

evcelica · October 27, 2012, 06:43

Thanks for sharing your benchmarking data.

I just found it odd since its better than 2x faster; I was thinking "perfect" scaling would be 100% faster only, not 122%. Looking through some of the fluent benchmarks I do see some rare cases where they get better than 100% scaling going to two nodes, but not often.

I was thinking for smaller clusters a few single socket i7s would have a higher performance/price ratio than dual socket XEONs.

If scaling to a large cluster, I really know nothing about clusters or interconnects or how they work, so maybe I shouldn't have said anything. I was just thinking each cpu would have its own interconnect instead of sharing one, I'm probably wrong though.

shreyasr · October 27, 2012, 08:15

Now that you've put it that way, it does seem strange and the difference seems high enough to warrant attention(?).
What do you think is contributing to the extra 22%?

If price is brought into the picture, from what I've read so far, I'd be inclined to agree with you regarding the higher performance/price of a mini cluster of 3rd generation i7's.
But, in such a scenario, I'm concerned about a very reliable, but relatively simple way of managing/administration. I would really want it to be open source/free.

I would like to know :
1. Do you use cluster applications/job schedulers to manage this mini cluster ?
If yes, which one ?
If no , how are you distributing your simulation? Is it via specifying the nodes in the cfx config file ?

2. Which OS are you using on both these computers?

ghorrocks · October 27, 2012, 08:24

Super-linear speed up (ie greater than 1) generally means the benchmark did not run properly on the single node case. Usually this is because it is too large to fit fully into memory so it had to swap/page some out to disk. The parallel ones are smaller and do not require paging - so run faster than the expected acceleration.

But in your case you have 32GB RAM and that should be big enough to fit this model. But memory fragmentation and other processes could be the reason.

shreyasr · October 27, 2012, 08:52

Hi Glenn,
If that were the case, does it also mean that Erik would probably get different speedup results on re-running the single node job ?

evcelica · October 27, 2012, 10:31

Quote:

Originally Posted by shreyasr

I would like to know :
1. Do you use cluster applications/job schedulers to manage this mini cluster ?
If yes, which one ?
If no , how are you distributing your simulation? Is it via specifying the nodes in the cfx config file ?

2. Which OS are you using on both these computers?

I'm distributing it via the specifying the nodes in the cfx config file.
I'm using windows 7 x64 ultimate.

I'm going to run the simulation again on each node separately, and make sure they are each the same speed.

ghorrocks · October 28, 2012, 07:01

No, I am not suggesting you are getting different speed for different nodes.

I am saying that because the job is large it has the potential to run slower than optimal due to many reasons - memory being one, but there aera others (eg disk io). So when you use this slower than optimal run as a baseline for speedup factors you get factors greater than 100%.

The benchmark simulation I use "Benchmark.def" which can be found int he examples directory. It is quite a small simulation so it will definitely run properly in any reasonable CFD computer. But it is a small simulation, so it is not good at testing more than about 4 processes.

As I am sure you aware, there is no such thing as a universal benchmark.

evcelica · October 31, 2012, 11:57

EDIT:
After running it a few more times I realized during my single node simulation I accidentally had the CPU downclocked to 3.8GHz instead of 4.4. So the 15.6% Overclock gave me the extra 11% speed per node. Running it again with the same 4.4GHz clock speed on all nodes I got 99.5% efficient scaling. Sorry for the misinformation.

ghorrocks · October 31, 2012, 20:26

Doing good benchmarks is not easy. There are lots of gotchas.

shreyasr · January 9, 2013, 03:12

Hi Erik,
I was wondering if you had the time to get further bench marking test done with the dual socket E5's?

evcelica · January 9, 2013, 11:49

Quote:

Originally Posted by shreyasr

Hi Erik,
I was wondering if you had the time to get further bench marking test done with the dual socket E5's?

No Problem,
Just to be clear, I have two i7's machines, not dual socket XEONs. But I'd be happy do do some benchmarking with the i7's, just send me the cfx file and tell me how you would like it ran.

I'm guessing I could also estimate a dual XEON E5 machines speed pretty well by downclocking my processors to whatever speed the XEONs run at, and lowering my memory frequencies and timings to match server memory which would be 1600MHZ @ 11-11-11-28. I'm sure it won't be perfect, but it should be quite close.

I can PM you my email if you're interested.

September 28, 2012, 04:52		#22
Big Len New Member Thomas Join Date: Sep 2012 Posts: 4 Rep Power: 14	OK first up I did make an error with the original quote The information was based on a per-core performance. Basically it was saying that if you turned off 2 cores of a six core machine, for same number of job paritions you will decrease the solution time. This is how most data is presented regarding CFX as it is obviously the licensing costs the dominate. I would imagine this effect may also become more pronouced as you compare chips that are actually 4 core vs 6 core. (not to mention that the only E5 6 core with the cfx-life-giving 8GT/s system bus costs 75% more) I look at it this way (using dell prices) For $6,400 I can have 16 cores at 2.9GHz with a total system bus of 16GT/s or For $6,600 I can have 16 cores at 3.3Ghz with a total system bus of 32GT/s One of these systems will blow the other out of the water ... Last edited by Big Len; September 28, 2012 at 07:55. Reason: added prices

October 9, 2012, 06:02	Single/Dual Socket processors	#25
shreyasr Member Shreyas Ragavan Join Date: Feb 2012 Location: India Posts: 37 Rep Power: 14	Hi everybody! This is a great discussion. I had a few related questions.. Hope it's okay that I post them on this thread : 1. Why is a dual socket/processor array better than a single socket processor, with the same number of cores ? 2. Lets say you have a dual socket Xeon E5 processor, with a speed of 2.6 GHz and then a single socket E5, with a speed of 3.6GHz; both with the same DDR3, 1600MHz RAM. Which would you prefer, and which would be faster for CFX ? 3. How far is Cache memory important in CFX simulations ? 4. How exactly does Intel's Turbo boost help with CFX ? Does it mean that the processors will run at the max turbo-boosted speed throughout the run ? Looking forward to your responses ! -shreyas __________________ Shreyas www.cfdrevolutions.weebly.com

October 26, 2012, 20:54		#28
evcelica Senior Member Erik Join Date: Feb 2011 Location: Earth (Land portion) Posts: 1,188 Rep Power: 23	I posted this on the Hardware forum but I thought I would share here too: Just thought I'd share the somewhat unexpected results of my 2 node "cluster". I'm using two identical 6-core i7-3930K computers overclocked to 4.4 GHz, each with 32GB of 2133MHz ram. They are connected using Intel gigabit and I'm using platform-MPI running ANSYS CFX v14. Benchmark case has ~4 million nodes - steady state thermal with multiple domains. When comparing: 1 computer running 4 cores to 2 computers running 4 cores each My speedup shows to be 2.22 times faster ! So much for linear scaling, has anyone else seen this, it just seems a little odd to me, though I'm definitely happy about it! This is something to consider If anyone has been thinking about adding a second node. I'd also be happy to do a little benchmarking against some dual socket XEON-E5 machines to compare the old 1 vs. 2 node question. I can set my CPU and memory frequency to whatever to make the test more even. Thinking about this more, perhaps a cluster of single sockets nodes would scale better than dual sockets since you would have twice as many interconnects, where dual sockets would be sharing one lane? Perhaps the E5-2643 is not the best choice then, instead maybe the i7-3820 would take its place at it is almost $600 cheaper? Even my 6 cores are several hundred cheaper than the E5-2643. EDIT: After running it a few more times I realized during my single node simulation I accidently had the CPU downclocked to 3.8GHz instead of 4.4. So the 15.6% Overclock gave me the extra 11% speed per node. Running it again with the same 4.4GHz clock speed on all nodes I got 99.5% efficient scaling. Sorry for the misinformation. shreyasr and larsenmm like this. Last edited by evcelica; October 31, 2012 at 11:57. Reason: Mistake in information.

October 27, 2012, 03:18		#29
shreyasr Member Shreyas Ragavan Join Date: Feb 2012 Location: India Posts: 37 Rep Power: 14	Hi Eric, That's an interesting observation. However, wouldn't one expect ~2X performance increase in such a mini cluster setup, assuming both the i7's have the same configuration ? Why do you find it odd ? I'd be very interested to know the benchmarking results with the Xeon E5's, especially since I am in the process of figuring out the optimum configuration to upgrade to in my office, with respect to CFX. So far, in my benchmarking tests with our current computers : Case : Steady, Incompressible, subsonic flow Geometry : complete hydraulic passages of a centrifugal pump, Frozen rotor config. ~2 Million cells. I've found a 2X speedup with a dual socket (3.0GHz quad core), comparing with a single socket quad core (2.4GHz processor). They both have exactly the same RAM, ~533MHz, DDR2. I've also found that a Westmere (Quad core 2.4GHz, dual socket config), with 1.3GHz DDR3 RAM completed the same simulation 3.5 hours earlier (46% speedup) , compared to my existing dual socket 3.0GHz quad core. Based on the above observations, I'd be a little sceptical about parallel single socket configurations being able to beat the performance of dual socket configurations. Extending that further, I also think, when it comes to interconnects, it's probably the speed of the interconnects (Gig-eth/infiniband) which would make a noticable difference rather than the number of interconnects. That's also what ANSYS swear by, though I understand it is really based on the application and the number of computers/cores being connected together. Please feel free to correct me if I am wrong. Came across this interesting document which is somewhat relevant (though it's old) : http://www.hpcadvisorycouncil.com/pdf/CFX_Analysis.pdf Once again, looking forward to your benchmark study with the Xeon E5 2643's. __________________ Shreyas www.cfdrevolutions.weebly.com Last edited by shreyasr; October 27, 2012 at 03:30. Reason: some additional data

October 27, 2012, 08:15		#31
shreyasr Member Shreyas Ragavan Join Date: Feb 2012 Location: India Posts: 37 Rep Power: 14	Now that you've put it that way, it does seem strange and the difference seems high enough to warrant attention(?). What do you think is contributing to the extra 22%? If price is brought into the picture, from what I've read so far, I'd be inclined to agree with you regarding the higher performance/price of a mini cluster of 3rd generation i7's. But, in such a scenario, I'm concerned about a very reliable, but relatively simple way of managing/administration. I would really want it to be open source/free. I would like to know : 1. Do you use cluster applications/job schedulers to manage this mini cluster ? If yes, which one ? If no , how are you distributing your simulation? Is it via specifying the nodes in the cfx config file ? 2. Which OS are you using on both these computers? __________________ Shreyas www.cfdrevolutions.weebly.com

September 28, 2012, 08:30		#24
ghorrocks Super Moderator Glenn Horrocks Join Date: Mar 2009 Location: Sydney, Australia Posts: 17,872 Rep Power: 144	No offense taken. It's all good. It is important for opinions to be expressed clearly, and if something is wrong then say so. You have obviously done some work and research in this area and your opinion is a good contribution to the forum.

October 9, 2012, 08:51		#26
evcelica Senior Member Erik Join Date: Feb 2011 Location: Earth (Land portion) Posts: 1,188 Rep Power: 23	Dual socket would be better since each socket has its own memory channels, so you would have 8 memory channels instead of "only" 4 with a single socket. Memory bandwidth seems to be our bottleneck in CFX, so I would go for the dual socket. I don't think cache would matter much in larger problems with high RAM usage. I don't know for sure though. Intel's "turbo boost" just increases the CPU clock speed under load and depending on how many cores are being used and if the temperature/power load is low enough. It would probably be max turbo boost with one core running, and decreasing clock as more cores are used.

October 9, 2012, 16:12		#27
bookie56 New Member Join Date: Jan 2010 Posts: 28 Rep Power: 16	Hi guys! I am glad I started this thread....it has been a fountain of information regarding different aspects of running CFX... Thank you to all that have posted here!! Much appreciated! bookie56

October 27, 2012, 06:43		#30
evcelica Senior Member Erik Join Date: Feb 2011 Location: Earth (Land portion) Posts: 1,188 Rep Power: 23	Thanks for sharing your benchmarking data. I just found it odd since its better than 2x faster; I was thinking "perfect" scaling would be 100% faster only, not 122%. Looking through some of the fluent benchmarks I do see some rare cases where they get better than 100% scaling going to two nodes, but not often. I was thinking for smaller clusters a few single socket i7s would have a higher performance/price ratio than dual socket XEONs. If scaling to a large cluster, I really know nothing about clusters or interconnects or how they work, so maybe I shouldn't have said anything. I was just thinking each cpu would have its own interconnect instead of sharing one, I'm probably wrong though.

October 27, 2012, 08:24		#32
ghorrocks Super Moderator Glenn Horrocks Join Date: Mar 2009 Location: Sydney, Australia Posts: 17,872 Rep Power: 144	Super-linear speed up (ie greater than 1) generally means the benchmark did not run properly on the single node case. Usually this is because it is too large to fit fully into memory so it had to swap/page some out to disk. The parallel ones are smaller and do not require paging - so run faster than the expected acceleration. But in your case you have 32GB RAM and that should be big enough to fit this model. But memory fragmentation and other processes could be the reason.

October 27, 2012, 08:52		#33
shreyasr Member Shreyas Ragavan Join Date: Feb 2012 Location: India Posts: 37 Rep Power: 14	Hi Glenn, If that were the case, does it also mean that Erik would probably get different speedup results on re-running the single node job ? __________________ Shreyas www.cfdrevolutions.weebly.com

October 28, 2012, 07:01		#35
ghorrocks Super Moderator Glenn Horrocks Join Date: Mar 2009 Location: Sydney, Australia Posts: 17,872 Rep Power: 144	No, I am not suggesting you are getting different speed for different nodes. I am saying that because the job is large it has the potential to run slower than optimal due to many reasons - memory being one, but there aera others (eg disk io). So when you use this slower than optimal run as a baseline for speedup factors you get factors greater than 100%. The benchmark simulation I use "Benchmark.def" which can be found int he examples directory. It is quite a small simulation so it will definitely run properly in any reasonable CFD computer. But it is a small simulation, so it is not good at testing more than about 4 processes. As I am sure you aware, there is no such thing as a universal benchmark.

October 31, 2012, 11:57		#36
evcelica Senior Member Erik Join Date: Feb 2011 Location: Earth (Land portion) Posts: 1,188 Rep Power: 23	EDIT: After running it a few more times I realized during my single node simulation I accidentally had the CPU downclocked to 3.8GHz instead of 4.4. So the 15.6% Overclock gave me the extra 11% speed per node. Running it again with the same 4.4GHz clock speed on all nodes I got 99.5% efficient scaling. Sorry for the misinformation. shreyasr likes this.

October 31, 2012, 20:26		#37
ghorrocks Super Moderator Glenn Horrocks Join Date: Mar 2009 Location: Sydney, Australia Posts: 17,872 Rep Power: 144	Doing good benchmarks is not easy. There are lots of gotchas.

January 9, 2013, 03:12		#38
shreyasr Member Shreyas Ragavan Join Date: Feb 2012 Location: India Posts: 37 Rep Power: 14	Hi Erik, I was wondering if you had the time to get further bench marking test done with the dual socket E5's? __________________ Shreyas www.cfdrevolutions.weebly.com

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Hardware required to model sediment dynamics in rivers using SSIIM	lindsaywestraadt	Hardware	0	August 12, 2010 12:18
Looking for a pimpleFoam tutorial using Salome (and hardware recommendations?)	madact	OpenFOAM	1	May 27, 2010 02:24
Min. Hardware for CFD calculations	mra	Hardware	3	April 12, 2010 04:47
Hardware recommendation? AMD X2, Phenom, Core2Duo, Quadcore?	rparks	OpenFOAM	0	April 22, 2009 10:10
hardware for fluent	Christian	FLUENT	9	December 3, 2001 17:05