|
[Sponsors] |
Some ANSYS Benchmarks: 3 node i7 vs Dual and Quad Xeon |
|
LinkBack | Thread Tools | Search this Thread | Display Modes |
January 13, 2015, 12:42 |
Some ANSYS Benchmarks: 3 node i7 vs Dual and Quad Xeon
|
#1 |
Senior Member
Erik
Join Date: Feb 2011
Location: Earth (Land portion)
Posts: 1,188
Rep Power: 23 |
Here are the results of some CFX benchmarks I have been doing an collecting for a while:
Model: Geometry: 1m x 1m x 5m long duct Mesh: 100 x 100 x 500 "cubes" all 1x1x1cm (5M cells) Flow: Default Water enters @ 10m/s at 300K, goes out other side at 0Pa. Walls are 400K. High Resolution Turbulence and advection Everything else default. Double Precision: ON 20 iterations (you must reduce your convergence criteria or it will converge in less iterations.) The i7's are 3930K/4930K @ 4.2GHz, each with 64GB of 2133 MHz RAM. Connected with 20Gbps Infiniband. The Dual Xeons have 128GB GB RAM, the Quad XEON has 256GB, all 1600 MHz with memory channels balanced properly. (Performance was atrocious before they were balanced, The Quad CPU Xeon was only performing at 46% of the speed it is now with balanced memory) I am comparing "CFD solver wall clock" times, not "Total wall clock times". I added Acasas' results to the plot, thanks for sharing! I'll gladly add anyone else's results to the plot as well if they feel like running the benchmark. CFX Benchmark.jpg Last edited by evcelica; January 15, 2015 at 13:53. |
|
January 13, 2015, 13:07 |
|
#2 |
Member
Antonio Casas
Join Date: May 2013
Location: world
Posts: 85
Rep Power: 13 |
thank´s for this info
|
|
January 13, 2015, 16:37 |
|
#3 |
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128 |
Greetings to all!
@Erik: Many thanks for this data! I see that you've restrained the core usage on the E5-4617 to only 4 cores per socket, when you had 6 cores available. I guess the performance wasn't worth registering? And I'm assuming their all of the first generation models of Xeon E5. A note before continuing: for the 4x E5-4617 configuration running at 8 cores total, the performance seems to be coherent with a use case of using only 2 cores per socket, hence more bandwidth was available and more maximum CPU frequency. Let's see if I can do some mathematical estimations based only on the specs at ark.intel.com and then compare with the results you've gotten, taking into account only using 4 cores on the E5-4617:
Now comes the really hard part, factoring in both details:
The cluster is hard to estimate, because of the performance drop related to using an Infiniband interconnect... uhm, actually, the scale up is pretty much linear with the Infiniband interconnect... the three i7 cluster is 2.97 time faster than one i7. Then the problem is related to the 2 layers of overclocking, which don't provide a proper scale up estimate. Let me review the mathematics assuming OC at max stock perfomance... it's the bOC entries: OK, looks like the overclocking only helps marginally to get additional performance, which is usually expected from overclocking for HPC. Side note: The "lithography boost" is something that I've seen many times, namely where the same CPU can gain up in direct-inverse proportion to lithography reduction. Erik, do you happen to have at least one run of the cluster (or just one i7) without the overclock, for a similar comparison? In other words, how much did each machine actually gain with the dual layer OC? And by the way, how much is each solution spending in electricity for each respective test? Best regards, Bruno |
|
January 15, 2015, 06:01 |
|
#4 |
Member
Antonio Casas
Join Date: May 2013
Location: world
Posts: 85
Rep Power: 13 |
Hi, this is what I´ve got over those computers:
On the i7 3820 @ 3.6 Ghz and DDR3 SDRAM PC·-12800 @ 800 MHZ, with 4 real cores and 8 threads, and with affinity fully set, it took 1598 sec wall time. On the dual Xeon E5-2650 v3, 20 real cores, no hyper threading, overclocking on, RAM memory DDR4-2133 (1066 MHz), it took 533 sec wall time. the full thread is in here http://www.cfd-online.com/Forums/har...tml#post527593 |
|
February 5, 2015, 12:22 |
|
#5 |
New Member
Sylvain Boulanger
Join Date: Nov 2014
Posts: 17
Rep Power: 12 |
Thank you guys, this is great information.
I do have a question regarding this approach. It seems that the general consensus is that the limiting factor for CFD computers is their memory bandwidth. Yet, the theoretical CPU GHz are always taken into account when estimating a given system’s performance. If the memory bandwidth is maxed out, the CPU is basically idling for the better part of a 1 second reference. Bruno is stating something in that sense: Why would overclocking yield marginal results but baseline frequency be an important factor in the estimate? I get that the overclocking of the i7-4930K is just 8% frequency increase but if we look at the E5-2680, it is a 30% increase (2.7GHz baseline, 3.5 GHz OC). I know that these sorts of calculations are trying to simply evaluate a rather complex system but would you be able to provide more information on this? |
|
February 7, 2015, 05:30 |
|
#6 | ||
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128 |
Greetings Sylvain,
Quote:
Quote:
On the other hand, the turbo feature in E5-2680 and other similar processors is not an overclocking feature and is something that is indicated as a stable frequency at which it can run for all features. Essential the stock vs turbo range gives us an idea that it's able to operate properly within this range. In addition, how this turbo feature shifts into gear depends on the model of the CPU itself; for example, from what I've seen:
Mmm... another analogy comes to mind, regarding memory bandwidth and the number of memory channels... imagine this:
Oh, and what about CPU's cache... an analogy... could be the scanner's book preloader, namely of being able to scan one book, while already having another book on hold that has already been retrieved from the trunk of the car Best regards, Bruno PS: Sorry, apparently I woke up with a creative writing mood -------------- edit: Have a look at this post as well: http://www.cfd-online.com/Forums/har...tml#post366672 post#17 Last edited by wyldckat; February 7, 2015 at 06:25. Reason: see "edit:" |
|||
February 9, 2015, 18:42 |
|
#7 |
New Member
Sylvain Boulanger
Join Date: Nov 2014
Posts: 17
Rep Power: 12 |
Thank you Bruno for your creative writing indeed.
If I understood correctly, you agree with my first statement with your analogy of the smartest person ever that nonetheless can’t put on glasses. CPU power has little impact when the memory bandwidth is maxed out. For your second analogy about memory bandwidth, what you’re saying is maximise the number of memory channels (road lanes) and maximise the memory frequency (road speed). What I don’t get is when you use the same analogy but for CPUs. You seem to say that all CPUs are the same with 0.1s. So, when assessing the hardware requirements for a new system, why is the baseline CPU frequency or achievable boost frequency taken into account? To support what I’m saying, I would like to point out the data provided on the first post of this thread. The first thing that I noticed is the differences between the i7-4930K in 1/2/3 nodes configuration and 3 nodes configuration. Looking at the results for 3 and 4 cores, we see that they’re pretty much the same. This would suggest that the 4th core on the 1 node configuration is underused or a nuisance to the other cores. Same thing happens with 6 and 8 cores. There is only a 5% performance increase per added core when it should be around 17% assuming good scalability. And for 9 and 12 cores, the increase is 5% per core when it should be around 11%. The other thing I noticed is that the performance difference between the i7-4930K (3 node configuration) and the E5-4617 is matching the memory frequency increase by 5%. Memory frequency increase: 2133/1600 = 1.33 Performance increase 4 cores: 1.013/0.798 = 1.27 If we look at the data with 8 cores, the results are slightly different. Performance increase 8 cores: 2.034/1.767 = 1.15 The performance increase is not as much as we could expect but the scalability of the E5-4617 between 4 and 8 cores is greater than 1. (1.767/0.798)/(8/4) = 1.11 Here it would be nice to know if the core distribution was 4+0+0+0 or 1+1+1+1 for 4 cores (incidentally 4+4+0+0 or 2+2+2+2 for 8 cores). This scalability beyond 1 would imply that either something was restraining the 4 core distribution or that the 8 core distribution has something more to work with. It could be a motherboard feature like NUMA but this is getting beyond my knowledge. I think that the 1+1+1+1 distribution (and 2+2+2+2) was not used because all the board memory bandwidth would have been available from the start and also all the motherboard features. That way, the best scalability that could have been achieved would have been 1. So, this “the whole is better than the sum of its parts” hypothesis could explain why there is smaller than expected performance increase from the 8 core E5-4617 and the 8 core i7-4930K. The reference, the E5-4617, has a feature that the i7-4930K doesn’t have. Hence, the 15% performance increase against the expected 27% of the memory speed-up. Now, for the E5-4617, something obviously happened between 8 and 12 cores. I cannot explain any of it especially since the scalability between 12 and 16 cores is 1. A system that is memory bandwidth limited should behave like the E5-2680. The scalability is 0.73 between 8 and 12 cores and 0.79 between 12 and 16 cores. Can someone provide an explanation for this? Bruno, I looked at the link you provided and here’s what I realised. Please tell me if that is the proper answer to my initial question about the CPU frequency being taken into account for a system’s performance estimate. A system’s scalability will be very close to one if the memory bandwidth is not fully used. Once the memory bandwidth is fully used, though the scalability will fall, a system will significantly benefit from a higher total CPU frequency. Hence, the total CPU frequency available in a given system should be a secondary criterion when choosing a system |
|
February 10, 2015, 16:40 |
|
#8 | ||||||
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128 |
Quote:
Quote:
Quote:
Therefore, assuming it's a 3 nodes with 3 cores each... mmm, I think the problem here is that the bottleneck is actually the cache per socket. It's 3MB vs 4MB per core. This means that by having 1 more MB per core, it reduces the number of times each core has to go fetch another blob of data from RAM. This is usually known as "cache misses" and has a very big impact on performance; here's a very explanation on the topic: http://stackoverflow.com/a/16699282 - which initially does lead to the same conclusion you're getting, but the writer also indicates that it's not that simple. And having more memory access bandwidth available per core also helps This is actually not the first time I've seen this... but I'm not able to find the blog post I'm thinking about The idea was that a 2 socket machine with 16 cores total (8+8) was slower than a 4 socket machine also with 16 cores (4+4+4+4), even though the total speed was almost the same. Quote:
And very likely the configurations on the E5-4617 are populated per socket, which would explain the boost in performance, since it reduces the cache misses in half. Quote:
As for your question and based on all of this, the order of prevalence for CFD should be something like this:
Quote:
In addition, I did a few years ago a test with a simple large test case with OpenFOAM, where a single-socket 6 core AMD 1055T wouldn't scale much beyond a factor of 3 vs a single core run; the main bottleneck is that this CPU has only 2 memory channels. But the crazy detail was that when I over-scheduled to 16 processes for only 6 cores, the wall clock runtime as smaller than when using only 6 processes. The explanation for this, and which should explain similar situations, is that by aligning memory accesses, we can get a bit more performance. |
|||||||
February 10, 2015, 19:17 |
|
#9 | ||
New Member
Sylvain Boulanger
Join Date: Nov 2014
Posts: 17
Rep Power: 12 |
Quote:
Core distribution of the 1/2/3 nodes configuration: Point 1: 4+0+0 4 cores total Point 2: 4+4+0 8 cores total Point 3: 4+4+4 12 cores total 3 points on the curve Core distribution of the 3 node configuration: Point 1: 1+1+1 3 cores total Point 2: 2+2+2 6 cores total Point 3: 3+3+3 9 cores total Point 4: 4+4+4 12 cores total 4 points on the curve Quote:
|
|||
July 8, 2016, 11:28 |
Request for more explanation about memory bandwidth
|
#11 |
New Member
M-G
Join Date: Apr 2016
Posts: 28
Rep Power: 10 |
Dears,
I'm a little confused about memory bandwidth definition. Please let me know which statement is incorrect. 1- For CFD, Max Memory Bandwidth (e.g 76.8 GB/s) / CPU # of cores ,is the main performance issue. 2- For Intel® Xeon® Processor E5-2698 v4 which has 20 cores and Max # of Memory Channels 4 (when 4 RAM modules installed in the correct places), then 76.8 GB/s / 20 = 3.84 GB/s would be the result. 3- Most of DDR4 Memory modules in market have at-least about 10 GB/s Read/Write speed which means considering this 10 GB/s - 3.84 GB/s = 6.16 GB/s is the wasted bandwidth of RAM modules in this configuration. 4- Four memory channels does not have independent 76.8 GB/s bandwidth, I mean 76.8 GB/s is maximum possible Bandwidth which CPU could ever have in the best configuration. 4 channels x 76.8 GB/s = 307.2 GB/s is incorrect. 5- If all above are correct, then I see that the most available Xeon E5 CPU memory bandwidth is 76.8 GB/s , So considering X99 chip-set equipped main-board which allows DDR4-3200 MHz and the fastest available memory module in Write bandwidth like Corsair CMK16GX4M4B3200C15 4GB (13,204 MB/s Write Speed ) we come into this conclusion : 76.8 GB/s / 13.2 GB/s = 5.82 which means CPUs with more than 6 cores are not suitable for CFD because waste of memory module bandwidth. Also lower number of CPU cores would be the bottleneck for using memory modules bandwidth because such RAM module cannot respond more than 13.2 GB/s and CPU would be idle for fraction of time. 6- Larger CPU Cache may compensate a little the above mentioned idle time in memory bandwidth bottleneck. but I don't know how much. and how to calculate. 7- E5-2643 V4 would be better than E5-2687W V4 for CFD although the latter has more clock speed and price. So What would be E5-2699 v4 good for with 22 cores ? Thanks for taking time and read my notes. |
|
July 11, 2016, 14:18 |
|
#12 |
Senior Member
Robert
Join Date: Jun 2010
Posts: 117
Rep Power: 17 |
@Wyldkat, instead of the fancy theoretical calculations why not just use the published specfp_rate measurements?
From my experience they provide a pretty good estimate of CCM+ performance on a chip. |
|
July 24, 2016, 12:51 |
|
#13 |
New Member
Sylvain Boulanger
Join Date: Nov 2014
Posts: 17
Rep Power: 12 |
M-G,
The main problem with what you wrote is that you compare CPU memory bandwidth against RAM memory bandwidth when in fact they are the same one thing. Memory bandwidth is dictated by the number of memory channnels, the DIMM frequency and the motherboard. The advertised memory bandwidth for a given CPU is based on the DIMM frequency that is garanteed to be stable by the manufacturer. Here's how it is calculated: Memory bandwidth = DIMM frequency x 8 bytes (64 bits) x # of channels So in the case of the Xeon E5-2698 v4 we have: Memory bandwidth = 2400mHz x 8 bytes x 4 channels = 76.8GB/s as advertised on Intel Ark Now, if you were to used that CPU with the X99 setup and were able to run it at 3200mHz, the memory bandwidth would be 102.4GB/s. This is all theoretical performance based on the hardware caracteristics. As to your optimization problem regarding the number core and memory bandwidth, it does not exist as soon as the case complexity reaches a certain level which happens very fast in CFD. By that, I mean that for CFD analysis, the amount of data produced by the CPU will be larger than what the memory bandwidth can handle for "relatively normal" case complexity hence the memory bottleneck. |
|
September 2, 2016, 13:00 |
|
#14 |
New Member
M-G
Join Date: Apr 2016
Posts: 28
Rep Power: 10 |
Dear Sylvian
So you mean cfprate2006 results are not applicable for CFD cases ? I see 4 of 17 tests in cfprate are based on CFD terms. Would you please explain more ? |
|
December 15, 2016, 07:57 |
|
#15 | |
New Member
Join Date: Jan 2015
Posts: 29
Rep Power: 11 |
Quote:
|
||
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Dual Channel memory vs Quad (4930k) | natty_king | Hardware | 1 | April 22, 2014 09:25 |
Dual xeon? or Dual i7 | cartman | Hardware | 8 | June 8, 2012 20:42 |
Dual Nodes is Slower Than Single Node (Reposting) | Mrxlazuardin | Hardware | 1 | May 26, 2010 11:25 |
Dual Nodes is Slower Than Single Node | Mrxlazuardin | FLUENT | 0 | May 21, 2010 02:48 |
Questions about CPU's: quad core, dual core, etc. | Tim | FLUENT | 0 | February 26, 2007 15:02 |