|
[Sponsors] |
128 core cluster E5-26xx V4 processor choice for Ansys FLUENT |
|
LinkBack | Thread Tools | Search this Thread | Display Modes |
June 8, 2017, 10:46 |
128 core cluster E5-26xx V4 processor choice for Ansys FLUENT
|
#1 |
New Member
Ramón
Join Date: Mar 2016
Location: The Netherlands
Posts: 11
Rep Power: 10 |
Dear fellow CFD engineers and enthusiasts,
As an R&D department we are trying to significantly scale up our CFD solving capabilities. Currently we are using a single machine, with dual CPU Xeon E5-2637 V3 (8 cores) and 64 GB memory. This machine is used for CFD simulations with Ansys FLUENT with the SIMPLE solver, with either steady k-epsilon Realizable or transient SAS/DES/LES turbulence modelling. All simulations are simulated with FGM partially premixed combustion modelling. Meshes sizes are very case/project dependent but range between 3 and 17 million cells. We are considering a scale up towards 128 cores (thus 3 Ansys HPC license packs with a single Ansys FLUENT solver license). However, I am getting a bit lost in the world of CPU specifications, memory speeds, interconnections and where the bottleneck lies with solving time versus communication time. Ansys is being a professional independent software supplier by not giving specific hardware advice but providing feedback on configuration proposals. Our hardware supplier appears to have not enough specific knowledge with flow simulations to help us with our decision. Our budget is not determined yet, first we would like to know what it will cost us if we get the best solution possible. The cluster will exists out of a master node and multiple slave nodes. The only differences between the master and slave nodes will be that the master has extra internal storage and a better GPU. The following specifications are considered at the moment: - All nodes will be interconnected with Mellanox Infinityband - Dual SSD in Raid-0 for each machine (I know that normal HHD should be sufficient) - 8 GB/core RDIMM 2400 MT/s memory - No GPU yet, as we are not using COUPLED solver at the moment, but mounting possibility will be present. - Dual socket E5-2683 V4 processors in initial specification. The E5-2683 V4 'only' runs at 2.1 GHz and I have the feeling that I can get much more simulation performance with one of the other choices of E5-26xx V4 processors available. For example: - E5-2680 v4; more bus and memory speed per core, slightly more Ghz, one extra server needed (5 instead of 4). - E5-2667 v4; much more bus and memory speed per core, much more Ghz, but also 2 times more servers needed (8 instead of 4). Will this negatively influence the possible communication bottleneck?. Given the other thread (Socket 2011-3 processors - an overwiew) I should pick this one? I would very much appreciate advice on how to make a choice, or simply which one of the above to choose or other E5-26xx V4 available processors to consider. Kind regards F1aerofan |
|
June 8, 2017, 11:59 |
|
#2 |
Senior Member
Join Date: Mar 2009
Location: Austin, TX
Posts: 160
Rep Power: 18 |
Keep in mind that total memory bandwidth and total cache size is the most important factor. Much more important than frequency or number of cores. All E5 V4 CPUs that support 2400mhz memory have the same memory bandwidth.
If your cases are "large" (20 million+ elements), you would also likely see a benefit from increased network bandwidth using 8 machines vs 4 (8 Infiniband cards vs 4). Of the options you suggested, the E5-2667 v4 is almost certainly the "fastest," since it has the most memory bandwidth. It is also far and away the most expensive and power hungry. You might also want to look at something like 2620 v4. It's a slight memory bandwidth hit over the 2667, but it's MUCH cheaper and uses less power. 8 machines with 2x 8 core E5-2620 v4: ~1100 GB/s, 1360 Watts TDP, $6,672 (CPUs only) 8 machines with 2x 8 core E5-2667 v4: ~1230 GB/s, 1920 Watts TDP, $32,912 4 machines with 2x 16 core E5-2683 V4: ~615 GB/s, 960 Watts TDP, $14,768 My suggestion? Get 16 machines with the E5-2620 v4 and use the savings to buy another HPC Pack! And as you admit, the SSD's are pointless here. Fluent can only save to one machine. Money is much better spent on making one really fast storage node rather than making them all sort-of fast. |
|
June 8, 2017, 14:05 |
|
#3 | |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
First things first: AMD will release its new Naples platform on June 20th. Intels new Skylake-EP processors are also expected to launch "mid 2017".
In contrast to the incremental speedup we saw between the last few generations of CPUs this new generation will significantly increase performance for memory-bound workloads like CFD. We are talking about a 50% increase in performance per node or more. So you should only buy a Broadwell-EP based cluster if you absolutely can not delay the purchase for another 1-2 months. Another positive side-effect might be that Intel seems to re-evaluate the CPU pricing now that they are facing competition again. But this is only a guess based on the pricing for their Skylake-X CPUs, I can not guarantee that this will be the case for the server CPUs. That being said, if you want to buy Broadwell-EP aim for the lowest amount of cores per node with the highest clock speed to maximize performance. Inter-node communication is usually not the bottleneck with a relatively small amount of nodes and infiniband interconnect. You often see super-linear speedup when increasing the amount of nodes. Since you will be paying quite a lot for the software licenses, it is recommended not to cheap out on the hardware to make the most of the expensive licenses. Those 16-core CPUs are definitely the wrong choice. My recommendation would be the E5-2667v4. Quote:
|
||
June 8, 2017, 19:10 |
|
#4 |
Senior Member
Join Date: Mar 2009
Location: Austin, TX
Posts: 160
Rep Power: 18 |
From Dell's website:
E5-2620 v4 w/32 GB RAM: $3143 ($25k for 128 cores) E5-2667 v4 w/32 GB RAM: $6387 ($52k for 128 cores) E5-2683 v4 w/64 GB RAM: $6289 ($25k for 128 cores) If he has a fixed $50k to spend, then the sweet spot is probably something like 12 of the E5-2620 machines (192 cores) plus another HPC pack. And agreed it is an exceptionally bad time to be buying hardware for a memory-bandwidth limited application. |
|
June 9, 2017, 03:31 |
|
#5 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
When I look up their offers, the cheapest configuration for a single node with 2xE5-2620v4 is ~5000$. I factored in that you need an infiniband network adapter and at least 64 GB of RAM because they don't offer 4GB modules even in their cheapest server line. And then we still have the additional cost for a server cabinet, mounting material, installation and the rest of the networking hardware. Which will slightly increase with a higher number of nodes.
Another thing to keep in mind is that the cases they run are rather small. His smaller cases will not scale well on such a high number of cores, so I think it is better to have less faster cores. Last edited by flotus1; June 9, 2017 at 04:43. |
|
June 9, 2017, 03:31 |
|
#6 |
New Member
Ramón
Join Date: Mar 2016
Location: The Netherlands
Posts: 11
Rep Power: 10 |
Thank you all for the very fast responses! So if I may summarize your advice:
- Total memory bandwidth and cache size are very important - Lowest amount of cores per node with highest clock speed per chip is fastest for CFD, given that infinity band is generally not the bottleneck and super linear speedup is observed. So indeed looking at memory bandwidth and bus speed per core would be a good indicator for calculation speed for these memory intensive processes. - So for pure speed take the E5-2667 V4 - But the E5-2620 V4 is much cheaper and less power hungry, so getting more cores of these with even more server blades may be cheaper even if I get another extra Ansys HPC pack (and not fully utilize that software license). - But please wait for the AMD Epyc (Naples) or Intel Skylake-EP because these will be epic?! I have two follow up questions though. Memory: We will go for the fastest available, so DDR4 at 2400 MT/s. But With the above proposed processors there are 4 memory lanes per processor, so 8 in total for the dual socket. The server blades proposed have 24 DIMM locations available, and will then feature 8 times 16 GB RDIMM, 2400 MT/s, Single Rank, x8 Data Width. Would this be good enough as a configuration? AMD Epyc or SkyLake-EP: So that AMD Epyc will launch 20th June, then you have 2x32 cores with 2x8 memory lanes available in dual socket configuration. So while the first advice was to go with less cores with higher frequency and bandwidth (E5-2667 V4), this increase from 16 to 64 cores per node is somehow justified by having an increase from 8 to 16 memory lanes? Or can we just take AMDs seismic data 2.5x speedup w.r.t. Intel for granted? I thought that in general AMD most of the time fell short of Intel when it came to CFD? We indeed have little time available/patience left since we already started inquiries for new hardware back in February. But you guys give better advice in shorter time then my suppliers! And with a 20th June launch date, what is your experience when full supported hardware can be sold as server racks and is supported by the software suppliers? Thanks! |
|
June 9, 2017, 03:42 |
|
#7 | |||
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
Quote:
Edit: if the blades have 24 DIMM slots they might have put you on the wrong (more expensive) type of server. Tell them you do not need 24 DIMM slots or crazy expandability, just pure compute performance. Quote:
It was true that AMD could not beat Intel in terms of performance over the last few years. But based on the specifications we saw so far this might change with the new generation. Quote:
Again, your hardware vendor should know when they start selling the new products. Edit: at least we will have some more reliable benchmarks figures and the actual lineup by the end of this month. This should help you with your decision. Last edited by flotus1; June 9, 2017 at 05:42. |
||||
June 10, 2017, 16:53 |
|
#8 | |
New Member
Ramón
Join Date: Mar 2016
Location: The Netherlands
Posts: 11
Rep Power: 10 |
Quote:
I am sure a lot of people are very curious about the upcoming benchmarks of the new processor types then, so hope for fast benchmarks! Any advice where to look for the best fast available CFD-related benchmarks? For the time being, I will check with my hardware supplier on possibilities for configurations and delivery times of Epic, and the pricing for the above E5 proposals. Thanks a lot (again)! |
||
June 11, 2017, 18:54 |
|
#9 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
I initially understood that you wanted to run a single job on the whole cluster. Apart from a certain overhead from parallelization, the total amount of memory required should remain more or less constant, thus less memory per core as you increase the number of cores.
But in the end you can't have too much memory. With great (computational) power comes larger model size |
|
June 11, 2017, 20:12 |
|
#10 |
Senior Member
Lucky
Join Date: Apr 2011
Location: Orlando, FL USA
Posts: 5,754
Rep Power: 66 |
HPC with CFD is very easy. The bottlenecks are always the RAM bandwidth and the interconnect (which you have already selected infiniband). The more RAM you have (the larger the model size) the more true this is. Just get the highest RAM bandwidth you can afford and enough GB capacity for what you want to model. Unlike other fields, in CFD it's rare to need a high core count & low mem usage. It does happen, like when you want to get your solution 25% faster by doubling the cores used, but this is non-optimum usage of limited compute resources (you . Nowadays with the northbridge on the cpu-die, getting more ram bandwidth means getting a fast enough cpu for the ram configuration that you have which makes the selection process even easier.
I highly recommend a quad-channel setup (as opposed to triple channel). These are very niche, which makes choosing easier! And forget about GPU's. It is a bunch of extra headaches for no real benefit. It only works for the COUPLED solver, which few people use. I really like the COUPLED solver, but chances are you will be using SIMPLE/PISO or the density-based solver. The COUPLED solver scales less than linearly. The number of double precision units on GPU's are not that many, and for the cost of a GPU, the best you can get is a linearly scaling compared with CPU power. You can just as easily buy another node without any of the handicaps. SSD's on write nodes are surprisingly very useful. Often you want to write a lot of data, especially true if you're doing LES for example and this can give you serious speed-up since writing to disk is the slowest of all and you have consistent need to write to the disk if you're saving data every time-step. You can then transfer from the SSD to slower disks automatically. For the environments where you normally put these things, I'm always paranoid the HDD will fail and I can definitely understand wanting SSD's on all machines. It happens a lot in Florida, one day it is too hot and the A/C cannot keep your cluster cool. I would not wait for the next hardware release. However, with a hardware release there's often a drop in prices for previous generation hardware. I would plan to current hardware and wait for a price drop (if any). Recently, the price drops have not been spectacular. In summary I agree with your outlook so far. Just get most memory bandwidth you can afford and then everything else will fall into place. |
|
June 12, 2017, 05:37 |
|
#11 | ||
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
I agree with the larger part of your comment, but I have a slightly different opinion on this:
Quote:
Quote:
Of course you can buy similar CFD performance in 128 cores right now by simply using more nodes with lower core counts. But this comes at a higher price and depending on which part of the world you live in the higher power consumption might also be a factor. What I really disagree on is dropping prices for older hardware. Intel has never done this in the past, they simply move on to the next generation. What they have done instead is reducing the prices for new hardware since AMD is competitive again. The 10-core I7-6950x was sold at ~1700$. With AMDs "Threadripper" (who comes up with these names?) being announced the MRSP for the 10-core I9-7900X dropped to 999$. And if Intel does not reduce costs for the next generation of server CPUs, AMD will. |
|||
June 14, 2017, 06:02 |
|
#12 | |
New Member
Ramón
Join Date: Mar 2016
Location: The Netherlands
Posts: 11
Rep Power: 10 |
Quote:
I have a follow-up question regarding the Infiniband. I now see that there is multiple flavours for the Mellanox ConnectX-3 Dual port adapters; - 10GbE or 40GbE - normal copper cables (SFP) or with SR optics (QSFP) Given that both of them probably give low latency and decent speed. Which one to choose? Or is this one also simple; get the fastest one to ensure it is no bottleneck? |
||
June 14, 2017, 06:40 |
|
#13 | |
Senior Member
Blanco
Join Date: Mar 2009
Location: Torino, Italy
Posts: 193
Rep Power: 17 |
Quote:
You should prefer motherboards having 4 channels connecting each CPU to the RAM banks instead of those having only 3. This is of great importance in CFD-3D as the bottleneck is usually the CPU-RAM communication. It is also important to assure that all the CPU-RAM channels you have on your motherboard are filled, so avoid configurations like 2x32 GB RAM banks if you have a 4 channel motherboard, otherwise you'll waste speed (and money ). |
||
June 14, 2017, 09:05 |
|
#14 | |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
I don't know why triple-channel memory was mentioned. Ever since the first generation of Xeon E5 CPUs released in 2012 they all had quad-channel memory controllers. Edit: correction, apparently until the third generation of E5 processors there were low-end variants E2-24xx with only 3 memory channels. Pretty much pointless for anything that requires computing power. I haven't seen a single motherboard with less than 1 DIMM slot per channel for these CPUs.Well except for one mini-ITX board with only two slots, but that is a different story.
I assumed you understood that memory has to be populated symmetrically with one (or two) DIMMs per channel since you got this right in the first place. Quote:
Optical cables can be used to cover longer distances, but I assume all the nodes will be close to the switch so you don't need them either. 40 Gbit/s infiniband should strike a balance between cost and performance. At least you should not buy cards with lower standards because the also have higher latencies and are simply outdated. Last edited by flotus1; June 14, 2017 at 21:03. |
||
June 14, 2017, 13:20 |
|
#15 |
Senior Member
Join Date: Mar 2009
Location: Austin, TX
Posts: 160
Rep Power: 18 |
If you are only looking for QDR (40Gb/s) Infiniband, that stuff is dirt cheap on Ebay. Like $20 for a card and $200 for a switch. This is two generations behind what is currently available. Don't let your hardware vendor charge you a ton of money for this.
Use copper QSFP+. No need for optical. |
|
June 15, 2017, 02:39 |
|
#16 | ||
New Member
Ramón
Join Date: Mar 2016
Location: The Netherlands
Posts: 11
Rep Power: 10 |
Quote:
Quote:
Great tip and I will check with my ICT department, but I figure they want the new 'warranty included' systems. |
|||
June 21, 2017, 05:22 |
|
#17 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
June 20th has come and gone...
Apparently, availability of the more interesting Epyc processors will be an issue for at least a few more months. AMD starts with the flagship 32-core processors sold by "selected partners". The versions with lower core counts will follow within the next months with availability increasing over the course of 2017...That's close to my definition of a paper-launch. Unless you really don't need the performance within the next few months, I would now recommend buying a Cluster based on the Broadwell-EP Xeon processors. |
|
June 21, 2017, 06:54 |
|
#18 |
New Member
Join Date: May 2013
Posts: 26
Rep Power: 13 |
is there any chance Epyc wont be much faster in CFD compared to upcomming skylake SP?
They do have a clear advantage in Memory bandwith: no matter how many cores (8, 16, 24 or 32): always 8 channel DDR4 |
|
June 21, 2017, 09:26 |
|
#19 | ||
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
Quote:
Quote:
|
|||
June 22, 2017, 02:59 |
|
#20 |
New Member
Ramón
Join Date: Mar 2016
Location: The Netherlands
Posts: 11
Rep Power: 10 |
The 8 channel DDR4 is not the only benefit right? The frequency of the memory itself is also 2666 Mhz instead of the 2400 MHz for the Intel? I am a bit surprised about the clockspeeds, because the highest non-boost is only 2.4 GHz (16 core) or all-core-boost of 2.9 (24 or 16 core).
Well this week I have been given some clarity on the timeline for the investment. So as long as there is some solid benchmarking done before the end of July and our hardware partners are able to supply systems with Epyc from september onwards, then we may still consider this. |
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
looking for a smart interface matlab fluent | chary | FLUENT | 24 | June 18, 2021 10:07 |
Superlinear speedup in OpenFOAM 13 | msrinath80 | OpenFOAM Running, Solving & CFD | 18 | March 3, 2015 06:36 |
Problem in using parallel process in fluent 14 | Tleja | FLUENT | 3 | September 13, 2013 11:54 |
problem in using parallel process in fluent 14 | aydinkabir88 | FLUENT | 1 | July 10, 2013 03:00 |
Fluent on a Windows cluster | Erwin | FLUENT | 4 | October 22, 2002 12:39 |