128 core cluster E5-26xx V4 processor choice for Ansys FLUENT

F1aerofan · June 8, 2017, 10:46

Dear fellow CFD engineers and enthusiasts

,

As an R&D department we are trying to significantly scale up our CFD solving capabilities. Currently we are using a single machine, with dual CPU Xeon E5-2637 V3 (8 cores) and 64 GB memory. This machine is used for CFD simulations with Ansys FLUENT with the SIMPLE solver, with either steady k-epsilon Realizable or transient SAS/DES/LES turbulence modelling. All simulations are simulated with FGM partially premixed combustion modelling. Meshes sizes are very case/project dependent but range between 3 and 17 million cells.

We are considering a scale up towards 128 cores (thus 3 Ansys HPC license packs with a single Ansys FLUENT solver license). However, I am getting a bit lost in the world of CPU specifications, memory speeds, interconnections and where the bottleneck lies with solving time versus communication time.

Ansys is being a professional independent software supplier by not giving specific hardware advice but providing feedback on configuration proposals. Our hardware supplier appears to have not enough specific knowledge with flow simulations to help us with our decision. Our budget is not determined yet, first we would like to know what it will cost us if we get the best solution possible.

The cluster will exists out of a master node and multiple slave nodes. The only differences between the master and slave nodes will be that the master has extra internal storage and a better GPU. The following specifications are considered at the moment:
- All nodes will be interconnected with Mellanox Infinityband
- Dual SSD in Raid-0 for each machine (I know that normal HHD should be sufficient)
- 8 GB/core RDIMM 2400 MT/s memory
- No GPU yet, as we are not using COUPLED solver at the moment, but mounting possibility will be present.
- Dual socket E5-2683 V4 processors in initial specification.

The E5-2683 V4 'only' runs at 2.1 GHz and I have the feeling that I can get much more simulation performance with one of the other choices of E5-26xx V4 processors available. For example:
- E5-2680 v4; more bus and memory speed per core, slightly more Ghz, one extra server needed (5 instead of 4).
- E5-2667 v4; much more bus and memory speed per core, much more Ghz, but also 2 times more servers needed (8 instead of 4). Will this negatively influence the possible communication bottleneck?. Given the other thread (Socket 2011-3 processors - an overwiew) I should pick this one?

I would very much appreciate advice on how to make a choice, or simply which one of the above to choose or other E5-26xx V4 available processors to consider.

Kind regards

F1aerofan

kyle · June 8, 2017, 11:59

Keep in mind that total memory bandwidth and total cache size is the most important factor. Much more important than frequency or number of cores. All E5 V4 CPUs that support 2400mhz memory have the same memory bandwidth.

If your cases are "large" (20 million+ elements), you would also likely see a benefit from increased network bandwidth using 8 machines vs 4 (8 Infiniband cards vs 4).

Of the options you suggested, the E5-2667 v4 is almost certainly the "fastest," since it has the most memory bandwidth. It is also far and away the most expensive and power hungry.

You might also want to look at something like 2620 v4. It's a slight memory bandwidth hit over the 2667, but it's MUCH cheaper and uses less power.

8 machines with 2x 8 core E5-2620 v4: ~1100 GB/s, 1360 Watts TDP, $6,672 (CPUs only)
8 machines with 2x 8 core E5-2667 v4: ~1230 GB/s, 1920 Watts TDP, $32,912
4 machines with 2x 16 core E5-2683 V4: ~615 GB/s, 960 Watts TDP, $14,768

My suggestion? Get 16 machines with the E5-2620 v4 and use the savings to buy another HPC Pack!

And as you admit, the SSD's are pointless here. Fluent can only save to one machine. Money is much better spent on making one really fast storage node rather than making them all sort-of fast.

flotus1 · June 8, 2017, 14:05

First things first: AMD will release its new Naples platform on June 20th. Intels new Skylake-EP processors are also expected to launch "mid 2017".
In contrast to the incremental speedup we saw between the last few generations of CPUs this new generation will significantly increase performance for memory-bound workloads like CFD. We are talking about a 50% increase in performance per node or more.
So you should only buy a Broadwell-EP based cluster if you absolutely can not delay the purchase for another 1-2 months. Another positive side-effect might be that Intel seems to re-evaluate the CPU pricing now that they are facing competition again. But this is only a guess based on the pricing for their Skylake-X CPUs, I can not guarantee that this will be the case for the server CPUs.

That being said, if you want to buy Broadwell-EP aim for the lowest amount of cores per node with the highest clock speed to maximize performance. Inter-node communication is usually not the bottleneck with a relatively small amount of nodes and infiniband interconnect. You often see super-linear speedup when increasing the amount of nodes.
Since you will be paying quite a lot for the software licenses, it is recommended not to cheap out on the hardware to make the most of the expensive licenses. Those 16-core CPUs are definitely the wrong choice. My recommendation would be the E5-2667v4.

Quote:

My suggestion? Get 16 machines with the E5-2620 v4 and use the savings to buy another HPC Pack!

An interesting idea. However, you will have to consult your hardware vendor if there are any actual savings when going from 8 nodes with fast CPUs to 16 nodes with these CPUs. I highly doubt that.

kyle · June 8, 2017, 19:10

From Dell's website:

E5-2620 v4 w/32 GB RAM: $3143 ($25k for 128 cores)
E5-2667 v4 w/32 GB RAM: $6387 ($52k for 128 cores)
E5-2683 v4 w/64 GB RAM: $6289 ($25k for 128 cores)

If he has a fixed $50k to spend, then the sweet spot is probably something like 12 of the E5-2620 machines (192 cores) plus another HPC pack.

And agreed it is an exceptionally bad time to be buying hardware for a memory-bandwidth limited application.

flotus1 · June 9, 2017, 03:31

When I look up their offers, the cheapest configuration for a single node with 2xE5-2620v4 is ~5000$. I factored in that you need an infiniband network adapter and at least 64 GB of RAM because they don't offer 4GB modules even in their cheapest server line. And then we still have the additional cost for a server cabinet, mounting material, installation and the rest of the networking hardware. Which will slightly increase with a higher number of nodes.
Another thing to keep in mind is that the cases they run are rather small. His smaller cases will not scale well on such a high number of cores, so I think it is better to have less faster cores.

F1aerofan · June 9, 2017, 03:31

Thank you all for the very fast responses!

So if I may summarize your advice:
- Total memory bandwidth and cache size are very important
- Lowest amount of cores per node with highest clock speed per chip is fastest for CFD, given that infinity band is generally not the bottleneck and super linear speedup is observed. So indeed looking at memory bandwidth and bus speed per core would be a good indicator for calculation speed for these memory intensive processes.
- So for pure speed take the E5-2667 V4
- But the E5-2620 V4 is much cheaper and less power hungry, so getting more cores of these with even more server blades may be cheaper even if I get another extra Ansys HPC pack (and not fully utilize that software license).
- But please wait for the AMD Epyc (Naples) or Intel Skylake-EP because these will be epic?!

I have two follow up questions though.

Memory:
We will go for the fastest available, so DDR4 at 2400 MT/s. But With the above proposed processors there are 4 memory lanes per processor, so 8 in total for the dual socket. The server blades proposed have 24 DIMM locations available, and will then feature 8 times “16 GB RDIMM, 2400 MT/s, Single Rank, x8 Data Width”. Would this be good enough as a configuration?

AMD Epyc or SkyLake-EP:
So that AMD Epyc will launch 20th June, then you have 2x32 cores with 2x8 memory lanes available in dual socket configuration. So while the first advice was to go with less cores with higher frequency and bandwidth (E5-2667 V4), this increase from 16 to 64 cores per node is somehow justified by having an increase from 8 to 16 memory lanes? Or can we just take AMD’s seismic data 2.5x speedup w.r.t. Intel for granted? I thought that in general AMD most of the time fell short of Intel when it came to CFD?

We indeed have little time available/patience left since we already started inquiries for new hardware back in February.

But you guys give better advice in shorter time then my suppliers!

And with a 20th June launch date, what is your experience when full supported hardware can be sold as server racks and is supported by the software suppliers?

Thanks!

flotus1 · June 9, 2017, 03:42

Quote:

Memory:
We will go for the fastest available, so DDR4 at 2400 MT/s. But With the above proposed processors there are 4 memory lanes per processor, so 8 in total for the dual socket. The server blades proposed have 24 DIMM locations available, and will then feature 8 times “16 GB RDIMM, 2400 MT/s, Single Rank, x8 Data Width”. Would this be good enough as a configuration?

Do you really need 128GB per node? Based on the size of your cases I would have concluded that 64GB is enough. If you have the choice, use 8 DIMMs with dual-rank organization. it is slightly faster than single-rank.
Edit: if the blades have 24 DIMM slots they might have put you on the wrong (more expensive) type of server. Tell them you do not need 24 DIMM slots or crazy expandability, just pure compute performance.

Quote:

AMD Epyc or SkyLake-EP:
So that AMD Epyc will launch 20th June, then you have 2x32 cores with 2x8 memory lanes available in dual socket configuration. So while the first advice was to go with less cores with higher frequency and bandwidth (E5-2667 V4), this increase from 16 to 64 cores per node is somehow justified by having an increase from 8 to 16 memory lanes? Or can we just take AMD’s seismic data 2.5x speedup w.r.t. Intel for granted? I thought that in general AMD most of the time fell short of Intel when it came to CFD?

Just as with Intel processors, you would not buy the highest core count model for CFD. Rumor has it that AMD will also offer a 16-core "Epyc" processor. And no, you should definitely not take their marketing figures for granted and wait for some independent benchmarks.
It was true that AMD could not beat Intel in terms of performance over the last few years. But based on the specifications we saw so far this might change with the new generation.

Quote:

And with a 20th June launch date, what is your experience when full supported hardware can be sold as server racks and is supported by the software suppliers?

I can't help you with that. AMD claims that June 20th is not a paper-launch but that their products will be available on that very date. But take that with a pinch of salt.
Again, your hardware vendor should know when they start selling the new products.
Edit: at least we will have some more reliable benchmarks figures and the actual lineup by the end of this month. This should help you with your decision.

F1aerofan · June 10, 2017, 16:53

Quote:

Originally Posted by flotus1

Do you really need 128GB per node? Based on the size of your cases I would have concluded that 64GB is enough. If you have the choice, use 8 DIMMs with dual-rank organization. it is slightly faster than single-rank.
Edit: if the blades have 24 DIMM slots they might have put you on the wrong (more expensive) type of server. Tell them you do not need 24 DIMM slots or crazy expandability, just pure compute performance.

In general, I would say yes, we do need the 8 GB/core. Currently on the 8 cores machine, with the cases of 17 Million cells or more, with the combustion pdf look-up tables and radiation with D.O. we run into the 60+ GB memory usage. Problem appears to be in the storage of the pdf look-up tables which consume a lot of memory. But, I will go check up on the rank possibility and the DIMM slots.

I am sure a lot of people are very curious about the upcoming benchmarks of the new processor types then, so hope for fast benchmarks! Any advice where to look for the best fast available CFD-related benchmarks? For the time being, I will check with my hardware supplier on possibilities for configurations and delivery times of Epic, and the pricing for the above E5 proposals.

Thanks a lot (again)!

flotus1 · June 11, 2017, 18:54

I initially understood that you wanted to run a single job on the whole cluster. Apart from a certain overhead from parallelization, the total amount of memory required should remain more or less constant, thus less memory per core as you increase the number of cores.
But in the end you can't have too much memory. With great (computational) power comes larger model size

LuckyTran · June 11, 2017, 20:12

HPC with CFD is very easy. The bottlenecks are always the RAM bandwidth and the interconnect (which you have already selected infiniband). The more RAM you have (the larger the model size) the more true this is. Just get the highest RAM bandwidth you can afford and enough GB capacity for what you want to model. Unlike other fields, in CFD it's rare to need a high core count & low mem usage. It does happen, like when you want to get your solution 25% faster by doubling the cores used, but this is non-optimum usage of limited compute resources (you . Nowadays with the northbridge on the cpu-die, getting more ram bandwidth means getting a fast enough cpu for the ram configuration that you have which makes the selection process even easier.

I highly recommend a quad-channel setup (as opposed to triple channel). These are very niche, which makes choosing easier!

And forget about GPU's. It is a bunch of extra headaches for no real benefit. It only works for the COUPLED solver, which few people use. I really like the COUPLED solver, but chances are you will be using SIMPLE/PISO or the density-based solver. The COUPLED solver scales less than linearly. The number of double precision units on GPU's are not that many, and for the cost of a GPU, the best you can get is a linearly scaling compared with CPU power. You can just as easily buy another node without any of the handicaps.

SSD's on write nodes are surprisingly very useful. Often you want to write a lot of data, especially true if you're doing LES for example and this can give you serious speed-up since writing to disk is the slowest of all and you have consistent need to write to the disk if you're saving data every time-step. You can then transfer from the SSD to slower disks automatically. For the environments where you normally put these things, I'm always paranoid the HDD will fail and I can definitely understand wanting SSD's on all machines. It happens a lot in Florida, one day it is too hot and the A/C cannot keep your cluster cool.

I would not wait for the next hardware release. However, with a hardware release there's often a drop in prices for previous generation hardware. I would plan to current hardware and wait for a price drop (if any). Recently, the price drops have not been spectacular.

In summary I agree with your outlook so far. Just get most memory bandwidth you can afford and then everything else will fall into place.

flotus1 · June 12, 2017, 05:37

I agree with the larger part of your comment, but I have a slightly different opinion on this:

Quote:

I would not wait for the next hardware release. However, with a hardware release there's often a drop in prices for previous generation hardware. I would plan to current hardware and wait for a price drop (if any). Recently, the price drops have not been spectacular.

It is true that the general rule of thumb is "buy hardware when you need it". Delaying the purchase to wait for some new iteration is usually not necessary. But to quote what you said:

Quote:

HPC with CFD is very easy. The bottlenecks are always the RAM bandwidth [...]

AMDs "Epyc" and Intels "Skylake-EP" will be a huge leap forward in exactly this metric, featuring 8xDDR4-2666 and 6xDDR4-2666 memory controllers respectively. So I think it is justified to make an exception from the rule in this particular case.
Of course you can buy similar CFD performance in 128 cores right now by simply using more nodes with lower core counts. But this comes at a higher price and depending on which part of the world you live in the higher power consumption might also be a factor.
What I really disagree on is dropping prices for older hardware. Intel has never done this in the past, they simply move on to the next generation. What they have done instead is reducing the prices for new hardware since AMD is competitive again. The 10-core I7-6950x was sold at ~1700$. With AMDs "Threadripper" (who comes up with these names?) being announced the MRSP for the 10-core I9-7900X dropped to 999$. And if Intel does not reduce costs for the next generation of server CPUs, AMD will.

F1aerofan · June 14, 2017, 06:02

Quote:

Originally Posted by LuckyTran

I highly recommend a quad-channel setup (as opposed to triple channel). These are very niche, which makes choosing easier!

What do you mean with this? I can follow single and dual rank, and a little bit of RDIMM and LRDIMM but this one I cannot follow.

I have a follow-up question regarding the Infiniband. I now see that there is multiple flavours for the Mellanox ConnectX-3 Dual port adapters;
- 10GbE or 40GbE
- normal copper cables (SFP) or with SR optics (QSFP)

Given that both of them probably give low latency and decent speed. Which one to choose? Or is this one also simple; get the fastest one to ensure it is no bottleneck?

Blanco · June 14, 2017, 06:40

Quote:

Originally Posted by F1aerofan

What do you mean with this? I can follow single and dual rank, and a little bit of RDIMM and LRDIMM but this one I cannot follow.

It was referred to the motherboard-RAM connection and not to the RAM type.

You should prefer motherboards having 4 channels connecting each CPU to the RAM banks instead of those having only 3. This is of great importance in CFD-3D as the bottleneck is usually the CPU-RAM communication.

It is also important to assure that all the CPU-RAM channels you have on your motherboard are filled, so avoid configurations like 2x32 GB RAM banks if you have a 4 channel motherboard, otherwise you'll waste speed (and money

).

flotus1 · June 14, 2017, 09:05

I don't know why triple-channel memory was mentioned. Ever since the first generation of Xeon E5 CPUs released in 2012 they all had quad-channel memory controllers. Edit: correction, apparently until the third generation of E5 processors there were low-end variants E2-24xx with only 3 memory channels. Pretty much pointless for anything that requires computing power. I haven't seen a single motherboard with less than 1 DIMM slot per channel for these CPUs.Well except for one mini-ITX board with only two slots, but that is a different story.
I assumed you understood that memory has to be populated symmetrically with one (or two) DIMMs per channel since you got this right in the first place.

Quote:

I have a follow-up question regarding the Infiniband. I now see that there is multiple flavours for the Mellanox ConnectX-3 Dual port adapters;
- 10GbE or 40GbE
- normal copper cables (SFP) or with SR optics (QSFP)

Since your cluster will be relatively small you don't have to worry about fancy tree topologies for your infiniband network. You simply connect all nodes to one switch, that's it. Consequently, you don't need infiniband cards with two ports. A single port is enough.
Optical cables can be used to cover longer distances, but I assume all the nodes will be close to the switch so you don't need them either.
40 Gbit/s infiniband should strike a balance between cost and performance. At least you should not buy cards with lower standards because the also have higher latencies and are simply outdated.

kyle · June 14, 2017, 13:20

If you are only looking for QDR (40Gb/s) Infiniband, that stuff is dirt cheap on Ebay. Like $20 for a card and $200 for a switch. This is two generations behind what is currently available. Don't let your hardware vendor charge you a ton of money for this.

Use copper QSFP+. No need for optical.

F1aerofan · June 15, 2017, 02:39

Quote:

Originally Posted by flotus1

I assumed you understood that memory has to be populated symmetrically with one (or two) DIMMs per channel since you got this right in the first place.

You assumed correctly.

Quote:

Originally Posted by flotus1

Optical cables can be used to cover longer distances, but I assume all the nodes will be close to the switch so you don't need them either.
40 Gbit/s infiniband should strike a balance between cost and performance.

Very clear, thank you!

Quote:

Originally Posted by kyle

If you are only looking for QDR (40Gb/s) Infiniband, that stuff is dirt cheap on Ebay.

Great tip and I will check with my ICT department, but I figure they want the new 'warranty included' systems.

flotus1 · June 21, 2017, 05:22

June 20th has come and gone...
Apparently, availability of the more interesting Epyc processors will be an issue for at least a few more months. AMD starts with the flagship 32-core processors sold by "selected partners". The versions with lower core counts will follow within the next months with availability increasing over the course of 2017...That's close to my definition of a paper-launch.
Unless you really don't need the performance within the next few months, I would now recommend buying a Cluster based on the Broadwell-EP Xeon processors.

hpvd · June 21, 2017, 06:54

is there any chance Epyc wont be much faster in CFD compared to upcomming skylake SP?

They do have a clear advantage in Memory bandwith:
no matter how many cores (8, 16, 24 or 32): always 8 channel DDR4

flotus1 · June 21, 2017, 09:26

Quote:

no matter how many cores (8, 16, 24 or 32): always 8 channel DDR4

Do you have reason to suspect that Skylake-EP will have fewer memory channels with fewer cores?

Quote:

is there any chance Epyc wont be much faster in CFD compared to upcomming skylake SP?

There are two main reasons why I would like to see independent testing first. First of all clock speeds for the low core count models are surprisingly low. And then there is the issue of CCX interconnect which gets even more complicated on dual-socket systems. This could mean some rather high latencies for far memory or cache access. We will see how this affects performance. Plus the issue of compiler optimization or lack thereof with intel compilers...

F1aerofan · June 22, 2017, 02:59

The 8 channel DDR4 is not the only benefit right? The frequency of the memory itself is also 2666 Mhz instead of the 2400 MHz for the Intel? I am a bit surprised about the clockspeeds, because the highest non-boost is only 2.4 GHz (16 core) or all-core-boost of 2.9 (24 or 16 core).

Well this week I have been given some clarity on the timeline for the investment. So as long as there is some solid benchmarking done before the end of July and our hardware partners are able to supply systems with Epyc from september onwards, then we may still consider this.

June 8, 2017, 10:46	128 core cluster E5-26xx V4 processor choice for Ansys FLUENT	#1
F1aerofan New Member Ramón Join Date: Mar 2016 Location: The Netherlands Posts: 11 Rep Power: 10	Dear fellow CFD engineers and enthusiasts, As an R&D department we are trying to significantly scale up our CFD solving capabilities. Currently we are using a single machine, with dual CPU Xeon E5-2637 V3 (8 cores) and 64 GB memory. This machine is used for CFD simulations with Ansys FLUENT with the SIMPLE solver, with either steady k-epsilon Realizable or transient SAS/DES/LES turbulence modelling. All simulations are simulated with FGM partially premixed combustion modelling. Meshes sizes are very case/project dependent but range between 3 and 17 million cells. We are considering a scale up towards 128 cores (thus 3 Ansys HPC license packs with a single Ansys FLUENT solver license). However, I am getting a bit lost in the world of CPU specifications, memory speeds, interconnections and where the bottleneck lies with solving time versus communication time. Ansys is being a professional independent software supplier by not giving specific hardware advice but providing feedback on configuration proposals. Our hardware supplier appears to have not enough specific knowledge with flow simulations to help us with our decision. Our budget is not determined yet, first we would like to know what it will cost us if we get the best solution possible. The cluster will exists out of a master node and multiple slave nodes. The only differences between the master and slave nodes will be that the master has extra internal storage and a better GPU. The following specifications are considered at the moment: - All nodes will be interconnected with Mellanox Infinityband - Dual SSD in Raid-0 for each machine (I know that normal HHD should be sufficient) - 8 GB/core RDIMM 2400 MT/s memory - No GPU yet, as we are not using COUPLED solver at the moment, but mounting possibility will be present. - Dual socket E5-2683 V4 processors in initial specification. The E5-2683 V4 'only' runs at 2.1 GHz and I have the feeling that I can get much more simulation performance with one of the other choices of E5-26xx V4 processors available. For example: - E5-2680 v4; more bus and memory speed per core, slightly more Ghz, one extra server needed (5 instead of 4). - E5-2667 v4; much more bus and memory speed per core, much more Ghz, but also 2 times more servers needed (8 instead of 4). Will this negatively influence the possible communication bottleneck?. Given the other thread (Socket 2011-3 processors - an overwiew) I should pick this one? I would very much appreciate advice on how to make a choice, or simply which one of the above to choose or other E5-26xx V4 available processors to consider. Kind regards F1aerofan chaitanyaarige and Svetlana like this.

June 8, 2017, 11:59		#2
kyle Senior Member Join Date: Mar 2009 Location: Austin, TX Posts: 160 Rep Power: 18	Keep in mind that total memory bandwidth and total cache size is the most important factor. Much more important than frequency or number of cores. All E5 V4 CPUs that support 2400mhz memory have the same memory bandwidth. If your cases are "large" (20 million+ elements), you would also likely see a benefit from increased network bandwidth using 8 machines vs 4 (8 Infiniband cards vs 4). Of the options you suggested, the E5-2667 v4 is almost certainly the "fastest," since it has the most memory bandwidth. It is also far and away the most expensive and power hungry. You might also want to look at something like 2620 v4. It's a slight memory bandwidth hit over the 2667, but it's MUCH cheaper and uses less power. 8 machines with 2x 8 core E5-2620 v4: ~1100 GB/s, 1360 Watts TDP, $6,672 (CPUs only) 8 machines with 2x 8 core E5-2667 v4: ~1230 GB/s, 1920 Watts TDP, $32,912 4 machines with 2x 16 core E5-2683 V4: ~615 GB/s, 960 Watts TDP, $14,768 My suggestion? Get 16 machines with the E5-2620 v4 and use the savings to buy another HPC Pack! And as you admit, the SSD's are pointless here. Fluent can only save to one machine. Money is much better spent on making one really fast storage node rather than making them all sort-of fast. Svetlana likes this.

June 8, 2017, 19:10		#4
kyle Senior Member Join Date: Mar 2009 Location: Austin, TX Posts: 160 Rep Power: 18	From Dell's website: E5-2620 v4 w/32 GB RAM: $3143 ($25k for 128 cores) E5-2667 v4 w/32 GB RAM: $6387 ($52k for 128 cores) E5-2683 v4 w/64 GB RAM: $6289 ($25k for 128 cores) If he has a fixed $50k to spend, then the sweet spot is probably something like 12 of the E5-2620 machines (192 cores) plus another HPC pack. And agreed it is an exceptionally bad time to be buying hardware for a memory-bandwidth limited application. chaitanyaarige and Svetlana like this.

June 9, 2017, 03:31		#5
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,428 Rep Power: 49	When I look up their offers, the cheapest configuration for a single node with 2xE5-2620v4 is ~5000$. I factored in that you need an infiniband network adapter and at least 64 GB of RAM because they don't offer 4GB modules even in their cheapest server line. And then we still have the additional cost for a server cabinet, mounting material, installation and the rest of the networking hardware. Which will slightly increase with a higher number of nodes. Another thing to keep in mind is that the cases they run are rather small. His smaller cases will not scale well on such a high number of cores, so I think it is better to have less faster cores. Last edited by flotus1; June 9, 2017 at 04:43.

June 9, 2017, 03:31		#6
F1aerofan New Member Ramón Join Date: Mar 2016 Location: The Netherlands Posts: 11 Rep Power: 10	Thank you all for the very fast responses! So if I may summarize your advice: - Total memory bandwidth and cache size are very important - Lowest amount of cores per node with highest clock speed per chip is fastest for CFD, given that infinity band is generally not the bottleneck and super linear speedup is observed. So indeed looking at memory bandwidth and bus speed per core would be a good indicator for calculation speed for these memory intensive processes. - So for pure speed take the E5-2667 V4 - But the E5-2620 V4 is much cheaper and less power hungry, so getting more cores of these with even more server blades may be cheaper even if I get another extra Ansys HPC pack (and not fully utilize that software license). - But please wait for the AMD Epyc (Naples) or Intel Skylake-EP because these will be epic?! I have two follow up questions though. Memory: We will go for the fastest available, so DDR4 at 2400 MT/s. But With the above proposed processors there are 4 memory lanes per processor, so 8 in total for the dual socket. The server blades proposed have 24 DIMM locations available, and will then feature 8 times “16 GB RDIMM, 2400 MT/s, Single Rank, x8 Data Width”. Would this be good enough as a configuration? AMD Epyc or SkyLake-EP: So that AMD Epyc will launch 20th June, then you have 2x32 cores with 2x8 memory lanes available in dual socket configuration. So while the first advice was to go with less cores with higher frequency and bandwidth (E5-2667 V4), this increase from 16 to 64 cores per node is somehow justified by having an increase from 8 to 16 memory lanes? Or can we just take AMD’s seismic data 2.5x speedup w.r.t. Intel for granted? I thought that in general AMD most of the time fell short of Intel when it came to CFD? We indeed have little time available/patience left since we already started inquiries for new hardware back in February. But you guys give better advice in shorter time then my suppliers! And with a 20th June launch date, what is your experience when full supported hardware can be sold as server racks and is supported by the software suppliers? Thanks! chaitanyaarige likes this.

June 11, 2017, 18:54		#9
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,428 Rep Power: 49	I initially understood that you wanted to run a single job on the whole cluster. Apart from a certain overhead from parallelization, the total amount of memory required should remain more or less constant, thus less memory per core as you increase the number of cores. But in the end you can't have too much memory. With great (computational) power comes larger model size

June 11, 2017, 20:12		#10
LuckyTran Senior Member Lucky Join Date: Apr 2011 Location: Orlando, FL USA Posts: 5,763 Rep Power: 66	HPC with CFD is very easy. The bottlenecks are always the RAM bandwidth and the interconnect (which you have already selected infiniband). The more RAM you have (the larger the model size) the more true this is. Just get the highest RAM bandwidth you can afford and enough GB capacity for what you want to model. Unlike other fields, in CFD it's rare to need a high core count & low mem usage. It does happen, like when you want to get your solution 25% faster by doubling the cores used, but this is non-optimum usage of limited compute resources (you . Nowadays with the northbridge on the cpu-die, getting more ram bandwidth means getting a fast enough cpu for the ram configuration that you have which makes the selection process even easier. I highly recommend a quad-channel setup (as opposed to triple channel). These are very niche, which makes choosing easier! And forget about GPU's. It is a bunch of extra headaches for no real benefit. It only works for the COUPLED solver, which few people use. I really like the COUPLED solver, but chances are you will be using SIMPLE/PISO or the density-based solver. The COUPLED solver scales less than linearly. The number of double precision units on GPU's are not that many, and for the cost of a GPU, the best you can get is a linearly scaling compared with CPU power. You can just as easily buy another node without any of the handicaps. SSD's on write nodes are surprisingly very useful. Often you want to write a lot of data, especially true if you're doing LES for example and this can give you serious speed-up since writing to disk is the slowest of all and you have consistent need to write to the disk if you're saving data every time-step. You can then transfer from the SSD to slower disks automatically. For the environments where you normally put these things, I'm always paranoid the HDD will fail and I can definitely understand wanting SSD's on all machines. It happens a lot in Florida, one day it is too hot and the A/C cannot keep your cluster cool. I would not wait for the next hardware release. However, with a hardware release there's often a drop in prices for previous generation hardware. I would plan to current hardware and wait for a price drop (if any). Recently, the price drops have not been spectacular. In summary I agree with your outlook so far. Just get most memory bandwidth you can afford and then everything else will fall into place. F1aerofan likes this.

June 14, 2017, 13:20		#15
kyle Senior Member Join Date: Mar 2009 Location: Austin, TX Posts: 160 Rep Power: 18	If you are only looking for QDR (40Gb/s) Infiniband, that stuff is dirt cheap on Ebay. Like $20 for a card and $200 for a switch. This is two generations behind what is currently available. Don't let your hardware vendor charge you a ton of money for this. Use copper QSFP+. No need for optical.

June 21, 2017, 05:22		#17
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,428 Rep Power: 49	June 20th has come and gone... Apparently, availability of the more interesting Epyc processors will be an issue for at least a few more months. AMD starts with the flagship 32-core processors sold by "selected partners". The versions with lower core counts will follow within the next months with availability increasing over the course of 2017...That's close to my definition of a paper-launch. Unless you really don't need the performance within the next few months, I would now recommend buying a Cluster based on the Broadwell-EP Xeon processors.

June 21, 2017, 06:54		#18
hpvd New Member Join Date: May 2013 Posts: 26 Rep Power: 13	is there any chance Epyc wont be much faster in CFD compared to upcomming skylake SP? They do have a clear advantage in Memory bandwith: no matter how many cores (8, 16, 24 or 32): always 8 channel DDR4

June 22, 2017, 02:59		#20
F1aerofan New Member Ramón Join Date: Mar 2016 Location: The Netherlands Posts: 11 Rep Power: 10	The 8 channel DDR4 is not the only benefit right? The frequency of the memory itself is also 2666 Mhz instead of the 2400 MHz for the Intel? I am a bit surprised about the clockspeeds, because the highest non-boost is only 2.4 GHz (16 core) or all-core-boost of 2.9 (24 or 16 core). Well this week I have been given some clarity on the timeline for the investment. So as long as there is some solid benchmarking done before the end of July and our hardware partners are able to supply systems with Epyc from september onwards, then we may still consider this. akidess likes this.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
looking for a smart interface matlab fluent	chary	FLUENT	24	June 18, 2021 10:07
Superlinear speedup in OpenFOAM 13	msrinath80	OpenFOAM Running, Solving & CFD	18	March 3, 2015 06:36
Problem in using parallel process in fluent 14	Tleja	FLUENT	3	September 13, 2013 11:54
problem in using parallel process in fluent 14	aydinkabir88	FLUENT	1	July 10, 2013 03:00
Fluent on a Windows cluster	Erwin	FLUENT	4	October 22, 2002 12:39