Hardware-Configuration for Fluent HPC-Pack (8x)

JohHaas · January 29, 2014, 11:03

Hello,

I need to buy a new workstation for CFD in Fluent 15.0.
License-wise I will be limited to a split-up Solver and Pre/Post-License combined with a single HPC-Pack (8x parallel). Main models used are turbulence (k-e, k-w menter, enhanced wall treatment), heat transfer (radiation, discrete ordinates), some DPM and rarely Eulerian Multiphase/VOF (2 species). Mesh sizes reach from 5-20 million cells, mostly poly or mixed hex/tet. Most cases will be steady state, though some will be transient (several minutes, timesteps about 0.1 s)

My budget for the workstation is about 10.000€, although I am not sure if it is possible to even reach this budget with the small amount of licenses to be fed.

I thought about getting the following core components:

- 2x Xeon E5-2637V2 (Quadcore 3,5 Ghz)
- 8x 16 GB DDR3 ECC RAM (Total of 128 GB too much?!)
- NVIDIA Quadro 4000

Would it be smarter to have a total of more than only 8 cores? As I am limited to only 8x parallel I thought it would be smarter to get only 8 cores, but to maximize clock speed and memory bandwidth. On the other hand I fear that my performance might suffer when I am pre/postprocessing and solving on all 8 cores at the same time.

Can anyone give me some advice how to tackle this problem the right way?
I am completely new to the whole field of hardware-problematics. Any help would be highly appreciated!

Thanks in advance,

Johannes

kyle · January 29, 2014, 18:48

128GB is a ton of memory. Anything that uses anywhere near much memory would take ages (days if not weeks) to run on only 8 cores.

With that kind of budget and your software constraints, why not buy two machines so you can pre/post while a simulation is running? That would surely maximize your software licenses.

Other than that what you propose sounds ideal.

siefdi · January 29, 2014, 21:27

I second kyle that 128GB of memory could be overkill. Instead, (if I were you and if you only need one workstation with that much of budget) I will invest more on GPU side since Fluent 15.0 now supports solver computation on GPU (theoretical speedup gain up to 2.5 times).
http://www.ansys.com/Products/ANSYS+...id+Solver+15-0

Maybe a NVIDIA Tesla could help to speedup your computation, though I am not sure how is real world performance of Fluent solver running on GPU and whether its need special licence or not. Maybe someone who has experience on it can give a light on this matter.

JohHaas · January 30, 2014, 03:58

Hello,

First of all - Thanks for your input so far!

In the company in which I worked before, we had computing times of approx. 3-5 days for a case. Since I will mostly be doing single case studies this is not too much of a problem. I totally agree, that computation times this long would ruin any project that rely on running a greater number of cases or parametrization, but for my needs I don't need the case to be converged in a matter of minutes.

The thing about GPU computing sounds very nice to me. I will have to talk to my contact at ANSYS about this. If it should not require a seperate license I will definitely invest in a more advanced GPU and cut some RAM if the budget should need it.

Concerning the idea with the two seperate workstations - On what should I focus when searching for the Pre/Post-Workstation's hardware? Does it rely just as hard on RAM/CPU as the solver machine or should i mainly focus on good graphics and a fast SSD to load cases in a shorter time?

Thank you for your suggestions!

Kind regards,
Johannes

bindesboll · February 4, 2014, 04:46

Regarding pre/post workstation:

Don't overdo the graphics on your pre/post machine. I have worked with both Quadro 2000 and Firepro V7900, where the later should have the double performance. However, regarding rendering of large mesh or drawing streamlines in Post I cannot tell any difference. So I guess these operations are more CPU dependent.

From my workstation (pre/post) to my cluster I share a 1 Gbit network with the rest of the office - and that is too slow when loading cases from the cluster. So be sure to have at least a 1 Gbit connection allocated for this purpose alone. I don´t think the load time difference is large comparing SSD and 10k og 15k disks in RAID0 og RAID5 (and not at all if you are working through a 1 Gbit network).

The only thing you can use many cores for in pre/post is multiple domain meshing. So go for speed rather than cores, if this is not important to you.
I guess Intel i7 versions with 4 memory channels or Xeon E5 1600 v2 might be relevant to consider as double CPU motherboard isn´t relevant?

Kind regards
Kim

JohHaas · February 4, 2014, 05:19

First of all, thanks for the input on the pre/post workstation.

Regarding the CPUs: Since I will definitely be going for a dual-CPU Workstation to make use of the extra RAM the 1600 Xeon Series and i7s won't make the cut.

Has anyone made experiences with the new fluent 15.0 GPU-Computing? I just saw that the price for a highend GPU is far above 3000€. Though, I did expect them to be expensive, I didn't know they cost that much... Maybe spending about 3.5k € on a Tesla K20 which I could also spend elsewhere (storagesystem, disk-capacity, even more cores on the processors...) is a bit too much. Is it worth the money or will i propably end up wasting a 3.5k on a GPU that won't even make that much of a difference in the end?

Kind regards,
Johannes

flotus1 · February 4, 2014, 15:53

Dont expect too much from the gpu acceleration for "normal" CFD applications.
Those 2.5 times speedup measured by the marketing divisions of Ansys and Nvidia may hold true for some special solvers like DO radiation modeling.
But since you already have a decent hardware setup with 8 cores the benefit from a 3000€ gpu will rather be in the range of 5-10 percent for solving NS-like equations.
As far as I know, this even costs you an additional HPC license so there may even be no benefit at all if you are restricted to 8 HPC licenses.
You could also have a look at this thread: http://www.cfd-online.com/Forums/flu...nt-14-5-a.html

If most of your simulations include radiation, it may be worth considering it.
Otherwise, every other component will have a better cost/performance ratio.

I think the recommendation of the I7 CPU was for the pre/post workstation where a single cpu is definitely enough. I also think an I7 would be a good choice here.
For the solver workstation you are on the right path with a dual-cpu setup aiming for maximum memory bandwith and core speed. Just make sure to get ram specified for 1866Mhz since the v2 versions of the latest Xeon cpus support this.

JohHaas · February 5, 2014, 04:35

Hello,

Thanks for the input in the GPU-Performance!
I will most likely be doing some radiation every now and then but as I won't be doing it as my everyday business I seriously think about dropping the GPU and investing in some other things. The GPU-Computing would take away one HPC-license from the pack, which means that I would run fluent on 7 cores and 1 GPU (without having the need to buy another license).

As for the post-workstation: Our "standard"-computers on which we work and do some CAD on are Intel i7-3770 with 16GBs of RAM and a Quadro 2000. If that was sufficient I would just go for the postprocessing on one of the standard computers.

Best regards,
Johannes

Daveo643 · March 3, 2015, 13:47

On my nowhere near optimized setup, I'm seeing a >2X speed up running the same simulation with GPGPU vs. without.

Configuration:
Intel Core i7-5820K (6 physical cores, 12 logical w/ Hyperthreading)
64GB DDR4 RAM
EVGA Nvidia Titan Z 12GB (2 GPU on each card X 2 cards).

Without GPU:
---------------------------------------------
| CPU Time Usage (Seconds)
ID | User Kernel Elapsed
---------------------------------------------
host | 19.1719 7.95313 5169.02
n0 | 4577.23 30.7344 5167.12
n1 | 4815.64 30.6719 5167.14
n2 | 4875.17 28.6094 5167.16
n3 | 4871.88 32.6094 5167.16
n4 | 4964.59 33.7813 5167.17
n5 | 5006.16 30.3125 5167.17
n6 | 4853.14 26.3125 5167.19
n7 | 4951.61 26.2656 5167.19
n8 | 4770.55 22.0938 5167.2
n9 | 4802.89 24.9844 5167.2
n10 | 4875.47 25.9063 5167.2
n11 | 4990.36 34.9531 5167.22
---------------------------------------------
Total | 58373.9 355.188 -
---------------------------------------------

Model Timers (Host)
Flow Model Time: 4216.608 sec (WALL), 5.406 sec (CPU), count 1071
Discrete Phase Model Time: 640.485 sec (WALL), 0.625 sec (CPU), count 1071
Other Models Time: 3.259 sec
Total Time: 4860.352 sec

Model Timers
Flow Model Time: 4218.424 sec (WALL), 3915.688 sec (CPU), count 1071
K-Epsilon Turbulence Model Time: 147.884 sec (WALL), 136.516 sec (CPU), count 1071
Species Combustion Model Time: 315.171 sec (WALL), 295.516 sec (CPU), count 1071
Temperature Model Time: 179.116 sec (WALL), 165.406 sec (CPU), count 1071
Other Models Time: 3.555 sec
Total Time: 4864.150 sec

Performance Timer for 1071 iterations on 12 compute nodes
Average wall-clock time per iteration: 4.607 sec
Global reductions per iteration: 158 ops
Global reductions time per iteration: 0.000 sec (0.0%)
Message count per iteration: 86923 messages
Data transfer per iteration: 156.294 MB
LE solves per iteration: 7 solves
LE wall-clock time per iteration: 3.711 sec (80.5%)
LE global solves per iteration: 17 solves
LE global wall-clock time per iteration: 0.025 sec (0.5%)
LE global matrix maximum size: 74
AMG cycles per iteration: 27.017 cycles
Relaxation sweeps per iteration: 3413 sweeps
Relaxation exchanges per iteration: 1164 exchanges
Time-step updates per iteration: 0.09 updates
Time-step wall-clock time per iteration: 0.002 sec (0.0%)

Total wall-clock time: 4934.609 sec

With GPU:

---------------------------------------------
| CPU Time Usage (Seconds)
ID | User Kernel Elapsed
---------------------------------------------
host | 19.2656 4.17188 2545.19
n0 | 1496.56 653.141 2543.16
n1 | 1726.08 398.047 2543.18
n2 | 1876.38 478.969 2543.18
n3 | 1868.2 486.609 2543.18
n4 | 1804.19 547.969 2543.18
n5 | 1744.97 418.516 2543.18
n6 | 1507.89 607.234 2543.16
n7 | 1863.16 491.313 2543.16
n8 | 1774.27 421.969 2543.16
n9 | 1864.2 490.875 2543.16
n10 | 1843.53 483.125 2543.16
n11 | 1837.56 517.938 2543.16
---------------------------------------------
Total | 21226.3 5999.88 -
---------------------------------------------

Model Timers (Host)
Flow Model Time: 1510.616 sec (WALL), 1.250 sec (CPU), count 1150
Discrete Phase Model Time: 599.887 sec (WALL), 0.188 sec (CPU), count 1150
Other Models Time: 3.201 sec
Total Time: 2113.704 sec

Model Timers
Flow Model Time: 1510.901 sec (WALL), 1447.391 sec (CPU), count 1150
K-Epsilon Turbulence Model Time: 138.914 sec (WALL), 136.281 sec (CPU), count 1150
Species Combustion Model Time: 297.500 sec (WALL), 296.016 sec (CPU), count 1150
Temperature Model Time: 168.848 sec (WALL), 167.422 sec (CPU), count 1150
Other Models Time: 3.200 sec
Total Time: 2119.363 sec

Performance Timer for 1150 iterations on 12 compute nodes
Average wall-clock time per iteration: 1.905 sec
Global reductions per iteration: 93 ops
Global reductions time per iteration: 0.000 sec (0.0%)
Message count per iteration: 10387 messages
Data transfer per iteration: 34.910 MB
LE solves per iteration: 6 solves
LE wall-clock time per iteration: 0.104 sec (5.4%)
LE global solves per iteration: 0 solves
LE global wall-clock time per iteration: 0.000 sec (0.0%)
LE global matrix maximum size: 295178
AMG cycles per iteration: 9.542 cycles
Relaxation sweeps per iteration: 105 sweeps
Relaxation exchanges per iteration: 106 exchanges
Time-step updates per iteration: 0.09 updates
Time-step wall-clock time per iteration: 0.002 sec (0.1%)

Total wall-clock time: 2190.798 sec

LE wall-clock time per iteration: 0.104 sec (5.4%)

Ref: http://www.nvidia.com/content/tesla/...-userguide.pdf

Quote:

The linear solver fraction in a CFD calculation can be found from the CPU run when the following command is added to the journal file.

/parallel/timer/usage
It is reported towards the end of the output file after the successful completion of calculations asshown below, which is nearly 75% or 0.75 in this case.

‘LE wall-clock time per iteration: 12.299 sec (74.8%)'

Both the pressure-based and density-based coupled solvers result in higher linear solver fractions (above 0.6) whereas the segregated solver typically has lower fractions. As a consequence, higher speed ups can be expected from coupled solvers. However, lower linear solver fractions in segregated solver might slow down the calculations because of data transfer overheads, thus not recommended in the current version 15.0.

All defaults were employed in my tests, and one can see from the above parameter and from load monitors that the GPGPU utilization is still very low; there's lots of potential improvement still to be had if an even greater the fraction of solving would be done in the GPGPU.

I have also measured that overall performance still responds very strongly to host CPU and memory subsystem speeds because it relies of these to set up the matrices and send to the GPGPU for solving. What this means is that you should not skimp on the CPU in favour of the GPU but instead have a balanced system. With some further tweaking I've brought the solving time down to only 1700 seconds; by comparison, a couple years old Dell Precision T3500 with Xeon X5650 with no GPGPU took over 18000 seconds to perform the same simulation, therefore a >10X improvement!!

If anyone knows how to improve the GPU utilization rate, I'm all ears!

Edit: On the other hand, using the GPU on my laptop with a GTX580M was actually slower than without on a shorter run of the same transient model as above (only 50 iterations instead of 1000+).
http://www.cfd-online.com/Forums/flu...pu-fluent.html

Daveo643 · March 3, 2015, 14:25

If I were to do this again, I would change a few things...

Contrary to what others have stated, I have noticed severe disk thrashing, even with 64GB of RAM on a different but not huge transient model of ~800k cells that took many iterations before converging. I think it's because the memory (actually the matrices) headed for the GPGPU maybe copied in main memory - I don't know...

Unfortunately, apart from workstation/server motherboards ($$$$$) you are usually limited to 64GB as is the case with the Intel X99 chipset. However, there is one pro-sumer board that does support up to 128GB ECC DDR4 RAM and that's the ASROCK X99-WS. I actually had one of these but returned it to get an ASUS X99-PRO because the former didn't have WiFi on-board... now I'm wondering if I made a mistake in that choice over silly WiFi... If the size and memory requirements of your model sets support it, I would definitely max out the full 128GB of RAM and disable disk paging in Windows.

If your setup is not mission-critical, I'd skip the Xeons and buy an unlocked Haswell or Haswell-E. Just for testing, I'm running my i7-5820k overclocked to 4 GHz (watercooled) and I'm running 8x8GB Corsair Vengence DDR4-2666 memory. It has been stable running Prime95 and running a whole bunch of simulations without issue. Core temperatures do not exceed 75 deg. Celsius under full simulation load. The Haswell Xeons have the same amount of cache per physical core, are multiplier-locked and have lower TDP as the consumer Haswell-E.

I picked up the 2 Titan-Zs relatively inexpensively because they are a discontinued model transitioning to the Nvidia Maxwell GPU architecture. I have not seen a significant speed-up from the utilizing the second Titan Z and it might be better served in a separate computer or needs more optimization of the settings.

January 29, 2014, 11:03	Hardware-Configuration for Fluent HPC-Pack (8x)	#1
JohHaas New Member Johannes Haas Join Date: Jan 2014 Posts: 9 Rep Power: 12	Hello, I need to buy a new workstation for CFD in Fluent 15.0. License-wise I will be limited to a split-up Solver and Pre/Post-License combined with a single HPC-Pack (8x parallel). Main models used are turbulence (k-e, k-w menter, enhanced wall treatment), heat transfer (radiation, discrete ordinates), some DPM and rarely Eulerian Multiphase/VOF (2 species). Mesh sizes reach from 5-20 million cells, mostly poly or mixed hex/tet. Most cases will be steady state, though some will be transient (several minutes, timesteps about 0.1 s) My budget for the workstation is about 10.000€, although I am not sure if it is possible to even reach this budget with the small amount of licenses to be fed. I thought about getting the following core components: - 2x Xeon E5-2637V2 (Quadcore 3,5 Ghz) - 8x 16 GB DDR3 ECC RAM (Total of 128 GB too much?!) - NVIDIA Quadro 4000 Would it be smarter to have a total of more than only 8 cores? As I am limited to only 8x parallel I thought it would be smarter to get only 8 cores, but to maximize clock speed and memory bandwidth. On the other hand I fear that my performance might suffer when I am pre/postprocessing and solving on all 8 cores at the same time. Can anyone give me some advice how to tackle this problem the right way? I am completely new to the whole field of hardware-problematics. Any help would be highly appreciated! Thanks in advance, Johannes

March 3, 2015, 14:25		#10
Daveo643 New Member Join Date: Mar 2013 Location: Canada Posts: 22 Rep Power: 13	If I were to do this again, I would change a few things... Contrary to what others have stated, I have noticed severe disk thrashing, even with 64GB of RAM on a different but not huge transient model of ~800k cells that took many iterations before converging. I think it's because the memory (actually the matrices) headed for the GPGPU maybe copied in main memory - I don't know... Unfortunately, apart from workstation/server motherboards ($$$$$) you are usually limited to 64GB as is the case with the Intel X99 chipset. However, there is one pro-sumer board that does support up to 128GB ECC DDR4 RAM and that's the ASROCK X99-WS. I actually had one of these but returned it to get an ASUS X99-PRO because the former didn't have WiFi on-board... now I'm wondering if I made a mistake in that choice over silly WiFi... If the size and memory requirements of your model sets support it, I would definitely max out the full 128GB of RAM and disable disk paging in Windows. If your setup is not mission-critical, I'd skip the Xeons and buy an unlocked Haswell or Haswell-E. Just for testing, I'm running my i7-5820k overclocked to 4 GHz (watercooled) and I'm running 8x8GB Corsair Vengence DDR4-2666 memory. It has been stable running Prime95 and running a whole bunch of simulations without issue. Core temperatures do not exceed 75 deg. Celsius under full simulation load. The Haswell Xeons have the same amount of cache per physical core, are multiplier-locked and have lower TDP as the consumer Haswell-E. I picked up the 2 Titan-Zs relatively inexpensively because they are a discontinued model transitioning to the Nvidia Maxwell GPU architecture. I have not seen a significant speed-up from the utilizing the second Titan Z and it might be better served in a separate computer or needs more optimization of the settings. Last edited by Daveo643; March 4, 2015 at 13:26.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Can we merge HPC Pack licenses?	Phillamon	FLUENT	0	January 24, 2014 03:59
Microsoft HPC Pack 2008 Tool Pack (LINPACK)	jemyungcha	Hardware	1	October 22, 2011 19:21
uninstalling Ubuntu pack and installing source pack	vetnav	OpenFOAM Installation	2	July 13, 2010 20:11
Need help with hardware specs for a Fluent HPC	John	FLUENT	0	September 25, 2008 11:24
What hardware configuration should be preferred?	Albert	Main CFD Forum	2	February 27, 2003 19:15

January 29, 2014, 18:48		#2
kyle Senior Member Join Date: Mar 2009 Location: Austin, TX Posts: 160 Rep Power: 18	128GB is a ton of memory. Anything that uses anywhere near much memory would take ages (days if not weeks) to run on only 8 cores. With that kind of budget and your software constraints, why not buy two machines so you can pre/post while a simulation is running? That would surely maximize your software licenses. Other than that what you propose sounds ideal.

January 29, 2014, 21:27		#3
siefdi New Member CFD Join Date: Jan 2013 Posts: 23 Rep Power: 13	I second kyle that 128GB of memory could be overkill. Instead, (if I were you and if you only need one workstation with that much of budget) I will invest more on GPU side since Fluent 15.0 now supports solver computation on GPU (theoretical speedup gain up to 2.5 times). http://www.ansys.com/Products/ANSYS+...id+Solver+15-0 Maybe a NVIDIA Tesla could help to speedup your computation, though I am not sure how is real world performance of Fluent solver running on GPU and whether its need special licence or not. Maybe someone who has experience on it can give a light on this matter.

January 30, 2014, 03:58		#4
JohHaas New Member Johannes Haas Join Date: Jan 2014 Posts: 9 Rep Power: 12	Hello, First of all - Thanks for your input so far! In the company in which I worked before, we had computing times of approx. 3-5 days for a case. Since I will mostly be doing single case studies this is not too much of a problem. I totally agree, that computation times this long would ruin any project that rely on running a greater number of cases or parametrization, but for my needs I don't need the case to be converged in a matter of minutes. The thing about GPU computing sounds very nice to me. I will have to talk to my contact at ANSYS about this. If it should not require a seperate license I will definitely invest in a more advanced GPU and cut some RAM if the budget should need it. Concerning the idea with the two seperate workstations - On what should I focus when searching for the Pre/Post-Workstation's hardware? Does it rely just as hard on RAM/CPU as the solver machine or should i mainly focus on good graphics and a fast SSD to load cases in a shorter time? Thank you for your suggestions! Kind regards, Johannes

February 4, 2014, 04:46		#5
bindesboll Member Kim Bindesbøll Andersen Join Date: Oct 2010 Location: Aalborg, Denmark Posts: 39 Rep Power: 16	Regarding pre/post workstation: Don't overdo the graphics on your pre/post machine. I have worked with both Quadro 2000 and Firepro V7900, where the later should have the double performance. However, regarding rendering of large mesh or drawing streamlines in Post I cannot tell any difference. So I guess these operations are more CPU dependent. From my workstation (pre/post) to my cluster I share a 1 Gbit network with the rest of the office - and that is too slow when loading cases from the cluster. So be sure to have at least a 1 Gbit connection allocated for this purpose alone. I don´t think the load time difference is large comparing SSD and 10k og 15k disks in RAID0 og RAID5 (and not at all if you are working through a 1 Gbit network). The only thing you can use many cores for in pre/post is multiple domain meshing. So go for speed rather than cores, if this is not important to you. I guess Intel i7 versions with 4 memory channels or Xeon E5 1600 v2 might be relevant to consider as double CPU motherboard isn´t relevant? Kind regards Kim

February 4, 2014, 05:19		#6
JohHaas New Member Johannes Haas Join Date: Jan 2014 Posts: 9 Rep Power: 12	First of all, thanks for the input on the pre/post workstation. Regarding the CPUs: Since I will definitely be going for a dual-CPU Workstation to make use of the extra RAM the 1600 Xeon Series and i7s won't make the cut. Has anyone made experiences with the new fluent 15.0 GPU-Computing? I just saw that the price for a highend GPU is far above 3000€. Though, I did expect them to be expensive, I didn't know they cost that much... Maybe spending about 3.5k € on a Tesla K20 which I could also spend elsewhere (storagesystem, disk-capacity, even more cores on the processors...) is a bit too much. Is it worth the money or will i propably end up wasting a 3.5k on a GPU that won't even make that much of a difference in the end? Kind regards, Johannes

February 4, 2014, 15:53		#7
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,427 Rep Power: 49	Dont expect too much from the gpu acceleration for "normal" CFD applications. Those 2.5 times speedup measured by the marketing divisions of Ansys and Nvidia may hold true for some special solvers like DO radiation modeling. But since you already have a decent hardware setup with 8 cores the benefit from a 3000€ gpu will rather be in the range of 5-10 percent for solving NS-like equations. As far as I know, this even costs you an additional HPC license so there may even be no benefit at all if you are restricted to 8 HPC licenses. You could also have a look at this thread: http://www.cfd-online.com/Forums/flu...nt-14-5-a.html If most of your simulations include radiation, it may be worth considering it. Otherwise, every other component will have a better cost/performance ratio. I think the recommendation of the I7 CPU was for the pre/post workstation where a single cpu is definitely enough. I also think an I7 would be a good choice here. For the solver workstation you are on the right path with a dual-cpu setup aiming for maximum memory bandwith and core speed. Just make sure to get ram specified for 1866Mhz since the v2 versions of the latest Xeon cpus support this.

February 5, 2014, 04:35		#8
JohHaas New Member Johannes Haas Join Date: Jan 2014 Posts: 9 Rep Power: 12	Hello, Thanks for the input in the GPU-Performance! I will most likely be doing some radiation every now and then but as I won't be doing it as my everyday business I seriously think about dropping the GPU and investing in some other things. The GPU-Computing would take away one HPC-license from the pack, which means that I would run fluent on 7 cores and 1 GPU (without having the need to buy another license). As for the post-workstation: Our "standard"-computers on which we work and do some CAD on are Intel i7-3770 with 16GBs of RAM and a Quadro 2000. If that was sufficient I would just go for the postprocessing on one of the standard computers. Best regards, Johannes