|
[Sponsors] |
April 28, 2017, 13:38 |
GPU acceleration in Ansys Fluent
|
#1 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
The topic of GPU acceleration for Ansys Fluent sometimes seems to be shrouded in mystery. So I ran a few benchmarks to answer some frequently asked questions and get a snapshot of the capability of this feature in 2017.
Flow Setup: Benchmark case: 3D lid driven cavity in a cubical domain Grid resolution: 64x64x64 -> 262144 cells Reynolds number: 10000 solver type: pressure-based, steady Turbulence model: standard k-epsilon Number of iterations: 100, reporting interval 10 default settings whenever possible Software/Hardware: Operating system: Opensuse Leap 42.1 Fluent version: Ansys Fluent 18.0 CPU: Intel Xeon W3670, 6 cores, 3.2GHz, HT disabled Memory: 24 GB DDR3-1333 ECC triple-channel GPU: Quadro 5000 (theoretical compute performance: 722 GFLOPS single, 361 GFLOPS double, memory bandwidth: 126 GB/s, memory size: 2.5 GB GDDR5) 1) Coupled algorithm As stated in this guide, GPU acceleration works best if the linear solver fraction is high which is usually the case when using the coupled solver. Fluent reported it to be around 60% or higher in all cases shown here. Without further ado: So obviously GPU acceleration works under the right circumstances. Using only one CPU core, adding the GPU results in a speed-up of 50-60% in single-precision (SP) and double precision (DP) respectively. But you can already see the diminishing returns with higher CPU core counts. 2) SIMPLE algorithm Using the SIMPLE algorithm the picture is completely different. The linear solver fraction without a GPU is just below 30% for all cases, so GPU acceleration as it is currently implemented in Ansys Fluent can not be as effective. This is a caveat that Ansys is aware of and that is clearly stated in the more in-depth reviews of this feature. As expected, solution times are much higher with a GPU "acceleration". To be clear: this is not new information, Ansys never claimed that GPU acceleration was worth it with the SIMPLE algorithm. 3) Pairing "high-end" CPUs with slow GPUs You might expect to be on the safe side as long as you are using the coupled solver. But we could already see the diminishing returns in case 1 with higher CPU core counts. We increase the discrepancy with different hardware: 2x Xeon E5-2687W, 128GB (16x8GB) DDR3-1600 reg ECC, Quadro 4000 (theoretical compute performance: 486 GFLOPS SP, 243 GFLOPS DP, memory bandwidth: 89.9 GB/s, memory size: 2 GB GDDR5) While solution times with a GPU and one CPU core are slightly lower than without a GPU, there is a huge performance penalty when using the GPU along with 14 CPU cores. This is despite the fact that the linear solver fraction is 60% without a GPU. So clearly, a low-end GPU will slow down fast CPUs even if the other criteria for using GPU acceleration are met. 4) Consumer-grade graphics cards Lets see what a cheap consumer-grade graphics card can do for GPU acceleration. The hardware in this test: 2x Xeon E5-2650v4, 128GB (8x16GB) DDR4-2400 reg ECC, Geforce GTX 1060 6GB (theoretical compute performance: 4372 GFLOPS SP, 137 GFLOPS DP, memory bandwidth: 192 GB/s, memory size: 6 GB GDDR5). Note that there was a suspended computation residing in memory so the numbers might not be representative for the absolute performance of this processor type. The conclusion: GPU acceleration in Ansys Fluent definitely works with cheap gaming graphics cards. Even in DP the performance gains from the GPU are quite remarkable given its low DP performance. This might indicate that the workload in this benchmark is not entirely compute bound. Memory- and PCIe-transfers might also be important. However, the GPU is still a huge bottleneck as soon as we are using more CPU cores. 5) Q&A Question When can I use GPU acceleration? Answer 1) You need to use the right solver in the first place. For example the coupled flow solver or the DO radiation model. Switching from SIMPLE or its variants to coupled just to use GPU acceleration is probably not the best idea. 2) Your model must fit into the GPU memory. You can estimate the amount of memory needed with the formulas in section 4 of the guide mentioned earlier. The benchmark I ran used ~0.5 GB of VRAM in single precision and ~1 GB in double precision. Again: if your model does not fit in the GPU memory, you currently can not use GPU acceleration. GPU memory from dual-cards or more than one card does stack, so you can use this to simulate larger models. Question Which GPUs can I use for GPU acceleration in Ansys Fluent Answer Ansys only recommends Tesla compute cards for this purpose. However, you can use virtually any recent Nvidia GPU. Yes, even Geforce cards, I verified this with a GTX 1060. That being said, not all GPUs are created equal. The main differentiation lies in the DP compute performance. Nearly all modern Geforce and Quadro GPUs have a DP/SP performance ratio of 1/32. A Quadro P6000, one of the most expensive GPUs you can buy right now has a theoretical peak performance of 11758 GFLOPS SP but only 367 GFLOPS DP. Just about the same as the seriously outdated Quadro 5000 I used in this test. This is not an issue if you want to compute in SP, but a colossal waste of money if you want to perform simulations in DP. In this case you will have to buy a Tesla card. Be careful though: even some of the Tesla cards now have reduced DP capabilities because their target application is deep learning. One of the last exceptions from this rule that is still somewhat relevant today is the first generation of Titan GPUs "Kepler" released in 2013 and 2014 (Titan, Titan Black, Titan Z). They have a DP/SP ratio of 1/3 and can be bought used for a reasonable price. Question Should I spend extra money on a compute GPU when buying a new Fluent workstation Answer For a "general purpose" Workstation with a limited budget the answer is probably no. You are better off spending excess money on more CPU performance in most cases. Only when you have maxed out CPU performance or if you are sure that you mostly use the solvers that benefit from GPU acceleration and your models are small enough you might consider it. Edit: here is a nearly exhaustive list of Nvidia GPUs with high DP capabilities: Last edited by flotus1; April 29, 2017 at 12:25. |
|
April 30, 2017, 05:26 |
|
#2 |
New Member
Deutschland / Germany
Join Date: Aug 2016
Posts: 8
Rep Power: 10 |
Thank you for very nice review. I use Ansys Mechanical, and yesterday I jusy bought a Gtx titan to test GPU acceleration. However, Ansys only supports Tesla and Quadro K6000/k5000 and program cannot run when I request to use GPU acceleration feature. How you could force ansys run with Geforce card? With DP of titan about 1.2Tflops and 6GB Ram, and price of 230euro, it should be considered
|
|
April 30, 2017, 05:55 |
|
#3 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
I didn't need to force anything. The GTX Graphics card just worked out of the box as soon as I selected the additional GPU in the Fluent launcher. Maybe the software behavior is different in Ansys Mechanical, but I can only speculate because I have not used it in years.
I guess you are already familiar with the basics? http://www.cadfem.de/fileadmin/CADFE...CADFEM_GPU.pdf Do you have one of the officially supported cards to check if everything else is set up correctly? |
|
April 30, 2017, 05:58 |
|
#4 |
New Member
Deutschland / Germany
Join Date: Aug 2016
Posts: 8
Rep Power: 10 |
No i do not have. The tesla is too expensive and quadro k6000 either.
Quadro 6000 is officialy supported, but DP is only 500Gflops, less than a half of gtx titan. Here's the error from Ansys Mechanical However, from Ansys website, Ansys fluent also supports only Tesla and Quadro. However, you still can run with Geforce card. That's weird. Last edited by atomdie; April 30, 2017 at 07:10. |
|
April 30, 2017, 09:27 |
|
#5 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
Well, seems like you are out of luck. This is a whitelist for "supported" hardware. I got Ansys mechanical running on a workstation with a Quadro K5000, but with a Quadro 4000 I get the same error message.
I personally think that this is a questionable move by Ansys and Nvidia since there is not really a technical reason for this restriction. Proprietary software... In the past there have been ways to spoof Quadro cards using their Geforce counterparts. I don't know if this is still possible with this generation or if it will work at all since the Quadro K6000 has some more shaders activated and twice the amount of VRAM compared to the Titan. It might be closer to a Tesla K20X. I don't recommend it, I just wanted to let you know |
|
April 30, 2017, 11:50 |
|
#6 |
New Member
Deutschland / Germany
Join Date: Aug 2016
Posts: 8
Rep Power: 10 |
yeap, that seems Nvidia only wants to sell expensive card as much as possible. Technically, CUDA library (cuSparse, cuBlas) does not care what's the difference between quadro, tesla, or geforce.
|
|
May 5, 2017, 04:58 |
GPU is not worth
|
#7 |
New Member
Deutschland / Germany
Join Date: Aug 2016
Posts: 8
Rep Power: 10 |
After whole week working with CUDA library for sparse matrix (cuSolver, cuBlas, cuSparse) for solving system Ax=b, I could only say that GPU is not worth.
In order to solve Ax=b, there are two approaches: - Direct solver: Cost a lot of memory, only use for small model. - Iterative solver: The usage of memory is less, but need many iterations to archive acceptable residual error. We are talking about large model, hence Direct solver is useless. Ansys provides two kind of Iterative solver: Preconditioned Conjugate Gradient (PCG), and Jacobi Conjugate Gradient. For PCG solver in GPU, a matrix in sparse form needs to be copied to GPU memory. A vector of solution is needed, another matrix for Preconditioned. GPU memory is very limited, due to this reason, it is impossible to copy whole matrix to GPU. How does Ansys do this? Ansys only use GPU for some "vector operations" to save memory of GPU. It means that, for large model, there is not too much benefit from GPU. For small model, when whole matrix could be copied to GPU, the GPU computing is faster than CPU. You need bigger model? Buy more GPU (multi GPU computing). Now we need to take care the price of GPU. Officially Ansys supports Tesla (which costs few thousands euro), Quadro K6000 (also very expensive). In the benchmark from Ansys, they used 4 GPU. The memory of GPU is only from 6-12GB, in comparison with CPU RAM from 64GB-128GB, GPU is nothing. If you have 10000 euro for Tesla which only does some "vector operations" because of 12GB memory, please build another Dual-Xeon system. In conclusion, may be GPU computing in FEM is a "marketing" of Nvidia. They want to sell expensive card for "scientists", "Professional engineers"..They come to Ansys and offer their GPU |
|
June 11, 2017, 21:01 |
|
#8 |
Senior Member
Lucky
Join Date: Apr 2011
Location: Orlando, FL USA
Posts: 5,751
Rep Power: 66 |
I just wanted to say this is a jewel of a thread. I did my own sniffing under the hood back in the day and quickly found the gpu limitations severely taxing. Nearly a decade later and the GPU memory has not gone up significantly and the number of double precision units is still painfully low.
The NVIDIA boys have it rough. They are so used to designing high-end enthusiasts cards that consumers will actually buy; now they have a hard time dealing with logical customers. Why buy a high-end card for GPU computing that will only help some of the time when I can just as easily double by cpu power & memory? |
|
June 16, 2017, 07:13 |
|
#9 |
Member
Join Date: Mar 2014
Posts: 56
Rep Power: 12 |
The actual benefit from GPU acceleration within Fluent calculations have been a topic with myself and my colleagues for some time now. We haven't had any clear knowledge of the topic thus far, although a healthy dose of scepticism has been included into our discussions.
I would like state that this review is excellent work and highly appreciated - thank you flotus1 for posting your results on this forum! In my opinion posts like this elevate the status of CFD Online above so many other forums by simply providing much needed information for other CFD enthusiastics. |
|
November 24, 2017, 23:41 |
|
#10 |
New Member
pong
Join Date: Apr 2009
Posts: 5
Rep Power: 17 |
Anyone has a test for Quadro K5000 on Fluent or CFX. From my understanding, we will benefit from the GPU calculation only with the supported cards, which is starting from K5000. Other cards can be used but we will not get any benefit from them.
|
|
November 25, 2017, 05:44 |
|
#11 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
Please read the test and the following conversation.
It was clearly shown that any Nvidia (CUDA) graphics card can be used in Fluent 18.0 for GPU acceleration and they actually speed things up with the right solver. Only Ansys mechanical refuses to use graphics cards that are not on their whitelist. I don't know how CFX behaves in this regard. You won't get better results for the SIMPLE solver with one of the GPUs that Ansys recommends. Poor performance here is a numerical constraint, not an artificial marketing constraint. Of course, Ansys can limit GPU support in Fluent the same way they did with their other products whenever they want. |
|
November 26, 2017, 10:02 |
|
#12 |
New Member
Join Date: May 2013
Posts: 26
Rep Power: 13 |
I absolutely share the general conclusion, of this thread:
- Nvidia guys are (very) good marketing guys. - a CPU is most universal approach for speeding up solution When looking at this topic, one should be aware that - there was much progress regarding DP performance of GPUs the last years (generations) - and there is an enormous difference within Chips of each generation: from consumer variant, Quadro for graphic (<6000, in most generations Quadro 6000 is similar regarding DP performance to DP-Tesla) to DP-Tesla looking at the initial post, the following cards were benchmarked: for 1) & 2) Quadro 5000: 361 GFLOPS DP, 2,5GB for 3) Quadro 4000: 243 GFLOPS DP, 2GB for 4) GTX 1060 137 GFLOPS DP, 6GB in comparison best in class from list in initial post look like this: Tesla P100: 5300 GFLOPS DP, 16GB edit: but there is also a newer V100: 7000 GFLOPS DP, 16GB -> that's a DP performance difference of up to x38,68 (raw, P100) ! and a memory size difference of up to x8 ! Of course there was also great progress for CPUs: more cores. But this may not perfectly pay of because base frequencies mostly keep the same or got even lowered because of more cores and in newer CPU generations especially drop under load when things like AVX2+ commands are used. => so it would be very interesting to see independent (non Nvidia / non Intel / non Ansys) real world benchmarks using one of these most latest and most expensive GPU in comparison to one of the latest biggest Xeons... Does the picture change over time a little bit? Or is it still for most real world use cases mostly marketing? What do you think? Last edited by hpvd; November 28, 2017 at 06:58. |
|
November 26, 2017, 10:33 |
|
#13 |
New Member
Join Date: May 2013
Posts: 26
Rep Power: 13 |
edit: just added some more details directly in post above
|
|
November 26, 2017, 12:04 |
|
#14 | |
Senior Member
Lucky
Join Date: Apr 2011
Location: Orlando, FL USA
Posts: 5,751
Rep Power: 66 |
Quote:
The i9-7900X for example, costs a measley $1000 but can also put out 500 DP FLOPS. The story hasn't changed. It's mostly still a marketing scheme (from a CFD point of view). There are non-CFD applications where GPU computing does awesome things. Where the story has gotten slightly better though (courtesy of Pascal), is the ability of the P100 to share MB memory with the CPU. Also w/ Pascal, Nvidia achieved a 2:1 single-to-double ratio. In my opinion this is super-important in order for the DP performance to scale correctly with the card and make it GPU competitive for high-precision super-computing. I say super-computing because economics of scale doesn't benefit consumers. Although I've eagerly awaited this architecture to release, I admit I'm not up-to-date on the detailed benchmarks. But with that price-tag I don't see it making much of a difference for consumers. |
||
November 26, 2017, 13:18 |
|
#15 | |
New Member
Join Date: May 2013
Posts: 26
Rep Power: 13 |
Quote:
sure, you are absolutely right with this remark! To spin this even further -for professional usecases- one should look at the total cost of ownership (tco): you not only have to spend money to buy these things (CPU/GPU), you also have costs for energy, space, admin and with this it may even look different... Just a tiny example: think about having a setup of 3 powerfull or 10 midpower machines: - you need more accompanying hardware (mobos, powersupplies, drives, cases, network) - more time to setup and maintain (new updates etc.) - more space to put these machines - more energy for the machines it-selves and in some cases cooling - more... edit: i almost forget one very big point: - needed software licences It's not that easy to determine and compare the "real" tcos.. to my mind CPUs will still win against GPU nearly every real world scenario. But i'm not sure. And of course it also depends on the type of CPUs used... To be sure, we need independed benchmarks for latest GPUs and CPUs :-) |
||
November 27, 2017, 04:51 |
|
#16 |
Senior Member
Joern Beilke
Join Date: Mar 2009
Location: Dresden
Posts: 530
Rep Power: 20 |
We should look at tasks, which can really benefit from GPU computing.
My first idea would be surface meshing. This is a very interactice and time consuming taks when we deal with complex geometries. All we do there is somehow connected to triangles, curvature refinement, proximity detection and refinemend ... So we might even be able to borrow some of the hard wired OpenGL algorithms for this. At least the creation and manipulation of triangles is THE task for which a GPU is designed. |
|
November 27, 2017, 13:54 |
|
#17 |
Member
Join Date: Dec 2016
Posts: 44
Rep Power: 9 |
Only consumer card, the titan black is supported in ansys mechanical, when do this:
https://www.computerbase.de/forum/sh....php?t=1680741 But the card ist not more bootable. Only running on windows with the standard windows k5200 driver as secondary card. And a new quadro driver can't be to install, the installer will crash. |
|
January 26, 2018, 20:48 |
|
#18 |
New Member
cae
Join Date: Jan 2018
Posts: 2
Rep Power: 0 |
All,
I'd recommend reviewing the following two URLs to better understand NVIDIA GPU support for ANSYS Fluent and other ANSYS applications. www.nvidia.com/ansys ANSYS Fluent GPU-ready app webpage If you still have any questions, please contact us at CAE AT NVidia DOT com. Cheers. |
|
January 26, 2018, 21:20 |
|
#19 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
If you feel that any specific aspect is missing or conclusions drawn are flawed I would recommend addressing it directly.
I will gladly re-run or add a few benchmarks with Tesla V100. Contact me through PN if you want to send over a few samples |
|
January 26, 2018, 21:59 |
|
#20 |
New Member
cae
Join Date: Jan 2018
Posts: 2
Rep Power: 0 |
This thread started back in April 2017 and some have replied back with comments on some of the Qs.
We strongly recommend everyone to review what we have published already (two URLs in the previous post) and then ask specific questions. We can summarize some points here that may help everyone: 1. Problems that contain less than a few million cells do not gain speed from GPUs because of communication overheads incurred in transferring matrices from or to CPUs. However, speedup is significant for meshes that contain tens and hundreds of millions of cells because the overhead is relatively small compared to the computing time in the AMG solver. This is noted in the previous URLs. 2. For Ansys Fluent, the entire model has to fit in the GPU memory. Approx. 1M cells needs about 4 GB of GPU memory. 3. Ansys Mechanical has published a set of GPUs that are recommended or certified. Someone has already posted a screen shot in this thread. 4. Gaming/consumer-grade cards are not benchmarked for the professional apps. Only Quadro & Tesla cards are. 5. We don't make any claims when features aren't supported or don't work. We always recommend everyone to run their own tests as a variety of factors play a role in performance (type of CPU core, host memory, I/O, GPU, etc.). Also, not everyone has enough licenses to run on all cores. Each GPU is equivalent to a single CPU core. 6. Everyone has their own metrics to deduce the value of GPUs. Apart from performance, we also look at the cost-benefit ratio that includes hardware & licensing costs. Hope this helps. |
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
[Resolved] GPU on Fluent | Daveo643 | FLUENT | 4 | March 7, 2018 09:02 |
How to open Icem mesh in Ansys Fluent? | emmkell | FLUENT | 27 | February 6, 2018 04:34 |
Can you help me with a problem in ansys static structural solver? | sourabh.porwal | Structural Mechanics | 0 | March 27, 2016 18:07 |
Running UDF with Supercomputer | roi247 | FLUENT | 4 | October 15, 2015 14:41 |
Ansys structural and fluent for FSI | assafwei | FLUENT | 1 | June 20, 2014 11:56 |