8x icoFoam speed up with Cufflink CUDA solver library

kmooney · April 12, 2012, 14:27

Howdy foamers,

Last night I took a leap of faith and installed the Cufflink library for Openfoam-ext. It appears to reformat OF sparse matrices, sends them onto an NVIDIA card, and uses the built in CUSP linear algebra functions to accelerate the solution time of various flavors of CG solvers.

I found it here:
http://code.google.com/p/cufflink-library/
I had to hack the compile setup a little to avoid any MPI/parallel stuff as I'm not ready to delve that deep into it quite yet. Other than that installation was pretty straight forward.

I ran the icoFoam cavity tutorial at various mesh sizes and plotted the execution times. I figured I would share the results with the community. Keep in mind that this was a really quick A-B comparison to satisfy my own curiosity. Solver tolerances were matched between the CPU and GPU runs.

A little bit about my machine:
Intel core i7, 8 gbs ram
Nvidia GeForce GTX 260
OpenSuse 11.4

akidess · April 12, 2012, 14:59

Great, thanks for sharing! Are you using single or double precision?

kmooney · April 12, 2012, 15:00

It was all run in DP.

vinz · April 13, 2012, 04:05

Dear Kyle,

How many cores do you use on your CPU and on your GPU? For each case you use the same number?
Is N the number of cells in your graphic?

alberto · April 13, 2012, 04:43

Any comparison with SpeedIT plugin for OpenFOAM?

kmooney · April 13, 2012, 11:02

Hi Vincent, The CPU runs were done on just a single core and yes, N is the number of cells in the domain. I'm actually kind of curious as to how many multiprocessors the video card ended up using. I'm not sure how the library decides on grid-block-thread allocations.

Alberto, I was considering trying the speedIT library but I think only the single precision version is free of charge. Do you have any experience with it?

lordvon · April 13, 2012, 19:09

Thanks for posting this! I just bought a GTX 550 to utilize this. I will post some data. I will be doing transient parallel cpu simulations with GGI to compare.

chegdan · April 14, 2012, 16:54

Very nice Kyle,

I'm glad to see someone using Cufflink, I've been reluctant to post anything since its still developing. I haven't been able to add more solvers or publicize Cufflink, since I'm writing the last little bit of my thesis and didn't have time to fix bugs if everyone started to have problems.

@Kyle
What version of CUSP and CUDA were you using?
What was the type of problem you were solving?
What was your mesh composition (cell shape and number, number of faces, boundary conditions,etc.)
What is your motherboard and or bandwidth speed to your GPU?
Which preconditioners did you use?

@those interested
* there are plans to port it to the SGI version in the next few months (unless someone wants to help).
* I had no plans to do anything with windows or mac....but if there is interest...this could be a nice project.
* If you want to add more CUDA based linear system solver/preconditioners this can be done by contributing to Cufflink directly (cufflink-library.googlecode.com) or to the CUSP (http://code.google.com/p/cusp-library/) project. If it is contributed to CUSP, then it will be included in Cufflink.
* in general, the CUDA based solvers work more effectively if you problem is solved using many inner iterations (linear system solver iterations) and less effectively if outer iterations are dominant. This is due to the cost of shooting data back and forth to the GPU. So I would expect results to be different for a steady state solver that relies on lots of outer iterations where you would use a relTol rather than an tolerance as your stopping criteria.
* Lastly, I would stay away from single precision as the smoothed aggregate preconditioner had some single precision issues in earlier versions of CUSP...so Cufflink (though it can be compiled in single precision) is meant for double precision.

Anyway, Kyle...good work.

Edit: The multi-GPU version works, but it is still in development. I'm not current on the speedup numbers (yes there is speedup) for the parallel version (included in the cufflink download already), but it is getting some work from another group to use UVA and GPUDirect with testing on large clusters. Any multi-gpu improvements will be updated in the google-code library.

chegdan · April 15, 2012, 13:41

Since I'm not selling anything or making money from Cufflink (since its open source), I think I can make a few comments.

* Though we always compare everything against base case of the non-preconditioned (bi)conjugate gradient, when looking at the CUDA based linear system solvers one should be fair. Make sure to compare the best OpenFOAM solver vs. the best of the GPU solvers (of course both against the base of the non-preconditioned (bi)conjugate gradient). If your metric is focused on solving a problem quickly, will multiple CPUs running in parallel (e.g. using GAMG) be better than a single high end GPU or several high end GPUs (or even multiple lower cost GPUs)?

* I definitely think GPU solvers have their place and they will have a tremendous impact, once the bottlenecks are worked through. What is important now is to understand where heterogeneous computing thrives and outperforms our current computing paradigm. I gave an example in the last post about inner iterations and outer iterations.

* The speedup is highly dependent on hardware and the problem being solved. You might even see some variability in the numbers if you ran the test a few times. One can have a really amazing setup, but a mediocre cluster if the communication is slow between nodes.

* There are known differences in precision in the GPU vs the CPU, i.e. double precision vs extended double precision (http://en.wikipedia.org/wiki/Extended_precision). And I have wondered if this loss of a few decimal places could also be an additional increase in speed (I'm no expert, this is just thinking out loud).

* There is a lot of hype in to sell these GPU solvers, so be careful of that. Fact: When the GPU is used in the right situation (in algorithms that can be parallelized i.e. linear system solvers), there is amazing and real speedup.

I hope people find this helpful.

lordvon · April 16, 2012, 23:01

Hello all, on the cufflink installation instructions webpage, it says that a complete recompilation of openfoam is required under the heading 'Changes in lduInterface.H'.

Could someone give more details about how to do this?

chegdan · April 16, 2012, 23:07

Quote:

Originally Posted by lordvon

Hello all, on the cufflink installation instructions webpage, it says that a complete recompilation of openfoam is required under the heading 'Changes in lduInterface.H'.

Could someone give more details about how to do this?

Yeah, make sure you are using extends version first of all. If you have compiled OF before, then this will be easy. You just need to take the lduInterface.H provided in the Cufflink folder (maybe save a copy of your old lduInterface.H) and and place it in the

OpenFOAM/OpenFOAM-1.6-ext/src/OpenFOAM/matrices/lduMatrix/lduAddressing/lduInterface

folder and recompile. Of course this is only necessary for the MultiGPU usage. Then just recompile the install of openfoam, recompile cufflink, and you should be good (in theory). This may throw off your git or svn repo...so if/when you update the ext then you may get some warnings.

Also, I just noticed that you were going to use GGI. This may be a problem as it will take some more thought to program cufflink to work with all the interfaces (as of now cufflink only works with the processor interfaces, i.e. nothing special like cyclics) and the regular boundary conditions.

lordvon · April 16, 2012, 23:10

Thanks, but the recompiling part is what I was asking about. Just some simple instructions, please.

chegdan · April 16, 2012, 23:19

oh...this may be difficult if there are errors. to recompile OF-extends, just type

Code:

foam

and that will take you to the right OpenFOAM directory, and then type

Code:

./Allwmake

and then go get a coffee. if all runs smoothly it will compile fine...if not then you will be an expert by the time you get it working again.

Lukasz · April 17, 2012, 05:27

Quote:

Originally Posted by alberto

Any comparison with SpeedIT plugin for OpenFOAM?

Actually, we did compare icoFoam CPU vs. GPU. Here is a link to more detailed report.

We analyzed cavity3D up to 9M cells for transient/steady-state flows run on Intel Dual Core and Intel Xeon E5620. Accelerations were up to x33 and x5.4, respectively.

Larger cases, such as motorbike, simpleFoam with 32M cells, had to be run on a cluster. If you are interested you may take a look at this report.

lordvon · April 17, 2012, 12:23

Lukasz, your presentation link says that memory bottleneck was the cause of no speedup using PISO. However the reference for that figure says that:

Quote:

OpenFOAM implements the PISO method using the GAMG method, which was not ported to the GPU.

Two things:
-I am pretty sure you can just change the solver while still using PISO.
-This means that memory bottleneck was not the cause; the GPU simply wasnt being used!

Someone verify this please.

Here is a link to the referenced paper:
http://am.ippt.gov.pl/index.php/am/a...ewFile/516/196

Lukasz · April 18, 2012, 08:39

Thanks for your comments!

There were two tests in our publication. We compared SpeedIT with CG and diagonal preconditioner on GPU with 1) pisoFoam with CG and DIC/DILU preconditioner 2) pisoFoam with GAMG.

The quoted sentence meant that GAMG was used on CPU. This procedure was not ported to GPU and therefore SpeedIT was not so succesful in terms of acceleration. Maybe you are right, CPU should be more emphasized in this sentence.

In a few days, we will publish a report where AMG preconditioner was used which converges faster than a diagonal precondtioner.

lordvon · April 18, 2012, 09:33

Hi Lukasz, thanks for the reply. So are you saying that the dark bar ("diagonal vs. diagonal", tiny speedup) represents CG solver with diagonal preconditioner, while the lighter bar ("diagonal vs. other", no speedup) represents GAMG solver with diagonal preconditioner? It seemed to me from the text only preconditioners were varied, not the solvers.

The speedup chart caption:

Quote:

Fig. 6. Acceleration of GPU-accelerated OpenFOAM solvers against the corresponding original CPU implementations with various preconditioners of linear solvers. Dark bars show the
acceleration of diagonally preconditioned GPU-accelerated solvers over the CPU implmentations with recommended preconditioners – GAMG for the PISO, and DILU/DIC for the
simpleFOAM and potentialFOAM solvers.

Here is the whole paragraph of my quote above:

Quote:

The results for the GPU acceleration are presented in Fig. 6 and show that the
acceleration of the PISO algorithm is hardly noticeable. This is a result of the fact
that OpenFOAM implements the PISO method using the Geometric Agglomerated Algebraic Multigrid (GAMG) method, which was not ported to the GPU.

Lukasz · April 18, 2012, 15:20

Quote:

Originally Posted by lordvon

It seemed to me from the text only preconditioners were varied, not the solvers.

You are correct. I asked my collegues and indeed, the preconditioners were varying, not the solvers.
Sorry about the misleading reply, the tests were done a while ago.

BTW, what solvers do you think are worthwile to accelerate with CG being accelerated on GPU?

lordvon · April 22, 2012, 14:30

Hmm.. About that icoFoam tutorial listed in the CUDA installation instructions page has some lines of code to create a custom solver directory named 'my_icoFoam'; I tried it out and it wierdly deleted all of my solvers... No problem I just had to remove and reinstall of1.6ext.

lordvon · April 22, 2012, 14:38

Oh, and Lukasz, so it seems that the speedup comparison in the presentation in your link showing that there was speedup in porting the matrix solving to the GPU in SIMPLE and potential solvers, but no speedup with PISO, is wrong. The GPU was not even being used. This is good news, because that implied that GPU acceleration was not going to give any benefit with PISO owing to its nature. GGI is implemented with PISO / PIMPLE and it is what I wanted to use Cufflink with. But in fact there still may be speedup with PISO if a solver other than GAMG is used (in the implementation you referenced).

April 12, 2012, 14:27	8x icoFoam speed up with Cufflink CUDA solver library	#1
kmooney Senior Member Kyle Mooney Join Date: Jul 2009 Location: San Francisco, CA USA Posts: 323 Rep Power: 18	Howdy foamers, Last night I took a leap of faith and installed the Cufflink library for Openfoam-ext. It appears to reformat OF sparse matrices, sends them onto an NVIDIA card, and uses the built in CUSP linear algebra functions to accelerate the solution time of various flavors of CG solvers. I found it here: http://code.google.com/p/cufflink-library/ I had to hack the compile setup a little to avoid any MPI/parallel stuff as I'm not ready to delve that deep into it quite yet. Other than that installation was pretty straight forward. I ran the icoFoam cavity tutorial at various mesh sizes and plotted the execution times. I figured I would share the results with the community. Keep in mind that this was a really quick A-B comparison to satisfy my own curiosity. Solver tolerances were matched between the CPU and GPU runs. A little bit about my machine: Intel core i7, 8 gbs ram Nvidia GeForce GTX 260 OpenSuse 11.4 mm.abdollahzadeh likes this.

April 12, 2012, 14:59		#2
akidess Senior Member Anton Kidess Join Date: May 2009 Location: Germany Posts: 1,377 Rep Power: 30	Great, thanks for sharing! Are you using single or double precision? __________________ On twitter @akidTwit Spend as much time formulating your questions as you expect people to spend on their answer.

April 13, 2012, 04:43		#5
alberto Senior Member Alberto Passalacqua Join Date: Mar 2009 Location: Ames, Iowa, United States Posts: 1,912 Rep Power: 36	Any comparison with SpeedIT plugin for OpenFOAM? __________________ Alberto Passalacqua GeekoCFD - A free distribution based on openSUSE 64 bit with CFD tools, including OpenFOAM. Available as in both physical and virtual formats (current status: http://albertopassalacqua.com/?p=1541) OpenQBMM - An open-source implementation of quadrature-based moment methods. To obtain more accurate answers, please specify the version of OpenFOAM you are using.

April 14, 2012, 16:54		#8
chegdan Senior Member Daniel P. Combest Join Date: Mar 2009 Location: St. Louis, USA Posts: 621 Rep Power: 0	Very nice Kyle, I'm glad to see someone using Cufflink, I've been reluctant to post anything since its still developing. I haven't been able to add more solvers or publicize Cufflink, since I'm writing the last little bit of my thesis and didn't have time to fix bugs if everyone started to have problems. @Kyle What version of CUSP and CUDA were you using? What was the type of problem you were solving? What was your mesh composition (cell shape and number, number of faces, boundary conditions,etc.) What is your motherboard and or bandwidth speed to your GPU? Which preconditioners did you use? @those interested * there are plans to port it to the SGI version in the next few months (unless someone wants to help). * I had no plans to do anything with windows or mac....but if there is interest...this could be a nice project. * If you want to add more CUDA based linear system solver/preconditioners this can be done by contributing to Cufflink directly (cufflink-library.googlecode.com) or to the CUSP (http://code.google.com/p/cusp-library/) project. If it is contributed to CUSP, then it will be included in Cufflink. * in general, the CUDA based solvers work more effectively if you problem is solved using many inner iterations (linear system solver iterations) and less effectively if outer iterations are dominant. This is due to the cost of shooting data back and forth to the GPU. So I would expect results to be different for a steady state solver that relies on lots of outer iterations where you would use a relTol rather than an tolerance as your stopping criteria. * Lastly, I would stay away from single precision as the smoothed aggregate preconditioner had some single precision issues in earlier versions of CUSP...so Cufflink (though it can be compiled in single precision) is meant for double precision. Anyway, Kyle...good work. Edit: The multi-GPU version works, but it is still in development. I'm not current on the speedup numbers (yes there is speedup) for the parallel version (included in the cufflink download already), but it is getting some work from another group to use UVA and GPUDirect with testing on large clusters. Any multi-gpu improvements will be updated in the google-code library. mm.abdollahzadeh likes this. Last edited by chegdan; April 14, 2012 at 17:01. Reason: forgot to add something about multi-gpu

April 15, 2012, 13:41		#9
chegdan Senior Member Daniel P. Combest Join Date: Mar 2009 Location: St. Louis, USA Posts: 621 Rep Power: 0	Since I'm not selling anything or making money from Cufflink (since its open source), I think I can make a few comments. * Though we always compare everything against base case of the non-preconditioned (bi)conjugate gradient, when looking at the CUDA based linear system solvers one should be fair. Make sure to compare the best OpenFOAM solver vs. the best of the GPU solvers (of course both against the base of the non-preconditioned (bi)conjugate gradient). If your metric is focused on solving a problem quickly, will multiple CPUs running in parallel (e.g. using GAMG) be better than a single high end GPU or several high end GPUs (or even multiple lower cost GPUs)? * I definitely think GPU solvers have their place and they will have a tremendous impact, once the bottlenecks are worked through. What is important now is to understand where heterogeneous computing thrives and outperforms our current computing paradigm. I gave an example in the last post about inner iterations and outer iterations. * The speedup is highly dependent on hardware and the problem being solved. You might even see some variability in the numbers if you ran the test a few times. One can have a really amazing setup, but a mediocre cluster if the communication is slow between nodes. * There are known differences in precision in the GPU vs the CPU, i.e. double precision vs extended double precision (http://en.wikipedia.org/wiki/Extended_precision). And I have wondered if this loss of a few decimal places could also be an additional increase in speed (I'm no expert, this is just thinking out loud). * There is a lot of hype in to sell these GPU solvers, so be careful of that. Fact: When the GPU is used in the right situation (in algorithms that can be parallelized i.e. linear system solvers), there is amazing and real speedup. I hope people find this helpful. mm.abdollahzadeh likes this.

April 12, 2012, 15:00		#3
kmooney Senior Member Kyle Mooney Join Date: Jul 2009 Location: San Francisco, CA USA Posts: 323 Rep Power: 18	It was all run in DP.

April 13, 2012, 04:05		#4
vinz Senior Member Vincent RIVOLA Join Date: Mar 2009 Location: France Posts: 283 Rep Power: 18	Dear Kyle, How many cores do you use on your CPU and on your GPU? For each case you use the same number? Is N the number of cells in your graphic?

April 13, 2012, 11:02		#6
kmooney Senior Member Kyle Mooney Join Date: Jul 2009 Location: San Francisco, CA USA Posts: 323 Rep Power: 18	Hi Vincent, The CPU runs were done on just a single core and yes, N is the number of cells in the domain. I'm actually kind of curious as to how many multiprocessors the video card ended up using. I'm not sure how the library decides on grid-block-thread allocations. Alberto, I was considering trying the speedIT library but I think only the single precision version is free of charge. Do you have any experience with it?

April 13, 2012, 19:09		#7
lordvon Senior Member Robert Join Date: Sep 2010 Posts: 158 Rep Power: 16	Thanks for posting this! I just bought a GTX 550 to utilize this. I will post some data. I will be doing transient parallel cpu simulations with GGI to compare.

April 16, 2012, 23:01		#10
lordvon Senior Member Robert Join Date: Sep 2010 Posts: 158 Rep Power: 16	Hello all, on the cufflink installation instructions webpage, it says that a complete recompilation of openfoam is required under the heading 'Changes in lduInterface.H'. Could someone give more details about how to do this?

April 16, 2012, 23:10		#12
lordvon Senior Member Robert Join Date: Sep 2010 Posts: 158 Rep Power: 16	Thanks, but the recompiling part is what I was asking about. Just some simple instructions, please.

April 16, 2012, 23:19		#13
chegdan Senior Member Daniel P. Combest Join Date: Mar 2009 Location: St. Louis, USA Posts: 621 Rep Power: 0	oh...this may be difficult if there are errors. to recompile OF-extends, just type Code: foam and that will take you to the right OpenFOAM directory, and then type Code: ./Allwmake and then go get a coffee. if all runs smoothly it will compile fine...if not then you will be an expert by the time you get it working again.

April 18, 2012, 08:39		#16
Lukasz Member Lukasz Miroslaw Join Date: Dec 2009 Location: Poland Posts: 66 Rep Power: 16	Thanks for your comments! There were two tests in our publication. We compared SpeedIT with CG and diagonal preconditioner on GPU with 1) pisoFoam with CG and DIC/DILU preconditioner 2) pisoFoam with GAMG. The quoted sentence meant that GAMG was used on CPU. This procedure was not ported to GPU and therefore SpeedIT was not so succesful in terms of acceleration. Maybe you are right, CPU should be more emphasized in this sentence. In a few days, we will publish a report where AMG preconditioner was used which converges faster than a diagonal precondtioner.

April 22, 2012, 14:30		#19
lordvon Senior Member Robert Join Date: Sep 2010 Posts: 158 Rep Power: 16	Hmm.. About that icoFoam tutorial listed in the CUDA installation instructions page has some lines of code to create a custom solver directory named 'my_icoFoam'; I tried it out and it wierdly deleted all of my solvers... No problem I just had to remove and reinstall of1.6ext.

April 22, 2012, 14:38		#20
lordvon Senior Member Robert Join Date: Sep 2010 Posts: 158 Rep Power: 16	Oh, and Lukasz, so it seems that the speedup comparison in the presentation in your link showing that there was speedup in porting the matrix solving to the GPU in SIMPLE and potential solvers, but no speedup with PISO, is wrong. The GPU was not even being used. This is good news, because that implied that GPU acceleration was not going to give any benefit with PISO owing to its nature. GGI is implemented with PISO / PIMPLE and it is what I wanted to use Cufflink with. But in fact there still may be speedup with PISO if a solver other than GAMG is used (in the implementation you referenced).

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Does it metter to increase RAM for solver speed??	raima	Main CFD Forum	1	February 28, 2008 11:47
compressible two phase flow in CFX4.4	youngan	CFX	0	July 2, 2003 00:32
CFX 5.5	Roued	CFX	1	October 2, 2001 17:49
Setting a B.C using UserFortran in 4.3	tokai	CFX	10	July 17, 2001 17:25
i wanna speed up my solver!	Maciej Matyka	Main CFD Forum	8	November 28, 2000 14:52