8x icoFoam speed up with Cufflink CUDA solver library

atg · November 6, 2012, 06:56

You don't need cufflink to demonstrate this drop in performance with multiple GPUs.

Just fire up the .nbody simulation on your 2050, note the flops in the lower left corner, and then try it with two GPUs. When I do that with a 2090, the 2090 always beats the 2090 plus Quadro600, by a significant margin. I think it is down to the overhead associated with communicating back and forth to the CPU. Newer versions of CUDA and the Kepler thing are supposed to alleviate this to some degree as far as I know, which admittedly isn't very far!

Good Luck, and thanks for posting.

alquimista · November 6, 2012, 12:24

agt I agree with the presence of a bottleneck but it must be largely compesated by the fact to use two GPUs. I'll test the .nbdoy simulation in systems with two Tesla 2050 and two GTX 690.

alquimista · November 6, 2012, 12:37

Checking the post again I note that you are using two differents kinds of GPU cards, in that case the Quadro600 is slowing down the system and the 2090 must wait for the Quadro600 to finish its task. So in that case there are no benefit.

I just have run the nbdoy with one 690 and with two 690 and the results are:

1 device: 903.968 single-precision GFLOP/s at 20 flops per interaction
2 devices: 1649.659 single-precision GFLOP/s at 20 flops per interaction

So its reasonable and the proper use of several GPUs justified. I would expect similar behavior in cufflink.

November 6, 2012, 12:37		#43
alquimista Member Join Date: Apr 2010 Posts: 61 Rep Power: 16	Checking the post again I note that you are using two differents kinds of GPU cards, in that case the Quadro600 is slowing down the system and the 2090 must wait for the Quadro600 to finish its task. So in that case there are no benefit. I just have run the nbdoy with one 690 and with two 690 and the results are: 1 device: 903.968 single-precision GFLOP/s at 20 flops per interaction 2 devices: 1649.659 single-precision GFLOP/s at 20 flops per interaction So its reasonable and the proper use of several GPUs justified. I would expect similar behavior in cufflink. Last edited by alquimista; November 6, 2012 at 13:25. Reason: test of nbody done

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Does it metter to increase RAM for solver speed??	raima	Main CFD Forum	1	February 28, 2008 11:47
compressible two phase flow in CFX4.4	youngan	CFX	0	July 2, 2003 00:32
CFX 5.5	Roued	CFX	1	October 2, 2001 17:49
Setting a B.C using UserFortran in 4.3	tokai	CFX	10	July 17, 2001 17:25
i wanna speed up my solver!	Maciej Matyka	Main CFD Forum	8	November 28, 2000 14:52

November 6, 2012, 06:56		#41
atg Member Karl Join Date: Jan 2011 Posts: 36 Rep Power: 15	You don't need cufflink to demonstrate this drop in performance with multiple GPUs. Just fire up the .nbody simulation on your 2050, note the flops in the lower left corner, and then try it with two GPUs. When I do that with a 2090, the 2090 always beats the 2090 plus Quadro600, by a significant margin. I think it is down to the overhead associated with communicating back and forth to the CPU. Newer versions of CUDA and the Kepler thing are supposed to alleviate this to some degree as far as I know, which admittedly isn't very far! Good Luck, and thanks for posting.

November 6, 2012, 12:24		#42
alquimista Member Join Date: Apr 2010 Posts: 61 Rep Power: 16	agt I agree with the presence of a bottleneck but it must be largely compesated by the fact to use two GPUs. I'll test the .nbdoy simulation in systems with two Tesla 2050 and two GTX 690.