How do people even make use of super computers for CFD?

murrdpirate · July 5, 2011, 16:55

Admittedly, I'm a bit of a novice when it comes to parallel computing, but from what I've seen so far, anything more than 4 cores has essentially no benefit. When I first started, I was really excited about the possibility of using Amazon's EC2, but now that seems completely useless. Is that right?

kwardle · July 5, 2011, 18:03

There is a huge difference in architecture between a cloud system and a supercomputer. When you talk about parallel scalability to many processors, the most important thing I have seen in running CFD in parallel is the speed of the interconnect between nodes and then, of course, the speed of the cores themselves. If your interconnect is 1GB/s (i.e gigabit ethernet) you won't see much improvement above some tens of processors. New supercomputers typically have QDR Infiniband interconnect with speeds of 10GB/s.

While I have a bit of experience running OpenFOAM on clusters and supercomputers on up to a few thousand processors, I am not so familiar with trying to do it on a cloud system. Apparently, Amazon does have custom HPC-type clouds with 10-GB interconnect. They claim this can match more standard HPC system performance. Their 'cloud' may simply be a normal cluster in itself and if so I am not sure what the advantage of EC2 would be other than on-demand access. Again, I know little about these systems as my original assumption was precisely your final conclusion--they are relatively useless for large-scale CFD. Perhaps someone who knows more can chime in if I am wrong.

murrdpirate · July 5, 2011, 18:23

The Amazon product that I was looking at is the HPC EC2. It supposedly offers a 10 Gigabyte connection, so maybe it actually would be fast enough.

I was a bit pessimistic because parallel computing on multicore processors seemed to reach diminishing returns very quickly (nearly 0 benefit to go from 2 to 4 processors for the geometries I've tried). I couldn't imagine a supercomputer having better connection speeds between its processors than a multicore chip, but I am pretty ignorant on much of this.

kwardle · July 5, 2011, 18:29

Well, but you also have to consider the problem size. You are going to see a max speedup around some number of meshpoints/processor. On QDR Infiniband systems for the type of problems I do (interFoam based) this is typically around 5K-10K polyhedral cells/processor. How large are the problems you have tried?

murrdpirate · July 5, 2011, 19:56

The cases I've been running are at around 100,000 tetrahedral cells. Going from 1 to 2 processors yields around a 40% increase in performance, and going from 2 to 4 yields an additional 10% at most. I don't suppose polyhedral meshes have better parallel performance, do they? I suppose it's possible since each polyhedral cell has more neighbors than each tet cell, and thus adds CPU calculation without adding more communication.

nileshjrane · July 6, 2011, 05:44

I would say the most crucial thing which affects the parallel efficiency is the CFD algorithm itself. Hardware issues are, to me, secondary. The current CFD algorithms, most of them, are good for serial processing. But they are not ideal for parallel processing. if one can use specialized algorithms on parallel machines then one can get near idea parallel scalability even on thousands of processors. The CFD is yet to mature for highly parallel hardware.

As an example consider this. If you are doing matrix inversion process and the domain is spread over many processors then for most of the conventional algorithms like gauss elimination, we require the whole matrix on single processor. It means all the components of the matrix are needed to be transferred back and forth between master and slave nodes all the time. And as we all know this is the bottle neck for speed. Instead there are methods which simply eliminate this data transfer and do matrix inversion locally on each processor, independent of other processor (read very less dependency). Thus they give high parallel efficiency.

Only power of thousands of processors isnt enough. One need to know how to use it.

niklas · July 6, 2011, 05:54

You might find this interesting. I did this a few weeks ago and as you can see there is alot you can gain
when you increase the number of cpu's.

Code:

# Scaling test on the KTH/PDC cluster. 
# http://www.pdc.kth.se/resources/comp...dgren/Hardware
# Each node consists of 24 core Cray XE6 with a Gemini network

# Test case is the ERCOFTAC ufr2-02 case, 
# LES of flow around a square cylinder in a channel
# http://openfoamwiki.net/index.php/Be...coftac_ufr2-02

# speedup = time / timeRef
# eff = speedup / ( cores / coresRef )
# Afact = nCells*nIter/( nCores * time ) (number of cell iteration per core and sec)

# pimpleFoam 1000 iterations 3.33 M cells, PCG, constant timestep, ~0.1 CFL
#cores  #time    #kCells/core  #speedup #eff    #Afact
  24   23881     138.8           1       1       5810
  48   11857      69.4          2.0      1.0     5851
  96    5113      34.7          4.7      1.2     6784
 120    3940      27.8          6.1      1.2     7043
 240    1655      13.9         14.4      1.4     8384
 480     914       6.9         26.1      1.3     7590
 960     664       3.5         36.0      0.9     5224
1200     658       2.8         36.3      0.7     4217
2400     524       1.4         45.6      0.5     2648


# pimpleFoam 1000 iterations 9.69 M cells, PCG, constant timestep, ~0.25 CFL
#cores  #time    #kCells/core  #speedup #eff    #Afact
 120    18679     80.8           1       1       4323
 240     8264     40.4          2.3      1.1     4886
 480     3727     20.2          5.0      1.3     5417
 960     1860     10.1         10.0      1.3     5427
1200     1515      8.1         12.3      1.2     5330
2400     1034      4.0         18.1      0.9     3905


# pimpleFoam 1000 iterations 26.84 M cells, PCG, constant timestep, ~0.1 CFL
#cores  #time    #kCells/core  #speedup #eff    #Afact
 120    68699    223.7           1       1       3256
 240    34235    111.8          2.0      1.0     3267
 480    15880     55.9          4.3      1.1     3521
 960     7327     28.0          9.4      1.2     3816
1200     5846     22.4         11.8      1.2     3826
2400     2593     11.2         26.5      1.3     4313
4800     1918      5.6         35.8      0.9     2915
9600     1387*     2.8         49.5      0.6     2016
9600     1278**    2.8         53.8      0.7     2188

* startup took 109 s
** subtracted the startup-phase 

# Afact seems to be max around 10k cells/core.
# trying to keep cells/core constant at 10 k = 240k/node
# switching to constant CFL number 
# in order to try and keep the number of pressure iterations equal

# pimpleFoam 1000 iterations, GAMG, variable timestep, 0.5 CFL
#cores  #ncells  #kCells/core  #time   #Afact
  24     237900      9.9        672     14751
  48     477288      9.9        840     11838
 120    1200115     10.0       1558      6419
 240    2419874     10.1       2524      3995
 480    4819176     10.0       3873      2592


# pimpleFoam 1000 iterations, PCG, variable timestep, 0.5 CFL
#cores  #ncells  #kCells/core  #time   #Afact
  24     237900      9.9        573    17299
  48     477288      9.9        663    14998
 120    1200115     10.0        863    11589
 240    2419874     10.1       1094     9216
 480    4819176     10.0       1413     7105

akidess · July 6, 2011, 11:00

Quote:

Originally Posted by nileshjrane

As an example consider this. If you are doing matrix inversion process and the domain is spread over many processors then for most of the conventional algorithms like gauss elimination, we require the whole matrix on single processor. It means all the components of the matrix are needed to be transferred back and forth between master and slave nodes all the time. And as we all know this is the bottle neck for speed. Instead there are methods which simply eliminate this data transfer and do matrix inversion locally on each processor, independent of other processor (read very less dependency). Thus they give high parallel efficiency.

I can't imagine any code working like this, and certainly OpenFoam doesn't! OpenFoam applies special "processor" patch boundary conditions on boundaries appearing after the domain composition, and only the patch neighbor values are broadcasted. Also, latency might be a larger problem than bandwidth.

arjun · July 6, 2011, 11:51

Quote:

Originally Posted by nileshjrane

Instead there are methods which simply eliminate this data transfer and do matrix inversion locally on each processor, independent of other processor (read very less dependency). Thus they give high parallel efficiency.

Only power of thousands of processors isnt enough. One need to know how to use it.

you can not invert a matrix locally without communicating with other processors. The only case where you can do it is where matrix is block diagonal and each block lies within a processor.

lakeat · August 22, 2011, 10:44

Quote:

Originally Posted by arjun

you can not invert a matrix locally without communicating with other processors. The only case where you can do it is where matrix is block diagonal and each block lies within a processor.

This makes sense. And what algorithm would you use for large cases? GAMG, or PCG, for an unsteady case? Thanks

lakeat · June 20, 2012, 18:26

Quote:

Originally Posted by niklas

You might find this interesting. I did this a few weeks ago and as you can see there is alot you can gain
when you increase the number of cpu's.

Hi Niklas,

I had a very very long startup time (A hour) when I am trying to use a thousand cpus. Any ideas?

niklas · June 21, 2012, 02:26

On which architecture?
One thing that I've noticed is that on the cray, if you have

CRAY_ROOTFS=DSL

you will get that behaviour.

lakeat · June 21, 2012, 10:57

Quote:

Originally Posted by niklas

On which architecture?
One thing that I've noticed is that on the cray, if you have

CRAY_ROOTFS=DSL

you will get that behaviour.

Thanks, What did you mean by "CRAY_ROOTFS=DSL", I google a little bit and found this page, and according to it, setting "CRAY_ROOTFS=DSL" is actually helpful.

I am wondering if other achetecture has the same env to set up.

But anyway, here is the architecture I am using.

Any suggestions?

niklas · June 21, 2012, 12:39

OK, I see that its not a cray, so its not that.

are you using the system mpi or are you compiling openmpi yourself?

If you are using the thirdparty option to compile openmpi yourself,
it is absolutely crucial that you add the --with-openib flag to $configOpts in the Allwmake script in the thirdparty folder.

It is also important that when you compile it, you make sure that the hardware is the same as the cluster hardware and that the infiniband-libs are available. Sometimes the login/submit-node can differ in this respect, in which case you need to submit the compilation as a job.

and last, if you are using the SYSTEMOPENMPI, you need to make sure that you have the library path's to the infiniband-libs in the LD_LIBRARY_PATH, otherwise it will fallback to using something else.
you need to find where these libs are located and go into the config/settings.sh and add these under the SYSTEMOPENMPI option
_foamAddLib /directoryToWhereInfinibandIsLocated
_foamAddLib /directoryToSomethingThatIBMightNeed

and maybe also this
_foamAddPath /directoryToOPENMPIBIN

lakeat · June 21, 2012, 12:45

Quote:

Originally Posted by niklas

OK, I see that its not a cray, so its not that.

are you using the system mpi or are you compiling openmpi yourself?

If you are using the thirdparty option to compile openmpi yourself,
it is absolutely crucial that you add the --with-openib flag to $configOpts in the Allwmake script in the thirdparty folder.

It is also important that when you compile it, you make sure that the hardware is the same as the cluster hardware and that the infiniband-libs are available. Sometimes the login/submit-node can differ in this respect, in which case you need to submit the compilation as a job.

and last, if you are using the SYSTEMOPENMPI, you need to make sure that you have the library path's to the infiniband-libs in the LD_LIBRARY_PATH, otherwise it will fallback to using something else.
you need to find where these libs are located and go into the config/settings.sh and add these under the SYSTEMOPENMPI option
_foamAddLib /directoryToWhereInfinibandIsLocated
_foamAddLib /directoryToSomethingThatIBMightNeed

and maybe also this
_foamAddPath /directoryToOPENMPIBIN

Thanks a lot, I will talk to the system manager to double check the openib issue (you know what, I am always worrying this issue, especially I am afraid that different computing nodes would use difference settings, this is a little bit tricky.)

Anyway, I will try and keep you posted. And in the meanwhile, would you mind to test my cases, see what happens in your cluster? Your email so that I can send you the download address?

Thanks

niklas · June 21, 2012, 12:55

sure,
its niklas dot nordin @ nequam dot se

nileshjrane · September 6, 2012, 02:08

Quote:

Originally Posted by akidess

I can't imagine any code working like this, and certainly OpenFoam doesn't! OpenFoam applies special "processor" patch boundary conditions on boundaries appearing after the domain composition, and only the patch neighbor values are broadcasted. Also, latency might be a larger problem than bandwidth.

I was talking in general sense not for OF specifically. And it was just one example to emphasize my point that algorithms are more important than the number of processors.

Quote:

Originally Posted by arjun

you can not invert a matrix locally without communicating with other processors. The only case where you can do it is where matrix is block diagonal and each block lies within a processor.

You cannot do it completely independently yes, thats why i mentioned in brackets "read less dependancy". The algorithm I used to work with in my Masters work for hypersonic flows makes an assumption that du/dy >> du/dx where u is velocity parallel to wall, x is along the wall and y is perpendicular. Here x-y grid is transformed one. This us valid assumption for wall bounded hypersonic flows. Due to this one can linearize the longitudinal gradients and reduce the dependance of points on same x line to its previous collinear point on same x lines to minimum. So in practical terms one doesn't need information beyond the boundary cells of each block (associated with one processor) when decomposition is done in longitudinal direction only. And the algorithm still gives as good results as a complete matrix inversion would have given. Mind well this is fully coupled hypersonic reacting flow code and not just block diagonal code. The elegance lies in the simple but appropriate simplification. (NOTE: I might sound vague here, sorry couldn't explain the method well I guess.

)

My point was, if one can judiciously modify the algorithm to make is parallel processor friendly one can get very good scaling without compromising on quality of results. Just increasing number of processors is not very bright idea.

arjun · September 6, 2012, 04:49

Quote:

Originally Posted by nileshjrane

You cannot do it completely independently yes, thats why i mentioned in brackets "read less dependancy". The algorithm I used to work with in my Masters work for hypersonic flows makes an assumption that du/dy >> du/dx where u is velocity parallel to wall, x is along the wall and y is perpendicular. Here x-y grid is transformed one. This us valid assumption for wall bounded hypersonic flows. Due to this one can linearize the longitudinal gradients and reduce the dependance of points on same x line to its previous collinear point on same x lines to minimum. So in practical terms one doesn't need information beyond the boundary cells of each block (associated with one processor) when decomposition is done in longitudinal direction only. And the algorithm still gives as good results as a complete matrix inversion would have given. Mind well this is fully coupled hypersonic reacting flow code and not just block diagonal code. The elegance lies in the simple but appropriate simplification. (NOTE: I might sound vague here, sorry couldn't explain the method well I guess.

)

My point was, if one can judiciously modify the algorithm to make is parallel processor friendly one can get very good scaling without compromising on quality of results. Just increasing number of processors is not very bright idea.

Quote:

Originally Posted by nileshjrane

Instead there are methods which simply eliminate this data transfer and do matrix inversion locally on each processor, independent of other processor (read very less dependency). Thus they give high parallel efficiency.

Only power of thousands of processors isnt enough. One need to know how to use it.

What you did is no use unless you can show that what you did could be used for everyone.
You made some assumption and seems to be working for your special case but it would not make me know how to use the power of thousands of processors for what i am doing.

You can still not invert matrix locally without communicating and you can not still invert matrix by ignoring few off diagonals and doing less communications. If it were true we would have developed lots of methods around it. What you are assuming is that you are the only smarty pants and all the others are mindless stupids. There is a reason we do things the way we do. And the reason is that people have found out that it is really not possible to just ignore few things here and there and make things work.

nileshjrane · September 6, 2012, 06:02

Quote:

Originally Posted by arjun

What you did is no use unless you can show that what you did could be used for everyone.
You made some assumption and seems to be working for your special case but it would not make me know how to use the power of thousands of processors for what i am doing.

You can still not invert matrix locally without communicating and you can not still invert matrix by ignoring few off diagonals and doing less communications. If it were true we would have developed lots of methods around it. What you are assuming is that you are the only smarty pants and all the others are mindless stupids. There is a reason we do things the way we do. And the reason is that people have found out that it is really not possible to just ignore few things here and there and make things work.

That was rude. Don't want to pollute the thread so I am backing off from this thread.

Just my last post here. I was talking about the algorithm which is developed by NASA and used extensively by them for their hypersonic flight designs, extra-terrestrial probes, reactive flows etc. So I am not considering myself "smarty pants" you see, neither did I say that others here are "mindless stupids". I am merely telling my observations/opinions. BTW something which is applicable for whole of supersonic and hypersonic regime is not that special case, now is it??

There are algorithms which are more "parallel friendly" than others. E.g. Krylov subspace solvers. The computational physics guys have been using them since years. They don't invert matrix at all. I have done some literature survey out of interest, and then derived the conclusion.

You can choose to ignore my opinions if you feel I am wrong. I did not enforced anyone to accept my views. I stand by my view, you stand by yours. But do it politely.

July 5, 2011, 16:55	How do people even make use of super computers for CFD?	#1
murrdpirate Member Kevin Join Date: May 2011 Posts: 33 Rep Power: 15	Admittedly, I'm a bit of a novice when it comes to parallel computing, but from what I've seen so far, anything more than 4 cores has essentially no benefit. When I first started, I was really excited about the possibility of using Amazon's EC2, but now that seems completely useless. Is that right?

July 5, 2011, 18:03		#2
kwardle Senior Member Kent Wardle Join Date: Mar 2009 Location: Illinois, USA Posts: 219 Rep Power: 21	There is a huge difference in architecture between a cloud system and a supercomputer. When you talk about parallel scalability to many processors, the most important thing I have seen in running CFD in parallel is the speed of the interconnect between nodes and then, of course, the speed of the cores themselves. If your interconnect is 1GB/s (i.e gigabit ethernet) you won't see much improvement above some tens of processors. New supercomputers typically have QDR Infiniband interconnect with speeds of 10GB/s. While I have a bit of experience running OpenFOAM on clusters and supercomputers on up to a few thousand processors, I am not so familiar with trying to do it on a cloud system. Apparently, Amazon does have custom HPC-type clouds with 10-GB interconnect. They claim this can match more standard HPC system performance. Their 'cloud' may simply be a normal cluster in itself and if so I am not sure what the advantage of EC2 would be other than on-demand access. Again, I know little about these systems as my original assumption was precisely your final conclusion--they are relatively useless for large-scale CFD. Perhaps someone who knows more can chime in if I am wrong. lakeat and murrdpirate like this.

July 5, 2011, 18:29		#4
kwardle Senior Member Kent Wardle Join Date: Mar 2009 Location: Illinois, USA Posts: 219 Rep Power: 21	Well, but you also have to consider the problem size. You are going to see a max speedup around some number of meshpoints/processor. On QDR Infiniband systems for the type of problems I do (interFoam based) this is typically around 5K-10K polyhedral cells/processor. How large are the problems you have tried? murrdpirate likes this.

July 6, 2011, 05:44		#6
nileshjrane Senior Member Nilesh Rane Join Date: Apr 2010 Posts: 122 Rep Power: 16	I would say the most crucial thing which affects the parallel efficiency is the CFD algorithm itself. Hardware issues are, to me, secondary. The current CFD algorithms, most of them, are good for serial processing. But they are not ideal for parallel processing. if one can use specialized algorithms on parallel machines then one can get near idea parallel scalability even on thousands of processors. The CFD is yet to mature for highly parallel hardware. As an example consider this. If you are doing matrix inversion process and the domain is spread over many processors then for most of the conventional algorithms like gauss elimination, we require the whole matrix on single processor. It means all the components of the matrix are needed to be transferred back and forth between master and slave nodes all the time. And as we all know this is the bottle neck for speed. Instead there are methods which simply eliminate this data transfer and do matrix inversion locally on each processor, independent of other processor (read very less dependency). Thus they give high parallel efficiency. Only power of thousands of processors isnt enough. One need to know how to use it. lakeat likes this. __________________ Imagination is more important than knowledge..

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
OpenFOAM with IBM AIX	matthias	OpenFOAM Installation	20	March 25, 2008 03:36
Compiling OpenFOAM13 on AMD64 with OpenSUSE 101	silent_missile	OpenFOAM Installation	5	August 10, 2007 08:31
Need a post-process software	Munikrishna	Main CFD Forum	3	November 27, 2006 14:45
a way to make lots of money quick and easy no lies	Dob	Main CFD Forum	0	October 10, 2006 17:45
What do people make at a REAL company?	Jim	Main CFD Forum	2	April 1, 2001 23:09

July 5, 2011, 18:23		#3
murrdpirate Member Kevin Join Date: May 2011 Posts: 33 Rep Power: 15	The Amazon product that I was looking at is the HPC EC2. It supposedly offers a 10 Gigabyte connection, so maybe it actually would be fast enough. I was a bit pessimistic because parallel computing on multicore processors seemed to reach diminishing returns very quickly (nearly 0 benefit to go from 2 to 4 processors for the geometries I've tried). I couldn't imagine a supercomputer having better connection speeds between its processors than a multicore chip, but I am pretty ignorant on much of this.

July 5, 2011, 19:56		#5
murrdpirate Member Kevin Join Date: May 2011 Posts: 33 Rep Power: 15	The cases I've been running are at around 100,000 tetrahedral cells. Going from 1 to 2 processors yields around a 40% increase in performance, and going from 2 to 4 yields an additional 10% at most. I don't suppose polyhedral meshes have better parallel performance, do they? I suppose it's possible since each polyhedral cell has more neighbors than each tet cell, and thus adds CPU calculation without adding more communication.

June 21, 2012, 02:26		#12
niklas Super Moderator Niklas Nordin Join Date: Mar 2009 Location: Stockholm, Sweden Posts: 693 Rep Power: 29	On which architecture? One thing that I've noticed is that on the cray, if you have CRAY_ROOTFS=DSL you will get that behaviour.

June 21, 2012, 12:39		#14
niklas Super Moderator Niklas Nordin Join Date: Mar 2009 Location: Stockholm, Sweden Posts: 693 Rep Power: 29	OK, I see that its not a cray, so its not that. are you using the system mpi or are you compiling openmpi yourself? If you are using the thirdparty option to compile openmpi yourself, it is absolutely crucial that you add the --with-openib flag to $configOpts in the Allwmake script in the thirdparty folder. It is also important that when you compile it, you make sure that the hardware is the same as the cluster hardware and that the infiniband-libs are available. Sometimes the login/submit-node can differ in this respect, in which case you need to submit the compilation as a job. and last, if you are using the SYSTEMOPENMPI, you need to make sure that you have the library path's to the infiniband-libs in the LD_LIBRARY_PATH, otherwise it will fallback to using something else. you need to find where these libs are located and go into the config/settings.sh and add these under the SYSTEMOPENMPI option _foamAddLib /directoryToWhereInfinibandIsLocated _foamAddLib /directoryToSomethingThatIBMightNeed and maybe also this _foamAddPath /directoryToOPENMPIBIN

June 21, 2012, 12:55		#16
niklas Super Moderator Niklas Nordin Join Date: Mar 2009 Location: Stockholm, Sweden Posts: 693 Rep Power: 29	sure, its niklas dot nordin @ nequam dot se