CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Software User Forums > OpenFOAM

How do people even make use of super computers for CFD?

Register Blogs Community New Posts Updated Threads Search

Like Tree7Likes
  • 2 Post By kwardle
  • 1 Post By kwardle
  • 1 Post By nileshjrane
  • 3 Post By niklas

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   July 5, 2011, 16:55
Default How do people even make use of super computers for CFD?
  #1
Member
 
Kevin
Join Date: May 2011
Posts: 33
Rep Power: 15
murrdpirate is on a distinguished road
Admittedly, I'm a bit of a novice when it comes to parallel computing, but from what I've seen so far, anything more than 4 cores has essentially no benefit. When I first started, I was really excited about the possibility of using Amazon's EC2, but now that seems completely useless. Is that right?
murrdpirate is offline   Reply With Quote

Old   July 5, 2011, 18:03
Default
  #2
Senior Member
 
Kent Wardle
Join Date: Mar 2009
Location: Illinois, USA
Posts: 219
Rep Power: 21
kwardle is on a distinguished road
There is a huge difference in architecture between a cloud system and a supercomputer. When you talk about parallel scalability to many processors, the most important thing I have seen in running CFD in parallel is the speed of the interconnect between nodes and then, of course, the speed of the cores themselves. If your interconnect is 1GB/s (i.e gigabit ethernet) you won't see much improvement above some tens of processors. New supercomputers typically have QDR Infiniband interconnect with speeds of 10GB/s.

While I have a bit of experience running OpenFOAM on clusters and supercomputers on up to a few thousand processors, I am not so familiar with trying to do it on a cloud system. Apparently, Amazon does have custom HPC-type clouds with 10-GB interconnect. They claim this can match more standard HPC system performance. Their 'cloud' may simply be a normal cluster in itself and if so I am not sure what the advantage of EC2 would be other than on-demand access. Again, I know little about these systems as my original assumption was precisely your final conclusion--they are relatively useless for large-scale CFD. Perhaps someone who knows more can chime in if I am wrong.
lakeat and murrdpirate like this.
kwardle is offline   Reply With Quote

Old   July 5, 2011, 18:23
Default
  #3
Member
 
Kevin
Join Date: May 2011
Posts: 33
Rep Power: 15
murrdpirate is on a distinguished road
The Amazon product that I was looking at is the HPC EC2. It supposedly offers a 10 Gigabyte connection, so maybe it actually would be fast enough.

I was a bit pessimistic because parallel computing on multicore processors seemed to reach diminishing returns very quickly (nearly 0 benefit to go from 2 to 4 processors for the geometries I've tried). I couldn't imagine a supercomputer having better connection speeds between its processors than a multicore chip, but I am pretty ignorant on much of this.
murrdpirate is offline   Reply With Quote

Old   July 5, 2011, 18:29
Default
  #4
Senior Member
 
Kent Wardle
Join Date: Mar 2009
Location: Illinois, USA
Posts: 219
Rep Power: 21
kwardle is on a distinguished road
Well, but you also have to consider the problem size. You are going to see a max speedup around some number of meshpoints/processor. On QDR Infiniband systems for the type of problems I do (interFoam based) this is typically around 5K-10K polyhedral cells/processor. How large are the problems you have tried?
murrdpirate likes this.
kwardle is offline   Reply With Quote

Old   July 5, 2011, 19:56
Default
  #5
Member
 
Kevin
Join Date: May 2011
Posts: 33
Rep Power: 15
murrdpirate is on a distinguished road
The cases I've been running are at around 100,000 tetrahedral cells. Going from 1 to 2 processors yields around a 40% increase in performance, and going from 2 to 4 yields an additional 10% at most. I don't suppose polyhedral meshes have better parallel performance, do they? I suppose it's possible since each polyhedral cell has more neighbors than each tet cell, and thus adds CPU calculation without adding more communication.
murrdpirate is offline   Reply With Quote

Old   July 6, 2011, 05:44
Default
  #6
Senior Member
 
Nilesh Rane
Join Date: Apr 2010
Posts: 122
Rep Power: 16
nileshjrane is on a distinguished road
I would say the most crucial thing which affects the parallel efficiency is the CFD algorithm itself. Hardware issues are, to me, secondary. The current CFD algorithms, most of them, are good for serial processing. But they are not ideal for parallel processing. if one can use specialized algorithms on parallel machines then one can get near idea parallel scalability even on thousands of processors. The CFD is yet to mature for highly parallel hardware.

As an example consider this. If you are doing matrix inversion process and the domain is spread over many processors then for most of the conventional algorithms like gauss elimination, we require the whole matrix on single processor. It means all the components of the matrix are needed to be transferred back and forth between master and slave nodes all the time. And as we all know this is the bottle neck for speed. Instead there are methods which simply eliminate this data transfer and do matrix inversion locally on each processor, independent of other processor (read very less dependency). Thus they give high parallel efficiency.

Only power of thousands of processors isnt enough. One need to know how to use it.
lakeat likes this.
__________________
Imagination is more important than knowledge..
nileshjrane is offline   Reply With Quote

Old   July 6, 2011, 05:54
Default
  #7
Super Moderator
 
niklas's Avatar
 
Niklas Nordin
Join Date: Mar 2009
Location: Stockholm, Sweden
Posts: 693
Rep Power: 29
niklas will become famous soon enoughniklas will become famous soon enough
You might find this interesting. I did this a few weeks ago and as you can see there is alot you can gain
when you increase the number of cpu's.

Code:
# Scaling test on the KTH/PDC cluster. 
# http://www.pdc.kth.se/resources/comp...dgren/Hardware
# Each node consists of 24 core Cray XE6 with a Gemini network

# Test case is the ERCOFTAC ufr2-02 case, 
# LES of flow around a square cylinder in a channel
# http://openfoamwiki.net/index.php/Be...coftac_ufr2-02

# speedup = time / timeRef
# eff = speedup / ( cores / coresRef )
# Afact = nCells*nIter/( nCores * time ) (number of cell iteration per core and sec)

# pimpleFoam 1000 iterations 3.33 M cells, PCG, constant timestep, ~0.1 CFL
#cores  #time    #kCells/core  #speedup #eff    #Afact
  24   23881     138.8           1       1       5810
  48   11857      69.4          2.0      1.0     5851
  96    5113      34.7          4.7      1.2     6784
 120    3940      27.8          6.1      1.2     7043
 240    1655      13.9         14.4      1.4     8384
 480     914       6.9         26.1      1.3     7590
 960     664       3.5         36.0      0.9     5224
1200     658       2.8         36.3      0.7     4217
2400     524       1.4         45.6      0.5     2648


# pimpleFoam 1000 iterations 9.69 M cells, PCG, constant timestep, ~0.25 CFL
#cores  #time    #kCells/core  #speedup #eff    #Afact
 120    18679     80.8           1       1       4323
 240     8264     40.4          2.3      1.1     4886
 480     3727     20.2          5.0      1.3     5417
 960     1860     10.1         10.0      1.3     5427
1200     1515      8.1         12.3      1.2     5330
2400     1034      4.0         18.1      0.9     3905


# pimpleFoam 1000 iterations 26.84 M cells, PCG, constant timestep, ~0.1 CFL
#cores  #time    #kCells/core  #speedup #eff    #Afact
 120    68699    223.7           1       1       3256
 240    34235    111.8          2.0      1.0     3267
 480    15880     55.9          4.3      1.1     3521
 960     7327     28.0          9.4      1.2     3816
1200     5846     22.4         11.8      1.2     3826
2400     2593     11.2         26.5      1.3     4313
4800     1918      5.6         35.8      0.9     2915
9600     1387*     2.8         49.5      0.6     2016
9600     1278**    2.8         53.8      0.7     2188

* startup took 109 s
** subtracted the startup-phase 

# Afact seems to be max around 10k cells/core.
# trying to keep cells/core constant at 10 k = 240k/node
# switching to constant CFL number 
# in order to try and keep the number of pressure iterations equal

# pimpleFoam 1000 iterations, GAMG, variable timestep, 0.5 CFL
#cores  #ncells  #kCells/core  #time   #Afact
  24     237900      9.9        672     14751
  48     477288      9.9        840     11838
 120    1200115     10.0       1558      6419
 240    2419874     10.1       2524      3995
 480    4819176     10.0       3873      2592


# pimpleFoam 1000 iterations, PCG, variable timestep, 0.5 CFL
#cores  #ncells  #kCells/core  #time   #Afact
  24     237900      9.9        573    17299
  48     477288      9.9        663    14998
 120    1200115     10.0        863    11589
 240    2419874     10.1       1094     9216
 480    4819176     10.0       1413     7105
lakeat, romant and mgg like this.
niklas is offline   Reply With Quote

Old   July 6, 2011, 11:00
Default
  #8
Senior Member
 
akidess's Avatar
 
Anton Kidess
Join Date: May 2009
Location: Germany
Posts: 1,377
Rep Power: 30
akidess will become famous soon enough
Quote:
Originally Posted by nileshjrane View Post
As an example consider this. If you are doing matrix inversion process and the domain is spread over many processors then for most of the conventional algorithms like gauss elimination, we require the whole matrix on single processor. It means all the components of the matrix are needed to be transferred back and forth between master and slave nodes all the time. And as we all know this is the bottle neck for speed. Instead there are methods which simply eliminate this data transfer and do matrix inversion locally on each processor, independent of other processor (read very less dependency). Thus they give high parallel efficiency.
I can't imagine any code working like this, and certainly OpenFoam doesn't! OpenFoam applies special "processor" patch boundary conditions on boundaries appearing after the domain composition, and only the patch neighbor values are broadcasted. Also, latency might be a larger problem than bandwidth.
akidess is offline   Reply With Quote

Old   July 6, 2011, 11:51
Default
  #9
Senior Member
 
Arjun
Join Date: Mar 2009
Location: Nurenberg, Germany
Posts: 1,290
Rep Power: 34
arjun will become famous soon enougharjun will become famous soon enough
Quote:
Originally Posted by nileshjrane View Post
Instead there are methods which simply eliminate this data transfer and do matrix inversion locally on each processor, independent of other processor (read very less dependency). Thus they give high parallel efficiency.

Only power of thousands of processors isnt enough. One need to know how to use it.

you can not invert a matrix locally without communicating with other processors. The only case where you can do it is where matrix is block diagonal and each block lies within a processor.
arjun is offline   Reply With Quote

Old   August 22, 2011, 10:44
Default
  #10
Senior Member
 
lakeat's Avatar
 
Daniel WEI (老魏)
Join Date: Mar 2009
Location: Beijing, China
Posts: 689
Blog Entries: 9
Rep Power: 21
lakeat is on a distinguished road
Send a message via Skype™ to lakeat
Quote:
Originally Posted by arjun View Post
you can not invert a matrix locally without communicating with other processors. The only case where you can do it is where matrix is block diagonal and each block lies within a processor.
This makes sense. And what algorithm would you use for large cases? GAMG, or PCG, for an unsteady case? Thanks
__________________
~
Daniel WEI
-------------
Boeing Research & Technology - China
Beijing, China
Email
lakeat is offline   Reply With Quote

Old   June 20, 2012, 18:26
Default
  #11
Senior Member
 
lakeat's Avatar
 
Daniel WEI (老魏)
Join Date: Mar 2009
Location: Beijing, China
Posts: 689
Blog Entries: 9
Rep Power: 21
lakeat is on a distinguished road
Send a message via Skype™ to lakeat
Quote:
Originally Posted by niklas View Post
You might find this interesting. I did this a few weeks ago and as you can see there is alot you can gain
when you increase the number of cpu's.
Hi Niklas,


I had a very very long startup time (A hour) when I am trying to use a thousand cpus. Any ideas?
__________________
~
Daniel WEI
-------------
Boeing Research & Technology - China
Beijing, China
Email
lakeat is offline   Reply With Quote

Old   June 21, 2012, 02:26
Default
  #12
Super Moderator
 
niklas's Avatar
 
Niklas Nordin
Join Date: Mar 2009
Location: Stockholm, Sweden
Posts: 693
Rep Power: 29
niklas will become famous soon enoughniklas will become famous soon enough
On which architecture?
One thing that I've noticed is that on the cray, if you have

CRAY_ROOTFS=DSL

you will get that behaviour.
niklas is offline   Reply With Quote

Old   June 21, 2012, 10:57
Default
  #13
Senior Member
 
lakeat's Avatar
 
Daniel WEI (老魏)
Join Date: Mar 2009
Location: Beijing, China
Posts: 689
Blog Entries: 9
Rep Power: 21
lakeat is on a distinguished road
Send a message via Skype™ to lakeat
Quote:
Originally Posted by niklas View Post
On which architecture?
One thing that I've noticed is that on the cray, if you have

CRAY_ROOTFS=DSL

you will get that behaviour.
Thanks, What did you mean by "CRAY_ROOTFS=DSL", I google a little bit and found this page, and according to it, setting "CRAY_ROOTFS=DSL" is actually helpful.

I am wondering if other achetecture has the same env to set up.

But anyway, here is the architecture I am using.

Any suggestions?
__________________
~
Daniel WEI
-------------
Boeing Research & Technology - China
Beijing, China
Email
lakeat is offline   Reply With Quote

Old   June 21, 2012, 12:39
Default
  #14
Super Moderator
 
niklas's Avatar
 
Niklas Nordin
Join Date: Mar 2009
Location: Stockholm, Sweden
Posts: 693
Rep Power: 29
niklas will become famous soon enoughniklas will become famous soon enough
OK, I see that its not a cray, so its not that.

are you using the system mpi or are you compiling openmpi yourself?

If you are using the thirdparty option to compile openmpi yourself,
it is absolutely crucial that you add the --with-openib flag to $configOpts in the Allwmake script in the thirdparty folder.

It is also important that when you compile it, you make sure that the hardware is the same as the cluster hardware and that the infiniband-libs are available. Sometimes the login/submit-node can differ in this respect, in which case you need to submit the compilation as a job.

and last, if you are using the SYSTEMOPENMPI, you need to make sure that you have the library path's to the infiniband-libs in the LD_LIBRARY_PATH, otherwise it will fallback to using something else.
you need to find where these libs are located and go into the config/settings.sh and add these under the SYSTEMOPENMPI option
_foamAddLib /directoryToWhereInfinibandIsLocated
_foamAddLib /directoryToSomethingThatIBMightNeed

and maybe also this
_foamAddPath /directoryToOPENMPIBIN
niklas is offline   Reply With Quote

Old   June 21, 2012, 12:45
Default
  #15
Senior Member
 
lakeat's Avatar
 
Daniel WEI (老魏)
Join Date: Mar 2009
Location: Beijing, China
Posts: 689
Blog Entries: 9
Rep Power: 21
lakeat is on a distinguished road
Send a message via Skype™ to lakeat
Quote:
Originally Posted by niklas View Post
OK, I see that its not a cray, so its not that.

are you using the system mpi or are you compiling openmpi yourself?

If you are using the thirdparty option to compile openmpi yourself,
it is absolutely crucial that you add the --with-openib flag to $configOpts in the Allwmake script in the thirdparty folder.

It is also important that when you compile it, you make sure that the hardware is the same as the cluster hardware and that the infiniband-libs are available. Sometimes the login/submit-node can differ in this respect, in which case you need to submit the compilation as a job.

and last, if you are using the SYSTEMOPENMPI, you need to make sure that you have the library path's to the infiniband-libs in the LD_LIBRARY_PATH, otherwise it will fallback to using something else.
you need to find where these libs are located and go into the config/settings.sh and add these under the SYSTEMOPENMPI option
_foamAddLib /directoryToWhereInfinibandIsLocated
_foamAddLib /directoryToSomethingThatIBMightNeed

and maybe also this
_foamAddPath /directoryToOPENMPIBIN


Thanks a lot, I will talk to the system manager to double check the openib issue (you know what, I am always worrying this issue, especially I am afraid that different computing nodes would use difference settings, this is a little bit tricky.)

Anyway, I will try and keep you posted. And in the meanwhile, would you mind to test my cases, see what happens in your cluster? Your email so that I can send you the download address?

Thanks
__________________
~
Daniel WEI
-------------
Boeing Research & Technology - China
Beijing, China
Email
lakeat is offline   Reply With Quote

Old   June 21, 2012, 12:55
Default
  #16
Super Moderator
 
niklas's Avatar
 
Niklas Nordin
Join Date: Mar 2009
Location: Stockholm, Sweden
Posts: 693
Rep Power: 29
niklas will become famous soon enoughniklas will become famous soon enough
sure,
its niklas dot nordin @ nequam dot se
niklas is offline   Reply With Quote

Old   September 6, 2012, 02:08
Default
  #17
Senior Member
 
Nilesh Rane
Join Date: Apr 2010
Posts: 122
Rep Power: 16
nileshjrane is on a distinguished road
Quote:
Originally Posted by akidess View Post
I can't imagine any code working like this, and certainly OpenFoam doesn't! OpenFoam applies special "processor" patch boundary conditions on boundaries appearing after the domain composition, and only the patch neighbor values are broadcasted. Also, latency might be a larger problem than bandwidth.
I was talking in general sense not for OF specifically. And it was just one example to emphasize my point that algorithms are more important than the number of processors.

Quote:
Originally Posted by arjun View Post
you can not invert a matrix locally without communicating with other processors. The only case where you can do it is where matrix is block diagonal and each block lies within a processor.
You cannot do it completely independently yes, thats why i mentioned in brackets "read less dependancy". The algorithm I used to work with in my Masters work for hypersonic flows makes an assumption that du/dy >> du/dx where u is velocity parallel to wall, x is along the wall and y is perpendicular. Here x-y grid is transformed one. This us valid assumption for wall bounded hypersonic flows. Due to this one can linearize the longitudinal gradients and reduce the dependance of points on same x line to its previous collinear point on same x lines to minimum. So in practical terms one doesn't need information beyond the boundary cells of each block (associated with one processor) when decomposition is done in longitudinal direction only. And the algorithm still gives as good results as a complete matrix inversion would have given. Mind well this is fully coupled hypersonic reacting flow code and not just block diagonal code. The elegance lies in the simple but appropriate simplification. (NOTE: I might sound vague here, sorry couldn't explain the method well I guess.)

My point was, if one can judiciously modify the algorithm to make is parallel processor friendly one can get very good scaling without compromising on quality of results. Just increasing number of processors is not very bright idea.
__________________
Imagination is more important than knowledge..
nileshjrane is offline   Reply With Quote

Old   September 6, 2012, 04:49
Default
  #18
Senior Member
 
Arjun
Join Date: Mar 2009
Location: Nurenberg, Germany
Posts: 1,290
Rep Power: 34
arjun will become famous soon enougharjun will become famous soon enough
Quote:
Originally Posted by nileshjrane View Post
You cannot do it completely independently yes, thats why i mentioned in brackets "read less dependancy". The algorithm I used to work with in my Masters work for hypersonic flows makes an assumption that du/dy >> du/dx where u is velocity parallel to wall, x is along the wall and y is perpendicular. Here x-y grid is transformed one. This us valid assumption for wall bounded hypersonic flows. Due to this one can linearize the longitudinal gradients and reduce the dependance of points on same x line to its previous collinear point on same x lines to minimum. So in practical terms one doesn't need information beyond the boundary cells of each block (associated with one processor) when decomposition is done in longitudinal direction only. And the algorithm still gives as good results as a complete matrix inversion would have given. Mind well this is fully coupled hypersonic reacting flow code and not just block diagonal code. The elegance lies in the simple but appropriate simplification. (NOTE: I might sound vague here, sorry couldn't explain the method well I guess.)

My point was, if one can judiciously modify the algorithm to make is parallel processor friendly one can get very good scaling without compromising on quality of results. Just increasing number of processors is not very bright idea.

Quote:
Originally Posted by nileshjrane View Post
Instead there are methods which simply eliminate this data transfer and do matrix inversion locally on each processor, independent of other processor (read very less dependency). Thus they give high parallel efficiency.

Only power of thousands of processors isnt enough. One need to know how to use it.
What you did is no use unless you can show that what you did could be used for everyone.
You made some assumption and seems to be working for your special case but it would not make me know how to use the power of thousands of processors for what i am doing.

You can still not invert matrix locally without communicating and you can not still invert matrix by ignoring few off diagonals and doing less communications. If it were true we would have developed lots of methods around it. What you are assuming is that you are the only smarty pants and all the others are mindless stupids. There is a reason we do things the way we do. And the reason is that people have found out that it is really not possible to just ignore few things here and there and make things work.
arjun is offline   Reply With Quote

Old   September 6, 2012, 06:02
Default
  #19
Senior Member
 
Nilesh Rane
Join Date: Apr 2010
Posts: 122
Rep Power: 16
nileshjrane is on a distinguished road
Quote:
Originally Posted by arjun View Post
What you did is no use unless you can show that what you did could be used for everyone.
You made some assumption and seems to be working for your special case but it would not make me know how to use the power of thousands of processors for what i am doing.

You can still not invert matrix locally without communicating and you can not still invert matrix by ignoring few off diagonals and doing less communications. If it were true we would have developed lots of methods around it. What you are assuming is that you are the only smarty pants and all the others are mindless stupids. There is a reason we do things the way we do. And the reason is that people have found out that it is really not possible to just ignore few things here and there and make things work.
That was rude. Don't want to pollute the thread so I am backing off from this thread.

Just my last post here. I was talking about the algorithm which is developed by NASA and used extensively by them for their hypersonic flight designs, extra-terrestrial probes, reactive flows etc. So I am not considering myself "smarty pants" you see, neither did I say that others here are "mindless stupids". I am merely telling my observations/opinions. BTW something which is applicable for whole of supersonic and hypersonic regime is not that special case, now is it??

There are algorithms which are more "parallel friendly" than others. E.g. Krylov subspace solvers. The computational physics guys have been using them since years. They don't invert matrix at all. I have done some literature survey out of interest, and then derived the conclusion.

You can choose to ignore my opinions if you feel I am wrong. I did not enforced anyone to accept my views. I stand by my view, you stand by yours. But do it politely.
__________________
Imagination is more important than knowledge..
nileshjrane is offline   Reply With Quote

Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
OpenFOAM with IBM AIX matthias OpenFOAM Installation 20 March 25, 2008 03:36
Compiling OpenFOAM13 on AMD64 with OpenSUSE 101 silent_missile OpenFOAM Installation 5 August 10, 2007 08:31
Need a post-process software Munikrishna Main CFD Forum 3 November 27, 2006 14:45
a way to make lots of money quick and easy no lies Dob Main CFD Forum 0 October 10, 2006 17:45
What do people make at a REAL company? Jim Main CFD Forum 2 April 1, 2001 23:09


All times are GMT -4. The time now is 00:50.