|
[Sponsors] |
How do people even make use of super computers for CFD? |
|
LinkBack | Thread Tools | Search this Thread | Display Modes |
July 5, 2011, 16:55 |
How do people even make use of super computers for CFD?
|
#1 |
Member
Kevin
Join Date: May 2011
Posts: 33
Rep Power: 15 |
Admittedly, I'm a bit of a novice when it comes to parallel computing, but from what I've seen so far, anything more than 4 cores has essentially no benefit. When I first started, I was really excited about the possibility of using Amazon's EC2, but now that seems completely useless. Is that right?
|
|
July 5, 2011, 18:03 |
|
#2 |
Senior Member
Kent Wardle
Join Date: Mar 2009
Location: Illinois, USA
Posts: 219
Rep Power: 21 |
There is a huge difference in architecture between a cloud system and a supercomputer. When you talk about parallel scalability to many processors, the most important thing I have seen in running CFD in parallel is the speed of the interconnect between nodes and then, of course, the speed of the cores themselves. If your interconnect is 1GB/s (i.e gigabit ethernet) you won't see much improvement above some tens of processors. New supercomputers typically have QDR Infiniband interconnect with speeds of 10GB/s.
While I have a bit of experience running OpenFOAM on clusters and supercomputers on up to a few thousand processors, I am not so familiar with trying to do it on a cloud system. Apparently, Amazon does have custom HPC-type clouds with 10-GB interconnect. They claim this can match more standard HPC system performance. Their 'cloud' may simply be a normal cluster in itself and if so I am not sure what the advantage of EC2 would be other than on-demand access. Again, I know little about these systems as my original assumption was precisely your final conclusion--they are relatively useless for large-scale CFD. Perhaps someone who knows more can chime in if I am wrong. |
|
July 5, 2011, 18:23 |
|
#3 |
Member
Kevin
Join Date: May 2011
Posts: 33
Rep Power: 15 |
The Amazon product that I was looking at is the HPC EC2. It supposedly offers a 10 Gigabyte connection, so maybe it actually would be fast enough.
I was a bit pessimistic because parallel computing on multicore processors seemed to reach diminishing returns very quickly (nearly 0 benefit to go from 2 to 4 processors for the geometries I've tried). I couldn't imagine a supercomputer having better connection speeds between its processors than a multicore chip, but I am pretty ignorant on much of this. |
|
July 5, 2011, 18:29 |
|
#4 |
Senior Member
Kent Wardle
Join Date: Mar 2009
Location: Illinois, USA
Posts: 219
Rep Power: 21 |
Well, but you also have to consider the problem size. You are going to see a max speedup around some number of meshpoints/processor. On QDR Infiniband systems for the type of problems I do (interFoam based) this is typically around 5K-10K polyhedral cells/processor. How large are the problems you have tried?
|
|
July 5, 2011, 19:56 |
|
#5 |
Member
Kevin
Join Date: May 2011
Posts: 33
Rep Power: 15 |
The cases I've been running are at around 100,000 tetrahedral cells. Going from 1 to 2 processors yields around a 40% increase in performance, and going from 2 to 4 yields an additional 10% at most. I don't suppose polyhedral meshes have better parallel performance, do they? I suppose it's possible since each polyhedral cell has more neighbors than each tet cell, and thus adds CPU calculation without adding more communication.
|
|
July 6, 2011, 05:44 |
|
#6 |
Senior Member
Nilesh Rane
Join Date: Apr 2010
Posts: 122
Rep Power: 16 |
I would say the most crucial thing which affects the parallel efficiency is the CFD algorithm itself. Hardware issues are, to me, secondary. The current CFD algorithms, most of them, are good for serial processing. But they are not ideal for parallel processing. if one can use specialized algorithms on parallel machines then one can get near idea parallel scalability even on thousands of processors. The CFD is yet to mature for highly parallel hardware.
As an example consider this. If you are doing matrix inversion process and the domain is spread over many processors then for most of the conventional algorithms like gauss elimination, we require the whole matrix on single processor. It means all the components of the matrix are needed to be transferred back and forth between master and slave nodes all the time. And as we all know this is the bottle neck for speed. Instead there are methods which simply eliminate this data transfer and do matrix inversion locally on each processor, independent of other processor (read very less dependency). Thus they give high parallel efficiency. Only power of thousands of processors isnt enough. One need to know how to use it.
__________________
Imagination is more important than knowledge..
|
|
July 6, 2011, 05:54 |
|
#7 |
Super Moderator
Niklas Nordin
Join Date: Mar 2009
Location: Stockholm, Sweden
Posts: 693
Rep Power: 29 |
You might find this interesting. I did this a few weeks ago and as you can see there is alot you can gain
when you increase the number of cpu's. Code:
# Scaling test on the KTH/PDC cluster. # http://www.pdc.kth.se/resources/comp...dgren/Hardware # Each node consists of 24 core Cray XE6 with a Gemini network # Test case is the ERCOFTAC ufr2-02 case, # LES of flow around a square cylinder in a channel # http://openfoamwiki.net/index.php/Be...coftac_ufr2-02 # speedup = time / timeRef # eff = speedup / ( cores / coresRef ) # Afact = nCells*nIter/( nCores * time ) (number of cell iteration per core and sec) # pimpleFoam 1000 iterations 3.33 M cells, PCG, constant timestep, ~0.1 CFL #cores #time #kCells/core #speedup #eff #Afact 24 23881 138.8 1 1 5810 48 11857 69.4 2.0 1.0 5851 96 5113 34.7 4.7 1.2 6784 120 3940 27.8 6.1 1.2 7043 240 1655 13.9 14.4 1.4 8384 480 914 6.9 26.1 1.3 7590 960 664 3.5 36.0 0.9 5224 1200 658 2.8 36.3 0.7 4217 2400 524 1.4 45.6 0.5 2648 # pimpleFoam 1000 iterations 9.69 M cells, PCG, constant timestep, ~0.25 CFL #cores #time #kCells/core #speedup #eff #Afact 120 18679 80.8 1 1 4323 240 8264 40.4 2.3 1.1 4886 480 3727 20.2 5.0 1.3 5417 960 1860 10.1 10.0 1.3 5427 1200 1515 8.1 12.3 1.2 5330 2400 1034 4.0 18.1 0.9 3905 # pimpleFoam 1000 iterations 26.84 M cells, PCG, constant timestep, ~0.1 CFL #cores #time #kCells/core #speedup #eff #Afact 120 68699 223.7 1 1 3256 240 34235 111.8 2.0 1.0 3267 480 15880 55.9 4.3 1.1 3521 960 7327 28.0 9.4 1.2 3816 1200 5846 22.4 11.8 1.2 3826 2400 2593 11.2 26.5 1.3 4313 4800 1918 5.6 35.8 0.9 2915 9600 1387* 2.8 49.5 0.6 2016 9600 1278** 2.8 53.8 0.7 2188 * startup took 109 s ** subtracted the startup-phase # Afact seems to be max around 10k cells/core. # trying to keep cells/core constant at 10 k = 240k/node # switching to constant CFL number # in order to try and keep the number of pressure iterations equal # pimpleFoam 1000 iterations, GAMG, variable timestep, 0.5 CFL #cores #ncells #kCells/core #time #Afact 24 237900 9.9 672 14751 48 477288 9.9 840 11838 120 1200115 10.0 1558 6419 240 2419874 10.1 2524 3995 480 4819176 10.0 3873 2592 # pimpleFoam 1000 iterations, PCG, variable timestep, 0.5 CFL #cores #ncells #kCells/core #time #Afact 24 237900 9.9 573 17299 48 477288 9.9 663 14998 120 1200115 10.0 863 11589 240 2419874 10.1 1094 9216 480 4819176 10.0 1413 7105 |
|
July 6, 2011, 11:00 |
|
#8 | |
Senior Member
Anton Kidess
Join Date: May 2009
Location: Germany
Posts: 1,377
Rep Power: 30 |
Quote:
|
||
July 6, 2011, 11:51 |
|
#9 | |
Senior Member
Arjun
Join Date: Mar 2009
Location: Nurenberg, Germany
Posts: 1,290
Rep Power: 34 |
Quote:
you can not invert a matrix locally without communicating with other processors. The only case where you can do it is where matrix is block diagonal and each block lies within a processor. |
||
August 22, 2011, 10:44 |
|
#10 |
Senior Member
Daniel WEI (老魏)
Join Date: Mar 2009
Location: Beijing, China
Posts: 689
Blog Entries: 9
Rep Power: 21 |
This makes sense. And what algorithm would you use for large cases? GAMG, or PCG, for an unsteady case? Thanks
__________________
~ Daniel WEI ------------- Boeing Research & Technology - China Beijing, China |
|
June 20, 2012, 18:26 |
|
#11 | |
Senior Member
Daniel WEI (老魏)
Join Date: Mar 2009
Location: Beijing, China
Posts: 689
Blog Entries: 9
Rep Power: 21 |
Quote:
I had a very very long startup time (A hour) when I am trying to use a thousand cpus. Any ideas?
__________________
~ Daniel WEI ------------- Boeing Research & Technology - China Beijing, China |
||
June 21, 2012, 02:26 |
|
#12 |
Super Moderator
Niklas Nordin
Join Date: Mar 2009
Location: Stockholm, Sweden
Posts: 693
Rep Power: 29 |
On which architecture?
One thing that I've noticed is that on the cray, if you have CRAY_ROOTFS=DSL you will get that behaviour. |
|
June 21, 2012, 10:57 |
|
#13 | |
Senior Member
Daniel WEI (老魏)
Join Date: Mar 2009
Location: Beijing, China
Posts: 689
Blog Entries: 9
Rep Power: 21 |
Quote:
I am wondering if other achetecture has the same env to set up. But anyway, here is the architecture I am using. Any suggestions?
__________________
~ Daniel WEI ------------- Boeing Research & Technology - China Beijing, China |
||
June 21, 2012, 12:39 |
|
#14 |
Super Moderator
Niklas Nordin
Join Date: Mar 2009
Location: Stockholm, Sweden
Posts: 693
Rep Power: 29 |
OK, I see that its not a cray, so its not that.
are you using the system mpi or are you compiling openmpi yourself? If you are using the thirdparty option to compile openmpi yourself, it is absolutely crucial that you add the --with-openib flag to $configOpts in the Allwmake script in the thirdparty folder. It is also important that when you compile it, you make sure that the hardware is the same as the cluster hardware and that the infiniband-libs are available. Sometimes the login/submit-node can differ in this respect, in which case you need to submit the compilation as a job. and last, if you are using the SYSTEMOPENMPI, you need to make sure that you have the library path's to the infiniband-libs in the LD_LIBRARY_PATH, otherwise it will fallback to using something else. you need to find where these libs are located and go into the config/settings.sh and add these under the SYSTEMOPENMPI option _foamAddLib /directoryToWhereInfinibandIsLocated _foamAddLib /directoryToSomethingThatIBMightNeed and maybe also this _foamAddPath /directoryToOPENMPIBIN |
|
June 21, 2012, 12:45 |
|
#15 | |
Senior Member
Daniel WEI (老魏)
Join Date: Mar 2009
Location: Beijing, China
Posts: 689
Blog Entries: 9
Rep Power: 21 |
Quote:
Thanks a lot, I will talk to the system manager to double check the openib issue (you know what, I am always worrying this issue, especially I am afraid that different computing nodes would use difference settings, this is a little bit tricky.) Anyway, I will try and keep you posted. And in the meanwhile, would you mind to test my cases, see what happens in your cluster? Your email so that I can send you the download address? Thanks
__________________
~ Daniel WEI ------------- Boeing Research & Technology - China Beijing, China |
||
June 21, 2012, 12:55 |
|
#16 |
Super Moderator
Niklas Nordin
Join Date: Mar 2009
Location: Stockholm, Sweden
Posts: 693
Rep Power: 29 |
sure,
its niklas dot nordin @ nequam dot se |
|
September 6, 2012, 02:08 |
|
#17 | ||
Senior Member
Nilesh Rane
Join Date: Apr 2010
Posts: 122
Rep Power: 16 |
Quote:
Quote:
My point was, if one can judiciously modify the algorithm to make is parallel processor friendly one can get very good scaling without compromising on quality of results. Just increasing number of processors is not very bright idea.
__________________
Imagination is more important than knowledge..
|
|||
September 6, 2012, 04:49 |
|
#18 | ||
Senior Member
Arjun
Join Date: Mar 2009
Location: Nurenberg, Germany
Posts: 1,290
Rep Power: 34 |
Quote:
Quote:
You made some assumption and seems to be working for your special case but it would not make me know how to use the power of thousands of processors for what i am doing. You can still not invert matrix locally without communicating and you can not still invert matrix by ignoring few off diagonals and doing less communications. If it were true we would have developed lots of methods around it. What you are assuming is that you are the only smarty pants and all the others are mindless stupids. There is a reason we do things the way we do. And the reason is that people have found out that it is really not possible to just ignore few things here and there and make things work. |
|||
September 6, 2012, 06:02 |
|
#19 | |
Senior Member
Nilesh Rane
Join Date: Apr 2010
Posts: 122
Rep Power: 16 |
Quote:
Just my last post here. I was talking about the algorithm which is developed by NASA and used extensively by them for their hypersonic flight designs, extra-terrestrial probes, reactive flows etc. So I am not considering myself "smarty pants" you see, neither did I say that others here are "mindless stupids". I am merely telling my observations/opinions. BTW something which is applicable for whole of supersonic and hypersonic regime is not that special case, now is it?? There are algorithms which are more "parallel friendly" than others. E.g. Krylov subspace solvers. The computational physics guys have been using them since years. They don't invert matrix at all. I have done some literature survey out of interest, and then derived the conclusion. You can choose to ignore my opinions if you feel I am wrong. I did not enforced anyone to accept my views. I stand by my view, you stand by yours. But do it politely.
__________________
Imagination is more important than knowledge..
|
||
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
OpenFOAM with IBM AIX | matthias | OpenFOAM Installation | 20 | March 25, 2008 03:36 |
Compiling OpenFOAM13 on AMD64 with OpenSUSE 101 | silent_missile | OpenFOAM Installation | 5 | August 10, 2007 08:31 |
Need a post-process software | Munikrishna | Main CFD Forum | 3 | November 27, 2006 14:45 |
a way to make lots of money quick and easy no lies | Dob | Main CFD Forum | 0 | October 10, 2006 17:45 |
What do people make at a REAL company? | Jim | Main CFD Forum | 2 | April 1, 2001 23:09 |