|
[Sponsors] |
October 14, 2010, 12:08 |
Large case parallel efficiency
|
#1 |
Senior Member
Daniel WEI (老魏)
Join Date: Mar 2009
Location: Beijing, China
Posts: 689
Blog Entries: 9
Rep Power: 21 |
Hi foamers,
I feel mad about my extremely slow parallel computing efficiency. They are unsteady 3D LES case. incompressible external flow. When the grid number is around 2M, it works fine, I use 24 or 48 cpus and looks not bad, but when the grid number is around 9M, I try to use 128 cpu or 96 cpu, but the simulation just did not move for a quite long time, (more than a week). so dear all, what's your idea, and what is your suggestion and your experience. Any ideas and advice would be highly appreciated!
__________________
~ Daniel WEI ------------- Boeing Research & Technology - China Beijing, China |
|
October 14, 2010, 12:14 |
|
#2 |
Senior Member
Vincent RIVOLA
Join Date: Mar 2009
Location: France
Posts: 283
Rep Power: 18 |
The last computations I did with about 8 million cells using a solver derived from rhoSimpleFoam were running slower on 32cores than on 16 so I decided to stick to 16.
However, I would be really happy to know how to improve this kind of behaviour... Vincent |
|
October 14, 2010, 12:24 |
|
#3 |
Senior Member
Daniel WEI (老魏)
Join Date: Mar 2009
Location: Beijing, China
Posts: 689
Blog Entries: 9
Rep Power: 21 |
Yes, i met the same situation, increase cpus but the speed is decreased, even decreased a lot.
__________________
~ Daniel WEI ------------- Boeing Research & Technology - China Beijing, China |
|
October 14, 2010, 13:29 |
|
#4 |
Member
Fábio César Canesin
Join Date: Mar 2010
Location: Florianópolis
Posts: 67
Rep Power: 16 |
It is related with the needed for communication...
Every nem devision added to a domain means new synchronization is needed at boundaries ... The speed you earn is a compromise between increase in computational power (more cores) and increase in communication ... In the case of adding every time more cores.. you need to know that the gain of communication will some time overtake the gain in computational power.. What can be done ?? The first solution is to improve the communication, better network and bypass the kernel .. the kernel generates overhead in communication using TCP/IP .. so you should use something like Infiniband or Myrinet .. The second solution is to improve the locality of your problem, maybe decrease the number of domain subdivision by per compute node and them solve the linear system parallel in each compute node... Hope it helps.. Fábio C. Canesin |
|
October 15, 2010, 01:06 |
|
#5 | |
Senior Member
Alberto Passalacqua
Join Date: Mar 2009
Location: Ames, Iowa, United States
Posts: 1,912
Rep Power: 36 |
Quote:
Best,
__________________
Alberto Passalacqua GeekoCFD - A free distribution based on openSUSE 64 bit with CFD tools, including OpenFOAM. Available as in both physical and virtual formats (current status: http://albertopassalacqua.com/?p=1541) OpenQBMM - An open-source implementation of quadrature-based moment methods. To obtain more accurate answers, please specify the version of OpenFOAM you are using. |
||
October 18, 2010, 08:52 |
|
#6 |
Senior Member
Eugene de Villiers
Join Date: Mar 2009
Posts: 725
Rep Power: 21 |
We regularly run LES on large meshes with large numbers of CPUs with excellent speedup. Some things to keep in mind:
Beyond a certain number of CPUs, you need to move to infiniband or similar interconnect. Gigeth just wont hack it. Where the switch needs to occur depends on the case size, cpu speed and many other factors, but as rule of thumb, I would say anything above 32 cores requires infiniband. Decomposition matters. If you can use a simpler decomposition like hierarchical, do. Try to keep the number of processor boundaries to a minimum (within reason). I suggest you experiment with different decompositions like (16 2 1), (8 4 1), etc. It can make a really massive difference. Check that the slow-down is not due to some kind of disk activity, nfs or similar bottle-neck. If you have function objects or similar that read/write to disk a lot or have your case on a slow disk, you might want to distribute your case so each processor data set/mesh is local to the node it is being used on. (Check the distributed key word in decomposeParDict and the manual entry on decomposePar) If you have an infiniband network, you either have to relink Pstream against an MPI that supports the OFED hardware stack or recompile OpenMPI to support infiniband, otherwise your infiniband will be wasted. Hope this helps. |
|
October 18, 2010, 15:49 |
|
#7 | ||
Senior Member
Alberto Passalacqua
Join Date: Mar 2009
Location: Ames, Iowa, United States
Posts: 1,912
Rep Power: 36 |
Quote:
Quote:
Best,
__________________
Alberto Passalacqua GeekoCFD - A free distribution based on openSUSE 64 bit with CFD tools, including OpenFOAM. Available as in both physical and virtual formats (current status: http://albertopassalacqua.com/?p=1541) OpenQBMM - An open-source implementation of quadrature-based moment methods. To obtain more accurate answers, please specify the version of OpenFOAM you are using. |
|||
October 18, 2010, 17:44 |
|
#8 |
Senior Member
Eugene de Villiers
Join Date: Mar 2009
Posts: 725
Rep Power: 21 |
Honestly, I haven't tried it. What I have read about scotch so far is that it produces decompositions similar to metis. For large numbers of cpus, this kind of approach simply doesn't cut the mustard. You end up with too many processors connected to too many others and parallel efficiency suffers. Somewhere there is an optimum between number of inter-processor connections vs. number of processor faces. You can see this easily by comparing a hierarchical decomposition like (128 1 1) with (64 2 1) and (8 4 4). The best performance will not be (128 1 1) or (8 4 4). (128 1 1) has a very large (processor face)/cell ratio, but the smallest number of (processor boundaries)/cell. For most cases (8 4 4) will be at the other extreme. Both are a disaster in terms of scalability - I have seen (64 2 1) run twice as fast as (128 1 1), (8 4 4) is even worse than (128 1 1). Extreme domain shapes probably influence matrix solvers as well. I must stress that this is all highly situational. If the number of CPUs is small, decomposition doesn't really matter. Cells/proc also affect scalability a lot.
Some kind of ultimate "self-optimising", hardware and algorithm aware decomposition would make a very cool Ph.D. project. At its simplest, you could just use dynamic load balancing techniques to optimise hierarchical decomposition coefficients at run time. Beyond this, you could look into profiling Pstream communication and developing decomposition methods that can be configured to perform best given a particular set of algorithms. After working on the parallel hierarchical algorithm to allow snappyHexMesh to do dynamic load balancing, I was very interested in developing something like this. Unfortunately, it turned out to be rather difficult and there were more pressing matters to attend too. We can only hope that someone with more time, energy and bright ideas will come along to save us from the current crop of sub-optimal methods. |
|
October 19, 2010, 08:31 |
|
#9 |
Senior Member
Thomas Jung
Join Date: Mar 2009
Posts: 102
Rep Power: 17 |
there are tons of points... one other, perhaps trivial, but not yet mentioned thing I just found out: mpich on our cluster was not configured to use shared memory communication, thus using loopback device. I found that in some cases I can gain a lot of speed over this using the shared memory communication. Dont know why, but configuration without shared memory enabled seems to be default in mpich...
|
|
October 19, 2010, 09:15 |
|
#10 |
Member
Simon Lapointe
Join Date: May 2009
Location: Québec, Qc, Canada
Posts: 33
Rep Power: 17 |
I've been running OpenFOAM on large meshes and high number of CPUs (up to 512) and the speedup was quite good. As it has been mentioned earlier, an Infiniband connection is necessary to achieve good performance on large parallel cases and we've also found that linking Pstream against the system compiled OpenMPI library supporting Infiniband makes a huge difference.
Concerning the distribution method, I've always used metis and obtained satisfactory results (my cases are mostly 3D airfoils). Eugene's post suggesting to use hierarchical decomposition if possible seems interesting and I might try it (along with scotch) in the near future. I'm curious about the input of other members on this topic. |
|
October 21, 2010, 06:01 |
|
#11 |
Member
Flavio Galeazzo
Join Date: Mar 2009
Location: Karlsruhe, Germany
Posts: 34
Rep Power: 18 |
My experience with large cases is very close to Simon one. I have run LES cases up to 10 million nodes on up to 256 cores, with parallel eficiency around 85%, always using Metis as decomposition strategy. The machine has Infiniband interconnect, and I have compiled OpenFoam with the system compiled OpenMPI.
About smaller cases, using grids up to 2 million nodes and a Linux cluster with gigabit ethernet, I got good scalability only up to 16-20 cores (4-5 machines). |
|
October 21, 2010, 12:22 |
|
#12 |
Senior Member
Alberto Passalacqua
Join Date: Mar 2009
Location: Ames, Iowa, United States
Posts: 1,912
Rep Power: 36 |
We have the same experience you had Flavio, with out LES on micro-reactors (>= 10^6 cells) using metis/scotch (scotch is actually slightly better it seems, even if the difference with respect to metis is not amazing).
Compiling against MPI libraries optimized for the architecture is key of course.
__________________
Alberto Passalacqua GeekoCFD - A free distribution based on openSUSE 64 bit with CFD tools, including OpenFOAM. Available as in both physical and virtual formats (current status: http://albertopassalacqua.com/?p=1541) OpenQBMM - An open-source implementation of quadrature-based moment methods. To obtain more accurate answers, please specify the version of OpenFOAM you are using. |
|
October 24, 2010, 01:19 |
|
#13 |
Member
Andy Jones
Join Date: Sep 2010
Location: Columbus, Ohio
Posts: 78
Rep Power: 16 |
Hello
You might consider the Scalasca Diagnostic Toolset. I am unsure what HPC formats are supported, but Cray Xt and IBM Blue Gene are. There is also Kojak, the precursor to Scalasca, which runs on more systems. Both give exhaustive info on bottlenecks and problems and system performance, complete with screenshots/charts/logs. http://www.fz-juelich.de/jsc/scalasca/overview/ Kojak: http://www.fz-juelich.de/jsc/kojak/platforms/ Kojak Supported Platforms •Instrumentation, Measurement, and Analysis ◦Linux IA-32, IA-64, and EM64T/x86_64 clusters with GNU, PGI, or Intel compilers ◦IBM Power3 / Power4 / Power5 / Power6 based clusters ◦SGI Mips based clusters (O2k, O3k) ◦SGI IA-64 based clusters (Altix) ◦SUN Solaris Sparc and x86 based clusters ◦DEC/HP Alpha based clusters ◦Generic UNIX workstation (clusters) •Instrumentation and Measurement only ◦IBM BlueGene/L and BlueGene/P ◦Cray T3E, XD1 and X1, XT3, XT4 ◦SiCortex ◦NEC SX ◦Hitachi SR-8000 I do not know anything about the learning curve or install. Its at least worth a glance. -------------------------------------------------------------------------------- |
|
January 10, 2011, 20:45 |
|
#14 |
Senior Member
Daniel WEI (老魏)
Join Date: Mar 2009
Location: Beijing, China
Posts: 689
Blog Entries: 9
Rep Power: 21 |
Good discussions, thank you all, I will try and keep you posted.
One of the major reminding for me is nfs writing speed. I will try to distribute the data.
__________________
~ Daniel WEI ------------- Boeing Research & Technology - China Beijing, China |
|
January 18, 2011, 00:11 |
|
#15 | |
Senior Member
Daniel WEI (老魏)
Join Date: Mar 2009
Location: Beijing, China
Posts: 689
Blog Entries: 9
Rep Power: 21 |
Quote:
1 PFLAGS = -DOMPI_SKIP_MPICXX 2 PINC = -I$(MPI_ARCH_PATH)/include 3 PLIBS = -L$(MPI_ARCH_PATH)/lib -lmpi Is this setting ok, or what? Thanks
__________________
~ Daniel WEI ------------- Boeing Research & Technology - China Beijing, China |
||
January 18, 2011, 00:28 |
|
#16 |
Senior Member
Alberto Passalacqua
Join Date: Mar 2009
Location: Ames, Iowa, United States
Posts: 1,912
Rep Power: 36 |
Also, take a look at the study presented at the Open Source CFD Conference 2010:
G. Shainer et al., OpenFOAM optimizations for Scale They might give some information of interest for you.
__________________
Alberto Passalacqua GeekoCFD - A free distribution based on openSUSE 64 bit with CFD tools, including OpenFOAM. Available as in both physical and virtual formats (current status: http://albertopassalacqua.com/?p=1541) OpenQBMM - An open-source implementation of quadrature-based moment methods. To obtain more accurate answers, please specify the version of OpenFOAM you are using. |
|
January 18, 2011, 01:12 |
|
#17 |
Senior Member
Daniel WEI (老魏)
Join Date: Mar 2009
Location: Beijing, China
Posts: 689
Blog Entries: 9
Rep Power: 21 |
Gotcha, Thanks, I am testing..
The problem is , which I am not sure about: I see in our high performance center, there are different nodes, it seems not all the compute nodes are using Infiniband. Some computing nodes are quite old. I am wondering if it is possible to apply the nsf computing node at Illinois. And also, concerning the disk activity, I am not clear. My job are submitted via SGE management system, I do not have the right to access the computing node, which means ssh computing.node.XXX.edu, doesn't work. So I am wondering, when you are using job-management system like SGE, how did you set the "root" directories? So to let the data distributed?? See my PLIBS now, Code:
[wei@opteron]$ echo $PLIBS -pthread -L/afs/crc.edu/x86_64_linux/openmpi/1.3.2/gnu/lib -lmpi -lopen-rte -lopen-pal -ldl -Wl,--export-dynamic -lnsl -lutil -lm -ldl Thanks,
__________________
~ Daniel WEI ------------- Boeing Research & Technology - China Beijing, China |
|
January 18, 2011, 02:01 |
|
#18 |
Senior Member
Daniel WEI (老魏)
Join Date: Mar 2009
Location: Beijing, China
Posts: 689
Blog Entries: 9
Rep Power: 21 |
Just noticed that they were using
– Six-Core Intel X5670 @ 2.93 GHz CPUs – Memory: 24GB per node – OS: CentOS5U4, OFED 1.5.1 InfiniBand SW stack While Im kind of frustrated, for mine is 32 HP DL160 G6 servers Dual Quad-Core, 2.27 GHz L5520 Intel Nehalem nodes (8 cores per node, 256 total cores), 12 GB RAM each or 393 HP DL165 G6 servers Dual Six-Core 2.4 GHz AMD Opteron Model 2431 64/32 bit (12 cores per node, 4716 total cores), 12 GB RAM, 1 x 160 GB SATA Disk So yesterday..
__________________
~ Daniel WEI ------------- Boeing Research & Technology - China Beijing, China |
|
March 4, 2011, 15:21 |
|
#19 |
Senior Member
Daniel WEI (老魏)
Join Date: Mar 2009
Location: Beijing, China
Posts: 689
Blog Entries: 9
Rep Power: 21 |
Hello all,
Some findings and updates, hope this can help those non-professionals like me. Infiniband support is critical for speed, to run a mid size case that need many nodes, it is strongly adviced to build the code against infiniband libs. But I also got another question, 1. Usually how many grid points you guys allocate for each cpu? 2. I am still not clear how you make Hierarchical a better option than metis. Are you aware of any general rules, or do you have any experience that it is super better than metis. If not, im gonna stay with metis. Thanks
__________________
~ Daniel WEI ------------- Boeing Research & Technology - China Beijing, China |
|
March 9, 2011, 04:35 |
|
#20 |
Member
Flavio Galeazzo
Join Date: Mar 2009
Location: Karlsruhe, Germany
Posts: 34
Rep Power: 18 |
I have the same experience as you about Infiniband, Daniel. It is crucial to get good speed up with more than 4-5 machines.
I normally allocate the nodes for simulation aiming for 1 second per time step, which is a good value in the system I work with (similar to your "old" cluster). The number of grid points per node depends largely on the complexity of the solver. I can allocate more grid points with a less complex solver, say an incompressible LES, than with a complex reacting flow solver. |
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Postprocessing large data sets in parallel | evrikon | OpenFOAM Post-Processing | 28 | June 28, 2016 04:43 |
Superlinear speedup in OpenFOAM 13 | msrinath80 | OpenFOAM Running, Solving & CFD | 18 | March 3, 2015 06:36 |
Parelleling Efficiency | kassiotis | OpenFOAM | 0 | June 19, 2009 15:12 |
Parallel efficiency channel flow | maka | OpenFOAM Running, Solving & CFD | 1 | December 8, 2005 13:58 |
Post-processing of a large transient case | Flav | Siemens | 2 | September 28, 2004 07:19 |