Large case parallel efficiency

lakeat · October 14, 2010, 12:08

Hi foamers,

I feel mad about my extremely slow parallel computing efficiency.
They are unsteady 3D LES case. incompressible external flow.

When the grid number is around 2M, it works fine, I use 24 or 48 cpus and looks not bad, but when the grid number is around 9M, I try to use 128 cpu or 96 cpu, but the simulation just did not move for a quite long time, (more than a week).

so dear all, what's your idea, and what is your suggestion and your experience.

Any ideas and advice would be highly appreciated!

vinz · October 14, 2010, 12:14

The last computations I did with about 8 million cells using a solver derived from rhoSimpleFoam were running slower on 32cores than on 16 so I decided to stick to 16.
However, I would be really happy to know how to improve this kind of behaviour...

Vincent

lakeat · October 14, 2010, 12:24

Quote:

Originally Posted by vinz

The last computations I did with about 8 million cells using a solver derived from rhoSimpleFoam were running slower on 32cores than on 16 so I decided to stick to 16.
However, I would be really happy to know how to improve this kind of behaviour...

Vincent

Yes, i met the same situation, increase cpus but the speed is decreased, even decreased a lot.

Canesin · October 14, 2010, 13:29

It is related with the needed for communication...

Every nem devision added to a domain means new synchronization is needed at boundaries ... The speed you earn is a compromise between increase in computational power (more cores) and increase in communication ...
In the case of adding every time more cores.. you need to know that the gain of communication will some time overtake the gain in computational power..

What can be done ??

The first solution is to improve the communication, better network and bypass the kernel .. the kernel generates overhead in communication using TCP/IP .. so you should use something like Infiniband or Myrinet ..

The second solution is to improve the locality of your problem, maybe decrease the number of domain subdivision by per compute node and them solve the linear system parallel in each compute node...

Hope it helps..

Fábio C. Canesin

alberto · October 15, 2010, 01:06

Quote:

Originally Posted by lakeat

Hi foamers,

I feel mad about my extremely slow parallel computing efficiency.
They are unsteady 3D LES case. incompressible external flow.

When the grid number is around 2M, it works fine, I use 24 or 48 cpus and looks not bad, but when the grid number is around 9M, I try to use 128 cpu or 96 cpu, but the simulation just did not move for a quite long time, (more than a week).

so dear all, what's your idea, and what is your suggestion and your experience.

Any ideas and advice would be highly appreciated!

It depends on a lot of factors, so at least you should say what method did you use to decompose your case, what architecture (multicore? how do CPU/nodes are distributed, meaning how many cores per node?) you are running on, and also if you are use the version of OpenMPI provided by OpenCFD or you recompiled OpenFOAM against the system libraries. Also, what compiler did you use?

Best,

eugene · October 18, 2010, 08:52

We regularly run LES on large meshes with large numbers of CPUs with excellent speedup. Some things to keep in mind:

Beyond a certain number of CPUs, you need to move to infiniband or similar interconnect. Gigeth just wont hack it. Where the switch needs to occur depends on the case size, cpu speed and many other factors, but as rule of thumb, I would say anything above 32 cores requires infiniband.

Decomposition matters. If you can use a simpler decomposition like hierarchical, do. Try to keep the number of processor boundaries to a minimum (within reason). I suggest you experiment with different decompositions like (16 2 1), (8 4 1), etc. It can make a really massive difference.

Check that the slow-down is not due to some kind of disk activity, nfs or similar bottle-neck. If you have function objects or similar that read/write to disk a lot or have your case on a slow disk, you might want to distribute your case so each processor data set/mesh is local to the node it is being used on. (Check the distributed key word in decomposeParDict and the manual entry on decomposePar)

If you have an infiniband network, you either have to relink Pstream against an MPI that supports the OFED hardware stack or recompile OpenMPI to support infiniband, otherwise your infiniband will be wasted.

Hope this helps.

alberto · October 18, 2010, 15:49

Quote:

Originally Posted by eugene

We regularly run LES on large meshes with large numbers of CPUs with excellent speedup.

Yes, same experience here with our LES on micro-reactors.

Quote:

Decomposition matters. If you can use a simpler decomposition like hierarchical, do.

Out of curiosity, what is your experience with scotch? We had interesting results with it too, but it would be helpful to know other experiences.

Best,

eugene · October 18, 2010, 17:44

Honestly, I haven't tried it. What I have read about scotch so far is that it produces decompositions similar to metis. For large numbers of cpus, this kind of approach simply doesn't cut the mustard. You end up with too many processors connected to too many others and parallel efficiency suffers. Somewhere there is an optimum between number of inter-processor connections vs. number of processor faces. You can see this easily by comparing a hierarchical decomposition like (128 1 1) with (64 2 1) and (8 4 4). The best performance will not be (128 1 1) or (8 4 4). (128 1 1) has a very large (processor face)/cell ratio, but the smallest number of (processor boundaries)/cell. For most cases (8 4 4) will be at the other extreme. Both are a disaster in terms of scalability - I have seen (64 2 1) run twice as fast as (128 1 1), (8 4 4) is even worse than (128 1 1). Extreme domain shapes probably influence matrix solvers as well. I must stress that this is all highly situational. If the number of CPUs is small, decomposition doesn't really matter. Cells/proc also affect scalability a lot.

Some kind of ultimate "self-optimising", hardware and algorithm aware decomposition would make a very cool Ph.D. project. At its simplest, you could just use dynamic load balancing techniques to optimise hierarchical decomposition coefficients at run time. Beyond this, you could look into profiling Pstream communication and developing decomposition methods that can be configured to perform best given a particular set of algorithms.
After working on the parallel hierarchical algorithm to allow snappyHexMesh to do dynamic load balancing, I was very interested in developing something like this. Unfortunately, it turned out to be rather difficult and there were more pressing matters to attend too. We can only hope that someone with more time, energy and bright ideas will come along to save us from the current crop of sub-optimal methods.

tehache · October 19, 2010, 08:31

there are tons of points... one other, perhaps trivial, but not yet mentioned thing I just found out: mpich on our cluster was not configured to use shared memory communication, thus using loopback device. I found that in some cases I can gain a lot of speed over this using the shared memory communication. Dont know why, but configuration without shared memory enabled seems to be default in mpich...

Simon Lapointe · October 19, 2010, 09:15

I've been running OpenFOAM on large meshes and high number of CPUs (up to 512) and the speedup was quite good. As it has been mentioned earlier, an Infiniband connection is necessary to achieve good performance on large parallel cases and we've also found that linking Pstream against the system compiled OpenMPI library supporting Infiniband makes a huge difference.

Concerning the distribution method, I've always used metis and obtained satisfactory results (my cases are mostly 3D airfoils). Eugene's post suggesting to use hierarchical decomposition if possible seems interesting and I might try it (along with scotch) in the near future.

I'm curious about the input of other members on this topic.

flavio_galeazzo · October 21, 2010, 06:01

My experience with large cases is very close to Simon one. I have run LES cases up to 10 million nodes on up to 256 cores, with parallel eficiency around 85%, always using Metis as decomposition strategy. The machine has Infiniband interconnect, and I have compiled OpenFoam with the system compiled OpenMPI.

About smaller cases, using grids up to 2 million nodes and a Linux cluster with gigabit ethernet, I got good scalability only up to 16-20 cores (4-5 machines).

alberto · October 21, 2010, 12:22

We have the same experience you had Flavio, with out LES on micro-reactors (>= 10^6 cells) using metis/scotch (scotch is actually slightly better it seems, even if the difference with respect to metis is not amazing).

Compiling against MPI libraries optimized for the architecture is key of course.

andyj · October 24, 2010, 01:19

Hello
You might consider the Scalasca Diagnostic Toolset. I am unsure what HPC formats are supported, but Cray Xt and IBM Blue Gene are.
There is also Kojak, the precursor to Scalasca, which runs on more systems. Both give exhaustive info on bottlenecks and problems and system performance, complete with screenshots/charts/logs.
http://www.fz-juelich.de/jsc/scalasca/overview/

Kojak:
http://www.fz-juelich.de/jsc/kojak/platforms/
Kojak Supported Platforms
•Instrumentation, Measurement, and Analysis
◦Linux IA-32, IA-64, and EM64T/x86_64 clusters with GNU, PGI, or Intel compilers
◦IBM Power3 / Power4 / Power5 / Power6 based clusters
◦SGI Mips based clusters (O2k, O3k)
◦SGI IA-64 based clusters (Altix)
◦SUN Solaris Sparc and x86 based clusters
◦DEC/HP Alpha based clusters
◦Generic UNIX workstation (clusters)
•Instrumentation and Measurement only
◦IBM BlueGene/L and BlueGene/P
◦Cray T3E, XD1 and X1, XT3, XT4
◦SiCortex
◦NEC SX
◦Hitachi SR-8000

I do not know anything about the learning curve or install. Its at least worth a glance.

--------------------------------------------------------------------------------

lakeat · January 10, 2011, 20:45

Good discussions, thank you all, I will try and keep you posted.

One of the major reminding for me is nfs writing speed. I will try to distribute the data.

lakeat · January 18, 2011, 00:11

Quote:

Originally Posted by Simon Lapointe

I've been running OpenFOAM on large meshes and high number of CPUs (up to 512) and the speedup was quite good. As it has been mentioned earlier, an Infiniband connection is necessary to achieve good performance on large parallel cases and we've also found that linking Pstream against the system compiled OpenMPI library supporting Infiniband makes a huge difference.

Concerning the distribution method, I've always used metis and obtained satisfactory results (my cases are mostly 3D airfoils). Eugene's post suggesting to use hierarchical decomposition if possible seems interesting and I might try it (along with scotch) in the near future.

I'm curious about the input of other members on this topic.

Dear Simon, could you tell me what do you mean by saying "linking Pstream against the system compiled OpenMPI"?

1 PFLAGS = -DOMPI_SKIP_MPICXX
2 PINC = -I$(MPI_ARCH_PATH)/include
3 PLIBS = -L$(MPI_ARCH_PATH)/lib -lmpi

Is this setting ok, or what?

Thanks

alberto · January 18, 2011, 00:28

Also, take a look at the study presented at the Open Source CFD Conference 2010:

G. Shainer et al., OpenFOAM optimizations for Scale

They might give some information of interest for you.

lakeat · January 18, 2011, 01:12

Gotcha, Thanks, I am testing..

The problem is , which I am not sure about:
I see in our high performance center, there are different nodes, it seems not all the compute nodes are using Infiniband. Some computing nodes are quite old. I am wondering if it is possible to apply the nsf computing node at Illinois.

And also, concerning the disk activity, I am not clear. My job are submitted via SGE management system, I do not have the right to access the computing node, which means ssh computing.node.XXX.edu, doesn't work. So I am wondering, when you are using job-management system like SGE, how did you set the "root" directories? So to let the data distributed??

See my PLIBS now,

Code:

[wei@opteron]$ echo $PLIBS
-pthread -L/afs/crc.edu/x86_64_linux/openmpi/1.3.2/gnu/lib -lmpi -lopen-rte -lopen-pal -ldl -Wl,--export-dynamic -lnsl -lutil -lm -ldl

Now, after re-setting the PFLAGS, PINC, and PLIBS, I then just wclean and recompile the src/Pstreams lib, So is this okay?

Thanks,

lakeat · January 18, 2011, 02:01

Just noticed that they were using
–
Six-Core Intel X5670 @ 2.93 GHz CPUs
–
Memory: 24GB per node
–
OS: CentOS5U4, OFED 1.5.1 InfiniBand SW stack

While Im kind of frustrated, for mine is
32 HP DL160 G6 servers
Dual Quad-Core, 2.27 GHz L5520 Intel Nehalem nodes (8 cores per node, 256 total cores), 12 GB RAM each
or
393 HP DL165 G6 servers
Dual Six-Core 2.4 GHz AMD Opteron Model 2431 64/32 bit (12 cores per node, 4716 total cores), 12 GB RAM, 1 x 160 GB SATA Disk

So yesterday..

lakeat · March 4, 2011, 15:21

Hello all,

Some findings and updates, hope this can help those non-professionals like me.

Infiniband support is critical for speed, to run a mid size case that need many nodes, it is strongly adviced to build the code against infiniband libs.

But I also got another question,
1. Usually how many grid points you guys allocate for each cpu?

2. I am still not clear how you make Hierarchical a better option than metis. Are you aware of any general rules, or do you have any experience that it is super better than metis. If not, im gonna stay with metis.

Thanks

flavio_galeazzo · March 9, 2011, 04:35

I have the same experience as you about Infiniband, Daniel. It is crucial to get good speed up with more than 4-5 machines.

I normally allocate the nodes for simulation aiming for 1 second per time step, which is a good value in the system I work with (similar to your "old" cluster). The number of grid points per node depends largely on the complexity of the solver. I can allocate more grid points with a less complex solver, say an incompressible LES, than with a complex reacting flow solver.

October 14, 2010, 12:08	Large case parallel efficiency	#1
lakeat Senior Member Daniel WEI (老魏) Join Date: Mar 2009 Location: Beijing, China Posts: 689 Blog Entries: 9 Rep Power: 21	Hi foamers, I feel mad about my extremely slow parallel computing efficiency. They are unsteady 3D LES case. incompressible external flow. When the grid number is around 2M, it works fine, I use 24 or 48 cpus and looks not bad, but when the grid number is around 9M, I try to use 128 cpu or 96 cpu, but the simulation just did not move for a quite long time, (more than a week). so dear all, what's your idea, and what is your suggestion and your experience. Any ideas and advice would be highly appreciated! __________________ ~ Daniel WEI ------------- Boeing Research & Technology - China Beijing, China Email

October 14, 2010, 13:29		#4
Canesin Member Fábio César Canesin Join Date: Mar 2010 Location: Florianópolis Posts: 67 Rep Power: 16	It is related with the needed for communication... Every nem devision added to a domain means new synchronization is needed at boundaries ... The speed you earn is a compromise between increase in computational power (more cores) and increase in communication ... In the case of adding every time more cores.. you need to know that the gain of communication will some time overtake the gain in computational power.. What can be done ?? The first solution is to improve the communication, better network and bypass the kernel .. the kernel generates overhead in communication using TCP/IP .. so you should use something like Infiniband or Myrinet .. The second solution is to improve the locality of your problem, maybe decrease the number of domain subdivision by per compute node and them solve the linear system parallel in each compute node... Hope it helps.. Fábio C. Canesin lakeat, deji, zhulianhua and 5 others like this.

October 18, 2010, 08:52		#6
eugene Senior Member Eugene de Villiers Join Date: Mar 2009 Posts: 725 Rep Power: 21	We regularly run LES on large meshes with large numbers of CPUs with excellent speedup. Some things to keep in mind: Beyond a certain number of CPUs, you need to move to infiniband or similar interconnect. Gigeth just wont hack it. Where the switch needs to occur depends on the case size, cpu speed and many other factors, but as rule of thumb, I would say anything above 32 cores requires infiniband. Decomposition matters. If you can use a simpler decomposition like hierarchical, do. Try to keep the number of processor boundaries to a minimum (within reason). I suggest you experiment with different decompositions like (16 2 1), (8 4 1), etc. It can make a really massive difference. Check that the slow-down is not due to some kind of disk activity, nfs or similar bottle-neck. If you have function objects or similar that read/write to disk a lot or have your case on a slow disk, you might want to distribute your case so each processor data set/mesh is local to the node it is being used on. (Check the distributed key word in decomposeParDict and the manual entry on decomposePar) If you have an infiniband network, you either have to relink Pstream against an MPI that supports the OFED hardware stack or recompile OpenMPI to support infiniband, otherwise your infiniband will be wasted. Hope this helps. lakeat, bennn, zhulianhua and 8 others like this.

October 18, 2010, 17:44		#8
eugene Senior Member Eugene de Villiers Join Date: Mar 2009 Posts: 725 Rep Power: 21	Honestly, I haven't tried it. What I have read about scotch so far is that it produces decompositions similar to metis. For large numbers of cpus, this kind of approach simply doesn't cut the mustard. You end up with too many processors connected to too many others and parallel efficiency suffers. Somewhere there is an optimum between number of inter-processor connections vs. number of processor faces. You can see this easily by comparing a hierarchical decomposition like (128 1 1) with (64 2 1) and (8 4 4). The best performance will not be (128 1 1) or (8 4 4). (128 1 1) has a very large (processor face)/cell ratio, but the smallest number of (processor boundaries)/cell. For most cases (8 4 4) will be at the other extreme. Both are a disaster in terms of scalability - I have seen (64 2 1) run twice as fast as (128 1 1), (8 4 4) is even worse than (128 1 1). Extreme domain shapes probably influence matrix solvers as well. I must stress that this is all highly situational. If the number of CPUs is small, decomposition doesn't really matter. Cells/proc also affect scalability a lot. Some kind of ultimate "self-optimising", hardware and algorithm aware decomposition would make a very cool Ph.D. project. At its simplest, you could just use dynamic load balancing techniques to optimise hierarchical decomposition coefficients at run time. Beyond this, you could look into profiling Pstream communication and developing decomposition methods that can be configured to perform best given a particular set of algorithms. After working on the parallel hierarchical algorithm to allow snappyHexMesh to do dynamic load balancing, I was very interested in developing something like this. Unfortunately, it turned out to be rather difficult and there were more pressing matters to attend too. We can only hope that someone with more time, energy and bright ideas will come along to save us from the current crop of sub-optimal methods. lakeat, nisha, vishwakarma and 9 others like this.

October 19, 2010, 08:31		#9
tehache Senior Member Thomas Jung Join Date: Mar 2009 Posts: 102 Rep Power: 17	there are tons of points... one other, perhaps trivial, but not yet mentioned thing I just found out: mpich on our cluster was not configured to use shared memory communication, thus using loopback device. I found that in some cases I can gain a lot of speed over this using the shared memory communication. Dont know why, but configuration without shared memory enabled seems to be default in mpich... lakeat likes this.

October 14, 2010, 12:14		#2
vinz Senior Member Vincent RIVOLA Join Date: Mar 2009 Location: France Posts: 283 Rep Power: 18	The last computations I did with about 8 million cells using a solver derived from rhoSimpleFoam were running slower on 32cores than on 16 so I decided to stick to 16. However, I would be really happy to know how to improve this kind of behaviour... Vincent

October 19, 2010, 09:15		#10
Simon Lapointe Member Simon Lapointe Join Date: May 2009 Location: Québec, Qc, Canada Posts: 33 Rep Power: 17	I've been running OpenFOAM on large meshes and high number of CPUs (up to 512) and the speedup was quite good. As it has been mentioned earlier, an Infiniband connection is necessary to achieve good performance on large parallel cases and we've also found that linking Pstream against the system compiled OpenMPI library supporting Infiniband makes a huge difference. Concerning the distribution method, I've always used metis and obtained satisfactory results (my cases are mostly 3D airfoils). Eugene's post suggesting to use hierarchical decomposition if possible seems interesting and I might try it (along with scotch) in the near future. I'm curious about the input of other members on this topic. lakeat likes this.

October 21, 2010, 06:01		#11
flavio_galeazzo Member Flavio Galeazzo Join Date: Mar 2009 Location: Karlsruhe, Germany Posts: 34 Rep Power: 18	My experience with large cases is very close to Simon one. I have run LES cases up to 10 million nodes on up to 256 cores, with parallel eficiency around 85%, always using Metis as decomposition strategy. The machine has Infiniband interconnect, and I have compiled OpenFoam with the system compiled OpenMPI. About smaller cases, using grids up to 2 million nodes and a Linux cluster with gigabit ethernet, I got good scalability only up to 16-20 cores (4-5 machines).

October 21, 2010, 12:22		#12
alberto Senior Member Alberto Passalacqua Join Date: Mar 2009 Location: Ames, Iowa, United States Posts: 1,912 Rep Power: 36	We have the same experience you had Flavio, with out LES on micro-reactors (>= 10^6 cells) using metis/scotch (scotch is actually slightly better it seems, even if the difference with respect to metis is not amazing). Compiling against MPI libraries optimized for the architecture is key of course. __________________ Alberto Passalacqua GeekoCFD - A free distribution based on openSUSE 64 bit with CFD tools, including OpenFOAM. Available as in both physical and virtual formats (current status: http://albertopassalacqua.com/?p=1541) OpenQBMM - An open-source implementation of quadrature-based moment methods. To obtain more accurate answers, please specify the version of OpenFOAM you are using.

October 24, 2010, 01:19		#13
andyj Member Andy Jones Join Date: Sep 2010 Location: Columbus, Ohio Posts: 78 Rep Power: 16	Hello You might consider the Scalasca Diagnostic Toolset. I am unsure what HPC formats are supported, but Cray Xt and IBM Blue Gene are. There is also Kojak, the precursor to Scalasca, which runs on more systems. Both give exhaustive info on bottlenecks and problems and system performance, complete with screenshots/charts/logs. http://www.fz-juelich.de/jsc/scalasca/overview/ Kojak: http://www.fz-juelich.de/jsc/kojak/platforms/ Kojak Supported Platforms •Instrumentation, Measurement, and Analysis ◦Linux IA-32, IA-64, and EM64T/x86_64 clusters with GNU, PGI, or Intel compilers ◦IBM Power3 / Power4 / Power5 / Power6 based clusters ◦SGI Mips based clusters (O2k, O3k) ◦SGI IA-64 based clusters (Altix) ◦SUN Solaris Sparc and x86 based clusters ◦DEC/HP Alpha based clusters ◦Generic UNIX workstation (clusters) •Instrumentation and Measurement only ◦IBM BlueGene/L and BlueGene/P ◦Cray T3E, XD1 and X1, XT3, XT4 ◦SiCortex ◦NEC SX ◦Hitachi SR-8000 I do not know anything about the learning curve or install. Its at least worth a glance. --------------------------------------------------------------------------------

January 10, 2011, 20:45		#14
lakeat Senior Member Daniel WEI (老魏) Join Date: Mar 2009 Location: Beijing, China Posts: 689 Blog Entries: 9 Rep Power: 21	Good discussions, thank you all, I will try and keep you posted. One of the major reminding for me is nfs writing speed. I will try to distribute the data. __________________ ~ Daniel WEI ------------- Boeing Research & Technology - China Beijing, China Email

January 18, 2011, 00:28		#16
alberto Senior Member Alberto Passalacqua Join Date: Mar 2009 Location: Ames, Iowa, United States Posts: 1,912 Rep Power: 36	Also, take a look at the study presented at the Open Source CFD Conference 2010: G. Shainer et al., OpenFOAM optimizations for Scale They might give some information of interest for you. __________________ Alberto Passalacqua GeekoCFD - A free distribution based on openSUSE 64 bit with CFD tools, including OpenFOAM. Available as in both physical and virtual formats (current status: http://albertopassalacqua.com/?p=1541) OpenQBMM - An open-source implementation of quadrature-based moment methods. To obtain more accurate answers, please specify the version of OpenFOAM you are using.

January 18, 2011, 01:12		#17
lakeat Senior Member Daniel WEI (老魏) Join Date: Mar 2009 Location: Beijing, China Posts: 689 Blog Entries: 9 Rep Power: 21	Gotcha, Thanks, I am testing.. The problem is , which I am not sure about: I see in our high performance center, there are different nodes, it seems not all the compute nodes are using Infiniband. Some computing nodes are quite old. I am wondering if it is possible to apply the nsf computing node at Illinois. And also, concerning the disk activity, I am not clear. My job are submitted via SGE management system, I do not have the right to access the computing node, which means ssh computing.node.XXX.edu, doesn't work. So I am wondering, when you are using job-management system like SGE, how did you set the "root" directories? So to let the data distributed?? See my PLIBS now, Code: [wei@opteron]$ echo $PLIBS -pthread -L/afs/crc.edu/x86_64_linux/openmpi/1.3.2/gnu/lib -lmpi -lopen-rte -lopen-pal -ldl -Wl,--export-dynamic -lnsl -lutil -lm -ldl Now, after re-setting the PFLAGS, PINC, and PLIBS, I then just wclean and recompile the src/Pstreams lib, So is this okay? Thanks, __________________ ~ Daniel WEI ------------- Boeing Research & Technology - China Beijing, China Email

January 18, 2011, 02:01		#18
lakeat Senior Member Daniel WEI (老魏) Join Date: Mar 2009 Location: Beijing, China Posts: 689 Blog Entries: 9 Rep Power: 21	Just noticed that they were using – Six-Core Intel X5670 @ 2.93 GHz CPUs – Memory: 24GB per node – OS: CentOS5U4, OFED 1.5.1 InfiniBand SW stack While Im kind of frustrated, for mine is 32 HP DL160 G6 servers Dual Quad-Core, 2.27 GHz L5520 Intel Nehalem nodes (8 cores per node, 256 total cores), 12 GB RAM each or 393 HP DL165 G6 servers Dual Six-Core 2.4 GHz AMD Opteron Model 2431 64/32 bit (12 cores per node, 4716 total cores), 12 GB RAM, 1 x 160 GB SATA Disk So yesterday.. __________________ ~ Daniel WEI ------------- Boeing Research & Technology - China Beijing, China Email

March 4, 2011, 15:21		#19
lakeat Senior Member Daniel WEI (老魏) Join Date: Mar 2009 Location: Beijing, China Posts: 689 Blog Entries: 9 Rep Power: 21	Hello all, Some findings and updates, hope this can help those non-professionals like me. Infiniband support is critical for speed, to run a mid size case that need many nodes, it is strongly adviced to build the code against infiniband libs. But I also got another question, 1. Usually how many grid points you guys allocate for each cpu? 2. I am still not clear how you make Hierarchical a better option than metis. Are you aware of any general rules, or do you have any experience that it is super better than metis. If not, im gonna stay with metis. Thanks __________________ ~ Daniel WEI ------------- Boeing Research & Technology - China Beijing, China Email

March 9, 2011, 04:35		#20
flavio_galeazzo Member Flavio Galeazzo Join Date: Mar 2009 Location: Karlsruhe, Germany Posts: 34 Rep Power: 18	I have the same experience as you about Infiniband, Daniel. It is crucial to get good speed up with more than 4-5 machines. I normally allocate the nodes for simulation aiming for 1 second per time step, which is a good value in the system I work with (similar to your "old" cluster). The number of grid points per node depends largely on the complexity of the solver. I can allocate more grid points with a less complex solver, say an incompressible LES, than with a complex reacting flow solver.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Postprocessing large data sets in parallel	evrikon	OpenFOAM Post-Processing	28	June 28, 2016 04:43
Superlinear speedup in OpenFOAM 13	msrinath80	OpenFOAM Running, Solving & CFD	18	March 3, 2015 06:36
Parelleling Efficiency	kassiotis	OpenFOAM	0	June 19, 2009 15:12
Parallel efficiency channel flow	maka	OpenFOAM Running, Solving & CFD	1	December 8, 2005 13:58
Post-processing of a large transient case	Flav	Siemens	2	September 28, 2004 07:19