Parallel Performance of Large Case

LESlie · September 2, 2013, 11:57

Dear Foamers,

After searching for a while on the forum about the performance of parallelization I decided to open this new thread to be able to specify better my problem.

I am doing a speed-up test of an industrial application of two phase flow. The mesh is 64 million cells, and the solver is compressibleInterFoam. Each node of the cluster contains 2 eight cores CPU Sandy Bridge clocked at 2.7 GHz (AVX), let 16 cores / node. Each node has 64 GB of memory, let 4 GB / core.

Respecting the rule of thumb of 50 k cells/processor, I have carried out the performance study on 1024 (62,500 cells/processor), 512 and 256 processors respectively using the scotch decomposition method. The real physical time to compute a same simulated period is 1.35, 1.035, and 1 (normalized by the physical time of using 256 processors) respectively, thus leaving the 1024 the least efficient configuration and the 256 the most efficient.

Any help or suggestion will be very appreciated.

Lieven · September 2, 2013, 19:14

Hi Leslie,

A few remarks:
1. If I were you, I would test with an even lower number of processors. From what I can see, you don't gain anything by increasing the number from 256 to 512 (ok, 3%, let's call this negligible) but you might have the same when increasing from 128 to 256.
2. Your 50 k cells/processor "rule of thumb" is strongly cluster dependent. For the cluster I'm using, this is in the order of 300 k cells/processor
3. Make sure the physical time is "sufficiently long" so that overhead like reading the grid, creating the fields etc. becomes negligible with respect to the simulation.
4. Also, in case you didn't do it, switch off all writing of data when you do the test. Your I/O speed might mess up your test

Cheers,

L

LESlie · September 3, 2013, 05:22

Hi Lieven, Thanks for the tips.

Actually the 50k cells/processor was verified on this cluster by a coarser mesh of 8 millions (exactly the same set-up) where 128 processors have proven to be the most efficient, which is quite reasonable.

Then I simply did a "refineMesh" to get the 64 million mesh. In fact going from 256 to 512, I LOSE 3%.

Quote:

Originally Posted by Lieven

1. If I were you, I would test with an even lower number of processors. From what I can see, you don't gain anything by increasing the number from 256 to 512 (ok, 3%, let's call this negligible) but you might have the same when increasing from 128 to 256.
2. Your 50 k cells/processor "rule of thumb" is strongly cluster dependent. For the cluster I'm using, this is in the order of 300 k cells/processor

wyldckat · September 3, 2013, 19:00

Greetings to all!

@Leslie: There are several important details that should be kept in mind, some of which I can think of right now:

Did you specifically compile OpenFOAM with the system's GCC, along with the proper compilation flags?
Along with the correct compilation flags, did you try to use GCC 4.5, or 4.6 or 4.7, for building OpenFOAM?
Which MPI are you using? Is it the one that the cluster uses?
Has the cluster been benchmarked with any specific dedicated tools, so that fine tuning was performed for it?
Is the cluster using any specific communication system, such as Infiniband?
1. And if so, is your job being properly launched using that particular communication system?
Which file sharing system is being used? For example, NFS v3 can do some damage to HPC communications...
Is processor affinity being used in the job options?
Have you tested more than one kind of decomposition method?
What about RAM and CPU characteristics? Because Sandy bridge covers a very broad array of possible processors, including features such as 4 memory channels. More specifically:
- RAM:
  1. How many slots are being used on the motherboards?
  2. What's the frequency of the memory modules? 1066, 1333, 1600 MHz?
  3. Are the memory modules and motherboard using ECC?
- CPU:
  1. Which exact model of CPU?
  2. 8 cores each CPU? Are you certain? Isn't it 8 threads in 4 cores (2 threads per core)?
When it comes to the core count distribution, 128/256/512/1024 cores, is this evenly distributed among all machines? Or does it allocate in order of arrival, therefore only allocating the minimum amount of machines?

I ended up writing a list longer than I was planning...
Anyway, all of these influence how things perform. And I didn't even mention specific BIOS configurations, since it's Intel CPUs

.

Best regards,
Bruno

LESlie · September 4, 2013, 09:15

Thanks a lot Bruno, these things will keep me busy for a while.

I will get back to the forum when I have the solution.

haakon · September 5, 2013, 10:12

I would, as an even more important factor than the list of wyldckat, put the attention to the pressure solver. My experience is as follows:

The multigrid solvers (GAMG) are quick (in therms of walltime), but does not at all scale well in parallel. They require around 100k cells/process for the parallel efficiency to be acceptable.
The conjugate gradient solvers (PCG) are slow in terms of walltime, but scale extremely well in parallel. As low as 10k cells/process can be effective.

In conclusion, my experience is that the fastest way to get results is the PCG solvers, with as few cells as possible per process, in the order og 10-40k. If that is not possible, go with the GAMG solver, and keep the number of cells per process above 100k.

For me it seems like you are using a GAMG solver for the pressure equation, that could in a very simple way explain the bad scaling.

If you are interested, you can consult my study here: https://www.hpc.ntnu.no/display/hpc/...mance+on+Vilje This is done on an Intel Sandy Bridge cluster with 16 core/node, as you have.

LESlie · September 5, 2013, 10:28

Hi Håkon,

Indeed I am using the GAMG for pressure. But why this bad scaling behavior did not occur for a 8 million cell mesh with exactly the same set up? (see the attachment)

Cheers,
Leslie

haakon · September 5, 2013, 10:37

Quote:

Originally Posted by LESlie

Hi Håkon,

Indeed I am using the GAMG for pressure. But why this bad scaling behavior did not occur for a 8 million cell mesh with exactly the same set up? (see the attachment)

Cheers,
Leslie

That is a really good question, which I cannot answer immediately.

Anyway, I now also see that you use a compressible solver. Then some of my considerations might not necessarily be correct, as I have only worked with and tested incompressible flows. The pressure is some quite different things (physically) in compressible and incompressible flows...

Andrea_85 · September 5, 2013, 11:28

Hi all,

I am also planning to move my simulations on a big cluster (Blue Gene/Q). I will have to do scalability tests before starting the real job, so this thread is of particular interest to me. The project is not yet started so i cannot post here my results at the moment (but i hope in few weeks), but i am trying to understand as much as possibile in advance.

so here my questions, related to bruno's post:

Can you please clarify a bit points 1, 5 and 7 of your post?

Thanks

andrea

dkokron · September 6, 2013, 01:01

All,

The only way to answer this question is to profile the code to see where it is slowing down as it is scaled to larger process counts. I have had some success using the TAU tools (http://www.cs.uoregon.edu/research/tau/about.php) with the simpleFoam and pimpleFoam incompressible solvers from OpenFOAM-2.2.x to get subroutine level timings. I am willing to try getting some results from compressible/interFoam if you can provide me your 8M cell case.

Dan

LESlie · September 6, 2013, 12:15

Thanks Dan for proposing. In fact the 8M case was with a reasonable scaling behavior, the problem is with 64M. You think TAU can handle such big case?

I was hoping to use extrae + paraver to get the communication time. Anyone here who is experienced with these analysis tools I will be grateful!

The problem is I don't find a proper way to limit the traces generated by extrae, for example for the depthCharge2D case, 6 minutes run generates 137GB traces and after parsing it's 29 GB. paraver couldn't handle this because of too much data. And this is just for a 12800 cell case...

Does anyone have a proposal for a suitable performance analysis tool?

Cheers and have a good weekend!

wyldckat · September 7, 2013, 05:03

Greetings to all!

@Andrea:

Quote:

Originally Posted by Andrea_85

so here my questions, related to bruno's post:

Can you please clarify a bit points 1, 5 and 7 of your post?

Let's see:

Quote:

Originally Posted by wyldckat

1. Did you specifically compile OpenFOAM with the system's GCC, along with the proper compilation flags?

Some examples:

Quote:

Originally Posted by wyldckat

5. Is the cluster using any specific communication system, such as Infiniband?

And if so, is your job being properly launched using that particular communication system?

Well, Infiniband provides a low latency and high speed connection infrastructure, which is excellent for HPC, because it doesn't rely on traditional Ethernet protocols. As for choosing this as the primary connection, that is usually done automatically by the MPI, specially in clusters that are using their own MPI installation.
For a bit more on this topic, read:

Quote:

Originally Posted by wyldckat

7. Is processor affinity being used in the job options?

One example: http://www.cfd-online.com/Forums/har...tml#post356954

You might find more information from the links at this blog post of mine: Notes about running OpenFOAM in parallel

Best regards,
Bruno

dkokron · September 7, 2013, 16:48

LESlie,

TAU can handle such a small case. The trouble is getting TAU to properly instrument OpenFOAM. It takes a bit of patience, but it works very well. Once the code is instrumented and built, you can use the resulting executable for any case.

Dan

haakon · September 9, 2013, 04:47

In addition to TAU, you can also use IPM ( http://ipm-hpc.sourceforge.net/ ) to find out where time is spent in your solver. I find this very useful. You can either use a simple LD_PRELOAD to link your OpenFOAM solver to IPM at runtime, this works sufficiently well to find out what takes time (calculation, I/O, different MPI calls etc).

You can also create a special purpose solver, and insert markers in the code, in that case you can find out what parts of the solution process that uses most time. This is a very powerful feature. The IPM library has very low overhead, such that you can use it on production runs, without suffering in terms of performance.

I have written a small guide on how to profile an OpenFOAM solver with IPM here: https://www.hpc.ntnu.no/display/hpc/...filing+Solvers if that is of any interest.

andrea.pasquali · June 10, 2014, 13:57

Dear haakon,

I am interested in parallel performace/anaylisis for the refinement stage in snappyHexMesh.
I found your instruction regarding IPM very useful but I have problems to compile it correctly within OpenFOAM.

I'd like to control the MPI execution in the "autoRefineDriver.C", where it refine the grid.
I changed the "Make/options" file and compile it with "wmake libso" as follows:

Code:

sinclude $(GENERAL_RULES)/mplib$(WM_MPLIB)
sinclude $(RULES)/mplib$(WM_MPLIB)

EXE_INC = \
    -I$(LIB_SRC)/parallel/decompose/decompositionMethods/lnInclude \
    -I$(LIB_SRC)/dynamicMesh/lnInclude \
    -I$(LIB_SRC)/finiteVolume/lnInclude \
    -I$(LIB_SRC)/lagrangian/basic/lnInclude \
    -I$(LIB_SRC)/meshTools/lnInclude \
    -I$(LIB_SRC)/fileFormats/lnInclude \
    -I$(LIB_SRC)/edgeMesh/lnInclude \
    -I$(LIB_SRC)/surfMesh/lnInclude \
    -I$(LIB_SRC)/triSurface/lnInclude

LIB_LIBS = \
    -ldynamicMesh \
    -lfiniteVolume \
    -llagrangian \
    -lmeshTools \
    -lfileFormats \
    -ledgeMesh \
    -lsurfMesh \
    -ltriSurface \
    -ldistributed \
    -L/home/pasquali/Tmp/ipm/lib/libipm.a \
    -L/home/pasquali/Tmp/ipm/lib/libipm.so \
    -lipm

My questions are:
1) What are "GENERAL_RULES" and "RULES"? How should I define them?
2) Which lib for IPM should I use? The static (.a) or the dynamic (.so)?
3) "-lipm" is not found. Only if I remove it I can compile correctly the library "libautoMesh.so"
4) Once I compiled the lib "libautoMesh.so", I added the line

Code:

#include "/usr/include/mpi/mpi.h"

in the file "autoRefineDriver.C". But re-compiling it I got the error:

Quote:

In file included from /usr/include/mpi/mpi.h:1886:0,
from autoHexMesh/autoHexMeshDriver/autoRefineDriver.C:41:
/usr/include/mpi/openmpi/ompi/mpi/cxx/mpicxx.h:33:17: fatal error: mpi.h: No such file or directory
compilation terminated.
make: *** [Make/linux64GccDPOpt/autoRefineDriver.o] Error 1

I am using the openmpi of my pc under "/usr/include/mpi". I would like to test it on my pc before to go on the cluster.

Thank you in advance for any help

Best regards

Andrea Pasquali

andrea.pasquali · June 12, 2014, 09:07

I compiled it!
It works

Andrea

arnaud6 · February 12, 2015, 14:28

Hello all!

I have recently come up with some issues regarding parallel jobs. I am running potentialFoam and simpleFoam on several cluster nodes. I am experiencing really different running times depending on the nodes selected.

I am observing this behaviour:
Running on a single switch, the case is running as expected with let's say 80 seconds per iteration.
Running the same job across multiple switches, each iteration takes 250 sec, so 3 times more.

I am running with openfoam-2.3.1 and mpirun-1.6.5 and using InfiniBand.

For pressure, I use the GAMG solver (I will try with the PCG solver to see if I can get more consistent results). The cell count is roughly 1M cells/ proc which should be fine.

I am using scotch decomposition method.

Hello I am coming back to you with more information.

I have run the Test-Parallel of OpenFOAM and the output looks fine for me.
Here is an example of the log file

PHP Code:

Create time

[0]
Starting transfers
[0]
[0] master receiving from slave 1
[144]
Starting transfers
[144]
[144] slave sending to master 0
[144] slave receiving from master 0
[153]
Starting transfers
[153]
[153] slave sending to master 0
[153] slave receiving from master 0[/PHP]

I don't know how to interpret all the processor numbers at the end of the test but I don't find them really useful. Should I get more information from this Test-Parallel ?

I want to emphasize that the IB fabric seems to work correctly as we don't observe any issue running commercial grade CFD applications.

We have built mpich3.1.3 from source and we observe exactly the same behaviour as using openmpi (slow across switches and fast in a single switch) so this suggests it is not mpi-related.

Has anyone experienced this behaviour running parallel openfoam jobs ?

And I would like to add that if I increased the numbers of processors I am using, I am experiencing even worse results (in this case I am running on mutiple switches). The run is completely stuck while iterating !!

If you have any further information of what I should check and try, that would be very much appreciated!

andrea.pasquali · February 12, 2015, 17:58

I notice the same behavior but running without infiniband it goes faster for me!
I did not investigate it yet, but I guess is something concerning with infiniband+mpi+openfoam. I used openfoam 2.2.0.
Andrea

arnaud6 · February 13, 2015, 10:39

Ah that's really interesting, I will give a try with an ethernet connection instead of IB so.

Andrea, did you notice different running times when running on nodes on a single switch ?

wyldckat · February 14, 2015, 12:38

Greetings to all!

Have you two used renumberMesh? Both in serial and in parallel mode?
This will rearrange the cell order, so that the final matrices for solving are as close to diagonal as possible, which improves considerably the performance!

Best regards,
Bruno

September 2, 2013, 11:57	Parallel Performance of Large Case	#1
LESlie New Member NaiXian Leslie Lu Join Date: Jun 2009 Location: France Posts: 26 Rep Power: 17	Dear Foamers, After searching for a while on the forum about the performance of parallelization I decided to open this new thread to be able to specify better my problem. I am doing a speed-up test of an industrial application of two phase flow. The mesh is 64 million cells, and the solver is compressibleInterFoam. Each node of the cluster contains 2 eight cores CPU Sandy Bridge clocked at 2.7 GHz (AVX), let 16 cores / node. Each node has 64 GB of memory, let 4 GB / core. Respecting the rule of thumb of 50 k cells/processor, I have carried out the performance study on 1024 (62,500 cells/processor), 512 and 256 processors respectively using the scotch decomposition method. The real physical time to compute a same simulated period is 1.35, 1.035, and 1 (normalized by the physical time of using 256 processors) respectively, thus leaving the 1024 the least efficient configuration and the 256 the most efficient. Any help or suggestion will be very appreciated. __________________ Cheers, Leslie LU

September 3, 2013, 19:00		#4
wyldckat Retired Super Moderator Bruno Santos Join Date: Mar 2009 Location: Lisbon, Portugal Posts: 10,981 Blog Entries: 45 Rep Power: 128	Greetings to all! @Leslie: There are several important details that should be kept in mind, some of which I can think of right now: Did you specifically compile OpenFOAM with the system's GCC, along with the proper compilation flags? Along with the correct compilation flags, did you try to use GCC 4.5, or 4.6 or 4.7, for building OpenFOAM? Which MPI are you using? Is it the one that the cluster uses? Has the cluster been benchmarked with any specific dedicated tools, so that fine tuning was performed for it? Is the cluster using any specific communication system, such as Infiniband? And if so, is your job being properly launched using that particular communication system? Which file sharing system is being used? For example, NFS v3 can do some damage to HPC communications... Is processor affinity being used in the job options? Have you tested more than one kind of decomposition method? What about RAM and CPU characteristics? Because Sandy bridge covers a very broad array of possible processors, including features such as 4 memory channels. More specifically: RAM: How many slots are being used on the motherboards? What's the frequency of the memory modules? 1066, 1333, 1600 MHz? Are the memory modules and motherboard using ECC? CPU: Which exact model of CPU? 8 cores each CPU? Are you certain? Isn't it 8 threads in 4 cores (2 threads per core)? When it comes to the core count distribution, 128/256/512/1024 cores, is this evenly distributed among all machines? Or does it allocate in order of arrival, therefore only allocating the minimum amount of machines? I ended up writing a list longer than I was planning... Anyway, all of these influence how things perform. And I didn't even mention specific BIOS configurations, since it's Intel CPUs . Best regards, Bruno LESlie, m_mousavi88, hxaxtma and 2 others like this. __________________ OpenFOAM: FAQ \| Getting started Forum: How to get help, to post code/output and forum guide Read this before sending me PM

September 4, 2013, 09:15		#5
LESlie New Member NaiXian Leslie Lu Join Date: Jun 2009 Location: France Posts: 26 Rep Power: 17	Thanks a lot Bruno, these things will keep me busy for a while. I will get back to the forum when I have the solution. __________________ Cheers, Leslie LU

September 5, 2013, 10:12		#6
haakon Senior Member Join Date: Dec 2011 Posts: 111 Rep Power: 20	I would, as an even more important factor than the list of wyldckat, put the attention to the pressure solver. My experience is as follows: The multigrid solvers (GAMG) are quick (in therms of walltime), but does not at all scale well in parallel. They require around 100k cells/process for the parallel efficiency to be acceptable. The conjugate gradient solvers (PCG) are slow in terms of walltime, but scale extremely well in parallel. As low as 10k cells/process can be effective. In conclusion, my experience is that the fastest way to get results is the PCG solvers, with as few cells as possible per process, in the order og 10-40k. If that is not possible, go with the GAMG solver, and keep the number of cells per process above 100k. For me it seems like you are using a GAMG solver for the pressure equation, that could in a very simple way explain the bad scaling. If you are interested, you can consult my study here: https://www.hpc.ntnu.no/display/hpc/...mance+on+Vilje This is done on an Intel Sandy Bridge cluster with 16 core/node, as you have. wyldckat, Bernhard, fumiya and 4 others like this.

September 6, 2013, 12:15		#11
LESlie New Member NaiXian Leslie Lu Join Date: Jun 2009 Location: France Posts: 26 Rep Power: 17	Thanks Dan for proposing. In fact the 8M case was with a reasonable scaling behavior, the problem is with 64M. You think TAU can handle such big case? I was hoping to use extrae + paraver to get the communication time. Anyone here who is experienced with these analysis tools I will be grateful! The problem is I don't find a proper way to limit the traces generated by extrae, for example for the depthCharge2D case, 6 minutes run generates 137GB traces and after parsing it's 29 GB. paraver couldn't handle this because of too much data. And this is just for a 12800 cell case... Does anyone have a proposal for a suitable performance analysis tool? Cheers and have a good weekend! __________________ Cheers, Leslie LU

September 2, 2013, 19:14		#2
Lieven Senior Member Lieven Join Date: Dec 2011 Location: Leuven, Belgium Posts: 299 Rep Power: 23	Hi Leslie, A few remarks: 1. If I were you, I would test with an even lower number of processors. From what I can see, you don't gain anything by increasing the number from 256 to 512 (ok, 3%, let's call this negligible) but you might have the same when increasing from 128 to 256. 2. Your 50 k cells/processor "rule of thumb" is strongly cluster dependent. For the cluster I'm using, this is in the order of 300 k cells/processor 3. Make sure the physical time is "sufficiently long" so that overhead like reading the grid, creating the fields etc. becomes negligible with respect to the simulation. 4. Also, in case you didn't do it, switch off all writing of data when you do the test. Your I/O speed might mess up your test Cheers, L

September 5, 2013, 11:28		#9
Andrea_85 Senior Member Andrea Ferrari Join Date: Dec 2010 Posts: 319 Rep Power: 17	Hi all, I am also planning to move my simulations on a big cluster (Blue Gene/Q). I will have to do scalability tests before starting the real job, so this thread is of particular interest to me. The project is not yet started so i cannot post here my results at the moment (but i hope in few weeks), but i am trying to understand as much as possibile in advance. so here my questions, related to bruno's post: Can you please clarify a bit points 1, 5 and 7 of your post? Thanks andrea

September 6, 2013, 01:01		#10
dkokron Member Dan Kokron Join Date: Dec 2012 Posts: 33 Rep Power: 14	All, The only way to answer this question is to profile the code to see where it is slowing down as it is scaled to larger process counts. I have had some success using the TAU tools (http://www.cs.uoregon.edu/research/tau/about.php) with the simpleFoam and pimpleFoam incompressible solvers from OpenFOAM-2.2.x to get subroutine level timings. I am willing to try getting some results from compressible/interFoam if you can provide me your 8M cell case. Dan

September 7, 2013, 16:48		#13
dkokron Member Dan Kokron Join Date: Dec 2012 Posts: 33 Rep Power: 14	LESlie, TAU can handle such a small case. The trouble is getting TAU to properly instrument OpenFOAM. It takes a bit of patience, but it works very well. Once the code is instrumented and built, you can use the resulting executable for any case. Dan

September 9, 2013, 04:47		#14
haakon Senior Member Join Date: Dec 2011 Posts: 111 Rep Power: 20	In addition to TAU, you can also use IPM ( http://ipm-hpc.sourceforge.net/ ) to find out where time is spent in your solver. I find this very useful. You can either use a simple LD_PRELOAD to link your OpenFOAM solver to IPM at runtime, this works sufficiently well to find out what takes time (calculation, I/O, different MPI calls etc). You can also create a special purpose solver, and insert markers in the code, in that case you can find out what parts of the solution process that uses most time. This is a very powerful feature. The IPM library has very low overhead, such that you can use it on production runs, without suffering in terms of performance. I have written a small guide on how to profile an OpenFOAM solver with IPM here: https://www.hpc.ntnu.no/display/hpc/...filing+Solvers if that is of any interest. wyldckat, LESlie, Tobi and 2 others like this.

June 12, 2014, 09:07		#16
andrea.pasquali Senior Member Andrea Pasquali Join Date: Sep 2009 Location: Germany Posts: 142 Rep Power: 17	I compiled it! It works Andrea __________________ Andrea Pasquali

February 12, 2015, 14:28		#17
arnaud6 New Member Join Date: May 2013 Posts: 23 Rep Power: 13	Hello all! I have recently come up with some issues regarding parallel jobs. I am running potentialFoam and simpleFoam on several cluster nodes. I am experiencing really different running times depending on the nodes selected. I am observing this behaviour: Running on a single switch, the case is running as expected with let's say 80 seconds per iteration. Running the same job across multiple switches, each iteration takes 250 sec, so 3 times more. I am running with openfoam-2.3.1 and mpirun-1.6.5 and using InfiniBand. For pressure, I use the GAMG solver (I will try with the PCG solver to see if I can get more consistent results). The cell count is roughly 1M cells/ proc which should be fine. I am using scotch decomposition method. Hello I am coming back to you with more information. I have run the Test-Parallel of OpenFOAM and the output looks fine for me. Here is an example of the log file PHP Code: Create time [0] Starting transfers [0] [0] master receiving from slave 1 [144] Starting transfers [144] [144] slave sending to master 0 [144] slave receiving from master 0 [153] Starting transfers [153] [153] slave sending to master 0 [153] slave receiving from master 0[/PHP] I don't know how to interpret all the processor numbers at the end of the test but I don't find them really useful. Should I get more information from this Test-Parallel ? I want to emphasize that the IB fabric seems to work correctly as we don't observe any issue running commercial grade CFD applications. We have built mpich3.1.3 from source and we observe exactly the same behaviour as using openmpi (slow across switches and fast in a single switch) so this suggests it is not mpi-related. Has anyone experienced this behaviour running parallel openfoam jobs ? And I would like to add that if I increased the numbers of processors I am using, I am experiencing even worse results (in this case I am running on mutiple switches). The run is completely stuck while iterating !! If you have any further information of what I should check and try, that would be very much appreciated!

February 12, 2015, 17:58		#18
andrea.pasquali Senior Member Andrea Pasquali Join Date: Sep 2009 Location: Germany Posts: 142 Rep Power: 17	I notice the same behavior but running without infiniband it goes faster for me! I did not investigate it yet, but I guess is something concerning with infiniband+mpi+openfoam. I used openfoam 2.2.0. Andrea __________________ Andrea Pasquali

February 13, 2015, 10:39		#19
arnaud6 New Member Join Date: May 2013 Posts: 23 Rep Power: 13	Ah that's really interesting, I will give a try with an ethernet connection instead of IB so. Andrea, did you notice different running times when running on nodes on a single switch ?

February 14, 2015, 12:38		#20
wyldckat Retired Super Moderator Bruno Santos Join Date: Mar 2009 Location: Lisbon, Portugal Posts: 10,981 Blog Entries: 45 Rep Power: 128	Greetings to all! Have you two used renumberMesh? Both in serial and in parallel mode? This will rearrange the cell order, so that the final matrices for solving are as close to diagonal as possible, which improves considerably the performance! Best regards, Bruno __________________ OpenFOAM: FAQ \| Getting started Forum: How to get help, to post code/output and forum guide Read this before sending me PM

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Large test case for running OpenFoam in parallel	fhy	OpenFOAM Running, Solving & CFD	23	April 6, 2019 10:55
Running AMI case in parallel	Kaskade	OpenFOAM Running, Solving & CFD	3	March 14, 2016 16:58
Parallel Moving Mesh Bug for Multi-patch Case	albcem	OpenFOAM	0	May 21, 2009 01:23
Parallel Performance of Fluent	Soheyl	FLUENT	2	October 30, 2005 07:11
PC vs. Workstation	Tim Franke	Main CFD Forum	5	September 29, 1999 16:01