Parallel Performance of Large Case

andrea.pasquali · February 16, 2015, 10:19

Hi,
I did not see different running times when running on nodes on a single switch.
My test was with mesh generation whit the refinement stage in snappyHexMesh.
As I said, I did not investigate it in detail yet. I only tried once recompiling mpi and openfoam with intel12 but having still the same (bad) performance with infiniband...

Andrea

arnaud6 · February 18, 2015, 06:09

Hello,

So I have tried the rebumberMesh before solving and it looks like it has improved a bit the performances of both single and multiple switches, reducing the running time by ~10%.

But I still can't see why the running times are so slow across multiple switches. RenumberMesh or not, we should get roughly the same running time whatever the nodes selected, right ?

wyldckat · February 22, 2015, 15:12

Greetings to all!

@arnaud6:

Quote:

Originally Posted by arnaud6

RenumberMesh or not, we should get roughly the same running time whatever the nodes selected, right ?

InfiniBand uses a special addressing mechanism that is not used by Ethernet MPI technology; AFAIK, InfiniBand uses a mechanism for sharing memory directly between nodes, mapping out as much as possible, between both the RAM of the machines and by mapping out the "path ways of least resistance" for communicating between each machine. This to say that an InfiniBand switch is far more complex than an Ethernet switch, because as many as possible paths are mapped out between each pair of ports on that switch.

Problem is that when 3 switches are used, the tree becomes a lot larger and is sectioned in 3 parts, making it a bit harder to map out communications.

Commercial CFD software might already have these kinds of configurations taken into account, by either asking the InfiniBand controls to adjust accordingly, or the CFD software itself tries to balance this out on its own, by placing sub-domains closer to each other on the same machines that share a switch and keeping communication to a minimum when communicating with machines that are connected on other switches. But when you use OpenFOAM, you're probably not taking this into account.

I haven't had to deal with this myself, so I have no idea how this is properly configured, but there are at least a few things I can imagine that could work:

Have you tried PCG yet? If not, you better try it as well.
Try multi-level decomposition: http://www.cfd-online.com/Forums/ope...tml#post367979 post #8 - the idea is that you should have the first level divided by switch group.
- Note: if you have 3 switches, either you have one master switch that connects only between the 2 other switches and has no direct machines, or you have 1 switch per group of machines in a daisy chain. Keep this in mind when using multi-level decomposition.
Contact your InfiniBand support line on how to configure mpirun to map out properly the communications.

Best regards,
Bruno

arnaud6 · February 26, 2015, 06:15

Hi Bruno,

Thanks for your ideas !

I am looking at the PCG solvers.
Would you advice to use the combination PCG for p and PBiCG for other variables or using PCG for p and keep other variables with a smopothSolver/Gauss Seidel ? In my cases it looks like p is the most difficult to solve (at least it is the variable that takes the longest time to be solved at each iteration).

The difficulty is that I don't have much control on the nodes thus the switches that will be selected when I submit my parallel job ...
I will see what I can do with the IB support.

wyldckat · October 24, 2015, 16:37

Hi arnaud6,

Quote:

Originally Posted by arnaud6

I am looking at the PCG solvers.
Would you advice to use the combination PCG for p and PBiCG for other variables or using PCG for p and keep other variables with a smopothSolver/Gauss Seidel ? In my cases it looks like p is the most difficult to solve (at least it is the variable that takes the longest time to be solved at each iteration).

Sorry for the really late reply, I've had this on my to-do list for a long time and only now did I take a quick look into it. But unfortunately I still don't have a specific answer/solution for this.
The best I could tell you back then and now is that you try running for a few iterations yourself with each configuration.
Even the GAMG matrix solver can sometimes be improved if you fine tune the parameters and do some trial and error sessions with your case, because these parameters depend on the case size and how the sub-domains in the case are structured.

Either way, I hope you managed to figure this out on your own.

Best regards,
Bruno

mgg · November 4, 2015, 11:49

Hi Bruno,

indeed. In my expericence, how the subdomain is structured has strong impact on the performance. So I choose to decompose manually.

My problem now is as following:

I am running a DNS case (22 Mio. cells) using buoyantPimpleFoam (OF V2.4). The case is a long pipe with an inlet and outlet. The fluid is air. Inlet Re is about 5400.

For getting better scalability, I use PCG for pressure equation. If I use perfect gas equation of state, the number of iterations will be around 100, which is acceptable. If I use icopolynom or rhoConst to describe the density, the number of iterations will be around 4000! If I use GAMG for p equation, number of iteration will be under 5, but the scalability is poor with above 500 cores. Does anyone has any opinion?

How can I improve PCG solver to decrease the number of iterations? Thank you.

Quote:

Originally Posted by wyldckat

Hi arnaud6,

Sorry for the really late reply, I've had this on my to-do list for a long time and only now did I take a quick look into it. But unfortunately I still don't have a specific answer/solution for this.
The best I could tell you back then and now is that you try running for a few iterations yourself with each configuration.
Even the GAMG matrix solver can sometimes be improved if you fine tune the parameters and do some trial and error sessions with your case, because these parameters depend on the case size and how the sub-domains in the case are structured.

Either way, I hope you managed to figure this out on your own.

Best regards,
Bruno

wyldckat · November 7, 2015, 12:52

Quote:

Originally Posted by mgg

If I use icopolynom or rhoConst to describe the density, the number of iterations will be around 4000! If I use GAMG for p equation, number of iteration will be under 5, but the scalability is poor with above 500 cores. Does anyone has any opinion?

How can I improve PCG solver to decrease the number of iterations? Thank you.

Quick questions/answers:

I don't know how to improve the PCG solver... perhaps you need to use another preconditioner? I can't remember right now, but isn't GAMG possible to be used as a preconditioner?
If GAMG can do it in 5 iterations, are those 5 iterations taking a lot longer than 4000 of the PCG?
I'm not familiar with DNS enough to know this, but isn't it possible to solve the same pressure equation a few times, with relaxation steps in between, like PIMPLE and SIMPLE have this ability?
GAMG is very configurable. Are you simply using a standard set of settings or have you tried to find the optimum settings for GAMG? Because GAMG can only scale well if you configure it correctly. I know there was a thread about this somewhere...
After a quick search:
- http://www.cfd-online.com/Forums/ope...on-meshes.html
- http://www.cfd-online.com/Forums/ope...tml#post192075 - post #2

arnaud6 · March 25, 2016, 06:03

Sorry for getting back so late on this one. The problem was mpirun 1.6.5. As soon as I switched to mpirun 1.8.3, the slowness disappeared !

February 16, 2015, 10:19		#21
andrea.pasquali Senior Member Andrea Pasquali Join Date: Sep 2009 Location: Germany Posts: 142 Rep Power: 17	Hi, I did not see different running times when running on nodes on a single switch. My test was with mesh generation whit the refinement stage in snappyHexMesh. As I said, I did not investigate it in detail yet. I only tried once recompiling mpi and openfoam with intel12 but having still the same (bad) performance with infiniband... Andrea __________________ Andrea Pasquali

March 25, 2016, 06:03		#28
arnaud6 New Member Join Date: May 2013 Posts: 23 Rep Power: 13	Sorry for getting back so late on this one. The problem was mpirun 1.6.5. As soon as I switched to mpirun 1.8.3, the slowness disappeared ! wyldckat and Ramzy1990 like this.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Large test case for running OpenFoam in parallel	fhy	OpenFOAM Running, Solving & CFD	23	April 6, 2019 10:55
Running AMI case in parallel	Kaskade	OpenFOAM Running, Solving & CFD	3	March 14, 2016 16:58
Parallel Moving Mesh Bug for Multi-patch Case	albcem	OpenFOAM	0	May 21, 2009 01:23
Parallel Performance of Fluent	Soheyl	FLUENT	2	October 30, 2005 07:11
PC vs. Workstation	Tim Franke	Main CFD Forum	5	September 29, 1999 16:01

February 18, 2015, 06:09		#22
arnaud6 New Member Join Date: May 2013 Posts: 23 Rep Power: 13	Hello, So I have tried the rebumberMesh before solving and it looks like it has improved a bit the performances of both single and multiple switches, reducing the running time by ~10%. But I still can't see why the running times are so slow across multiple switches. RenumberMesh or not, we should get roughly the same running time whatever the nodes selected, right ?

February 26, 2015, 06:15		#24
arnaud6 New Member Join Date: May 2013 Posts: 23 Rep Power: 13	Hi Bruno, Thanks for your ideas ! I am looking at the PCG solvers. Would you advice to use the combination PCG for p and PBiCG for other variables or using PCG for p and keep other variables with a smopothSolver/Gauss Seidel ? In my cases it looks like p is the most difficult to solve (at least it is the variable that takes the longest time to be solved at each iteration). The difficulty is that I don't have much control on the nodes thus the switches that will be selected when I submit my parallel job ... I will see what I can do with the IB support.