OpenFOAM on AMD GPUs. Container from Infinity Hub: user experiences and performance

alexisespinosa · February 23, 2023, 04:28

AMD recently provided an OpenFOAM container capable of running on AMD GPUs.

It is in their Infinity Hub:
https://www.amd.com/en/technologies/...y-hub/openfoam

And my questions are:

-How have been the experiences of the community using this OpenFOAM container on AMD GPUs?
-Are you reaching cool performance improvements vs just CPU solvers?

Thanks a lot,
Alexis

(PS. I will start using it and post my experiences too)

be_inspired · February 9, 2024, 04:51

Hi,

Were you able to launch any simulation using the GPU version? Is it 100% GPU or only the pressure solver is solved in the GPU?

Do you know if it could be compatible que Nvidia GPU to test it?

Best Regards
Marcelino

mesh-monkey · April 28, 2024, 10:15

Thought I'd share my experiences with this!

My findings, unfortunately with my setup, have been that it remains much faster to solve on CPU than GPU.

I used the HPC_Motorbike example and code provided by AMD in the docker container (not available on the link any longer btw) as-is on my Radeon VII. For the CPU examples, I modified the run to suit a typical CPU-based set of solvers using the standard tutorial fvSolution files.

Results as follows. Times shown are SimpleFoam total ClockTime to 20 iterations; and time per iteration, excluding the first time step:

GPU: 473 seconds; 20.8s per iteration
CPU, with 'GPU-aligned' solvers: 343 seconds ; 16.7s per iteration
CPU, with 'normal' solvers: 205 seconds; 9.9s per iteration

Velocity and pressure solvers for each as follows:

GPU: PETSc-bcgs & PETSc-cg
CPU, with 'GPU-aligned' solvers: DILUPBiCGStab & DICPCG
CPU, with 'normal' solvers, per tutorial: smoothSolver & GAMG

The GPU appears seldom used, with sporadic spikes in utilisation and barely exceeding 40% of the GPU pipe. Most of the time within iterations cycle being seems to be spent doing not much (I/O maybe?

). Unsurprisingly the 1st iteration is much longer as the model is read into vram, which you can see quite easily, but subsequent iterations are also slower than similar solvers on CPU. I have included a time per iteration from iterations 2-20 to illustrate the per-iteration slowdown to account for this.

I get that GPUs are made for large models but I am already nearly reaching the 16GB of vram even in this model (5,223,573 cells). I can't run the Medium sized model (~9M cells I think) because I run out of vram

I'm running this on my desktop PC for funzies because I don't even want to know how much faster this will be on my usual solving machine (48 core xeon).

So, in summary, based on my experiences with a Radeon VII and the Small HPC_motorbike case:

GPU is half as fast as CPU when using CPU-native solvers
GPU is 20% slower vs CPU when using less-efficient 'GPU-aligned' solvers

Next step I think is to find more GPUs to test the scaling of larger models (love an excuse to keep scouring ebay for deals hehe)

Cheers,
Tom

mesh-monkey · April 28, 2024, 10:23

Quote:

Originally Posted by be_inspired

Hi,

Were you able to launch any simulation using the GPU version? Is it 100% GPU or only the pressure solver is solved in the GPU?

Do you know if it could be compatible que Nvidia GPU to test it?

Best Regards
Marcelino

100% GPU as far as I'm aware. All solvers are petsc.

The initial run script appears to be flexible to support CUDA devices too. I've not dug any deeper and don't have a suitable GPU to test with further, sorry.

Code:

Available Options: HIP or CUDA

Only HIP is mentioned in the fvSolution file though, so I'd guess that the petsc solver has been tuned for AMD.

dlahaye · April 28, 2024, 14:51

Thanks for your input. Much appreciated.

1/ Can you confirm that the bulk of the CPU time goes into the pressure-solve (independent of CPU vs. GPU)?

2/ How do you precondition PETSc-CG for the pressure solve?

3/ Are you willing to walk an extra mile and compare two flavours of PETSc-CG.

Flavour-1: using AMG to precondition PETSc-CG allowing AMG to do a set-up at each linear system solve.

Flavour-2: using AMG to precondition PETSc-CG (so far identical to Flavour-1), this time freezing the hierarchy that AMG construct.

mesh-monkey · April 29, 2024, 05:42

Quote:

Originally Posted by dlahaye

Thanks for your input. Much appreciated.

1/ Can you confirm that the bulk of the CPU time goes into the pressure-solve (independent of CPU vs. GPU)?

2/ How do you precondition PETSc-CG for the pressure solve?

3/ Are you willing to walk an extra mile and compare two flavours of PETSc-CG.

Flavour-1: using AMG to precondition PETSc-CG allowing AMG to do a set-up at each linear system solve.

Flavour-2: using AMG to precondition PETSc-CG (so far identical to Flavour-1), this time freezing the hierarchy that AMG construct.

1) I don't have a specific clocktime breakdown, but it would appear so, yes.
2) PETSC-CG is preconditioned using GAMG:

Code:

p
    {
        solver          petsc;
        petsc
        {               
            options
            {
                ksp_type  cg;
                ksp_cg_single_reduction  true;
                ksp_norm_type none;
                mat_type    mpiaijhipsparse; //HIPSPARSE
                vec_type    hip;

                //preconditioner 
                pc_type gamg;
                pc_gamg_type "agg"; // smoothed aggregation                                                                            
                pc_gamg_agg_nsmooths "1"; // number of smooths for smoothed aggregation (not smoother iterations)                      
                pc_gamg_coarse_eq_limit "100";
                pc_gamg_reuse_interpolation true;
                pc_gamg_aggressive_coarsening "2"; //square the graph on the finest N levels
                pc_gamg_threshold "-1"; // increase to 0.05 if coarse grids get larger                                                 
                pc_gamg_threshold_scale "0.5"; // thresholding on coarse grids
                pc_gamg_use_sa_esteig true;

                // mg_level config
                mg_levels_ksp_max_it "1"; // use 2 or 4 if problem is hard (i.e stretched grids)
                mg_levels_esteig_ksp_type cg; //max_it "1"; // use 2 or 4 if problem is hard (i.e stretched grids)                     

                // coarse solve (indefinite PC in parallel with 2 cores)                                                               
                mg_coarse_ksp_type "gmres";
                mg_coarse_ksp_max_it "2";
        
                // smoother (cheby)                                                                                                    
                mg_levels_ksp_type chebyshev;
                mg_levels_ksp_chebyshev_esteig "0,0.05,0,1.1";
                mg_levels_pc_type "jacobi";
                
            }

            caching
            {
                matrix
                {
                    update always;
                }

                preconditioner
                {
                    //update always;     
                    update periodic;

                    periodicCoeffs
                    {
                        frequency  40;
                    }
                }
            }
        }
        tolerance       1e-07;
        relTol          0.1;
    }

3/ Sure, happy to. I'll need some guidance on how to set those flavours up.

dlahaye · April 29, 2024, 06:16

Thanks again.

It appears that by setting

Code:

periodicCoeffs
    {
       frequency  40;
    }

you already have a blend between Flavour-1 (frequency 1) and Flavour-2 (frequency infinity). My question has thus been answered.

To obtain statistics on OpenFoam native GAMG coarsening, insert in system/controlDict

I have two follow-up questions if you allow.

1/ How does runtime of PETSc-GAMG compare with OpenFoam-native-GAMG (the latter used as a preconditioner to be fair)?

2/ Do you see statistics of PETSc-GAMG coarsening printed somewhere? It would be interesting to compare these statistics (in particular the geometric and algebraic complexities) with the statistics of OpenFoam-native-GAMG. The latter can be easily obtained by inserting debug switches in system/controlDict;

Quote:

// see /opt/OpenFOAM/OpenFOAM-v1906/etc/controlDict for a complete list of DebugSwitches
DebugSwitches
{
GAMG 2;
GAMGAgglomeration 0;
GAMGInterface 0;
GAMGInterfaceField 0;
GaussSeidel 0;
fvScalarMatrix 0;
lduMatrix 0;
lduMesh 0;
}

February 23, 2023, 04:28	OpenFOAM on AMD GPUs. Container from Infinity Hub: user experiences and performance	#1
alexisespinosa New Member Alexis Espinosa Join Date: Aug 2009 Location: Australia Posts: 20 Rep Power: 17	AMD recently provided an OpenFOAM container capable of running on AMD GPUs. It is in their Infinity Hub: https://www.amd.com/en/technologies/...y-hub/openfoam And my questions are: -How have been the experiences of the community using this OpenFOAM container on AMD GPUs? -Are you reaching cool performance improvements vs just CPU solvers? Thanks a lot, Alexis (PS. I will start using it and post my experiences too) Last edited by alexisespinosa; March 6, 2023 at 22:00.

April 28, 2024, 10:15	OpenFOAM on AMD GPUs. Container from Infinity Hub: Experiences with Radeon VII	#3
mesh-monkey New Member Tom Join Date: Dec 2015 Location: Melbourne, Australia Posts: 11 Rep Power: 11	Thought I'd share my experiences with this! My findings, unfortunately with my setup, have been that it remains much faster to solve on CPU than GPU. I used the HPC_Motorbike example and code provided by AMD in the docker container (not available on the link any longer btw) as-is on my Radeon VII. For the CPU examples, I modified the run to suit a typical CPU-based set of solvers using the standard tutorial fvSolution files. Results as follows. Times shown are SimpleFoam total ClockTime to 20 iterations; and time per iteration, excluding the first time step: GPU: 473 seconds; 20.8s per iteration CPU, with 'GPU-aligned' solvers: 343 seconds ; 16.7s per iteration CPU, with 'normal' solvers: 205 seconds; 9.9s per iteration Velocity and pressure solvers for each as follows: GPU: PETSc-bcgs & PETSc-cg CPU, with 'GPU-aligned' solvers: DILUPBiCGStab & DICPCG CPU, with 'normal' solvers, per tutorial: smoothSolver & GAMG The GPU appears seldom used, with sporadic spikes in utilisation and barely exceeding 40% of the GPU pipe. Most of the time within iterations cycle being seems to be spent doing not much (I/O maybe?). Unsurprisingly the 1st iteration is much longer as the model is read into vram, which you can see quite easily, but subsequent iterations are also slower than similar solvers on CPU. I have included a time per iteration from iterations 2-20 to illustrate the per-iteration slowdown to account for this. I get that GPUs are made for large models but I am already nearly reaching the 16GB of vram even in this model (5,223,573 cells). I can't run the Medium sized model (~9M cells I think) because I run out of vram I'm running this on my desktop PC for funzies because I don't even want to know how much faster this will be on my usual solving machine (48 core xeon). So, in summary, based on my experiences with a Radeon VII and the Small HPC_motorbike case: GPU is half as fast as CPU when using CPU-native solvers GPU is 20% slower vs CPU when using less-efficient 'GPU-aligned' solvers Next step I think is to find more GPUs to test the scaling of larger models (love an excuse to keep scouring ebay for deals hehe) Cheers, Tom

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Performance problems on AMD Epyc cluster	crpvn	Hardware	3	February 17, 2020 09:50

February 9, 2024, 04:51		#2
be_inspired Senior Member M. Montero Join Date: Mar 2009 Location: Madrid Posts: 155 Rep Power: 17	Hi, Were you able to launch any simulation using the GPU version? Is it 100% GPU or only the pressure solver is solved in the GPU? Do you know if it could be compatible que Nvidia GPU to test it? Best Regards Marcelino

April 28, 2024, 14:51		#5
dlahaye Senior Member Domenico Lahaye Join Date: Dec 2013 Posts: 802 Blog Entries: 1 Rep Power: 19	Thanks for your input. Much appreciated. 1/ Can you confirm that the bulk of the CPU time goes into the pressure-solve (independent of CPU vs. GPU)? 2/ How do you precondition PETSc-CG for the pressure solve? 3/ Are you willing to walk an extra mile and compare two flavours of PETSc-CG. Flavour-1: using AMG to precondition PETSc-CG allowing AMG to do a set-up at each linear system solve. Flavour-2: using AMG to precondition PETSc-CG (so far identical to Flavour-1), this time freezing the hierarchy that AMG construct.