No OpenMP Performance Gain

GomerOfDoom · March 22, 2023, 11:33

Hello all,

I have been playing around a bit with different compilation and execution of SU2 in an attempt to wring out as much performance as possible on my workstation, and noticed something funny: I am not seeing any speedup when using OpenMP.

I configured with meson using:

Code:

> CXXFLAGS="-O3 -march=znver3" ./meson.py build --prefix=<installdir> -Dwith-mpi=enabled -Dwith-omp=true

And then used

Code:

./ninja -C build install

The compilation/installation went fine.

I then run the case (turbulent channel flow with ~650,000 points) using:

Code:

export OMP_NUM_THREADS=<number_of_threads>
mpirun -np 1 SU2_CFD -t <number_of_threads> channel.cfg

However, I see basically no improvement in seconds per iteration from 1 thread to 2, 4, 8, and 16 threads (see attached plot).

Any thoughts? What am I missing? Thanks for any help!

-Paul

PS. I should add that I have run this on a Ryzen 9 5950x using Manjaro Linux and a Ryzen 9 7950x using both Arch Linux under WSL and Manjaro Linux (dual boot).

GomerOfDoom · March 22, 2023, 11:47

Here is a chart showing the same OpenMP scaling as well as the MPI scaling added for comparison.

pcg · March 22, 2023, 17:38

What does htop look like when su2 is running? You may need to look into thread pinning settings (mpirun --bind-to none, as a first try).
But at those counts per thread on a single machine plain mpi is likely to be faster, unless you are using multi grid for which fewer partitions tend to improve convergence.
-Denable-mixedprec should give you a reasonable boost if you are running implicit.

GomerOfDoom · March 23, 2023, 11:04

Pedro,

Thanks for responding! As it turns out, htop shows only one or two cores working, regardless of whether I specify 1, 2, 4, 8, or 16 threads. Interesting!

So, I re-ran using:

Code:

> mpirun -np 1 --bind-to none SU2_CFD -t <num_threads> channel.cfg

And now I get performance changes with different thread numbers! So, what does this mean? Is there a default binding behavior that is set on my machine somewhere? Should I be using "--bind-to none" all the time? Would "--bind-to numa" or another binding option be better?

Thanks for all of your help!

-Paul

pcg · March 24, 2023, 16:13

I don't know what sets the defaults TBH. --bind-to-numa is preferred and using our OpenMP strategy across multiple numa nodes usually results in reduced performance.
Can you post the scaling now compared to MPI for my curiosity

GomerOfDoom · March 27, 2023, 10:53

Hi Pedro,

Thanks again for your help.

Here is the plot of the MPI vs OpenMP scaling with "--bind-to none"

(Note that the colors are switched from my previous plot).

The OpenMP behaves as expected.

I'm sure there are some other knobs I could turn (for example, SMT is enabled on this machine, I used the gnu compilers, etc.)... so I might do a little more fiddling. I'll re-update if I find any other significant improvements.

March 23, 2023, 11:04	HTOP Shows No Additional Threads	#4
GomerOfDoom New Member Paul Join Date: Jul 2018 Posts: 5 Rep Power: 8	Pedro, Thanks for responding! As it turns out, htop shows only one or two cores working, regardless of whether I specify 1, 2, 4, 8, or 16 threads. Interesting! So, I re-ran using: Code: > mpirun -np 1 --bind-to none SU2_CFD -t <num_threads> channel.cfg And now I get performance changes with different thread numbers! So, what does this mean? Is there a default binding behavior that is set on my machine somewhere? Should I be using "--bind-to none" all the time? Would "--bind-to numa" or another binding option be better? Thanks for all of your help! -Paul Last edited by GomerOfDoom; March 23, 2023 at 11:06. Reason: grammatical

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
General recommendations for CFD hardware [WIP]	flotus1	Hardware	19	June 23, 2024 19:02
Workstation Suggestions For A Newbie	mrtcnsmgr	Hardware	1	February 22, 2023 02:13
CPU for Flow3d	mik_urb	Hardware	4	December 4, 2022 23:06
If memory bound : how to improve performance?	aerosayan	Main CFD Forum	13	July 7, 2021 06:44
Abysmal performance of 64 cores opteron based workstation for CFD	Fauster	Hardware	8	June 4, 2018 11:51

March 22, 2023, 17:38		#3
pcg Senior Member Pedro Gomes Join Date: Dec 2017 Posts: 466 Rep Power: 14	What does htop look like when su2 is running? You may need to look into thread pinning settings (mpirun --bind-to none, as a first try). But at those counts per thread on a single machine plain mpi is likely to be faster, unless you are using multi grid for which fewer partitions tend to improve convergence. -Denable-mixedprec should give you a reasonable boost if you are running implicit.

March 24, 2023, 16:13		#5
pcg Senior Member Pedro Gomes Join Date: Dec 2017 Posts: 466 Rep Power: 14	I don't know what sets the defaults TBH. --bind-to-numa is preferred and using our OpenMP strategy across multiple numa nodes usually results in reduced performance. Can you post the scaling now compared to MPI for my curiosity