|
[Sponsors] |
March 30, 2022, 16:20 |
Hybrid OpenmMP+MPI optimisation
|
#1 |
New Member
Marco
Join Date: Mar 2014
Posts: 8
Rep Power: 12 |
Hi All,
I am trying to optimize a run utilizing the hybrid parallelization approach implemented in SU2. I am trying to run using 2 nodes, 2 tasks per node, and 16 CPUs with the following SLURM script: #SBATCH --nodes=2 #SBATCH --ntasks-per-node=2 #SBATCH --cpus-per-task=16 export OMP_NUM_THREAD=$SLURM_CPUS_PER_TASK export OMP_WAIT_POLICY=ACTIVE mpirun -n $SLURM_NTASKS --bind-to none SU2_CFD -t $SLURM_CPUS_PER_TASK Config.cfg The mesh has 20M cells, and the run works fine with sensible results. The problem I am having however is that I cannot optimise the performance of the approach and I keep receiving the following warning. WARNING: On 4 MPI ranks the coloring efficiency was less than 0.875 (min value was 0.0625). Those ranks will now use a fallback strategy, better performance may be possible with a different value of config option EDGE_COLORING_GROUP_SIZE (default 512). I have tried different values of EDGE_COLORING_GROUP_SIZE, (e.g., 32,64,128,512,1028) but I keep receiving the same message. If anybody can shed some light on this, that would be much appreciated! Thanks a lot for your help. |
|
March 31, 2022, 07:28 |
|
#2 |
Senior Member
Pedro Gomes
Join Date: Dec 2017
Posts: 466
Rep Power: 14 |
Hi Marco,
If you have multigrid turned on that message might be for the coarse grids (where it does not matter much). I assume the nodes have 2 CPUs of 16 cores each? In general you never want to do bind to none, should be numa-node. And the threads should bind to cores. Finally with only 2 nodes and a large mesh it is possible that the communication costs are not high enough to offset the openmp overhead, how much slower is it compared to just MPI? You can try using more tasks and fewer CPUs per task. |
|
March 31, 2022, 09:03 |
|
#3 | |
New Member
Marco
Join Date: Mar 2014
Posts: 8
Rep Power: 12 |
Hi Pedro,
Thanks for your feedback. Just to answer to your questions:
If we use only MPI, the performance degrades sensibly during heavy usage of the cluster, especially if SU2 runs on multiple nodes. We can obviously run on a single node, but that increases exponentially the waiting time in the queue! What we are hoping to achieve is that using the hybrid approach, we could be able to minimize MPI communication achieving performances which are less affected by how the cluster is used (if it makes sense!) Thanks again for your help. Marco Quote:
|
||
March 31, 2022, 13:27 |
|
#4 |
Senior Member
Pedro Gomes
Join Date: Dec 2017
Posts: 466
Rep Power: 14 |
Understood. For that CPU it will be essential to bind to numa and not use less than 1 MPI rank per numa node.
From this "(min value was 0.0625)" (1/16) it looks like the coloring is failing, since the mesh is large you may try increasing the group size more. |
|
April 5, 2022, 07:02 |
|
#5 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
Not that I would know the first thing about SU2 in particular, but for hybrid parallelization with MPI+OpenMP it is often necessary to take the underlying hardware into account.
AMD EPYC 7702 CPUs are quite complex in that regard. For each CPU you have:
For an MPI+OpenMP approach, my first order of business would be to set NPS=4 in bios. Consult the output of lstopo, lscpu or numactl --hardware to see how many NUMA nodes you have. Then limit each OpenMP region to a single NUMA node (now 4 per CPU or 8 per node with 16 CPU cores each), and let MPI handle communication across each NUMA node, Depending on communication/synchronization requirements of the solver, it might even be necessary to go one step further: Each OpenMP region only spans a single segment of L3 cache (containing 4 CPU cores). |
|
April 5, 2022, 19:21 |
|
#6 |
New Member
Marco
Join Date: Mar 2014
Posts: 8
Rep Power: 12 |
Hello Alex, thanks a lot for your answer, but I must say I am a bit lost!
Unfortunately altering the BIOS is not an option, being an HPC facility and I do not have control over it. Secondly, I run the numactl --hardware command and below is the output I got: [[kelvin2] ~]$ numactl --hardware available: 8 nodes (0-7) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 0 size: 128426 MB node 0 free: 94622 MB node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 1 size: 129020 MB node 1 free: 120385 MB node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 node 2 size: 64508 MB node 2 free: 55138 MB node 3 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 node 3 size: 64496 MB node 3 free: 57355 MB node 4 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 node 4 size: 129004 MB node 4 free: 110361 MB node 5 cpus: 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 node 5 size: 129020 MB node 5 free: 122277 MB node 6 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 node 6 size: 64508 MB node 6 free: 58994 MB node 7 cpus: 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 node 7 size: 64508 MB node 7 free: 51406 MB node distances: node 0 1 2 3 4 5 6 7 0: 10 12 12 12 32 32 32 32 1: 12 10 12 12 32 32 32 32 2: 12 12 10 12 32 32 32 32 3: 12 12 12 10 32 32 32 32 4: 32 32 32 32 10 12 12 12 5: 32 32 32 32 12 10 12 12 6: 32 32 32 32 12 12 10 12 7: 32 32 32 32 12 12 12 10 And there I am kind of lost! :-) I understand that the 128 cores available are divided in 8 nodes (dies??), the second line should be the memory available and free for each "node" but the meaning of node distance is unclear to me. Moreover, (and apologies for my ignorance) I am not really sure on how to interpret what you wrote about “let MPI handle communication across each NUMA node”. Does it mean that for the hardware I have, I should limit the shared memory operation (-t) to 8 ? Thanks a gain! |
|
April 5, 2022, 19:42 |
|
#7 |
Senior Member
Pedro Gomes
Join Date: Dec 2017
Posts: 466
Rep Power: 14 |
Yep a minimum of 8 tasks per node, and 16 CPUs per task. And those tasks should bind to numa node.
Given the L3 cache detail 16 tasks per node with 8 CPUs may indeed be better. With fewer CPUs per task it will be easier to find a suitable color group size that is efficient. |
|
April 6, 2022, 05:59 |
|
#8 | |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
Quote:
1 NUMA node with NPS=4 corresponds to a memory controller. Each CPU has 4 dual-channel memory controllers. CPU dies are one more layer of segmentation, with 2 dies per memory controller. And lastly, each die has two "CCX", each with its own separate chunk of L3 cache. "Distance" in this output gives you a first rough idea how fast communication is between the individual NUMA nodes. E.g. the 1st line, 1st column entry is 10. Meaning that communicating within the first NUMA node is relatively fast. On the other end, 1st line, 8th column entry is 32. So communication between cores on NUMA node 0 and 7 is much slower. Don't read too much into that for now, it just tells us what we already know: intra-node communication is faster than inter-node communication. Hence my recommendation of keeping OpenMP regions contained within a NUMA node. But another issue sticks out: Memory population on this machine is unbalanced. You can see it from the different sizes of the NUMA nodes. This should really be avoided, otherwise it can cause performance regression with these CPUs. It strikes me as very odd that a HPC facility would run their nodes like this I wish I could help you more, but I am not familiar with the nomenclature of SU2. |
||
April 6, 2022, 08:29 |
|
#9 | |
New Member
Marco
Join Date: Mar 2014
Posts: 8
Rep Power: 12 |
Quote:
This is something I have been discussing with the HPC system guys for more than 2 years, but for whatever reason they do not want to address the problem! Thanks again for your explanation, it is much appreciated. |
||
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
mpirun, best parameters | pablodecastillo | Hardware | 18 | November 10, 2016 13:36 |
[OpenFOAM.org] MPI compiling and version mismatch | pki | OpenFOAM Installation | 7 | June 15, 2015 17:21 |
Sgimpi | pere | OpenFOAM | 27 | September 24, 2011 08:57 |
Error using LaunderGibsonRSTM on SGI ALTIX 4700 | jaswi | OpenFOAM | 2 | April 29, 2008 11:54 |
Is Testsuite on the way or not | lakeat | OpenFOAM Installation | 6 | April 28, 2008 12:12 |