|
[Sponsors] |
March 2, 2022, 05:59 |
Benchmark fpmem
|
#1 |
Member
Erik Andresen
Join Date: Feb 2016
Location: Denmark
Posts: 35
Rep Power: 10 |
The STREAM benchmark test the memory bandwidth, even though floating point operations are made. In the benchmark, the number of floating point operations doesn’t exceed the number of loads. This is likely also the case for many CFD programs, but for higher order solvers based on Cartesian grid, it is not the case. The ratio between number of floating point operations and loads, could be much larger for such solvers.
Optimizing in HPC is often about minimizing the reading from memory. The work can be split into smaller chunks, where as much work as possible is done on each chunk, before the next chunk of memory is processed. The relevant size of such chunks should be determined. I have made a benchmark, fpmem, that gives the floating point performance for various combinations of floating point operations pr load, and the size of the array processed. The benchmark doesn’t do any real work, but it can be compiled, linked and run in about 5 minutes. The instructions for compiling, linking and usage of the benchmark is given in the first few lines of the source file. It requires a resent C++ compiler (-std=c++17) and mpi. It uses AVX2 when compiled with -D_USE_INTRINSIC. See instructions. I hope that some care to use the benchmark and post the results. The benchmark is made to run on one CPU, and if used on a large cluster the performance will just increase linearly with the number of CPUs. I don’t have access to EPYC Milan or newer Xeons on socket LGA4189 so for me results from these could be very interesting. I have attached the benchmark (fpmem.c) and the results for my newly build system with an Intel i5-12600. Edit: I have uploaded a new version, that corrects an error that effected the reported performance values with up to about 10%. Last edited by ErikAdr; March 3, 2022 at 05:55. |
|
March 2, 2022, 07:17 |
|
#2 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,426
Rep Power: 49 |
Here are some results from my system (2x Epyc 7551). First one with all 64 threads, second with only 8 threads pinned to the first CCD. The latter makes it behave like a single 1st-gen Ryzen CPU with very low clock speeds and extremely crappy memory.
7551_a.txt Code:
compiled: mpiCC -D_USE_INTRINSIC -std=c++17 -O3 -march=native -c fpmem.c (gcc version 9.2.0) run: mpirun -np 64 ./fpmem 30 24 System: 2x AMD Epyc 7551, 16x32GB DDR4-2666 2Rx4, OpenSUSE Leap 15.3, 5.3.18-150300.59.46-default Performance (Gflops) using 64 processes with AVX2 FLOPs/load: 0.50 1 2 4 8 16 32 64 Array size 8kB: 316.30 319.22 515.52 917.54 1050.67 1116.92 892.12 659.82 16kB: 320.53 319.74 518.03 918.43 1050.43 1114.73 892.49 660.28 32kB: 273.02 295.91 512.90 900.98 1043.95 1116.21 892.29 660.26 64kB: 277.78 295.18 512.84 903.64 1043.42 1116.13 892.17 659.73 128kB: 272.64 296.74 509.96 897.51 1040.41 1116.53 893.17 659.66 256kB: 228.56 291.10 486.24 862.42 1021.68 1110.92 893.14 659.84 512kB: 138.33 271.48 460.28 831.43 999.17 1096.32 892.43 659.86 1MB: 110.29 170.01 312.09 673.73 898.32 1051.14 888.12 658.28 2MB: 12.53 25.13 50.02 125.73 227.88 425.46 831.72 656.62 4MB: 11.63 23.28 46.65 116.91 211.14 398.37 765.78 656.24 8MB: 11.63 23.29 46.59 116.60 210.99 398.59 767.66 655.67 16MB: 11.64 23.33 46.78 116.84 211.22 398.66 768.35 655.37 32MB: 11.69 23.47 47.00 117.33 211.95 400.49 765.00 655.42 64MB: 11.72 23.61 47.29 118.19 213.55 406.43 752.34 652.60 Code:
compiled: mpiCC -D_USE_INTRINSIC -std=c++17 -O3 -march=native -c fpmem.c (gcc version 9.2.0) run: mpirun -np 8 --bind-to core --rank-by core --map-by core ./fpmem 30 24 System: 2x AMD Epyc 7551, 16x32GB DDR4-2666 2Rx4, OpenSUSE Leap 15.3, 5.3.18-150300.59.46-default Performance (Gflops) using 8 processes with AVX2 FLOPs/load: 0.50 1 2 4 8 16 32 64 Array size 8kB: 37.58 38.02 61.04 110.03 125.33 134.04 108.10 80.87 16kB: 38.58 39.24 63.00 110.46 126.50 134.51 108.40 80.88 32kB: 38.08 38.47 62.44 110.13 124.67 134.28 107.54 80.63 64kB: 38.22 38.61 62.78 110.24 124.63 134.33 107.81 80.78 128kB: 35.09 38.83 62.52 110.25 125.04 134.35 108.12 80.78 256kB: 32.88 39.08 62.60 110.24 125.85 135.03 107.83 80.72 512kB: 17.93 38.64 62.25 108.93 125.67 135.34 108.25 80.75 1MB: 13.75 22.19 38.58 86.55 113.44 131.61 107.92 80.62 2MB: 1.57 3.14 6.25 15.77 28.78 53.64 103.39 80.47 4MB: 1.46 2.90 5.81 14.63 26.32 49.65 96.09 80.49 8MB: 1.45 2.90 5.82 14.62 26.34 49.57 96.13 80.47 16MB: 1.46 2.91 5.83 14.64 26.45 49.66 96.57 80.35 32MB: 1.46 2.92 5.83 14.65 26.37 49.83 96.01 80.51 64MB: 1.47 2.97 5.99 15.05 27.31 52.29 92.34 79.90 Last edited by flotus1; March 2, 2022 at 16:34. |
|
March 2, 2022, 09:26 |
|
#3 | |
Member
Erik Andresen
Join Date: Feb 2016
Location: Denmark
Posts: 35
Rep Power: 10 |
Quote:
Compared to the i5-12600 it looks like the two systems have about the same ratio between compuational performance (with AVX2) and memory bandwidth. The zen-core has one fmadd 'engine' (8 FLOPs/cycle) whereas zen2, zen3 and newer Intel cores have two 'engines' and do 16 FLOPs/cycle with AVX2. It is seen especially for the small arrays that can be contained within the first level cache. The zen cores are about half performance of the Intel cores at the same clock speed, but the 7551 has a lot of cores, Looking at the line for 64MB arrays, it is seen that both systems are memory bound up to about the point where there are 32 FLOPs/load. It is seen since the performance numbers doubles each time the FLOPs/load ratio doubles. The data supplied by the memory is the limiting factor. At 64 FLOPs/load both systems are cpu-bound. Thanks for runing the benchmark! Last edited by ErikAdr; March 2, 2022 at 16:25. |
||
March 2, 2022, 14:04 |
|
#4 |
Senior Member
Join Date: May 2012
Posts: 551
Rep Power: 16 |
Another benchmark? Count me in!
Not sure what it means but here you go: Code:
System: Ryzen 3700X, 16GB DDR4 SingleRank @ 3200 MT/s, GCC 9.3, Ubuntu 20.04 Performance (Gflops) using 8 processes with AVX2 FLOPs/load: 0.50 1 2 4 8 16 32 64 Array size 8kB: 71.31 118.02 164.55 318.22 368.69 373.22 286.99 203.55 16kB: 59.24 121.15 163.89 315.79 367.57 376.81 287.23 202.66 32kB: 62.78 105.93 160.60 278.80 351.75 358.89 286.54 203.30 64kB: 62.71 105.87 163.70 280.15 354.65 372.24 287.88 203.20 128kB: 62.05 103.28 159.84 275.72 350.16 369.65 286.98 203.37 256kB: 46.90 95.48 149.06 273.69 343.32 364.09 284.49 203.07 512kB: 35.50 74.42 142.31 237.92 334.94 356.40 283.07 202.85 1MB: 35.69 70.13 128.29 261.14 331.96 352.53 282.27 202.63 2MB: 15.54 14.64 43.74 131.91 181.58 271.45 277.96 201.78 4MB: 1.80 3.60 7.16 18.21 31.90 61.04 117.54 199.16 8MB: 1.75 3.50 7.10 18.51 32.45 60.25 115.02 199.27 16MB: 1.75 3.49 7.00 17.69 31.56 60.00 115.75 199.15 32MB: 1.76 3.55 6.98 17.62 31.79 59.44 115.03 199.75 64MB: 1.78 3.51 7.09 17.72 32.07 61.32 119.82 191.47 Code:
System: 2 x Xeon E5-2673v3, 128 GB DDR4 Dual rank @2133 MT/s, GCC 8.3, Debian 10 Performance (Gflops) using 24 processes with AVX2 FLOPs/load: 0.50 1 2 4 8 16 32 64 Array size 8kB: 181.82 227.95 370.67 544.75 592.74 594.93 479.71 373.98 16kB: 220.25 237.19 400.54 552.96 609.18 592.86 482.14 374.02 32kB: 63.79 125.82 224.20 484.24 540.32 585.78 337.38 374.26 64kB: 63.68 124.11 238.68 475.64 594.16 584.14 479.93 374.18 128kB: 55.34 93.03 189.14 392.51 542.51 579.75 481.11 374.07 256kB: 34.44 66.66 134.06 301.36 463.48 575.17 481.45 374.03 512kB: 33.57 64.77 129.26 287.09 448.37 571.43 482.26 373.94 1MB: 31.43 57.68 112.56 260.22 413.58 552.70 478.22 373.60 2MB: 4.98 10.13 20.27 50.92 92.27 173.84 338.16 370.65 4MB: 4.73 9.46 18.85 47.05 84.63 159.23 303.62 370.16 8MB: 4.69 9.39 18.77 46.72 83.82 158.12 300.67 369.71 16MB: 4.65 9.34 18.64 46.53 83.17 157.03 299.76 369.34 32MB: 4.64 9.23 18.42 45.97 82.80 155.93 298.16 369.09 64MB: 4.63 9.20 18.39 45.90 82.48 155.30 297.76 369.05 |
|
March 2, 2022, 16:19 |
|
#5 |
Member
Erik Andresen
Join Date: Feb 2016
Location: Denmark
Posts: 35
Rep Power: 10 |
I can understand there is a need for an explanation on how to interpret the results. I took it in steps. First I ran STREAM with different array sizes to test the cache and memory bandwidths. Then I looked at floating point performance in cases with several floating point operations for each load. In HPC, some problems have a low value for the ratio FLOPs/load, and others a very high value. For CFD the ratio is usually at a low value, but for matrix multiplications between two large matrices the ratio has a very high value. For small values the performance are limited by the memory bandwidth, and for high values the performance is limited by the cpu’s ability to crunch numbers. My interest is typically in the intermediate range, say with ratios from 4 to 64, where it is not evident what limits the performance.
I don't know how to include a text file, but I have attached the results for the i5-12600 again. Please look at it. Looking at the first column for a ratio of 0.5, it is seen that the performance is highest for very small arrays. Arrays at 8kB and 16kB can be contained within the 1st level cache, and it is the fastest cache. At 32kB the performance is lower, since the the 1st level cache is a little too small and the bandwidth starts to be limited by the slower 2nd level cache. From 64kB to 256kB the performance is nearly constant and determined by the bandwidth of the 2nd level cache. For larger arrays the bandwidth of the 3rd level cache starts to play a role, but from array sizes of 8MB and larger, the performance is limited by the bandwidth of the RAM. All performance figures in the first column is in this way determined by the bandwidth to the memory system in which the arrays can be contained. The calculation include two equal sized arrays, but the size specified in the table is for each array. The column at the right for the ratio at 64 is much easier to interpret. Here all performance figures are about the same, independent of the array size. The performance here is alone determined by the cpu’s ability to crunch numbers. For most intermediate columns the performance is about constant for the smaller array sizes, where the cpu is the limiting factor, but at some point, the memory system that contains the larger arrays gets too slow, and then the performance shifts to be limited by the memory bandwidth. The benchmark shows where this happens! |
|
March 2, 2022, 16:59 |
|
#6 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,426
Rep Power: 49 |
You can wrap code tags around any text you want to appear as formatted in a plain text file.
[CODE] text goes here[/CODE ] <- remove the space Code:
text goes here Last edited by flotus1; March 3, 2022 at 06:02. |
|
March 3, 2022, 05:48 |
|
#7 |
Member
Erik Andresen
Join Date: Feb 2016
Location: Denmark
Posts: 35
Rep Power: 10 |
I have made a version that also shows the corresponding memory bandwidth. There were a lot of figures before and even more now.....
Code:
System: 12th Gen Intel(R) Core(TM) i5-12600; 2 channels SR DDR5 @ 6000 Bandwidth (GB/s) using 6 processes with AVX2 (ver 1.2) FLOPs/load: 0.50 1 2 4 8 16 32 64 Array size 8kB: 2353.36 1291.34 1281.65 799.87 514.75 282.19 129.43 55.39 16kB: 2675.92 1403.25 1297.78 815.08 518.52 284.52 127.70 55.40 32kB: 1833.83 1123.37 1158.65 775.19 497.57 271.84 126.96 55.17 64kB: 1145.61 1031.83 989.96 768.46 471.57 269.89 125.24 55.16 128kB: 1083.94 1034.94 990.89 771.72 472.04 267.54 124.16 55.18 256kB: 1144.07 1035.63 990.41 769.20 471.71 269.90 122.44 54.95 512kB: 1081.49 982.72 931.36 742.99 461.37 262.46 121.02 54.19 1MB: 444.34 450.78 448.27 450.23 399.18 253.15 124.23 52.94 2MB: 306.17 266.65 261.62 241.13 209.98 198.56 122.54 53.65 4MB: 185.51 154.16 117.49 111.73 111.77 110.09 101.85 53.46 8MB: 90.26 91.51 90.55 90.74 90.13 89.45 88.06 53.36 16MB: 81.98 83.36 82.49 82.81 82.35 81.84 78.58 53.40 32MB: 77.44 79.98 79.24 79.25 79.10 78.72 78.07 53.53 64MB: 76.76 78.39 77.67 77.81 77.65 77.27 76.70 53.61 Performance (Gflops) using 6 processes with AVX2 (ver 1.2) FLOPs/load: 0.50 1 2 4 8 16 32 64 Array size 8kB: 98.06 107.61 213.61 299.95 364.61 388.00 350.53 297.73 16kB: 111.50 116.94 216.30 305.65 367.29 391.22 345.84 297.76 32kB: 76.41 93.61 193.11 290.70 352.45 373.77 343.86 296.55 64kB: 47.73 85.99 164.99 288.17 334.03 371.10 339.19 296.50 128kB: 45.16 86.24 165.15 289.40 334.36 367.86 336.27 296.59 256kB: 47.67 86.30 165.07 288.45 334.13 371.11 331.62 295.38 512kB: 45.06 81.89 155.23 278.62 326.81 360.89 327.76 291.28 1MB: 18.51 37.57 74.71 168.84 282.75 348.08 336.45 284.56 2MB: 12.76 22.22 43.60 90.42 148.74 273.02 331.88 288.34 4MB: 7.73 12.85 19.58 41.90 79.17 151.37 275.85 287.36 8MB: 3.76 7.63 15.09 34.03 63.84 122.99 238.51 286.81 16MB: 3.42 6.95 13.75 31.06 58.33 112.53 212.83 287.02 32MB: 3.23 6.66 13.21 29.72 56.03 108.24 211.45 287.72 64MB: 3.20 6.53 12.94 29.18 55.00 106.25 207.73 288.17 Looking at the column for 0.5 FLOPs/load, then the performance of the memory system can be seen. From 64kB to 256kB the results are almost constant, showing the performance of the second level cache. From 8MB and up, it is the bandwidth of the RAM that limits the performance. In the results for the bandwidth I have include the one write for each two reads like in the STREAM benchmark. The performance for large arrays are very similar to the bandwidth figures from STREAM. Looking at the computational performance, the performance has dropped a bit compared to the results previously posted. I made a mistake that effects from 4 FLOPs/load and up. The performance is reduced by about 10% for for 4 FLOPs/load, 5% for 8 FLOPs/load and 2.5% for 16 FLOPs/load. I have uploaded a corrected version in the first post. It is also corrected in the version attached here, that also reports memory bandwidth. Last edited by ErikAdr; March 3, 2022 at 07:44. |
|
March 9, 2022, 18:55 |
|
#8 |
Senior Member
Will Kernkamp
Join Date: Jun 2014
Posts: 365
Rep Power: 14 |
Code:
System: 4xOpteron 6376 32x 8GB DDR3-1600 Bandwidth (GB/s) using 64 processes with AVX2 (ver 1.2) FLOPs/load: 0.50 1 2 4 8 16 32 64 Array size 8kB: 1429.63 774.41 385.02 326.77 331.47 195.80 143.08 77.17 16kB: 1237.93 719.69 445.85 341.76 310.33 195.59 142.24 77.11 32kB: 1238.08 718.77 491.09 362.15 312.95 191.37 141.70 76.94 64kB: 1228.22 717.10 471.26 379.37 312.74 190.14 140.95 76.57 128kB: 1124.11 649.13 446.28 372.39 307.63 189.93 140.25 76.67 256kB: 1087.40 628.51 444.04 370.25 306.16 186.85 140.14 76.70 512kB: 693.82 516.11 424.20 331.35 272.41 174.58 136.23 75.53 1MB: 211.77 180.41 181.49 191.25 183.45 151.17 124.84 73.27 2MB: 124.33 119.58 119.68 120.33 120.66 118.90 114.47 72.71 4MB: 120.83 119.24 119.48 119.93 119.58 118.78 114.82 72.70 8MB: 121.15 119.52 119.48 119.96 119.84 118.92 115.04 72.64 16MB: 121.22 119.55 119.73 120.29 120.00 119.02 115.03 72.60 32MB: 121.22 121.16 120.98 121.69 121.66 120.93 117.75 73.43 64MB: 121.26 121.13 121.02 121.57 121.60 120.98 117.40 73.53 Performance (Gflops) using 64 processes with AVX2 (ver 1.2) FLOPs/load: 0.50 1 2 4 8 16 32 64 Array size 8kB: 59.57 64.53 64.17 122.54 234.79 269.22 387.51 414.76 16kB: 51.58 59.97 74.31 128.16 219.81 268.94 385.23 414.49 32kB: 51.59 59.90 81.85 135.81 221.68 263.13 383.76 413.54 64kB: 51.18 59.76 78.54 142.26 221.53 261.44 381.75 411.58 128kB: 46.84 54.09 74.38 139.65 217.90 261.15 379.84 412.08 256kB: 45.31 52.38 74.01 138.84 216.87 256.92 379.55 412.24 512kB: 28.91 43.01 70.70 124.26 192.96 240.05 368.97 405.98 1MB: 8.82 15.03 30.25 71.72 129.94 207.86 338.10 393.82 2MB: 5.18 9.96 19.95 45.12 85.47 163.49 310.03 390.84 4MB: 5.03 9.94 19.91 44.97 84.70 163.32 310.98 390.74 8MB: 5.05 9.96 19.91 44.98 84.89 163.51 311.57 390.42 16MB: 5.05 9.96 19.95 45.11 85.00 163.66 311.54 390.21 32MB: 5.05 10.10 20.16 45.63 86.17 166.28 318.90 394.67 64MB: 5.05 10.09 20.17 45.59 86.13 166.34 317.95 395.21 |
|
March 9, 2022, 22:57 |
|
#9 |
Senior Member
Will Kernkamp
Join Date: Jun 2014
Posts: 365
Rep Power: 14 |
Recompiled with openmp and two threads per process (running 32 processes instead of 64). The GFlops are much improved because this cpu shares cache and fpu between two integer cores.
Code:
System: 4xOpteron 6376 32x DDR3-1600 Bandwidth (GB/s) using 32 processes with AVX2 (ver 1.2) FLOPs/load: 0.50 1 2 4 8 16 32 64 Array size 8kB: 358.73 328.31 336.01 287.65 294.05 251.21 153.21 88.35 16kB: 565.12 457.82 431.06 375.55 366.33 291.31 167.76 88.58 32kB: 783.87 574.87 539.15 462.21 426.45 330.46 172.52 90.89 64kB: 953.99 658.27 619.53 523.85 471.55 334.78 171.77 96.58 128kB: 1090.23 716.07 671.24 528.28 506.23 329.29 178.33 97.10 256kB: 973.01 602.55 564.47 492.98 458.04 324.14 170.09 95.76 512kB: 986.28 613.41 578.11 509.80 469.13 327.15 170.53 95.80 1MB: 613.50 522.80 493.97 455.18 413.47 307.43 168.32 94.67 2MB: 200.24 183.15 185.47 185.90 186.86 182.11 153.70 92.87 4MB: 120.47 118.79 118.98 119.18 118.86 118.71 117.33 96.64 8MB: 120.78 118.75 118.92 119.13 118.84 118.72 118.29 97.05 16MB: 120.87 118.92 119.01 119.67 119.00 119.05 118.82 82.24 32MB: 120.92 120.62 120.93 121.25 121.08 121.19 118.45 84.99 64MB: 120.95 120.49 121.03 121.43 121.28 121.20 117.56 86.36 Performance (Gflops) using 32 processes with AVX2 (ver 1.2) FLOPs/load: 0.50 1 2 4 8 16 32 64 Array size 8kB: 14.95 27.36 56.00 107.87 208.28 345.42 414.96 474.90 16kB: 23.55 38.15 71.84 140.83 259.48 400.55 454.34 476.12 32kB: 32.66 47.91 89.86 173.33 302.07 454.39 467.23 488.56 64kB: 39.75 54.86 103.25 196.44 334.01 460.32 465.20 519.11 128kB: 45.43 59.67 111.87 198.11 358.58 452.78 482.99 521.93 256kB: 40.54 50.21 94.08 184.87 324.45 445.70 460.66 514.69 512kB: 41.10 51.12 96.35 191.18 332.30 449.83 461.85 514.93 1MB: 25.56 43.57 82.33 170.69 292.88 422.72 455.86 508.84 2MB: 8.34 15.26 30.91 69.71 132.36 250.40 416.28 499.20 4MB: 5.02 9.90 19.83 44.69 84.20 163.23 317.78 519.45 8MB: 5.03 9.90 19.82 44.67 84.18 163.25 320.36 521.63 16MB: 5.04 9.91 19.84 44.88 84.29 163.70 321.81 442.06 32MB: 5.04 10.05 20.15 45.47 85.76 166.63 320.82 456.83 64MB: 5.04 10.04 20.17 45.54 85.90 166.65 318.40 464.21 |
|
March 10, 2022, 20:13 |
|
#10 |
Senior Member
Will Kernkamp
Join Date: Jun 2014
Posts: 365
Rep Power: 14 |
I dont understand why the GFlops dont keep increasing as the flop_load_ratio goes up. Anyone have an answer?
|
|
March 11, 2022, 09:30 |
|
#11 | |
Member
Erik Andresen
Join Date: Feb 2016
Location: Denmark
Posts: 35
Rep Power: 10 |
Quote:
The Gflops do increase for your Opteron system. See the computational performance in the tabel at the buttom. The tabel at the top shows the amount of data read from memory, corresponding to the computational performance in the lower tabel. The GB/s decays when the computational performance becomes the limitting factor. That is the case, when the flop_load_ratio is high. The performance is either limited by memory bandwidth or by computational performance. The test gives a picture of what is the limitting factor for various array sizes and flop_load_ratios. Hope this helped. |
||
March 11, 2022, 09:41 |
|
#12 |
Senior Member
Will Kernkamp
Join Date: Jun 2014
Posts: 365
Rep Power: 14 |
It does on my result, but not for the others. If I run the 128 case, it also drops. Why would a higher number of repeats lead to reduced flops?
|
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
[snappyHexMesh] Motobike benchmark case | joshmccraney | OpenFOAM Meshing & Mesh Conversion | 6 | March 26, 2020 16:28 |
Setting up Lid driven Cavity Benchmark with 1M cells for multiple cores | puneet336 | OpenFOAM Running, Solving & CFD | 11 | April 7, 2019 00:58 |
Benchmark Commannd Line | eRzBeNgEl | STAR-CCM+ | 2 | February 17, 2013 15:27 |
Euler3d Benchmark | Verdi | Hardware | 2 | May 26, 2011 06:21 |
SIG HPC Benchmark | jens_klostermann | OpenFOAM | 0 | October 1, 2009 18:20 |