|
[Sponsors] |
November 23, 2017, 14:03 |
Kernel for new CPUs
|
#1 |
Senior Member
Join Date: May 2012
Posts: 552
Rep Power: 16 |
Hey,
Tried a 7940X setup today with a fresh CentOS installation. As benchmark I used Palabos (cavity3d) The performance was abysmal to say the least. At N=100 the results for 1 thread was about 4.5 msu With the 4.14 kernel I managed to increase the value to 9 msu 4 threads: 29 msu 8 threads: 43 msu Compared to the old CPUs in the reference I feel that I am doing something wrong. What do you think, it should be higher, right? http://wiki.palabos.org/plb_wiki:benchmark:cavity_n100 Installing OpenFOAM now to get some regular CFD benchmark data. |
|
November 23, 2017, 15:25 |
|
#2 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
Seems like there is something wrong. My laptop processor (I5-4210U) is getting 7.3 MLUPS on one core.
Did you recompile after the kernel update? Did you clear caches before running the benchmark? Code:
# free && sync && echo 3 > /proc/sys/vm/drop_caches && free Which memory do you have? Did you try to bind the process to a core when solving on one core? Edit: You might want to try a few ordinary benchmarks first. Being able to compare your results with known results for the same hardware helps finding possible causes for bad performance. Maybe even using a test installation of windows. The tools you have there are easier to use in my opinion. Last edited by flotus1; November 23, 2017 at 19:04. |
|
November 24, 2017, 08:35 |
|
#3 | ||||
Senior Member
Join Date: May 2012
Posts: 552
Rep Power: 16 |
Quote:
Yes, did a "make clean" and "make". At least it was a 2x speed improvement with the new kernel. No, but after testing this suggestion it is still around 9.5-9.9 msu. Quote:
Corsair vengeance LPX 3200 MHz 4x8 GB. Not sure if the XMP is on or not. SSHing in to the computer so I cannot check atm. Quote:
Quote:
UPDATE: I have now also checked the benchmark on my 7600k and it gives 5.4 msu, under Ubuntu 16.04. The 7600k should be able to out-perform I5-4210U, unless you have it extremely overclocked (seems unlikely in a laptop though), right? I compile it using the make-file (no changes). I run it with "./cavity3d 100" or "mpirun -np 1 ./cavity3d 100" Anything I do different here? Last edited by Simbelmynė; November 24, 2017 at 08:39. Reason: Update |
|||||
November 24, 2017, 09:08 |
|
#4 | ||||
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
Quote:
Maybe an energy saving option in your OS? Maybe deactivated in the Bios? Anyway, if your CPU is running at 3.1GHz that would explain the mediocre performance. You could try lock the clock speed to a higher value once you have access to the bios again and see if this changes anything. Quote:
For MPI, there must be some environment variables or runtime options controlling thread affinity. I am not an MPI expert yet. But if the process ran on one specific core all along this is not the cause of your problem Quote:
But no, it is running at 2.7GHz single-core turbo with dual-channel DDR3-1600. Linux Mint 18.2 Quote:
But I have to say that is a neat and handy CFD Benchmark... Last edited by flotus1; November 25, 2017 at 20:42. |
|||||
November 27, 2017, 10:32 |
|
#5 |
Senior Member
Join Date: May 2012
Posts: 552
Rep Power: 16 |
I have done some testing with OpenFOAM and it seems that Ubuntu correctly applies turboBoost to increase the frequency of the 7600K CPU. However, CentOS does not appear to apply turboBoost at all to the 7940X CPU.
Not sure if this is a monitoring issue, I have yet to install i7z on CentOS. So far I only use Code:
$ lscpu | grep "MHz" My main suspicion is that I have to modify the BIOS and turn off some power saving options. Will try it when I have the possibility. UPDATE: I can now verify that lscpu is not showing frequency correctly under the 4.14 kernel (OK under the 4.10 kernel). I tested to upgrade the Ubuntu installation to 4.14 and got the same problem as in CentOS. Furthermore, I have now also tested i7z under CentOS and it shows correct turbo frequency. (Still get 9.86 msu, on the 7940X, and even though that feels a bit low, I think when comparing to OpenFOAM results the CPU seems to work reasonable). Last edited by Simbelmynė; November 27, 2017 at 16:12. Reason: Update |
|
December 2, 2017, 07:02 |
|
#6 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
As I just found out myself while setting up my new workstation: sudo turbostat shows the actual CPU frequency on linux. While "lscpu" and "cat /proc/cpuinfo | grep MHz" always worked fine for me on Intel systems including turbo frequencies, this seems to show only base frequencies under certain circumstances.
Btw: AMD Epyc 7301 with DDR4-2133 memory and 2.7GHz clock speed hits around 9.2MLUPS single-core in this benchmark. You might want to overclock the uncore/cache/ring/mesh/whateveritscallednow on your CPU. It is known to be quite low and the cause for some mediocre benchmark results where Skylake-X gets beaten by its predecessors. |
|
December 2, 2017, 09:43 |
|
#7 |
Senior Member
Join Date: May 2012
Posts: 552
Rep Power: 16 |
Thank you for the suggestions! I do not run calculations on my 7600k, but it is annoying anyway, so I will try to figure out how to solve it
It seems that the Epyc 7301 and 7940X are quite equally matched in single core in this benchmark. The 7940X is turbo boosting to about 4.3 GHz, but it only has 4 memory channels. I expect the difference to be much larger at higher thread count. I have also tested an 8700k and it yields approx. 13 msu in the same benchmark, so very powerful in single threaded simulations. In the OF motorbike benchmark it hits a wall at 4 threads, with very minor improvement after that. |
|
December 2, 2017, 12:37 |
|
#8 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
||
December 13, 2017, 09:41 |
|
#9 |
Senior Member
Join Date: May 2012
Posts: 552
Rep Power: 16 |
I have an interesting observation.
Running the Palabos benchmark in a Virtualbox using Ubuntu 17.10 I get 11.6 msu with the 7600k cpu. This is a rather massive improvement from the 5.4 msu I get when booting into Ubuntu 16.04 with the same machine. Not sure if the difference is attributed to the Virtualbox being Ubuntu 17.10 (as opposed to the 16.04 I have installed) or if it has something to do with the Virtualbox itself. Btw, changing "uncore/cache/ring/mesh/whateveritscallednow" had no effect. |
|
December 13, 2017, 12:04 |
|
#10 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
Weird. I suspect that something is still bottlenecking your I9 CPU. Did you already try a clean install with a more recent Linux version? I am currently running Opensuse Tumbleweed on my AMD workstation which works quite well.
|
|
December 13, 2017, 18:11 |
|
#11 |
Senior Member
Join Date: May 2012
Posts: 552
Rep Power: 16 |
OK so I have done some more testing (best kernel in bold text).
Still the Cavity3d 100, test case from Palabos, single thread. 7600K Linux Mint 18.3 - approx. 6 msu (kernel 4.10.0) Ubuntu 16.04 - approx. 6 msu (kernel 4.10.0) Ubuntu 16.04 (Virtualbox) - 5.9 msu (kernel 4.10.0) Ubuntu 17.10 (Virtualbox) - 11.6 msu (kernel 4.13.0) 8700k CentOS 7.4.1708 - 2.8 msu (kernel 3.10.0) (yes I double checked this one!) CentOS 7.4.1708 - 12.8 msu (kernel 4.14.2) Threadripper 1950X CentOS 7.4.1708 - 8.1 (kernel 3.10.0) CentOS 7.4.1708 - 7.6 msu (kernel 4.14.2) I9 7940X CentOS 7.4.1708 - 4.5 msu (kernel 3.10.0) CentOS 7.4.1708 - 10.7 msu (kernel 4.14.1-1) *updated* Epyc 7301 (from flotus1) CentOS 7 - 9.2 (kernel 4.14-3-1) OpenSUSE Tumbleweed - 9.37 msu (kernel 4.14-3-1) The top results from each CPU is more or less in line with the frequency and IPC of each model, with one big exception - the EPYC - which performs much better for some reason. Perhaps the 4.14.2 kernel is not good enough and still limits the 7940X and the Threadripper 1959X? It is very clear that the kernel has a dramatic impact on the performance of this benchmark. However, it seems that the latest is not always the greatest. This is true for the Threadripper case only though. I will keep a close eye on the kernel releases, now that there is once again competition in the high-end segment of computing, leading to more frequent releases of new models. Last edited by Simbelmynė; December 14, 2017 at 10:10. Reason: Updated some values |
|
December 14, 2017, 05:08 |
|
#12 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
IIRC the 9.2 single-core were on CentOS 7 with Kernel version 4.14-3-1
Now with Tumbleweed (same Kernel) this slightly improved to 9.37. |
|
December 14, 2017, 07:52 |
|
#13 |
Senior Member
Join Date: May 2012
Posts: 552
Rep Power: 16 |
OK, thanks! I will try out the 4.14-3-1 kernel and see if that helps the Threadripper.
How about the results. Do you think this benchmark is bandwidth limited for 1 thread? It doesn't seem to be when looking at the Intel line-up. But your EPYC clearly says otherwise. |
|
December 14, 2017, 08:43 |
|
#14 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
I don't think so, a memory bandwidth limit on a single core would be rather unusual. Apart from that, your I9 CPU has more memory bandwidth available to a single core, AMD Epyc only has 2 memory channels per die, your CPU has 4 with even higher frequency.
Which MPI library did you use and how did you install it? I downloaded openmpi 3.0 and compiled it from source https://www.open-mpi.org/software/ompi/v3.0/ And which compiler version are you using? |
|
January 3, 2018, 10:26 |
|
#15 |
Senior Member
Join Date: May 2012
Posts: 552
Rep Power: 16 |
Ok, so I have realized that the test case is most likely flawed. After testing all currently stable kernels I got very inconclusive results.
The compiler itself seems important (but not always, on CentOS the kernel version seemed more important ). Using linux Mint 18.3 with the 5.3.1 version of g++ my 8700k managed 6.9 msu, while having the 4.13.0 kernel. However, when using the 7.2 version of g++ under Ubuntu 17.10 (4.13.0 kernel) I got 14 msu. Installing g++ 7.1 on the linux Mint installation resulted in a benchmark value of 623 msu !! So now I have tested the cavity3d example under /example/showcases/ instead. It gives the same results regardless of compiler and kernel. Would be interesting to hear what results you get when running that case with 1 thread. I changed the code to suppress output and hdd read/write. e.g.: Code:
const T logT = (T)1/(T) 1; const T imSave = (T)10/(T) 1; const T vtkSave = (T)10/(T) 1; Code:
time mpirun -np 6 ./cavity3d Threadripper 1950X (16 threads): 2m01s Last edited by Simbelmynė; January 3, 2018 at 11:40. Reason: Spelling :P, added 1950X and 7940X |
|
January 3, 2018, 15:00 |
|
#16 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
I applied the changes to the source code as you suggested. Output for dual AMD Epyc 7301:
Code:
as01449@localhost:~/benchmarks/palabos-v2.0r0/examples/showCases/cavity3d> time mpirun -np 32 cavity3d omega= 1.53846 Writing Gif ... step 0; t=0; av energy=6.12745098e-07; av rho=1 Time spent during previous iteration: 0 step 5000; t=1; av energy=1.906470281e-06; av rho=0.9999633641 Time spent during previous iteration: 0.001128533 step 10000; t=2; av energy=1.907210258e-06; av rho=0.9999266731 Time spent during previous iteration: 0.001125073 step 15000; t=3; av energy=1.90721169e-06; av rho=0.9998899833 Time spent during previous iteration: 0.001120143 step 20000; t=4; av energy=1.907211692e-06; av rho=0.9998532949 Time spent during previous iteration: 0.001123033 step 25000; t=5; av energy=1.907211692e-06; av rho=0.9998166078 Time spent during previous iteration: 0.001139313 step 30000; t=6; av energy=1.907211692e-06; av rho=0.999779922 Time spent during previous iteration: 0.001128202 step 35000; t=7; av energy=1.907211692e-06; av rho=0.9997432376 Time spent during previous iteration: 0.001129653 step 40000; t=8; av energy=1.907211692e-06; av rho=0.9997065546 Time spent during previous iteration: 0.001125473 step 45000; t=9; av energy=1.907211692e-06; av rho=0.9996698728 Time spent during previous iteration: 0.001126403 Writing Gif ... Saving VTK file ... step 50000; t=10; av energy=1.907211692e-06; av rho=0.9996331925 Time spent during previous iteration: 0.001134143 real 0m59.133s user 29m53.143s sys 1m29.464s as01449@localhost:~/benchmarks/palabos-v2.0r0/examples/showCases/cavity3d> time mpirun -np 1 cavity3d omega= 1.53846 Writing Gif ... step 0; t=0; av energy=6.12745098e-07; av rho=1 Time spent during previous iteration: 0 step 5000; t=1; av energy=1.906470281e-06; av rho=0.9999633641 Time spent during previous iteration: 0.021283929 step 10000; t=2; av energy=1.907210258e-06; av rho=0.9999266731 Time spent during previous iteration: 0.021312529 step 15000; t=3; av energy=1.90721169e-06; av rho=0.9998899833 Time spent during previous iteration: 0.021273699 step 20000; t=4; av energy=1.907211692e-06; av rho=0.9998532949 Time spent during previous iteration: 0.021283039 step 25000; t=5; av energy=1.907211692e-06; av rho=0.9998166078 Time spent during previous iteration: 0.021346609 step 30000; t=6; av energy=1.907211692e-06; av rho=0.999779922 Time spent during previous iteration: 0.021269809 step 35000; t=7; av energy=1.907211692e-06; av rho=0.9997432376 Time spent during previous iteration: 0.021288709 step 40000; t=8; av energy=1.907211692e-06; av rho=0.9997065546 Time spent during previous iteration: 0.021296029 step 45000; t=9; av energy=1.907211692e-06; av rho=0.9996698728 Time spent during previous iteration: 0.021267211 Writing Gif ... Saving VTK file ... step 50000; t=10; av energy=1.907211692e-06; av rho=0.9996331925 Time spent during previous iteration: 0.02127318 real 17m56.751s user 17m54.621s sys 0m0.237s as01449@localhost:~/benchmarks/palabos-v2.0r0/examples/showCases/cavity3d> gcc -v Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/lib64/gcc/x86_64-suse-linux/7/lto-wrapper OFFLOAD_TARGET_NAMES=hsa:nvptx-none Target: x86_64-suse-linux Configured with: ../configure --prefix=/usr --infodir=/usr/share/info --mandir=/usr/share/man --libdir=/usr/lib64 --libexecdir=/usr/lib64 --enable-languages=c,c++,objc,fortran,obj-c++,ada,go --enable-offload-targets=hsa,nvptx-none=/usr/nvptx-none, --without-cuda-driver --enable-checking=release --disable-werror --with-gxx-include-dir=/usr/include/c++/7 --enable-ssp --disable-libssp --disable-libvtv --disable-libcc1 --enable-plugin --with-bugurl=http://bugs.opensuse.org/ --with-pkgversion='SUSE Linux' --with-slibdir=/lib64 --with-system-zlib --enable-__cxa_atexit --enable-libstdcxx-allocator=new --disable-libstdcxx-pch --enable-version-specific-runtime-libs --with-gcc-major-version-only --enable-linker-build-id --enable-linux-futex --enable-gnu-indirect-function --program-suffix=-7 --without-system-libunwind --enable-multilib --with-arch-32=x86-64 --with-tune=generic --build=x86_64-suse-linux --host=x86_64-suse-linux Thread model: posix gcc version 7.2.1 20171020 [gcc-7-branch revision 253932] (SUSE Linux) as01449@localhost:~/benchmarks/palabos-v2.0r0/examples/showCases/cavity3d> |
|
January 3, 2018, 15:35 |
|
#17 |
Senior Member
Join Date: May 2012
Posts: 552
Rep Power: 16 |
Fantastic, thank you!
So the dual Epyc is twice as fast as the Threadripper in this benchmark! It is interesting to note that the 1950X is actually (slightly) faster than the 7940X even at 14 threads. I did not expect that. The 8700k is superior in single threaded performance (as it should be, the previous results were really confusing). Do you think the system size is large enough to utilize the dual Epyc bandwidth, or will larger sizes yield even bigger differences compared to the 1950X and 7940X? Of course twice the speed is huge in it's own right, I'm just asking from a pure price/performance viewpoint |
|
January 4, 2018, 08:08 |
|
#18 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
The system size here is 50x50x50? Then I would expect the gap to become larger with increased problem size.
|
|
January 4, 2018, 10:09 |
|
#19 |
Senior Member
Join Date: May 2012
Posts: 552
Rep Power: 16 |
Running at N=100 (one million cells):
1950X: 29m16s 7940X: 37m39s The AMD system is 29% faster. At N=50 (above) the AMD system is 36% faster, so the gap becomes smaller, but it is still visibly in favor of the Threadripper. |
|
January 4, 2018, 10:51 |
|
#20 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
My bad, I thought you were referring to parallel execution times compared to the Epyc setup.
|
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Comparison between Intel CPUs Xeon E5-2643 v4 and Intel i7 5820K | mechy | Hardware | 11 | August 17, 2016 04:47 |
win7 can read only 2 CPUs | Anna Tian | Hardware | 11 | August 20, 2014 23:34 |
correction of Grub after installing Windows XP and 8 | immortality | Lounge | 20 | January 5, 2014 18:41 |
CFX on different number of CPUs | cstebbings | CFX | 1 | April 19, 2011 14:01 |
Fluent benchmakrs on new Intel CPUs | cfdmystic | FLUENT | 1 | February 15, 2008 07:28 |