Home > Forums > General Forums > Hardware

Kernel for new CPUs

Updated Threads

Like Tree

3Likes

Reply

Page 1 of 2

1

Search this Thread

November 23, 2017, 14:03	Kernel for new CPUs	#1
Simbelmynë Senior Member Join Date: May 2012 Posts: 552 Rep Power: 16	Hey, Tried a 7940X setup today with a fresh CentOS installation. As benchmark I used Palabos (cavity3d) The performance was abysmal to say the least. At N=100 the results for 1 thread was about 4.5 msu With the 4.14 kernel I managed to increase the value to 9 msu 4 threads: 29 msu 8 threads: 43 msu Compared to the old CPUs in the reference I feel that I am doing something wrong. What do you think, it should be higher, right? http://wiki.palabos.org/plb_wiki:benchmark:cavity_n100 Installing OpenFOAM now to get some regular CFD benchmark data.

November 23, 2017, 15:25		#2
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,427 Rep Power: 49	Seems like there is something wrong. My laptop processor (I5-4210U) is getting 7.3 MLUPS on one core. Did you recompile after the kernel update? Did you clear caches before running the benchmark? Code: # free && sync && echo 3 > /proc/sys/vm/drop_caches && free Did you make sure the CPU is running at maximum frequency while solving? Which memory do you have? Did you try to bind the process to a core when solving on one core? Edit: You might want to try a few ordinary benchmarks first. Being able to compare your results with known results for the same hardware helps finding possible causes for bad performance. Maybe even using a test installation of windows. The tools you have there are easier to use in my opinion. Last edited by flotus1; November 23, 2017 at 19:04.

Old

November 24, 2017, 08:35

Default

#3

Simbelmynë

Senior Member

Join Date: May 2012

Posts: 552

Rep Power: 16

Simbelmynë is on a distinguished road

Quote:

Originally Posted by flotus1

View Post

Seems like there is something wrong. My laptop processor (I5-4210U) is getting 7.3 MLUPS on one core.

Seems reasonable!

Quote:

Originally Posted by flotus1

View Post

Did you recompile after the kernel update?

Yes, did a "make clean" and "make". At least it was a 2x speed improvement with the new kernel.

Quote:

Originally Posted by flotus1

View Post

Did you clear caches before running the benchmark?

No, but after testing this suggestion it is still around 9.5-9.9 msu.

Quote:

Originally Posted by flotus1

View Post

Did you make sure the CPU is running at maximum frequency while solving?

This is interesting. I don't see any use of Turbo boost. The frequency just stays at 3.1 GHz on all cores. Not sure why though. While testing the same benchmark on my 7600k (Ubuntu 16.04) I get the same behavior. Not sure why the Turbo boost is not kicking in.

Quote:

Originally Posted by flotus1

View Post

Which memory do you have?

Corsair vengeance LPX 3200 MHz 4x8 GB. Not sure if the XMP is on or not. SSHing in to the computer so I cannot check atm.

Quote:

Originally Posted by flotus1

View Post

Did you try to bind the process to a core when solving on one core?

No. How do I do that? I can see that the same core is being used though throughout the benchmark.

Quote:

Originally Posted by flotus1

View Post

Edit: You might want to try a few ordinary benchmarks first. Being able to compare your results with known results for the same hardware helps finding possible causes for bad performance. Maybe even using a test installation of windows. The tools you have there are easier to use in my opinion.

Yeah, perhaps a dual boot with Windows is good anyway. I will test it!

UPDATE: I have now also checked the benchmark on my 7600k and it gives 5.4 msu, under Ubuntu 16.04. The 7600k should be able to out-perform I5-4210U, unless you have it extremely overclocked (seems unlikely in a laptop though), right?

I compile it using the make-file (no changes).
I run it with "./cavity3d 100"
or "mpirun -np 1 ./cavity3d 100"

Anything I do different here?

Last edited by Simbelmynë; November 24, 2017 at 08:39. Reason: Update

Simbelmynë is offline

Reply With Quote

Old

November 24, 2017, 09:08

Default

#4

flotus1

Super Moderator

Alex

Join Date: Jun 2012

Location: Germany

Posts: 3,427

Rep Power: 49

flotus1 has a spectacular aura about

flotus1 has a spectacular aura about

Quote:

Originally Posted by Simbelmynë

View Post

This is interesting. I don't see any use of Turbo boost. The frequency just stays at 3.1 GHz on all cores. Not sure why though. While testing the same benchmark on my 7600k (Ubuntu 16.04) I get the same behavior. Not sure why the Turbo boost is not kicking in.

Another reason to test a different OS. With Opensuse and Mint (and Windows of course) I never had any issues with turbo not being used.
Maybe an energy saving option in your OS? Maybe deactivated in the Bios?
Anyway, if your CPU is running at 3.1GHz that would explain the mediocre performance. You could try lock the clock speed to a higher value once you have access to the bios again and see if this changes anything.

Quote:

Originally Posted by Simbelmynë

View Post

No. How do I do that? I can see that the same core is being used though throughout the benchmark.

You can try using taskset for any program.
For MPI, there must be some environment variables or runtime options controlling thread affinity. I am not an MPI expert yet. But if the process ran on one specific core all along this is not the cause of your problem

Quote:

Originally Posted by Simbelmynë

View Post

UPDATE: I have now also checked the benchmark on my 7600k and it gives 5.4 msu, under Ubuntu 16.04. The 7600k should be able to out-perform I5-4210U, unless you have it extremely overclocked (seems unlikely in a laptop though), right?

I probably would if I could

But no, it is running at 2.7GHz single-core turbo with dual-channel DDR3-1600. Linux Mint 18.2

Quote:

Originally Posted by Simbelmynë

View Post

I compile it using the make-file (no changes).
I run it with "./cavity3d 100"
or "mpirun -np 1 ./cavity3d 100"

Anything I do different here?

Same here. I tried versions 1.5 and 2.0 of the program, with no significant differences.

But I have to say that is a neat and handy CFD Benchmark...

Last edited by flotus1; November 25, 2017 at 20:42.

flotus1 is offline

Reply With Quote

November 27, 2017, 10:32		#5
Simbelmynë Senior Member Join Date: May 2012 Posts: 552 Rep Power: 16	I have done some testing with OpenFOAM and it seems that Ubuntu correctly applies turboBoost to increase the frequency of the 7600K CPU. However, CentOS does not appear to apply turboBoost at all to the 7940X CPU. Not sure if this is a monitoring issue, I have yet to install i7z on CentOS. So far I only use Code: $ lscpu \| grep "MHz" which I believe is not giving correct readings. However, it gives no indication whatsoever that the frequency changes, which is strange. My main suspicion is that I have to modify the BIOS and turn off some power saving options. Will try it when I have the possibility. UPDATE: I can now verify that lscpu is not showing frequency correctly under the 4.14 kernel (OK under the 4.10 kernel). I tested to upgrade the Ubuntu installation to 4.14 and got the same problem as in CentOS. Furthermore, I have now also tested i7z under CentOS and it shows correct turbo frequency. (Still get 9.86 msu, on the 7940X, and even though that feels a bit low, I think when comparing to OpenFOAM results the CPU seems to work reasonable). Last edited by Simbelmynë; November 27, 2017 at 16:12. Reason: Update

December 2, 2017, 07:02		#6
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,427 Rep Power: 49	As I just found out myself while setting up my new workstation: sudo turbostat shows the actual CPU frequency on linux. While "lscpu" and "cat /proc/cpuinfo \| grep MHz" always worked fine for me on Intel systems including turbo frequencies, this seems to show only base frequencies under certain circumstances. Btw: AMD Epyc 7301 with DDR4-2133 memory and 2.7GHz clock speed hits around 9.2MLUPS single-core in this benchmark. You might want to overclock the uncore/cache/ring/mesh/whateveritscallednow on your CPU. It is known to be quite low and the cause for some mediocre benchmark results where Skylake-X gets beaten by its predecessors.

December 2, 2017, 09:43		#7
Simbelmynë Senior Member Join Date: May 2012 Posts: 552 Rep Power: 16	Thank you for the suggestions! I do not run calculations on my 7600k, but it is annoying anyway, so I will try to figure out how to solve it It seems that the Epyc 7301 and 7940X are quite equally matched in single core in this benchmark. The 7940X is turbo boosting to about 4.3 GHz, but it only has 4 memory channels. I expect the difference to be much larger at higher thread count. I have also tested an 8700k and it yields approx. 13 msu in the same benchmark, so very powerful in single threaded simulations. In the OF motorbike benchmark it hits a wall at 4 threads, with very minor improvement after that.

Old

December 2, 2017, 12:37

Default

#8

flotus1

Super Moderator

Alex

Join Date: Jun 2012

Location: Germany

Posts: 3,427

Rep Power: 49

flotus1 has a spectacular aura about

flotus1 has a spectacular aura about

Quote:

Originally Posted by Simbelmynë

View Post

I expect the difference to be much larger at higher thread count.

Indeed

Running the 100 benchmark size on all 32 cores I get 162 MLUPS.

flotus1 is offline

Reply With Quote

December 13, 2017, 09:41		#9
Simbelmynë Senior Member Join Date: May 2012 Posts: 552 Rep Power: 16	I have an interesting observation. Running the Palabos benchmark in a Virtualbox using Ubuntu 17.10 I get 11.6 msu with the 7600k cpu. This is a rather massive improvement from the 5.4 msu I get when booting into Ubuntu 16.04 with the same machine. Not sure if the difference is attributed to the Virtualbox being Ubuntu 17.10 (as opposed to the 16.04 I have installed) or if it has something to do with the Virtualbox itself. Btw, changing "uncore/cache/ring/mesh/whateveritscallednow" had no effect.

December 13, 2017, 12:04		#10
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,427 Rep Power: 49	Weird. I suspect that something is still bottlenecking your I9 CPU. Did you already try a clean install with a more recent Linux version? I am currently running Opensuse Tumbleweed on my AMD workstation which works quite well.

December 13, 2017, 18:11		#11
Simbelmynë Senior Member Join Date: May 2012 Posts: 552 Rep Power: 16	OK so I have done some more testing (best kernel in bold text). Still the Cavity3d 100, test case from Palabos, single thread. 7600K Linux Mint 18.3 - approx. 6 msu (kernel 4.10.0) Ubuntu 16.04 - approx. 6 msu (kernel 4.10.0) Ubuntu 16.04 (Virtualbox) - 5.9 msu (kernel 4.10.0) Ubuntu 17.10 (Virtualbox) - 11.6 msu (kernel 4.13.0) 8700k CentOS 7.4.1708 - 2.8 msu (kernel 3.10.0) (yes I double checked this one!) CentOS 7.4.1708 - 12.8 msu (kernel 4.14.2) Threadripper 1950X CentOS 7.4.1708 - 8.1 (kernel 3.10.0) CentOS 7.4.1708 - 7.6 msu (kernel 4.14.2) I9 7940X CentOS 7.4.1708 - 4.5 msu (kernel 3.10.0) CentOS 7.4.1708 - 10.7 msu (kernel 4.14.1-1) updated Epyc 7301 (from flotus1) CentOS 7 - 9.2 (kernel 4.14-3-1) OpenSUSE Tumbleweed - 9.37 msu (kernel 4.14-3-1) The top results from each CPU is more or less in line with the frequency and IPC of each model, with one big exception - the EPYC - which performs much better for some reason. Perhaps the 4.14.2 kernel is not good enough and still limits the 7940X and the Threadripper 1959X? It is very clear that the kernel has a dramatic impact on the performance of this benchmark. However, it seems that the latest is not always the greatest. This is true for the Threadripper case only though. I will keep a close eye on the kernel releases, now that there is once again competition in the high-end segment of computing, leading to more frequent releases of new models. Last edited by Simbelmynë; December 14, 2017 at 10:10. Reason: Updated some values

December 14, 2017, 05:08		#12
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,427 Rep Power: 49	IIRC the 9.2 single-core were on CentOS 7 with Kernel version 4.14-3-1 Now with Tumbleweed (same Kernel) this slightly improved to 9.37.

December 14, 2017, 07:52		#13
Simbelmynë Senior Member Join Date: May 2012 Posts: 552 Rep Power: 16	OK, thanks! I will try out the 4.14-3-1 kernel and see if that helps the Threadripper. How about the results. Do you think this benchmark is bandwidth limited for 1 thread? It doesn't seem to be when looking at the Intel line-up. But your EPYC clearly says otherwise.

December 14, 2017, 08:43		#14
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,427 Rep Power: 49	I don't think so, a memory bandwidth limit on a single core would be rather unusual. Apart from that, your I9 CPU has more memory bandwidth available to a single core, AMD Epyc only has 2 memory channels per die, your CPU has 4 with even higher frequency. Which MPI library did you use and how did you install it? I downloaded openmpi 3.0 and compiled it from source https://www.open-mpi.org/software/ompi/v3.0/ And which compiler version are you using?

January 3, 2018, 10:26		#15
Simbelmynë Senior Member Join Date: May 2012 Posts: 552 Rep Power: 16	Ok, so I have realized that the test case is most likely flawed. After testing all currently stable kernels I got very inconclusive results. The compiler itself seems important (but not always, on CentOS the kernel version seemed more important ). Using linux Mint 18.3 with the 5.3.1 version of g++ my 8700k managed 6.9 msu, while having the 4.13.0 kernel. However, when using the 7.2 version of g++ under Ubuntu 17.10 (4.13.0 kernel) I got 14 msu. Installing g++ 7.1 on the linux Mint installation resulted in a benchmark value of 623 msu !! So now I have tested the cavity3d example under /example/showcases/ instead. It gives the same results regardless of compiler and kernel. Would be interesting to hear what results you get when running that case with 1 thread. I changed the code to suppress output and hdd read/write. e.g.: Code: const T logT = (T)1/(T) 1; const T imSave = (T)10/(T) 1; const T vtkSave = (T)10/(T) 1; With 1 thread I got 10m20s and with all 6 threads it is decreased to 3m10s real time measured by: Code: time mpirun -np 6 ./cavity3d Intel 7940X (14 threads): 2m45s Threadripper 1950X (16 threads): 2m01s Last edited by Simbelmynë; January 3, 2018 at 11:40. Reason: Spelling :P, added 1950X and 7940X

Old

January 3, 2018, 15:00

Default

#16

flotus1

Super Moderator

Alex

Join Date: Jun 2012

Location: Germany

Posts: 3,427

Rep Power: 49

flotus1 has a spectacular aura about

flotus1 has a spectacular aura about

I applied the changes to the source code as you suggested. Output for dual AMD Epyc 7301:

Code:

as01449@localhost:~/benchmarks/palabos-v2.0r0/examples/showCases/cavity3d> time mpirun -np 32 cavity3d
omega= 1.53846
Writing Gif ...
step 0; t=0; av energy=6.12745098e-07; av rho=1
Time spent during previous iteration: 0
step 5000; t=1; av energy=1.906470281e-06; av rho=0.9999633641
Time spent during previous iteration: 0.001128533
step 10000; t=2; av energy=1.907210258e-06; av rho=0.9999266731
Time spent during previous iteration: 0.001125073
step 15000; t=3; av energy=1.90721169e-06; av rho=0.9998899833
Time spent during previous iteration: 0.001120143
step 20000; t=4; av energy=1.907211692e-06; av rho=0.9998532949
Time spent during previous iteration: 0.001123033
step 25000; t=5; av energy=1.907211692e-06; av rho=0.9998166078
Time spent during previous iteration: 0.001139313
step 30000; t=6; av energy=1.907211692e-06; av rho=0.999779922
Time spent during previous iteration: 0.001128202
step 35000; t=7; av energy=1.907211692e-06; av rho=0.9997432376
Time spent during previous iteration: 0.001129653
step 40000; t=8; av energy=1.907211692e-06; av rho=0.9997065546
Time spent during previous iteration: 0.001125473
step 45000; t=9; av energy=1.907211692e-06; av rho=0.9996698728
Time spent during previous iteration: 0.001126403
Writing Gif ...
Saving VTK file ...
step 50000; t=10; av energy=1.907211692e-06; av rho=0.9996331925
Time spent during previous iteration: 0.001134143

real    0m59.133s
user    29m53.143s
sys     1m29.464s
as01449@localhost:~/benchmarks/palabos-v2.0r0/examples/showCases/cavity3d> time mpirun -np 1 cavity3d
omega= 1.53846
Writing Gif ...
step 0; t=0; av energy=6.12745098e-07; av rho=1
Time spent during previous iteration: 0
step 5000; t=1; av energy=1.906470281e-06; av rho=0.9999633641
Time spent during previous iteration: 0.021283929
step 10000; t=2; av energy=1.907210258e-06; av rho=0.9999266731
Time spent during previous iteration: 0.021312529
step 15000; t=3; av energy=1.90721169e-06; av rho=0.9998899833
Time spent during previous iteration: 0.021273699
step 20000; t=4; av energy=1.907211692e-06; av rho=0.9998532949
Time spent during previous iteration: 0.021283039
step 25000; t=5; av energy=1.907211692e-06; av rho=0.9998166078
Time spent during previous iteration: 0.021346609
step 30000; t=6; av energy=1.907211692e-06; av rho=0.999779922
Time spent during previous iteration: 0.021269809
step 35000; t=7; av energy=1.907211692e-06; av rho=0.9997432376
Time spent during previous iteration: 0.021288709
step 40000; t=8; av energy=1.907211692e-06; av rho=0.9997065546
Time spent during previous iteration: 0.021296029
step 45000; t=9; av energy=1.907211692e-06; av rho=0.9996698728
Time spent during previous iteration: 0.021267211
Writing Gif ...
Saving VTK file ...
step 50000; t=10; av energy=1.907211692e-06; av rho=0.9996331925
Time spent during previous iteration: 0.02127318

real    17m56.751s
user    17m54.621s
sys     0m0.237s
as01449@localhost:~/benchmarks/palabos-v2.0r0/examples/showCases/cavity3d> gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib64/gcc/x86_64-suse-linux/7/lto-wrapper
OFFLOAD_TARGET_NAMES=hsa:nvptx-none
Target: x86_64-suse-linux
Configured with: ../configure --prefix=/usr --infodir=/usr/share/info --mandir=/usr/share/man --libdir=/usr/lib64 --libexecdir=/usr/lib64 --enable-languages=c,c++,objc,fortran,obj-c++,ada,go --enable-offload-targets=hsa,nvptx-none=/usr/nvptx-none, --without-cuda-driver --enable-checking=release --disable-werror --with-gxx-include-dir=/usr/include/c++/7 --enable-ssp --disable-libssp --disable-libvtv --disable-libcc1 --enable-plugin --with-bugurl=http://bugs.opensuse.org/ --with-pkgversion='SUSE Linux' --with-slibdir=/lib64 --with-system-zlib --enable-__cxa_atexit --enable-libstdcxx-allocator=new --disable-libstdcxx-pch --enable-version-specific-runtime-libs --with-gcc-major-version-only --enable-linker-build-id --enable-linux-futex --enable-gnu-indirect-function --program-suffix=-7 --without-system-libunwind --enable-multilib --with-arch-32=x86-64 --with-tune=generic --build=x86_64-suse-linux --host=x86_64-suse-linux
Thread model: posix
gcc version 7.2.1 20171020 [gcc-7-branch revision 253932] (SUSE Linux) 
as01449@localhost:~/benchmarks/palabos-v2.0r0/examples/showCases/cavity3d>

Simbelmynë likes this.

flotus1 is offline

Reply With Quote

January 3, 2018, 15:35		#17
Simbelmynë Senior Member Join Date: May 2012 Posts: 552 Rep Power: 16	Fantastic, thank you! So the dual Epyc is twice as fast as the Threadripper in this benchmark! It is interesting to note that the 1950X is actually (slightly) faster than the 7940X even at 14 threads. I did not expect that. The 8700k is superior in single threaded performance (as it should be, the previous results were really confusing). Do you think the system size is large enough to utilize the dual Epyc bandwidth, or will larger sizes yield even bigger differences compared to the 1950X and 7940X? Of course twice the speed is huge in it's own right, I'm just asking from a pure price/performance viewpoint

January 4, 2018, 08:08		#18
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,427 Rep Power: 49	The system size here is 50x50x50? Then I would expect the gap to become larger with increased problem size.

January 4, 2018, 10:09		#19
Simbelmynë Senior Member Join Date: May 2012 Posts: 552 Rep Power: 16	Running at N=100 (one million cells): 1950X: 29m16s 7940X: 37m39s The AMD system is 29% faster. At N=50 (above) the AMD system is 36% faster, so the gap becomes smaller, but it is still visibly in favor of the Threadripper. BlnPhoenix likes this.

January 4, 2018, 10:51		#20
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,427 Rep Power: 49	My bad, I thought you were referring to parallel execution times compared to the Epyc setup. huopoxiaoyang likes this.

Reply

Page 1 of 2

1

« Previous Thread | Next Thread »

Posting Rules
You may not post new threads You may not post replies You may not post attachments You may not edit your posts BB code is On Smilies are On [IMG] code is On HTML code is Off Trackbacks are Off Pingbacks are On Refbacks are On Forum Rules

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Comparison between Intel CPUs Xeon E5-2643 v4 and Intel i7 5820K	mechy	Hardware	11	August 17, 2016 04:47
win7 can read only 2 CPUs	Anna Tian	Hardware	11	August 20, 2014 23:34
correction of Grub after installing Windows XP and 8	immortality	Lounge	20	January 5, 2014 18:41
CFX on different number of CPUs	cstebbings	CFX	1	April 19, 2011 14:01
Fluent benchmakrs on new Intel CPUs	cfdmystic	FLUENT	1	February 15, 2008 07:28

All times are GMT -4. The time now is 16:48.