|
[Sponsors] |
March 22, 2018, 11:08 |
|
#41 |
Senior Member
Joern Beilke
Join Date: Mar 2009
Location: Dresden
Posts: 539
Rep Power: 20 |
You are using just ONE processor. So you have only half of the memory bandwith.
|
|
March 22, 2018, 11:14 |
|
#42 |
Member
Johan Roenby
Join Date: May 2011
Location: Denmark
Posts: 93
Rep Power: 21 |
But when flotus1 is running on 16 of his 32 cores, I thought he was effectively using just one of his CPU's which in my understanding only communicates with the 8 RAM slots associated with that CPU. Did I misunderstand this?
|
|
March 22, 2018, 11:33 |
|
#43 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
Running on 16 of 32 cores, mpirun with default settings spreads out the active cores as evenly as possible across all NUMA-nodes. I confirmed this by looking at which cores are actually doing any work using htop. So my results with 16 cores will definitely be better than with a single CPU running 16 cores. A better estimate for 16 cores on a single CPU would be my result on 32 cores multiplied by 2.
If you want to I can do a few runs pinning all threads to one CPU so you can compare your results. Which linux kernel version are you running? If it is the default kernel version of Ubuntu 16.04, it might be too old to use the full potential of your CPU. You will have to use HWE kernel to get better results. Is SMT turned off already? |
|
March 22, 2018, 11:42 |
|
#44 |
Senior Member
Join Date: May 2012
Posts: 552
Rep Power: 16 |
I am not sure this is general Linux problem. I think it is a bug in the Palabos benchmark (I have not been able to confirm any dependence of the kernel in my OpenFOAM benchmarks).
|
|
March 22, 2018, 11:57 |
|
#45 | |
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128 |
Quote:
The maximum clock speed differences will likely account for it not being strictly 2x faster on the 2x7301. As for capacity per module: the more RAM there is, the higher the latency is expected, if I remember correctly, so the 8GB should be an itty-bitty-tiny-bit faster than 16GB edit: I didn't notice that others had already answered Last edited by wyldckat; March 22, 2018 at 11:58. Reason: see "edit:" |
||
March 22, 2018, 14:14 |
|
#46 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
Reran the benchmark with 16 cores pinned to a single CPU: 84.4s execution time.
Surprisingly close to my prediction of two times the result for 32 cores |
|
March 22, 2018, 15:26 |
|
#47 | |
Member
Johan Roenby
Join Date: May 2011
Location: Denmark
Posts: 93
Rep Power: 21 |
Quote:
# cores Wall time (s): ------------------------ 1 1008.95 2 582.33 4 273.67 6 174.61 8 126.35 12 123.35 16 85.05 So on 16 cores, I am now comfortable that things are OK (I reran it 20 times and all were in the range 83-88 s). It is interesting to see that the other runs with idle cores available are not really affected by the multithreading. I guess these threads with other stuff that is apparently running alongside my simulation just find one of the idle cores to work on. |
||
March 22, 2018, 16:36 |
|
#48 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
roenbi could you post the output of lscpu please?
|
|
March 22, 2018, 16:41 |
|
#49 |
Member
Johan Roenby
Join Date: May 2011
Location: Denmark
Posts: 93
Rep Power: 21 |
Code:
roenby@aref:~$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 1 Core(s) per socket: 16 Socket(s): 1 NUMA node(s): 4 Vendor ID: AuthenticAMD CPU family: 23 Model: 1 Model name: AMD EPYC 7351 16-Core Processor Stepping: 2 CPU MHz: 2400.000 CPU max MHz: 2400,0000 CPU min MHz: 1200,0000 BogoMIPS: 4799.73 Virtualization: AMD-V L1d cache: 32K L1i cache: 64K L2 cache: 512K L3 cache: 8192K NUMA node0 CPU(s): 0-3 NUMA node1 CPU(s): 4-7 NUMA node2 CPU(s): 8-11 NUMA node3 CPU(s): 12-15 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_l2 mwaitx cpb hw_pstate retpoline retpoline_amd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca |
|
March 23, 2018, 08:20 |
|
#50 |
Member
Giovanni Medici
Join Date: Mar 2014
Posts: 48
Rep Power: 12 |
First of all I want to thank everyone for the help.
As suggested flotus1 our machine was scaling quite bad. I tried to switch off hyperthreading, and things got a little bit better. Moreover, I checked and the position of the 2 32 Gb RAM dimms, where ok (A1 B1). We now installed 8 x 8 Gb DDR4, slightly faster (2400MHz) RAM dimms, so to populate slots (A1 A2 A3 A4, B1 B2 B3 B4). Therefore all 4 memory channels of each socket are now populated. The results which I report here where obtained with 2x E5 2630 v3 2.4 GHz hyperThreading ON (i.e. 32 threads), and 32Gb allocated to the OracleVM : Code:
# cores Wall time (s): ------------------------ 1 1032.88 1.00 2 577.14 1.79 4 328.22 3.15 6 262.23 3.94 8 258.98 3.99 12 247.23 4.18 16 236.92 4.36 24 281.56 3.67 30 342.56 3.02 32 391.9 2.64 I'm running under Windows Server 2012R2, with OF_1712 (ESI distribution), therefore I'm taking advantage of OracleVM. The VM does not allow me to allocate all the RAM available (otherwise, I think, the OS could collapse), therefore I'm not quite sure every thread/core is using in the fastest way the RAM.
Thank you !!!! |
|
March 23, 2018, 09:57 |
|
#51 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
I guess the memory allocation in Oracle VM is your problem: https://blogs.oracle.com/wim/underst...-oracle-vm-xen
This also explains why it seemed like memory was mis-configured with 2 DIMMs. Unfortunately, I have no idea how to improve this Maybe by asking oracle support... |
|
March 23, 2018, 20:35 |
|
#52 | |||||
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128 |
(Somewhat) Quick answers:
Quote:
Quote:
Quote:
That and/or use a Linux Kernel compiled to be able to connect to the OracleVM on the host, so that it could gain additional direct-metal-access capabilities. Never used it myself, but I guess that OracleVM has something like that. Quote:
Quote:
Right now, blueCFD-Core is mostly only good enough as a convenient replacement to any other virtualization strategies, so that you don't need to leave Windows to use OpenFOAM. Performance-wise, it's not great But if you want to take full advantage of your hardware, you should install a Linux Distribution natively, or at least use an extremely efficient virtualization software. That and build OpenFOAM from source code, along with dedicated flags for your CPU model-type... don't use pre-built packages. |
||||||
March 24, 2018, 04:38 |
|
#53 | |
Member
Giovanni Medici
Join Date: Mar 2014
Posts: 48
Rep Power: 12 |
Quote:
Thanks wyldckat for the fast and comprehensive answer. I will definitely investigate the IPMI capabilities of our motherboard (namely the Dell PowerEdge R430). BlueCFD looks like to be a really interesting option for the users which can not (for whatever reason), switch completely to Linux. |
||
March 26, 2018, 10:27 |
|
#54 |
New Member
Join Date: Dec 2017
Posts: 5
Rep Power: 9 |
Hi everyone,
I tried to run the OF benchmark with the 2x AMD EPYC 7351, 16x 8GB DDR4 2666MHz, with OpenFOAM 5.0 on Ubuntu 16.04. I ran the calculation by binding the processes to core, to socket and none on 16 and 32 cores. There are the results below : HTML Code:
# cores Wall time (s): Wall time (s): Wall time (s): core socket none --------------------------------------------------------------------- 1 922 16 153.34 55.7 65.78 32 70.8 38.68 38.8 Do you think it could come from the fact that the hyper-threading is on ? Thanks in Advance |
|
March 26, 2018, 10:33 |
|
#55 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
I don't think it has to do with SMT. I had it turned off and tried a few binding options but ended up with the same poor performance you observed. I still have no clue what causes it.
May I ask which exact memory type you are using? |
|
March 26, 2018, 10:42 |
|
#56 |
New Member
Join Date: Dec 2017
Posts: 5
Rep Power: 9 |
There is my config :
Handle 0x0053, DMI type 17, 40 bytes Memory Device Array Handle: 0x001A Error Information Handle: 0x0052 Total Width: 72 bits Data Width: 64 bits Size: 8192 MB Form Factor: DIMM Set: None Locator: P2-DIMMH1 Bank Locator: P1_Node0_Channel7_Dimm0 Type: DDR4 Type Detail: Synchronous Registered (Buffered) Speed: 2667 MHz Manufacturer: Samsung Serial Number: 030C18C6 Asset Tag: P2-DIMMH1_AssetTag (date:17/05) Part Number: M393A1G40EB2-CTD Rank: 1 Configured Clock Speed: 2667 MHz Minimum Voltage: 1.2 V Maximum Voltage: 1.2 V Configured Voltage: 1.2 V which binding option are you usually using ? |
|
March 26, 2018, 10:49 |
|
#57 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
For this benchmark I ended up using no binding option at all which gave the best overall results. I don't use OpenFOAM for my work. A simple bind to core in order to avoid messing up caches and memory access is usually enough for the solvers I use.
|
|
March 26, 2018, 13:00 |
|
#58 |
New Member
Join Date: Dec 2017
Posts: 5
Rep Power: 9 |
Thanks a lot. By turning off the Multi-threading i get the following results :
Code:
# cores Wall time (s): Wall time (s): Wall time (s): core socket none --------------------------------------------------------------------- 16 81.23 60.52 61.91 32 37.37 36.94 39.67 |
|
March 26, 2018, 17:13 |
|
#59 | |
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128 |
Greetings to all!
Quote:
This means that for the 16 processes run, those 16 were fighting for access to the existing 8 memory channels on the first socket. When binding per socket, it will have likely ordered based on balanced distribution, namely 8 cores on each socket. This is clearer when compared to the results with 32 cores, where the results are nearly the same. Side note: If you are trying to pinpoint which mode is best, I strongly suggest running several runs on each mode, because the majority of the results seem to be within a statistical margin of error, i.e. the latest results with 32 cores seem mostly identical, regardless of the assignment mode. Best regards, Bruno |
||
March 26, 2018, 18:49 |
|
#60 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
What I tried among other options was explicitly binding threads to certain cores, making sure the distribution was optimal - at least in theory. The same method worked for other solvers. I still ended up with low performance for most thread counts in OpenFOAM.
|
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
How to contribute to the community of OpenFOAM users and to the OpenFOAM technology | wyldckat | OpenFOAM | 17 | November 10, 2017 16:54 |
UNIGE February 13th-17th - 2107. OpenFOAM advaced training days | joegi.geo | OpenFOAM Announcements from Other Sources | 0 | October 1, 2016 20:20 |
OpenFOAM Training Beijing 22-26 Aug 2016 | cfd.direct | OpenFOAM Announcements from Other Sources | 0 | May 3, 2016 05:57 |
New OpenFOAM Forum Structure | jola | OpenFOAM | 2 | October 19, 2011 07:55 |
Hardware for OpenFOAM LES | LijieNPIC | Hardware | 0 | November 8, 2010 10:54 |