|
[Sponsors] |
August 20, 2018, 18:45 |
Efficiency of dual socket node in Fluent
|
#1 |
New Member
Arthur Piquet
Join Date: Mar 2013
Posts: 18
Rep Power: 13 |
Hi everybody!
I’m running some scalability tests on a machine using Fluent v18.2 and I’ve found some strange behavior in bandwidth, latency and CP. The machine is a HPC with 32 nodes of dual i5 with 16 cores each. Meaning 32 per node. 1024 overall. No hyper threading. The machine is not mine First, on my scalability test I found that Fluent starts losing efficiency from 100k/proc. Isn’t that strange ? According to Fluent benchmark, efficiency starts to be bad around 10k/proc. My test case is 40m cells with keps, really simple. I just change the number of proc to reduce the nb cells per proc. I’m using MPi InfiniBand. Secondly, when i’m testing bandwidth or latency on 32 proc on one node, I found that inter connectivity is bad between the two processor. Inside a processor (16 core) I have 10Mb/s but between the 2 proc I have 2Mb/s of bandwidth . If I run my 40m case on 512 cores mapped as 32*16 (16 nodes - node full) It will take more time than mapping 16*32 (32 nodes - node half full). My question , what the purpose of dual processor per node if only half is good to run my case? I’m losing half of my HPC here... Maybe I need to activate something in the bios to improve the inter-processor connectivity ? Inter-node connectivity is good though (InfiniBand) ~7Mb/s Thks! |
|
August 21, 2018, 05:08 |
|
#2 | |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
In order to not draw false conclusions about inter-node scaling, you should first check for intra-node scaling. i.e. running the job on one node with 1-32 cores.
Here you will probably see one of the reasons for what you observed when running the cluster nodes full vs. half-full: Scaling on a single node will be less than ideal due to memory bandwidth limitations. This should make it obvious why running the cluster nodes half full is more efficient compared to running half the nodes fully occupied. Running jobs like this is common practice in memory-bound HPC with per-core licensing. Inter-socked bandwidth in a NUMA system will always be worse than intra-socket bandwidth. This is nothing to worry about, just a consequence of the implementation where data has to be sent over some kind of interconnect between the sockets. Non-ideal scaling when comparing a fully occupied node vs. a half-occupied node does not necessarily stem from poor inter-socket bandwidth and latency. MPI with default settings should distribute the threads across both CPUs even when only half of all cores are used. Again, it is more likely a consequence of memory-bound execution, unless you pinned the 16 threads to the first CPU when running 16 threads per node. Edit: Quote:
Since the system is not yours, make sure that it is not running other heavy jobs during your tests. Both load on the nodes and the node interconnect could distort your findings. |
||
Tags |
benchmarking, dual cpu, fluent, hpc |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Parallel fluent not using all processors specified | Paul | FLUENT | 18 | October 26, 2023 04:54 |
node values vs. no node values in fluent | user0314 | FLUENT | 4 | June 30, 2019 05:14 |
Fluent refuse to start, because... "Error during socket creation"? | Sataha | FLUENT | 4 | February 26, 2018 11:51 |
Problem in using parallel process in fluent 14 | Tleja | FLUENT | 3 | September 13, 2013 11:54 |
RE: Edge Node distances in Fluent | Ashutosh Joshi | FLUENT | 0 | December 26, 2000 00:18 |