Efficiency of dual socket node in Fluent

halowine · August 20, 2018, 18:45

Hi everybody!

I’m running some scalability tests on a machine using Fluent v18.2 and I’ve found some strange behavior in bandwidth, latency and CP.

The machine is a HPC with 32 nodes of dual i5 with 16 cores each. Meaning 32 per node. 1024 overall. No hyper threading. The machine is not mine

First, on my scalability test I found that Fluent starts losing efficiency from 100k/proc. Isn’t that strange ? According to Fluent benchmark, efficiency starts to be bad around 10k/proc. My test case is 40m cells with keps, really simple. I just change the number of proc to reduce the nb cells per proc. I’m using MPi InfiniBand.

Secondly, when i’m testing bandwidth or latency on 32 proc on one node, I found that inter connectivity is bad between the two processor. Inside a processor (16 core) I have 10Mb/s but between the 2 proc I have 2Mb/s of bandwidth .
If I run my 40m case on 512 cores mapped as 32*16 (16 nodes - node full) It will take more time than mapping 16*32 (32 nodes - node half full).

My question , what the purpose of dual processor per node if only half is good to run my case? I’m losing half of my HPC here...

Maybe I need to activate something in the bios to improve the inter-processor connectivity ? Inter-node connectivity is good though (InfiniBand) ~7Mb/s

Thks!

flotus1 · August 21, 2018, 05:08

In order to not draw false conclusions about inter-node scaling, you should first check for intra-node scaling. i.e. running the job on one node with 1-32 cores.
Here you will probably see one of the reasons for what you observed when running the cluster nodes full vs. half-full: Scaling on a single node will be less than ideal due to memory bandwidth limitations.
This should make it obvious why running the cluster nodes half full is more efficient compared to running half the nodes fully occupied. Running jobs like this is common practice in memory-bound HPC with per-core licensing.

Inter-socked bandwidth in a NUMA system will always be worse than intra-socket bandwidth. This is nothing to worry about, just a consequence of the implementation where data has to be sent over some kind of interconnect between the sockets.
Non-ideal scaling when comparing a fully occupied node vs. a half-occupied node does not necessarily stem from poor inter-socket bandwidth and latency. MPI with default settings should distribute the threads across both CPUs even when only half of all cores are used. Again, it is more likely a consequence of memory-bound execution, unless you pinned the 16 threads to the first CPU when running 16 threads per node.

Edit:

Quote:

First, on my scalability test I found that Fluent starts losing efficiency from 100k/proc. Isn’t that strange ? According to Fluent benchmark, efficiency starts to be bad around 10k/proc.

At which number of cells/core scaling deteriorates can depend on so many factors. Just taking arbitrary numbers from Ansys will lead to false conclusions. If you really want to check inter-node scaling on your system, run the benchmark with 32 cores per node and increase the number of nodes from 1-32.
Since the system is not yours, make sure that it is not running other heavy jobs during your tests. Both load on the nodes and the node interconnect could distort your findings.

August 20, 2018, 18:45	Efficiency of dual socket node in Fluent	#1
halowine New Member Arthur Piquet Join Date: Mar 2013 Posts: 18 Rep Power: 13	Hi everybody! I’m running some scalability tests on a machine using Fluent v18.2 and I’ve found some strange behavior in bandwidth, latency and CP. The machine is a HPC with 32 nodes of dual i5 with 16 cores each. Meaning 32 per node. 1024 overall. No hyper threading. The machine is not mine First, on my scalability test I found that Fluent starts losing efficiency from 100k/proc. Isn’t that strange ? According to Fluent benchmark, efficiency starts to be bad around 10k/proc. My test case is 40m cells with keps, really simple. I just change the number of proc to reduce the nb cells per proc. I’m using MPi InfiniBand. Secondly, when i’m testing bandwidth or latency on 32 proc on one node, I found that inter connectivity is bad between the two processor. Inside a processor (16 core) I have 10Mb/s but between the 2 proc I have 2Mb/s of bandwidth . If I run my 40m case on 512 cores mapped as 3216 (16 nodes - node full) It will take more time than mapping 1632 (32 nodes - node half full). My question , what the purpose of dual processor per node if only half is good to run my case? I’m losing half of my HPC here... Maybe I need to activate something in the bios to improve the inter-processor connectivity ? Inter-node connectivity is good though (InfiniBand) ~7Mb/s Thks!

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Parallel fluent not using all processors specified	Paul	FLUENT	18	October 26, 2023 04:54
node values vs. no node values in fluent	user0314	FLUENT	4	June 30, 2019 05:14
Fluent refuse to start, because... "Error during socket creation"?	Sataha	FLUENT	4	February 26, 2018 11:51
Problem in using parallel process in fluent 14	Tleja	FLUENT	3	September 13, 2013 11:54
RE: Edge Node distances in Fluent	Ashutosh Joshi	FLUENT	0	December 26, 2000 00:18