Performance Improvements by core locking

RobertB · October 8, 2010, 10:34

HP-MPI used by CCM+ allows you to core lock the threads on a machine. Below is some benchmarking that shows the advantage of doing so.

The cluster has 16 nodes with two quad core Xeons on each node and an InfiniBand backbone. It runs a version of Linux.

This was the only model running on the cluster during this time. If other models had been running then it is probable that the cases where the nodes are not loaded by this model would run somewhat slower.

The model is a 27 million poly conjugate heat transfer case.

The command to lock cores looks something like

-mpi hp:"-prot -IBV -cpu_bind=MAP_CPU:0,1,2,3,4,5,6,7"

you add it to the end of the command line

-prot prints out how HP-MPI is message passing, in this case across InfinBand between nodes and shared memory on the nodes

-IBV tells it to use infiniband

-cpu_bind=MAP_CPU tells it which cores to assign to the threads. In this case all 8 cores on a node (Xeons have cores 8-15 but these are virtual and use the same physical cores as 0-7).

32 processors/16 nodes (no CPU locking) ~27 seconds per iteration
32 processors/16 nodes (locked to 0,6) ~26 seconds per iteration

64 processors (no CPU locking) ~21 seconds per iteration
64 processors (locked to 0,3,5,7) ~14 seconds per iteration

96 processors (locked to 0,1,2,4,5,6) ~9.5 seconds per iteration

Since the above nodes were not fully loaded (there were unused cores) I ran the following tests on fully loaded nodes.

32 processors/4 nodes (no CPU locking) ~44 seconds per iteration
32 processors/4 nodes (locked 0-7) ~29 seconds per iteration

I believe the lessons here are (on admittedly a single model/cluster):

1) You will not get good scalability if you do not core lock
2) You can run nodes fully loaded, with only a small degradation, if you core lock

Since it is a free software switch it would appear to be worth doing.

It also works with the HP-MPI versions of STAR-CD but I don't have any recent benchmark data.

f-w · October 8, 2010, 13:05

Nice write-up; do you know if there a similar option for MPICH?

RobertB · October 10, 2010, 10:14

I'm afraid I only use Windows for pre processing so haven't looked into it.

If you use STAR CD 3.26 there is an HPMPI version available if you ask.

Another important thing to do to maximize performance is to edit the update frequency so that it only updates every 10 iterations, so as not to slow it down. If you select all scenes at once you can edit the update frequency for all the scenes simultaneously.

TMG · October 11, 2010, 10:42

No there is no equivalent set of options for mpich. mpich does not do native infiniband either which means terrible comparative performance in a infiniband environment. MPICH is the mpi of last resort only when nothing else works or HPMPI just isn't available for a specific build.

Larsen · October 13, 2010, 10:03

We have a similar setup here at work, so I tested the command line yesterday.
I used a simulation with 3.5 Mill cell, 8 workstations with 2x6 cores and infiniband. A total of 96 parallel processes.

We got an improvement of ~5% in both steady and a rotating region (With interface updates). Looks like our gain isn't as large as RobertB got. I suspect is has to do with lower inter-node network communication. I am going to do a couple of more tests, since 5% performance gains is considerable here where we run simulations continuously.

Just out of Curiosity. What kind of company and simulation do you usually work with RobertB?

RobertB · October 13, 2010, 15:24

I'm following this up with adapco and will let you know what comes out.

We had, however, also seen sizeable gains on a previous Opteron based system using STAR-CD, HP-MPI and core locking.

You may almost have 'too many' threads as you will only have 30K cells per thread, my smallest thread size was about 10x that. For the fully loaded node cases it was 30x. Your interprocess communication to matrix solution time balance may therefore be different from mine.

Do you see much gain from using 4 threads per CPU as against 6 threads - do the processors have enough memory bandwidth ? Are you using Intel or AMD processors?

I work in the gas turbine industry and the case was a conjugate heat transfer model used to design a cooling system.

Larsen · October 17, 2010, 12:22

Well we usually only run low-cell cases with a lot of illetrations, (6-Dof, moving mesh and so on) So this is a typical case for us. It might not be the optimal with so few cells per node, but we usually have to have results over night.

We use AMD's processors and work mostly with marine application here

I am looking forward to hear cd-adapco's response!

RobertB · October 22, 2010, 08:59

The basic reply (thanks to adapco/Intel) is

If you have Xeon processors with hyperthreading you have two options to get the best performance.

1) If you leave hyperthreading on, you need to lock cores

The scheduler cannot differentiate between real and virtual cores and thus can put two threads on a single physical core which slows things down. If you are fully loading the cores this slowdown is of the order of 50% greater runtime.

2) Alternatively you can turn off hyperthreading in the BIOS.

In this case all cores are real processors and the scheduler does a decent job of assigning threads to cores.

Larsen, this would explain why you see little difference as you have AMD chips that have no virtual cores.

Hope this helps

October 8, 2010, 10:34	Performance Improvements by core locking	#1
RobertB Senior Member Robert Join Date: Jun 2010 Posts: 117 Rep Power: 17	HP-MPI used by CCM+ allows you to core lock the threads on a machine. Below is some benchmarking that shows the advantage of doing so. The cluster has 16 nodes with two quad core Xeons on each node and an InfiniBand backbone. It runs a version of Linux. This was the only model running on the cluster during this time. If other models had been running then it is probable that the cases where the nodes are not loaded by this model would run somewhat slower. The model is a 27 million poly conjugate heat transfer case. The command to lock cores looks something like -mpi hp:"-prot -IBV -cpu_bind=MAP_CPU:0,1,2,3,4,5,6,7" you add it to the end of the command line -prot prints out how HP-MPI is message passing, in this case across InfinBand between nodes and shared memory on the nodes -IBV tells it to use infiniband -cpu_bind=MAP_CPU tells it which cores to assign to the threads. In this case all 8 cores on a node (Xeons have cores 8-15 but these are virtual and use the same physical cores as 0-7). 32 processors/16 nodes (no CPU locking) ~27 seconds per iteration 32 processors/16 nodes (locked to 0,6) ~26 seconds per iteration 64 processors (no CPU locking) ~21 seconds per iteration 64 processors (locked to 0,3,5,7) ~14 seconds per iteration 96 processors (locked to 0,1,2,4,5,6) ~9.5 seconds per iteration Since the above nodes were not fully loaded (there were unused cores) I ran the following tests on fully loaded nodes. 32 processors/4 nodes (no CPU locking) ~44 seconds per iteration 32 processors/4 nodes (locked 0-7) ~29 seconds per iteration I believe the lessons here are (on admittedly a single model/cluster): 1) You will not get good scalability if you do not core lock 2) You can run nodes fully loaded, with only a small degradation, if you core lock Since it is a free software switch it would appear to be worth doing. It also works with the HP-MPI versions of STAR-CD but I don't have any recent benchmark data.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
solving a conduction problem in FLUENT using UDF	Avin2407	Fluent UDF and Scheme Programming	1	March 13, 2015 03:02
Superlinear speedup in OpenFOAM 13	msrinath80	OpenFOAM Running, Solving & CFD	18	March 3, 2015 06:36
Performance of Xeon X56-- 6 core	hvem10	Hardware	4	August 3, 2010 11:55
Serial Job Jumping from Core to Core	Will	FLUENT	2	August 25, 2008 15:21
Performance of dual core AMD processors	Imraan Parker	FLUENT	1	September 9, 2005 09:04

October 8, 2010, 13:05		#2
f-w Senior Member Join Date: Apr 2009 Posts: 159 Rep Power: 17	Nice write-up; do you know if there a similar option for MPICH?

October 10, 2010, 10:14		#3
RobertB Senior Member Robert Join Date: Jun 2010 Posts: 117 Rep Power: 17	I'm afraid I only use Windows for pre processing so haven't looked into it. If you use STAR CD 3.26 there is an HPMPI version available if you ask. Another important thing to do to maximize performance is to edit the update frequency so that it only updates every 10 iterations, so as not to slow it down. If you select all scenes at once you can edit the update frequency for all the scenes simultaneously.

October 11, 2010, 10:42		#4
TMG Member Join Date: Mar 2009 Posts: 44 Rep Power: 17	No there is no equivalent set of options for mpich. mpich does not do native infiniband either which means terrible comparative performance in a infiniband environment. MPICH is the mpi of last resort only when nothing else works or HPMPI just isn't available for a specific build.

October 13, 2010, 10:03		#5
Larsen New Member Thomas Larsen Join Date: Apr 2010 Posts: 12 Rep Power: 16	We have a similar setup here at work, so I tested the command line yesterday. I used a simulation with 3.5 Mill cell, 8 workstations with 2x6 cores and infiniband. A total of 96 parallel processes. We got an improvement of ~5% in both steady and a rotating region (With interface updates). Looks like our gain isn't as large as RobertB got. I suspect is has to do with lower inter-node network communication. I am going to do a couple of more tests, since 5% performance gains is considerable here where we run simulations continuously. Just out of Curiosity. What kind of company and simulation do you usually work with RobertB?

October 13, 2010, 15:24		#6
RobertB Senior Member Robert Join Date: Jun 2010 Posts: 117 Rep Power: 17	I'm following this up with adapco and will let you know what comes out. We had, however, also seen sizeable gains on a previous Opteron based system using STAR-CD, HP-MPI and core locking. You may almost have 'too many' threads as you will only have 30K cells per thread, my smallest thread size was about 10x that. For the fully loaded node cases it was 30x. Your interprocess communication to matrix solution time balance may therefore be different from mine. Do you see much gain from using 4 threads per CPU as against 6 threads - do the processors have enough memory bandwidth ? Are you using Intel or AMD processors? I work in the gas turbine industry and the case was a conjugate heat transfer model used to design a cooling system.

October 17, 2010, 12:22		#7
Larsen New Member Thomas Larsen Join Date: Apr 2010 Posts: 12 Rep Power: 16	Well we usually only run low-cell cases with a lot of illetrations, (6-Dof, moving mesh and so on) So this is a typical case for us. It might not be the optimal with so few cells per node, but we usually have to have results over night. We use AMD's processors and work mostly with marine application here I am looking forward to hear cd-adapco's response!

October 22, 2010, 08:59		#8
RobertB Senior Member Robert Join Date: Jun 2010 Posts: 117 Rep Power: 17	The basic reply (thanks to adapco/Intel) is If you have Xeon processors with hyperthreading you have two options to get the best performance. 1) If you leave hyperthreading on, you need to lock cores The scheduler cannot differentiate between real and virtual cores and thus can put two threads on a single physical core which slows things down. If you are fully loading the cores this slowdown is of the order of 50% greater runtime. 2) Alternatively you can turn off hyperthreading in the BIOS. In this case all cores are real processors and the scheduler does a decent job of assigning threads to cores. Larsen, this would explain why you see little difference as you have AMD chips that have no virtual cores. Hope this helps