CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Software User Forums > Siemens > STAR-CCM+

Performance Improvements by core locking

Register Blogs Community New Posts Updated Threads Search

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   October 8, 2010, 10:34
Default Performance Improvements by core locking
  #1
Senior Member
 
Robert
Join Date: Jun 2010
Posts: 117
Rep Power: 17
RobertB is on a distinguished road
HP-MPI used by CCM+ allows you to core lock the threads on a machine. Below is some benchmarking that shows the advantage of doing so.

The cluster has 16 nodes with two quad core Xeons on each node and an InfiniBand backbone. It runs a version of Linux.

This was the only model running on the cluster during this time. If other models had been running then it is probable that the cases where the nodes are not loaded by this model would run somewhat slower.



The model is a 27 million poly conjugate heat transfer case.

The command to lock cores looks something like

-mpi hp:"-prot -IBV -cpu_bind=MAP_CPU:0,1,2,3,4,5,6,7"

you add it to the end of the command line


-prot prints out how HP-MPI is message passing, in this case across InfinBand between nodes and shared memory on the nodes


-IBV tells it to use infiniband

-cpu_bind=MAP_CPU tells it which cores to assign to the threads. In this case all 8 cores on a node (Xeons have cores 8-15 but these are virtual and use the same physical cores as 0-7).


32 processors/16 nodes (no CPU locking) ~27 seconds per iteration
32 processors/16 nodes (locked to 0,6) ~26 seconds per iteration

64 processors (no CPU locking) ~21 seconds per iteration
64 processors (locked to 0,3,5,7) ~14 seconds per iteration

96 processors (locked to 0,1,2,4,5,6) ~9.5 seconds per iteration

Since the above nodes were not fully loaded (there were unused cores) I ran the following tests on fully loaded nodes.

32 processors/4 nodes (no CPU locking) ~44 seconds per iteration
32 processors/4 nodes (locked 0-7) ~29 seconds per iteration


I believe the lessons here are (on admittedly a single model/cluster):

1) You will not get good scalability if you do not core lock
2) You can run nodes fully loaded, with only a small degradation, if you core lock

Since it is a free software switch it would appear to be worth doing.

It also works with the HP-MPI versions of STAR-CD but I don't have any recent benchmark data.
RobertB is offline   Reply With Quote

Old   October 8, 2010, 13:05
Default
  #2
f-w
Senior Member
 
f-w's Avatar
 
Join Date: Apr 2009
Posts: 159
Rep Power: 17
f-w is on a distinguished road
Nice write-up; do you know if there a similar option for MPICH?
f-w is offline   Reply With Quote

Old   October 10, 2010, 10:14
Default
  #3
Senior Member
 
Robert
Join Date: Jun 2010
Posts: 117
Rep Power: 17
RobertB is on a distinguished road
I'm afraid I only use Windows for pre processing so haven't looked into it.

If you use STAR CD 3.26 there is an HPMPI version available if you ask.

Another important thing to do to maximize performance is to edit the update frequency so that it only updates every 10 iterations, so as not to slow it down. If you select all scenes at once you can edit the update frequency for all the scenes simultaneously.
RobertB is offline   Reply With Quote

Old   October 11, 2010, 10:42
Default
  #4
TMG
Member
 
Join Date: Mar 2009
Posts: 44
Rep Power: 17
TMG is on a distinguished road
No there is no equivalent set of options for mpich. mpich does not do native infiniband either which means terrible comparative performance in a infiniband environment. MPICH is the mpi of last resort only when nothing else works or HPMPI just isn't available for a specific build.
TMG is offline   Reply With Quote

Old   October 13, 2010, 10:03
Default
  #5
New Member
 
Thomas Larsen
Join Date: Apr 2010
Posts: 12
Rep Power: 16
Larsen is on a distinguished road
We have a similar setup here at work, so I tested the command line yesterday.
I used a simulation with 3.5 Mill cell, 8 workstations with 2x6 cores and infiniband. A total of 96 parallel processes.

We got an improvement of ~5% in both steady and a rotating region (With interface updates). Looks like our gain isn't as large as RobertB got. I suspect is has to do with lower inter-node network communication. I am going to do a couple of more tests, since 5% performance gains is considerable here where we run simulations continuously.

Just out of Curiosity. What kind of company and simulation do you usually work with RobertB?
Larsen is offline   Reply With Quote

Old   October 13, 2010, 15:24
Default
  #6
Senior Member
 
Robert
Join Date: Jun 2010
Posts: 117
Rep Power: 17
RobertB is on a distinguished road
I'm following this up with adapco and will let you know what comes out.

We had, however, also seen sizeable gains on a previous Opteron based system using STAR-CD, HP-MPI and core locking.

You may almost have 'too many' threads as you will only have 30K cells per thread, my smallest thread size was about 10x that. For the fully loaded node cases it was 30x. Your interprocess communication to matrix solution time balance may therefore be different from mine.

Do you see much gain from using 4 threads per CPU as against 6 threads - do the processors have enough memory bandwidth ? Are you using Intel or AMD processors?

I work in the gas turbine industry and the case was a conjugate heat transfer model used to design a cooling system.
RobertB is offline   Reply With Quote

Old   October 17, 2010, 12:22
Default
  #7
New Member
 
Thomas Larsen
Join Date: Apr 2010
Posts: 12
Rep Power: 16
Larsen is on a distinguished road
Well we usually only run low-cell cases with a lot of illetrations, (6-Dof, moving mesh and so on) So this is a typical case for us. It might not be the optimal with so few cells per node, but we usually have to have results over night.

We use AMD's processors and work mostly with marine application here I am looking forward to hear cd-adapco's response!
Larsen is offline   Reply With Quote

Old   October 22, 2010, 08:59
Default
  #8
Senior Member
 
Robert
Join Date: Jun 2010
Posts: 117
Rep Power: 17
RobertB is on a distinguished road
The basic reply (thanks to adapco/Intel) is

If you have Xeon processors with hyperthreading you have two options to get the best performance.

1) If you leave hyperthreading on, you need to lock cores

The scheduler cannot differentiate between real and virtual cores and thus can put two threads on a single physical core which slows things down. If you are fully loading the cores this slowdown is of the order of 50% greater runtime.

2) Alternatively you can turn off hyperthreading in the BIOS.

In this case all cores are real processors and the scheduler does a decent job of assigning threads to cores.

Larsen, this would explain why you see little difference as you have AMD chips that have no virtual cores.

Hope this helps
RobertB is offline   Reply With Quote

Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
solving a conduction problem in FLUENT using UDF Avin2407 Fluent UDF and Scheme Programming 1 March 13, 2015 03:02
Superlinear speedup in OpenFOAM 13 msrinath80 OpenFOAM Running, Solving & CFD 18 March 3, 2015 06:36
Performance of Xeon X56-- 6 core hvem10 Hardware 4 August 3, 2010 11:55
Serial Job Jumping from Core to Core Will FLUENT 2 August 25, 2008 15:21
Performance of dual core AMD processors Imraan Parker FLUENT 1 September 9, 2005 09:04


All times are GMT -4. The time now is 13:47.