CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > General Forums > Hardware

FLOP/clock-cycle

Register Blogs Community New Posts Updated Threads Search

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   July 18, 2013, 17:42
Default FLOP/clock-cycle
  #1
New Member
 
Join Date: Jul 2013
Posts: 2
Rep Power: 0
etna is on a distinguished road
hi there,

reading about the efficiency and performance of cfd-simulations i often found sentences like this: ... When running a typical CFD simulation on cluster, the cores are waiting most of the time to get new data into caches and this gives low performance from FLOPs/s point of view, ie, realistic FLOPs/clock-cycle is far below theoretical FLOPs/clock-cycle.

Example recent OpenFOAM cluster benchmark: simulation using AMD Interlagos CPUs (having theoretically 8 FLOPs/clock-cycle) is only 10% faster then simulation run on AMD Fangio CPUs (same as Interlagos but capped down to max 2 FLOPs/clock-cycle). Notice: in theory the sim. on Interlagos CPUs should be 4 times faster than sim. on Fangio CPUs!

Question 1:

are cores 'waiting' due to:
a) slow core - RAM communication?
b) slow communication between different cores (partitions) in a cluster?
c) both, depending on the core-loading (nr. of CFD grid cells per core).

Question 2:

how to increase the realistic FLOP/clock-cycle?
- if a) then i want to run my simulation on as many cores as possible (lower the nr. of cells per core)
- if b) then i want to run my simulation on as few cores as possible (increase the nr. of cells per core)
- if c) then i want to run on an optimum nr. of cells per core

Question 3:

how to find an 'optimum nr. of cells per core'?

is this nr. same for cores with high theoretical FLOP/clock-cycle (for example 8) and low theoretical FLOP/clock-cycle (for example 2 or even 1)?
etna is offline   Reply With Quote

Old   July 18, 2013, 19:36
Default
  #2
Senior Member
 
Join Date: Mar 2009
Location: Austin, TX
Posts: 160
Rep Power: 18
kyle is on a distinguished road
The answer is, as you probably expect, "c".

If you have slow RAM, or your mesh is not stored efficiently in memory, then your CPUs will spend a lot of time waiting for data to be transferred from memory.

If your network is slow, your domain is decomposed inefficiently, or your case is split across too many cores, the CPUs will be spending a lot of time waiting for data to cross the network.

The optimum is very hard to define and really depends on your requirements. For many people the cost of hardware is a negligible expense, so they will use twice as many cores for only 10% speedup. For others, hardware expense is a huge concern. Commercial software often costs more per-core than the hardware, which can really affects which hardware makes sense.

My feeling is many people, myself included, spend so much time obsessing over and researching hardware that any benefit in doing so is eaten up by the time it takes!
kyle is offline   Reply With Quote

Old   July 19, 2013, 00:19
Default
  #3
Senior Member
 
Erik
Join Date: Feb 2011
Location: Earth (Land portion)
Posts: 1,185
Rep Power: 23
evcelica is on a distinguished road
Quote:
Originally Posted by kyle View Post

My feeling is many people, myself included, spend so much time obsessing over and researching hardware that any benefit in doing so is eaten up by the time it takes!
I'm definitely guilty of that as well!
evcelica is offline   Reply With Quote

Old   July 21, 2013, 17:18
Default
  #4
New Member
 
Join Date: Jul 2013
Posts: 2
Rep Power: 0
etna is on a distinguished road
thank you kyle for your quick response and very clear explanation!

yep, i was expecting the answer to the 1st question to be c).

in princple i also agree with your observation that searching for the cluster 'sweet spot' (optimum nr. of cells per core) is often overestimated (loss of time).

do you think the whole idea of finding the cluster 'sweet spot' is irelevant also when we are talking about rel. large clusters (> 10,000 cores, where each simulation can be run on 500, 1000, 2000 or even 4000 cores)?

i expect running loads of simulation a bit more efficiently (close to the cores sweet spot) can add-up to quite a nice saving in time over a year...

and what confuses me additionally is the fact that different cores have wildly different theor. FLOPs/clock-cycle performances...

if i have one 10,000 cores cluster consisting of cores with max. 2 FLOPs/clock-cycle and another one with 10,000 cores having max. 8 FLOPs/clock-cycle how to choose the simulation strategy for each cluster?

if i want efficency should i run all my simulations on on the first cluster using 'only' 500 cores, while going for 4000 cores on the second cluster? or vice versa?

if someone could explain it in layman's terms I would be grateful!
etna is offline   Reply With Quote

Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
THERMODYNAMIC CYCLE ANALYSIS SOFTWARE P.PETER Main CFD Forum 7 May 19, 2016 23:18
How to simulate the split cycle engine in fluent hmdeepak FLUENT 0 March 29, 2013 12:13
piston motion_not completing the cycle after a number of cycles Catthan FLUENT 0 September 5, 2012 09:56
Coefficient of Lift vs flapping cycle phase plot Rose Siemens 2 December 20, 2011 07:24
Multi cycle analysis james Siemens 0 April 11, 2005 14:03


All times are GMT -4. The time now is 02:31.