Parallel speedup scales better than number of CPUs

August 20, 2020, 10:00
Default Parallel speedup scales better than number of CPUs
Mike Worth
I've got a model that has a fairly low mesh cell count (~80k), with a big AMI boundary running up through it that will significantly limit how it can be decomposed. I don't have the knowledge to decide how many CPUs would be best.

As such I decided to do a quick scaling test, where I run the first 1ms of simulated time over and over with different numbers of CPUs, recording how long it took each time. I also calculated the ratio of each time to the time for 1 CPU.

I ran all of this on an AWS c5a.8xlarge machine (32 virtual CPUs, so 16 proper cores). The results are tabulated below:
CPUs	Time/s	Speedup
1	128.86	1
2	43.68	2.95
3	32.09	4.015
4	45.94	2.804
5	40.26	3.2
6	23.85	5.402
7	34.44	3.741
8	17.6	7.321
9	21.82	5.905
10	19.16	6.725
11	22.15	5.817
12	20.75	6.21
13	19.62	6.567
14	28.57	4.51
15	36.98	3.484
16	20.14	6.398
What strikes me as odd is that the 2 and 3 core results suggest a speedup more than the increased computational power - twice the cores solves in a third of the time. Have I missed something, or is there something funny going on with my approach?

I'm using scotch decomposition, and my (not very polished) script is this:
maxCpus=16 #Try all CPU counts up to this value
runLength=0.001 #How much simulated time to run for with each CPU count?

. ${WM_PROJECT_DIR:?}/bin/tools/RunFunctions        # Tutorial run functions

sed -i "/^endTime/c\endTime         $runLength;" system/controlDict


echo "cpuCount	executionTime	SpeedUp" > log.scalingTest

runApplication  $solver
executionTimeSerial=$(grep ExecutionTime log.${solver} | tail -n1 | cut -d' ' -f3)

echo "1	${executionTimeSerial}	1" >> log.scalingTest
echo "Execution Time: $executionTimeSerial s"

mv log.${solver} log.${solver}.1CPUs

for cpuCount in $(seq 2 $maxCpus)

  foamDictionary system/decomposeParDict -entry numberOfSubdomains -set $cpuCount
  runApplication decomposePar 
  find -maxdepth 1 -name "processor*" -type d | while read procDir
      (cp include/meshModifiers.parallel $procDir/constant/polyMesh/meshModifiers)

  runParallel  $solver

  executionTime=$(grep ExecutionTime log.${solver} | tail -n1 | cut -d' ' -f3)
  speedUp=$(echo "scale=3; $executionTimeSerial / $executionTime" | bc)
  echo "${cpuCount}	${executionTime}	${speedUp}" >> log.scalingTest
  echo "Execution Time: $executionTime s"
  echo "Speed Up (over serial): $speedUp"
  rm -r processor* log.decomposePar
  mv log.${solver} log.${solver}.${cpuCount}CPUs

sed -i '/^endTime/c\endTime         $simFinish;' system/controlDict

echo "Results:"
cat log.scalingTest
August 20, 2020, 11:08
Senior Member
Gerhard Holzinger
If you plot your speed-up vs. the CPUs, then you will see an initial rise which is followed by leveling-off with quite some noise super-imposed.

Why the "noise"? Some numbers of CPUs distribute the load more favourably among the CPUs, while other numbers (one more or one less) distribute the load more unfavourably.
August 20, 2020, 11:29
Mike Worth
The initial rise, followed by a levelling off (and after a while dropping back down) is exactly what I was expecting. The thing that threw me was the points above the x=y line that you've plotted.

Is it genuinely the case that for my setup I can expect 2xCPU to run 3 times faster than 1xCPU, or is this output a sign that I've done something silly?
August 20, 2020, 12:35
New Member
Wenyuan Fan
Hi Mike,

Could you please run your simulations for a longer time, say, 10 ms, then calculate the time it takes for the last 1 ms for each simulation?
August 21, 2020, 18:03
Patti Michelle Sheaffer
Is there any auto-partitioner in OF? Sometimes it has been said that decomposition is best if the number of decomposed-partition-interconnect cells is minimal, but that may also depend on the specifics of the flow...


If you plot your speed-up vs. the CPUs, then you will see an initial rise which is followed by leveling-off with quite some noise super-imposed.

Why the "noise"? Some numbers of CPUs distribute the load more favourably among the CPUs, while other numbers (one more or one less) distribute the load more unfavourably.
August 21, 2020, 18:30
Senior Member
Super linear speed-up!!!

Look for that on the web.
