Parallel speedup scales better than number of CPUs

MikeWorth · August 20, 2020, 10:00

I've got a model that has a fairly low mesh cell count (~80k), with a big AMI boundary running up through it that will significantly limit how it can be decomposed. I don't have the knowledge to decide how many CPUs would be best.

As such I decided to do a quick scaling test, where I run the first 1ms of simulated time over and over with different numbers of CPUs, recording how long it took each time. I also calculated the ratio of each time to the time for 1 CPU.

I ran all of this on an AWS c5a.8xlarge machine (32 virtual CPUs, so 16 proper cores). The results are tabulated below:

Code:

CPUs	Time/s	Speedup
1	128.86	1
2	43.68	2.95
3	32.09	4.015
4	45.94	2.804
5	40.26	3.2
6	23.85	5.402
7	34.44	3.741
8	17.6	7.321
9	21.82	5.905
10	19.16	6.725
11	22.15	5.817
12	20.75	6.21
13	19.62	6.567
14	28.57	4.51
15	36.98	3.484
16	20.14	6.398

What strikes me as odd is that the 2 and 3 core results suggest a speedup more than the increased computational power - twice the cores solves in a third of the time. Have I missed something, or is there something funny going on with my approach?

I'm using scotch decomposition, and my (not very polished) script is this:

Code:

maxCpus=16 #Try all CPU counts up to this value
runLength=0.001 #How much simulated time to run for with each CPU count?

. ${WM_PROJECT_DIR:?}/bin/tools/RunFunctions        # Tutorial run functions
solver=$(getApplication)

sed -i "/^endTime/c\endTime         $runLength;" system/controlDict

./Allrun.pre

echo "cpuCount	executionTime	SpeedUp" > log.scalingTest

runApplication  $solver
executionTimeSerial=$(grep ExecutionTime log.${solver} | tail -n1 | cut -d' ' -f3)

echo "1	${executionTimeSerial}	1" >> log.scalingTest
echo "Execution Time: $executionTimeSerial s"

mv log.${solver} log.${solver}.1CPUs


for cpuCount in $(seq 2 $maxCpus)
do

  foamDictionary system/decomposeParDict -entry numberOfSubdomains -set $cpuCount
  
  runApplication decomposePar 
  
  find -maxdepth 1 -name "processor*" -type d | while read procDir
  do
      (cp include/meshModifiers.parallel $procDir/constant/polyMesh/meshModifiers)
  done

  runParallel  $solver

  executionTime=$(grep ExecutionTime log.${solver} | tail -n1 | cut -d' ' -f3)
  speedUp=$(echo "scale=3; $executionTimeSerial / $executionTime" | bc)
  
  echo "${cpuCount}	${executionTime}	${speedUp}" >> log.scalingTest
  echo "Execution Time: $executionTime s"
  echo "Speed Up (over serial): $speedUp"
  
  rm -r processor* log.decomposePar
  mv log.${solver} log.${solver}.${cpuCount}CPUs
  
  
done

sed -i '/^endTime/c\endTime         $simFinish;' system/controlDict

echo "Results:"
cat log.scalingTest

Thanks,
Mike

GerhardHolzinger · August 20, 2020, 11:08

If you plot your speed-up vs. the CPUs, then you will see an initial rise which is followed by leveling-off with quite some noise super-imposed.

Why the "noise"? Some numbers of CPUs distribute the load more favourably among the CPUs, while other numbers (one more or one less) distribute the load more unfavourably.

MikeWorth · August 20, 2020, 11:29

The initial rise, followed by a levelling off (and after a while dropping back down) is exactly what I was expecting. The thing that threw me was the points above the x=y line that you've plotted.

Is it genuinely the case that for my setup I can expect 2xCPU to run 3 times faster than 1xCPU, or is this output a sign that I've done something silly?

Wenyuan · August 20, 2020, 12:35

Hi Mike,

Could you please run your simulations for a longer time, say, 10 ms, then calculate the time it takes for the last 1 ms for each simulation?

pattim · August 21, 2020, 18:03

Is there any auto-partitioner in OF? Sometimes it has been said that decomposition is best if the number of decomposed-partition-interconnect cells is minimal, but that may also depend on the specifics of the flow...

Thanks!

Quote:

Originally Posted by GerhardHolzinger

If you plot your speed-up vs. the CPUs, then you will see an initial rise which is followed by leveling-off with quite some noise super-imposed.

Why the "noise"? Some numbers of CPUs distribute the load more favourably among the CPUs, while other numbers (one more or one less) distribute the load more unfavourably.

joegi.geo · August 21, 2020, 18:30

Super linear speed-up!!!

Look for that on the web.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
[snappyHexMesh] snappyHexMesh sticking point	natty_king	OpenFOAM Meshing & Mesh Conversion	11	February 20, 2024 10:12
AMI speed performance	danny123	OpenFOAM	21	October 24, 2020 05:13
simpleFoam parallel	AndrewMortimer	OpenFOAM Running, Solving & CFD	12	August 7, 2015 19:45
[blockMesh] --> foam fatal error:	lillo763	OpenFOAM Meshing & Mesh Conversion	0	March 5, 2014 11:27
Problem with parallel run	Hisham	OpenFOAM Running, Solving & CFD	9	March 13, 2012 09:31

August 20, 2020, 11:29		#3
MikeWorth Member Mike Worth Join Date: Jun 2019 Posts: 45 Rep Power: 7	The initial rise, followed by a levelling off (and after a while dropping back down) is exactly what I was expecting. The thing that threw me was the points above the x=y line that you've plotted. Is it genuinely the case that for my setup I can expect 2xCPU to run 3 times faster than 1xCPU, or is this output a sign that I've done something silly?

August 20, 2020, 12:35		#4
Wenyuan New Member Wenyuan Fan Join Date: Mar 2017 Posts: 27 Rep Power: 9	Hi Mike, Could you please run your simulations for a longer time, say, 10 ms, then calculate the time it takes for the last 1 ms for each simulation?

August 21, 2020, 18:30		#6
joegi.geo Senior Member joegi Join Date: Nov 2009 Location: genoa Posts: 104 Rep Power: 17	Super linear speed-up!!! Look for that on the web.