CFX scalability with MPI

evan247 · December 8, 2016, 16:02

For those who have been running CFX with 'local parallel MPI' options: how well CFX is doing in terms of scalability?

I'm currently using a Xeon E5v3 workstation with 2 CPUs, each with 8 cores (or 16 with hyper-threading). My experience with running single stage turbine steady simulation with mesh size from 2m up to 8m is that 16 cores is barely faster than 8 cores, while 8 cores is ~60% faster than 4 cores. The speed up for 8 cores is roughly 6 times compared to serial mode.

I'm just wondering if ppl experience a similar level of scalability, and any general advice to improve the speed up when running with more partitions? Later on I need to run URANS with ~40m mesh and it'd be good to keep the running time down.

So far what I could think of is to keep the job on the same socket i.e. using 1 CPU to avoid communication cost between sockets.

ghorrocks · December 8, 2016, 17:58

* Turn hyperthreading off. It does not help CFX simulations, you need to use physical cores.
* Xeon workstations will loose considerable parallel performance due to bottlenecks in things like the memory bus. If you report a 6x speedup on 8 cores then you are doing pretty well, this is about as good as you are going to get.
* Distributed parallel speedup is better. This is because you have multiple cores, but also multiple memory busses and all the other stuff.
* But at around 8 to 16 cores on distributed parallel you will start to have scaling problems if you are using ethernet. You will need to consider high speed interconnects like infiniband.
* If you are looking at large systems (a few hundred cores or more) then the design of these systems is very complex. To get good performance you need to carefully design many factors. You can't just buy lots of workstations and hook them up - your speedup will be terrible. A big investment like this will require careful design and testing to ensure it works well.

Note that none of my comments above mention CFX. These factors are common for any software running on multiple cores, so the issue is not unique to CFX.

evan247 · December 8, 2016, 18:33

Quote:

Originally Posted by ghorrocks

.* But at around 8 to 16 cores on distributed parallel you will start to have scaling problems if you are using ethernet. You will need to consider high speed interconnects like infiniband.

Thanks Glenn - did you mean I would have scaling problem once the total # of cores on all distributed systems exceed 16, or 16 per computer?

Antanas · December 9, 2016, 01:14

Quote:

Originally Posted by evan247

Thanks Glenn - did you mean I would have scaling problem once the total # of cores on all distributed systems exceed 16, or 16 per computer?

There is section 16.4 in CFX Modelling Guide with advices on using CFX in parallel.

ghorrocks · December 9, 2016, 01:57

Let me clarify: I would expect that a distributed parallel run, with 16 partitions (as either 16 nodes x 1 partition per node OR 8 nodes x 2 partitions per node) would start to slow down unless you have a high speed interconnect.

Or another way: I would expect a distributed parallel run with 8 partitions as 2 nodes with 4 partitions per node to start slowing down on ethernet as the network speed will be the bottleneck here.

Disclaimer: It has been a few years since I did parallel benchmarks so my rules of thumb might be a bit out of date.

December 8, 2016, 16:02	CFX scalability with MPI	#1
evan247 New Member Join Date: Jan 2015 Posts: 29 Rep Power: 11	For those who have been running CFX with 'local parallel MPI' options: how well CFX is doing in terms of scalability? I'm currently using a Xeon E5v3 workstation with 2 CPUs, each with 8 cores (or 16 with hyper-threading). My experience with running single stage turbine steady simulation with mesh size from 2m up to 8m is that 16 cores is barely faster than 8 cores, while 8 cores is ~60% faster than 4 cores. The speed up for 8 cores is roughly 6 times compared to serial mode. I'm just wondering if ppl experience a similar level of scalability, and any general advice to improve the speed up when running with more partitions? Later on I need to run URANS with ~40m mesh and it'd be good to keep the running time down. So far what I could think of is to keep the job on the same socket i.e. using 1 CPU to avoid communication cost between sockets.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
CFX vs. FLUENT	turbo	CFX	4	April 13, 2021 09:08
Problem running cfx on hpc	beyonder1	CFX	4	September 14, 2015 03:35
MPI code on multiple nodes, scalability and best practice	t.teschner	Hardware	0	October 7, 2014 06:07
CFX pressure in Simulations problem	nasdak	CFX	1	April 14, 2010 14:22
PhD using CFX	Rui	CFX	9	May 28, 2007 06:59

December 8, 2016, 17:58		#2
ghorrocks Super Moderator Glenn Horrocks Join Date: Mar 2009 Location: Sydney, Australia Posts: 17,872 Rep Power: 144	* Turn hyperthreading off. It does not help CFX simulations, you need to use physical cores. * Xeon workstations will loose considerable parallel performance due to bottlenecks in things like the memory bus. If you report a 6x speedup on 8 cores then you are doing pretty well, this is about as good as you are going to get. * Distributed parallel speedup is better. This is because you have multiple cores, but also multiple memory busses and all the other stuff. * But at around 8 to 16 cores on distributed parallel you will start to have scaling problems if you are using ethernet. You will need to consider high speed interconnects like infiniband. * If you are looking at large systems (a few hundred cores or more) then the design of these systems is very complex. To get good performance you need to carefully design many factors. You can't just buy lots of workstations and hook them up - your speedup will be terrible. A big investment like this will require careful design and testing to ensure it works well. Note that none of my comments above mention CFX. These factors are common for any software running on multiple cores, so the issue is not unique to CFX.

December 9, 2016, 01:57		#5
ghorrocks Super Moderator Glenn Horrocks Join Date: Mar 2009 Location: Sydney, Australia Posts: 17,872 Rep Power: 144	Let me clarify: I would expect that a distributed parallel run, with 16 partitions (as either 16 nodes x 1 partition per node OR 8 nodes x 2 partitions per node) would start to slow down unless you have a high speed interconnect. Or another way: I would expect a distributed parallel run with 8 partitions as 2 nodes with 4 partitions per node to start slowing down on ethernet as the network speed will be the bottleneck here. Disclaimer: It has been a few years since I did parallel benchmarks so my rules of thumb might be a bit out of date.