|
[Sponsors] |
December 8, 2016, 16:02 |
CFX scalability with MPI
|
#1 |
New Member
Join Date: Jan 2015
Posts: 29
Rep Power: 11 |
For those who have been running CFX with 'local parallel MPI' options: how well CFX is doing in terms of scalability?
I'm currently using a Xeon E5v3 workstation with 2 CPUs, each with 8 cores (or 16 with hyper-threading). My experience with running single stage turbine steady simulation with mesh size from 2m up to 8m is that 16 cores is barely faster than 8 cores, while 8 cores is ~60% faster than 4 cores. The speed up for 8 cores is roughly 6 times compared to serial mode. I'm just wondering if ppl experience a similar level of scalability, and any general advice to improve the speed up when running with more partitions? Later on I need to run URANS with ~40m mesh and it'd be good to keep the running time down. So far what I could think of is to keep the job on the same socket i.e. using 1 CPU to avoid communication cost between sockets. |
|
December 8, 2016, 17:58 |
|
#2 |
Super Moderator
Glenn Horrocks
Join Date: Mar 2009
Location: Sydney, Australia
Posts: 17,872
Rep Power: 144 |
* Turn hyperthreading off. It does not help CFX simulations, you need to use physical cores.
* Xeon workstations will loose considerable parallel performance due to bottlenecks in things like the memory bus. If you report a 6x speedup on 8 cores then you are doing pretty well, this is about as good as you are going to get. * Distributed parallel speedup is better. This is because you have multiple cores, but also multiple memory busses and all the other stuff. * But at around 8 to 16 cores on distributed parallel you will start to have scaling problems if you are using ethernet. You will need to consider high speed interconnects like infiniband. * If you are looking at large systems (a few hundred cores or more) then the design of these systems is very complex. To get good performance you need to carefully design many factors. You can't just buy lots of workstations and hook them up - your speedup will be terrible. A big investment like this will require careful design and testing to ensure it works well. Note that none of my comments above mention CFX. These factors are common for any software running on multiple cores, so the issue is not unique to CFX. |
|
December 8, 2016, 18:33 |
|
#3 |
New Member
Join Date: Jan 2015
Posts: 29
Rep Power: 11 |
Thanks Glenn - did you mean I would have scaling problem once the total # of cores on all distributed systems exceed 16, or 16 per computer?
|
|
December 9, 2016, 01:14 |
|
#4 |
Senior Member
Join Date: Feb 2011
Posts: 496
Rep Power: 18 |
||
December 9, 2016, 01:57 |
|
#5 |
Super Moderator
Glenn Horrocks
Join Date: Mar 2009
Location: Sydney, Australia
Posts: 17,872
Rep Power: 144 |
Let me clarify: I would expect that a distributed parallel run, with 16 partitions (as either 16 nodes x 1 partition per node OR 8 nodes x 2 partitions per node) would start to slow down unless you have a high speed interconnect.
Or another way: I would expect a distributed parallel run with 8 partitions as 2 nodes with 4 partitions per node to start slowing down on ethernet as the network speed will be the bottleneck here. Disclaimer: It has been a few years since I did parallel benchmarks so my rules of thumb might be a bit out of date. |
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
CFX vs. FLUENT | turbo | CFX | 4 | April 13, 2021 09:08 |
Problem running cfx on hpc | beyonder1 | CFX | 4 | September 14, 2015 03:35 |
MPI code on multiple nodes, scalability and best practice | t.teschner | Hardware | 0 | October 7, 2014 06:07 |
CFX pressure in Simulations problem | nasdak | CFX | 1 | April 14, 2010 14:22 |
PhD using CFX | Rui | CFX | 9 | May 28, 2007 06:59 |