how well does CFX parallel scale?

danieru · February 20, 2012, 03:26

Hi,

We're trying to run CFX parallel jobs on our cluster, and as near as I can tell, CFX seems to be scaling terribly.

As an example, our nodes each have 4 CPUs, 12 cores each: when running CFX with 4 processes on the node, each process is only utilizing 25% of the core it's running on. At 8 processes, it's only 8% per core, at 48 processes, it's reported down at 1 or 2% per process! This is with no other processes competing for CPU cycles.

Can anyone comment on their experience with how well the CFX parallel solver scales, both on a single node with multiple cores, or across nodes?

ghorrocks · February 20, 2012, 06:03

Sorry, wrote the reply below assuming you are running distributed parallel, but on a second read of your question I think you are running local parallel. Oh well, I will leave the reply up anyway just in case it is relevant. I will tack the local parallel comments on the bottom.

It is not as simple as you seem to imply. When you are trying to pipe 12 cores worth of computations down a single network pipe you are going to have a major bottleneck unless you have a super-dooper network. What network are you running? If you are on ethernet then forget it, it will never work. You will need infiniband, myranet or one of the high-end end networks to get decent scaling out of this.

And it does not stop there - you need a motherboard with a fast pipe from CPU to the network adapter. I have seen differences of 4x between motherboards with the same CPU. A quality motherboard is essential. And then there is the network switch as well.

It is highly unlikely your poor scale up is due to CFX. CFX has run out to thousands of CPUs with good speedups, so something is wrong with your set up. And if you have got all the software setup correctly then it probably means you need to buy a really expensive network.

********

If you are getting this performance on local parallel then check that you are running the correct parallel solver for your CPU. If your solver has not detected the CPU correctly it will run with the default solver which is really slow. Likewise the CPU setup, it should detect that and optimse the setup for it.

The speed up you are reporting is pretty bad and something is wrong. CFX has good parallel performance so it is unlikely to simply be CFX.

Can you describe the simulation you are using to get these performance and speedup figures?

danieru · February 20, 2012, 06:40

Hi ghorrocks,

Thanks for your reply, and from what you've said, it's probably a configuration issue and not a CFX scalability issue, which is good news because it means the performance can be improved

We would like to run distributed parallel, but we're first just trying to get CFX up and running smoothly with local parallel jobs.

Quick hardware rundown:

We have a new cluster with 318 nodes. Our interconnect is 40 GB/s Infiniband and each node has:
- 4 x AMD Opteron 6238 (Interlagos) 12 core 2.6 GHz processors. (48 cores total)
- 128GB RAM

...so I'm confident we have sufficient hardware specs to get decent performance, so now to the configuration...

The simulation is dead simple; just running the StaticMixer.def file that comes in the examples dir.

Quote:

If you are getting this performance on local parallel then check that you are running the correct parallel solver for your CPU. If your solver has not detected the CPU correctly it will run with the default solver which is really slow. Likewise the CPU setup, it should detect that and optimse the setup for it.

ah! those are some of the clues I'm looking for! Are there different executables for different architectures of CPU (Intel vs AMD)? We installed the linux x64 executables, and I'm only seeing one 'cfx5solve' executable in the 'CFX/bin' dir. How can I determine if the executable has correctly detected the AMD arch?

A lot of the software use we compile from source for our specific architecture, but in the ANSYS case, it it seems it installed pre-existing linux x64 executables.

I've looked through quite a bit of ANSYS documentation, but I'll keep digging. I am new to the software suite so if I've missed this information somewhere any pointers are appreciated!

ghorrocks · February 20, 2012, 06:45

If you are just using the static mixer tutorial then it is not big enough to get a good scale up with 48 cores. I would remesh the static mixer to a far finer mesh, something which will have a solver time of a few minutes on 48 cores (so an hour or more when run serial).

I am at home at the moment so cannot direct you where to look for the executable. Check the documentation to see if it gives you a pointer. Also if your CPU or OS is really new it might not be recognised.

danieru · February 20, 2012, 11:08

One of our users supplied us with a larger mesh that has ~= 7 million cells, so we'll do some tests with it and see what wall time is for serial vs some parallel runs...and while the tests are running, I'll continue digging through docs for more information about optimizing the runs for a specific arch.

ghorrocks · February 20, 2012, 17:53

Also, what time are you comparing? You should compare solver wall times. Do not compare total simulation time as that contains the setup and packup stuff and that is not parallel, so will distort the results.

danieru · February 27, 2012, 06:34

Hi,

Just wanted to report back that we identified and solved the issue. It wasn't an issue with CFX itself, but rather a configuration issue with our job scheduler, SLURM. SLURM was confining all the processes to one core, thus the drop of in CPU utilization per core.

Typically, SLURM likes to launch all the processes itself via srun, but since CFX spawns it's own processes when running in parallel, we needed to explicitly tell SLURM to allocate the number of resources (in this case the number of cores) that CFX will launch processes for, even though SLURM itself isn't launching the processes. Once we had this configured correctly we got good performance from CFX since it able to utilize a core per process!

So, thanks for helping narrow down possible contributers to the problem:-)

monkey1 · February 29, 2012, 08:53

Hi danieru!
Just a little hint that we got from the CFX support concerning CFX scaling on multiprocessors. To have a good speed up ANSYS recommends something like 250.000 Cells per core. With less you will loose time due to the inter core communication and with a lot more each core will simply be "overloaded"

So for your case of about 7 mio cells You should use 28 cores to see a speed up.

danieru · February 29, 2012, 09:30

Quote:

Originally Posted by monkey1

Hi danieru!
Just a little hint that we got from the CFX support concerning CFX scaling on multiprocessors. To have a good speed up ANSYS recommends something like 250.000 Cells per core. With less you will loose time due to the inter core communication and with a lot more each core will simply be "overloaded"

So for your case of about 7 mio cells You should use 28 cores to see a speed up.

Hej monkey1,

That's great info! We'll pass that on to CFX users on our cluster. Thanks for taking a moment to post:-)

Far · February 29, 2012, 10:17

good info. Thanks

Lance · February 29, 2012, 10:20

We've seen almost linear scaling up to 256 cores with 50 000 hexas per core, so I would say that the optimum number of cells/core is both problem and cluster dependent.

The CFX manual talks about a minimum of 30000 nodes/partition for tetrahedrals and 75000 nodes/partition for hexahedrals but actual numbers could be both lower or higher.

February 20, 2012, 03:26	how well does CFX parallel scale?	#1
danieru New Member Daniel Petersen Join Date: Feb 2012 Posts: 6 Rep Power: 14	Hi, We're trying to run CFX parallel jobs on our cluster, and as near as I can tell, CFX seems to be scaling terribly. As an example, our nodes each have 4 CPUs, 12 cores each: when running CFX with 4 processes on the node, each process is only utilizing 25% of the core it's running on. At 8 processes, it's only 8% per core, at 48 processes, it's reported down at 1 or 2% per process! This is with no other processes competing for CPU cycles. Can anyone comment on their experience with how well the CFX parallel solver scales, both on a single node with multiple cores, or across nodes?

February 27, 2012, 06:34	issue solved	#7
danieru New Member Daniel Petersen Join Date: Feb 2012 Posts: 6 Rep Power: 14	Hi, Just wanted to report back that we identified and solved the issue. It wasn't an issue with CFX itself, but rather a configuration issue with our job scheduler, SLURM. SLURM was confining all the processes to one core, thus the drop of in CPU utilization per core. Typically, SLURM likes to launch all the processes itself via srun, but since CFX spawns it's own processes when running in parallel, we needed to explicitly tell SLURM to allocate the number of resources (in this case the number of cores) that CFX will launch processes for, even though SLURM itself isn't launching the processes. Once we had this configured correctly we got good performance from CFX since it able to utilize a core per process! So, thanks for helping narrow down possible contributers to the problem:-)

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Core usage on CFX parallel processing	alterego	CFX	6	December 21, 2011 06:45
scale in cfx	jai	CFX	4	November 13, 2008 09:04
PhD using CFX	Rui	CFX	9	May 28, 2007 06:59
FEDORA CORE and PARALLEL processing	Tuks	CFX	2	August 20, 2005 12:05
CFX 4.4 installation problem	Pandu Sattvika	CFX	1	December 1, 2001 05:07

February 20, 2012, 06:03		#2
ghorrocks Super Moderator Glenn Horrocks Join Date: Mar 2009 Location: Sydney, Australia Posts: 17,871 Rep Power: 144	Sorry, wrote the reply below assuming you are running distributed parallel, but on a second read of your question I think you are running local parallel. Oh well, I will leave the reply up anyway just in case it is relevant. I will tack the local parallel comments on the bottom. It is not as simple as you seem to imply. When you are trying to pipe 12 cores worth of computations down a single network pipe you are going to have a major bottleneck unless you have a super-dooper network. What network are you running? If you are on ethernet then forget it, it will never work. You will need infiniband, myranet or one of the high-end end networks to get decent scaling out of this. And it does not stop there - you need a motherboard with a fast pipe from CPU to the network adapter. I have seen differences of 4x between motherboards with the same CPU. A quality motherboard is essential. And then there is the network switch as well. It is highly unlikely your poor scale up is due to CFX. CFX has run out to thousands of CPUs with good speedups, so something is wrong with your set up. And if you have got all the software setup correctly then it probably means you need to buy a really expensive network. ******** If you are getting this performance on local parallel then check that you are running the correct parallel solver for your CPU. If your solver has not detected the CPU correctly it will run with the default solver which is really slow. Likewise the CPU setup, it should detect that and optimse the setup for it. The speed up you are reporting is pretty bad and something is wrong. CFX has good parallel performance so it is unlikely to simply be CFX. Can you describe the simulation you are using to get these performance and speedup figures?

February 20, 2012, 06:45		#4
ghorrocks Super Moderator Glenn Horrocks Join Date: Mar 2009 Location: Sydney, Australia Posts: 17,871 Rep Power: 144	If you are just using the static mixer tutorial then it is not big enough to get a good scale up with 48 cores. I would remesh the static mixer to a far finer mesh, something which will have a solver time of a few minutes on 48 cores (so an hour or more when run serial). I am at home at the moment so cannot direct you where to look for the executable. Check the documentation to see if it gives you a pointer. Also if your CPU or OS is really new it might not be recognised.

February 20, 2012, 11:08		#5
danieru New Member Daniel Petersen Join Date: Feb 2012 Posts: 6 Rep Power: 14	One of our users supplied us with a larger mesh that has ~= 7 million cells, so we'll do some tests with it and see what wall time is for serial vs some parallel runs...and while the tests are running, I'll continue digging through docs for more information about optimizing the runs for a specific arch.

February 20, 2012, 17:53		#6
ghorrocks Super Moderator Glenn Horrocks Join Date: Mar 2009 Location: Sydney, Australia Posts: 17,871 Rep Power: 144	Also, what time are you comparing? You should compare solver wall times. Do not compare total simulation time as that contains the setup and packup stuff and that is not parallel, so will distort the results.

February 29, 2012, 08:53		#8
monkey1 Senior Member Join Date: Jul 2011 Location: Berlin, Germany Posts: 173 Rep Power: 15	Hi danieru! Just a little hint that we got from the CFX support concerning CFX scaling on multiprocessors. To have a good speed up ANSYS recommends something like 250.000 Cells per core. With less you will loose time due to the inter core communication and with a lot more each core will simply be "overloaded" So for your case of about 7 mio cells You should use 28 cores to see a speed up.

February 29, 2012, 10:17		#10
Far Senior Member Sijal Join Date: Mar 2009 Location: Islamabad Posts: 4,558 Blog Entries: 6 Rep Power: 54	good info. Thanks

February 29, 2012, 10:20		#11
Lance Senior Member Lance Join Date: Mar 2009 Posts: 669 Rep Power: 22	We've seen almost linear scaling up to 256 cores with 50 000 hexas per core, so I would say that the optimum number of cells/core is both problem and cluster dependent. The CFX manual talks about a minimum of 30000 nodes/partition for tetrahedrals and 75000 nodes/partition for hexahedrals but actual numbers could be both lower or higher.