parallel performance

January 27, 2009, 13:15

I am testing cfx11 for parallel computing, using an Intel QuadCore with 4GB ram. My domain is a pipe with about 10000 elements in the section, and 256 nodes streamwise (about 3 millions of nodes). I have found that the serial run performs very well with full load of a single processor, while parallel run is a disaster. The load falls down and the wallclock time per timestep is higher than the serial run! Can it depend on the domain shape that is almost topologically cubic, or there is something else? Some mistake? I use PVM and automatic partioning.

January 27, 2009, 17:22

Hi,

Are you running on Windows? PVM does not run too well on windows. Try MPICH.

Glenn Horrocks

January 27, 2009, 18:12

Here's one possibility. Each solver process needs some extra memory, in addition to the memory for the matrices and variable values on the mesh. I'll call these two types the 'solver memory' and 'job memory'. So

a) For 1 CFX process you need (100% job memory + solver memory)

b) For 2 CFX processes, dividing the job 50% per process, you need 2*(50% job memory + solver memory)

c) For 3 CFX processes, dividing the job 33% per process, you need 3*(33% job memory + solver memory)

etc. So your parallel task will use more memory requirements than the single process version because of the extra copies of the 'solver memory'. Do you have enough memory? Are you using pagefiles in the parallel runs (but not in the single processor case)?

Cheers, andy

January 28, 2009, 12:53

As Glenn suggested, use MPICH. PVM hangs on Windows, which is probably the greatest source of delay.

The case you are running is only 10k nodes, so it probably won't scale that well. Available memory won't be an issue, but there is additional computational overhead for the solver and some communication overhead. On large models, these are amortized over a large number of nodes and aren't noticable, but you will generally see parallel efficiency drop off as your partitions drop below 100k nodes each.

That said, with MPICH you should see some improvement in run time, I just wouldn't expect 2 processors to be twice as fast (maybe 1.2 to 1.5 times faster). Make your mesh bigger and you'll see better parallel efficiency.

-CycLone

January 28, 2009, 17:44

You are probably right about PVM - I've never used Windows with CFX, and you have a knack for answering questions well here! However, I'll just point out the OP's problem is 10,000 elements in *each* of 256 mesh cross sections, giving ~3 million nodes in total, as the original post said. (It sounds like an extruded or structured mesh.)

I would certainly agree that 10,000 nodes is too small too scale well in parallel. However, the actual problem size of 3 million nodes does sound about the size that would use most of a 4GB machine's memory. Hence my suggestion - however I don't have access to a suitable problem to estimate the actual memory consumption accurately just now, so I freely admit it's just a half-educated guess!

Best wishes, andy2o

January 29, 2009, 12:33

Ah! I missed that. I thought it was 10k nodes total.

3 million nodes (hex) will probably require ~3GB RAM. The solver memory overhead is still pretty small, so I doubt it would push him over but it may be pushing the limit with other applications running on the same machine. PVM will definitely be an issue (it is gone from v12beta altogether), so let's see how MPICH works for him.

-CycLone

January 29, 2009, 16:26

The Intel Core 2 the front side bus can cause bottleneck issues as well. This might be part of the problem

January 27, 2009, 13:15	parallel performance	#1
ivandipia Guest Posts: n/a	I am testing cfx11 for parallel computing, using an Intel QuadCore with 4GB ram. My domain is a pipe with about 10000 elements in the section, and 256 nodes streamwise (about 3 millions of nodes). I have found that the serial run performs very well with full load of a single processor, while parallel run is a disaster. The load falls down and the wallclock time per timestep is higher than the serial run! Can it depend on the domain shape that is almost topologically cubic, or there is something else? Some mistake? I use PVM and automatic partioning.

January 27, 2009, 17:22	Re: parallel performance	#2
Glenn Horrocks Guest Posts: n/a	Hi, Are you running on Windows? PVM does not run too well on windows. Try MPICH. Glenn Horrocks

January 27, 2009, 18:12	Re: parallel performance	#3
andy2o Guest Posts: n/a	Here's one possibility. Each solver process needs some extra memory, in addition to the memory for the matrices and variable values on the mesh. I'll call these two types the 'solver memory' and 'job memory'. So a) For 1 CFX process you need (100% job memory + solver memory) b) For 2 CFX processes, dividing the job 50% per process, you need 2(50% job memory + solver memory) c) For 3 CFX processes, dividing the job 33% per process, you need 3(33% job memory + solver memory) etc. So your parallel task will use more memory requirements than the single process version because of the extra copies of the 'solver memory'. Do you have enough memory? Are you using pagefiles in the parallel runs (but not in the single processor case)? Cheers, andy

January 28, 2009, 12:53	Re: parallel performance	#4
CycLone Guest Posts: n/a	As Glenn suggested, use MPICH. PVM hangs on Windows, which is probably the greatest source of delay. The case you are running is only 10k nodes, so it probably won't scale that well. Available memory won't be an issue, but there is additional computational overhead for the solver and some communication overhead. On large models, these are amortized over a large number of nodes and aren't noticable, but you will generally see parallel efficiency drop off as your partitions drop below 100k nodes each. That said, with MPICH you should see some improvement in run time, I just wouldn't expect 2 processors to be twice as fast (maybe 1.2 to 1.5 times faster). Make your mesh bigger and you'll see better parallel efficiency. -CycLone

January 28, 2009, 17:44	Re: parallel performance	#5
andy2o Guest Posts: n/a	You are probably right about PVM - I've never used Windows with CFX, and you have a knack for answering questions well here! However, I'll just point out the OP's problem is 10,000 elements in each of 256 mesh cross sections, giving ~3 million nodes in total, as the original post said. (It sounds like an extruded or structured mesh.) I would certainly agree that 10,000 nodes is too small too scale well in parallel. However, the actual problem size of 3 million nodes does sound about the size that would use most of a 4GB machine's memory. Hence my suggestion - however I don't have access to a suitable problem to estimate the actual memory consumption accurately just now, so I freely admit it's just a half-educated guess! Best wishes, andy2o

January 29, 2009, 12:33	Re: parallel performance	#6
CycLone Guest Posts: n/a	Ah! I missed that. I thought it was 10k nodes total. 3 million nodes (hex) will probably require ~3GB RAM. The solver memory overhead is still pretty small, so I doubt it would push him over but it may be pushing the limit with other applications running on the same machine. PVM will definitely be an issue (it is gone from v12beta altogether), so let's see how MPICH works for him. -CycLone

January 29, 2009, 16:26	Re: parallel performance	#7
Nathan Guest Posts: n/a	The Intel Core 2 the front side bus can cause bottleneck issues as well. This might be part of the problem

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
parallel performance on BX900	uzawa	OpenFOAM Installation	3	September 5, 2011 16:52
Performance of GGI case in parallel	hannes	OpenFOAM Running, Solving & CFD	26	August 3, 2011 04:07
Parallel performance OpenFoam Vs Fluent	prapanj	Main CFD Forum	0	March 26, 2009 06:43
Performance of interFoam running in parallel	hsieh	OpenFOAM Running, Solving & CFD	8	September 14, 2006 10:15
ANSYS CFX 10.0 Parallel Performance for Windows XP	Saturn	CFX	4	August 13, 2006 13:27