Parallel speed up

May 21, 2002, 07:38

Hi,

Does anyone have experiance with dual processor computer and CFX-5.5 under linux ?

What kind of speed-up is common compared to singel processor ?

Thanks a lot.

Regards Soren

May 21, 2002, 12:01

I seem to remember that you can obtain a fairly linar relationship assuming you solve a large problem to dissolve the effects of the partitioning (I saw a CFX presentation), but contact your vendor since CFX is very likely to have done the comparison.

May 21, 2002, 13:02

CFX-5.5 gets speedups of 1.6-1.8, depending on the problem size, in Linux. This is better on high end workstations, where 1.9-2.1 are typical. The memory and cache architectures on Intel/AMD Linux boxes are just not good enough to get comparable speedups.

Neale

May 22, 2002, 03:14

Hi

Thanks for the reply.

I know that under Windows NT/2k/XP the parallel performance of a dual processor computer is very bad.

The speed up is about 1.1 to 1.2.

Thats why I am looking at Linux.

Any comment ?

Regards

Soren

May 22, 2002, 04:41

Using CFX 5.5 on a Pentium-IV with WinNT, we obtained a speed-up of about 1.8-2.0. But, we have only tested it up to 4 PC's.

Astrid

May 22, 2002, 05:43

Hi Astrid

It the computer singel or dual processor ?

Regards

Soren

May 22, 2002, 08:44

I use TASCflow and CFX-5.5 in a Dual PIII PC. I've noted a speed-up about 1.4 - 1.6 in CFX-5 and 1.6 - 1.8 in TASCflow, depending the problem size. I only ran local parallel with two partitions.
cfd guy

May 22, 2002, 16:46

Linux seems to generally do a better job at dynamic process managment (i.e., multitasking) so you see slightly better speedups there usually. I've typically seen on the order of 1.4-1.6 on NT, and 1.6-1.8 on Linux for CFX-5.5.

Neale.

May 22, 2002, 16:49

Astrid,

Do you mean you ran a 4 process job on 4 PCs and only got 1.8 -> 2.0 speedup??? What problem size were you running? For a 4 process job you would need at least 400,000->600,000 elements to see a decent speedup.

Neale

May 23, 2002, 03:11

Hi

I am curious about these speed-up. I am running indoor and HVAC problems with mesh size from 400k-2.000k on a Win NT box with dual P4 processor.

The speed up I am getting is below 1.2.

Are you appling something special ?

Thanks

Regards

Jens

May 23, 2002, 12:22

Hi Jens,

How much RAM usage do you have. For a 2 million node problem, I'd be surprised if you were not running into swap space. In this case, you will see the best speedup if you run it on multiple systems, at least enough to get it all in RAM and out of swap.

Robin

May 23, 2002, 14:43

Hi

I have benchmark using a HVAC problem with 600.000 cell. The speed up was 1.15 on a dual P4 with 1.2 Gb Ram.

Any hints ?

Regards

Jens

May 24, 2002, 11:05

How were you calculating the speedup? You should use the CFD start and finish times in the output file.

600,000 cells means roughly 120,000 nodes (for a tet grid I assume), which should only take about 180MB-200MB for uvwp-k-eps. So, swapping probably isn't an issue.

Make sure you do your performance measurements on a "clean" machine. i.e., you aren't running anything else or doing anything else other than the CFD calculation.

Neale.

May 27, 2002, 12:56

Hi Jens,
As this discussion is very interesting, I'd like to propose you the following benchmark. It would be very interesting that users could share their speedups information. I've built a very simple case (rectangular channel) with approximately 960K cells (hybrid mesh with inflation). I've performed this definition file in a SUN Workstation running on Solaris 8 with 4 processors. Here's some data about this case:
3D, Turbulent (k-eps), Incompressible(Air) and Steady State flow. Number of Cells: Almost 948,000
Run - - - - - - Speedup Serial -----> 1. 2 proc. -----> 2.08 3 proc. -----> 3.03 4 proc. -----> 4.02

Why don't you test in your NT machine? I could send you the journal file so that you could easily obtain this definition file. If anyone else wants the journal file, please feel free to mail me.

PS1.: Make sure you're not running any other applications in your machine. PS2.: Rebuilding the journal file in my NT machine the resulting mesh has 947,916 elements. However when rebuilding it in my UNIX system, the resulting file has 948,161 elements. I believe that will be no problem at all for benchmarking purposes. PS3: I think it's the simplest case you could ever imagine. It's a simple geometry with no geometric bad angles and no grid interfaces (monoblock). I believe that the speedup also depends on some geometric information.

Kind regards, cfd guy

May 28, 2002, 14:02

About my previous post. I've tested two coarser grids in comparison with the 1st one. Here the results:
Grid 2: 468.500 elements. SpeedUp: (2 processes) = 1.92 SpeedUp: (3 processes) = 2.74 SpeedUp: (4 processes) = 3.41
Grid 3: 109.400 elements. SpeedUp: (2 processes) = 1.49 SpeedUp: (3 processes) = 2.15 SpeedUp: (4 processes) = 2.80
I'm not trying to find out the optimal mesh size for this problem, but it seems that in this case, the minimum number of elements for each processor must be greater than 200k cells to obtain a linear relation between the speedup and the number of processes.
cfd guy

May 30, 2002, 17:38

Soren,

We used 4 distributed parallel PC's in a 100bt network.

Astrid

May 30, 2002, 17:44

Sorry, I was wrong. I didn't mean to confuse you.

We ran a job with ± 1.5M elements on 1 standalone PC and on 2 distributed parallel PC's. Then, speed up was 1.8-2.0. With 4 PC's, speed up was about 3.6.

Astrid

May 30, 2002, 17:47

cfd guy,

The number of nodes is more relevant to parallel efficiency. Can you post the number of nodes in your mesh rather than elements?

Typically, the best efficiency is achieved when the number of nodes per partition is greater than 100k. At less than 20k per partition the trend may reverse, taking longer with added partitions (due to increased communication).

Robin

May 31, 2002, 13:26

Actually, it's ok to quote by elements as well. They are related anyways (roughly 1:1 for hex grids, and 5-6:1 for tet/hybrid grids). In fact the assembly really scales with the number of elements anyways as the CFX-5 solver uses an element based assembly.

I'm not suprised by the results though, as 50,000 vertices per partition translates into 200k elements on a tet/hybrid grid. This is what we see in parallel results as well.

Neale.

May 21, 2002, 07:38	Parallel speed up	#1
Soren Guest Posts: n/a	Hi, Does anyone have experiance with dual processor computer and CFX-5.5 under linux ? What kind of speed-up is common compared to singel processor ? Thanks a lot. Regards Soren

May 21, 2002, 12:01	Re: Parallel speed up	#2
Holidays Guest Posts: n/a	I seem to remember that you can obtain a fairly linar relationship assuming you solve a large problem to dissolve the effects of the partitioning (I saw a CFX presentation), but contact your vendor since CFX is very likely to have done the comparison.

May 21, 2002, 13:02	Re: Parallel speed up	#3
Neale Guest Posts: n/a	CFX-5.5 gets speedups of 1.6-1.8, depending on the problem size, in Linux. This is better on high end workstations, where 1.9-2.1 are typical. The memory and cache architectures on Intel/AMD Linux boxes are just not good enough to get comparable speedups. Neale

May 22, 2002, 03:14	Re: Parallel speed up	#4
Soren Guest Posts: n/a	Hi Thanks for the reply. I know that under Windows NT/2k/XP the parallel performance of a dual processor computer is very bad. The speed up is about 1.1 to 1.2. Thats why I am looking at Linux. Any comment ? Regards Soren

May 22, 2002, 04:41	Re: Parallel speed up	#5
Astrid Guest Posts: n/a	Using CFX 5.5 on a Pentium-IV with WinNT, we obtained a speed-up of about 1.8-2.0. But, we have only tested it up to 4 PC's. Astrid

May 22, 2002, 05:43	Re: Parallel speed up	#6
Soren Guest Posts: n/a	Hi Astrid It the computer singel or dual processor ? Regards Soren

May 22, 2002, 08:44	Re: Parallel speed up	#7
cfd guy Guest Posts: n/a	I use TASCflow and CFX-5.5 in a Dual PIII PC. I've noted a speed-up about 1.4 - 1.6 in CFX-5 and 1.6 - 1.8 in TASCflow, depending the problem size. I only ran local parallel with two partitions. cfd guy

May 22, 2002, 16:46	Re: Parallel speed up	#8
Neale Guest Posts: n/a	Linux seems to generally do a better job at dynamic process managment (i.e., multitasking) so you see slightly better speedups there usually. I've typically seen on the order of 1.4-1.6 on NT, and 1.6-1.8 on Linux for CFX-5.5. Neale.

May 22, 2002, 16:49	Re: Parallel speed up	#9
Neale Guest Posts: n/a	Astrid, Do you mean you ran a 4 process job on 4 PCs and only got 1.8 -> 2.0 speedup??? What problem size were you running? For a 4 process job you would need at least 400,000->600,000 elements to see a decent speedup. Neale

May 23, 2002, 03:11	Re: Parallel speed up	#10
Jens Guest Posts: n/a	Hi I am curious about these speed-up. I am running indoor and HVAC problems with mesh size from 400k-2.000k on a Win NT box with dual P4 processor. The speed up I am getting is below 1.2. Are you appling something special ? Thanks Regards Jens

May 23, 2002, 12:22	Re: Parallel speed up	#11
Robin Guest Posts: n/a	Hi Jens, How much RAM usage do you have. For a 2 million node problem, I'd be surprised if you were not running into swap space. In this case, you will see the best speedup if you run it on multiple systems, at least enough to get it all in RAM and out of swap. Robin

May 23, 2002, 14:43	Re: Parallel speed up	#12
Jens Guest Posts: n/a	Hi I have benchmark using a HVAC problem with 600.000 cell. The speed up was 1.15 on a dual P4 with 1.2 Gb Ram. Any hints ? Regards Jens

May 24, 2002, 11:05	Re: Parallel speed up	#13
Neale Guest Posts: n/a	How were you calculating the speedup? You should use the CFD start and finish times in the output file. 600,000 cells means roughly 120,000 nodes (for a tet grid I assume), which should only take about 180MB-200MB for uvwp-k-eps. So, swapping probably isn't an issue. Make sure you do your performance measurements on a "clean" machine. i.e., you aren't running anything else or doing anything else other than the CFD calculation. Neale.

May 27, 2002, 12:56	Architetures Benchmark	#14
cfd guy Guest Posts: n/a	Hi Jens, As this discussion is very interesting, I'd like to propose you the following benchmark. It would be very interesting that users could share their speedups information. I've built a very simple case (rectangular channel) with approximately 960K cells (hybrid mesh with inflation). I've performed this definition file in a SUN Workstation running on Solaris 8 with 4 processors. Here's some data about this case: 3D, Turbulent (k-eps), Incompressible(Air) and Steady State flow. Number of Cells: Almost 948,000 Run - - - - - - Speedup Serial -----> 1. 2 proc. -----> 2.08 3 proc. -----> 3.03 4 proc. -----> 4.02 Why don't you test in your NT machine? I could send you the journal file so that you could easily obtain this definition file. If anyone else wants the journal file, please feel free to mail me. PS1.: Make sure you're not running any other applications in your machine. PS2.: Rebuilding the journal file in my NT machine the resulting mesh has 947,916 elements. However when rebuilding it in my UNIX system, the resulting file has 948,161 elements. I believe that will be no problem at all for benchmarking purposes. PS3: I think it's the simplest case you could ever imagine. It's a simple geometry with no geometric bad angles and no grid interfaces (monoblock). I believe that the speedup also depends on some geometric information. Kind regards, cfd guy

May 28, 2002, 14:02	Re: Architetures Benchmark	#15
cfd guy Guest Posts: n/a	About my previous post. I've tested two coarser grids in comparison with the 1st one. Here the results: Grid 2: 468.500 elements. SpeedUp: (2 processes) = 1.92 SpeedUp: (3 processes) = 2.74 SpeedUp: (4 processes) = 3.41 Grid 3: 109.400 elements. SpeedUp: (2 processes) = 1.49 SpeedUp: (3 processes) = 2.15 SpeedUp: (4 processes) = 2.80 I'm not trying to find out the optimal mesh size for this problem, but it seems that in this case, the minimum number of elements for each processor must be greater than 200k cells to obtain a linear relation between the speedup and the number of processes. cfd guy

May 30, 2002, 17:38	Re: Parallel speed up	#16
Astrid Guest Posts: n/a	Soren, We used 4 distributed parallel PC's in a 100bt network. Astrid

May 30, 2002, 17:44	Re: Parallel speed up	#17
Astrid Guest Posts: n/a	Sorry, I was wrong. I didn't mean to confuse you. We ran a job with ± 1.5M elements on 1 standalone PC and on 2 distributed parallel PC's. Then, speed up was 1.8-2.0. With 4 PC's, speed up was about 3.6. Astrid

May 30, 2002, 17:47	Re: Architetures Benchmark	#18
Robin Guest Posts: n/a	cfd guy, The number of nodes is more relevant to parallel efficiency. Can you post the number of nodes in your mesh rather than elements? Typically, the best efficiency is achieved when the number of nodes per partition is greater than 100k. At less than 20k per partition the trend may reverse, taking longer with added partitions (due to increased communication). Robin

May 31, 2002, 13:26	Re: Architetures Benchmark	#19
Neale Guest Posts: n/a	Actually, it's ok to quote by elements as well. They are related anyways (roughly 1:1 for hex grids, and 5-6:1 for tet/hybrid grids). In fact the assembly really scales with the number of elements anyways as the CFX-5 solver uses an element based assembly. I'm not suprised by the results though, as 50,000 vertices per partition translates into 200k elements on a tet/hybrid grid. This is what we see in parallel results as well. Neale.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
IcoFoam in parallel Issue with speed up	hjasak	OpenFOAM Running, Solving & CFD	19	October 11, 2011 18:07
Increase speed of parallel computation	Purushothama	Siemens	2	November 30, 2010 15:51
Parallel with Windoze, speed difference between PV	Charles	CFX	3	March 10, 2005 02:25
Parallel speed up for CFX 5 on PC's	Roued	CFX	6	November 28, 2001 19:02
speed up ratio at parallel processing	Kim hak-gyu	Main CFD Forum	1	October 25, 2000 10:57