Parallel runs slower with MTU=9000 than MTU=1500

October 28, 2007, 23:30

Hi,

I have been trying to build a small cluster with 2 dual-core Pentium D pcs. I've installed SUSE SLES 10, the NIC cards are Gigabit. After two weeks struggling with the network configuration. I'm finally able to perform some benchmark.

The problems is that setting up the Jumbo Frames option on the NIC card (MTU=9000) my test case runs slower than the one with the NIC standard option (MTU=1500). Also I've seen that the MTU=9000 options don't use as much CPU than the standard option.

Does anyone have experience with this?

Any comments would be helpful. I need to improve this to request some extra funds for my research project and build a bigger beowulf cluster.

---- REFERENCE INFO ---------

Case: 464000 Hex Cells 3D, PBNS, RNG k-e, multiphase mixture model (2 phases), Multiple Reference frames, unsteady.

MTU=9000 (OPTION) Performance Timer for 40 iterations on 4 compute nodes

Average wall-clock time per iteration: 13.969 sec

Global reductions per iteration: 223 ops

Global reductions time per iteration: 0.000 sec (0.0%)

Message count per iteration: 854 messages

Data transfer per iteration: 30.742 MB

LE solves per iteration: 7 solves

LE wall-clock time per iteration: 5.445 sec

LE global solves per iteration: 2 solves

LE global wall-clock time per iteration: 0.085 sec (0.6%)

AMG cycles per iteration: 8 cycles

Relaxation sweeps per iteration: 316 sweeps

Relaxation exchanges per iteration: 76 exchanges

Time-step updates per iteration: 0.05 updates

Time-step wall-clock time per iteration: 0.015 sec (0.1%)

Total wall-clock time: 558.759 sec

Total CPU time: 1477.740 sec

MTU=1500 (OPTION) Performance Timer for 40 iterations on 4 compute nodes

Average wall-clock time per iteration: 7.700 sec

Global reductions per iteration: 223 ops

Global reductions time per iteration: 0.000 sec (0.0%)

Message count per iteration: 854 messages

Data transfer per iteration: 30.742 MB

LE solves per iteration: 7 solves

LE wall-clock time per iteration: 0.605 sec (7.9%)

LE global solves per iteration: 2 solves

LE global wall-clock time per iteration: 0.003 sec (0.0%)

AMG cycles per iteration: 8 cycles

Relaxation sweeps per iteration: 316 sweeps

Relaxation exchanges per iteration: 76 exchanges

Time-step updates per iteration: 0.05 updates

Time-step wall-clock time per iteration: 0.016 sec (0.2%)

Total wall-clock time: 308.003 sec

Total CPU time: 949.780 sec

Cheers,

Javier

October 28, 2007, 23:30	Parallel runs slower with MTU=9000 than MTU=1500	#1
Javier Larrondo Guest Posts: n/a	Hi, I have been trying to build a small cluster with 2 dual-core Pentium D pcs. I've installed SUSE SLES 10, the NIC cards are Gigabit. After two weeks struggling with the network configuration. I'm finally able to perform some benchmark. The problems is that setting up the Jumbo Frames option on the NIC card (MTU=9000) my test case runs slower than the one with the NIC standard option (MTU=1500). Also I've seen that the MTU=9000 options don't use as much CPU than the standard option. Does anyone have experience with this? Any comments would be helpful. I need to improve this to request some extra funds for my research project and build a bigger beowulf cluster. ---- REFERENCE INFO --------- Case: 464000 Hex Cells 3D, PBNS, RNG k-e, multiphase mixture model (2 phases), Multiple Reference frames, unsteady. MTU=9000 (OPTION) Performance Timer for 40 iterations on 4 compute nodes Average wall-clock time per iteration: 13.969 sec Global reductions per iteration: 223 ops Global reductions time per iteration: 0.000 sec (0.0%) Message count per iteration: 854 messages Data transfer per iteration: 30.742 MB LE solves per iteration: 7 solves LE wall-clock time per iteration: 5.445 sec LE global solves per iteration: 2 solves LE global wall-clock time per iteration: 0.085 sec (0.6%) AMG cycles per iteration: 8 cycles Relaxation sweeps per iteration: 316 sweeps Relaxation exchanges per iteration: 76 exchanges Time-step updates per iteration: 0.05 updates Time-step wall-clock time per iteration: 0.015 sec (0.1%) Total wall-clock time: 558.759 sec Total CPU time: 1477.740 sec MTU=1500 (OPTION) Performance Timer for 40 iterations on 4 compute nodes Average wall-clock time per iteration: 7.700 sec Global reductions per iteration: 223 ops Global reductions time per iteration: 0.000 sec (0.0%) Message count per iteration: 854 messages Data transfer per iteration: 30.742 MB LE solves per iteration: 7 solves LE wall-clock time per iteration: 0.605 sec (7.9%) LE global solves per iteration: 2 solves LE global wall-clock time per iteration: 0.003 sec (0.0%) AMG cycles per iteration: 8 cycles Relaxation sweeps per iteration: 316 sweeps Relaxation exchanges per iteration: 76 exchanges Time-step updates per iteration: 0.05 updates Time-step wall-clock time per iteration: 0.016 sec (0.2%) Total wall-clock time: 308.003 sec Total CPU time: 949.780 sec Cheers, Javier

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
parallel fluent runs being killed at partitioing	Ben Aga	FLUENT	3	June 8, 2012 11:40
Inconsistent behaviour of gMax/gMin for parallel runs	gschaider	OpenFOAM Bugs	5	July 29, 2009 15:23
Differences between serial and parallel runs	carsten	OpenFOAM Bugs	11	September 12, 2008 12:16
Help: Serial code to parallel but even slower	Zonexo	Main CFD Forum	4	May 14, 2008 11:26
Distributed parallel runs on ANSYS CFX 10	Manoj Kumar	CFX	4	January 25, 2006 09:00