star 4.06 memory on linux cluster

johnmck · April 20, 2009, 12:27

I'm trying to run Star 4.06 on a linux cluster with pbs, on 900,000 cells modelling incompressible transient flow. Each node of the cluster has two processors with 4 cores, and 8GB of shared memory. The model is partitioned using metis.

Each processor is an Intel(R) Xeon(R) CPU E5430 @ 2.66GHz

Uname -a gives: Linux 2.6.9-55.0.2.ELsmp #1 SMP Tue Jun 26 14:14:47 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux

The compiler is Absoft 9.0 EP 64 bit.

My question is this:

If I use 8 processes on each of 2 nodes, ie 16 processes in total, each process takes 860Mb virtual memory (mostly data and stack).
If I use 8 processes on each of 4 nodes, ie 32 processes in total, each process takes 820Mb.

How come if I use more processes each doesnt use proportionally less memory? I would have expected the total memory used to stay almost constant. As it stands I can't use many more cells before the nodes run out of memory, and start to swap.

Any advice appreciated. My apologies if I've missed something obvious like an option.
Regards
John Mck.

vishyaroon · April 20, 2009, 12:49

The memory used is not only related to the problem size. As you use more processors, the communication overhead between the processors increases. So you'll not have a linear decrease in the memory usage. At some point using more processors may result in slower performance due to the communication overhead

johnmck · April 20, 2009, 16:17

Yes, I'd agree that eventually there is a trade off, when using more nodes adds a greater communications overhead then the computational benefit they bring.

But I didnt think I'd reached that point yet. I'm finding that I can't run 1,000,000 cells on two nodes each with 8Gb and 8 processors. Other threads indicate that I should be able to do this on a single processor with 2GB memory.

Any ideas?

Regards
John

vishyaroon · April 20, 2009, 17:25

That was my initial thought too. I use similar machines (my Linux machine) shows the same capabilities as your except for a different linux version. And I frequently run about 1 million size meshes on 1 processor.

f-w · April 20, 2009, 19:15

johnmck,

Just out of curiosity, have you benchmarked your quad-cores? I was advised to go with dual-cores instead of quad-cores because of the inherent performance loss when using all 4 cores (which I confirmed on my head-node with Star-CCM+). What is your "speedup" going from 7 to 8 cores on one of your nodes?

Thanks,
f-w

olesen · April 21, 2009, 03:48

Quote:

Originally Posted by f-w

I was advised to go with dual-cores instead of quad-cores because of the inherent performance loss when using all 4 cores

I don't think the issue is dual vs. quad core per se, but rather the bottleneck accessing the memory. We've have several dual-cpu/quad-cores machines in our cluster and found that using a single process per cpu gave us about 30-35% better performance than using all of the cores (no swapping occured). In _our_ testcase, the memory bottleneck was worse than the network overhead incurred by spreading the job over more machines. As always, do not trust anybody's benchmark though, but benchmark with your own problems.

With the changes in memory access with the Nehalem cpus, the impact of the memory bottleneck should become less significant in the future ... it might even be better in the current generation of AMD cpus.

johnmck · April 21, 2009, 08:58

I ran some more tests (mesh 96x99x96=912384 cells), and yes we are reaching a tradeoff:

Nodes x processes per node
1x1=1 (ie serial) uses 1870Mb/process
1x2=2 uses 1340
1x4=4 uses 1060
1x8=8 uses 940
2x4=8 uses 930
2x8=16 uses 860
4x8=32 uses 820

For our work using 32 licences the memory per process doesnt fall much below half the serial memory requirement. So for big jobs we'll have to only partly use nodes, in order to get enough memory on them.

The memory overhead due to parallel working seems surprisingly high, to me at least.

Many Thanks
Regards
John mck

TMG · April 22, 2009, 14:02

Your model is too small to make your conclusion valid. By 32 cores you only have 28000 cells on each core (that's a very small number). At that size the overhead of all the "halo" cells (the cells that exist at boundaries between two domains) are just not going to decrease any further. If you run a much larger (like an order of magnitude) model, you will see the memory effect you are looking for.

April 20, 2009, 16:17	yes a tradeoff - but not yet?	#3
johnmck New Member john mck Join Date: Apr 2009 Posts: 3 Rep Power: 17	Yes, I'd agree that eventually there is a trade off, when using more nodes adds a greater communications overhead then the computational benefit they bring. But I didnt think I'd reached that point yet. I'm finding that I can't run 1,000,000 cells on two nodes each with 8Gb and 8 processors. Other threads indicate that I should be able to do this on a single processor with 2GB memory. Any ideas? Regards John

April 21, 2009, 08:58	More Results	#7
johnmck New Member john mck Join Date: Apr 2009 Posts: 3 Rep Power: 17	I ran some more tests (mesh 96x99x96=912384 cells), and yes we are reaching a tradeoff: Nodes x processes per node 1x1=1 (ie serial) uses 1870Mb/process 1x2=2 uses 1340 1x4=4 uses 1060 1x8=8 uses 940 2x4=8 uses 930 2x8=16 uses 860 4x8=32 uses 820 For our work using 32 licences the memory per process doesnt fall much below half the serial memory requirement. So for big jobs we'll have to only partly use nodes, in order to get enough memory on them. The memory overhead due to parallel working seems surprisingly high, to me at least. Many Thanks Regards John mck

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Running on Distibuted Memory linux itanium cluster	Josh	FLUENT	0	January 29, 2007 01:18
HPC on a Linux cluster	Jihwan	Siemens	2	November 22, 2005 11:17
[Commercial meshers] Trimmed cell and embedded refinement mesh conversion issues	michele	OpenFOAM Meshing & Mesh Conversion	2	July 15, 2005 05:15
Linux Cluster Performance with a bi-processor PC	M.	FLUENT	1	April 22, 2005 10:25
Star and cluster under Linux	jens	Siemens	1	January 19, 2000 04:59

April 20, 2009, 12:27	star 4.06 memory on linux cluster	#1
johnmck New Member john mck Join Date: Apr 2009 Posts: 3 Rep Power: 17	I'm trying to run Star 4.06 on a linux cluster with pbs, on 900,000 cells modelling incompressible transient flow. Each node of the cluster has two processors with 4 cores, and 8GB of shared memory. The model is partitioned using metis. Each processor is an Intel(R) Xeon(R) CPU E5430 @ 2.66GHz Uname -a gives: Linux 2.6.9-55.0.2.ELsmp #1 SMP Tue Jun 26 14:14:47 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux The compiler is Absoft 9.0 EP 64 bit. My question is this: If I use 8 processes on each of 2 nodes, ie 16 processes in total, each process takes 860Mb virtual memory (mostly data and stack). If I use 8 processes on each of 4 nodes, ie 32 processes in total, each process takes 820Mb. How come if I use more processes each doesnt use proportionally less memory? I would have expected the total memory used to stay almost constant. As it stands I can't use many more cells before the nodes run out of memory, and start to swap. Any advice appreciated. My apologies if I've missed something obvious like an option. Regards John Mck.

April 20, 2009, 12:49		#2
vishyaroon Senior Member Aroon Join Date: Apr 2009 Location: Racine WI Posts: 148 Rep Power: 17	The memory used is not only related to the problem size. As you use more processors, the communication overhead between the processors increases. So you'll not have a linear decrease in the memory usage. At some point using more processors may result in slower performance due to the communication overhead

April 20, 2009, 17:25		#4
vishyaroon Senior Member Aroon Join Date: Apr 2009 Location: Racine WI Posts: 148 Rep Power: 17	That was my initial thought too. I use similar machines (my Linux machine) shows the same capabilities as your except for a different linux version. And I frequently run about 1 million size meshes on 1 processor.

April 20, 2009, 19:15		#5
f-w Senior Member Join Date: Apr 2009 Posts: 159 Rep Power: 17	johnmck, Just out of curiosity, have you benchmarked your quad-cores? I was advised to go with dual-cores instead of quad-cores because of the inherent performance loss when using all 4 cores (which I confirmed on my head-node with Star-CCM+). What is your "speedup" going from 7 to 8 cores on one of your nodes? Thanks, f-w

April 22, 2009, 14:02		#8
TMG Member Join Date: Mar 2009 Posts: 44 Rep Power: 17	Your model is too small to make your conclusion valid. By 32 cores you only have 28000 cells on each core (that's a very small number). At that size the overhead of all the "halo" cells (the cells that exist at boundaries between two domains) are just not going to decrease any further. If you run a much larger (like an order of magnitude) model, you will see the memory effect you are looking for.