CFX distributed computing: Cluster design Q

July 10, 2006, 13:18

4 single socket systems vs. 2 dual socket systems ...

I'm interested in others experiences with distributed MPICH solves over multiple single socket (i.e. 1 CPU) vs. multiple dual socket (i.e. 2 CPU) systems.

Lets assume the following idealised situation: 4 identical CPUs, say Opteron 285s. Identical amounts of ram per CPU, say 1 GB. Identical motherboards, say one of the Tyan dual socket Opteron motherboards.

"Single socket used" configuration: 4 separate boxes with: 1 CPU per motherboard (odd I know ... humour me), 1 GB RAM per motherboard. Boxes linked via GigE with a switch with decent backplane bandwidth.

"Dual sockets used" configuration: 2 separate boxes with: 2 CPUs per motherboard, 2 GB RAM per motherboard. Boxes linked via GigE with a switch with decent backplane bandwidth.

Lets say I would use the distributed MPICH CFX solver option.

Lets assume I'm modelling a long pipe whose mesh is partitioned into 4 partitions longitudinally.

Q1: Would there be any difference in solving time between these two cluster configurations?

Q2: Is CFX coded such that physically adjacent mesh partitions are placed together on unified memory machines (i.e. 2 CPUs on a 2 socket motherboard) i.e. would the first two mesh partitions be placed together on the first 2 CPU box with the last two mesh partitions placed together on the second 2 CPU box, or are the partitions randomly assigned to CPUs in a distributed cluster?

I would have guessed that the introduction of an Ethernet link anywhere in the system would negate the theroretical advantages of using two dual CPU boxes?

I ask this because typically single socket CPUs and motherboards are a good deal cheaper than dual socket motherboards / CPUs.

Any thoughts / experiences?

July 10, 2006, 16:15

I get the impression that CFX is not that badly affected by the "slow" ethernet interconnect. The pressure based coupled solver does a lot of work per iteration, so computing time lost while exchanging information between grid partitions does not seem to be too much of an issue. By contrast, Fluent is very sensitive to the latency of the interconnect used, and I must assume that this is due to a much smaller amount of work being done per iteration with the segregated solver, and hence more frequent exchange of information. To put it differently, on a 21 x dual CPU cluster with Gb ethernet interconnect, I've found that it is easy to get say 95% average usage per CPU for CFX, but difficult to get more than 70% usage for Fluent.

July 10, 2006, 17:22

Fascinating. So on a 42 CPU cluster youre getting virtually linear scale up? Damn thats not to be sneezed at

Are you using commodity gige switches?

PS: I'm amazed that fluent is still on the SIMPLE solver.

July 10, 2006, 17:41

Two more Q if I may:

What OS are you using on the cluster? 64bit linux? Which distributed CFX solver type do you use?

July 10, 2006, 18:35

No, not quite linear scaling, but still pretty useful CPU usage. The point is that the Fluent CPU usage would drop off much quicker, but it is strongly problem size dependent. You can fiddle with clusters forever without doing any real work, or you can just get on with it. So, instead of putting time into doing scaling graphs for various model sizes, I've found it useful to monitor CPU usage, it seems to be a good way of checking that you're not wasting resources.

As I understand it Fluent now have a pressure-based coupled solver which is claimed to need far fewer iterations, so a guess would be that it would scale better with commodity hardware.

July 11, 2006, 09:39

I agree with Charles that an ethernet connection doesn't really slow things down much. An exception to this is if you start getting up to 4 sockets per machine, then there's just not enough bandwidth for 4 partitions per connection. Two dual socket machines versus four single socket machine should be very close in speed. I expect there's a small theoretical difference somewhere (reduced memory latency, hypertransport,... not sure to be honest), but any difference should be hardly noticable. I belive the partitions are assigned in the order that the hosts are given when starting the run. So if you did: cfx5solve -def file.def -par-dist 'host1*2,host2*2' then partitions 1 and 2 would go to host 1 and partitions 3 and 4 would go to host 2. I don't think there's any guarentee that partition 1 is adjacent to partition 2, but if you look at "Real Partition Number" in CFX-Post for a few cases then I think this does tend to be the case. Mike

July 11, 2006, 09:53

Thanks for the partition info... very interesting. I'll test that once the cluster is up and running.

What OS are you running your clusters on? Which CFX solver type?

July 11, 2006, 11:21

Running on Linux RedHat 9 with the 32-bit solver Mike

July 10, 2006, 13:18	CFX distributed computing: Cluster design Q	#1
Joe Guest Posts: n/a	4 single socket systems vs. 2 dual socket systems ... I'm interested in others experiences with distributed MPICH solves over multiple single socket (i.e. 1 CPU) vs. multiple dual socket (i.e. 2 CPU) systems. Lets assume the following idealised situation: 4 identical CPUs, say Opteron 285s. Identical amounts of ram per CPU, say 1 GB. Identical motherboards, say one of the Tyan dual socket Opteron motherboards. "Single socket used" configuration: 4 separate boxes with: 1 CPU per motherboard (odd I know ... humour me), 1 GB RAM per motherboard. Boxes linked via GigE with a switch with decent backplane bandwidth. "Dual sockets used" configuration: 2 separate boxes with: 2 CPUs per motherboard, 2 GB RAM per motherboard. Boxes linked via GigE with a switch with decent backplane bandwidth. Lets say I would use the distributed MPICH CFX solver option. Lets assume I'm modelling a long pipe whose mesh is partitioned into 4 partitions longitudinally. Q1: Would there be any difference in solving time between these two cluster configurations? Q2: Is CFX coded such that physically adjacent mesh partitions are placed together on unified memory machines (i.e. 2 CPUs on a 2 socket motherboard) i.e. would the first two mesh partitions be placed together on the first 2 CPU box with the last two mesh partitions placed together on the second 2 CPU box, or are the partitions randomly assigned to CPUs in a distributed cluster? I would have guessed that the introduction of an Ethernet link anywhere in the system would negate the theroretical advantages of using two dual CPU boxes? I ask this because typically single socket CPUs and motherboards are a good deal cheaper than dual socket motherboards / CPUs. Any thoughts / experiences?

July 10, 2006, 16:15	Re: CFX distributed computing: Cluster design Q	#2
Charles Guest Posts: n/a	I get the impression that CFX is not that badly affected by the "slow" ethernet interconnect. The pressure based coupled solver does a lot of work per iteration, so computing time lost while exchanging information between grid partitions does not seem to be too much of an issue. By contrast, Fluent is very sensitive to the latency of the interconnect used, and I must assume that this is due to a much smaller amount of work being done per iteration with the segregated solver, and hence more frequent exchange of information. To put it differently, on a 21 x dual CPU cluster with Gb ethernet interconnect, I've found that it is easy to get say 95% average usage per CPU for CFX, but difficult to get more than 70% usage for Fluent.

July 10, 2006, 17:22	Re: CFX distributed computing: Cluster design Q	#3
Joe Guest Posts: n/a	Fascinating. So on a 42 CPU cluster youre getting virtually linear scale up? Damn thats not to be sneezed at Are you using commodity gige switches? PS: I'm amazed that fluent is still on the SIMPLE solver.

July 10, 2006, 17:41	Re: CFX distributed computing: Cluster design Q	#4
Joe Guest Posts: n/a	Two more Q if I may: What OS are you using on the cluster? 64bit linux? Which distributed CFX solver type do you use?

July 10, 2006, 18:35	Re: CFX distributed computing: Cluster design Q	#5
Charles Guest Posts: n/a	No, not quite linear scaling, but still pretty useful CPU usage. The point is that the Fluent CPU usage would drop off much quicker, but it is strongly problem size dependent. You can fiddle with clusters forever without doing any real work, or you can just get on with it. So, instead of putting time into doing scaling graphs for various model sizes, I've found it useful to monitor CPU usage, it seems to be a good way of checking that you're not wasting resources. As I understand it Fluent now have a pressure-based coupled solver which is claimed to need far fewer iterations, so a guess would be that it would scale better with commodity hardware.

July 11, 2006, 09:39	Re: CFX distributed computing: Cluster design Q	#6
Mike Guest Posts: n/a	I agree with Charles that an ethernet connection doesn't really slow things down much. An exception to this is if you start getting up to 4 sockets per machine, then there's just not enough bandwidth for 4 partitions per connection. Two dual socket machines versus four single socket machine should be very close in speed. I expect there's a small theoretical difference somewhere (reduced memory latency, hypertransport,... not sure to be honest), but any difference should be hardly noticable. I belive the partitions are assigned in the order that the hosts are given when starting the run. So if you did: cfx5solve -def file.def -par-dist 'host12,host22' then partitions 1 and 2 would go to host 1 and partitions 3 and 4 would go to host 2. I don't think there's any guarentee that partition 1 is adjacent to partition 2, but if you look at "Real Partition Number" in CFX-Post for a few cases then I think this does tend to be the case. Mike

July 11, 2006, 09:53	Re: CFX distributed computing: Cluster design Q	#7
Joe Guest Posts: n/a	Thanks for the partition info... very interesting. I'll test that once the cluster is up and running. What OS are you running your clusters on? Which CFX solver type?

July 11, 2006, 11:21	Re: CFX distributed computing: Cluster design Q	#8
Mike Guest Posts: n/a	Running on Linux RedHat 9 with the 32-bit solver Mike

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Pros and Cons for CFX, CFdesign, COMSOL	Val	Main CFD Forum	3	June 10, 2011 03:20
surface bodies from design modeler to CFX	chisa	CFX	0	June 9, 2010 12:36
Distributed parallel error in CFX 5.5.1	bogesz	CFX	6	January 27, 2003 19:22
Running CFX on a cluster	jvk	CFX	9	September 19, 2002 23:22
CFX 4.4 installation problem	Pandu Sattvika	CFX	1	December 1, 2001 05:07