Parallel performance

hsing · August 22, 2005, 14:50

Hi, I just test the parallel performance of the solver, icoFoam, on a cluster. For single cpu mode, it takes about 31 hours while the 2 cpu mode takes 26 hours. So, the efficency is around 60%. Is it reasonable?

henry · August 22, 2005, 14:54

It depends on the speed of the interconnect, the size of the case and the parallel comms settings you have specified in .OpenFoam-1.1/controlDict.

hsing · August 22, 2005, 15:11

The size of the case is quite huge, there are
262392 cells in the computation domain.

I have no idea of editing the file of
.OpenFoam-1.1/controlDict. Actually, I do not change it at all. It takes the form:

InfoSwitches
{
writeJobInfo 0;
FoamXwriteComments 1;
}

OptimisationSwitches
{
fileModificationSkew 10;
scheduledTransfer 1;
floatTransfer 0;
nProcsSimpleSum 16;
}

henry · August 22, 2005, 15:14

and the speed of the interconnect?

hsing · August 22, 2005, 16:37

Every node of my cluster is dual CPU system. So for the two CPU mode is actually running inside a node. And the interconnection between the mahcine node is 1 G byte.

henry · August 22, 2005, 16:44

I assume from your results that the two CPUs are sharing the memory bus in each of your nodes and you are only getting 60% efficiency because the memory bus is saturated. Try running the case between two nodes.

hsing · August 22, 2005, 16:52

Thanks Henry,
I will try. And there is a type erro in my previous post, the interconnection is 1G bite.

hsing · August 24, 2005, 16:25

I have run the code in two CPU but with two different node. Now, the efficiency seems to be higher than 100%!!!. That means the bottleneck is the bus speed in my cluster and I'd better to upgrad the mother board?

BTW, the running time given by icoFoam is CPU time, in stead of wall clock time, right?

henry · August 24, 2005, 16:47

Recent multi-CPU motherboards like the Tyan dual and quad Opteron boards (and I am guessing the recent Xeon boards as well) have a separate memory bus for each CPU. The AMD-based boards have hyper-transport buses between the CPUs as well but I don't know if there is an equivalent for Xeon processors. This arrangement is far preferable to the old shared-memory multi-CPU machines because CPU speeds outstrip memory access which means that memory-access intensive codes like CFD would become memory-access limited unless each CPU has it's own memory.

All the OpenFOAM applications print CPU time but you can easily add a print for the wall-clock time using the clockTime() member function in the same way as cpuTime() is used.

mprinkey · August 24, 2005, 16:57

>(and I am guessing the recent Xeon boards as well)

I am pretty sure this is not correct. The Nocona Xeon dual CPU motherboards still use a shared memory bus. Based on our experience, these systems are not a good target platform for the current incarnation of OpenFOAM.

henry · August 24, 2005, 17:03

Current CPU performance far outstrips memory access performance and it doesn't look like this situation will improve anytime soon which means that all codes that rely on rapid memory access of large amounts of data (that is all CFD codes not just the current incarnation of OpenFOAM) will benefit from each CPU having it's own memory bus.

gschaider · August 25, 2005, 10:02

Does this mean that the Dual-Core Opterons are not good for CFD computations? If I interpret the Block-Diagrams correctly both cores share the same memory-interface (leading to a similar problem like the Xeon-MoBos discussed above).

Does anyone have experience with OpenFOAM on DualCores?

henry · August 25, 2005, 10:08

I would expect dual-core CPUs to suffer from the same problem because they share the same memory bus.

August 29, 2005, 04:54

At dual-Opterons (Athlon X2) each CPU has its own RAM-Channel, the dual DDR-Ram bus is devided to a single for each CPU. The Performance of one CPU (aka Socket 939/940) decreases by approx. 8% to a Socket 754 CPU.

August 29, 2005, 05:25

I don't have such a CPU - the 8% above are just the difference of a Socket 939-CPU with/without dual-DDR-Ram! There is a crossbar-switch between the CPU and the RAM, which should act in that way. Graphics, Harddisk and Ethernet use one (Athlon) to 3 (Opteron) HT-Links with 3.2 GB each.

At Tomshardware.de the difference between 2 single Opteron and on Dual-Opteron is neglectable. But they didn't use CFD for comparisons!

hsing · August 30, 2005, 15:30

Thanks for all of you guys' idea about the parallel performance.

Now I have a question about the CPU time. The CPU time provided by the function of elapsedCpuTime() counts only the main node's CPU time instead of all of the parallel node's, right?

Another question is why evey machine only use a portion of the CPU resource as I am quite sure no other people is using the cluster.
Here is the output of my top command:

PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND
9397 hsing 25 0 18996 18M 8400 R 69.9 0.4 1000m 0 hsingFlow

henry · August 30, 2005, 15:38

Each node calculates it's own CPU time but only the master write to the log via the Info statement. If you want to see the CPU time for all the nodes replace Info with Sout or Serr.

Only a fraction of the CPU is being used because the rest of the time it's waiting probably for data communication between nodes.

August 22, 2005, 14:50	Hi, I just test the parallel p	#1
hsing New Member Ho Hsing Join Date: Mar 2009 Posts: 13 Rep Power: 17	Hi, I just test the parallel performance of the solver, icoFoam, on a cluster. For single cpu mode, it takes about 31 hours while the 2 cpu mode takes 26 hours. So, the efficency is around 60%. Is it reasonable?

August 22, 2005, 14:54	It depends on the speed of the	#2
henry Senior Member Join Date: Mar 2009 Posts: 854 Rep Power: 22	It depends on the speed of the interconnect, the size of the case and the parallel comms settings you have specified in .OpenFoam-1.1/controlDict.

August 22, 2005, 15:11	The size of the case is quite	#3
hsing New Member Ho Hsing Join Date: Mar 2009 Posts: 13 Rep Power: 17	The size of the case is quite huge, there are 262392 cells in the computation domain. I have no idea of editing the file of .OpenFoam-1.1/controlDict. Actually, I do not change it at all. It takes the form: InfoSwitches { writeJobInfo 0; FoamXwriteComments 1; } OptimisationSwitches { fileModificationSkew 10; scheduledTransfer 1; floatTransfer 0; nProcsSimpleSum 16; }

August 22, 2005, 15:14	and the speed of the interconn	#4
henry Senior Member Join Date: Mar 2009 Posts: 854 Rep Power: 22	and the speed of the interconnect?

August 22, 2005, 16:37	Every node of my cluster is du	#5
hsing New Member Ho Hsing Join Date: Mar 2009 Posts: 13 Rep Power: 17	Every node of my cluster is dual CPU system. So for the two CPU mode is actually running inside a node. And the interconnection between the mahcine node is 1 G byte.

August 22, 2005, 16:44	I assume from your results tha	#6
henry Senior Member Join Date: Mar 2009 Posts: 854 Rep Power: 22	I assume from your results that the two CPUs are sharing the memory bus in each of your nodes and you are only getting 60% efficiency because the memory bus is saturated. Try running the case between two nodes.

August 22, 2005, 16:52	Thanks Henry, I will try. And	#7
hsing New Member Ho Hsing Join Date: Mar 2009 Posts: 13 Rep Power: 17	Thanks Henry, I will try. And there is a type erro in my previous post, the interconnection is 1G bite.

August 24, 2005, 16:25	I have run the code in two CPU	#8
hsing New Member Ho Hsing Join Date: Mar 2009 Posts: 13 Rep Power: 17	I have run the code in two CPU but with two different node. Now, the efficiency seems to be higher than 100%!!!. That means the bottleneck is the bus speed in my cluster and I'd better to upgrad the mother board? BTW, the running time given by icoFoam is CPU time, in stead of wall clock time, right?

August 24, 2005, 16:47	Recent multi-CPU motherboards	#9
henry Senior Member Join Date: Mar 2009 Posts: 854 Rep Power: 22	Recent multi-CPU motherboards like the Tyan dual and quad Opteron boards (and I am guessing the recent Xeon boards as well) have a separate memory bus for each CPU. The AMD-based boards have hyper-transport buses between the CPUs as well but I don't know if there is an equivalent for Xeon processors. This arrangement is far preferable to the old shared-memory multi-CPU machines because CPU speeds outstrip memory access which means that memory-access intensive codes like CFD would become memory-access limited unless each CPU has it's own memory. All the OpenFOAM applications print CPU time but you can easily add a print for the wall-clock time using the clockTime() member function in the same way as cpuTime() is used.

August 24, 2005, 16:57	>(and I am guessing the recent	#10
mprinkey Senior Member Michael Prinkey Join Date: Mar 2009 Location: Pittsburgh PA Posts: 363 Rep Power: 25	>(and I am guessing the recent Xeon boards as well) I am pretty sure this is not correct. The Nocona Xeon dual CPU motherboards still use a shared memory bus. Based on our experience, these systems are not a good target platform for the current incarnation of OpenFOAM.

August 24, 2005, 17:03	Current CPU performance far ou	#11
henry Senior Member Join Date: Mar 2009 Posts: 854 Rep Power: 22	Current CPU performance far outstrips memory access performance and it doesn't look like this situation will improve anytime soon which means that all codes that rely on rapid memory access of large amounts of data (that is all CFD codes not just the current incarnation of OpenFOAM) will benefit from each CPU having it's own memory bus.

August 25, 2005, 10:02	Does this mean that the Dual-C	#12
gschaider Assistant Moderator Bernhard Gschaider Join Date: Mar 2009 Posts: 4,225 Rep Power: 51	Does this mean that the Dual-Core Opterons are not good for CFD computations? If I interpret the Block-Diagrams correctly both cores share the same memory-interface (leading to a similar problem like the Xeon-MoBos discussed above). Does anyone have experience with OpenFOAM on DualCores? __________________ Note: I don't use "Friend"-feature on this forum out of principle. Ah. And by the way: I'm not on Facebook either. So don't be offended if I don't accept your invitation/friend request

August 25, 2005, 10:08	I would expect dual-core CPUs	#13
henry Senior Member Join Date: Mar 2009 Posts: 854 Rep Power: 22	I would expect dual-core CPUs to suffer from the same problem because they share the same memory bus.

August 29, 2005, 04:54	At dual-Opterons (Athlon X2) e	#14
ulf Guest Posts: n/a	At dual-Opterons (Athlon X2) each CPU has its own RAM-Channel, the dual DDR-Ram bus is devided to a single for each CPU. The Performance of one CPU (aka Socket 939/940) decreases by approx. 8% to a Socket 754 CPU.

August 29, 2005, 05:25	I don't have such a CPU - the	#15
ulf Guest Posts: n/a	I don't have such a CPU - the 8% above are just the difference of a Socket 939-CPU with/without dual-DDR-Ram! There is a crossbar-switch between the CPU and the RAM, which should act in that way. Graphics, Harddisk and Ethernet use one (Athlon) to 3 (Opteron) HT-Links with 3.2 GB each. At Tomshardware.de the difference between 2 single Opteron and on Dual-Opteron is neglectable. But they didn't use CFD for comparisons!

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
parallel performance	ivandipia	CFX	6	January 29, 2009 16:26
Parallel performance	liu	OpenFOAM Running, Solving & CFD	8	October 17, 2006 11:04
Performance of interFoam running in parallel	hsieh	OpenFOAM Running, Solving & CFD	8	September 14, 2006 10:15
ANSYS CFX 10.0 Parallel Performance for Windows XP	Saturn	CFX	4	August 13, 2006 13:27
Parallel Performance of Fluent	Soheyl	FLUENT	2	October 30, 2005 07:11

August 30, 2005, 15:30	Thanks for all of you guys' i	#16
hsing New Member Ho Hsing Join Date: Mar 2009 Posts: 13 Rep Power: 17	Thanks for all of you guys' idea about the parallel performance. Now I have a question about the CPU time. The CPU time provided by the function of elapsedCpuTime() counts only the main node's CPU time instead of all of the parallel node's, right? Another question is why evey machine only use a portion of the CPU resource as I am quite sure no other people is using the cluster. Here is the output of my top command: PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND 9397 hsing 25 0 18996 18M 8400 R 69.9 0.4 1000m 0 hsingFlow

August 30, 2005, 15:38	Each node calculates it's own	#17
henry Senior Member Join Date: Mar 2009 Posts: 854 Rep Power: 22	Each node calculates it's own CPU time but only the master write to the log via the Info statement. If you want to see the CPU time for all the nodes replace Info with Sout or Serr. Only a fraction of the CPU is being used because the rest of the time it's waiting probably for data communication between nodes.