Scale-Up Study in Parallel Processing with OpenFoam

sahm · April 8, 2010, 17:12

Hello everyone, Whats the plan?

Recently I have done a case with different numbers of processors to check the parallel processing performance in OpenFoam, But I got a strange result.
I made a simple case and used Scotch method, which is kinda acting like Metis method. Then I used the cluseter in our lab (CMTL in UIC) to solve these cases. The cluster has 4 nodes with 8 processors on each node.
The case is a 3d Backward facing step with 748K cells and Re=500, and the solver is rhoPisoFoam.
I used 2,4,6,8,16,24,32 processors at a time to solve same Case, but after running I saw the 32-processor case was slower than 16-processor case. The attached file shows the speed of solutions ( Time step/Minute) versus number of processors
Can Any one tell me what I did wrong, or if he /she had this problem before?
Also Can anyone tell me about that Time openFoam is reporting in log file?

andersking · April 9, 2010, 01:57

Hi Sahm,

Your case is too small to stress your cluster, for 32 procs you only have 23k cells per proc. Your cluster is probably spending all its time communicating. Try a larger mesh (or faster interconnect).

Cheers
Andrew

sahm · April 14, 2010, 17:43

I changed decomposition method to Metis method, and there was an increase to my solution speeds, The attached file shows the speed-up.

Canesin · April 19, 2010, 09:46

Can you upload it in pdf or open-document format ?? Running linux, so no xlsx for me =(

sahm · April 20, 2010, 16:22

Sorry for that,
this file is for 2 cases. one is backstep flow, with Solver rhopisoFoam, and 750k and 1.5m cells, the other one, is for cavity case with 1M cells, any comment about this is appreciated.

sahm · April 22, 2010, 15:59

Is any body Going to Help? Is any body not Going to Help?

sjrees · April 23, 2010, 05:50

What is your question??

I would say you should get better scalability than this but you will have to say more about the system - cores per node, interconnect type etc.

sahm · April 23, 2010, 18:08

Ok,
The system I`m working with is a cluster with 4 processing Node, and 1 head node, each node has 8 processors, with Infiniband connection between nodes. I dont know about ram of each node. But I didn't have any problem with that.
The problem with this scale up study is that when I increased the number of Processors, the speed of calculation is decreased. I also tried this case with 3M cells, But still this problem exists.
Do you have any Idea why this is happening and how I can make it better?

wyldckat · April 23, 2010, 19:30

Greetings SAHM,

It seems you have an "isolate and conquer" problem! 4 machines, 8 cores each, means that you should first test in a ramification method.

In other words:

Do a battery of test on a single node, ranging from 1 to 8 cores. Analyse results - if this doesn't scale well, then you've got a problem on how the case is divided along the cores.
If it scaled well with a single node, do another battery of tests with only two nodes, but this time test 2 to 16 nodes, in equally distributing the scale up... in other words, it's the first test, by upping both nodes at the same time per test.
Repeat this battery of tests with the other two nodes, just to make sure if the nodes and interconnects are purely symmetrical and well balanced. If the times aren't nearly identical, then you've got a bottleneck somewhere
If the pairs are properly balanced, then do yet another battery of tests, this time using 3 nodes, ranging 3 to 24 processors, always scaling up 1 core per node per test.
Again, you should run again this battery of tests, by swapping one of the used nodes by the one left out.
And finally, a battery of tests, using all 4 nodes, doing tests from 4 to 32 cores, always scaling 1 core in all nodes.

Of course, if you don't feel like running all of these batteries of tests, then just do the fourth one!

I suggest these tests, because by what I've seen from the first graph, it feels there is somewhat of an inertia like problem with the machines! In other words, by analysing the first graph, it seems that:

1-2-4 processors - single node
6-8 -> 3-4 nodes - indicates a lag made by the interconnect, because the nodes have to be in sync
8-16-32 -> always 4 nodes - and this part of the graph is nearly linear!

Now, if this analysis reflects what really happened, then there is nothing wrong with the results! If you always want to use the 4 nodes/32 processors, then you should have compared always using 4 nodes! And/or compare:

4 cores: 1 node*4 cores/node vs 2 nodes*2 cores/node vs 4 nodes*1 cores/node
8 cores: 1 node*8 cores/node vs 2 nodes*4 cores/node vs 4 nodes*2 cores/node
16 cores: 2 nodes*8 cores/node vs 4 nodes*4 cores/node

This way you can account for the intrusive behaviour the interconnect has on your scalability!

Additionally, don't forget that your infinyband interconnect might have options to configure for jumbo packets (useful for gigantic data transfer between nodes, which would be the case with 50-100 million cells

) or very low latency (useful for constant communication between nodes).

Another thing that could hinder the scalability is how frequently do you save a case snapshot. In other words, does this case store:

every iteration of the run;
only the last iteration;
or only needs to read at the start and doesn't save anything during execution, not even at the end.

Actually, if it saves all iterations, that would explain that flat line of the 4-6-8 processors!

Sooo... to conclude... be careful what you wish for! You just might get it

... or so I would think so :P

Edit ----------------------------
I didn't see the other three graphs before I posted. But my train of thought still applies. And I should add another thing to the test list:

pick the biggest case, compare only using the 4 nodes, by upping 1 core in all nodes at same time (4-8-12-16-20-24-28-32), and always run at least 10 times each test! Do an average and deviation analysis.

For this final test, many outcomes may apply:

Scenario 1: average is linear and deviation is small.
Conclusion: what the hell Wasn't working before and now is working? This shouldn't have happened....
Scenario 2: average is linear and deviation is large.
Conclusion: you've got a serious bottleneck somewhere... could be storage, interconnect, network interference, or someone left something running in the nodes, like a file indexing daemon...
Scenario 3: average is not linear and deviation is small.
Conclusion: the problem is reproducible. This means that hunting down the real problem will be easier this way, because when you change something, it will be directly reflected on the results!

And yet another thing comes to mind: Do the CPUs in the nodes have the "turbo boost" feature that some modern CPUs have? For example, when the CPU uses 8 cores, runs at 3GHz; but with 4 cores, runs at 3.3GHz; and with 2 cores runs at 3.6GHz? And with a single core, runs at 4GHz! This would seriously affect results...

Best regards,
Bruno

sahm · April 26, 2010, 16:12

Thanks Bruno. I forgot to erase the first sheet since that case was not defined well, but the other cases have the same problem. About your comments to run them with different types of parallelism, I have a question.

Actually I don't know, how to assign a certain sub-domain (of a decomposed domain) into a certain processor. I mean to run your tests, I need to define a specific processor for a specific sub-domain, or at least I should define how many cores of a node I`m going to use. For example I have to define how to use 4 processors, 2 cores/node and 2 nodes. Can you tell me how to do this?

Since this cluster is shared between people in the lab, I need to ask for their permission when I'm going to use more than 8 cores. So running your case might take a long time. Besides, we use a software that enqueues the jobs in my cluster, its called Lava, and I should use it to define my jobs for the cluster, otherwise, other people might not see if I'm running something. Can you tell me how to define a job on certain cores on different nodes, and I'd appreciate it if you tell me how to do it with that Lava, if you know this software.

I have an idea that I would like to discuss with you. I think if I define sub-domains for specific cores, I can make the interconnections between nodes minimum. I mean If speed of connection between cores of a node is faster than the connection of nodes, assigning neighbor sub-domains into cores of a single node might help since this reduces the data transferred between nodes ( through a slower connection). I would like to know your idea about this concept.

Thanks again for your comment.

wyldckat · April 26, 2010, 18:37

Greetings SAHM,

Quote:

Originally Posted by sahm

For example I have to define how to use 4 processors, 2 cores/node and 2 nodes. Can you tell me how to do this?

Well, I know that with OpenMPI and MPICH2 you can define which machines to use and how many cores per machine. After a little searching, I found a tutorial (--> MPI Tutorial <--) that has an example of a machine file:

Code:

cat machinefile.morab 
morab001
morab001
morab002
morab002

For OpenFOAM, if I'm not mistaken, you can create a file named "machines" or "hostfile" with a list of machines (by IP or name) you wish to use; then save that file either in the case folder or in the "system" folder of the case. Then use OpenFOAM's foamJob script for launching the case. For example, to launch icoFoam in parallel and with output to screen:

Code:

foamJob -p -s icoFoam

And it will create a log file named "log".
So, for more information about the MPI application you are using, you should consult its manual

You could also tweak the foamJob script (run "which foamJob" to know where it is

) to better suit your needs!

Quote:

Originally Posted by sahm

Besides, we use a software that enqueues the jobs in my cluster, its called Lava, and I should use it to define my jobs for the cluster, otherwise, other people might not see if I'm running something. Can you tell me how to define a job on certain cores on different nodes, and I'd appreciate it if you tell me how to do it with that Lava, if you know this software.

Quick answer: check with your cluster/system's administrator or the Lava manual

Not-so-quick answer: See --> this <-- short tutorial and the links it points to. But by what I can briefly see, it seems that it can operate as a wrapper for mpirun. So, my guess is that you could do a symbolic link of mpirun to Lava's mpirun and use foamJob as it is now!

Quote:

Originally Posted by sahm

I have an idea that I would like to discuss with you. I think if I define sub-domains for specific cores, I can make the interconnections between nodes minimum. I mean If speed of connection between cores of a node is faster than the connection of nodes, assigning neighbor sub-domains into cores of a single node might help since this reduces the data transferred between nodes ( through a slower connection). I would like to know your idea about this concept.

Yep, you are thinking in the right direction

Assuming that the "machines" file will be respected, you should be able to assign auto-magically each sub-domain (each processor folder made by decomposePar) to each respective machine, as long as mpirun will follow the order in the machine list. As for defining the dimensions of each sub-domain in the decomposeParDict... I have no clue

Unfortunately I only have a little experience with running OpenFOAM in parallel, and I still have a lot to master in OpenFOAM

Best regards,
Bruno

April 9, 2010, 01:57		#2
andersking Member Andrew King Join Date: Mar 2009 Location: Perth, Western Australia, Australia Posts: 82 Rep Power: 17	Hi Sahm, Your case is too small to stress your cluster, for 32 procs you only have 23k cells per proc. Your cluster is probably spending all its time communicating. Try a larger mesh (or faster interconnect). Cheers Andrew __________________ Dr Andrew King Fluid Dynamics Research Group Curtin University

April 22, 2010, 15:59		#6
sahm Senior Member Seyyed Ali H.M. Join Date: Nov 2009 Location: Utah Posts: 107 Rep Power: 17	Is any body Going to Help? Is any body not Going to Help? Last edited by sahm; April 26, 2010 at 16:16. Reason: 2 Admin: If you can please delete this message.

April 23, 2010, 19:30		#9
wyldckat Retired Super Moderator Bruno Santos Join Date: Mar 2009 Location: Lisbon, Portugal Posts: 10,981 Blog Entries: 45 Rep Power: 128	Greetings SAHM, It seems you have an "isolate and conquer" problem! 4 machines, 8 cores each, means that you should first test in a ramification method. In other words: Do a battery of test on a single node, ranging from 1 to 8 cores. Analyse results - if this doesn't scale well, then you've got a problem on how the case is divided along the cores. If it scaled well with a single node, do another battery of tests with only two nodes, but this time test 2 to 16 nodes, in equally distributing the scale up... in other words, it's the first test, by upping both nodes at the same time per test. Repeat this battery of tests with the other two nodes, just to make sure if the nodes and interconnects are purely symmetrical and well balanced. If the times aren't nearly identical, then you've got a bottleneck somewhere If the pairs are properly balanced, then do yet another battery of tests, this time using 3 nodes, ranging 3 to 24 processors, always scaling up 1 core per node per test. Again, you should run again this battery of tests, by swapping one of the used nodes by the one left out. And finally, a battery of tests, using all 4 nodes, doing tests from 4 to 32 cores, always scaling 1 core in all nodes. Of course, if you don't feel like running all of these batteries of tests, then just do the fourth one! I suggest these tests, because by what I've seen from the first graph, it feels there is somewhat of an inertia like problem with the machines! In other words, by analysing the first graph, it seems that: 1-2-4 processors - single node 6-8 -> 3-4 nodes - indicates a lag made by the interconnect, because the nodes have to be in sync 8-16-32 -> always 4 nodes - and this part of the graph is nearly linear! Now, if this analysis reflects what really happened, then there is nothing wrong with the results! If you always want to use the 4 nodes/32 processors, then you should have compared always using 4 nodes! And/or compare: 4 cores: 1 node4 cores/node vs 2 nodes2 cores/node vs 4 nodes1 cores/node 8 cores: 1 node8 cores/node vs 2 nodes4 cores/node vs 4 nodes2 cores/node 16 cores: 2 nodes8 cores/node vs 4 nodes4 cores/node This way you can account for the intrusive behaviour the interconnect has on your scalability! Additionally, don't forget that your infinyband interconnect might have options to configure for jumbo packets (useful for gigantic data transfer between nodes, which would be the case with 50-100 million cells ) or very low latency (useful for constant communication between nodes). Another thing that could hinder the scalability is how frequently do you save a case snapshot. In other words, does this case store: every iteration of the run; only the last iteration; or only needs to read at the start and doesn't save anything during execution, not even at the end. Actually, if it saves all iterations, that would explain that flat line of the 4-6-8 processors! Sooo... to conclude... be careful what you wish for! You just might get it ... or so I would think so :P Edit ---------------------------- I didn't see the other three graphs before I posted. But my train of thought still applies. And I should add another thing to the test list: pick the biggest case, compare only using the 4 nodes, by upping 1 core in all nodes at same time (4-8-12-16-20-24-28-32), and always run at least 10 times each test! Do an average and deviation analysis. For this final test, many outcomes may apply: Scenario 1: average is linear and deviation is small. Conclusion: what the hell Wasn't working before and now is working? This shouldn't have happened.... Scenario 2: average is linear and deviation is large. Conclusion: you've got a serious bottleneck somewhere... could be storage, interconnect, network interference, or someone left something running in the nodes, like a file indexing daemon... Scenario 3: average is not linear and deviation is small. Conclusion: the problem is reproducible. This means that hunting down the real problem will be easier this way, because when you change something, it will be directly reflected on the results! And yet another thing comes to mind: Do the CPUs in the nodes have the "turbo boost" feature that some modern CPUs have? For example, when the CPU uses 8 cores, runs at 3GHz; but with 4 cores, runs at 3.3GHz; and with 2 cores runs at 3.6GHz? And with a single core, runs at 4GHz! This would seriously affect results... Best regards, Bruno arvindpj, konangsh, kera and 2 others like this. __________________ OpenFOAM: FAQ \| Getting started Forum: How to get help, to post code/output and forum guide Read this before sending me PM Last edited by wyldckat; April 23, 2010 at 19:58. Reason: added a missing thought process... and then I saw the other 3 graphs...

April 26, 2010, 16:12	Wow, Still there is a question.	#10
sahm Senior Member Seyyed Ali H.M. Join Date: Nov 2009 Location: Utah Posts: 107 Rep Power: 17	Thanks Bruno. I forgot to erase the first sheet since that case was not defined well, but the other cases have the same problem. About your comments to run them with different types of parallelism, I have a question. Actually I don't know, how to assign a certain sub-domain (of a decomposed domain) into a certain processor. I mean to run your tests, I need to define a specific processor for a specific sub-domain, or at least I should define how many cores of a node I`m going to use. For example I have to define how to use 4 processors, 2 cores/node and 2 nodes. Can you tell me how to do this? Since this cluster is shared between people in the lab, I need to ask for their permission when I'm going to use more than 8 cores. So running your case might take a long time. Besides, we use a software that enqueues the jobs in my cluster, its called Lava, and I should use it to define my jobs for the cluster, otherwise, other people might not see if I'm running something. Can you tell me how to define a job on certain cores on different nodes, and I'd appreciate it if you tell me how to do it with that Lava, if you know this software. I have an idea that I would like to discuss with you. I think if I define sub-domains for specific cores, I can make the interconnections between nodes minimum. I mean If speed of connection between cores of a node is faster than the connection of nodes, assigning neighbor sub-domains into cores of a single node might help since this reduces the data transferred between nodes ( through a slower connection). I would like to know your idea about this concept. Thanks again for your comment.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
running OpenFoam in parallel	vishwa	OpenFOAM Running, Solving & CFD	22	August 2, 2015 09:53
Parallel solution in OpenFOAM	makaveli_lcf	OpenFOAM Running, Solving & CFD	0	September 21, 2009 09:07
parallel processing with icoFoam	sudhar	OpenFOAM Running, Solving & CFD	0	June 22, 2009 23:00
Parallel Processing Set Up	virgilhowardson	FLUENT	1	May 11, 2009 01:54
parallel processing and dynamic mesh	Anand	FLUENT	0	January 9, 2009 16:52

April 19, 2010, 09:46		#4
Canesin Member Fábio César Canesin Join Date: Mar 2010 Location: Florianópolis Posts: 67 Rep Power: 16	Can you upload it in pdf or open-document format ?? Running linux, so no xlsx for me =(

April 23, 2010, 05:50		#7
sjrees New Member Simon Rees Join Date: Mar 2009 Posts: 12 Rep Power: 17	What is your question?? I would say you should get better scalability than this but you will have to say more about the system - cores per node, interconnect type etc.

April 23, 2010, 18:08		#8
sahm Senior Member Seyyed Ali H.M. Join Date: Nov 2009 Location: Utah Posts: 107 Rep Power: 17	Ok, The system I`m working with is a cluster with 4 processing Node, and 1 head node, each node has 8 processors, with Infiniband connection between nodes. I dont know about ram of each node. But I didn't have any problem with that. The problem with this scale up study is that when I increased the number of Processors, the speed of calculation is decreased. I also tried this case with 3M cells, But still this problem exists. Do you have any Idea why this is happening and how I can make it better?