|
[Sponsors] |
January 19, 2011, 19:18 |
decomposed case to 2-cores (Not working)
|
#1 |
Member
Join Date: Nov 2010
Posts: 33
Rep Power: 15 |
I am working on interFoam/laminar/damBreak case. The number of cells in the generated mesh is 955000. To run in parallel, the mesh decomposition is done using metis decomposition.
When decomposed and running for 4 cores (quad-core Xeon E5620 ), it works perfectly fine. On changing the decomposed case to 2-cores, After some time of running, the systems hangs and displays the following error: "mpirun noticed that process rank 0 with PID 3758 on node exited on signal 11 (Segmentation fault)." Please suggest.Thanks |
|
January 20, 2011, 05:37 |
|
#2 |
Senior Member
Roman Thiele
Join Date: Aug 2009
Location: Eindhoven, NL
Posts: 374
Rep Power: 21 |
did you delete all the previous processor* folders before decomposing again?
__________________
~roman |
|
January 20, 2011, 11:01 |
|
#3 |
Member
Join Date: Nov 2010
Posts: 33
Rep Power: 15 |
Oh yes, I deleted the processor* directories. Any other clues?
|
|
January 20, 2011, 11:22 |
|
#4 |
Senior Member
Santiago Marquez Damian
Join Date: Aug 2009
Location: Santa Fe, Santa Fe, Argentina
Posts: 452
Rep Power: 24 |
Hi, sometimes changing the decomposition method fixes the problem, try with Simple.
Best.
__________________
Santiago MÁRQUEZ DAMIÁN, Ph.D. Research Scientist Research Center for Computational Methods (CIMEC) - CONICET/UNL Tel: 54-342-4511594 Int. 7032 Colectora Ruta Nac. 168 / Paraje El Pozo (3000) Santa Fe - Argentina. http://www.cimec.org.ar |
|
January 20, 2011, 11:31 |
|
#5 |
Member
Join Date: Nov 2010
Posts: 33
Rep Power: 15 |
Thanks for your response. You are right, changing the decomposition method to simple fix this problem. But my requirement is to reduce the boundary faces shared between the cores, so that I can reduce the communication cost. This is the reason I shifted to metis from simple.
Do you think the problem is related to the MPI buffer size? |
|
January 22, 2011, 14:53 |
|
#6 |
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128 |
Greetings to all!
@pkr: Wow, you've been asking about this in a few places! Like Santiago implied, the problem is related to how the cells are distributed through the sub-domains. And this isn't a new issue, it has been happening for quite a while now. On one case that I accompanied a bit led to this bug report: http://www.cfd-online.com/Forums/ope...-parallel.html - If your follow up the story on that thread and related link, you'll learn about an issue with cyclic patches and so on. Nonetheless, if your case is just a high-resolution of the damBreak case, without adding anything in particular - like cyclic patches, wedges and so on - then the problem should be related to a few cells that are split up between processors, when they should be kept together. Also, if it's just the damBreak case, AFAIK decomposing with Metis will not be more minimized than using simple or hierarchical methods. The proof is the face count returned by decomposePar, which always leads to Metis having something like 50% more faces interfacing between domains, than with simple or hierarchical methods. My experiences with high-resolution versions of the damBreak and cavity cases, in attempts to do some benchmarks with OpenFOAM in a multi-core machine, have led me to conclude that both simple and hierarchical methods are more than enough and also better for situations like these, where the meshes are so simple. Metis and Scotch are for the more complex meshes, with no clear indication of where are the most likely and best places to split the mesh. Now, if you still want to use Metis, then also try scotch, which usually is available with the latest versions of OpenFOAM. It's conceptually similar to Metis, but has a far more permissive software license than Metis. It will likely produce a different way of distributing the cells between sub-domains; with luck, the wrong cells wont end up apart from each other. Also, if you run the following command in the tutorials and applications folders of OpenFOAM, you can find out a bit more about decomposition options from other dictionaries: Code:
find . -name "decomposeParDict" Bruno PS: By the way, another conclusion was that with a single machine, with multiple cores, sometimes over-scheduling the processors leads to higher processing power; such case was with a modified cavity 3D using icoFoam solver and about 1 million cells, where on my AMD 1055T x6 cores, 16 sub-domains lead to a rather better run time, rather than 4 and 6 sub-domains! But still, I have yet to be able to achieve linear speed-up or anything near even from a CPU computation power point of view (i.e. 6x times the power with 6 core machine, no matter how many sub-domains).
__________________
|
|
January 22, 2011, 17:41 |
|
#7 |
Senior Member
Santiago Marquez Damian
Join Date: Aug 2009
Location: Santa Fe, Santa Fe, Argentina
Posts: 452
Rep Power: 24 |
Hey Bruno, thx for the explanation. I have a related problem, working with interFoam and METIS too. We've a parallel facility with a server and diskless nodes which reads the SO trough the net via NFS. When I use METIS and run for example a) 2 threads each one in one core in the server things go well. Then, if I do the same in b) a node (server and nodes have 8 cores) problem is decomposed correctly but only one core has load and the problem runs at lesser load core, it is very slowly.
Other case, c) launching from the server, but sending a thread to node1 and the other one to node2. Correct decomposition, balanced load. All OK. Finally d) launching from server sending two threads to the same node, same problem as a). It is very weird, sounds like nodes don't like multicore processing with OpenFOAM. Regards.
__________________
Santiago MÁRQUEZ DAMIÁN, Ph.D. Research Scientist Research Center for Computational Methods (CIMEC) - CONICET/UNL Tel: 54-342-4511594 Int. 7032 Colectora Ruta Nac. 168 / Paraje El Pozo (3000) Santa Fe - Argentina. http://www.cimec.org.ar |
|
January 22, 2011, 21:34 |
|
#8 |
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128 |
Hi Santiago,
Yes, I know, it's really weird! Here's another proof I picked up from this forum, a draft report by Gijsbert Wierink: Installation of OpenFOAM on the Rosa cluster If you see the Figure 1 in that document, you'll see the case can't speed up unless it's unleashed into more than one machine! I've replicated the case used and the timings with my AMD 1055T x6 are roughly the same. It was with that information that lead me to try do over-scheduling of 16 processes into the 6 processors and managed to get a rather better performance than using only 6 processes. Basically, the timings reported on that draft indicate a lousy speed up of almost 4 times in a 8 core machine (4 core per socket, dual socket machine if I'm not mistaken), but when 16 and 32 cores (3-4 nodes) are used, the speed up is of 10 and 20 times! Above that, it saturates due to the cell/core count dropping too much under the 50k cell/core estimate. With this information, along with the information in the report "OpenFOAM Performance Benchmark and Profiling" and the estimated minimum limit of 50k cells/core, my deductions are:
edit: I forgot to mention, if my memory isn't failing me, that here in the forum is some limited information about configuring the shared memory defined by the kernel, which can play a rather important key in local runs, but I've never been able to actually be successful in doing a proper tuning of these parameters. Best regards, Bruno
__________________
Last edited by wyldckat; January 22, 2011 at 21:44. Reason: see "edit:" |
|
January 24, 2011, 12:57 |
|
#9 | ||
Member
Join Date: Nov 2010
Posts: 33
Rep Power: 15 |
Thanks Bruno. I am working on your suggestions.
I am also trying to make the parallel case running across the machines. To test the parallel solution, I followed the steps mentioned in other post http://www.cfd-online.com/Forums/ope...tml#post256927. If I run the parallel case with 2 processes on a single machine. The parallelTest utility works fine: Quote:
On the other hand, If I split the processing across 2 machines then the system hangs after create time: Quote:
P.S. OpenFoam version 1.6 is used in both the machines. |
|||
January 24, 2011, 18:41 |
|
#10 |
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128 |
Hi pkr,
Wow, I love it when the parallelTest application does some crazy time travel and flushes the buffer in a crazy order As for the second run:
On the other hand, if at least one of them is true, then you should check how the firewall is configured on those two machines. The other possibility is that the naming convention for the IP addresses isn't being respected in both machines. For example, if the first machine has defined in "/etc/hosts" that:
My usual trick to try to isolate these cases is to:
Best regards and good luck! Bruno
__________________
|
|
January 24, 2011, 21:40 |
|
#11 | |
Senior Member
Santiago Marquez Damian
Join Date: Aug 2009
Location: Santa Fe, Santa Fe, Argentina
Posts: 452
Rep Power: 24 |
Bruno, some comments,
Quote:
Regards.
__________________
Santiago MÁRQUEZ DAMIÁN, Ph.D. Research Scientist Research Center for Computational Methods (CIMEC) - CONICET/UNL Tel: 54-342-4511594 Int. 7032 Colectora Ruta Nac. 168 / Paraje El Pozo (3000) Santa Fe - Argentina. http://www.cimec.org.ar |
||
January 24, 2011, 23:08 |
|
#12 | ||||||||
Member
Join Date: Nov 2010
Posts: 33
Rep Power: 15 |
Thanks for your response Bruno. I tried your suggestions, but still no progress in solving the problem.
Quote:
Quote:
Quote:
Quote:
Quote:
Apart from this I tried a simple OpenMPI program which works fine. The code and output is as follows: Quote:
Quote:
mpirun -np 2 parallelTest ==> Works mpirun --hostfile machines -np 2 parallelTest ==> Not working Quote:
Do you think it might be a problem with the version I am using? I am currently working on OpenFoam1.6. Shall I move to OpenFoam1.6.X? Please suggest some other things I can check upon. Thanks. |
|||||||||
January 25, 2011, 02:29 |
|
#13 | |
Member
Join Date: Nov 2010
Posts: 33
Rep Power: 15 |
Hi Bruno,
Another query: Please comment on the process I am following to execute parallelTest across the machines. 1. machine1 as master and machine2 as slave. 2. In machine1, change system/decomposeParDict for 2 processes 3. Execute decomposePar on machine1 which creates two directories as processor1 and processor2. 4. Create machines file in machine1 to contain machine1 and machine2 as entries. 5. Copy processor1 and processor2 directory from machine1 to machine2. (Directory: OpenFOAM/OpenFOAM-1.6/tutorials/multiphase/interFoam/laminar/damBreak) 6. Launch "foamjob -p -s parallelTest" on machine1 After following these steps, the output stucks at create time as follows: Quote:
Please comment if I am following the right process for executing parallelTest application across the machines. |
||
January 25, 2011, 16:10 |
|
#14 | ||
Member
Join Date: Nov 2010
Posts: 33
Rep Power: 15 |
Hi Bruno,
Yet another query: It seems that the problem might be due to setting of some environment variables at the slave. Please suggest? The OpenFoam project directory is visible on the slave side: rphull@fire3:~$ echo $WM_PROJECT_DIR /home/rphull/OpenFOAM/OpenFOAM-1.6 1. When complete path of executable is not specified: Quote:
2. The case when the complete executable path is specified: Quote:
From the second case, it looks like the machine tried to launch the application but failed as it was not able to figure out the path for the shared object (libinterfaceProperties.so in this case). Any suggestions to fix this? |
|||
January 25, 2011, 19:29 |
|
#15 | ||
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128 |
Hi pkr,
Mmm, that's a lot of testing you've been doing OK, let's see if I don't forget anything:
Code:
mpirun -np 1 -host machine1 /home/rphull/OpenFOAM/OpenFOAM-1.6.x/bin/foamExec parallelTest -case $PWD mpirun -np 1 -host machine2 /home/rphull/OpenFOAM/OpenFOAM-1.6.x/bin/foamExec parallelTest -case $PWD These two tests can help you isolate where the problem is, since they only launch one instance of parallelTest on the remote machine, without the need for explicit communication between processes via MPI. OK, I hope I didn't forget anything. Best regards and good luck! Bruno
__________________
|
|||
January 27, 2011, 13:48 |
|
#16 | ||||
Member
Join Date: Nov 2010
Posts: 33
Rep Power: 15 |
Hi Bruno,
Quote:
Quote:
Quote:
Does putting "-parallel" makes it to run in master-slave framework? Quote:
Both the cases works fine on machine1. When I try the same on machine2, the following case fails: rphull@fire3:~/OpenFOAM/OpenFOAM-1.6/tutorials/multiphase/interFoam/laminar/damBreak$ mpirun -np 1 -host fire2 /home/rphull/OpenFOAM/OpenFOAM-1.6/bin/foamExec parallelTest -case $PWD exec: 128: parallelTest: not found Is this the root cause for the problem? Any suggestions to fix this? |
|||||
January 27, 2011, 14:14 |
|
#17 | ||||
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128 |
Hi pkr,
Quote:
Quote:
When using NFS or sshfs, this is sort of done automatically for you, except that the real files should only reside on a single machine, instead of being physically replicated on both machines. Quote:
Quote:
OK, the first thing that comes to my mind is that "parallelTest" is only available on one of the machines. To confirm this, run on both machines: Code:
which parallelTest Now, when I try to think more deeply about this, I get the feeling that there is something else that is slightly different on one of the machines, but I can't put my finger on it... it feels that it's either the Linux distribution version that isn't identical... or something about bash not working the same exact way. Perhaps it's how "~/.bashrc" is defined on both machines... check if there are any big differences between the two files. Any changes to the variables "PATH" and "LD_LIBRARY_PATH" inside "~/.bashrc" which are different in some particular way, can lead to very different working environments! The other possibility would be how ssh is configured on both machines... Best regards, Bruno
__________________
|
|||||
January 27, 2011, 15:05 |
|
#18 | |||
Member
Join Date: Nov 2010
Posts: 33
Rep Power: 15 |
Thanks Bruno.
Quote:
mpirun -np 1 -host machine1 /home/rphull/OpenFOAM/OpenFOAM-1.6.x/bin/foamExec parallelTest -case $PWD mpirun -np 1 -host machine2 /home/rphull/OpenFOAM/OpenFOAM-1.6.x/bin/foamExec parallelTest -case $PWD But still the system hangs when I try to run the openmpi parallelTest with -parallel keyword for 2 machines Quote:
Quote:
|
||||
January 27, 2011, 19:38 |
|
#19 | |
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128 |
Hi pkr,
Quote:
Another test you can try, which is launching a parallel case to work solely on the remote machine: Code:
mpirun -np 2 -host machine1 /home/rphull/OpenFOAM/OpenFOAM-1.6.x/bin/foamExec parallelTest -parallel -case $PWD mpirun -np 2 -host machine2 /home/rphull/OpenFOAM/OpenFOAM-1.6.x/bin/foamExec parallelTest -parallel -case $PWD It feels we are really close to getting this to work and yet it seems so far... OK, another test for trying to isolate what on digital earth is going on - run this using the two machines, from/to either one of them and also only locally and only remotely: Code:
foamJob -s -p bash -c export cat log | sort -u > log2 export | sort -u > loglocal diff -Nur loglocal log2 > log.diff mpirun -np 4 bash -c export > log.simple cat log.simple | sort -u > log2.simple diff -Nur loglocal log2.simple > log.simple.diff
This is like the last resort, which should help verify what the environment looks like on the remote machine, when using mpirun to launch the process remotely. The things to keep an eye out for are:
Right now I'm too tired to figure out any more tests and/or possibilities. Best regards and good luck! Bruno
__________________
|
||
January 28, 2011, 01:01 |
|
#20 | |||
Member
Join Date: Nov 2010
Posts: 33
Rep Power: 15 |
Thanks for your response. I am mentioning all the commands in this post. I still have to try the commands for difference in remote machine configuration. I will get back to you soon on that.
Quote:
Commands to launch: 1. Without -parallel but with machine file mpirun -hostfile machines -np 2 /home/rphull/OpenFOAM/OpenFOAM-1.6/bin/foamExec parallelTest ====> Works fine from both the machines 2. With -parallel but without any machine file mpirun -np 2 /home/rphull/OpenFOAM/OpenFOAM-1.6/bin/foamExec parallelTest -parallel ====> Works fine from both the machines 3. With -parallel and with machine file mpirun -hostfile machines -np 2 /home/rphull/OpenFOAM/OpenFOAM-1.6/bin/foamExec parallelTest -parallel ====> Does not work from any of the machine With Foam Job, I try the following: foamJob -p -s parallelTest ==> This works when machines file is not present in the current directory, otherwise it fails All of the following commands work fine from both the machines. mpirun -np 1 -host fire3 /home/rphull/OpenFOAM/OpenFOAM-1.6/bin/foamExec parallelTest -case $PWD mpirun -np 1 -host fire2 /home/rphull/OpenFOAM/OpenFOAM-1.6/bin/foamExec parallelTest -case $PWD mpirun -np 1 -host fire3 /home/rphull/OpenFOAM/OpenFOAM-1.6/bin/foamExec parallelTest mpirun -np 1 -host fire2 /home/rphull/OpenFOAM/OpenFOAM-1.6/bin/foamExec parallelTest mpirun -np 2 -host fire3 /home/rphull/OpenFOAM/OpenFOAM-1.6/bin/foamExec parallelTest -parallel mpirun -np 2 -host fire2 /home/rphull/OpenFOAM/OpenFOAM-1.6/bin/foamExec parallelTest -parallel When running with -parallel across the machines, once in a while I am seeing following error message. Have you seen it before? Quote:
I also tried debugging with gdb. Here is the call stack where program gets stuck when running with -parallel and across the machines: Quote:
|
||||
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
bc's of a komegaSST case | Zymon | OpenFOAM | 11 | July 25, 2010 10:36 |
Paraview decomposed case without reconstructing? | HelloWorld | OpenFOAM | 3 | May 8, 2010 10:47 |
Interfoam Droplet under shear test case | adona058 | OpenFOAM Running, Solving & CFD | 3 | May 3, 2010 19:46 |
Scale-Up Study in Parallel Processing with OpenFoam | sahm | OpenFOAM | 10 | April 26, 2010 18:37 |
Turbulent Flat Plate Validation Case | Jonas Larsson | Main CFD Forum | 0 | April 2, 2004 11:25 |