|
[Sponsors] |
October 24, 2011, 13:09 |
Problems about distributed parallel runs
|
#1 |
Senior Member
Vesselin Krastev
Join Date: Jan 2010
Location: University of Tor Vergata, Rome
Posts: 368
Rep Power: 20 |
Hi all,
I'm trying to launch some distributed parallel runs on a CentOS based cluster (server and nodes have the same OS version installed), but this is what I have obtained running the foamJob script from the server: [krastev@epsilon morris60_SA_secondamesh]$ foamJob -s -p sonicAdaptiveFoam Parallel processing using OPENMPI with 4 processors Executing: mpirun -np 4 -hostfile machines /server/krastev/OpenFOAM/OpenFOAM-1.7.1/bin/foamExec sonicAdaptiveFoam -parallel | tee log /*---------------------------------------------------------------------------*\ | ========= | | | \\ / F ield | OpenFOAM: The Open Source CFD Toolbox | | \\ / O peration | Version: 1.7.1 | | \\ / A nd | Web: www.OpenFOAM.com | | \\/ M anipulation | | \*---------------------------------------------------------------------------*/ Build : 1.7.1-03e7e056c215 Exec : sonicAdaptiveFoam -parallel Date : Oct 24 2011 Time : 17:46:36 Host : node64-1.sub.uniroma2.it PID : 25824 [node64-1.sub.uniroma2.it][[39888,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_comple te_connect] connect() to 10.1.1.102 failed: No route to host (113) Could please someone give me some hint to find where the problem is? Thanks a lot V. PS-Some additional information: 1) I have installed OF-1.7.1 locally in the home folder of my account (I'm not the server administrator and I need to have my own free-compiling version) 2) the same run works perfectly on the single nodes (they are quad-core nodes) by launching either the foamJob command or directly the mpirun -np 4 etc. etc. syntax |
|
October 24, 2011, 14:53 |
|
#2 | |
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128 |
Hi Vesselin,
The problem is somewhat simple: one or more IPs on each machine isn't/aren't visible from any point of view in the cluster. A solution should be something like this: Quote:
Code:
/sbin/ifconfig Best regards, Bruno
__________________
|
||
October 25, 2011, 06:24 |
|
#3 |
Senior Member
Vesselin Krastev
Join Date: Jan 2010
Location: University of Tor Vergata, Rome
Posts: 368
Rep Power: 20 |
Hi Bruno, and thanks a lot for your answer!
Following your suggestions I've tried first to type (from the server): mpirun --mca btl_tcp_if_exclude lo -hostfile machines -np 4 foamExec sonicAdaptiveFoam -parallel but this is the result: bash: orted: command not found -------------------------------------------------------------------------- A daemon (pid 20171) died unexpectedly with status 127 while attempting to launch so we are aborting. There may be more information reported by the environment (see above). This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -------------------------------------------------------------------------- bash: orted: command not found mpirun: clean termination accomplished After that, I've checked the IP's of each node with /sbin/ifconfig and all seems fine (each node has reported an active eth0 connection with an assigned IP). Finally, I've checked also the /etc/hosts files and they look like the following: 10.1.1.102 node64-2.sub.uniroma2.it node64-2 # Added by NetworkManager 127.0.0.1 localhost.localdomain localhost ::1 node64-2.sub.uniroma2.it node64-2 localhost6.localdomain6 localhost6 with 10.1.1.102 being the same inet adress reported by /sbin/ifconfig, except for the node 1, where the hosts file looks like this: 10.1.1.101 node64-1.sub.uniroma2.it node64-1 127.0.0.1 localhost localhost.localdomain In addition, I can tell you that each node "knows" about the others in therms of interactive ssh connections, because they share a common .ssh/known_hosts file containing all the proper IP's. Probably I'm missing something trivial, but the fact is that I would really like to run my simulations independently from the administrator (remember that I have no root access to the server, neither to the nodes)... Thanks once again V. |
|
October 25, 2011, 06:37 |
|
#4 | ||
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128 |
Hi Vesselin,
Quote:
Code:
`which mpirun` --mca btl_tcp_if_exclude lo -hostfile machines -np 4 `which foamExec` sonicAdaptiveFoam -parallel And remember that this way, both mpirun and OpenFOAM should be visible on the same path. I assume that your home folder is shared among all nodes. Quote:
Nonetheless, if the "machines" file for your case indicates only IPs, then it should work as intended. Best regards and good luck! Bruno
__________________
|
|||
October 25, 2011, 07:01 |
|
#5 | |
Senior Member
Vesselin Krastev
Join Date: Jan 2010
Location: University of Tor Vergata, Rome
Posts: 368
Rep Power: 20 |
Quote:
/*---------------------------------------------------------------------------*\ | ========= | | | \\ / F ield | OpenFOAM: The Open Source CFD Toolbox | | \\ / O peration | Version: 1.7.1 | | \\ / A nd | Web: www.OpenFOAM.com | | \\/ M anipulation | | \*---------------------------------------------------------------------------*/ Build : 1.7.1-03e7e056c215 Exec : sonicAdaptiveFoam -parallel Date : Oct 25 2011 Time : 11:48:36 Host : node64-1.sub.uniroma2.it PID : 31167 [node64-1.sub.uniroma2.it][[62160,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_comple te_connect] connect() to 10.1.1.102 failed: No route to host (113) And nothing changes if i put directly the IP's inside the machines file (e.g. 10.1.1.101 instead of node64-1). Also, it is the same if I use the server as the master node and any other node as the slave one. Any further idea? Thanks V. |
||
October 25, 2011, 07:16 |
|
#6 |
Senior Member
Pablo Higuera
Join Date: Jan 2011
Location: Auckland
Posts: 627
Rep Power: 19 |
I think I had a similar problem recently, nodes were not able to communicate because mpirun by default used an interface which was not connected. This was installed by default for some application related with virtual machines (I do not remember the name, neither I can check it right now).
The solution was to kill this (unused in our case) interface. |
|
October 25, 2011, 07:17 |
|
#7 |
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128 |
Hi Vesselin,
I still think the problem might be the incomplete "/etc/hosts" file, but I could be wrong. Several days ago I wrote a couple of posts about trying to isolate-and-conquer issues when running in parallel: Segmentation fault in interFoam run through openMPI posts #8 and #10. You might want to read the whole thread, just to understand better what's being talked about on those two posts I've also been collecting more information about running OpenFOAM in parallel on this blog post of mine: Notes about running OpenFOAM in parallel - These might come in handy for you as well. Good luck! Bruno
__________________
|
|
October 25, 2011, 08:23 |
|
#8 | |
Senior Member
Vesselin Krastev
Join Date: Jan 2010
Location: University of Tor Vergata, Rome
Posts: 368
Rep Power: 20 |
Quote:
thanks for the answer but I need a more precise information about this application before starting killing blindly something on the cluster (remember also that I have no administrator rights). V. |
||
October 25, 2011, 08:32 |
|
#9 | |
Senior Member
Vesselin Krastev
Join Date: Jan 2010
Location: University of Tor Vergata, Rome
Posts: 368
Rep Power: 20 |
Quote:
Thanks once again V. Last edited by vkrastev; October 25, 2011 at 08:40. Reason: adding information |
||
November 11, 2012, 08:18 |
|
#10 |
New Member
charlse
Join Date: Mar 2011
Location: china
Posts: 6
Rep Power: 15 |
Recently, I meet the same problem and still not solved yet. Additional information: I can use other node from one node. For example, I can use node14 from node16. But not use two or more nodes. Anyone has some other advices?
|
|
November 11, 2012, 10:22 |
|
#11 | |
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128 |
Greetings star shower,
Quote:
Best regards, Bruno
__________________
|
||
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
interDyMFoam, problems in mesh motion solutor run in parallel | DLC | OpenFOAM | 11 | December 11, 2012 03:20 |
Problems with "polyTopoChange" on parallel?!? | daZigeiner | OpenFOAM Programming & Development | 0 | March 14, 2011 11:05 |
Cyclic patches and parallel postprocessing problems | askjak | OpenFOAM Bugs | 18 | October 27, 2010 04:35 |
STAR-CD v4.02 parallel problems on Win XP 64 bit | Kasper | Siemens | 3 | September 24, 2007 07:06 |
Problems in Parallel PHOENICS | Zeng | Phoenics | 3 | February 27, 2001 14:28 |