|
[Sponsors] |
July 31, 2009, 15:26 |
OpenORTE/mpi problem
|
#1 |
Senior Member
Tomislav Maric
Join Date: Mar 2009
Location: Darmstadt, Germany
Posts: 284
Blog Entries: 5
Rep Power: 21 |
Hello,
I'm trying to run damBreak on a LAN. When I execute mpirun --hostnames machines -np 4 interFoam -parallel I get prompted to enter my pass for two nodes (mario & marija), but something goes wrong (interFoam runs on icarus host, and # are my comments): tomislav@icarus:damBreak$ mpirun --hostfile machines -np 4 interFoam -parallel # first of all, ssh works on both nodes and I can log on and do whatever I # want. why do both prompts for a pass at different nodes (LAN hosts) # appear in the same line? tomislav@mario's password: tomislav@marija's password: # I've entered my pass above and then there's a pause before this: bash: orted: command not found # I've googled and found an answer here: # http://www.open-mpi.org/community/li...07/08/3876.php # but it didn't help at all [icarus:15321] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 275 [icarus:15321] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1166 [icarus:15321] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90 [icarus:15321] ERROR: A daemon on node mario failed to start as expected. [icarus:15321] ERROR: There may be more information available from [icarus:15321] ERROR: the remote shell (see above). [icarus:15321] ERROR: The daemon exited unexpectedly with status 127. [icarus:15321] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 188 [icarus:15321] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1198 -------------------------------------------------------------------------- mpirun was unable to cleanly terminate the daemons for this job. Returned value Timeout instead of ORTE_SUCCESS. -------------------------------------------------------------------------- Has anyone here encountered anything similar or can direct me where to look for an answer? Both nodes are running slax live dvd. Thank you, Tomislav |
|
August 1, 2009, 12:02 |
|
#2 |
New Member
Bjarne Jensen
Join Date: Mar 2009
Location: Denmark
Posts: 7
Rep Power: 17 |
The problem may be your password for ssh. You should have a password-less login for ssh on your system for openMPI to work between several nodes.
Regards, Bjarne |
|
August 1, 2009, 12:34 |
|
#3 | |
Senior Member
Tomislav Maric
Join Date: Mar 2009
Location: Darmstadt, Germany
Posts: 284
Blog Entries: 5
Rep Power: 21 |
Quote:
ssh-keygen -t dsa command gave me ~/.ssh directory with public/private keys. I didn't enter a passphrase. I've copied the .ssh directory to the node (mario), executed successfully ssh-add command and tried ssh mario command, but I got an error message: "ssh: connect to host mario port 22: Connection refused". ping mario works fine. Tomislav |
||
August 1, 2009, 14:14 |
|
#4 |
Senior Member
Tomislav Maric
Join Date: Mar 2009
Location: Darmstadt, Germany
Posts: 284
Blog Entries: 5
Rep Power: 21 |
ok, I've tried again and weird things happened. I've followed again the instructions on the Open MPI site from my first post. I'm trying to run interFoam on host marija and use host mario as a slave node. This is what happens:
slax@marija:~/damBreak$ mpirun --hostfile hosts -np 2 interFoam -parallel ssh: Could not resolve hostname marija: Name or service not known -------------------------------------------------------------------------- A daemon (pid 8759) died unexpectedly with status 255 while attempting to launch so we are aborting. There may be more information reported by the environment (see above). This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -------------------------------------------------------------------------- mpirun: clean termination accomplished slax@marija:~$ ssh -v 192.168.1.66 OpenSSH_5.1p1, OpenSSL 0.9.8i 15 Sep 2008 debug1: Reading configuration data /etc/ssh/ssh_config debug1: Connecting to 192.168.1.66 [192.168.1.66] port 22. debug1: Connection established. debug1: identity file /home/slax/.ssh/identity type -1 debug1: identity file /home/slax/.ssh/id_rsa type -1 debug1: identity file /home/slax/.ssh/id_dsa type 2 debug1: Remote protocol version 2.0, remote software version OpenSSH_5.1 debug1: match: OpenSSH_5.1 pat OpenSSH* debug1: Enabling compatibility mode for protocol 2.0 debug1: Local version string SSH-2.0-OpenSSH_5.1 debug1: SSH2_MSG_KEXINIT sent debug1: SSH2_MSG_KEXINIT received debug1: kex: server->client aes128-cbc hmac-md5 none debug1: kex: client->server aes128-cbc hmac-md5 none debug1: SSH2_MSG_KEX_DH_GEX_REQUEST(1024<1024<8192) sent debug1: expecting SSH2_MSG_KEX_DH_GEX_GROUP debug1: SSH2_MSG_KEX_DH_GEX_INIT sent debug1: expecting SSH2_MSG_KEX_DH_GEX_REPLY The authenticity of host '192.168.1.66 (192.168.1.66)' can't be established. RSA key fingerprint is 8a:94:0a:55:2f:df:b2:82:7a:bc:b2:f9:6a:b7:f6:dc. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added '192.168.1.66' (RSA) to the list of known hosts. debug1: ssh_rsa_verify: signature correct debug1: SSH2_MSG_NEWKEYS sent debug1: expecting SSH2_MSG_NEWKEYS debug1: SSH2_MSG_NEWKEYS received debug1: SSH2_MSG_SERVICE_REQUEST sent debug1: SSH2_MSG_SERVICE_ACCEPT received debug1: Authentications that can continue: publickey,password,keyboard-interactive debug1: Next authentication method: publickey debug1: Trying private key: /home/slax/.ssh/identity debug1: Trying private key: /home/slax/.ssh/id_rsa debug1: Offering public key: /home/slax/.ssh/id_dsa debug1: Authentications that can continue: publickey,password,keyboard-interactive debug1: Next authentication method: keyboard-interactive debug1: Authentications that can continue: publickey,password,keyboard-interactive debug1: Next authentication method: password slax@192.168.1.66's password: debug1: Authentication succeeded (password). debug1: channel 0: new [client-session] debug1: Requesting no-more-sessions@openssh.com debug1: Entering interactive session. Last login: Sat Aug 1 18:20:50 2009 Linux 2.6.27.8. slax@marija:~$ ssh mario Last login: Sat Aug 1 16:41:41 2009 from mario Linux 2.6.27.8. slax@marija:~$ exit logout Connection to mario closed. slax@marija:~$ What's weird is that in the last lines I'm exiting from connection with mario host, but there was no visible trace of the connection in the first place. ls showed me the contets of slax home directory on marija host, and the prompt shows that I'm running slax user on host marija. What do I need to do to make ssh work without a password besides the instructions in the link? Thanks in advance, Tomislav |
|
October 27, 2010, 13:03 |
starting parallel runs in OpenFOAM
|
#6 | |
New Member
ANON
Join Date: Oct 2010
Posts: 2
Rep Power: 0 |
Quote:
I have just recently started parallel runs using the damBreak tutorial and am having difficulties. I am using sge on ROCKS and I can't get the parallel command to work. I hope you can help in this regards. Thanks |
||
October 28, 2010, 05:52 |
|
#7 | |
Senior Member
Tomislav Maric
Join Date: Mar 2009
Location: Darmstadt, Germany
Posts: 284
Blog Entries: 5
Rep Power: 21 |
Quote:
1) The installation of OpenFOAM is directed towards an NFS export directory, the best choice is /share/apps as the ROCKS user manual suggests for the applications that are not installed via rolls or .rpm. 2) The execution in parallel must be done with the full pathnames in order for the orte to pick up the proper paths on the nodes. For this purpose, you can use the expansion signs "`": `which mpirun` -machinefile MACHINES -np N `which SOLVER` -parallel where MACHINES is the full pathname of the machinefile for the mpirun, N is the number of cores and SOLVER is the solver you wish to run. Hope this helps, Tomislav P.S. If it doesn't work via SGE, try the manual parallel run. Does this work? |
||
October 28, 2010, 11:03 |
Could not resolve hostname
|
#8 |
New Member
rlobosco
Join Date: Nov 2009
Posts: 5
Rep Power: 16 |
I have just trying to run the damBreak tutorial in parallel but I am in trouble with it. Maybe someone can help me.
I have two machines alfa and beta. I can do ssh between both without password and I have no problems. But when I try the command: mpirun --hostfile machines -np 2 interFoam -parallel > log It gives me the following message error: ssh: Could not resolve hostname beta: Name or service not known -------------------------------------------------------------------------- A daemon (pid 5947) died unexpectedly with status 255 while attempting to launch so we are aborting. There may be more information reported by the environment (see above). This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun was unable to cleanly terminate the daemons on the nodes shown below. Additional manual cleanup may be required - please refer to the "orte-clean" tool for assistance. -------------------------------------------------------------------------- alfa - daemon did not report back when launched beta - daemon did not report back when launched My machines file is in the directory ~/.ssh and have just the flowing lines: alfa beta Can someone give me some hints? With best regard, rlobosco |
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
UDF compiling problem | Wouter | Fluent UDF and Scheme Programming | 6 | June 6, 2012 05:43 |
Incoherent problem table in hollow-fiber spinning | Gianni | FLUENT | 0 | April 5, 2008 11:33 |
natural convection problem for a CHT problem | Se-Hee | CFX | 2 | June 10, 2007 07:29 |
Adiabatic and Rotating wall (Convection problem) | ParodDav | CFX | 5 | April 29, 2007 20:13 |
Is this problem well posed? | Thomas P. Abraham | Main CFD Forum | 5 | September 8, 1999 15:52 |