|
[Sponsors] |
October 14, 2005, 06:01 |
Hi,
Something changed in my
|
#1 |
Member
Radu Mustata
Join Date: Mar 2009
Location: Zaragoza, Spain
Posts: 99
Rep Power: 17 |
Hi,
Something changed in my system lately and I donīt know what it is. I was able to boot LAM with no problems in the past, but now I get the following (rather long) message: radu@nodo1-2:~$ lamboot -d ./machines_foam n-1<3811> ssi:boot:open: opening n-1<3811> ssi:boot:open: opening boot module globus n-1<3811> ssi:boot:open: opened boot module globus n-1<3811> ssi:boot:open: opening boot module rsh n-1<3811> ssi:boot:open: opened boot module rsh n-1<3811> ssi:boot:open: opening boot module slurm n-1<3811> ssi:boot:open: opened boot module slurm n-1<3811> ssi:boot:select: initializing boot module slurm n-1<3811> ssi:boot:slurm: not running under SLURM n-1<3811> ssi:boot:select: boot module not available: slurm n-1<3811> ssi:boot:select: initializing boot module rsh n-1<3811> ssi:boot:rsh: module initializing n-1<3811> ssi:boot:rsh:agent: rsh n-1<3811> ssi:boot:rsh:username: <same> n-1<3811> ssi:boot:rsh:verbose: 1000 n-1<3811> ssi:boot:rsh:algorithm: linear n-1<3811> ssi:boot:rsh:no_n: 0 n-1<3811> ssi:boot:rsh:no_profile: 0 n-1<3811> ssi:boot:rsh:fast: 0 n-1<3811> ssi:boot:rsh:ignore_stderr: 0 n-1<3811> ssi:boot:rsh:priority: 10 n-1<3811> ssi:boot:select: boot module available: rsh, priority: 10 n-1<3811> ssi:boot:select: initializing boot module globus n-1<3811> ssi:boot:globus: globus-job-run not found, globus boot will not run n-1<3811> ssi:boot:select: boot module not available: globus n-1<3811> ssi:boot:select: finalizing boot module slurm n-1<3811> ssi:boot:slurm: finalizing n-1<3811> ssi:boot:select: closing boot module slurm n-1<3811> ssi:boot:select: finalizing boot module globus n-1<3811> ssi:boot:globus: finalizing n-1<3811> ssi:boot:select: closing boot module globus n-1<3811> ssi:boot:select: selected boot module rsh LAM 7.1.1 - Indiana University n-1<3811> ssi:boot:base: looking for boot schema in following directories: n-1<3811> ssi:boot:base: <current> n-1<3811> ssi:boot:base: $TROLLIUSHOME/etc n-1<3811> ssi:boot:base: $LAMHOME/etc n-1<3811> ssi:boot:base: /home/dm2/henry/OpenFOAM/OpenFOAM-1.2/src/lam-7.1.1/platforms/linuxGcc4Opt/etc n-1<3811> ssi:boot:base: looking for boot schema file: n-1<3811> ssi:boot:base: ./machines_foam n-1<3811> ssi:boot:base: found boot schema: ./machines_foam n-1<3811> ssi:boot:rsh: found the following hosts: n-1<3811> ssi:boot:rsh: n0 nodo1-2 (cpu=1) n-1<3811> ssi:boot:rsh: resolved hosts: n-1<3811> ssi:boot:rsh: n0 nodo1-2 --> 192.168.3.2 (origin) n-1<3811> ssi:boot:rsh: starting RTE procs n-1<3811> ssi:boot:base:linear: starting n-1<3811> ssi:boot:base:server: opening server TCP socket n-1<3811> ssi:boot:base:server: opened port 43936 n-1<3811> ssi:boot:base:linear: booting n0 (nodo1-2) n-1<3811> ssi:boot:rsh: starting lamd on (nodo1-2) n-1<3811> ssi:boot:rsh: starting on n0 (nodo1-2): hboot -t -c lam-conf.lamd -d -I -H 192.168.3.2 -P 43936 -n 0 -o 0 n-1<3811> ssi:boot:rsh: launching locally hboot: performing tkill hboot: tkill -d tkill: setting prefix to (null) tkill: setting suffix to (null) mkdir: Permission denied tkill: got killname back: /tmp/lam-radu@nodo1-2/lam-killfile tkill: removing socket file ... tkill: socket file: /tmp/lam-radu@nodo1-2/lam-kernel-socketd tkill: removing IO daemon socket file ... tkill: IO daemon socket file: /tmp/lam-radu@nodo1-2/lam-io-socket tkill: f_kill = "/tmp/lam-radu@nodo1-2/lam-killfile" tkill: nothing to kill: "/tmp/lam-radu@nodo1-2/lam-killfile" hboot: booting... hboot: fork /mnt/store1/radu/OpenFOAM/OpenFOAM-1.2/src/lam-7.1.1/platforms/linuxGcc4Opt/bin/ lamd [1] 3814 lamd -H 192.168.3.2 -P 43936 -n 0 -o 0 -d n-1<3811> ssi:boot:rsh: successfully launched on n0 (nodo1-2) n-1<3811> ssi:boot:base:server: expecting connection from finite list hboot: attempting to execute mkdir: Permission denied chdir failed!: No such file or directory ----------------------------------------------------------------------------- The lamboot agent timed out while waiting for the newly-booted process to call back and indicated that it had successfully booted. *** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND *** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ *** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S *** MAILING LIST. As far as LAM could tell, the remote process started properly, but then never called back. Possible reasons that this may happen: - There are network filters between the lamboot agent host and the remote host such that communication on random TCP ports is blocked - Network routing from the remote host to the local host isn't properly configured (this is uncommon) You can check these things by watching the output from "lamboot -d". 1. On the command line for hboot, there are two important parameters: one is the IP address of where the lamboot agent was invoked, the other is the port number that the lamboot agent is expecting the newly-booted process to call back on (this will be a random integer). 2. Manually login to the remote machine and try to telnet to the port indicated on the hboot command line. For example, telnet <ipnumber> <portnumber> If all goes well, you should get a "Connection refused" error. If you get any other kind of error, it could indicate either of the two conditions above. Consult with your system/network administrator. ----------------------------------------------------------------------------- n-1<3811> ssi:boot:base:server: failed to connect to remote lamd! n-1<3811> ssi:boot:base:server: closing server socket n-1<3811> ssi:boot:base:linear: aborted! lamboot did NOT complete successfully I did what it says in 2. above and it worked well, i.e. radu@nodo1-2:~$ telnet nodo1-2 43936 Trying 192.168.3.2... telnet: Unable to connect to remote host: Connection refused ...so I donīt really know what happens... Any ideas, please? I see that it fails in some mkdir, but I can mkdir anywhere in the list of nodes.. Thank yoy in advance, Radu |
|
October 16, 2005, 10:44 |
- try 'ssh' to the machines
-
|
#2 |
Senior Member
Mattijs Janssens
Join Date: Mar 2009
Posts: 1,419
Rep Power: 26 |
- try 'ssh' to the machines
- try 'ssh ls' to the machine - can you write to all files needed - can you do mkdir /tmp/lam-radu@nodo1-2 |
|
October 17, 2005, 05:10 |
Hi Mattijs,
1&2 work fine
|
#3 |
Member
Radu Mustata
Join Date: Mar 2009
Location: Zaragoza, Spain
Posts: 99
Rep Power: 17 |
Hi Mattijs,
1&2 work fine 3 -- donīt know what "needed" files are 4 -- no I cannot mkdir in /tmp of any of the nodes in the list...will ask the admin....I guess thatīs the trouble. Thank you, Radu |
|
October 17, 2005, 05:19 |
You can create a tmp in your h
|
#4 |
Super Moderator
Niklas Nordin
Join Date: Mar 2009
Location: Stockholm, Sweden
Posts: 693
Rep Power: 29 |
You can create a tmp in your home directory and
point the system to that location instead, using setenv TMPHOME $HOME/tmp or TMP_HOME or something like that... N |
|
October 17, 2005, 05:27 |
Thanks Niklas,
The problem i
|
#5 |
Member
Radu Mustata
Join Date: Mar 2009
Location: Zaragoza, Spain
Posts: 99
Rep Power: 17 |
Thanks Niklas,
The problem is solved now. I got the rights to write in the /tmp of the nodes so now it boots allright. Radu |
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
UDS trouble | Jenny | FLUENT | 0 | July 7, 2008 04:27 |
UDS trouble | Jenny | FLUENT | 0 | July 6, 2008 05:09 |
Please help cannot start lamboot | hsieh | OpenFOAM Installation | 8 | May 24, 2007 15:44 |
Lamboot and ssh | dmoroian | OpenFOAM Running, Solving & CFD | 1 | November 1, 2006 06:53 |
Lamboot and mpirun | r2d2 | OpenFOAM Running, Solving & CFD | 2 | January 10, 2006 12:31 |