|
[Sponsors] |
January 19, 2006, 12:52 |
PVM RSH parallel setup problem Windows XP
|
#1 |
Guest
Posts: n/a
|
I am trying to set up 2 Windows XP SP1 PC's to run PVM distributed parallel runs on ANSYS CFX 10 with ServicePack1. I have setup my hosts.ccl file to include the slave node as follows.
EXECUTION CONTROL: PARALLEL HOST LIBRARY: HOST DEFINITION: MASTER Installation Root = C:\Program Files\Ansys Inc\CFX\CFX-10.0 Host Architecture String = intel_opteron.sse2_winnt5.1 END # HOST DEFINITION MASTER HOST DEFINITION: SLAVE Installation Root = C:\Program Files\Ansys Inc\CFX\CFX-10.0 Host Architecture String = intel_p4.sse2_winnt5.1 END # HOST DEFINITION SLAVE END # PARALLEL HOST LIBRARY END # EXECUTION CONTROL I copied the automatically-created PVMhosts file from C:\Temp into the <CFXROOT>\config folder, and it is setup as follows. &MASTER dx=C:\Program Files\Ansys Inc\CFX\Shared\pvm3.4.4_11-1\lib\WIN32\pvmd3.exe &SLAVE dx=C:\Program Files\Ansys Inc\CFX\Shared\pvm3.4.4_11-1\lib\WIN32\pvmd3.exe No problems adding the SLAVE under the solver definition, but I get an error from the Solver output... +--------------------------------------------------------------------+ | An error has occurred in cfx5solve: | | | | Unable to start the PVM daemon on host CLYDESDALE. This may | | indicate that the PVM daemon is already running on that host, or | | that it left files in /tmp on that host because it did not exit | | cleanly last time it was run. | +--------------------------------------------------------------------+ Checking in the help files, I launched the PVM console and tried to manually add the SLAVE node (add SLAVE) but it gives me an error along the lines of... socket closed (?) after reading from : The system cannot find the file specified. socket closed (?) after reading from SLAVE: Error doing $PVM_ROOT/lib/pvmd -s -d0x0 -nSLAVE 1 etc etc etc phase1() rsh failed for hose SLAVE p1_startup() got result "PVMCantStart" for SLAVE phase1() etc etc etc (it re-tries and fails again). Then the console window gives an error which includes the statement... Auto-diagnosing Failed Hosts... SLAVE... Verifying Local Path to "rsh"... Error - File C:\WINDOWS\system32\rsh Not Found! Determine the path to the "rsh" command on your system, and edit PVM_ROOT\conf\WIN32.def to adjust the path for the -DRSHCOMMAND=\"\" flag. Then recompile PVM and your applications. Keep in mind, I DO have the RSH service installed, and functional, under the same login name, on the SLAVE machine. I can test it from MASTER with the Echo command. Sorry for this long post, but I really don't know what to do now. I've looked through the old posts (ca 2002) about RSH and PVM problems on CFX5 but I can't seem to get my system to work. Could it have something to do with the spaces in the install directory names? Any help would be appreciated. |
|
January 19, 2006, 13:47 |
Re: PVM RSH parallel setup problem Windows XP
|
#2 |
Guest
Posts: n/a
|
PVM doesn't like spaces in the path, try specifying your Installation Root using the short-form path:
Installation Root = C:\Progra~1\AnsysI~1\CFX\CFX-%v M |
|
January 19, 2006, 17:02 |
Re: PVM RSH parallel setup problem Windows XP
|
#3 |
Guest
Posts: n/a
|
Mike, Thank you very much. I changed the relevant fields in each of the "hosts.ccl" and "PVMhosts" files on both machines, and after making sure that all the "pvm..." files were deleted from the Temp folder, PVM distributed parallel is working for me.
A note: for some reason, the PVM temp files were in the C:\Documents and Settings\<USER>\Temp folder, not in c:\Temp. So for others reading this thread, be sure to check both of these locations for possible temp files. |
|
January 20, 2006, 03:44 |
Re: PVM RSH parallel setup problem Windows XP
|
#4 |
Guest
Posts: n/a
|
Just a little comment to spoil your day ... Under Windows it seems as if parallel performance with PVM is nowhere near as good as with MPI. I'm not sure if you will see the difference with only two nodes, but I did notice it quite dramatically when running 6. Somebody pointed this out to me and I tried MPI, which was much quicker.
|
|
January 20, 2006, 06:19 |
Re: PVM RSH parallel setup problem Windows XP
|
#5 |
Guest
Posts: n/a
|
I'm using SGI MPI under Linux as well. PVM seems to freeze the jobs sometimes but it works very well in SGI Unix system.
|
|
January 20, 2006, 10:15 |
Re: PVM RSH parallel setup problem Windows XP
|
#6 |
Guest
Posts: n/a
|
Yes, under Windows MPI (MPICH) is often faster. This is not true of Linux/Unix. If you're on a Linux/Unix platform that has the HP MPI executables available in CFX-10.0, then this should be the first choice since it's usually faster than PVM.
M |
|
January 21, 2006, 15:19 |
Re: PVM RSH parallel setup problem Windows XP
|
#7 |
Guest
Posts: n/a
|
Charles, In the Help files for CFFX 10, it says to "Do not use MPI mode for network parallel simulations. PVM mode (the default) is far superior in terms of robustness to abnormal situations for network parallel runs (eg a computer crash during a run). also "Only consider using MPI on multi-processor machines".
How do you have your MPI parallel nodes set up? Is it set up on a network? I assume it's working OK for you; but I'm wondering why the Help files recommend so strongly against MPI for network parallel. Before I got my second box, I was running in MPICH local parallel on my dual-processor Opteron workstation, but now that I have another couple of CPU's, I'm runnin PVM. If the stability isn't an issue, and configuration isn't too difficult, I'd like to run MPI. One more thing, can you quantify how much faster MPI actually is than PVM? Does it help the scalability of speed when adding additional nodes? Thanks! |
|
January 22, 2006, 09:05 |
Re: PVM RSH parallel setup problem Windows XP
|
#8 |
Guest
Posts: n/a
|
Abe, it seems as if the PVM performance issue is just with Windows. Linux PVM is fine, apparently. I ran distributed parallel with MPI on Windows, and it was OK in terms of reliability, but Ansys are obviously not confident of it. I don't really understand the "(eg a computer crash during a run)" comment. If a node goes down, the run dies, in PVM as well as MPI, and you have to go back sometimes and clean up dead files. I didn't quantify the PVM vs MPI difference, but it was lots, something like twice as fast on a 6 CPU cluster, and I think it gets worse as you add nodes. What we do now is to run distributed parallel on Linux with HPMPI. It works fine ... until the network switch dies, but that is an occupational hazard of distributed parallel compooting.
The networking professionals have tricks of the trade to improve network latency and redundancy, but for most of us we have to accept that distributed parallel computing occasionally has its practical drawbacks. I think first option for modest parallel computing is a quad Operon motherboard with four dual-core CPU's, which can give you pretty good 8-way parallel computing without having to dick around with networking issues. |
|
January 22, 2006, 16:45 |
Re: PVM RSH parallel setup problem Windows XP
|
#9 |
Guest
Posts: n/a
|
2X faster is definitely worth looking at!
I agree with your puzzlement about the "computer crash during run" comment... last night I had a router issue that apparently caused my slave node to miscomunicate - it crashed the whole PVM run and I don't really know how to salvage it. I suppose there is a way of backing up the entire run periodically? I wonder if that causes a big time hit... By the way, do you have any idea if there would be a big speedup by using a Gigabit switch instead of a 10/100 router? Both my boxes have Gigabit-capable ethernet adapters. Here's to an 8-way SMP Opteron box! (Now, who wants to help me pay for it??? ...and for the 2GB memory modules as well...) |
|
January 22, 2006, 17:26 |
Re: PVM RSH parallel setup problem Windows XP
|
#10 |
Guest
Posts: n/a
|
Hi,
I have done lots of benchmarking on MPI and PVM on Linux and Windows. In my experience, MPI on windows is as fast as PVM on Linux. PVM is more robust in terms of when a job crashes it rarely needs the user to manually clean things up but that is regularly required with MPI. The best option here is prevention and make sure your nodes and network are happy and healthy before starting the run so it does not crash! I think the reference in the manual about not using MPI on windows is because you should not use the standard MPI implementation but the custom windows MPI version. Also you cannot run heterogenous clusters with MPI for windows but you can with PVM. I have also tested gigabit networks versus 100MB and found only a slight difference for most runs with less than 8 nodes - only a few percent speed difference. The parallel implementation in CFX is very network bandwidth efficient so network speed is not a bottleneck. I have not tested large clusters with more than 8 nodes, I suspect the network is more significant then. I have posted some extensive posts on this topic on the CFX-Community website. I recommend you have a look at them. Regards, Glenn Horrocks |
|
January 23, 2006, 09:44 |
Re: PVM RSH parallel setup problem Windows XP
|
#11 |
Guest
Posts: n/a
|
I would agree with Glenn about Gigabit versus 100MB network. The one exception is if you are running a cluster of machines that have 4 cores or more (e.g. two duel core Opterons), then the network traffic does become a bottle neck and using a high performance network will help. Not too many of us have cluster like that yet! M
|
|
January 30, 2006, 06:10 |
Re: PVM RSH parallel setup problem Windows XP
|
#12 |
Guest
Posts: n/a
|
I use Remote Task Manager to keep an eye on my simulations on the slave machines as well as network traffic from the master computer. This program gives you information about network traffic and CPU load on the Windows network. You can easily see if the network is a bottleneck. http://www.protect-me.com/rtm/
|
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
problem loading UDF library in parallel cluster | Veera Gutti | FLUENT | 8 | July 26, 2016 08:24 |
RSH problem for parallel running in CFX | Nicola | CFX | 5 | June 18, 2012 19:31 |
CFX 12 on windows vista: problem with hp-mpi local parallel | matheusguzella | CFX | 5 | February 4, 2010 11:04 |
RSH service on Windows 2003 server | Saturn | FLUENT | 1 | August 21, 2006 04:23 |
Using SSH instead of RSH for parallel | Eric | Siemens | 4 | October 11, 2002 09:13 |