|
[Sponsors] |
Issue with running in parallel on multiple nodes |
|
LinkBack | Thread Tools | Search this Thread | Display Modes |
August 27, 2010, 09:07 |
Issue with running in parallel on multiple nodes
|
#1 |
Senior Member
Dave
Join Date: Jul 2010
Posts: 100
Rep Power: 16 |
Hey all,
I have been struggling for weeks with trying to get my network to perform parallel processing using the openmpi implemented with OF. I have performed these runs in parallel on a single node and successfully run this case. The issue arises when I goto run on multiple machines, mpi runs but then OF cannot find "controlDict". I get the following error for node "prius" (slave node to node "insight"): Process 2506 Unable to locate the parameter file "/home/dave/Openfoam/dave-1.7.0/run/kukaseries/unsteadyCoGparallel" in the following search path: /home/dave/OpenFOAM/dave-1.7.0/run/kukaseries/unsteadyCoGparallel:/usr/share/openmpi/amca-param-sets:/home/dave/OpenFOAM/dave-1.7.0/run/kukaseries/unsteadyCoGparallel -------------------------------------------------------------------------- [insight:03276] [[17400,1],0] node[0].name insight daemon 0 arch ffca0200 [insight:03276] [[17400,1],0] node[1].name prius daemon 1 arch ffca0200 [prius:02505] procdir: /tmp/openmpi-sessions-dave@prius_0/17400/1/1 [prius:02506] procdir: /tmp/openmpi-sessions-dave@prius_0/17400/1/2 [prius:02506] jobdir: /tmp/openmpi-sessions-dave@prius_0/17400/1 [prius:02506] top: openmpi-sessions-dave@prius_0 [prius:02506] tmp: /tmp [prius:02506] [[17400,1],2] node[0].name insight daemon 0 arch ffca0200 [prius:02505] jobdir: /tmp/openmpi-sessions-dave@prius_0/17400/1 [prius:02505] top: openmpi-sessions-dave@prius_0 [prius:02505] tmp: /tmp [prius:02506] [[17400,1],2] node[1].name prius daemon 1 arch ffca0200 [prius:02505] [[17400,1],1] node[0].name insight daemon 0 arch ffca0200 [prius:02505] [[17400,1],1] node[1].name prius daemon 1 arch ffca0200 [insight:03277] procdir: /tmp/openmpi-sessions-dave@insight_0/17400/1/3 [insight:03277] jobdir: /tmp/openmpi-sessions-dave@insight_0/17400/1 [insight:03277] top: openmpi-sessions-dave@insight_0 [insight:03277] tmp: /tmp [insight:03277] [[17400,1],3] node[0].name insight daemon 0 arch ffca0200 [insight:03277] [[17400,1],3] node[1].name prius daemon 1 arch ffca0200 [insight:03276] mca_param_files=/home/dave/.openmpi/mca-params.conf:/etc/openmpi/openmpi-mca-params.conf (default value) [insight:03276] mca_base_param_file_prefix=/home/dave/Openfoam/dave-1.7.0/run/kukaseries/unsteadyCoGparallel (file:/home/dave/.openmpi/mca-params.conf) [insight:03276] mca_base_param_file_path=/usr/share/openmpi/amca-param-sets:/home/dave/OpenFOAM/dave-1.7.0/run/kukaseries/unsteadyCoGparallel (default value) [1] --> FOAM FATAL IO ERROR: [1] cannot open file [2] [2] [2] --> FOAM FATAL IO ERROR: [2] cannot open file [2] [2] file: /home/dave/processor2/system/controlDict at line 0. [2] [2] From function regIOobject::readStream() [2] in file db/regIOobject/regIOobjectRead.C at line 61. [2] FOAM parallel run exiting [2] [1] -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD with errorcode 1. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. -------------------------------------------------------------------------- [1] file: /home/dave/processor1/system/controlDict at line 0. [1] [prius:02506] sess_dir_finalize: proc session dir not empty - leaving [1] From function regIOobject::readStream() [1] in file db/regIOobject/regIOobjectRead.C at line 61. [1] FOAM parallel run exiting [1] [prius:02505] sess_dir_finalize: proc session dir not empty - leaving [prius:02337] sess_dir_finalize: proc session dir not empty - leaving The problem is that this directory does not exist on prius nor do I believe it should unless it is a temporary directory created when the mpi session is created. I have tried numerous methods of overcoming this such as using the --preload-file , the --preload-file-dir option and -wdir option to try and overcome this, but none have solved the problem. Any help would be really appreciated since it has been racking my brain for weeks how to overcome this. -Dave |
|
August 27, 2010, 10:29 |
|
#2 |
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128 |
Greetings Dave,
OK, here are a few questions, so we can better understand what's going on:
Bruno
__________________
|
|
August 30, 2010, 06:57 |
|
#3 |
Senior Member
Dave
Join Date: Jul 2010
Posts: 100
Rep Power: 16 |
Bruno,
Thank you for the insight about MPI, I had the impression that MPI would handle file/data transfer as necessary and that a file sharing system was not necessary. I will be setting up a NFS file transfer system today to see if I can resolve the issue of preloading files to the slave computers. I almost (see below) succeeded in parallel processing a tutorial (since the files were on both computers already) with the "mpirun --hostfile machines -np 4 <executable> -parallel > log " command. I noticed a strange behavior difference between running MPI on a single node vs 2 nodes. When I run MPI on one node the "processorN" files are where the write files are sent to and I have to use reconstructPar to put the pieces together. When I ran the process on 2 nodes, it was creating the fully reconstructed time steps on BOTH computers and the processor folders remain empty other than the 0 and constant folders initially placed there. The issue appears to be that mpirun is setting off the processes on both computers but independently (basically I am running 2 instances of the case, with one on each computer using both cores on each computer). I noticed this behavior when one computer finished running and the other was still going and was at a different time step. The other odd behavior is that I am still getting the behavior of it being unable to find the "parameter file" but it runs fine. The error I get is this (Process 3499 being on the slave computer): Process 3499 Unable to locate the parameter file "/home/dave/Openfoam/dave-1.7.0/run/kukaseries/unsteadyCoGparallel" in the following search path: /home/dave/OpenFOAM/dave-1.7.1/run/tutorials/multiphase/interDyMFoam/ras/sloshingTank3D3DoF:/usr/share/openmpi/amca-param-sets:/home/dave/OpenFOAM/dave-1.7.1/run/tutorials/multiphase/interDyMFoam/ras/sloshingTank3D3DoF I think I am going to try the parallelTest if I can't get it to run in a few tries. -Dave |
|
August 30, 2010, 08:29 |
|
#4 |
Senior Member
Dave
Join Date: Jul 2010
Posts: 100
Rep Power: 16 |
Whoops, I had typed the appendix -parallel as >parallel and somehow it did not spit out an error. After correcting it, it runs in parallel correctly (although without NFS file transfer setup the write files for the slave computer are being written to "processorN" on that computer instead of the master computer). Goes to show when something goes wrong, take a look at what your typing in early in the debug process.
-Dave |
|
August 31, 2010, 11:40 |
|
#5 |
Senior Member
Dave
Join Date: Jul 2010
Posts: 100
Rep Power: 16 |
Bruno,
I managed to use a really ornate scripting file to tar, send, perform the execution and then selectively transfer the processor files of each slave computer back to the master computer where it is reconstructed. However, I have only done this between 32 bit machines, and have been unable to make a 64 bit machine execute with the 32 bit machines. This is unfortunate since about half of our machines are 32 bit and the other half are 64 bit, is there a way to force the 64 bit machines to read the 32bit libraries? -Dave |
|
August 31, 2010, 12:47 |
|
#6 |
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128 |
Hi Dave,
Why aren't you using NFS for sharing files/folders? It's a lot more efficient/simpler (I think it is, but I'm not 100% sure), but I guess that selective temporary sharing can be more efficient in the long run. Anyway, if you have openSUSE installed, it's very easy to setup NFS! I suggest NFS because it usually is the most efficient file sharing system among Linux machines, and this way you can also share the whole OpenFOAM folder! As for 32-64bit compatibilities:
How exactly have you setup each machine? Best regards, Bruno
__________________
|
|
August 31, 2010, 16:23 |
|
#7 |
Senior Member
Dave
Join Date: Jul 2010
Posts: 100
Rep Power: 16 |
Bruno,
The shell script was a matter of expediency since I already had a script file that did file transfer (by way of tarring and untarring the files) so I simply modified it to decompose, tar/copy the file to slaves, and the selectively copy back the relevant processor files (and delete them off the slave to save space). I would like to setup NFS in the long term, but in the name of expediency I went with what I knew (should work), since NFS is still a subject I am rather unfamiliar with. As for the 64 vs 32 bit problem, I have 3 machines with 32 bit ubuntu and one with 64 bit. One machine is incapable of 64 bit while the other 3 are 64 bit processors so I am torn because I can either try and get the 64 bit machine to operate with the 32 bit by installing 32 bit (I have a boot disk that is incredibly easy to use for installing 32 bit ubuntu) or to upgrade the two 64 bit capable machines that are currently 32 bit (due to a mistake on my part of using the 32 bit machine's ubuntu disk I have to install ubuntu on them) and then modify the libraries as per above, or just modifying the 64 machine. I think in the long run I will upgrade the two machines to 64 bit ubuntu and eventually retire the 32 bit machine when a new machine comes along to replace it (since the performance benefits of 64 bit merit complete exchange to 64 architecture if possible). The issue appears to be in which version of the execution is referenced, the 64 bit references one library file while the 32 bit references another another (as determined by a "which interDyMFoam" on each). As a short term solution I have simply run a quarter of the cases on the 64 bit machine and the remainder on the 32 bit parallel machines (inelegant I know, but working so...). Thank you for the insights in all of this! -Dave |
|
August 31, 2010, 18:16 |
|
#8 | |
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128 |
Greetings Dave,
Quote:
Because by what I can infer from the information you gave, you probably are running Ubuntu 10.04 (also known as Lucid) and installed OpenFOAM via the Debian packages provided by OpenCFD. If this is the case, then I believe that some minor "hacking" can get you running all OpenFOAM versions in 32bit, without a need to change Ubuntu versions! Or we could even get the 64bit version to "play ball" with the 32bit versions You're welcome Helping others is a way of sharing experience Best regards, Bruno
__________________
|
||
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
OpenFOAM static build on Cray XT5 | asaijo | OpenFOAM Installation | 9 | April 6, 2011 13:21 |
Running Multiple Simulations from Workbench 12.1 | Josh | CFX | 3 | August 10, 2010 20:51 |
parallel calculation on multiple computers | foam_noob | OpenFOAM | 4 | February 3, 2010 04:14 |
parallel computing with non-domain member nodes | luke.christ | FLUENT | 0 | June 27, 2009 07:12 |
Kubuntu uses dash breaks All scripts in tutorials | platopus | OpenFOAM Bugs | 8 | April 15, 2008 08:52 |