|
[Sponsors] |
February 27, 2014, 19:37 |
Losing Log when running in parallel
|
#1 |
New Member
David H.
Join Date: Oct 2013
Posts: 25
Rep Power: 13 |
Hi everyone, I've recently been running some simulations on a cluster at my school for research.
I've been having an issue, which seems to come up when I try to restart a run, or continue one using the controlDict inputs as appropriate. The issue though, is I lose my log even though I can see from "top" that my process is running on the head node, and by "lsload" that the work is being distributed across the other nodes as well. By "lose my log", I mean it will usually print a header and items but then stops at the next step. Also, there does not seem to be any write output while these processors are spinning their bits with reckless abandon. I'm running OpenFOAM using the following command, which I usually copy and paste: Code:
mpirun -np 48 pimpleFoam -parallel > log & Code:
tail -f log Any ideas? Here's what I've got now: Code:
/*---------------------------------------------------------------------------*\ | ========= | | | \\ / F ield | OpenFOAM: The Open Source CFD Toolbox | | \\ / O peration | Version: 2.1.0 | | \\ / A nd | Web: www.OpenFOAM.org | | \\/ M anipulation | | \*---------------------------------------------------------------------------*/ Build : 2.1.0-bd7367f93311 Exec : pimpleFoam -parallel Date : Feb 27 2014 Time : 21:20:21 Host : "Cluster1" PID : 21310 Last edited by djh2; February 27, 2014 at 22:43. |
|
February 28, 2014, 02:43 |
|
#2 |
Senior Member
Bernhard
Join Date: Sep 2009
Location: Delft
Posts: 790
Rep Power: 22 |
Can you show the complete submit script?
Also, some scheduler will also output two files: <jobname>.[o|e]<jobid>, this might give you some hints to what went wrong. Final question, what is the layout of your system? 48 cpus on a single node? Or are you using multiple nodes? |
|
February 28, 2014, 09:54 |
|
#3 |
New Member
David H.
Join Date: Oct 2013
Posts: 25
Rep Power: 13 |
I'm not using a script to run the job (like Allrun), although maybe I should make one.
In general, this is my method: 1) scp my files from my desktop (remote) onto cluster 2) run blockMesh 3) run decomposePar 4) run the job in parallel using mpirun. 5) run tail to follow the log progress 6) run reconstructPar 7) run paraFoam As I have said previously, I'm using Code:
mpirun -np 48 pimpleFoam -parallel > log & This is running on a cluster of five nodes with 12 processors each. I have had times where everything goes as you'd expect, tail -f log brings me the running logfile and I can watch the steps go. Other times, this doesn't behave and what you see at the end of my first post is all the log provides. For example, I ran the simulation from 0 to 0.01 with 0.001 write intervals. Then I reconstructed the results, viewed them, (results looked okay). Then I modified the controlDict to a later endTime, since the simulation was on the right track. Then the problem occurs. However, I am not able to reproduce any "successful" runs, where it seems now even a clean copy of my files causes no log output. Is there significance to having two spaces between "parallel" and ">"? I noticed this usage in another posted topic, and now it seems that my simulation is behaving. Code:
mpirun -np 48 pimpleFoam -parallel > log & |
|
February 28, 2014, 10:20 |
|
#4 |
Senior Member
|
Hi,
Do I get it right: you don't have any batch system on the cluster? You just log in and run a simulation with Code:
mpirun -np 48 pimpleFoam -parallel > log & If it is a case then you're trying to run all of your 48 processes on one node cause I do not see --hostfile or --host option in your command. Cause there're lots of possibilities for the behavior (you run all your processes on one node and it can't launch itself, there is NFS caching, so results in the log file do not appear immediately etc), you need to provide more information about environment you're using to suggest anything. |
|
February 28, 2014, 10:41 |
|
#5 |
New Member
David H.
Join Date: Oct 2013
Posts: 25
Rep Power: 13 |
The host file seems to be managed by the cluster, when I run: (running simulation on 60 procs now, same cluster though)
Code:
mpirun -np 60 pimpleFoam -parallel > log & Code:
[6]+ Stopped mpirun -f /shared/opt/mpihosts -np 60 pimpleFoam -parallel > log I think this might be the issue that I'm having, because even though I use Code:
> log & The strange part is, even though it was "Stopped", the load was still 100%. I think we can chalk this one up to an "linux amateur" problem, and not software related. Thanks for your time and input. Last edited by djh2; February 28, 2014 at 14:10. |
|
Tags |
parallel |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Issues running custom code in parallel | BigBlueDart | OpenFOAM Programming & Development | 4 | October 23, 2013 07:17 |
OpenFoam Parallel running | shipman | OpenFOAM Running, Solving & CFD | 3 | August 17, 2013 11:50 |
Problem in Running OpenFoam in Parallel | himanshu28 | OpenFOAM Running, Solving & CFD | 1 | July 11, 2013 10:19 |
Running PimpleDyMFoam in parallel | paul b | OpenFOAM Running, Solving & CFD | 8 | April 20, 2011 06:21 |
running in parallel, at time t>0 | bunni | OpenFOAM | 1 | October 21, 2010 10:34 |