Losing Log when running in parallel

djh2 · February 27, 2014, 19:37

Hi everyone, I've recently been running some simulations on a cluster at my school for research.

I've been having an issue, which seems to come up when I try to restart a run, or continue one using the controlDict inputs as appropriate.

The issue though, is I lose my log even though I can see from "top" that my process is running on the head node, and by "lsload" that the work is being distributed across the other nodes as well.
By "lose my log", I mean it will usually print a header and items but then stops at the next step. Also, there does not seem to be any write output while these processors are spinning their bits with reckless abandon.

I'm running OpenFOAM using the following command, which I usually copy and paste:

Code:

mpirun -np 48 pimpleFoam -parallel > log &

then

Code:

tail -f log

to view the progress.

Any ideas?

Here's what I've got now:

Code:

/*---------------------------------------------------------------------------*\                                                                                                                            
| =========                 |                                                 |                                                                                                                            
| \\      /  F ield         | OpenFOAM: The Open Source CFD Toolbox           |                                                                                                                            
|  \\    /   O peration     | Version:  2.1.0                                 |                                                                                                                            
|   \\  /    A nd           | Web:      www.OpenFOAM.org                      |                                                                                                                            
|    \\/     M anipulation  |                                                 |                                                                                                                            
\*---------------------------------------------------------------------------*/                                                                                                                            
Build  : 2.1.0-bd7367f93311                                                                                                                                                                                
Exec   : pimpleFoam -parallel                                                                                                                                                                              
Date   : Feb 27 2014                                                                                                                                                                                       
Time   : 21:20:21                                                                                                                                                                                          
Host   : "Cluster1"                                                                                                                                                                                        
PID    : 21310

Bernhard · February 28, 2014, 02:43

Can you show the complete submit script?

Also, some scheduler will also output two files: <jobname>.[o|e]<jobid>, this might give you some hints to what went wrong.

Final question, what is the layout of your system? 48 cpus on a single node? Or are you using multiple nodes?

djh2 · February 28, 2014, 09:54

I'm not using a script to run the job (like Allrun), although maybe I should make one.

In general, this is my method:
1) scp my files from my desktop (remote) onto cluster
2) run blockMesh
3) run decomposePar
4) run the job in parallel using mpirun.
5) run tail to follow the log progress
6) run reconstructPar
7) run paraFoam

As I have said previously, I'm using

Code:

mpirun -np 48 pimpleFoam -parallel > log &

as the command to start my parallel job.

This is running on a cluster of five nodes with 12 processors each. I have had times where everything goes as you'd expect, tail -f log brings me the running logfile and I can watch the steps go. Other times, this doesn't behave and what you see at the end of my first post is all the log provides.

For example, I ran the simulation from 0 to 0.01 with 0.001 write intervals. Then I reconstructed the results, viewed them, (results looked okay). Then I modified the controlDict to a later endTime, since the simulation was on the right track. Then the problem occurs. However, I am not able to reproduce any "successful" runs, where it seems now even a clean copy of my files causes no log output.

Is there significance to having two spaces between "parallel" and ">"? I noticed this usage in another posted topic, and now it seems that my simulation is behaving.

Code:

mpirun -np 48 pimpleFoam -parallel  > log &

alexeym · February 28, 2014, 10:20

Hi,

Do I get it right: you don't have any batch system on the cluster? You just log in and run a simulation with

Code:

mpirun -np 48 pimpleFoam -parallel > log &

?

If it is a case then you're trying to run all of your 48 processes on one node cause I do not see --hostfile or --host option in your command.

Cause there're lots of possibilities for the behavior (you run all your processes on one node and it can't launch itself, there is NFS caching, so results in the log file do not appear immediately etc), you need to provide more information about environment you're using to suggest anything.

djh2 · February 28, 2014, 10:41

The host file seems to be managed by the cluster, when I run: (running simulation on 60 procs now, same cluster though)

Code:

mpirun -np 60 pimpleFoam -parallel  > log &

then later it shows:

Code:

 [6]+  Stopped                 mpirun -f /shared/opt/mpihosts -np 60 pimpleFoam -parallel > log

I looked into restarting a stopped process, so I type "bg" to continue the solution.

I think this might be the issue that I'm having, because even though I use

Code:

> log &

for the output, something I do in the terminal whether it is checking the load or viewing the log is causing the process to stop solving.

The strange part is, even though it was "Stopped", the load was still 100%.

I think we can chalk this one up to an "linux amateur" problem, and not software related. Thanks for your time and input.

February 27, 2014, 19:37	Losing Log when running in parallel	#1
djh2 New Member David H. Join Date: Oct 2013 Posts: 25 Rep Power: 13	Hi everyone, I've recently been running some simulations on a cluster at my school for research. I've been having an issue, which seems to come up when I try to restart a run, or continue one using the controlDict inputs as appropriate. The issue though, is I lose my log even though I can see from "top" that my process is running on the head node, and by "lsload" that the work is being distributed across the other nodes as well. By "lose my log", I mean it will usually print a header and items but then stops at the next step. Also, there does not seem to be any write output while these processors are spinning their bits with reckless abandon. I'm running OpenFOAM using the following command, which I usually copy and paste: Code: mpirun -np 48 pimpleFoam -parallel > log & then Code: tail -f log to view the progress. Any ideas? Here's what I've got now: Code: /---------------------------------------------------------------------------\ \| ========= \| \| \| \\ / F ield \| OpenFOAM: The Open Source CFD Toolbox \| \| \\ / O peration \| Version: 2.1.0 \| \| \\ / A nd \| Web: www.OpenFOAM.org \| \| \\/ M anipulation \| \| \---------------------------------------------------------------------------/ Build : 2.1.0-bd7367f93311 Exec : pimpleFoam -parallel Date : Feb 27 2014 Time : 21:20:21 Host : "Cluster1" PID : 21310 Last edited by djh2; February 27, 2014 at 22:43.

February 28, 2014, 09:54		#3
djh2 New Member David H. Join Date: Oct 2013 Posts: 25 Rep Power: 13	I'm not using a script to run the job (like Allrun), although maybe I should make one. In general, this is my method: 1) scp my files from my desktop (remote) onto cluster 2) run blockMesh 3) run decomposePar 4) run the job in parallel using mpirun. 5) run tail to follow the log progress 6) run reconstructPar 7) run paraFoam As I have said previously, I'm using Code: mpirun -np 48 pimpleFoam -parallel > log & as the command to start my parallel job. This is running on a cluster of five nodes with 12 processors each. I have had times where everything goes as you'd expect, tail -f log brings me the running logfile and I can watch the steps go. Other times, this doesn't behave and what you see at the end of my first post is all the log provides. For example, I ran the simulation from 0 to 0.01 with 0.001 write intervals. Then I reconstructed the results, viewed them, (results looked okay). Then I modified the controlDict to a later endTime, since the simulation was on the right track. Then the problem occurs. However, I am not able to reproduce any "successful" runs, where it seems now even a clean copy of my files causes no log output. Is there significance to having two spaces between "parallel" and ">"? I noticed this usage in another posted topic, and now it seems that my simulation is behaving. Code: mpirun -np 48 pimpleFoam -parallel > log &

February 28, 2014, 10:20		#4
alexeym Senior Member Alexey Matveichev Join Date: Aug 2011 Location: Nancy, France Posts: 1,938 Rep Power: 39	Hi, Do I get it right: you don't have any batch system on the cluster? You just log in and run a simulation with Code: mpirun -np 48 pimpleFoam -parallel > log & ? If it is a case then you're trying to run all of your 48 processes on one node cause I do not see --hostfile or --host option in your command. Cause there're lots of possibilities for the behavior (you run all your processes on one node and it can't launch itself, there is NFS caching, so results in the log file do not appear immediately etc), you need to provide more information about environment you're using to suggest anything.

February 28, 2014, 10:41		#5
djh2 New Member David H. Join Date: Oct 2013 Posts: 25 Rep Power: 13	The host file seems to be managed by the cluster, when I run: (running simulation on 60 procs now, same cluster though) Code: mpirun -np 60 pimpleFoam -parallel > log & then later it shows: Code: [6]+ Stopped mpirun -f /shared/opt/mpihosts -np 60 pimpleFoam -parallel > log I looked into restarting a stopped process, so I type "bg" to continue the solution. I think this might be the issue that I'm having, because even though I use Code: > log & for the output, something I do in the terminal whether it is checking the load or viewing the log is causing the process to stop solving. The strange part is, even though it was "Stopped", the load was still 100%. I think we can chalk this one up to an "linux amateur" problem, and not software related. Thanks for your time and input. Last edited by djh2; February 28, 2014 at 14:10.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Issues running custom code in parallel	BigBlueDart	OpenFOAM Programming & Development	4	October 23, 2013 07:17
OpenFoam Parallel running	shipman	OpenFOAM Running, Solving & CFD	3	August 17, 2013 11:50
Problem in Running OpenFoam in Parallel	himanshu28	OpenFOAM Running, Solving & CFD	1	July 11, 2013 10:19
Running PimpleDyMFoam in parallel	paul b	OpenFOAM Running, Solving & CFD	8	April 20, 2011 06:21
running in parallel, at time t>0	bunni	OpenFOAM	1	October 21, 2010 10:34

February 28, 2014, 02:43		#2
Bernhard Senior Member Bernhard Join Date: Sep 2009 Location: Delft Posts: 790 Rep Power: 22	Can you show the complete submit script? Also, some scheduler will also output two files: <jobname>.[o\|e]<jobid>, this might give you some hints to what went wrong. Final question, what is the layout of your system? 48 cpus on a single node? Or are you using multiple nodes?