problems after decomposing for running

alessio.nz · April 18, 2011, 06:47

Hello, I had a mesh with the decomposePartDict included and I could use this flie for running in parallel without problem. The mesh splitted well and then the running was perfect (this file is actually set in order to split my domain in more than one node of the cluster - each node has 8 cores, so for example I can run in 4 node = 32 cores)

I wanted to use the same file for another mesh, but after splitting the domains in the 32 processors, apparently without errors,

Number of processor faces = 50892
Max number of processor patches = 8
Max number of faces between processors = 9008

Processor 0: field transfer
Processor 1: field transfer
Processor 2: field transfer
Processor 3: field transfer
Processor 4: field transfer
Processor 5: field transfer
Processor 6: field transfer
Processor 7: field transfer
Processor 8: field transfer
Processor 9: field transfer
Processor 10: field transfer
Processor 11: field transfer
Processor 12: field transfer
Processor 13: field transfer
Processor 14: field transfer
Processor 15: field transfer
Processor 16: field transfer
Processor 17: field transfer
Processor 18: field transfer
Processor 19: field transfer
Processor 20: field transfer
Processor 21: field transfer
Processor 22: field transfer
Processor 23: field transfer
Processor 24: field transfer
Processor 25: field transfer
Processor 26: field transfer
Processor 27: field transfer
Processor 28: field transfer
Processor 29: field transfer
Processor 30: field transfer
Processor 31: field transfer

End.

I tried to run with the foamJob -p simpleFoam and gives the following error:

Executing: mpirun -np 32 -hostfile system/machines /cvos/shared/apps/OpenFOAM/OpenFOAM-1.7.1/bin/foamExec simpleFoam -parallel > log 2>&1
[user@cluster]$ tail -f log
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
mpirun: clean termination accomplished

do you know what could it be? I attach the file on the mail.

stevenvanharen · April 18, 2011, 07:33

it seems like the call to mpi generated by the foamJob script is not correct. (I miss the file specifying the machines)

Read section 3.4 in the user guide and try to run mpi without using the foamJob script.

alessio.nz · April 18, 2011, 10:46

This is the command I put:
mpirun --hostfile system/machines -np 32 SimpleFoam -parallel

and this is what I got:
--------------------------------------------------------------------------
Open RTE detected a parse error in the hostfile:
system/machines
It occured on line number 1 on token 1.
--------------------------------------------------------------------------
[elmo:11368] [[22308,0],0] ORTE_ERROR_LOG: Error in file base/ras_base_allocate.c at line 236
[elmo:11368] [[22308,0],0] ORTE_ERROR_LOG: Error in file base/plm_base_launch_support.c at line 72
[elmo:11368] [[22308,0],0] ORTE_ERROR_LOG: Error in file plm_rsh_module.c at line 990
--------------------------------------------------------------------------
A daemon (pid unknown) died unexpectedly on signal 1 while attempting to
launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
mpirun: clean termination accomplished

stevenvanharen · April 18, 2011, 10:58

Quote:

Originally Posted by alessio.nz

--------------------------------------------------------------------------
Open RTE detected a parse error in the hostfile:
system/machines
It occured on line number 1 on token 1.
--------------------------------------------------------------------------

Somehow it is not happy with your machines file, are you sure you set the right names for the remote nodes in the "machines" file?

alessio.nz · April 18, 2011, 11:11

yes, I am sure, I was working with another mesh and it work perfectly, the problem is that with this different one the splitting seems ok, but once I am running it crashes giving the errors I mentioned

alessio.nz · April 20, 2011, 09:44

Hello, finally it worked, maybe there was a problem in the cluster itself. Anyway thanks for the help.regards

alireza2475 · December 23, 2015, 15:27

Quote:

Originally Posted by stevenvanharen

Somehow it is not happy with your machines file, are you sure you set the right names for the remote nodes in the "machines" file?

Just in case for anyone else may face the problem:

There is something wrong in the hostname file as steve mentioned.
Sometimes, even if you copy a working file for a new run, it's not gonna work. I suggest that you create another hostname file from scratch. I have just had the same problem by running a system that worked perfectly before. I just wrote the machine names again and it works now.

kashaf · March 5, 2021, 05:49

Quote:

Originally Posted by alessio.nz

Hello, finally it worked, maybe there was a problem in the cluster itself. Anyway thanks for the help.regards

HEY HI , How did you resolve this issue ,I am facing the same error

April 18, 2011, 06:47	problems after decomposing for running	#1
alessio.nz Member Alex Join Date: Apr 2010 Posts: 48 Rep Power: 16	Hello, I had a mesh with the decomposePartDict included and I could use this flie for running in parallel without problem. The mesh splitted well and then the running was perfect (this file is actually set in order to split my domain in more than one node of the cluster - each node has 8 cores, so for example I can run in 4 node = 32 cores) I wanted to use the same file for another mesh, but after splitting the domains in the 32 processors, apparently without errors, Number of processor faces = 50892 Max number of processor patches = 8 Max number of faces between processors = 9008 Processor 0: field transfer Processor 1: field transfer Processor 2: field transfer Processor 3: field transfer Processor 4: field transfer Processor 5: field transfer Processor 6: field transfer Processor 7: field transfer Processor 8: field transfer Processor 9: field transfer Processor 10: field transfer Processor 11: field transfer Processor 12: field transfer Processor 13: field transfer Processor 14: field transfer Processor 15: field transfer Processor 16: field transfer Processor 17: field transfer Processor 18: field transfer Processor 19: field transfer Processor 20: field transfer Processor 21: field transfer Processor 22: field transfer Processor 23: field transfer Processor 24: field transfer Processor 25: field transfer Processor 26: field transfer Processor 27: field transfer Processor 28: field transfer Processor 29: field transfer Processor 30: field transfer Processor 31: field transfer End. I tried to run with the foamJob -p simpleFoam and gives the following error: Executing: mpirun -np 32 -hostfile system/machines /cvos/shared/apps/OpenFOAM/OpenFOAM-1.7.1/bin/foamExec simpleFoam -parallel > log 2>&1 [user@cluster]$ tail -f log libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -------------------------------------------------------------------------- mpirun: clean termination accomplished do you know what could it be? I attach the file on the mail.

April 20, 2011, 09:44	Re:	#6
alessio.nz Member Alex Join Date: Apr 2010 Posts: 48 Rep Power: 16	Hello, finally it worked, maybe there was a problem in the cluster itself. Anyway thanks for the help.regards

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Needed Benchmark Problems for FSI	Mechstud	Main CFD Forum	4	July 26, 2011 13:13
Two-phase air water flow problems by activating Wall Lubrication Force	challenger85	CFX	5	November 5, 2009 06:44
Help required to solve Hydraulic related problems	aero	CFX	0	October 30, 2006 12:00
Some problems with Star CD	Micha	Siemens	0	August 6, 2003 14:55
Inverse problems	Aleksey Alekseev	Main CFD Forum	0	May 12, 1999 16:38

April 18, 2011, 07:33		#2
stevenvanharen Senior Member Steven van Haren Join Date: Aug 2010 Location: The Netherlands Posts: 149 Rep Power: 16	it seems like the call to mpi generated by the foamJob script is not correct. (I miss the file specifying the machines) Read section 3.4 in the user guide and try to run mpi without using the foamJob script.

April 18, 2011, 10:46		#3
alessio.nz Member Alex Join Date: Apr 2010 Posts: 48 Rep Power: 16	This is the command I put: mpirun --hostfile system/machines -np 32 SimpleFoam -parallel and this is what I got: -------------------------------------------------------------------------- Open RTE detected a parse error in the hostfile: system/machines It occured on line number 1 on token 1. -------------------------------------------------------------------------- [elmo:11368] [[22308,0],0] ORTE_ERROR_LOG: Error in file base/ras_base_allocate.c at line 236 [elmo:11368] [[22308,0],0] ORTE_ERROR_LOG: Error in file base/plm_base_launch_support.c at line 72 [elmo:11368] [[22308,0],0] ORTE_ERROR_LOG: Error in file plm_rsh_module.c at line 990 -------------------------------------------------------------------------- A daemon (pid unknown) died unexpectedly on signal 1 while attempting to launch so we are aborting. There may be more information reported by the environment (see above). This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -------------------------------------------------------------------------- mpirun: clean termination accomplished

April 18, 2011, 11:11		#5
alessio.nz Member Alex Join Date: Apr 2010 Posts: 48 Rep Power: 16	yes, I am sure, I was working with another mesh and it work perfectly, the problem is that with this different one the splitting seems ok, but once I am running it crashes giving the errors I mentioned