linear solver crashing during transient run on HPC cluster

Geoff67 · August 20, 2020, 20:30

I have been encountering problems with runs crashing due to fatal overflow in the linear solver at seemingly random points in the transient solution, while running on multiple compute nodes on a HPC cluster.

I am doubting the model set up, time step, or mesh are to blame. The identical job appears to run fine if I use a local computer, a different single node compute server or a single node on the HPC cluster. I had been running similar similar cases without problems for a while now but on a single compute node.

The fatal overflow occurs well into the transient simulation. There is nothing physically unique happening then. when the run crashes and I restart it from a back up file it crashes at the exact same time step. If I restart on a different single node machine the simulation runs past the point it crashed at. If i stop that run that went past the crash point and use it to initialize another run on the HPC cluster using 2 nodes it will run for a little while longer then crash again.

I compared the output files of the single and multi node jobs and noticed; 'multi-step communication method for linear solver' was being activated when I used 2 nodes but not for a single node. I disable the multi-step communication method and tried another 2 node job. It appeared to solve the problem at first. It ran past the point where it crashed previously but it has now crashed later into the simulation.

I was reading that the partitioning method can behave differently with different numbers of core. Please correct me if I am wrong but if the partitioning method was the problem, the run would crash on the first time step.

Unfortunately it is more desirable for me to run one 2-node job than two 1-node jobs due to the scheduler on the HPC cluster. So I would really like to be able to get to the root of this problem.

ghorrocks · August 20, 2020, 23:49

The first thing to try in situations like this is a different partitioning algorithm. Your statement saying that poor partitioning would cause the run to crash in the first time step is incorrect - when a region of very high gradients (eg a shock wave or a free surface) lies on top of a partition boundary this can cause convergence problems. This can happen anywhere in a simulation as shocks and free surfaces can move.

So try some of the other partitioning methods and see if that fixes it.

Also check that your HPC is stable as well - does a simulation which is known to run well (eg a tutorial example) run successfully in the same parallel mode?

Opaque · August 21, 2020, 10:02

Since it is a sliding mesh simulation, can you tell if there is a relation to the relative configuration when it fails?

That is, will it fail every time it passes through the same relative configuration?

Geoff67 · August 28, 2020, 19:53

Thanks for the response. That is good to know about the partitioning methods. I seem to have resolved the problem the problem by specifying the compute nodes differently. Using 'user specified partitioning direction' seems to have improved my convergence rate slightly.

August 20, 2020, 20:30	linear solver crashing during transient run on HPC cluster	#1
Geoff67 New Member Geoff Join Date: Nov 2017 Posts: 4 Rep Power: 9	I have been encountering problems with runs crashing due to fatal overflow in the linear solver at seemingly random points in the transient solution, while running on multiple compute nodes on a HPC cluster. I am doubting the model set up, time step, or mesh are to blame. The identical job appears to run fine if I use a local computer, a different single node compute server or a single node on the HPC cluster. I had been running similar similar cases without problems for a while now but on a single compute node. The fatal overflow occurs well into the transient simulation. There is nothing physically unique happening then. when the run crashes and I restart it from a back up file it crashes at the exact same time step. If I restart on a different single node machine the simulation runs past the point it crashed at. If i stop that run that went past the crash point and use it to initialize another run on the HPC cluster using 2 nodes it will run for a little while longer then crash again. I compared the output files of the single and multi node jobs and noticed; 'multi-step communication method for linear solver' was being activated when I used 2 nodes but not for a single node. I disable the multi-step communication method and tried another 2 node job. It appeared to solve the problem at first. It ran past the point where it crashed previously but it has now crashed later into the simulation. I was reading that the partitioning method can behave differently with different numbers of core. Please correct me if I am wrong but if the partitioning method was the problem, the run would crash on the first time step. Unfortunately it is more desirable for me to run one 2-node job than two 1-node jobs due to the scheduler on the HPC cluster. So I would really like to be able to get to the root of this problem.

August 20, 2020, 23:49		#2
ghorrocks Super Moderator Glenn Horrocks Join Date: Mar 2009 Location: Sydney, Australia Posts: 17,870 Rep Power: 144	The first thing to try in situations like this is a different partitioning algorithm. Your statement saying that poor partitioning would cause the run to crash in the first time step is incorrect - when a region of very high gradients (eg a shock wave or a free surface) lies on top of a partition boundary this can cause convergence problems. This can happen anywhere in a simulation as shocks and free surfaces can move. So try some of the other partitioning methods and see if that fixes it. Also check that your HPC is stable as well - does a simulation which is known to run well (eg a tutorial example) run successfully in the same parallel mode? __________________ Note: I do not answer CFD questions by PM. CFD questions should be posted on the forum.

August 21, 2020, 10:02		#3
Opaque Senior Member Join Date: Jun 2009 Posts: 1,880 Rep Power: 33	Since it is a sliding mesh simulation, can you tell if there is a relation to the relative configuration when it fails? That is, will it fail every time it passes through the same relative configuration? __________________ Note: I do not answer CFD questions by PM. CFD questions should be posted on the forum.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
How to run SU2 on HPC cluster in parallel on HPC cluster?	Samirs	Main CFD Forum	0	July 13, 2018 01:44
Problem with running customized solver parallel on cluster	shinri1217	OpenFOAM Running, Solving & CFD	0	June 27, 2018 14:26
how to modify fvScheme to converge?	immortality	OpenFOAM Running, Solving & CFD	15	January 16, 2013 14:06
OpenCL linear solver for OpenFoam 1.7 (alpha) will come out very soon	qinmaple	OpenFOAM Announcements from Other Sources	4	August 10, 2012 12:00
convergence problem in using incompressible transient solvers.	Geon-Hong	OpenFOAM Running, Solving & CFD	13	November 24, 2011 06:48

August 28, 2020, 19:53		#4
Geoff67 New Member Geoff Join Date: Nov 2017 Posts: 4 Rep Power: 9	Thanks for the response. That is good to know about the partitioning methods. I seem to have resolved the problem the problem by specifying the compute nodes differently. Using 'user specified partitioning direction' seems to have improved my convergence rate slightly.