|
[Sponsors] |
linear solver crashing during transient run on HPC cluster |
|
LinkBack | Thread Tools | Search this Thread | Display Modes |
August 20, 2020, 20:30 |
linear solver crashing during transient run on HPC cluster
|
#1 |
New Member
Geoff
Join Date: Nov 2017
Posts: 4
Rep Power: 8 |
I have been encountering problems with runs crashing due to fatal overflow in the linear solver at seemingly random points in the transient solution, while running on multiple compute nodes on a HPC cluster.
I am doubting the model set up, time step, or mesh are to blame. The identical job appears to run fine if I use a local computer, a different single node compute server or a single node on the HPC cluster. I had been running similar similar cases without problems for a while now but on a single compute node. The fatal overflow occurs well into the transient simulation. There is nothing physically unique happening then. when the run crashes and I restart it from a back up file it crashes at the exact same time step. If I restart on a different single node machine the simulation runs past the point it crashed at. If i stop that run that went past the crash point and use it to initialize another run on the HPC cluster using 2 nodes it will run for a little while longer then crash again. I compared the output files of the single and multi node jobs and noticed; 'multi-step communication method for linear solver' was being activated when I used 2 nodes but not for a single node. I disable the multi-step communication method and tried another 2 node job. It appeared to solve the problem at first. It ran past the point where it crashed previously but it has now crashed later into the simulation. I was reading that the partitioning method can behave differently with different numbers of core. Please correct me if I am wrong but if the partitioning method was the problem, the run would crash on the first time step. Unfortunately it is more desirable for me to run one 2-node job than two 1-node jobs due to the scheduler on the HPC cluster. So I would really like to be able to get to the root of this problem. |
|
August 20, 2020, 23:49 |
|
#2 |
Super Moderator
Glenn Horrocks
Join Date: Mar 2009
Location: Sydney, Australia
Posts: 17,852
Rep Power: 144 |
The first thing to try in situations like this is a different partitioning algorithm. Your statement saying that poor partitioning would cause the run to crash in the first time step is incorrect - when a region of very high gradients (eg a shock wave or a free surface) lies on top of a partition boundary this can cause convergence problems. This can happen anywhere in a simulation as shocks and free surfaces can move.
So try some of the other partitioning methods and see if that fixes it. Also check that your HPC is stable as well - does a simulation which is known to run well (eg a tutorial example) run successfully in the same parallel mode?
__________________
Note: I do not answer CFD questions by PM. CFD questions should be posted on the forum. |
|
August 21, 2020, 10:02 |
|
#3 |
Senior Member
Join Date: Jun 2009
Posts: 1,869
Rep Power: 33 |
Since it is a sliding mesh simulation, can you tell if there is a relation to the relative configuration when it fails?
That is, will it fail every time it passes through the same relative configuration?
__________________
Note: I do not answer CFD questions by PM. CFD questions should be posted on the forum. |
|
August 28, 2020, 19:53 |
|
#4 |
New Member
Geoff
Join Date: Nov 2017
Posts: 4
Rep Power: 8 |
Thanks for the response. That is good to know about the partitioning methods. I seem to have resolved the problem the problem by specifying the compute nodes differently. Using 'user specified partitioning direction' seems to have improved my convergence rate slightly.
|
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
How to run SU2 on HPC cluster in parallel on HPC cluster? | Samirs | Main CFD Forum | 0 | July 13, 2018 01:44 |
Problem with running customized solver parallel on cluster | shinri1217 | OpenFOAM Running, Solving & CFD | 0 | June 27, 2018 14:26 |
how to modify fvScheme to converge? | immortality | OpenFOAM Running, Solving & CFD | 15 | January 16, 2013 14:06 |
OpenCL linear solver for OpenFoam 1.7 (alpha) will come out very soon | qinmaple | OpenFOAM Announcements from Other Sources | 4 | August 10, 2012 12:00 |
convergence problem in using incompressible transient solvers. | Geon-Hong | OpenFOAM Running, Solving & CFD | 13 | November 24, 2011 06:48 |