CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Software User Forums > ANSYS > CFX

linear solver crashing during transient run on HPC cluster

Register Blogs Community New Posts Updated Threads Search

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   August 20, 2020, 20:30
Default linear solver crashing during transient run on HPC cluster
  #1
New Member
 
Geoff
Join Date: Nov 2017
Posts: 4
Rep Power: 8
Geoff67 is on a distinguished road
I have been encountering problems with runs crashing due to fatal overflow in the linear solver at seemingly random points in the transient solution, while running on multiple compute nodes on a HPC cluster.

I am doubting the model set up, time step, or mesh are to blame. The identical job appears to run fine if I use a local computer, a different single node compute server or a single node on the HPC cluster. I had been running similar similar cases without problems for a while now but on a single compute node.

The fatal overflow occurs well into the transient simulation. There is nothing physically unique happening then. when the run crashes and I restart it from a back up file it crashes at the exact same time step. If I restart on a different single node machine the simulation runs past the point it crashed at. If i stop that run that went past the crash point and use it to initialize another run on the HPC cluster using 2 nodes it will run for a little while longer then crash again.

I compared the output files of the single and multi node jobs and noticed; 'multi-step communication method for linear solver' was being activated when I used 2 nodes but not for a single node. I disable the multi-step communication method and tried another 2 node job. It appeared to solve the problem at first. It ran past the point where it crashed previously but it has now crashed later into the simulation.

I was reading that the partitioning method can behave differently with different numbers of core. Please correct me if I am wrong but if the partitioning method was the problem, the run would crash on the first time step.

Unfortunately it is more desirable for me to run one 2-node job than two 1-node jobs due to the scheduler on the HPC cluster. So I would really like to be able to get to the root of this problem.
Geoff67 is offline   Reply With Quote

Old   August 20, 2020, 23:49
Default
  #2
Super Moderator
 
Glenn Horrocks
Join Date: Mar 2009
Location: Sydney, Australia
Posts: 17,852
Rep Power: 144
ghorrocks is just really niceghorrocks is just really niceghorrocks is just really niceghorrocks is just really nice
The first thing to try in situations like this is a different partitioning algorithm. Your statement saying that poor partitioning would cause the run to crash in the first time step is incorrect - when a region of very high gradients (eg a shock wave or a free surface) lies on top of a partition boundary this can cause convergence problems. This can happen anywhere in a simulation as shocks and free surfaces can move.

So try some of the other partitioning methods and see if that fixes it.

Also check that your HPC is stable as well - does a simulation which is known to run well (eg a tutorial example) run successfully in the same parallel mode?
__________________
Note: I do not answer CFD questions by PM. CFD questions should be posted on the forum.
ghorrocks is offline   Reply With Quote

Old   August 21, 2020, 10:02
Default
  #3
Senior Member
 
Join Date: Jun 2009
Posts: 1,869
Rep Power: 33
Opaque will become famous soon enough
Since it is a sliding mesh simulation, can you tell if there is a relation to the relative configuration when it fails?

That is, will it fail every time it passes through the same relative configuration?
__________________
Note: I do not answer CFD questions by PM. CFD questions should be posted on the forum.
Opaque is offline   Reply With Quote

Old   August 28, 2020, 19:53
Default
  #4
New Member
 
Geoff
Join Date: Nov 2017
Posts: 4
Rep Power: 8
Geoff67 is on a distinguished road
Thanks for the response. That is good to know about the partitioning methods. I seem to have resolved the problem the problem by specifying the compute nodes differently. Using 'user specified partitioning direction' seems to have improved my convergence rate slightly.
Geoff67 is offline   Reply With Quote

Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
How to run SU2 on HPC cluster in parallel on HPC cluster? Samirs Main CFD Forum 0 July 13, 2018 01:44
Problem with running customized solver parallel on cluster shinri1217 OpenFOAM Running, Solving & CFD 0 June 27, 2018 14:26
how to modify fvScheme to converge? immortality OpenFOAM Running, Solving & CFD 15 January 16, 2013 14:06
OpenCL linear solver for OpenFoam 1.7 (alpha) will come out very soon qinmaple OpenFOAM Announcements from Other Sources 4 August 10, 2012 12:00
convergence problem in using incompressible transient solvers. Geon-Hong OpenFOAM Running, Solving & CFD 13 November 24, 2011 06:48


All times are GMT -4. The time now is 00:21.