CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > General Forums > Main CFD Forum

Job interrupted abruptly using remote cluster

Register Blogs Community New Posts Updated Threads Search

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   July 25, 2023, 18:11
Default Job interrupted abruptly using remote cluster
  #1
Member
 
Santhosh
Join Date: Nov 2021
Posts: 44
Rep Power: 4
Santhosh91 is on a distinguished road
Hello everyone,


I am trying to run a simulation using a remote cluster. The simulation works well when meshing is coarse. However when I refine the mesh, I have this following error :
''srun: error: cdr2418: task 319: Killed
srun: Terminating StepId=8156913.0
slurmstepd: error: *** STEP 8156913.0 ON cdr2411 CANCELLED AT 2023-07-24T15:43:49 ***
srun: error: cdr2411: task 31: Killed
srun: error: cdr2415: task 145: Terminated
srun: error: cdr2411: task 1: Terminated
srun: error: cdr2415: tasks 144,186: Terminated
srun: error: cdr2412: task 49: Terminated
srun: error: cdr2412: task 63: Killed
srun: error: cdr2416: tasks 193,198,223-224,226: Terminated
srun: error: cdr2411: tasks 22,38: Terminated
srun: error: cdr2419: tasks 336-366,368-382: Terminated
srun: error: cdr2419: task 383: Killed
srun: error: cdr2416: tasks 195,197,199,201,203,205,207-209,211,213,215,217,219,221,225,227,229-231,233,235,237,239: Terminated
srun: error: cdr2415: tasks 146-185,187-191: Terminated
srun: error: cdr2412: tasks 48,50-62,64-95: Terminated
srun: error: cdr2411: tasks 0,2-21,23-30,32-37,39-47: Terminated
srun: error: cdr2419: task 367: Terminated
srun: error: cdr2416: tasks 192,194,196,200,202,204,206,210,212,214,216,218,22 0,222,228,232,234,236,238: Terminated
srun: error: cdr2417: tasks 240-287: Terminated
srun: error: cdr2414: tasks 97,99,101,105,107,109,125,133,137,141: Terminated
srun: error: cdr2414: tasks 96,98,100,102-104,106,108,110-124,126-132,134-136,138-140,142-143: Terminated
srun: error: cdr2418: tasks 288-318,320-335: Terminated
srun: Force Terminated StepId=8156913.0
''


Does anyone encounter this type of error ?

Hope you can help me out.


Sincerely,
Santhosh
Santhosh91 is offline   Reply With Quote

Old   July 27, 2023, 09:46
Default
  #2
New Member
 
Join Date: Nov 2019
Posts: 19
Rep Power: 6
requou is on a distinguished road
Could you reproduce this error? Do you always get the same error message?

It looks to me that the job was cancelled. Maybe the HPC needed maintenance and all the jobs were terminated.
requou is offline   Reply With Quote

Old   July 28, 2023, 12:59
Default
  #3
Member
 
Santhosh
Join Date: Nov 2021
Posts: 44
Rep Power: 4
Santhosh91 is on a distinguished road
Hello requou,


Thanks for the input! I kept getting the same error but I was launching my simulation using 3-4 wholes nudes on a cluster, knowing that each node had about 48 CPUs.


After checking the efficiency in memory, I saw that it was too much CPUs asked so I re-run my simulations using only 8CPUS on 1 node and it is running.


I don't really understand why it didn't run on the 4 whole nodes tho, even if the efficiency in memory is bad.
Santhosh91 is offline   Reply With Quote

Old   July 28, 2023, 14:03
Default
  #4
Member
 
Join Date: Nov 2019
Posts: 95
Rep Power: 6
FliegenderZirkus is on a distinguished road
I'm not an expert but my experience is that the more processes a parallel job uses the more memory it requires. So you can certainly find yourself in a situation when a job runs fine on 8 cores but not on 48 cores (assuming your node has 48 cores). A second thing is that each time you add another node, you also add a corresponding amount of memory. It is possible to specify how many processes do you want to use on each node (the argument tends to be called something like PPN - processes per node). That way you can run larger jobs that otherwise wouldn't fit in the memory, had all the CPUs been utilized. But of course it's wasteful as some cores are left idle.
FliegenderZirkus is offline   Reply With Quote

Old   July 28, 2023, 15:54
Default
  #5
Member
 
Santhosh
Join Date: Nov 2021
Posts: 44
Rep Power: 4
Santhosh91 is on a distinguished road
Hello Fliegender,


Thank! I see, it makes a bit more sense. Actually, I always tried to use more CPUs/nodes in order to have faster simulations but yeah we have to be careful with memory issues to not let some CPUs idle whihc could lead to issues maybe. I guess that is a compromise to make.
Santhosh91 is offline   Reply With Quote

Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
problem in running the case in linux cluster, which works fine on local machine atul1018 OpenFOAM Running, Solving & CFD 1 March 11, 2021 10:49
[OpenFOAM.org] OpenFOAM Cluster Setup for Beginners Ruli OpenFOAM Installation 7 July 22, 2016 05:14
Compute Cluster with diskless compute nodes Pauli Hardware 0 October 6, 2015 17:48
Improper data to cluster through .cas and .dat files kaeran FLUENT 0 October 24, 2014 05:10
another issue about HPC cluster for running cfx, hepl PLZ. happy CFX 4 March 5, 2012 00:58


All times are GMT -4. The time now is 16:11.