|
[Sponsors] |
Job interrupted abruptly using remote cluster |
|
LinkBack | Thread Tools | Search this Thread | Display Modes |
July 25, 2023, 18:11 |
Job interrupted abruptly using remote cluster
|
#1 |
Member
Santhosh
Join Date: Nov 2021
Posts: 44
Rep Power: 4 |
Hello everyone,
I am trying to run a simulation using a remote cluster. The simulation works well when meshing is coarse. However when I refine the mesh, I have this following error : ''srun: error: cdr2418: task 319: Killed srun: Terminating StepId=8156913.0 slurmstepd: error: *** STEP 8156913.0 ON cdr2411 CANCELLED AT 2023-07-24T15:43:49 *** srun: error: cdr2411: task 31: Killed srun: error: cdr2415: task 145: Terminated srun: error: cdr2411: task 1: Terminated srun: error: cdr2415: tasks 144,186: Terminated srun: error: cdr2412: task 49: Terminated srun: error: cdr2412: task 63: Killed srun: error: cdr2416: tasks 193,198,223-224,226: Terminated srun: error: cdr2411: tasks 22,38: Terminated srun: error: cdr2419: tasks 336-366,368-382: Terminated srun: error: cdr2419: task 383: Killed srun: error: cdr2416: tasks 195,197,199,201,203,205,207-209,211,213,215,217,219,221,225,227,229-231,233,235,237,239: Terminated srun: error: cdr2415: tasks 146-185,187-191: Terminated srun: error: cdr2412: tasks 48,50-62,64-95: Terminated srun: error: cdr2411: tasks 0,2-21,23-30,32-37,39-47: Terminated srun: error: cdr2419: task 367: Terminated srun: error: cdr2416: tasks 192,194,196,200,202,204,206,210,212,214,216,218,22 0,222,228,232,234,236,238: Terminated srun: error: cdr2417: tasks 240-287: Terminated srun: error: cdr2414: tasks 97,99,101,105,107,109,125,133,137,141: Terminated srun: error: cdr2414: tasks 96,98,100,102-104,106,108,110-124,126-132,134-136,138-140,142-143: Terminated srun: error: cdr2418: tasks 288-318,320-335: Terminated srun: Force Terminated StepId=8156913.0 '' Does anyone encounter this type of error ? Hope you can help me out. Sincerely, Santhosh |
|
July 27, 2023, 09:46 |
|
#2 |
New Member
Join Date: Nov 2019
Posts: 19
Rep Power: 6 |
Could you reproduce this error? Do you always get the same error message?
It looks to me that the job was cancelled. Maybe the HPC needed maintenance and all the jobs were terminated. |
|
July 28, 2023, 12:59 |
|
#3 |
Member
Santhosh
Join Date: Nov 2021
Posts: 44
Rep Power: 4 |
Hello requou,
Thanks for the input! I kept getting the same error but I was launching my simulation using 3-4 wholes nudes on a cluster, knowing that each node had about 48 CPUs. After checking the efficiency in memory, I saw that it was too much CPUs asked so I re-run my simulations using only 8CPUS on 1 node and it is running. I don't really understand why it didn't run on the 4 whole nodes tho, even if the efficiency in memory is bad. |
|
July 28, 2023, 14:03 |
|
#4 |
Member
Join Date: Nov 2019
Posts: 95
Rep Power: 6 |
I'm not an expert but my experience is that the more processes a parallel job uses the more memory it requires. So you can certainly find yourself in a situation when a job runs fine on 8 cores but not on 48 cores (assuming your node has 48 cores). A second thing is that each time you add another node, you also add a corresponding amount of memory. It is possible to specify how many processes do you want to use on each node (the argument tends to be called something like PPN - processes per node). That way you can run larger jobs that otherwise wouldn't fit in the memory, had all the CPUs been utilized. But of course it's wasteful as some cores are left idle.
|
|
July 28, 2023, 15:54 |
|
#5 |
Member
Santhosh
Join Date: Nov 2021
Posts: 44
Rep Power: 4 |
Hello Fliegender,
Thank! I see, it makes a bit more sense. Actually, I always tried to use more CPUs/nodes in order to have faster simulations but yeah we have to be careful with memory issues to not let some CPUs idle whihc could lead to issues maybe. I guess that is a compromise to make. |
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
problem in running the case in linux cluster, which works fine on local machine | atul1018 | OpenFOAM Running, Solving & CFD | 1 | March 11, 2021 10:49 |
[OpenFOAM.org] OpenFOAM Cluster Setup for Beginners | Ruli | OpenFOAM Installation | 7 | July 22, 2016 05:14 |
Compute Cluster with diskless compute nodes | Pauli | Hardware | 0 | October 6, 2015 17:48 |
Improper data to cluster through .cas and .dat files | kaeran | FLUENT | 0 | October 24, 2014 05:10 |
another issue about HPC cluster for running cfx, hepl PLZ. | happy | CFX | 4 | March 5, 2012 00:58 |