|
[Sponsors] |
November 8, 2016, 17:43 |
Trouble running SU2 in parallel on cluster
|
#1 |
New Member
Devin Gibson
Join Date: Nov 2016
Posts: 2
Rep Power: 0 |
I am not sure which exact forum this is most appropriate for, but I figure the Hardware Forum works because I am having a problem with that and not the software, as far as I can tell.
I work for one of my professors and we are trying to run SU2 in parallel on a cluster owned by the university that uses slurm for its workload manager. The problem we are running into is that when we ssh into the cluster and run the command: parallel_computation.py -f SU2.cfg on an assigned node by slurm (using sbatch), the code hangs and wont run. The weird thing about this is if we run the same command on the login node, it works just fine. Do any of you know what could possibly be the problem? Here is some additional information: - We talked with the IT guy in charge of the cluster and he doesn't have enough background to know what is going on. - On some of our output files we would get the escape key [!0134h, when we changed the terminal settings to get rid of the escape key the code behavior was consistent as above. - We can run SU2_CFD "config file", the code in serial, on both the login node and the cluster just fine - We have tried running an interactive session on a node (using srun), no change in behavior Any thoughts would be appreciated! We really want to be able to run the code in-house instead of outsource. |
|
November 10, 2016, 19:40 |
|
#2 |
New Member
California
Join Date: Nov 2016
Posts: 10
Rep Power: 10 |
I know it's only been 2 days since this post, but did you make any progress on this? I'm trying to run this on the cluster with slurm.
I can run it fine on the login node in serial, but not sure how to submit it in parallel. |
|
November 10, 2016, 19:58 |
|
#3 |
New Member
Devin Gibson
Join Date: Nov 2016
Posts: 2
Rep Power: 0 |
No progress. . .
I have been doing some more tests, on the kingspeak cluster at the University of Utah to see if the error is consistent, and recently I have been getting the error: Code:
------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted. ------------------------------------------------------- -------------------------------------------------------------------------- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[47035,1],0] Exit code: 127 -------------------------------------------------------------------------- Traceback (most recent call last): File "/uufs/chpc.utah.edu/sys/installdir/su2/4.0i/bin/parallel_computation.py", line 110, in <module> main() File "/uufs/chpc.utah.edu/sys/installdir/su2/4.0i/bin/parallel_computation.py", line 61, in main options.compute ) File "/uufs/chpc.utah.edu/sys/installdir/su2/4.0i/bin/parallel_computation.py", line 88, in parallel_computation info = SU2.run.CFD(config) File "/uufs/chpc.utah.edu/sys/installdir/su2/4.0i/bin/SU2/run/interface.py", line 110, in CFD run_command( the_Command ) File "/uufs/chpc.utah.edu/sys/installdir/su2/4.0i/bin/SU2/run/interface.py", line 268, in run_command raise exception , message RuntimeError: Path = /uufs/chpc.utah.edu/common/home/<uNID>/SU2-Tests/Users/2118/D2602EDB-0B2F-46C7-A93C-5290D2F8DA50/, Command = mpirun -n 2 /uufs/chpc.utah.edu/sys/installdir/su2/4.0i/bin/SU2_CFD config_CFD.cfg SU2 process returned error '127' /uufs/chpc.utah.edu/sys/installdir/su2/4.0i/bin/SU2_CFD: symbol lookup error: /uufs/chpc.utah.edu/sys/installdir/inte l/impi/5.1.1.109/intel64/lib/libmpifort.so.12: undefined symbol: MPI_UNWEIGHTED In reference to your question about running in parallel, here is my SLURM batch script that I am using when I use the sbatch command: Code:
#!/bin/bash #SBATCH --account=owner-guest #SBATCH --partition=kingspeak-guest #SBATCH --job-name=NACA-2412 #SBATCH --nodes=2 #SBATCH --ntasks=12 #SBATCH --time=02:00:00 #SBATCH -o slurmjob-%j.out #SBATCH -e slurmjob-%j.err module load openmpi module load su2 parallel_computation.py -f SU2.cfg ##################################################### |
|
December 13, 2016, 18:35 |
|
#4 |
Senior Member
Zach Davis
Join Date: Jan 2010
Location: Los Angeles, CA
Posts: 101
Rep Power: 16 |
In order to run SU2 in parallel, the code needs a few things:
Based on the error message you're receiving, SU2 is not detecting that the SLURM_JOBID environment variable is set, so it's defaulting to use an mpirun command. Your slurm configuration file is also not passing the 12 MPI processes that you want to the parallel_computation.py command. It should look like: parallel_computation.py -n 12 -f SU2.cfg > su2.out 2>&1 The re-direct of standard error and standard output to a file named su2.out isn't necessary, but a good practice to capture the output from SU2. It appears, you're running with: mpirun -n 2 SU2_CFD config_CFD.cfg which isn't what you want. This suggests that there should be another slurm header variable that you need to add to your script that indicates how many CPU cores are available on each node. Perhaps there is a machinefile that the slurm process uses to determine this, but as I mentioned above the SLURM_JOBID environment variable isn't set, so SU2 is bypassing slurm altogether. I don't use slurm, but if there is a variable for --ntasks, then you could use it in the run command of your script instead of explicitly setting the value to 12 in this example. Doing so would make it so you don't have to modify the value twice in your script for each run. PBS has such a variable, but I'm unfamiliar whether slurm does. Best Regards, Zach |
|
January 4, 2017, 15:58 |
|
#5 |
New Member
Oliver V
Join Date: Dec 2015
Posts: 17
Rep Power: 10 |
Hello,
I've been having the exact same error recently Code:
mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[47035,1],0] Exit code: 127 Oliver |
|
Tags |
cfd, cluster, parallel, slurm, su2 |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
OF 2.0.1 parallel running problems | moser_r | OpenFOAM Running, Solving & CFD | 9 | July 27, 2022 04:15 |
Case running in serial, but Parallel run gives error | atmcfd | OpenFOAM Running, Solving & CFD | 18 | March 26, 2016 13:40 |
SU2 Tutorial 2 Parallel Computation | CrashLaker | SU2 | 7 | April 5, 2014 17:14 |
parallel error with cyclic BCs for pimpleDyMFoam and trouble in resuming running | sunliming | OpenFOAM Bugs | 21 | November 22, 2013 04:38 |
Running in parallel | Djub | OpenFOAM Running, Solving & CFD | 3 | January 24, 2013 17:01 |