Trouble running SU2 in parallel on cluster

devinmgibson · November 8, 2016, 17:43

I am not sure which exact forum this is most appropriate for, but I figure the Hardware Forum works because I am having a problem with that and not the software, as far as I can tell.

I work for one of my professors and we are trying to run SU2 in parallel on a cluster owned by the university that uses slurm for its workload manager. The problem we are running into is that when we ssh into the cluster and run the command:

parallel_computation.py -f SU2.cfg

on an assigned node by slurm (using sbatch), the code hangs and wont run. The weird thing about this is if we run the same command on the login node, it works just fine. Do any of you know what could possibly be the problem?

Here is some additional information:
- We talked with the IT guy in charge of the cluster and he doesn't have enough background to know what is going on.
- On some of our output files we would get the escape key [!0134h, when we changed the terminal settings to get rid of the escape key the code behavior was consistent as above.
- We can run SU2_CFD "config file", the code in serial, on both the login node and the cluster just fine
- We have tried running an interactive session on a node (using srun), no change in behavior

Any thoughts would be appreciated! We really want to be able to run the code in-house instead of outsource.

nomad2 · November 10, 2016, 19:40

I know it's only been 2 days since this post, but did you make any progress on this? I'm trying to run this on the cluster with slurm.

I can run it fine on the login node in serial, but not sure how to submit it in parallel.

devinmgibson · November 10, 2016, 19:58

No progress. . .

I have been doing some more tests, on the kingspeak cluster at the University of Utah to see if the error is consistent, and recently I have been getting the error:

Code:

-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[47035,1],0]
  Exit code:    127
--------------------------------------------------------------------------
Traceback (most recent call last):
  File "/uufs/chpc.utah.edu/sys/installdir/su2/4.0i/bin/parallel_computation.py", line 110, in <module>
    main()
  File "/uufs/chpc.utah.edu/sys/installdir/su2/4.0i/bin/parallel_computation.py", line 61, in main
    options.compute      )
  File "/uufs/chpc.utah.edu/sys/installdir/su2/4.0i/bin/parallel_computation.py", line 88, in parallel_computation
    info = SU2.run.CFD(config)
  File "/uufs/chpc.utah.edu/sys/installdir/su2/4.0i/bin/SU2/run/interface.py", line 110, in CFD
    run_command( the_Command )
  File "/uufs/chpc.utah.edu/sys/installdir/su2/4.0i/bin/SU2/run/interface.py", line 268, in run_command
    raise exception , message
RuntimeError: Path = /uufs/chpc.utah.edu/common/home/<uNID>/SU2-Tests/Users/2118/D2602EDB-0B2F-46C7-A93C-5290D2F8DA50/,
Command = mpirun -n 2 /uufs/chpc.utah.edu/sys/installdir/su2/4.0i/bin/SU2_CFD config_CFD.cfg
SU2 process returned error '127'
/uufs/chpc.utah.edu/sys/installdir/su2/4.0i/bin/SU2_CFD: symbol lookup error: /uufs/chpc.utah.edu/sys/installdir/inte                                                                                                                      
l/impi/5.1.1.109/intel64/lib/libmpifort.so.12: undefined symbol: MPI_UNWEIGHTED

from the new cluster. I've reached out to a few people and no one has been able to tell me what this means yet. I also don't know if this error is related to the cluster at my university or not.

In reference to your question about running in parallel, here is my SLURM batch script that I am using when I use the sbatch command:

Code:

#!/bin/bash
#SBATCH --account=owner-guest
#SBATCH --partition=kingspeak-guest
#SBATCH --job-name=NACA-2412
#SBATCH --nodes=2
#SBATCH --ntasks=12
#SBATCH --time=02:00:00
#SBATCH -o slurmjob-%j.out
#SBATCH -e slurmjob-%j.err

module load openmpi
module load su2

parallel_computation.py -f SU2.cfg

#####################################################

RcktMan77 · December 13, 2016, 18:35

In order to run SU2 in parallel, the code needs a few things:

The SU2 executables need to be accessible to each node at the same location on the filesystem. (A shared volume attached to each node here would be best.)
The SU2 grid and configuration file need to be accessible to each node at the same location on the filesystem. (This means you may need to have a shared volume mounted on each of the nodes which you use for your run disk.)
The SU2 run command, parallel_computation.py, needs to know how many processes to launch. You tell it this with the -n flag followed by an integer with the number of MPI processes to create (e.g. parallel_computation.py -n 32 -f my_su2_config_file.cfg)
The actual slurm command that is used by your compiled SU2 executables can be seen at: $SU2_RUN/SU2/run/interface.py. In this file search for slurm, and modify the run command if needed for your environment.
You need passwordless ssh setup between each node (i.e. you need to be able to login to each node from the head node and vice versa without being prompted for a password).

Based on the error message you're receiving, SU2 is not detecting that the SLURM_JOBID environment variable is set, so it's defaulting to use an mpirun command. Your slurm configuration file is also not passing the 12 MPI processes that you want to the parallel_computation.py command. It should look like:

parallel_computation.py -n 12 -f SU2.cfg > su2.out 2>&1

The re-direct of standard error and standard output to a file named su2.out isn't necessary, but a good practice to capture the output from SU2. It appears, you're running with:

mpirun -n 2 SU2_CFD config_CFD.cfg

which isn't what you want. This suggests that there should be another slurm header variable that you need to add to your script that indicates how many CPU cores are available on each node. Perhaps there is a machinefile that the slurm process uses to determine this, but as I mentioned above the SLURM_JOBID environment variable isn't set, so SU2 is bypassing slurm altogether.

I don't use slurm, but if there is a variable for --ntasks, then you could use it in the run command of your script instead of explicitly setting the value to 12 in this example. Doing so would make it so you don't have to modify the value twice in your script for each run. PBS has such a variable, but I'm unfamiliar whether slurm does.

Best Regards,

Zach

OVS · January 4, 2017, 15:58

Hello,

I've been having the exact same error recently

Code:

 
mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:    Process name: [[47035,1],0]   Exit code:    127

Any progress on that? Which version of SU2 are you using?

Oliver

November 8, 2016, 17:43	Trouble running SU2 in parallel on cluster	#1
devinmgibson New Member Devin Gibson Join Date: Nov 2016 Posts: 2 Rep Power: 0	I am not sure which exact forum this is most appropriate for, but I figure the Hardware Forum works because I am having a problem with that and not the software, as far as I can tell. I work for one of my professors and we are trying to run SU2 in parallel on a cluster owned by the university that uses slurm for its workload manager. The problem we are running into is that when we ssh into the cluster and run the command: parallel_computation.py -f SU2.cfg on an assigned node by slurm (using sbatch), the code hangs and wont run. The weird thing about this is if we run the same command on the login node, it works just fine. Do any of you know what could possibly be the problem? Here is some additional information: - We talked with the IT guy in charge of the cluster and he doesn't have enough background to know what is going on. - On some of our output files we would get the escape key [!0134h, when we changed the terminal settings to get rid of the escape key the code behavior was consistent as above. - We can run SU2_CFD "config file", the code in serial, on both the login node and the cluster just fine - We have tried running an interactive session on a node (using srun), no change in behavior Any thoughts would be appreciated! We really want to be able to run the code in-house instead of outsource.

December 13, 2016, 18:35		#4
RcktMan77 Senior Member Zach Davis Join Date: Jan 2010 Location: Los Angeles, CA Posts: 101 Rep Power: 16	In order to run SU2 in parallel, the code needs a few things: The SU2 executables need to be accessible to each node at the same location on the filesystem. (A shared volume attached to each node here would be best.) The SU2 grid and configuration file need to be accessible to each node at the same location on the filesystem. (This means you may need to have a shared volume mounted on each of the nodes which you use for your run disk.) The SU2 run command, parallel_computation.py, needs to know how many processes to launch. You tell it this with the -n flag followed by an integer with the number of MPI processes to create (e.g. parallel_computation.py -n 32 -f my_su2_config_file.cfg) The actual slurm command that is used by your compiled SU2 executables can be seen at: $SU2_RUN/SU2/run/interface.py. In this file search for slurm, and modify the run command if needed for your environment. You need passwordless ssh setup between each node (i.e. you need to be able to login to each node from the head node and vice versa without being prompted for a password). Based on the error message you're receiving, SU2 is not detecting that the SLURM_JOBID environment variable is set, so it's defaulting to use an mpirun command. Your slurm configuration file is also not passing the 12 MPI processes that you want to the parallel_computation.py command. It should look like: parallel_computation.py -n 12 -f SU2.cfg > su2.out 2>&1 The re-direct of standard error and standard output to a file named su2.out isn't necessary, but a good practice to capture the output from SU2. It appears, you're running with: mpirun -n 2 SU2_CFD config_CFD.cfg which isn't what you want. This suggests that there should be another slurm header variable that you need to add to your script that indicates how many CPU cores are available on each node. Perhaps there is a machinefile that the slurm process uses to determine this, but as I mentioned above the SLURM_JOBID environment variable isn't set, so SU2 is bypassing slurm altogether. I don't use slurm, but if there is a variable for --ntasks, then you could use it in the run command of your script instead of explicitly setting the value to 12 in this example. Doing so would make it so you don't have to modify the value twice in your script for each run. PBS has such a variable, but I'm unfamiliar whether slurm does. Best Regards, Zach saladbowl likes this.

January 4, 2017, 15:58		#5
OVS New Member Oliver V Join Date: Dec 2015 Posts: 17 Rep Power: 11	Hello, I've been having the exact same error recently Code: mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[47035,1],0] Exit code: 127 Any progress on that? Which version of SU2 are you using? Oliver

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
OF 2.0.1 parallel running problems	moser_r	OpenFOAM Running, Solving & CFD	9	July 27, 2022 04:15
Case running in serial, but Parallel run gives error	atmcfd	OpenFOAM Running, Solving & CFD	18	March 26, 2016 13:40
SU2 Tutorial 2 Parallel Computation	CrashLaker	SU2	7	April 5, 2014 17:14
parallel error with cyclic BCs for pimpleDyMFoam and trouble in resuming running	sunliming	OpenFOAM Bugs	21	November 22, 2013 04:38
Running in parallel	Djub	OpenFOAM Running, Solving & CFD	3	January 24, 2013 17:01

November 10, 2016, 19:40		#2
nomad2 New Member California Join Date: Nov 2016 Posts: 10 Rep Power: 10	I know it's only been 2 days since this post, but did you make any progress on this? I'm trying to run this on the cluster with slurm. I can run it fine on the login node in serial, but not sure how to submit it in parallel.