Thanks for helping to test the Slurm job scheduling system in the RC environment. In addition to this example page, we are building a FAQ page here: https://www.rc.colorado.edu/support/examples/slurmfaqs
First, make sure you don't have Torque or Moab modules loaded:
$ module unload torque
$ module unload moab
Load the slurm module:
$ module load slurm
Since many Torque commands and directives are supported under Slurm, you may be able to submit your jobs just like you always have but please use Slurm commands when possible, e.g.,
$ qsub -q janus-debug run_script.sh
In many cases you'll find it's easier to use native Slurm commands, such as sbatch, squeue, and sinfo. These are introduced in the Example section below. You can also find a translation between Torque commands and their Slurm equivalents at slurm.schedmd.com/rosetta.pdf .
IMPORTANT: We are working on a job cleanup issue that will put an error in your job output file. The error is safe to ignore and it should not have any ill effects on your job. The error message is the following:
slurmstepd: task/cgroup: unable to remove step memcg : No such file or directory
Slurm QOS Parameters
With the transition to Slurm, we are making the queue structure simpler to help cut down on any confusion the previous queues might have caused. In Slurm, a QOS can be thought of like a queue in Torque/Moab; so when you see QOS in our documentation, think queue. We have created a QOS for each class of nodes we have (himem, serial, gpu, janus) and each has certain limitations on what type of jobs you can run on them.
The QOS's for the Janus nodes are the following:
normal - This is the default QOS for all jobs. Any job using this QOS will run on a Janus compute node with the following restrictions: 24 hours max walltime, 480 max nodes per job
janus-long - This QOS is for any long jobs on Janus compute nodes. There are a total of 80 nodes available for this QOS. The restrictions on this QOS are: 7 days max walltime, 40 nodes per user (this means you can have 40 single node jobs or one 40 node job or anything in between)
janus-debug - This QOS is strictly for debugging purposes and has a higher priority than the normal QOS. You should not be doing any production work in this QOS. To prevent users from taking advantage of the higher priority, we made very restrictive settings on this QOS: 1 hour max walltime, 2 running jobs per user, 4 queued jobs per user including running jobs
The QOS's for all other Research Computing resources are the following:
himem - This QOS allows a job to run on any of the HiMem nodes. The following restrictions apply: 14 day max walltime
serial - This QOS allows a job to run on any of the serial nodes. The following restrictions apply: 14 day max walltime, 10 nodes per user (this means you can have 10 single node jobs, or a single 10 node job or anything in between)
gpu - This QOS allows a job to run on any GPU node. The following restrictions apply: 4 hour max walltime, 2 running jobs per user
Slurm Test Job Example
The intent of this example is to demonstrate the process of submitting a compute job. The example job should run long enough to allow you to see it running and see the output file get created and updated. This example may seem overly simple to an experienced Linux user but one may still find it useful to see how a job gets queued and runs, and we'll demonstrate a few commands.
Before you begin, you need an RC account, a registered OTP device, an SSH program (or OSX terminal window) and an allocation of compute time. If you do not have an allocation your jobs will not run.
Please note: The job we are going to run does nothing but waste computer time, please resist the temptation to run it at a larger scale.
First, get logged into an RC Login node as per the Getting Started guide. Once you have done so, you should have a dollar prompt on the bottom of your screen, something like:
-bash-3.2$
What we are about to do is write a script that allows your work to be scheduled and performed by one or more of the available compute resources. Rather than executing a program immediately on demand as one does on their own computer, in this environment you have to request that your program be run sometime in the future as best fits the schedule and available resources. Your job script is how you define what resources you need, for example the number of compute nodes and cores and for how long you intend to use them. Your script then also runs the program, and can also set up notifications, move files, etc. This example is a very simple script that prints out a few lines of text.
Let's begin. When you see text in this blue color, you are to copy the text and paste it after the $ prompt in your SSH session to login.rc.colorado.edu. I will leave off the prompt to make this easier but you get the idea. I mean for you to cut and paste with the mouse and right-click or control-c, control-v. Hit enter on the SSH screen when you have pasted text.
For example when you see this:
pwd
You want to highlight and copy it, then paste it into your SSH screen at the dollar prompt. Then you would see this on your SSH terminal screen:
-bash-3.2$ pwd
You would then hit Enter (and in this case you Print your current Working Directory, which should be /home/your-username:
-bash-3.2$ pwd
/home/ralphie
Let's try one. Please highlight and copy the text below shown on a blue background:
echo $SHELL
Then paste it into your SSH screen at the dollar prompt and hit enter.
The output from the commands (if interesting) I will highlight in this gray color:
-bash-3.2$ echo $SHELL
/bin/bash
Does that work? I hope so. We just asked what shell you are using.
Next, in your home directory, where you start when you first log in, we will make a directory for your test job an move into it. First create a 'testjob' directory:
mkdir testjob
And move into it with 'cd' (Change Directory):
cd testjob
If you want to be sure you are in the right place, try a "pwd" to see what directory you are in:
pwd
Do you see something like this?
-bash-3.2$ pwd
/home/ralphie/testjob
Now let's create a shell script that we will submit as a compute job. Copy and Paste the following shell script into your ssh screen. In the top line the 'cat' command here will create a file which will comprise your job script. This method is an alternative to using an editor in Linux to write a job script, which would be harder to describe. Make sure you paste all 40 lines or so at once, NOT one line at a time.
Hit enter when you have pasted all this in. If something goes wrong, you may need to try to manually close the file by typing 'EOF' and then deleting it and starting again.
Next do an 'ls' to list the files in this directory and you should see the file we just created.
ls
-bash-3.2$ ls
testjob_submit.sh
Now check the file you just created to be sure that it's all there. Note that the first line, where we used 'cat' to create this file and the last line with the EOF will not appear, those were commands we used to create the file.
cat testjob_submit.sh
#
# Set the name of the job
#SBATCH -J test_job
--snipped for brevity--
echo Enough waiting. Job completed.
# End of example job shell script
#
The file you just created is a Bash shell script that informs the job scheduler of your job's needs and then does some very basic things when executed. All our script does is write out some lines to the output file and wait 60 seconds several times. A proper compute job will do a great deal more than this. This is intended to be a very simple example.
To submit this job, we need to add Slurm to our environment. Slurm is a resource manager that can accept and schedule 10,000 jobs or more a day in our environment. We use sbatch to ask Slurm to accept the job, and after that we will use a command or two to ask Slurm how things are going.
module load slurm
-bash-3.2$ module load slurm
-bash-3.2$
Next let's submit this script to a QOS and waste some supercomputer time. The 'sbatch' command asks Slurm to schedule the job based on the requirements we put in the #SBATCH lines of our job script. Slurm will work in the background to find resources to run it.
sbatch testjob_submit.sh
We get something like the following response:
-bash-3.2$ sbatch testjob_submit.sh
Submitted batch job 56
The number after "Submitted batch job" is the job ID number that we will use to check on the progress of our job, and our output file will also have this number in the filename.
Let's check on our job.
We'll use squeue to look at all of our jobs. I will use the '-u' flag to look at at a single user, and the $USER variable which is already set to your username (this is done so the command can be pasted in and it will work for anyone, you may also type in your username in place of $USER and you should get the same results.)
squeue -u $USER
Here we see we have one job running and how much time it has used.
Since the job is running and we had 'echo' commands in the script, we should see output in our output file. First, an 'ls' to list the files and see if there is an output file. I like 'ls -l' because it formats the output nicely and shows dates and times.
ls -l
I see the output file has appeared so let's take a look at it. You will have to use the name of your output file, which should be similar but with a different job number embedded in it. So cutting and pasting won't work this time, but you can cut and paste the name off your own SSH screen or type it in carefully.
cat testjob-[type your job ID here].out
The job has only run for a minute or so in my case, so I only see the first two echo statements.
You can see how the output file is built as the job executes and adds more lines to it. When the job is finished it will look like this:
That's our basic job script example. Normally rather than just print some text and wait your job script would launch a program. Our other examples show both serial and parallel programs run by job scripts.
If you want to run this example again, you can perform the 'sbatch' operation a second time. You will get a different job number and will get a differently named output file. There are some more things you can also do to learn a bit more about running jobs:
• Use the squeue command to get information about your job while it's running, or use scontrol show job [job id] for more verbose output.
• For a list of options to SBATCH and their corresponding PBS options, see the following: SLURM Rosetta Stone
To get back to your home directory you can use the Change Directory (cd) command with no arguments:
cd