This is an old revision of the document!

Queueing System - S.L.U.R.M.

Our cluster runs S.L.U.R.M. workload manager for managing batch jobs. It is preferable to use this system for running long batch jobs as interactive calculations are less reliable and require more human work.

The queuing system give you access to computers owned by LCM, LTHC, LTHI, LINX and IC Faculty; sharing the computational resources among as many groups as possible will result in a more efficient use of the resources (including the electric power), you can take advantage of many more machines for your urgent calculations and get results faster. On the other hand, since the machines you are using are not always owned by your group, try to be as fair as possible and respect the needs of other users.

We have configured the system with almost no restriction to access and capabilities because the queuing system can make a more efficient use of the cluster if it does not have to satisfy too many constraints. we are currently using only some constraints:

number of CPU/cores: you must indicate the correct number of cores you're going to use;
Megabytes/Gigabytes of RAM your jobs need to use;
Time for the execution: if your job is not completed by the indicated time, it will be automatically terminated;

here we provide just a fast and dirty guide for the most basic commands/tasks that you're going to use for the day to daily activities, you can find better and more complete guides on how to use S.L.U.R.M. control commands on internet; e.g:

partitions (a.k.a. queues)

If you used other types of cluster management, you will already known the term “queue” to identify the type of computers (nodes) or programs (jobs) you want to use. in S.L.U.R.M. notation, queues are called partitions. The two terms are used to indicate the same entity, even if they are not quite the same.

Mini User Guide

The most used/needed commands are:

squeue for checking the status of the partitions or of your running jobs
sbatch or srun for submitting your jobs
scancel for deleting a running or waiting job
sinfo to discover the availability of nodes and partitions

sinfo show the list of partitions, nodes and their availability: here you can see that the default partition of the cluster is called “slurm-cluster' (the * indicate the default), the time limit imposed on he partitions and the nodes that are associated with them and what is their activity status.

$ sinfo
PARTITION      AVAIL  TIMELIMIT  NODES  STATE NODELIST
slurm-cluster*    up   infinite      6   idle iscpc88,node[01-02,05,10-11]
slurm-ws          up    1:00:00      4  down* iscpc[85-87,90]
slurm-ws          up    1:00:00      2   idle iscpc[14-15]

squeue shows the list of jobs currently submitted to all the partitions (queues). By default, the command will shows all the jobs submitted to the cluster:

$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               550 slurm-clu  sheepit    damir PD       0:00      1 (Resources)
               551 slurm-clu script.s rmarino1 PD       0:00      1 (Priority)
               549 slurm-clu  sheepit    damir  R      11:13      1 node05
               548 slurm-clu  sheepit    damir  R      11:25      1 iscpc88

here you can see that the command provides the ID of the jobs, the PARTITION used to run the jobs (hence the nodes where these jobs will run), the NAME assigned to the jobs, the name of the USER that submitted the jobs, the STATUS of the job (R=Run, PD=Waiting), the execution TIME and the nodes where the jobs are actually running (or the reason why they wait in the queue).

sbatch is used to submit and run jobs on the cluster. Jobs are nothing else than short scripts that contain some directive about the specific requests of the programs that need to be executed. The output of the program will be written by default in two files called xxx.out and xxx.err respectively for standard output (any message that would be printed on the screen) and standard error (any error message that would be printed on the screen). xxx stands for the job id. You can change the output file names by setting the directives –output= for standard output and –error= for standard error.

Once a job is submitted (and accepted by the cluster), you'll receive the ID assigned to the job:

$ sbatch sheepit.slurm 
Submitted batch job 552

srun is used to launch your program inside the cluster: it can be used to provide a an interactive session of a node, or to launch parallel computer programs. used inside the sbatch/slurm scripts the cluster always knows what resources are allocated.

scancel is used to remove your job from the queue or to kill your program when it's already running on the cluster:

$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               552 slurm-clu  sheepit    damir PD       0:00      1 (Priority)
               550 slurm-clu  sheepit    damir PD       0:00      1 (Resources)
               551 slurm-clu script.s rmarino1 PD       0:00      1 (Priority)
               549 slurm-clu  sheepit    damir  R      35:54      1 node05
               548 slurm-clu  sheepit    damir  R      36:06      1 iscpc88
$ scancel 552
$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               550 slurm-clu  sheepit    damir PD       0:00      1 (Resources)
               551 slurm-clu script.s rmarino1 PD       0:00      1 (Priority)
               549 slurm-clu  sheepit    damir  R      36:05      1 node05
               548 slurm-clu  sheepit    damir  R      36:17      1 iscpc88

Scripts (used with sbatch)

It is convenient to write the job script in a file not only because in this way the script can be reused, but also because it is also possible to set sbatch options directly inside the script as in the following example (that shows the content of the file sheepit.slurm):

$ cat sheepit.slurm 
#!/bin/bash

#SBATCH --job-name=sheepit
#SBATCH --nodes=1
#SBATCH --cpus-per-task=8
#SBATCH --time=4:00:00
#SBATCH --mem=16G
#SBATCH --mail-user=damir.laurenzi@epfl.ch
#SBATCH --mail-type=begin
#SBATCH --mail-type=end
#SBATCH --error=sheepit/sheepit.%J.err
#SBATCH --output=sheepit/sheepit.%J.out
#SBATCH --partition slurm-cluster
#SBATCH --gres=gpu:1
#SBATCH --constraint=opteron61


echo "$(hostname) $(date)"

cd ${HOME}/sheepit
srun sleep 60
echo "$(hostname) $(date)"

<note> At the beginning of the file, you can read the line #!/bin/bash that is not strictly necessary. It turns out that it's common practice to identify the slurm script as bash scripts so they can be executed also outside of the cluster. in this case the '#SBATCH' lines are interpreted as comments. </note> Inside a script, all the line that starts with the '#' char are comment, but the lines that start with the '#SBATCH' string, are directives for the queuing system. The example above instruct the queuing system to:

#SBATCH –job-name=sheepit assign the name sheepit to the job
#SBATCH –nodes=1 require only one node to run the job
#SBATCH –cpus-per-task=8 inform the cluster that the program will need/use 8 cores to run
#SBATCH –time=4:00:00 the job will run for 4 hours
#SBATCH –mem=16G the job will require 16GB of RAM to be executed. Different units can be specified using the suffixes [K|M|G|T]
#SBATCH –mail-* are parameters to indicate the email address that will receive the email messages from the cluster and when these messages are to be sent (when the job start and when the job ends)
#SBATCH –error, #SBATCH –output these two directives indicate where to write the messages from the program(s) you execute
#SBATCH –partition indicate the partition that must be used to run the program
#SBATCH –gres=gpu:1 this parameter inform the cluster that the program must be run only inside nodes that provides the “gpu” resource and that 1 of these resources is needed.
#SBATCH –constraint= this directive indicate the the program must be run only on those nodes that provide this property. constraints can be combined using AND, OR or combinations (as in –constraint=“intel&gpu” or –constraint=“intel|amd”). it's better to refer to the sbatch manual to better understand the possibilities.

At the moment we have defined these resources:

gpu on nodes that provides GPU capabilities.

and these properties/constraints:

opteron61 nodes that have the old CPU AMD Opteron 6821, that lack some newer hardware functions
matlab nodes that can run Matlab simulations
mathematica nodes that can run Mathematica simulations
tensorflow nodes that can run tensorflow 2.x
xeon26, xeon41, xeon56 nodes that have different version of Intel Xeon CPUs
epyc7302 nodes that provide the AMD Epyc CPUs
gpu nodes that provide GPU capabilities

<note> Please pay attention that resources aren't the same as properties and the two must be indicated using different parameters inside the scripts:

#SBATCH –gres= indicate the resource we want to use
#SBATCH –constraints= indicate that we want to limit the run of our program on the nodes that present the property

</note>

<note important> It is mandatory to specify at least the estimated run time of the job and the memory needed, so the scheduler can optimize the nodes/cores/memory usage and the overall cluster throughput. If your job will pass the limits you fixed, it will be automatically killed by the cluster manager.

Please keep in mind that longer jobs are less likely to enter the queue when the cluster load is high. Therefore, don't be lazy and do not always ask for infinite run time because your job will remain stuck in the queue. </note>

Here you can find some useful sbatch script that can be used as starting point

Script	Execute with
Base example script contains most of the useful options	sbatch [sbatch options] base.slurm
Script example for running matlab computations	sbatch [sbatch options] matlab.slurm
Script example for running Mathematica computations	sbatch [sbatch options] mathematica.slurm
Script example for windows programs (executed under wine)	sbatch [sbatch options] wine.slurm

The shell running the sbatch script will have access to various variables that might be useful, you can find here a complete list.

See the man page for more details.

Tips and Tricks

Delete all queued jobs

scancel -u $(whoami) -h -t RUNNING | awk '{print $1};' | while read a ; do scancel ${a} ; done