Queueing System - S.L.U.R.M.
Our cluster runs S.L.U.R.M. workload manager for managing batch jobs. It is preferable to use this system for running long batch jobs as interactive calculations are less reliable and require more human work.
The queuing system give you access to computers owned by LCM, LTHC, LTHI, LINX and IC Faculty; sharing the computational resources among as many groups as possible will result in a more efficient use of the resources (including the electric power), you can take advantage of many more machines for your urgent calculations and get results faster. On the other hand, since the machines you are using are not always owned by your group, try to be as fair as possible and respect the needs of other users.
We have configured the system with almost no restriction to access and capabilities because the queuing system can make a more efficient use of the cluster if it does not have to satisfy too many constraints. we are currently using only some constraints:
- number of CPU/cores: you must indicate the correct number of cores you're going to use;
- Megabytes/Gigabytes of RAM your jobs need to use;
- Time for the execution: if your job is not completed by the indicated time, it will be automatically terminated;
here we provide just a fast and dirty guide for the most basic commands/tasks that you're going to use for the day to daily activities, you can find better and more complete guides on how to use S.L.U.R.M. control commands on internet; e.g:
partitions (a.k.a. queues)
If you used other types of cluster management, you will already known the term “queue” to identify the type of computers (nodes) or programs (jobs) you want to use. in S.L.U.R.M. notation, queues
are called partitions. The two terms are used to indicate the same entity, even if they are not quite the same.
Mini User Guide
The most used/needed commands are:
squeue
for checking the status of the partitions or of your running jobssbatch
orsrun
for submitting your jobsscancel
for deleting a running or waiting jobsinfo
to discover the availability of nodes and partitions
sinfo
show the list of partitions, nodes and their availability: here you can see that the default partition of the cluster is called “slurm-cluster' (the * indicate the default), the time limit imposed on he partitions and the nodes that are associated with them and what is their activity status.
$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST slurm-cluster* up infinite 6 idle iscpc88,node[01-02,05,10-11] slurm-ws up 1:00:00 4 down* iscpc[85-87,90] slurm-ws up 1:00:00 2 idle iscpc[14-15]
squeue
shows the list of jobs currently submitted to all the partitions (queues). By default, the command will shows all the jobs submitted to the cluster:
$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 550 slurm-clu sheepit damir PD 0:00 1 (Resources) 551 slurm-clu script.s rmarino1 PD 0:00 1 (Priority) 549 slurm-clu sheepit damir R 11:13 1 node05 548 slurm-clu sheepit damir R 11:25 1 iscpc88
here you can see that the command provides the ID of the jobs, the PARTITION used to run the jobs (hence the nodes where these jobs will run), the NAME assigned to the jobs, the name of the USER that submitted the jobs, the STATUS of the job (R=Run, PD=Waiting), the execution TIME and the nodes where the jobs are actually running (or the reason why they wait in the queue).
sbatch
is used to submit and run jobs on the cluster. Jobs are nothing else than short scripts that contain some directive about the specific requests of the programs that need to be executed. The output of the program will be written by default in two files called xxx.out and xxx.err respectively for standard output (any message that would be printed on the screen) and standard error (any error message that would be printed on the screen).xxx
stands for the job id. You can change the output file names by setting the directives–output=
for standard output and–error=
for standard error.
Once a job is submitted (and accepted by the cluster), you'll receive the ID assigned to the job:
$ sbatch sheepit.slurm Submitted batch job 552
srun
is used to launch your program inside the cluster: it can be used to provide a an interactive session of a node, or to launch parallel computer programs. used inside the sbatch/slurm scripts the cluster always knows what resources are allocated.
scancel
is used to remove your job from the queue or to kill your program when it's already running on the cluster:
$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 552 slurm-clu sheepit damir PD 0:00 1 (Priority) 550 slurm-clu sheepit damir PD 0:00 1 (Resources) 551 slurm-clu script.s rmarino1 PD 0:00 1 (Priority) 549 slurm-clu sheepit damir R 35:54 1 node05 548 slurm-clu sheepit damir R 36:06 1 iscpc88 $ scancel 552 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 550 slurm-clu sheepit damir PD 0:00 1 (Resources) 551 slurm-clu script.s rmarino1 PD 0:00 1 (Priority) 549 slurm-clu sheepit damir R 36:05 1 node05 548 slurm-clu sheepit damir R 36:17 1 iscpc88
Scripts (used with sbatch)
It is convenient to write the job script in a file not only because in this way the script can be reused, but also because it is also possible to set sbatch
options directly inside the script as in the following example (that shows the content of the file sheepit.slurm):
$ cat sheepit.slurm #!/bin/bash #SBATCH --job-name=sheepit #SBATCH --nodes=1 #SBATCH --cpus-per-task=8 #SBATCH --time=4:00:00 #SBATCH --mem=16G #SBATCH --mail-user=damir.laurenzi@epfl.ch #SBATCH --mail-type=begin #SBATCH --mail-type=end #SBATCH --error=sheepit/sheepit.%J.err #SBATCH --output=sheepit/sheepit.%J.out #SBATCH --partition slurm-cluster #SBATCH --gres=gpu:1 #SBATCH --constraint=opteron61 echo "$(hostname) $(date)" cd ${HOME}/sheepit srun sleep 60 echo "$(hostname) $(date)"
<note>
At the beginning of the file, you can read the line #!/bin/bash
that is not strictly necessary. It turns out that it's common practice to identify the slurm script as bash scripts so they can be executed also outside of the cluster. in this case the '#SBATCH' lines are interpreted as comments.
</note>
Inside a script, all the line that starts with the '#' char are comment, but the lines that start with the '#SBATCH' string, are directives for the queuing system.
The example above instruct the queuing system to:
#SBATCH –job-name=sheepit
assign the name sheepit to the job#SBATCH –nodes=1
require only one node to run the job#SBATCH –cpus-per-task=8
inform the cluster that the program will need/use 8 cores to run#SBATCH –time=4:00:00
the job will run for 4 hours#SBATCH –mem=16G
the job will require 16GB of RAM to be executed. Different units can be specified using the suffixes [K|M|G|T]#SBATCH –mail-*
are parameters to indicate the email address that will receive the email messages from the cluster and when these messages are to be sent (when the job start and when the job ends)#SBATCH –error
,#SBATCH –output
these two directives indicate where to write the messages from the program(s) you execute#SBATCH –partition
indicate the partition that must be used to run the program#SBATCH –gres=gpu:1
this parameter inform the cluster that the program must be run only inside nodes that provides the “gpu” resource and that 1 of these resources is needed.#SBATCH –constraint=
this directive indicate the the program must be run only on those nodes that provide this property. constraints can be combined using AND, OR or combinations (as in –constraint=“intel&gpu” or –constraint=“intel|amd”). it's better to refer to the sbatch manual to better understand the possibilities.
At the moment we have defined these resources:
gpu
on nodes that provides GPU capabilities.
and these properties/constraints:
opteron61
nodes that have the old CPU AMD Opteron 6821, that lack some newer hardware functionsmatlab
nodes that can run Matlab simulationsmathematica
nodes that can run Mathematica simulationstensorflow
nodes that can run tensorflow 2.xxeon26
,xeon41
,xeon56
nodes that have different version of Intel Xeon CPUsepyc7302
nodes that provide the AMD Epyc CPUsgpu
nodes that provide GPU capabilities
<note>
Please pay attention that resources
aren't the same as properties
and the two must be indicated using different parameters inside the scripts:
#SBATCH –gres=
indicate the resource we want to use#SBATCH –constraints=
indicate that we want to limit the run of our program on the nodes that present the property
</note>
<note important> It is mandatory to specify at least the estimated run time of the job and the memory needed, so the scheduler can optimize the nodes/cores/memory usage and the overall cluster throughput. If your job will pass the limits you fixed, it will be automatically killed by the cluster manager.
Please keep in mind that longer jobs are less likely to enter the queue when the cluster load is high. Therefore, don't be lazy and do not always ask for infinite run time because your job will remain stuck in the queue. </note>
Here you can find some useful sbatch script that can be used as starting point
Script | Execute with |
---|---|
Base example script contains most of the useful options | sbatch [sbatch options] base.slurm |
Script example for running matlab computations | sbatch [sbatch options] matlab.slurm |
Script example for running Mathematica computations | sbatch [sbatch options] mathematica.slurm |
Script example for windows programs (executed under wine) | sbatch [sbatch options] wine.slurm |
The shell running the sbatch script will have access to various variables that might be useful, you can find here a complete list.
See the man page for more details.
Tips and Tricks
Delete all queued jobs
squeue -u $(whoami) -h -t RUNNING | awk '{print $1};' | while read a ; do scancel ${a} ; done