This is an old revision of the document!

!!!! WORK IN PROGRESS !!!!!

Queuing System - S.L.U.R.M.

Our cluster runs S.L.U.R.M. workload manager for managing batch jobs. It is preferable to use this system for running long batch jobs as interactive calculations are less reliable and require more human work.

The queuing system give you access to computers owned by LCM, LTHC, LTHI, LINX and IC Faculty; sharing the computational resources among as many groups as possible will result in a more efficient use of the resources (including the electric power). As user you can take advantage of many more machines for your urgent calculations and get results faster. On the other hand, since the machines your are using are not always owned by your group, try to be as fair as possible and respect the needs of other users.

We have configured the system almost without access restriction because the queuing system can make a more efficient use of the cluster if it does not have to satisfy too many constraints. we are currently using only two constraints:

number of CPU/cores: you must indicate the correct number of cores you're going to use;
Megabytes/Gigabytes of RAM your jobs need to use;
Time for the execution: if your job is not completed by the indicated time, it will be automatically terminated;

you can find better and more complete guides on how to use S.L.U.R.M. control commands on internet; e.g:

here we provide just a fast and dirty guide for the most basic commands/tasks that you're going to use forthe day to day activities.

partitions (a.k.a. queues)

If you used other types of cluster managenet, you qill already known the term “queue” to identify the type of nodes/jobs you an use to submit your jobs to the clusters. in S.L.U.R.M. notation, queues are called partitions. The two terms are used to indicate the same entity and use.

Mini User Guide

The 3 most used commands are:

squeue: for checking the status of the partitions or of your running jobs
sbatch or srun: for submitting your jobs
scancel: for deleting a running or waiting job
sinfo: to discover the availability of nodes and partitions

sinfo show the list of partitions and nodes and their availability: here you can see thet the default parttion of the cluster is called “slurm-cluster' (the * indicate the default property), the time limit imposed on he partitions and the nodes that are associated with them.

$ sinfo
PARTITION      AVAIL  TIMELIMIT  NODES  STATE NODELIST
slurm-cluster*    up   infinite      6   idle iscpc88,node[01-02,05,10-11]
slurm-ws          up    1:00:00      4  down* iscpc[85-87,90]
slurm-ws          up    1:00:00      2   idle iscpc[14-15]

squeue shows the list of jobs currently submitted to all the partitions (queues). By default, the command will shows all the jobs submitted to the cluster:

$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               550 slurm-clu  sheepit    damir PD       0:00      1 (Resources)
               551 slurm-clu script.s rmarino1 PD       0:00      1 (Priority)
               549 slurm-clu  sheepit    damir  R      11:13      1 node05
               548 slurm-clu  sheepit    damir  R      11:25      1 iscpc88

here you can see that the command provides the ID of the jobs, the PARTITION used to run the jobs (hence the nodes where these jobs will run), the NAME assigned to the jobs, the name of the USER that submitted the jobs, the STATUS of the job (R=Run, PD=Waiting), the execution TIME and the nodes where the jobs are actually running (or the reason why they wait in the queue).

sbatch is used to submit and run jobs on the cluster. Jobs are nothing else than short scripts that contain some directive about the specific requests of the programs that need to be executed. The output of the program will be written by default in two files called xxx.out and xxx.err respectively for standard output (any message that would be printed on the screen) and standard error (any error message that would be printed on the screen). xxx stands for the job id. You can change the output file names by setting the –output= for standard output and –error= for standard error.

Once a job is submitted (and accepted by the cluster, you'll receive the ID assigned to the job:

$ sbatch sheepit.slurm 
Submitted batch job 552

srun is used to launch immediately your program inside the cluster (interactive mode): I do not recocomend this use of the cluster, but, if you really need it ….

scancel is used to remove your job from the queue or to kill your program when it's already running on the cluster:

$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               552 slurm-clu  sheepit    damir PD       0:00      1 (Priority)
               550 slurm-clu  sheepit    damir PD       0:00      1 (Resources)
               551 slurm-clu script.s rmarino1 PD       0:00      1 (Priority)
               549 slurm-clu  sheepit    damir  R      35:54      1 node05
               548 slurm-clu  sheepit    damir  R      36:06      1 iscpc88
$ scancel 552
$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               550 slurm-clu  sheepit    damir PD       0:00      1 (Resources)
               551 slurm-clu script.s rmarino1 PD       0:00      1 (Priority)
               549 slurm-clu  sheepit    damir  R      36:05      1 node05
               548 slurm-clu  sheepit    damir  R      36:17      1 iscpc88

Scripts (used with sbatch)

It is convenient to write the job script in a file not only because in this way the script can be reused, but also because it is also possible to set sbatch options directly inside the script as in the following example:

$ cat sheepit.slurm 
#!/bin/bash

#SBATCH --job-name=sheepit
#SBATCH --nodes=1
#SBATCH --cpus-per-task=8
#SBATCH --time=4:00:00
#SBATCH --mem=16G
#SBATCH --mail-user=damir.laurenzi@epfl.ch
#SBATCH --mail-type=begin
#SBATCH --mail-type=end
#SBATCH --error=sheepit/sheepit.%J.err
#SBATCH --output=sheepit/sheepit.%J.out
#SBATCH --partition slurm-cluster
#SBATCH --gres=gpu:1
#SBATCH --constraint=opteron61


echo "$(hostname) $(date)"

cd ${HOME}/sheepit
srun sleep 60
echo "$(hostname) $(date)"

Inside a script, all the line that starts with the '#' char are comment, but the lines that start with the '#SBATCH' string, are directives for the queuing system. The example above instruct the queuing system to:

#SBATCH –job-name=sheepit: assign the name sheepit to the job
#SBATCH –nodes=1: require only one node to run the job
#SBATCH –cpus-per-task=8: inform the cluster that the program will need/use 8 cores to run
#SBATCH –time=4:00:00: the job will run for 4 hours
#SBATCH –mem=16G: the job will require 16GB of RAM to be executed
#SBATCH –mail-* are parameters to indicate the email address that will receive the email messages from the cluster and when these messages are to be sent (when the job start and when the job ends)
#SBATCH –error #SBATCH –output: these two directives indicate where to write the messages from the program
#SBATCH –partition : indicate the partition that must be used to run the program
#SBATCH –gres=gpu:1: this parameter inform the cluster that the program must be run only inside nodes that provides the “gpu” resource and that 1 of these resources is needed.
#SBATCH –constraint=: this directive indicate the the program must be run only on those nodes that provide this property

At the moment we have defined these resources:

gpu on nodes that provides GPU capabilities.

and these properties/constraints:

opteron61: nodes that have the old CPU AMD Opteron 6821, that lack some newer hardware functions
matlab: nodes that can run Matlab simulations
mathematica: nodes that can run Mathematica simulations
tensorflow: nodes that can run tensorflow 2.x
xeon26, xeon41, xeon56: nodes that have different version of Intel Xeon CPUs
gpu: nodes that provide GPU capabilities

<note> Please pay attention that resources aren't the same as properties and the two must be indicated using different parameters inside the scripts:

#SBATCH –gres=: indicate the resource we want to use
#SBATCH –constraints=: indicate that we want to limit the run of our program on the nodes that present the property

</note>

Example qsub -l nodes=1:ppn=8:bit64 (the string 1: is mandatory and means: I need at least one node with the properties that follows (eight cores with 64bit architecture)). To specify more than one property use the colon ”:“ to separate eacho of them. a job that require one 64bit cpu and matlab should be called using qsub -l nodes=1:bit64:matlab <name of the pbs script>.

<note important> It is mandatory to specify at least the estimated run time of the job and the memory needed by so that the scheduler can optimize the machines usage and the overall cluster throughput. If your job will pass the limits you fixed, it will be automatically killed by the cluster manager.

By default, if no time limit is specified, the job is sent to the short queue and killed after one hour.

Please keep in mind that longer jobs are less likely to enter the queue when the cluster load is high. Therefore, don't be lazy and do not always ask for infinite run time because your job will remain stuck in the queue. It is also not as smart as it might seem, to submit tons of very short jobs because the start-up and shut-down overheads are intentionally quite long. </note>

-a date_time declares the time after which the job is eligible for execution.

The date_time argument is in the form: [[[[CC]YY]MM]DD]hhmm[.SS]. Where CC is the first two digits of the year (the century), YY is the second two digits of the year, MM is the two digits for the month, DD is the day of the month, hh is the hour, mm is the minute, and the optional SS is the seconds.
If the month, MM, is not specified, it will default to the current month if the specified day DD, is in the future. Otherwise, the month will be set to next month. Likewise, if the day, DD, is not specified, it will default to today if the time hhmm is in the future. Otherwise, the day will be set to tomorrow. For example, if you submit a job at 11:15am with a time of -a 1110, the job will be eligible to run at 11:10am tomorrow.
Here you can find some useful pbs script that can be used as starting point

Script	Execute with
Base example script contains most of the useful options	qsub [qsub options] base.pbs
Script example for running matlab computations	qsub -l nodes=1:matlab [qsub options] matlab.pbs
Script example for running Mathematica computations	qsub [qsub options] mathematica.pbs
Script example for windows programs (executed under wine)	qsub [qsub options] wine.pbs

The shell running the pbs script will have access to various variables that might be usefull:

PBS_O_WORKDIR : the directory where the qsub command was issued
PBS_QUEUE : the actual queue where the job is running
PBS_JOBID : the internal job identification name
PBS_JOBNAME : the name of the job. Unless specified with the -N option, this is usually the name of the pbs script or STDIN
HOSTNAME : the name of the machine where the job is running

See the man page for more details.

Making your script cross platform

Presently, we have only 64 bit compute nodes. If you need to compile for 32 bit platforms, in principle, 64 bit nodes can run 32 bit code out of the box. In reality, there might be problems due to missing or incompatible library. An easy solution for taking advantage both of the full set of architecture and also of the optimized 64 code on 64 bit machines is the following (suggested by Alipour Masoud):

Compile two version of your code (32 and 64 bit);
name the two executables (32 and 64 bit) as WHATEVER.i686 and WHATEVER.x86_64 respectively (replace WHATEVER with the name you want to assign to your program);
in your pbs script use ./WHATEVER.$(arch) to select the good executable and run it: the 'arch' ia a system program that discover for you the architecure (32/64 bit) of the computer.

qdel

When you submit a job, you receive from the system a number that is used as reference to the job. to delete the job all you have to do is launch the qdel command followed by the job number you want to delete.

$ qdel 236

You can also indicate more than one job number:

$ qdel 236 237 241

BUG

There is a bug in pbs that appears some time when the server would like to stop a running job but the node where the job is running does not respond (e.g. it did crash). When this happens, the server starts to send you a lot of identical mail messages telling you that it had to kill your job because it exceeded the time limit. If you start to receive the same message over and over about the same JOB ID, please contact your sys admin. Thanks.

Tips and Tricks

Delete all queued jobs

qstat -u $(whoami) -n1 | grep "Q   --" | awk '{print $1;}' | while read a ; do qdel ${a%%.*} ; done

A script that run as long as possible

Here is a short script that can be useful in those cases where you have the same calculation to run many times (e.g. for collecting statistics).

Since the machines are different and take different time to run the program, one usually allocates the time needed by the slowest machine even if on the fastest machine the actual running time would be 1/10 of the requested one.

The following script will keep running your program until there is time left. It will use the time needed to run 1 iteration to decide if another one can be ran.

qstat=/usr/bin/qstat
jobid=${PBS_JOBID%%.*}

# check how much time is left and set the "moretime" variable accordingly 
checktime() {
  if [ -x $qstat ] ; then
    times=$(qstat -n1 $jobid | tail -n 1)
    let tend=$(echo $times | awk '{print $9}' | awk -F : '{print $1*3600+$2*60;}')
    let tnow=$(echo $times | awk '{print $11}' | awk -F : '{print $1*3600+$2*60;}')
    let trem=$tend-$tnow
    let tmin=$tnow/$niter
    if [ $trem -ge $tmin ] ; then
      moretime="yes"
    else
      moretime="no"
    fi
  else
    # cannot say => random guess
    moretime="yes"
  fi
}

# Execute a task as many times as possible. 
let niter=0;
moretime="yes"
while [ "$moretime" == "yes" ] ; do
  # run your program here
  ./my_program.x
  let niter=$niter+1
  checktime
done