User Tools

Site Tools


sge

Queuing System - Torque/Maui

Our cluster runs Torque resource manager (a pbs variant) and the Maui scheduler for managing batch jobs. It is mandatory to use this system for running long batch jobs: interactive calculations will be less and less tolerated.

The queuing system give you access to computers owned by ALGO, LCM, LTHC, LTHI and IC Faculty for a total of approximately 500 cores. Sharing the computational resources among as many groups as possible will result in a more efficient use of the resources (included the electric power). A larger cluster not only have an improved average throughput, but it is also better suited to respond to peak requests.

As user you can take advantage of many more machines for your urgent calculations and get results faster. On the other hand, since the machines your are using are not always owned by your group, try to be as fair as possible and respect the needs of other users: if you notice that the cluster is overloaded (using the commands qstat -q or showq ), do not submit too many jobs and leave some space for the others.

We have configured the system almost without access restriction because the queuing system can make a more efficient use of the cluster if it does not have to satisfy too many constraints. Please don't force us to introduce limitations such as, for example, reducing the maximum number of jobs executed per user.

As user practice showed us, we had been forced to introduce some limitations:
  1. The maximum number of jobs per user is between 130 and 150, depending on other resource you requests. this limit can be varied depending on the load of the cluster, please ask your sysadmin for such changes.
  2. It's mandatory to specify how much memory your jobs will need: if you don't specify it your job will be executed on computer with small amount of memory.
  3. It's mandatory to specify how much time your job will need to complete: if you don't specify the time needed the execution of the job will be terminated by force after one hour.
  4. Jobs that need to run for more than 120 hours have less precedence over other jobs.

Mini User Guide

The 3 most used commands are:

  1. qstat: for checking the status of the queues or of your running jobs
  2. qsub: for submitting your jobs
  3. qdel: for deleting a running or waiting job

qstat

  • qstat -q shows the status of the queues. In the following example there are 5 queues (long, short, batch, algo, and default which is an alias for short). There are 100 jobs are running on the long queue and one is in queued into algo. In the short (which is the default one if you don't specify how long your job is supposed to run), a job can run for at most one hour.
$ qstat -q

server: pbs

Queue            Memory CPU Time Walltime Node  Run Que Lm  State
---------------- ------ -------- -------- ----  --- --- --  -----
long               --      --       --      --  100   0 --   E R
default            --      --       --      --    0   0 --   E R
short              --   01:00:00    --      --    0   0 --   E R
batch              --      --       --      --    0   0 --   E R
algo               --   24:00:00    --      --    0   1 --   E R
                                               ----- -----
                                                 100     1
  • qstat -a gives more informations about the jobs in the queue. The job status is indicated in the S column: R=running, Q=queued, etc. As an alternative, one can use qstat -n1 which shows also the name of the machine where the job is running:
$ qstat -a

licossrv4.epfl.ch: 
                                                                   Req'd  Req'd   Elap
Job ID               Username Queue    Jobname    SessID NDS   TSK Memory Time  S Time
-------------------- -------- -------- ---------- ------ ----- --- ------ ----- - -----
146.licossrv4.epfl.c damir    batch    STDIN        3980     1  --    --    --  R   -- 
147.licossrv4.epfl.c damir    batch    STDIN        3998     1  --    --    --  R   -- 
148.licossrv4.epfl.c damir    batch    STDIN       24367     1  --    --    --  R   -- 
149.licossrv4.epfl.c damir    batch    STDIN       24390     1  --    --    --  R   -- 
150.licossrv4.epfl.c damir    batch    STDIN         --      1  --    --    --  Q   -- 
151.licossrv4.epfl.c damir    batch    STDIN         --      1  --    --    --  Q   -- 
152.licossrv4.epfl.c damir    batch    STDIN         --      1  --    --    --  Q   -- 
153.licossrv4.epfl.c damir    batch    STDIN         --      1  --    --    --  Q   -- 
154.licossrv4.epfl.c damir    batch    STDIN         --      1  --    --    --  Q   -- 
155.licossrv4.epfl.c cangiani batch    STDIN       15006     1  --    --    --  R   -- 
156.licossrv4.epfl.c cangiani batch    STDIN       15028     1  --    --    --  R   -- 
157.licossrv4.epfl.c cangiani batch    STDIN       11036     1  --    --    --  R   -- 
158.licossrv4.epfl.c cangiani batch    STDIN       11045     1  --    --    --  R   -- 
159.licossrv4.epfl.c cangiani batch    STDIN       11080     1  --    --    --  R   -- 
160.licossrv4.epfl.c cangiani batch    STDIN       11097     1  --    --    --  R   -- 
161.licossrv4.epfl.c cangiani batch    STDIN       30704     1  --    --    --  R   -- 
162.licossrv4.epfl.c cangiani batch    STDIN       30715     1  --    --    --  R   -- 
163.licossrv4.epfl.c cangiani batch    STDIN       30733     1  --    --    --  R   -- 
164.licossrv4.epfl.c cangiani batch    STDIN       30756     1  --    --    --  R   -- 

$ qstat -n1

licossrv4.epfl.ch: 
                                                                   Req'd  Req'd   Elap
Job ID               Username Queue    Jobname    SessID NDS   TSK Memory Time  S Time
-------------------- -------- -------- ---------- ------ ----- --- ------ ----- - -----
165.licossrv4.epfl.c damir    batch    STDIN        4522     1  --    --    --  R 00:01   lthipc1/0
166.licossrv4.epfl.c damir    batch    STDIN        4549     1  --    --    --  R 00:01   lthipc1/1
167.licossrv4.epfl.c damir    batch    STDIN       24672     1  --    --    --  R 00:01   node02/0
168.licossrv4.epfl.c damir    batch    STDIN       24701     1  --    --    --  R 00:01   node02/1
169.licossrv4.epfl.c damir    batch    STDIN         --      1  --    --    --  Q   --     -- 
170.licossrv4.epfl.c damir    batch    STDIN         --      1  --    --    --  Q   --     -- 
171.licossrv4.epfl.c damir    batch    STDIN         --      1  --    --    --  Q   --     -- 
172.licossrv4.epfl.c damir    batch    STDIN         --      1  --    --    --  Q   --     -- 
173.licossrv4.epfl.c damir    batch    STDIN         --      1  --    --    --  Q   --     -- 
174.licossrv4.epfl.c cangiani batch    STDIN       15202     1  --    --    --  R   --    node03/0
175.licossrv4.epfl.c cangiani batch    STDIN       15225     1  --    --    --  R   --    node03/1
176.licossrv4.epfl.c cangiani batch    STDIN       11477     1  --    --    --  R   --    lthcserv7/0
177.licossrv4.epfl.c cangiani batch    STDIN       11494     1  --    --    --  R   --    lthcserv7/1
178.licossrv4.epfl.c cangiani batch    STDIN       11501     1  --    --    --  R   --    lthcserv7/2
179.licossrv4.epfl.c cangiani batch    STDIN       11508     1  --    --    --  R   --    lthcserv7/3
180.licossrv4.epfl.c cangiani batch    STDIN       30886     1  --    --    --  R   --    lcmpc1/0
181.licossrv4.epfl.c cangiani batch    STDIN       30910     1  --    --    --  R   --    lcmpc1/1
182.licossrv4.epfl.c cangiani batch    STDIN       30931     1  --    --    --  R   --    lcmpc1/2
183.licossrv4.epfl.c cangiani batch    STDIN       30952     1  --    --    --  R   --    lcmpc1/3

qsub

Qsub is used to submit jobs. Jobs are nothing else than short scripts where the program to be executed is launched. The easiest job is something like the following

$ echo "cd myProject/bin ; ./mynumbercruncher" | qsub
188.licossrv4.epfl.ch

This will change to the directory myProject/bin located in my home directory, and executes the program called mynumbercruncher. The output of the program will be written by default in two files called STDIN.oXXX and STDIN.eXXX respectively for standard output and standard error. XXX stands for the job id (188 in the above example). You can change the output file names by setting the -o filename for standard output and -e filename or -j oe (append to standard output) for standard error.

Scripts

It is convenient to write the job script in a file not only because in this way the script can be reused, but also because it is also possible to set qsub options directly inside the script as in the following example:

$ cat myScript.sh
# lines starting with #PBS are directives for qsub
#PBS -j oe
#PBS -o myScript.out
#PBS -l nodes=1:bit64

cd bin 
./bogo

Inside a script, all the line that starts with the '#' char are comment, but the lines that start with the '#PBS' string, are directives for the queuing system. The example above instruct the queuing system to:

  • #PBS -j oe: put all the output messages (messages from the program and executions errors) on a single file.
  • #PBS -o myScript.out: all the output generated by my program must saved on a file named myScript.out.
  • #PBS -l nodes=1:bit64: I need at least one node with a 64 bit cpu for my program.
  • #PBS -l nodes=1:ppn=8:bit64: I need at least one node with at least 8 64bit cores for my program.


Many options are available for the qsub command. The most important are the following:

  • -q queue_name force the job to run on a specific queue. Presently the queue is automatically selected following your requests for the job. we might add more conditions or queue if we see that they are needed.
  • -l resource_list defines the resources that are required by the job and establishes a limit to the amount of resource that can be consumed. For example, a job that needs a lot of memory is dispatched only to a compute node that can offer that amount of memory. The main resources that can be requested are:
    • cput for cpu time (example: -l cput=08:00:00),
    • pmem for physical memory (example: -l pmem=4gb),
    • ppn for the number of cores needed inside a single node (useful for parallel programs),
    • nodes for giving a list of nodes (hostnames or properties) to consider.

The properties available on the various nodes can be listed with the pbsnodes -a command.
For the moment we have defined these properties:

  • bit64 on 64 bit machines.
  • matlab for nodes that can launch matlab simulations.
  • mathematica for nodes that can launch Mathematica simulations; follow How to generate Mathematica scripts, if you need an hint.
  • magma for MAGMA Computational Algebra System: because of licence this program is limited to run oly on a single node.
  • cuda for nodes with CUDA Tesla 2070 Hardware with development software (Jul 2015: currently dismissed/unavailable).
  • f20 for nodes with Linux Fedora 20 installed.

Example qsub -l nodes=1:ppn=8:bit64 (the string 1: is mandatory and means: I need at least one node with the properties that follows (eight cores with 64bit architecture)). To specify more than one property use the colon “:” to separate eacho of them. a job that require one 64bit cpu and matlab should be called using qsub -l nodes=1:bit64:matlab <name of the pbs script>.

It is mandatory to specify at least the estimated run time of the job and the memory needed by so that the scheduler can optimize the machines usage and the overall cluster throughput. If your job will pass the limits you fixed, it will be automatically killed by the cluster manager.

By default, if no time limit is specified, the job is sent to the short queue and killed after one hour.

Please keep in mind that longer jobs are less likely to enter the queue when the cluster load is high. Therefore, don't be lazy and do not always ask for infinite run time because your job will remain stuck in the queue. It is also not as smart as it might seem, to submit tons of very short jobs because the start-up and shut-down overheads are intentionally quite long.

  • -a date_time declares the time after which the job is eligible for execution.

The date_time argument is in the form: [[[[CC]YY]MM]DD]hhmm[.SS]. Where CC is the first two digits of the year (the century), YY is the second two digits of the year, MM is the two digits for the month, DD is the day of the month, hh is the hour, mm is the minute, and the optional SS is the seconds.
If the month, MM, is not specified, it will default to the current month if the specified day DD, is in the future. Otherwise, the month will be set to next month. Likewise, if the day, DD, is not specified, it will default to today if the time hhmm is in the future. Otherwise, the day will be set to tomorrow. For example, if you submit a job at 11:15am with a time of -a 1110, the job will be eligible to run at 11:10am tomorrow.
Here you can find some useful pbs script that can be used as starting point

Script Execute with
Base example script contains most of the useful optionsqsub [qsub options] base.pbs
Script example for running matlab computationsqsub -l nodes=1:matlab [qsub options] matlab.pbs
Script example for running Mathematica computationsqsub [qsub options] mathematica.pbs
Script example for windows programs (executed under wine)qsub [qsub options] wine.pbs


The shell running the pbs script will have access to various variables that might be usefull:

  • PBS_O_WORKDIR : the directory where the qsub command was issued
  • PBS_QUEUE : the actual queue where the job is running
  • PBS_JOBID : the internal job identification name
  • PBS_JOBNAME : the name of the job. Unless specified with the -N option, this is usually the name of the pbs script or STDIN
  • HOSTNAME : the name of the machine where the job is running

See the man page for more details.

Making your script cross platform

Presently, we have only 64 bit compute nodes. If you need to compile for 32 bit platforms, in principle, 64 bit nodes can run 32 bit code out of the box. In reality, there might be problems due to missing or incompatible library. An easy solution for taking advantage both of the full set of architecture and also of the optimized 64 code on 64 bit machines is the following (suggested by Alipour Masoud):

  1. Compile two version of your code (32 and 64 bit);
  2. name the two executables (32 and 64 bit) as WHATEVER.i686 and WHATEVER.x86_64 respectively (replace WHATEVER with the name you want to assign to your program);
  3. in your pbs script use ./WHATEVER.$(arch) to select the good executable and run it: the 'arch' ia a system program that discover for you the architecure (32/64 bit) of the computer.

qdel

When you submit a job, you receive from the system a number that is used as reference to the job. to delete the job all you have to do is launch the qdel command followed by the job number you want to delete.

$ qdel 236

You can also indicate more than one job number:

$ qdel 236 237 241

BUG

There is a bug in pbs that appears some time when the server would like to stop a running job but the node where the job is running does not respond (e.g. it did crash). When this happens, the server starts to send you a lot of identical mail messages telling you that it had to kill your job because it exceeded the time limit. If you start to receive the same message over and over about the same JOB ID, please contact your sys admin. Thanks.

Tips and Tricks

Delete all queued jobs

qstat -u $(whoami) -n1 | grep "Q   --" | awk '{print $1;}' | while read a ; do qdel ${a%%.*} ; done 

A script that run as long as possible

Here is a short script that can be useful in those cases where you have the same calculation to run many times (e.g. for collecting statistics).

Since the machines are different and take different time to run the program, one usually allocates the time needed by the slowest machine even if on the fastest machine the actual running time would be 1/10 of the requested one.

The following script will keep running your program until there is time left. It will use the time needed to run 1 iteration to decide if another one can be ran.

qstat=/usr/bin/qstat
jobid=${PBS_JOBID%%.*}

# check how much time is left and set the "moretime" variable accordingly 
checktime() {
  if [ -x $qstat ] ; then
    times=$(qstat -n1 $jobid | tail -n 1)
    let tend=$(echo $times | awk '{print $9}' | awk -F : '{print $1*3600+$2*60;}')
    let tnow=$(echo $times | awk '{print $11}' | awk -F : '{print $1*3600+$2*60;}')
    let trem=$tend-$tnow
    let tmin=$tnow/$niter
    if [ $trem -ge $tmin ] ; then
      moretime="yes"
    else
      moretime="no"
    fi
  else
    # cannot say => random guess
    moretime="yes"
  fi
}

# Execute a task as many times as possible. 
let niter=0;
moretime="yes"
while [ "$moretime" == "yes" ] ; do
  # run your program here
  ./my_program.x
  let niter=$niter+1
  checktime
done
sge.txt · Last modified: 2015/11/16 10:18 (external edit)