User Tools

Site Tools


1slurm

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
1slurm [2020/03/27 14:35] admin1slurm [2020/06/26 10:38] (current) admin
Line 1: Line 1:
-!!!! WORK IN PROGRESS !!!!! + 
-====== Queuing System - S.L.U.R.M. ======+====== Queueing System - S.L.U.R.M. ======
 Our cluster runs [[https://slurm.schedmd.com| S.L.U.R.M.]] workload manager for managing batch jobs. It is **preferable** to use this system for running long batch jobs as interactive calculations are less reliable and require more human work. Our cluster runs [[https://slurm.schedmd.com| S.L.U.R.M.]] workload manager for managing batch jobs. It is **preferable** to use this system for running long batch jobs as interactive calculations are less reliable and require more human work.
  
-The queuing system give you access to computers owned by LCM, LTHC, LTHI, LINX and IC Faculty; sharing the computational resources among as many groups as possible will result in a more efficient use of the resources (included the electric power). A larger cluster not only have an improved average throughputbut it is also better suited to respond to peak requests.  +The queuing system give you access to computers owned by LCM, LTHC, LTHI, LINX and IC Faculty; sharing the computational resources among as many groups as possible will result in a more efficient use of the resources (including the electric power), you can take advantage of many more machines for your urgent calculations and get results faster. 
-As user you can take advantage of many more machines for your urgent calculations and get results faster. On the other hand, since the machines your are using are not always owned by your group, try to be as fair as possible and respect the needs of other users: if you notice that the cluster is overloaded (using the command ''squeue''), do not submit too many jobs and leave some space for the others+On the other hand, since the machines you are using are not always owned by your group, try to be as fair as possible and respect the needs of other users.  
 + 
 +We have configured the system with almost no restriction to access and capabilities because the queuing system can make a more efficient use of the cluster if it does not have to satisfy too many constraints. we are currently using only some constraints: 
 +  - number of CPU/cores: you must indicate the correct number of cores you're going to use; 
 +  - Megabytes/Gigabytes of RAM your jobs need to use; 
 +  - Time for the execution: if your job is not completed by the indicated time, it will be automatically terminated; 
 + 
 +here we provide just a fast and dirty guide for the most basic commands/tasks that you're going to use for the day to daily activities, you can find better and more complete guides on how to use S.L.U.R.M. control commands on internet; e.g: 
 +  - [[https://slurm.schedmd.com/|Slurm Documentation]] 
 +  - [[https://scitas-data.epfl.ch/confluence/display/DOC/FAQ#FAQ-BatchSystemQuestions|SCITAS Documentation]] 
 +  - [[https://slurm.schedmd.com/quickstart.html|Quick Start]] 
  
-We have configured the system almost without access restriction because the queuing system can make more efficient use of the cluster if it does not have to satisfy too many constraintswe are currently using only two constraints: +==== partitions (a.k.a. queues) ==== 
-  - number of CPU/cores: you must indicate the corerct number of cores you're going to use. +If you used other types of cluster management, you will already known the term "queue" to identify the type of computers (nodes) or programs (jobs) you want to use. in S.L.U.R.M. notation, ''queues'' are called **partitions**. The two terms are used to indicate the same entity, even if they are not quite the same.
-  - Megabytes/Gigabytes of RAM your jobs need to use +
-  - Time for the execution: if your job is not completed by the indicated time, it will be automatically terminated.+
  
 ===== Mini User Guide ===== ===== Mini User Guide =====
  
-The most used commands are: +The most used/needed commands are: 
-  - ''squeue''for checking the status of the queues or of your running jobs +  - ''squeue'' for checking the status of the partitions or of your running jobs 
-  - ''sbatch'' or ''srun''for submitting your jobs +  - ''sbatch'' or ''srun'' for submitting your jobs 
-  - ''scancel''for deleting a running or waiting job+  - ''scancel'' for deleting a running or waiting job 
 +  - ''sinfo'' to discover the availability of nodes and partitions
  
-==== squeue ==== +  * ''sinfo'' show the list of partitionsnodes and their availability: here you can see that the default partition of the cluster is called "slurm-cluster(the * indicate the default), the time limit imposed on he partitions and the nodes that are associated with them and what is their activity status.
-  * ''qstat -q'' shows the status of the queues. In the following example there are 5 queues (''long, short, batch, algo'', and ''default'' which is an alias for ''short''). There are 100 jobs are running on the ''long'' queue and one is in queued into ''algo''. In the ''short'' (which is the default one if you don't specify how long your job is supposed to run), a job can run for at most one hour+
 <code> <code>
-qstat -q+sinfo 
 +PARTITION      AVAIL  TIMELIMIT  NODES  STATE NODELIST 
 +slurm-cluster*    up   infinite      6   idle iscpc88,node[01-02,05,10-11] 
 +slurm-ws          up    1:00:00      4  down* iscpc[85-87,90] 
 +slurm-ws          up    1:00:00      2   idle iscpc[14-15] 
 +</code>
  
-serverpbs+  * ''squeue'' shows the list of jobs currently submitted to all the **partitions** (queues). By default, the command will shows all the jobs submitted to the cluster: 
 +<code> 
 +$ squeue  
 +             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
 +               550 slurm-clu  sheepit    damir PD       0:00      1 (Resources) 
 +               551 slurm-clu script.s rmarino1 PD       0:00      1 (Priority) 
 +               549 slurm-clu  sheepit    damir  R      11:13      1 node05 
 +               548 slurm-clu  sheepit    damir  R      11:25      1 iscpc88 
 +</code> 
 +here you can see that the command provides the ID of the jobs, the PARTITION used to run the jobs (hence the nodes where these jobs will run), the NAME assigned to the jobs, the name of the USER that submitted the jobs, the STATUS of the job (R=Run, PD=Waiting), the execution TIME and the nodes where the jobs are actually running (or the reason why they wait in the queue).
  
-Queue            Memory CPU Time Walltime Node  Run Que Lm  State +  * ''sbatch'' is used to submit and run jobs on the cluster. Jobs are nothing else than short scripts that contain some directive about the specific requests of the programs that need to be executed. The output of the program will be written by default in two files called xxx.out and xxx.err  respectively for standard output (any message that would be printed on the screen) and standard error (any error message that would be printed on the screen). ''xxx'' stands for the job id. You can change the output file names by setting the directives ''--output='' for standard output and ''--error='' for standard error. 
----------------- ------ -------- -------- ----  --- --- --  ----- +Once a job is submitted (and accepted by the cluster), you'll receive the ID assigned to the job
-long               --      --       --      --  100   0 --   E R +<code> 
-default            --      --       --      --    0   0 --   E R +$ sbatch sheepit.slurm  
-short              --   01:00:00    --      --    0   0 --   E R +Submitted batch job 552
-batch              --      --       --      --    0   0 --   E R +
-algo               --   24:00:00    --      --    0   1 --   E R +
-                                               ----- ----- +
-                                                 100     1+
 </code> </code>
-  * ''qstat -a'' gives more informations about the jobs in the queue. The job status is indicated in the ''S'' column''R''=running''Q''=queued, etcAs an alternative, one can use ''qstat -n1'' which shows also the name of the machine where the job is running:+  * ''srun'' is used to launch your program inside the clusterit can be used to provide a an interactive session of a nodeor to launch parallel computer programsused inside the sbatch/slurm scripts the cluster always knows what resources are allocated. 
 + 
 +  * ''scancel'' is used to remove your job from the queue or to kill your program when it's already running on the cluster:
 <code> <code>
-qstat -a+squeue  
 +             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
 +               552 slurm-clu  sheepit    damir PD       0:00      1 (Priority) 
 +               550 slurm-clu  sheepit    damir PD       0:00      1 (Resources) 
 +               551 slurm-clu script.s rmarino1 PD       0:00      1 (Priority) 
 +               549 slurm-clu  sheepit    damir  R      35:54      1 node05 
 +               548 slurm-clu  sheepit    damir  R      36:06      1 iscpc88 
 +$ scancel 552 
 +$ squeue  
 +             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
 +               550 slurm-clu  sheepit    damir PD       0:00      1 (Resources) 
 +               551 slurm-clu script.s rmarino1 PD       0:00      1 (Priority) 
 +               549 slurm-clu  sheepit    damir  R      36:05      1 node05 
 +               548 slurm-clu  sheepit    damir  R      36:17      1 iscpc88 
 +</code>
  
-licossrv4.epfl.ch:  +=== Scripts (used with sbatch) ===
-                                                                   Req' Req'  Elap +
-Job ID               Username Queue    Jobname    SessID NDS   TSK Memory Time  S Time +
--------------------- -------- -------- ---------- ------ ----- --- ------ ----- - ----- +
-146.licossrv4.epfl.c damir    batch    STDIN        3980      --    --    --  R   --  +
-147.licossrv4.epfl.c damir    batch    STDIN        3998      --    --    --  R   --  +
-148.licossrv4.epfl.c damir    batch    STDIN       24367      --    --    --  R   --  +
-149.licossrv4.epfl.c damir    batch    STDIN       24390      --    --    --  R   --  +
-150.licossrv4.epfl.c damir    batch    STDIN         --      1  --    --    --  Q   --  +
-151.licossrv4.epfl.c damir    batch    STDIN         --      1  --    --    --  Q   --  +
-152.licossrv4.epfl.c damir    batch    STDIN         --      1  --    --    --  Q   --  +
-153.licossrv4.epfl.c damir    batch    STDIN         --      1  --    --    --  Q   --  +
-154.licossrv4.epfl.c damir    batch    STDIN         --      1  --    --    --  Q   --  +
-155.licossrv4.epfl.c cangiani batch    STDIN       15006      --    --    --  R   --  +
-156.licossrv4.epfl.c cangiani batch    STDIN       15028      --    --    --  R   --  +
-157.licossrv4.epfl.c cangiani batch    STDIN       11036      --    --    --  R   --  +
-158.licossrv4.epfl.c cangiani batch    STDIN       11045      --    --    --  R   --  +
-159.licossrv4.epfl.c cangiani batch    STDIN       11080      --    --    --  R   --  +
-160.licossrv4.epfl.c cangiani batch    STDIN       11097      --    --    --  R   --  +
-161.licossrv4.epfl.c cangiani batch    STDIN       30704      --    --    --  R   --  +
-162.licossrv4.epfl.c cangiani batch    STDIN       30715      --    --    --  R   --  +
-163.licossrv4.epfl.c cangiani batch    STDIN       30733      --    --    --  R   --  +
-164.licossrv4.epfl.c cangiani batch    STDIN       30756      --    --    --  R   -- +
  
-qstat -n1+It is convenient to write the job script in a file not only because in this way the script can be reused, but also because it is also possible to set ''sbatch'' options directly inside the script as in the following example (that shows the content of the file sheepit.slurm): 
 +<code> 
 +cat sheepit.slurm  
 +#!/bin/bash
  
-licossrv4.epfl.ch:  +#SBATCH --job-name=sheepit 
-                                                                   Req' Req'  Elap +#SBATCH --nodes=1 
-Job ID               Username Queue    Jobname    SessID NDS   TSK Memory Time  S Time +#SBATCH --cpus-per-task=8 
--------------------- -------- -------- ---------- ------ ----- --- ------ ----- - ----- +#SBATCH --time=4:00:00 
-165.licossrv4.epfl.c damir    batch    STDIN        4522     1  --    --    --  R 00:01   lthipc1/0 +#SBATCH --mem=16G 
-166.licossrv4.epfl.c damir    batch    STDIN        4549     1  --    --    --  R 00:01   lthipc1/+#SBATCH --mail-user=damir.laurenzi@epfl.ch 
-167.licossrv4.epfl.c damir    batch    STDIN       24672      --    --    --  R 00:01   node02/+#SBATCH --mail-type=begin 
-168.licossrv4.epfl.c damir    batch    STDIN       24701      --    --    --  R 00:01   node02/1 +#SBATCH --mail-type=end 
-169.licossrv4.epfl.c damir    batch    STDIN         --      1  --    --    --  Q   --     --  +#SBATCH --error=sheepit/sheepit.%J.err 
-170.licossrv4.epfl.c damir    batch    STDIN         --      1  --    --    --  Q   --     --  +#SBATCH --output=sheepit/sheepit.%J.out 
-171.licossrv4.epfl.c damir    batch    STDIN         --      1  --    --    --  Q   --     --  +#SBATCH --partition slurm-cluster 
-172.licossrv4.epfl.c damir    batch    STDIN         --      1  --    --    --  Q   --     --  +#SBATCH --gres=gpu:
-173.licossrv4.epfl.c damir    batch    STDIN         --       --    --    --  Q   --     --  +#SBATCH --constraint=opteron61
-174.licossrv4.epfl.c cangiani batch    STDIN       15202     1  --    --    --  R   --    node03/0 +
-175.licossrv4.epfl.c cangiani batch    STDIN       15225     1  --    --    --  R   --    node03/+
-176.licossrv4.epfl.c cangiani batch    STDIN       11477      --    --    --  R   --    lthcserv7/0 +
-177.licossrv4.epfl.c cangiani batch    STDIN       11494     1  --    --    --  R   --    lthcserv7/+
-178.licossrv4.epfl.c cangiani batch    STDIN       11501      --    --    --  R   --    lthcserv7/2 +
-179.licossrv4.epfl.c cangiani batch    STDIN       11508     1  --    --    --  R   --    lthcserv7/3 +
-180.licossrv4.epfl.c cangiani batch    STDIN       30886     1  --    --    --  R   --    lcmpc1/0 +
-181.licossrv4.epfl.c cangiani batch    STDIN       30910      --    --    --  R   --    lcmpc1/+
-182.licossrv4.epfl.c cangiani batch    STDIN       30931     1  --    --    --  R   --    lcmpc1/2 +
-183.licossrv4.epfl.c cangiani batch    STDIN       30952      --    --    --  R   --    lcmpc1/3 +
-</code>+
  
-==== sbatch / srun ==== 
-sbatch is used to submit jobs. Jobs are nothing else than short scripts where the program to be executed is launched. 
-srun is used to actually launch your program inside the cluster. 
  
-The output of the program will be written by default in two files called ''STDIN.oXXX'' and ''STDIN.eXXX'' respectively for standard output and standard error. ''XXX'' stands for the job id (188 in the above example). You can change the output file names by setting the ''-o filename'' for standard output and ''-e filename'' or ''-j oe'' (append to standard outputfor standard error.+echo "$(hostname$(date)"
  
-=== Scripts ===+cd ${HOME}/sheepit 
 +srun sleep 60 
 +echo "$(hostname) $(date)"
  
-It is convenient to write the job script in a file not only because in this way the script can be reused, but also because it is also possible to set ''qsub'' options directly inside the script as in the following example: 
-<code> 
-$ cat myScript.sh 
-# lines starting with #PBS are directives for qsub 
-#PBS -j oe 
-#PBS -o myScript.out 
-#PBS -l nodes=1:bit64 
  
-cd bin  
-./bogo 
  
 </code> </code>
  
-Inside a script, all the line that starts with the '#' char are comment, but the lines that start with the '#PBS' string, are directives for the queuing system.+<note> 
 +At the beginning of the file, you can read the line ''#!/bin/bash'' that is not strictly necessary. It turns out that it's common practice to identify the slurm script as bash scripts so they can be executed also outside of the cluster. in this case the '#SBATCH' lines are interpreted as comments. 
 +</note> 
 +Inside a script, all the line that starts with the '#' char are comment, but the lines that start with the '#SBATCH' string, are directives for the queuing system.
 The example above instruct the queuing system to: The example above instruct the queuing system to:
-  * ''#PBS -j oe'': put all the output messages (messages from the program and executions errors) on a single file.\\ +  * ''#SBATCH --job-name=sheepit'' assign the name sheepit to the job 
-  * ''#PBS -o myScript.out'': all the output generated by my program must saved on a file named myScript.out.\\ +  * ''#SBATCH --nodes=1'' require only one node to run the job 
-  * ''#PBS -l nodes=1:bit64'': I need at least one node with a 64 bit cpu for my program.\\ +  * ''#SBATCH --cpus-per-task=8'' inform the cluster that the program will need/use 8 cores to run 
-  * ''#PBS -l nodes=1:ppn=8:bit64'': I need at least one node with at least 8 64bit cores for my program.\\ +  * ''#SBATCH --time=4:00:00'' the job will run for 4 hours 
-\\ +  * ''#SBATCH --mem=16G'' the job will require 16GB of RAM to be executedDifferent units can be specified using the suffixes [K|M|G|T] 
-Many options are available for the qsub command. The most important are the following: +  * ''#SBATCH --mail-*'' are parameters to indicate the email address that will receive the email messages from the cluster and when these messages are to be sent (when the job start and when the job ends) 
-  * ''-q queue_name'' force the job to run on a specific queuePresently the queue is automatically selected following your requests for the job. we might add more conditions or queue if we see that they are needed.  +  * ''#SBATCH --error'', ''#SBATCH --output'' these two directives indicate where to write the messages from the program(syou execute 
-  * '' -l resource_list'' defines the resources that are required by the job and establishes a limit to the amount of resource that can be consumed. For example, a job that needs a lot of memory is dispatched only to a compute node that can offer that amount of memory. The main resources that can be requested are: +  * ''#SBATCH --partition'' indicate the partition that must be used to run the program 
-    * ''cput'' for cpu time (example: ''-l cput=08:00:00''), +  * ''#SBATCH --gres=gpu:1'' this parameter inform the cluster that the program must be run only inside nodes that provides the "gpu" resource and that **1** of these resources is needed
-    * ''pmem'' for physical memory (example: ''-l pmem=4gb''), +  * ''#SBATCH --constraint='' this directive indicate the the program must be run **only** on those nodes that provide this propertyconstraints can be combined using AND, OR or combinations (as in --constraint="intel&gpu" or --constraint="intel|amd"). it's better to refer to the {{https://slurm.schedmd.com/sbatch.html|sbatch manual}} to better understand the possibilities.
-    * ''ppn'' for the number of cores needed inside a single node (useful for parallel programs), +
-    * ''nodes'' for giving a list of nodes (hostnames or //properties//) to consider. +
-The properties available on the various nodes can be listed with the ''pbsnodes -a'' command.\\ +
-For the moment we have defined these properties: +
-    * ''bit64'' on 64 bit machines. +
-    * ''matlab'' for nodes that can launch matlab simulations. +
-    ''mathematica'' for nodes that can launch Mathematica simulations; follow [[sge:mathematica_batch|How to generate Mathematica scripts]], if you need an hint+
-    * ''magma'' for [[http://magma.maths.usyd.edu.au/|MAGMA]] Computational Algebra System: because of licence this program is limited to run oly on a single node. +
-    ''cuda'' for nodes with CUDA Tesla 2070 Hardware with development software (Jul 2015: currently dismissed/unavailable). +
-    ''f20'' for nodes with Linux Fedora 20 installed. +
-Example **qsub -l nodes=1:ppn=8:bit64** (the string ''1:'' is mandatory and means: //I need at least one node with the properties that follows// (eight cores with 64bit architecture))To specify more than one property use the colon ":" to separate eacho of them. a job that require one 64bit cpu and matlab should be called using **qsub -l nodes=1:bit64:matlab <name of the pbs script>**.+
  
-<note important> +At the moment we have defined these resources: 
-It is **mandatory** to specify at least the estimated run time of the job and the memory needed by so that the scheduler can optimize the machines usage and the overall cluster throughput. If your job will pass the limits you fixed, it will be automatically killed by the cluster manager.+    ''gpu'' on nodes that provides GPU capabilities.
  
-By defaultif no time limit is specified, the job is sent to the ''short'' queue and killed after one hour.+and these properties/constraints: 
 +    * ''opteron61'' nodes that have the old CPU AMD Opteron 6821that lack some newer hardware functions 
 +    * ''matlab'' nodes that can run Matlab simulations 
 +    * ''mathematica'' nodes that can run Mathematica simulations 
 +    * ''tensorflow'' nodes that can run tensorflow 2.x 
 +    * ''xeon26''''xeon41'', ''xeon56'' nodes that have different version of Intel Xeon CPUs 
 +    * ''epyc7302'' nodes that provide the AMD Epyc CPUs 
 +    * ''gpu'' nodes that provide GPU capabilities 
 + 
 +<note> 
 +Please pay attention that ''resources'' aren'the same as ''properties'' and the two must be indicated using different parameters inside the scripts: 
 +  * ''#SBATCH --gres='' indicate the resource we want to use 
 +  * ''#SBATCH --constraints='' indicate that we want to limit the run of our program on the nodes that present the property 
 +</note> 
 + 
 +<note important> 
 +It is **mandatory** to specify at least the estimated run time of the job and the memory needed, so the scheduler can optimize the nodes/cores/memory usage and the overall cluster throughput. If your job will pass the limits you fixed, it will be automatically killed by the cluster manager.
  
-Please keep in mind that longer jobs are less likely to enter the queue when the cluster load is high. Therefore, don't be lazy and do not always ask for //infinite// run time because your job will remain stuck in the queue. It is also not as smart as it might seem, to submit tons of very short jobs because the start-up and shut-down overheads are intentionally quite long+Please keep in mind that longer jobs are less likely to enter the queue when the cluster load is high. Therefore, don't be lazy and do not always ask for //infinite// run time because your job will remain stuck in the queue.
 </note> </note>
  
-  * '' -a date_time'' declares the time after which the job is eligible for execution. +Here you can find some useful sbatch script that can be used as starting point
-The ''date_time'' argument is in the form: ''[[[[CC]YY]MM]DD]hhmm[.SS]''. Where ''CC'' is the first two digits of the year (the century), ''YY'' is the second two digits of the year, ''MM'' is the two digits for  the  month, ''DD'' is the day of the month, ''hh'' is the hour, ''mm'' is the minute, and the optional ''SS'' is the seconds.\\ +
-If  the month, ''MM'', is not specified, it will default to the current month if the specified day ''DD'', is in the future.  Otherwise, the month will be set to next month. Likewise, if the day, ''DD'', is not specified, it will default to today if the time hhmm is in the future. Otherwise, the day will be set to tomorrow. For example, if you submit a job at 11:15am with a time of ''-a 1110'', the job will be eligible to run at 11:10am tomorrow. +
-\\ +
-Here you can find some useful pbs script that can be used as starting point+
 ^Script  ^  Execute with  ^ ^Script  ^  Execute with  ^
-|{{base2.pbs|Base example script}} contains most of the useful options|qsub [qsub options] base.pbs+|{{base.slurm|Base example script}} contains most of the useful options|sbatch [sbatch options] base.slurm
-|{{matlab.pbs|Script example for running matlab computations}}|qsub -l nodes=1:matlab [qsub options] matlab.pbs+|{{matlab.slurm|Script example for running matlab computations}}|sbatch [sbatch options] matlab.slurm
-|{{mathematica.pbs|Script example for running Mathematica computations}}|qsub [qsub options] mathematica.pbs+|{{mathematica.slurm|Script example for running Mathematica computations}}|sbatch [sbatch options] mathematica.slurm
-|{{wine.pbs|Script example for windows programs (executed under wine)}}|qsub [qsub options] wine.pbs|+|{{wine.slurm|Script example for windows programs (executed under wine)}}|sbatch [sbatch options] wine.slurm|
  
 \\ \\
  
-The shell running the pbs script will have access to various variables that might be usefull: + 
-  * ''PBS_O_WORKDIR'' : the directory where the qsub command was issued +The shell running the sbatch script will have access to various variables that might be useful, you can find [[https://slurm.schedmd.com/sbatch.html#lbAJ|here]] a complete list.\\
-  * ''PBS_QUEUE    '' : the actual queue where the job is running +
-  * ''PBS_JOBID    '' : the internal job identification name +
-  * ''PBS_JOBNAME  '' : the name of the jobUnless specified with the -N option, this is usually the name of the pbs script or STDIN +
-  * ''HOSTNAME     '' : the name of the machine where the job is running+
    
 See the man page for more details.  See the man page for more details. 
- 
-==== Making your script cross platform ==== 
-Presently, we have only 64 bit compute nodes. If you need to compile for 32 bit platforms, in principle, 64 bit nodes can run 32 bit code out of the box. In reality, there might be problems due to missing or incompatible library. 
-An easy solution for taking advantage both of the full set of architecture and also of the optimized 64 code on 64 bit machines is the following (suggested by Alipour Masoud): 
- 
-  - Compile two version of your code (32 and 64 bit); 
-  - name the two executables (32 and 64 bit) as ''WHATEVER.i686'' and ''WHATEVER.x86_64'' respectively (replace ''WHATEVER'' with the name you want to assign to your program); 
-  - in your pbs script use ''./WHATEVER.$(arch)'' to select the good executable and run it: the 'arch' ia a system program that discover for you the architecure (32/64 bit) of the computer.  
- 
- 
-==== qdel ==== 
- 
-When you submit a job, you receive from the system a number that is used as reference to the job. to delete the job all you have to do is launch the qdel command followed by the job number you want to delete.\\ 
-<code> 
- 
-$ qdel 236 
- 
-</code> 
- 
-You can also indicate more than one job number: 
-<code> 
- 
-$ qdel 236 237 241 
- 
-</code> 
- 
-==== BUG ==== 
- 
-There is a bug in pbs that appears some time when the server would like to stop a running job but the node where the job is running does not respond (e.g. it did crash). When this happens, the server starts to send you a lot of identical mail messages telling you that it had to kill your job because it exceeded the time limit. If you start to receive the same message over and over about the same JOB ID, please contact your sys admin. Thanks. 
  
 ===== Tips and Tricks ===== ===== Tips and Tricks =====
 === Delete all queued jobs === === Delete all queued jobs ===
 <code> <code>
-qstat -u $(whoami) -n1 | grep "  --| awk '{print $1;}' | while read a ; do qdel ${a%%.*} ; done + 
 +squeue -u $(whoami) --t RUNNING | awk '{print $1};' | while read a ; do scancel ${a} ; done 
 </code> </code>
  
-=== A script that run as long as possible === 
-Here is a short script that can be useful in those cases where you have the same  
-calculation to run many times (e.g. for collecting statistics).  
- 
-Since the machines are different and take different time to run the program, one usually 
-allocates the time needed by the slowest machine even if on the fastest machine the actual 
-running time would be 1/10 of the requested one.  
- 
-The following script will keep running your program until there is time left. It will use 
-the time needed to run 1 iteration to decide if another one can be ran. 
-  
-<code> 
-qstat=/usr/bin/qstat 
-jobid=${PBS_JOBID%%.*} 
- 
-# check how much time is left and set the "moretime" variable accordingly  
-checktime() { 
-  if [ -x $qstat ] ; then 
-    times=$(qstat -n1 $jobid | tail -n 1) 
-    let tend=$(echo $times | awk '{print $9}' | awk -F : '{print $1*3600+$2*60;}') 
-    let tnow=$(echo $times | awk '{print $11}' | awk -F : '{print $1*3600+$2*60;}') 
-    let trem=$tend-$tnow 
-    let tmin=$tnow/$niter 
-    if [ $trem -ge $tmin ] ; then 
-      moretime="yes" 
-    else 
-      moretime="no" 
-    fi 
-  else 
-    # cannot say => random guess 
-    moretime="yes" 
-  fi 
-} 
- 
-# Execute a task as many times as possible.  
-let niter=0; 
-moretime="yes" 
-while [ "$moretime" == "yes" ] ; do 
-  # run your program here 
-  ./my_program.x 
-  let niter=$niter+1 
-  checktime 
-done 
-</code> 
  
1slurm.1585319739.txt.gz · Last modified: 2020/03/27 14:35 by admin