User Tools

Site Tools


sge

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
Last revisionBoth sides next revision
sge [2011/01/26 08:48] damirsge [2011/06/21 20:52] damir
Line 1: Line 1:
-====== Queueing System - Torque/Maui ======+====== Queuing System - Torque/Maui ======
 Our cluster runs [[http://www.clusterresources.com/pages/products/torque-resource-manager.php|Torque]] resource manager (a pbs variant) and the [[http://www.clusterresources.com/pages/products/maui-cluster-scheduler.php|Maui]] Our cluster runs [[http://www.clusterresources.com/pages/products/torque-resource-manager.php|Torque]] resource manager (a pbs variant) and the [[http://www.clusterresources.com/pages/products/maui-cluster-scheduler.php|Maui]]
- scheduler for managing batch jobs. It is **mandatory** to use this system for running long batch jobs: interactive calculations will be less and less tollerated.+ scheduler for managing batch jobs. It is **mandatory** to use this system for running long batch jobs: interactive calculations will be less and less tolerated.
  
-The queueing system give you access to computers owned by ALGO, ARNI, LCM, LICOS, LTHC, and LTHI for a total of approximately 140 CPUs (cores). We (the sysadmins) believe that sharing the computational resources among as many groups as possible will result in a more efficient use of the resources and of the electric power. A larger cluster not only have an improved average throughput, but it is also better suted to respond to peak requests. +The queuing system give you access to computers owned by ALGO, ARNI, LCM, LICOS, LAPMAL, LTHC, and LTHI for a total of approximately 300 cores. We (the sysadmins) believe that sharing the computational resources among as many groups as possible will result in a more efficient use of the resources (included the electric power). A larger cluster not only have an improved average throughput, but it is also better suited to respond to peak requests. 
  
-As an user you can take advantage of many more machines for your urgent calculations and get results faster. On the other hand, since the machines your are using are not always owned by your group, try to be as fair as possible and respect the needs of other users: if you notice that the cluster is overloaded ('' qstat -q ''), do not submit too many jobs and leave some space for the others. +As an user you can take advantage of many more machines for your urgent calculations and get results faster. On the other hand, since the machines your are using are not always owned by your group, try to be as fair as possible and respect the needs of other users: if you notice that the cluster is overloaded (using the commands ''qstat -q '' or ''showq ''), do not submit too many jobs and leave some space for the others. 
  
-We have configured the system almost without access restriction because the queueing system can make a more efficient use of the cluster if it does not have to satisfy too many constraints. Please don't force us to introduce limitations such as, for example, a maximum number of jobs per user.+We have configured the system almost without access restriction because the queuing system can make a more efficient use of the cluster if it does not have to satisfy too many constraints. Please don't force us to introduce limitations such as, for example, a maximum number of jobs per user. 
 +<note> 
 +As user practice showed us, we had been forced to introduce some limitations: 
 +  - The maximum number of jobs per user is between 130 and 150, depending on other resource you requests. this limit can be varied depending on the load of the cluster, please ask your sysadmin for such changes. 
 +  - It's mandatory to specify how much memory your jobs will need. 
 +  - It's mandatory to specify how much time your job will need to complete. 
 +  - Jobs that need to run for more than 120 hours have less precedence over other jobs. 
 +</note>
  
 ===== Mini User Guide ===== ===== Mini User Guide =====
Line 21: Line 28:
 [root@licossrv4 server_priv]# qstat -q [root@licossrv4 server_priv]# qstat -q
  
-server: licossrv4.epfl.ch+server: pbs
  
 Queue            Memory CPU Time Walltime Node  Run Que Lm  State Queue            Memory CPU Time Walltime Node  Run Que Lm  State
Line 120: Line 127:
   * ''-q queue_name'' force the job to run on a specific queue. Presently the queue is automatically selected following your requests for the job. we might add more conditions or queue if we see that they are needed.    * ''-q queue_name'' force the job to run on a specific queue. Presently the queue is automatically selected following your requests for the job. we might add more conditions or queue if we see that they are needed. 
   * '' -l resource_list'' defines the resources that are required by the job and establishes a limit to the amount of resource that can be consumed. For example, a job that needs a lot of memory is dispatched only to a compute node that can offer that amount of memory. The main resources that can be requested are:   * '' -l resource_list'' defines the resources that are required by the job and establishes a limit to the amount of resource that can be consumed. For example, a job that needs a lot of memory is dispatched only to a compute node that can offer that amount of memory. The main resources that can be requested are:
-    * ''cput'' for cpu time (example: ''-l cput=8:00''),+    * ''cput'' for cpu time (example: ''-l cput=08:00:00''),
     * ''pmem'' for physical memory (example: ''-l pmem=4gb''),     * ''pmem'' for physical memory (example: ''-l pmem=4gb''),
     * ''nodes'' for giving a list of nodes (hostnames or //properties//) to consider.     * ''nodes'' for giving a list of nodes (hostnames or //properties//) to consider.
Line 130: Line 137:
     * ''mathematica'' for nodes that can launch Mathematica simulations; follow [[sge:mathematica_batch|How to generate Mathematica scripts]], if you need an hint.     * ''mathematica'' for nodes that can launch Mathematica simulations; follow [[sge:mathematica_batch|How to generate Mathematica scripts]], if you need an hint.
     * ''magma'' for [[http://magma.maths.usyd.edu.au/|MAGMA]] Computational Algebra System     * ''magma'' for [[http://magma.maths.usyd.edu.au/|MAGMA]] Computational Algebra System
-    * ''cuda'' for nodes with CUDA Hardware+    * ''cuda'' for nodes with CUDA Tesla 2070 Hardware with development software.
-    * ''f10'' for nodes with Linux Fedora 10 installed.+
     * ''f12'' for nodes with Linux Fedora 12 installed.     * ''f12'' for nodes with Linux Fedora 12 installed.
-Example **qsub -l nodes=1:bit64** (the string ''1:'' is mandatory and means: //I need at least one node with property 64bit//). To specify more than one property use the colon ":" to separate the properties. a job that require 64 bit cpu and matlab should be called using **qsub -l nodes=1:bit64:matlab <name of the pbs script>**.+    * ''f14'' for nodes with Linux Fedora 14 installed. 
 +Example **qsub -l nodes=1:bit64** (the string ''1:'' is mandatory and means: //I need at least one node with the property that follow//). To specify more than one property use the colon ":" to separate the properties. a job that require 64 bit cpu and matlab should be called using **qsub -l nodes=1:bit64:matlab <name of the pbs script>**.
  
 <note important> <note important>
-It is very **important to specify at least the estimated run time** of the job and the memory needed by so that the scheduler can optimize the machines usage and the overall cluster throughput.+It is **mandatory** to specify at least the estimated run time of the job and the memory needed by so that the scheduler can optimize the machines usage and the overall cluster throughput. If your job will pass the limits you fixed, the job will be automatically killed by the cluster manager.
  
 By default, if no time limit is specified, the job is sent to the ''short'' queue and killed after one hour. By default, if no time limit is specified, the job is sent to the ''short'' queue and killed after one hour.
Line 170: Line 177:
  
   - Compile two version of your code (32 and 64 bit);   - Compile two version of your code (32 and 64 bit);
-  - name the two executables 32 and 64 bit as ''WHATEVER.i686'' and ''WHATEVER.x86_64'' respectively (replace ''WHATEVER'' with what you want);+  - name the two executable 32 and 64 bit as ''WHATEVER.i686'' and ''WHATEVER.x86_64'' respectively (replace ''WHATEVER'' with what you want);
   - in your pbs script use ''./WHATEVER.`arch`'' to select the good executable and run it.    - in your pbs script use ''./WHATEVER.`arch`'' to select the good executable and run it. 
  
-If your workstation is a 32bit machine, then you can compile the 64 bit version of your code on ''iscsrv13''+If your workstation is a 32bit machine, then you can compile the 64 bit version of your code on ''iscsrv14''
  
  
Line 195: Line 202:
  
 There is a bug in pbs that appears some time when the server would like to stop a running job but the node where the job is running does not respond (e.g. it did crash). When this happens, the server starts to send you a lot o identical mail messages telling you that it had to kill your job because it exceeded the time limit. If you start to receive the same message over and over about the same JOB ID, please contact a sys admin. Thanks. There is a bug in pbs that appears some time when the server would like to stop a running job but the node where the job is running does not respond (e.g. it did crash). When this happens, the server starts to send you a lot o identical mail messages telling you that it had to kill your job because it exceeded the time limit. If you start to receive the same message over and over about the same JOB ID, please contact a sys admin. Thanks.
 +
 ===== Tips and Tricks ===== ===== Tips and Tricks =====
 === Delete all queued jobs === === Delete all queued jobs ===
sge.txt · Last modified: 2015/11/16 10:18 by 127.0.0.1