User Tools

Site Tools


sge

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
sge [2011/01/26 09:10]
damir
sge [2015/11/16 10:18] (current)
Line 3: Line 3:
  ​scheduler for managing batch jobs. It is **mandatory** to use this system for running long batch jobs: interactive calculations will be less and less tolerated.  ​scheduler for managing batch jobs. It is **mandatory** to use this system for running long batch jobs: interactive calculations will be less and less tolerated.
  
-The queuing system give you access to computers owned by ALGO, ARNI, LCM, LICOS, LAPMAL, LTHC, and LTHI for a total of approximately ​300 cores. ​We (the sysadmins) believe that sharing ​the computational resources among as many groups as possible will result in a more efficient use of the resources (included the electric power). A larger cluster not only have an improved average throughput, but it is also better suited to respond to peak requests. ​+The queuing system give you access to computers owned by ALGO, LCM, LTHC, LTHI and IC Faculty ​for a total of approximately ​500 cores. ​Sharing ​the computational resources among as many groups as possible will result in a more efficient use of the resources (included the electric power). A larger cluster not only have an improved average throughput, but it is also better suited to respond to peak requests. ​
  
-As an user you can take advantage of many more machines for your urgent calculations and get results faster. On the other hand, since the machines your are using are not always owned by your group, try to be as fair as possible and respect the needs of other users: if you notice that the cluster is overloaded (using the commands ''​qstat -q ''​ or ''​showq ''​),​ do not submit too many jobs and leave some space for the others. ​+As user you can take advantage of many more machines for your urgent calculations and get results faster. On the other hand, since the machines your are using are not always owned by your group, try to be as fair as possible and respect the needs of other users: if you notice that the cluster is overloaded (using the commands ''​qstat -q ''​ or ''​showq ''​),​ do not submit too many jobs and leave some space for the others. ​
  
-We have configured the system almost without access restriction because the queuing system can make a more efficient use of the cluster if it does not have to satisfy too many constraints. Please don't force us to introduce limitations such as, for example, ​maximum number of jobs per user.+We have configured the system almost without access restriction because the queuing system can make a more efficient use of the cluster if it does not have to satisfy too many constraints. Please don't force us to introduce limitations such as, for example, ​reducing the maximum number of jobs executed ​per user.
 <​note>​ <​note>​
 As user practice showed us, we had been forced to introduce some limitations:​ As user practice showed us, we had been forced to introduce some limitations:​
   - The maximum number of jobs per user is between 130 and 150, depending on other resource you requests. this limit can be varied depending on the load of the cluster, please ask your sysadmin for such changes.   - The maximum number of jobs per user is between 130 and 150, depending on other resource you requests. this limit can be varied depending on the load of the cluster, please ask your sysadmin for such changes.
-  - It's mandatory to specify how much memory your jobs will need. +  - It's mandatory to specify how much memory your jobs will need: if you don't specify it your job will be executed on computer with small amount of memory
-  - It's mandatory to specify how much time your job will need to complete.+  - It's mandatory to specify how much time your job will need to complete: if you don't specify the time needed the execution of the job will be terminated by force after one hour.
   - Jobs that need to run for more than 120 hours have less precedence over other jobs.   - Jobs that need to run for more than 120 hours have less precedence over other jobs.
 </​note>​ </​note>​
Line 26: Line 26:
   * ''​qstat -q''​ shows the status of the queues. In the following example there are 5 queues (''​long,​ short, batch, algo'',​ and ''​default''​ which is an alias for ''​short''​). There are 100 jobs are running on the ''​long''​ queue and one is in queued into ''​algo''​. In the ''​short''​ (which is the default one if you don't specify how long your job is supposed to run), a job can run for at most one hour.    * ''​qstat -q''​ shows the status of the queues. In the following example there are 5 queues (''​long,​ short, batch, algo'',​ and ''​default''​ which is an alias for ''​short''​). There are 100 jobs are running on the ''​long''​ queue and one is in queued into ''​algo''​. In the ''​short''​ (which is the default one if you don't specify how long your job is supposed to run), a job can run for at most one hour. 
 <​code>​ <​code>​
-[root@licossrv4 server_priv]# ​qstat -q+qstat -q
  
 server: pbs server: pbs
Line 42: Line 42:
   * ''​qstat -a''​ gives more informations about the jobs in the queue. The job status is indicated in the ''​S''​ column: ''​R''​=running,​ ''​Q''​=queued,​ etc. As an alternative,​ one can use ''​qstat -n1''​ which shows also the name of the machine where the job is running:   * ''​qstat -a''​ gives more informations about the jobs in the queue. The job status is indicated in the ''​S''​ column: ''​R''​=running,​ ''​Q''​=queued,​ etc. As an alternative,​ one can use ''​qstat -n1''​ which shows also the name of the machine where the job is running:
 <​code>​ <​code>​
-[root@licossrv4 server_priv]# ​qstat -a+qstat -a
  
 licossrv4.epfl.ch: ​ licossrv4.epfl.ch: ​
Line 68: Line 68:
 164.licossrv4.epfl.c cangiani batch    STDIN       ​30756 ​    ​1 ​ --    --    --  R   ​-- ​ 164.licossrv4.epfl.c cangiani batch    STDIN       ​30756 ​    ​1 ​ --    --    --  R   ​-- ​
  
-[root@licossrv4 server_priv]# ​qstat -n1+qstat -n1
  
 licossrv4.epfl.ch: ​ licossrv4.epfl.ch: ​
Line 123: Line 123:
   * ''#​PBS -o myScript.out'':​ all the output generated by my program must saved on a file named myScript.out.\\   * ''#​PBS -o myScript.out'':​ all the output generated by my program must saved on a file named myScript.out.\\
   * ''#​PBS -l nodes=1:​bit64'':​ I need at least one node with a 64 bit cpu for my program.\\   * ''#​PBS -l nodes=1:​bit64'':​ I need at least one node with a 64 bit cpu for my program.\\
 +  * ''#​PBS -l nodes=1:​ppn=8:​bit64'':​ I need at least one node with at least 8 64bit cores for my program.\\
 \\ \\
 Many options are available for the qsub command. The most important are the following: Many options are available for the qsub command. The most important are the following:
Line 129: Line 130:
     * ''​cput''​ for cpu time (example: ''​-l cput=08:​00:​00''​),​     * ''​cput''​ for cpu time (example: ''​-l cput=08:​00:​00''​),​
     * ''​pmem''​ for physical memory (example: ''​-l pmem=4gb''​),​     * ''​pmem''​ for physical memory (example: ''​-l pmem=4gb''​),​
 +    * ''​ppn''​ for the number of cores needed inside a single node (useful for parallel programs),
     * ''​nodes''​ for giving a list of nodes (hostnames or //​properties//​) to consider.     * ''​nodes''​ for giving a list of nodes (hostnames or //​properties//​) to consider.
 The properties available on the various nodes can be listed with the ''​pbsnodes -a''​ command.\\ The properties available on the various nodes can be listed with the ''​pbsnodes -a''​ command.\\
 For the moment we have defined these properties: For the moment we have defined these properties:
     * ''​bit64''​ on 64 bit machines.     * ''​bit64''​ on 64 bit machines.
-    * ''​bit32''​ on 32 bit machines (mainly needed because of matlab). 
     * ''​matlab''​ for nodes that can launch matlab simulations.     * ''​matlab''​ for nodes that can launch matlab simulations.
     * ''​mathematica''​ for nodes that can launch Mathematica simulations;​ follow [[sge:​mathematica_batch|How to generate Mathematica scripts]], if you need an hint.     * ''​mathematica''​ for nodes that can launch Mathematica simulations;​ follow [[sge:​mathematica_batch|How to generate Mathematica scripts]], if you need an hint.
-    * ''​magma''​ for [[http://​magma.maths.usyd.edu.au/​|MAGMA]] Computational Algebra System +    * ''​magma''​ for [[http://​magma.maths.usyd.edu.au/​|MAGMA]] Computational Algebra System: because of licence this program is limited to run oly on a single node. 
-    * ''​cuda''​ for nodes with CUDA Hardware ​and development software. +    * ''​cuda''​ for nodes with CUDA Tesla 2070 Hardware ​with development software ​(Jul 2015: currently dismissed/​unavailable)
-    * ''​f10''​ for nodes with Linux Fedora ​10 installed. +    * ''​f20''​ for nodes with Linux Fedora ​20 installed. 
-    * ''​f12''​ for nodes with Linux Fedora 12 installed. +Example **qsub -l nodes=1:ppn=8:bit64** (the string ''​1:''​ is mandatory and means: //I need at least one node with the properties ​that follows// (eight cores with 64bit architecture)). To specify more than one property use the colon ":"​ to separate ​eacho of them. a job that require ​one 64bit cpu and matlab should be called using **qsub -l nodes=1:​bit64:​matlab <name of the pbs script>​**.
-Example **qsub -l nodes=1:​bit64** (the string ''​1:''​ is mandatory and means: //I need at least one node with the property ​that follow//). To specify more than one property use the colon ":"​ to separate ​the properties. a job that require ​64 bit cpu and matlab should be called using **qsub -l nodes=1:​bit64:​matlab <name of the pbs script>​**.+
  
 <note important>​ <note important>​
-It **mandatory** to specify at least the estimated run time of the job and the memory needed by so that the scheduler can optimize the machines usage and the overall cluster throughput. If your job will pass the limits you fixed, ​the job will be automatically killed by the cluster manager.+It is **mandatory** to specify at least the estimated run time of the job and the memory needed by so that the scheduler can optimize the machines usage and the overall cluster throughput. If your job will pass the limits you fixed, ​it will be automatically killed by the cluster manager.
  
 By default, if no time limit is specified, the job is sent to the ''​short''​ queue and killed after one hour. By default, if no time limit is specified, the job is sent to the ''​short''​ queue and killed after one hour.
Line 173: Line 173:
  
 ==== Making your script cross platform ==== ==== Making your script cross platform ====
-Presently, we have both 32 and 64 bit compute nodes. ​In principle, 64 bit nodes can run 32 bit code out of the box. In reality, there might be problems due to missing or incompatible library. +Presently, we have only 64 bit compute nodes. ​If you need to compile for 32 bit platforms, in principle, 64 bit nodes can run 32 bit code out of the box. In reality, there might be problems due to missing or incompatible library. 
-An easy solution for taking advantage both of the full set of machines ​and also of the optimized 64 code on 64 bit machines is the following (suggested by Masoud):+An easy solution for taking advantage both of the full set of architecture ​and also of the optimized 64 code on 64 bit machines is the following (suggested by Alipour ​Masoud):
  
   - Compile two version of your code (32 and 64 bit);   - Compile two version of your code (32 and 64 bit);
-  - name the two executable ​32 and 64 bit as ''​WHATEVER.i686''​ and ''​WHATEVER.x86_64''​ respectively (replace ''​WHATEVER''​ with what you want); +  - name the two executables (32 and 64 bitas ''​WHATEVER.i686''​ and ''​WHATEVER.x86_64''​ respectively (replace ''​WHATEVER''​ with the name you want to assign to your program); 
-  - in your pbs script use ''​./​WHATEVER.`arch`''​ to select the good executable and run it.  +  - in your pbs script use ''​./​WHATEVER.$(arch)''​ to select the good executable and run it: the '​arch'​ ia system program that discover for you the architecure (32/64 bitof the computer
- +
-If your workstation is 32bit machine, then you can compile ​the 64 bit version ​of your code on ''​iscsrv13''​+
  
  
Line 188: Line 186:
 <​code>​ <​code>​
  
-damir@lthipc1:​~$ qdel 236+$ qdel 236
  
 </​code>​ </​code>​
Line 195: Line 193:
 <​code>​ <​code>​
  
-damir@lthipc1:​~$ qdel 236 237 241+$ qdel 236 237 241
  
 </​code>​ </​code>​
Line 201: Line 199:
 ==== BUG ==== ==== BUG ====
  
-There is a bug in pbs that appears some time when the server would like to stop a running job but the node where the job is running does not respond (e.g. it did crash). When this happens, the server starts to send you a lot identical mail messages telling you that it had to kill your job because it exceeded the time limit. If you start to receive the same message over and over about the same JOB ID, please contact ​sys admin. Thanks.+There is a bug in pbs that appears some time when the server would like to stop a running job but the node where the job is running does not respond (e.g. it did crash). When this happens, the server starts to send you a lot of identical mail messages telling you that it had to kill your job because it exceeded the time limit. If you start to receive the same message over and over about the same JOB ID, please contact ​your sys admin. Thanks.
  
 ===== Tips and Tricks ===== ===== Tips and Tricks =====
Line 215: Line 213:
 Since the machines are different and take different time to run the program, one usually Since the machines are different and take different time to run the program, one usually
 allocates the time needed by the slowest machine even if on the fastest machine the actual allocates the time needed by the slowest machine even if on the fastest machine the actual
-running time would be 1/10 of the requested one. As you know, the queueing system does  +running time would be 1/10 of the requested one. 
-not like when it is provided with wrong informations.+
  
 The following script will keep running your program until there is time left. It will use The following script will keep running your program until there is time left. It will use
Line 254: Line 251:
 done done
 </​code>​ </​code>​
 +
sge.txt · Last modified: 2015/11/16 10:18 (external edit)