sge
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
sge [2011/01/26 09:10] – damir | sge [2015/11/16 10:18] (current) – external edit 127.0.0.1 | ||
---|---|---|---|
Line 3: | Line 3: | ||
| | ||
- | The queuing system give you access to computers owned by ALGO, ARNI, LCM, LICOS, LAPMAL, LTHC, and LTHI for a total of approximately | + | The queuing system give you access to computers owned by ALGO, LCM, LTHC, LTHI and IC Faculty |
- | As an user you can take advantage of many more machines for your urgent calculations and get results faster. On the other hand, since the machines your are using are not always owned by your group, try to be as fair as possible and respect the needs of other users: if you notice that the cluster is overloaded (using the commands '' | + | As user you can take advantage of many more machines for your urgent calculations and get results faster. On the other hand, since the machines your are using are not always owned by your group, try to be as fair as possible and respect the needs of other users: if you notice that the cluster is overloaded (using the commands '' |
- | We have configured the system almost without access restriction because the queuing system can make a more efficient use of the cluster if it does not have to satisfy too many constraints. Please don't force us to introduce limitations such as, for example, | + | We have configured the system almost without access restriction because the queuing system can make a more efficient use of the cluster if it does not have to satisfy too many constraints. Please don't force us to introduce limitations such as, for example, |
< | < | ||
As user practice showed us, we had been forced to introduce some limitations: | As user practice showed us, we had been forced to introduce some limitations: | ||
- The maximum number of jobs per user is between 130 and 150, depending on other resource you requests. this limit can be varied depending on the load of the cluster, please ask your sysadmin for such changes. | - The maximum number of jobs per user is between 130 and 150, depending on other resource you requests. this limit can be varied depending on the load of the cluster, please ask your sysadmin for such changes. | ||
- | - It's mandatory to specify how much memory your jobs will need. | + | - It's mandatory to specify how much memory your jobs will need: if you don't specify it your job will be executed on computer with small amount of memory. |
- | - It's mandatory to specify how much time your job will need to complete. | + | - It's mandatory to specify how much time your job will need to complete: if you don't specify the time needed the execution of the job will be terminated by force after one hour. |
- Jobs that need to run for more than 120 hours have less precedence over other jobs. | - Jobs that need to run for more than 120 hours have less precedence over other jobs. | ||
</ | </ | ||
Line 26: | Line 26: | ||
* '' | * '' | ||
< | < | ||
- | [root@licossrv4 server_priv]# | + | $ qstat -q |
server: pbs | server: pbs | ||
Line 42: | Line 42: | ||
* '' | * '' | ||
< | < | ||
- | [root@licossrv4 server_priv]# | + | $ qstat -a |
licossrv4.epfl.ch: | licossrv4.epfl.ch: | ||
Line 68: | Line 68: | ||
164.licossrv4.epfl.c cangiani batch STDIN | 164.licossrv4.epfl.c cangiani batch STDIN | ||
- | [root@licossrv4 server_priv]# | + | $ qstat -n1 |
licossrv4.epfl.ch: | licossrv4.epfl.ch: | ||
Line 123: | Line 123: | ||
* ''# | * ''# | ||
* ''# | * ''# | ||
+ | * ''# | ||
\\ | \\ | ||
Many options are available for the qsub command. The most important are the following: | Many options are available for the qsub command. The most important are the following: | ||
Line 129: | Line 130: | ||
* '' | * '' | ||
* '' | * '' | ||
+ | * '' | ||
* '' | * '' | ||
The properties available on the various nodes can be listed with the '' | The properties available on the various nodes can be listed with the '' | ||
For the moment we have defined these properties: | For the moment we have defined these properties: | ||
* '' | * '' | ||
- | * '' | ||
* '' | * '' | ||
* '' | * '' | ||
- | * '' | + | * '' |
- | * '' | + | * '' |
- | * '' | + | * '' |
- | * '' | + | Example **qsub -l nodes=1:ppn=8:bit64** (the string '' |
- | Example **qsub -l nodes=1: | + | |
<note important> | <note important> | ||
- | It **mandatory** to specify at least the estimated run time of the job and the memory needed by so that the scheduler can optimize the machines usage and the overall cluster throughput. If your job will pass the limits you fixed, | + | It is **mandatory** to specify at least the estimated run time of the job and the memory needed by so that the scheduler can optimize the machines usage and the overall cluster throughput. If your job will pass the limits you fixed, |
By default, if no time limit is specified, the job is sent to the '' | By default, if no time limit is specified, the job is sent to the '' | ||
Line 173: | Line 173: | ||
==== Making your script cross platform ==== | ==== Making your script cross platform ==== | ||
- | Presently, we have both 32 and 64 bit compute nodes. | + | Presently, we have only 64 bit compute nodes. |
- | An easy solution for taking advantage both of the full set of machines | + | An easy solution for taking advantage both of the full set of architecture |
- Compile two version of your code (32 and 64 bit); | - Compile two version of your code (32 and 64 bit); | ||
- | - name the two executable | + | - name the two executables (32 and 64 bit) as '' |
- | - in your pbs script use '' | + | - in your pbs script use '' |
- | + | ||
- | If your workstation is a 32bit machine, then you can compile | + | |
Line 188: | Line 186: | ||
< | < | ||
- | damir@lthipc1: | + | $ qdel 236 |
</ | </ | ||
Line 195: | Line 193: | ||
< | < | ||
- | damir@lthipc1: | + | $ qdel 236 237 241 |
</ | </ | ||
Line 201: | Line 199: | ||
==== BUG ==== | ==== BUG ==== | ||
- | There is a bug in pbs that appears some time when the server would like to stop a running job but the node where the job is running does not respond (e.g. it did crash). When this happens, the server starts to send you a lot o identical mail messages telling you that it had to kill your job because it exceeded the time limit. If you start to receive the same message over and over about the same JOB ID, please contact | + | There is a bug in pbs that appears some time when the server would like to stop a running job but the node where the job is running does not respond (e.g. it did crash). When this happens, the server starts to send you a lot of identical mail messages telling you that it had to kill your job because it exceeded the time limit. If you start to receive the same message over and over about the same JOB ID, please contact |
===== Tips and Tricks ===== | ===== Tips and Tricks ===== | ||
Line 215: | Line 213: | ||
Since the machines are different and take different time to run the program, one usually | Since the machines are different and take different time to run the program, one usually | ||
allocates the time needed by the slowest machine even if on the fastest machine the actual | allocates the time needed by the slowest machine even if on the fastest machine the actual | ||
- | running time would be 1/10 of the requested one. As you know, the queueing system does | + | running time would be 1/10 of the requested one. |
- | not like when it is provided with wrong informations. | + | |
The following script will keep running your program until there is time left. It will use | The following script will keep running your program until there is time left. It will use | ||
Line 254: | Line 251: | ||
done | done | ||
</ | </ | ||
+ |
sge.txt · Last modified: 2015/11/16 10:18 by 127.0.0.1