sge
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revisionLast revisionBoth sides next revision | ||
sge [2009/05/13 13:08] – damir | sge [2011/06/21 22:52] – damir | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== | + | ====== |
Our cluster runs [[http:// | Our cluster runs [[http:// | ||
- | | + | |
- | The queueing | + | The queuing |
- | As an user you can take advantage of many more machines for your urgent calculations and get results faster. On the other hand, since the machines your are using are not always owned by your group, try to be as fair as possible and respect the needs of other users: if you notice that the cluster is overloaded ('' | + | As an user you can take advantage of many more machines for your urgent calculations and get results faster. On the other hand, since the machines your are using are not always owned by your group, try to be as fair as possible and respect the needs of other users: if you notice that the cluster is overloaded (using the commands |
- | We have configured the system almost without access restriction because the queueing | + | We have configured the system almost without access restriction because the queuing |
+ | < | ||
+ | As user practice showed us, we had been forced to introduce some limitations: | ||
+ | - The maximum number of jobs per user is between 130 and 150, depending on other resource you requests. this limit can be varied depending on the load of the cluster, please ask your sysadmin for such changes. | ||
+ | - It's mandatory to specify how much memory your jobs will need. | ||
+ | - It's mandatory to specify how much time your job will need to complete. | ||
+ | - Jobs that need to run for more than 120 hours have less precedence over other jobs. | ||
+ | </ | ||
===== Mini User Guide ===== | ===== Mini User Guide ===== | ||
Line 21: | Line 28: | ||
[root@licossrv4 server_priv]# | [root@licossrv4 server_priv]# | ||
- | server: | + | server: |
Queue Memory CPU Time Walltime Node Run Que Lm State | Queue Memory CPU Time Walltime Node Run Que Lm State | ||
Line 87: | Line 94: | ||
183.licossrv4.epfl.c cangiani batch STDIN | 183.licossrv4.epfl.c cangiani batch STDIN | ||
</ | </ | ||
- | |||
- | |||
- | |||
==== qsub ==== | ==== qsub ==== | ||
Line 107: | Line 111: | ||
#PBS -j oe | #PBS -j oe | ||
#PBS -o myScript.out | #PBS -o myScript.out | ||
- | #PBS -l nodes=1:64bit | + | #PBS -l nodes=1:bit64 |
cd bin | cd bin | ||
Line 118: | Line 122: | ||
* ''# | * ''# | ||
* ''# | * ''# | ||
- | * ''# | + | * ''# |
\\ | \\ | ||
Many options are available for the qsub command. The most important are the following: | Many options are available for the qsub command. The most important are the following: | ||
* '' | * '' | ||
* '' | * '' | ||
- | * '' | + | * '' |
* '' | * '' | ||
* '' | * '' | ||
The properties available on the various nodes can be listed with the '' | The properties available on the various nodes can be listed with the '' | ||
- | For the moment we have defined | + | For the moment we have defined |
* '' | * '' | ||
* '' | * '' | ||
* '' | * '' | ||
* '' | * '' | ||
- | * '' | + | * '' |
- | Example **qsub -l nodes=1: | + | * '' |
+ | * '' | ||
+ | * '' | ||
+ | Example **qsub -l nodes=1: | ||
<note important> | <note important> | ||
- | It is very **important | + | It is **mandatory** |
By default, if no time limit is specified, the job is sent to the '' | By default, if no time limit is specified, the job is sent to the '' | ||
Line 149: | Line 156: | ||
Here you can find some useful pbs script that can be used as starting point | Here you can find some useful pbs script that can be used as starting point | ||
^Script | ^Script | ||
- | |{{base.pbs|Script | + | |{{base2.pbs|Base example |
- | |{{nomail.pbs|base Script example}}|qsub [qsub options] nomail.pbs| | + | |
|{{matlab.pbs|Script example for running matlab computations}}|qsub -l nodes=1: | |{{matlab.pbs|Script example for running matlab computations}}|qsub -l nodes=1: | ||
|{{mathematica.pbs|Script example for running Mathematica computations}}|qsub [qsub options] mathematica.pbs| | |{{mathematica.pbs|Script example for running Mathematica computations}}|qsub [qsub options] mathematica.pbs| | ||
Line 157: | Line 163: | ||
\\ | \\ | ||
+ | The shell running the pbs script will have access to various variables that might be usefull: | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | |||
See the man page for more details. | See the man page for more details. | ||
+ | |||
+ | ==== Making your script cross platform ==== | ||
+ | Presently, we have both 32 and 64 bit compute nodes. In principle, 64 bit nodes can run 32 bit code out of the box. In reality, there might be problems due to missing or incompatible library. | ||
+ | An easy solution for taking advantage both of the full set of machines and also of the optimized 64 code on 64 bit machines is the following (suggested by Masoud): | ||
+ | |||
+ | - Compile two version of your code (32 and 64 bit); | ||
+ | - name the two executable 32 and 64 bit as '' | ||
+ | - in your pbs script use '' | ||
+ | |||
+ | If your workstation is a 32bit machine, then you can compile the 64 bit version of your code on '' | ||
+ | |||
==== qdel ==== | ==== qdel ==== | ||
Line 178: | Line 202: | ||
There is a bug in pbs that appears some time when the server would like to stop a running job but the node where the job is running does not respond (e.g. it did crash). When this happens, the server starts to send you a lot o identical mail messages telling you that it had to kill your job because it exceeded the time limit. If you start to receive the same message over and over about the same JOB ID, please contact a sys admin. Thanks. | There is a bug in pbs that appears some time when the server would like to stop a running job but the node where the job is running does not respond (e.g. it did crash). When this happens, the server starts to send you a lot o identical mail messages telling you that it had to kill your job because it exceeded the time limit. If you start to receive the same message over and over about the same JOB ID, please contact a sys admin. Thanks. | ||
+ | |||
+ | ===== Tips and Tricks ===== | ||
+ | === Delete all queued jobs === | ||
+ | < | ||
+ | qstat -u $(whoami) -n1 | grep " | ||
+ | </ | ||
+ | |||
+ | === A script that run as long as possible === | ||
+ | Here is a short script that can be useful in those cases where you have the same | ||
+ | calculation to run many times (e.g. for collecting statistics). | ||
+ | |||
+ | Since the machines are different and take different time to run the program, one usually | ||
+ | allocates the time needed by the slowest machine even if on the fastest machine the actual | ||
+ | running time would be 1/10 of the requested one. As you know, the queueing system does | ||
+ | not like when it is provided with wrong informations. | ||
+ | |||
+ | The following script will keep running your program until there is time left. It will use | ||
+ | the time needed to run 1 iteration to decide if another one can be ran. | ||
+ | |||
+ | < | ||
+ | qstat=/ | ||
+ | jobid=${PBS_JOBID%%.*} | ||
+ | |||
+ | # check how much time is left and set the " | ||
+ | checktime() { | ||
+ | if [ -x $qstat ] ; then | ||
+ | times=$(qstat -n1 $jobid | tail -n 1) | ||
+ | let tend=$(echo $times | awk ' | ||
+ | let tnow=$(echo $times | awk ' | ||
+ | let trem=$tend-$tnow | ||
+ | let tmin=$tnow/ | ||
+ | if [ $trem -ge $tmin ] ; then | ||
+ | moretime=" | ||
+ | else | ||
+ | moretime=" | ||
+ | fi | ||
+ | else | ||
+ | # cannot say => random guess | ||
+ | moretime=" | ||
+ | fi | ||
+ | } | ||
+ | |||
+ | # Execute a task as many times as possible. | ||
+ | let niter=0; | ||
+ | moretime=" | ||
+ | while [ " | ||
+ | # run your program here | ||
+ | ./ | ||
+ | let niter=$niter+1 | ||
+ | checktime | ||
+ | done | ||
+ | </ |
sge.txt · Last modified: 2015/11/16 11:18 by 127.0.0.1