slurm-dummies
Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
slurm-dummies [2020/03/02 16:53] – created admin | slurm-dummies [2023/10/09 13:17] (current) – admin | ||
---|---|---|---|
Line 1: | Line 1: | ||
====== S.L.U.R.M. ====== | ====== S.L.U.R.M. ====== | ||
- | ====== | + | ====== |
===== for Dummies ===== | ===== for Dummies ===== | ||
- | **1st thing you need to know:** Using a slurm script is like if you're typing the commands from a shell. Therefore you must include in the script all the commands that you would use on the shell before/ | + | **1st things |
+ | Using a slurm script is like if you're typing the commands from a shell. Therefore you must include in the script all the commands that you would use on the shell before/ | ||
\\ | \\ | ||
Every instruction line for the queue manager start with #SBATCH, so\\ | Every instruction line for the queue manager start with #SBATCH, so\\ | ||
Line 11: | Line 12: | ||
# SBATCH...... : this is a comment\\ | # SBATCH...... : this is a comment\\ | ||
- | The mandatory directives that you must **always** include in the scripts are: | + | ====The mandatory |
+ | directives that you must **always** include in the scripts are: | ||
- Your email address. the official epfl address or another, but valid (worldwide), | - Your email address. the official epfl address or another, but valid (worldwide), | ||
- How much time your job must run (if the job runs over this limit the cluster manager will kill it). the minimum is 1 minute and there' | - How much time your job must run (if the job runs over this limit the cluster manager will kill it). the minimum is 1 minute and there' | ||
- | - How much memory (RAM) your job will use. Please remember that if your job use more memory than the limit you put here the cluster manager will kill the job. the minimum is 512 Mbyte, currently (as for Jul 2015) the maximum is 64 Gbyte. | + | - How much memory (RAM) your job will use. Please remember that if your job use more memory than the limit you put here, then the cluster manager will kill the job. the minimum is 512 Mbyte, currently (as for Feb 2020) the maximum is 250 Gbyte. |
+ | - How many nodes (computers) you're going to use with your script. | ||
- How many cores/cpu must be reserved for your job. If you don't include this parameter only one core/cpu will be assigned to your job and you cannot run more than a single threaded job. | - How many cores/cpu must be reserved for your job. If you don't include this parameter only one core/cpu will be assigned to your job and you cannot run more than a single threaded job. | ||
+ | - **the name of the queue/ | ||
+ | |||
+ | ==== partitions (a.k.a. queues) ==== | ||
+ | If you used other types of cluster management, you will already known the term '' | ||
+ | The ' | ||
+ | - slurm-cluster: | ||
+ | - slurm-gpu: this includes computers that have a gpu (nvidia, mostly) that can be used for HPC. | ||
+ | - slurm-ws: this includes all the workstations that are sitting under your desks, programs that run a very shor time (1 hour top) can take advantage of the workstation cpus not used by the users. | ||
The beginning of your script will be: | The beginning of your script will be: | ||
< | < | ||
# you email address | # you email address | ||
- | #PBS -M < | + | #SBATCH |
# how much time this process must run (hours: | # how much time this process must run (hours: | ||
- | #PBS -l cput=04:00:00 | + | #SBATCH |
- | # how much memory it needs ? 1 GB for the example | + | # how much memory it needs ? 1 GB (1024MB) |
- | #PBS -l mem=1024mb | + | #SBATCH --mem=1G |
</ | </ | ||
- | If your job is running a simulation that is multithreaded, you can use more than one cpu/core by indicating the number of cores you want with: | + | If your job is running a simulation that is multi-threaded (or parallel), you can use more than one CPU/core by indicating the number of cores you want with: |
< | < | ||
#Numer of cores needed by the application (8 in this example) | #Numer of cores needed by the application (8 in this example) | ||
- | #PBS -l ppn=8 | + | #SBATCH |
+ | #and the number of nodes (physical computers) your program is supposed to use (you need at least 1) | ||
+ | #SBATCH --nodes=1 | ||
</ | </ | ||
- | After this //prolog// you can add directives for instructing the system about the messages you | + | After this //prolog//, you can add directives for instructing the system about the messages you |
want to receive: | want to receive: | ||
< | < | ||
- | # this line instruct | + | # this line instruct |
- | #PBS -m be | + | #SBATCH |
- | # you can substitute the previous directive with the following. | + | |
- | # this line means: do not send email messages | + | |
- | #PBS -m n | + | |
</ | </ | ||
- | Also you can tell the PBS where you want to put the output and errors messages. By default the cluster will put the output and errors messages in 2 separate files (<name of the job> | + | You can also tell SLURM where you want to put the output and errors messages.\\ |
+ | By default the cluster will put the output and errors messages in 2 separate files (<name of the job> | ||
< | < | ||
- | #Output and Error streams are redirected | + | # these lines instruct the cluster |
- | #PBS -j oe | + | #SBATCH --error=$HOME/ |
+ | #SBATCH | ||
</ | </ | ||
- | And then you might want to assign a name to your job, so you will know what the cluster is doing for you when you look at the list of running jobs (using the command '' | + | And then you might want to assign a name to your job, so you will know what the cluster is doing for you when you look at the list of running jobs (using the command '' |
< | < | ||
#Name of the job | #Name of the job | ||
- | #PBS -N exit_coupled | + | #SBATCH |
</ | </ | ||
- | Now you can start the bash shell script commands : | + | Another mandatory parameter is the queue (called partition in SLURM terminology) you want to use: to start always use the queue '' |
+ | < | ||
+ | # queue to be used | ||
+ | #SBATCH --partition slurm-cluster | ||
+ | </ | ||
+ | |||
+ | if you want to use a particular feature (gpu, tensorflow, mathematica, | ||
+ | in this case, SLURM will launch your program only in the nodes that have the requested feature(s). | ||
+ | < | ||
+ | # require this feature: | ||
+ | #SBATCH --gres=gpu: | ||
+ | </ | ||
+ | |||
+ | Now you can start the shell script commands: | ||
< | < | ||
Line 69: | Line 94: | ||
echo " | echo " | ||
- | ./name of the program and parameters you want to launch | + | </ |
+ | It's better to use the command srun to launch the executable command (just prefix srun to you normal command line), so SLURM can better manage the scheduling of the jobs. | ||
+ | the use of '' | ||
+ | < | ||
+ | |||
+ | srun ./name of the program and parameters you want to launch | ||
echo " | echo " | ||
Line 75: | Line 105: | ||
</ | </ | ||
- | Another thing to remember is that the output files (the ...o< | + | Once we attach all the lines from above we'll have a script |
- | + | ||
- | Once we attach all the lines from above we'll have a script that will look like this: | + | |
< | < | ||
- | #PBS -M <my email address that everyone can use to send emails> | + | #!/ |
- | #PBS -l cput=04: | + | |
- | #PBS -l mem=1024mb | + | |
- | #PBS -l ppn=8 | + | |
- | # you want to receive an email messages when your job is started and when it's | + | |
- | finished (or blocked) | + | |
- | #PBS -m be | + | |
- | # all the messages (output and errors) must go in a single file | + | |
- | #PBS -j oe | + | |
- | # the name you want to assign to this job | + | |
- | #PBS -N exit_coupled_test | + | |
+ | #SBATCH --job-name=dummy-test | ||
+ | #SBATCH --partition slurm-cluster | ||
+ | #SBATCH --mail-user=dummy.epfl@epfl.ch | ||
+ | #SBATCH --time=04: | ||
+ | #SBATCH --mem=1024M | ||
+ | #SBATCH --cpu-per-task=8 | ||
+ | #SBATCH --mail-type=begin | ||
+ | #SBATCH --mail-type=end | ||
+ | #SBATCH --gres=gpu: | ||
cd $HOME/..... | cd $HOME/..... | ||
Line 96: | Line 123: | ||
echo " | echo " | ||
- | ./name of the program and parameters you want to launch | + | srun ./name of the program and parameters you want to launch |
echo " | echo " | ||
Line 102: | Line 129: | ||
</ | </ | ||
- | Now you just need to tell the cluster system that you want to run this job, but how you do that? pretty simple, you use the command | + | Now you just need to tell the cluster system that you want to run this job, but how you do that? pretty simple, you use the command |
< | < | ||
- | $ qsub test1.pbs | + | $ sbatch dummy.slurm |
</ | </ | ||
- | < | + | After all this work, you just need to relax and wait until you receive the email messages from the queuing manager telling you about success or failure |
- | If you like, you can use the absolute path to indicate the script to launch, but **remember** that the output files will be written inside the directory from **where** you executed the qsub program. | + | If you browse the the documentation we have on [[1slurm|Batch Queuing System]] you'll find examples on how to use Matlab or Mathematica and some explanation about the directives and the commands available for the queuing system. |
- | </ | + | |
- | + | ||
- | After all this work, you just need to relax and wait until you receive the email messages from the queuing manager telling you about success or failure. At this point you return to the directory where the output files are saved and check the results.\\ | + | |
- | If you browse the the documentation we have on [[sge|Batch Queuing System]] you'll find examples on how to use Matlab or Mathematica and some explanation about the directives and the commands available for the queuing system. | + | |
slurm-dummies.1583168030.txt.gz · Last modified: 2020/03/02 16:53 by admin