====== Using SLURM ====== ===== What is SLURM? ===== SLURM is an open-source resource manager designed for Linux clusters of all sizes. It provides three key functions. First it allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job) on a set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work. {{:images:arch.gif|}} As depicted in Figure 1, SLURM consists of a slurmd daemon running on each compute node and a central slurmctld daemon running on a management node (with optional fail-over twin). The slurmd daemons provide fault-tolerant hierarchical communications. The user commands include: sacct, salloc, sattach, sbatch, sbcast, scancel, scontrol, sinfo, smap, squeue, srun, strigger and sview. All of the commands can run anywhere in the cluster. ===== Viewing Status of Runs ===== $ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug* up infinite 2 idle carnot,diesel sinfo shows the current status of the cluster. It displays the availability, timelimits being enforced, number of nodes, and which nodes are available. The STATE of a node can also be allocated, completing, down, draining, unknown. If a node is marked with an '*', the node is not responding to the controller. $squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 37 debug myprogram james R 0:04 1 carnot squeue shows jobs currently in the queue. The State can be one of: * R - Running * S - Suspended * PD - Pending * F - Failed * CG - Completing * TO - Timeout * CA - Canceled $ sacct JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 37 sleep debug 8 COMPLETED 0:0 sacct shows jobs that you have run in the past, how much resources they used and their state. ===== Starting a Job with srun ===== $ srun -n4 --time=0:30 -o myjob.out ./myprogram ===== Starting a Job with sbatch ===== Setting up all of your options everytime you use srun can be repetitive. Use a batch file and submit that instead! Example file 'mybatch': #!/bin/sh #SBATCH -n 4 #SBATCH --time=0:60 #SBATCH --output=mybatch.out srun ./my_program $ sbatch mybatch Submitted batch job 47 ===== Stopping or Canceling Jobs ===== $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 37 debug myprogram james R 0:04 1 carnot $ scancel 37 Cancel a job with scancel, then the JobID of the program you want to cancel. You can view the jobID with squeue. ===== More information ===== [[https://computing.llnl.gov/linux/slurm/quickstart.html|Slurm Quickstart]]