Basics of SLURM#

SLURM (Simple Linux Utility for Resource Management) is the workload manager used on most modern HPC clusters. It acts as both a scheduler and a resource manager.

Think of SLURM as an air traffic controller for computational jobs. Just as air traffic control manages which planes can take off, land, and use specific runways, SLURM manages:

Resource allocation: Which jobs get access to which compute nodes
Job scheduling: When jobs run based on priority, resource availability, and fairness
Resource monitoring: Tracking CPU, memory, and GPU usage
Job lifecycle: Starting, monitoring, and terminating jobs

Job Submission Process#

Slurm Workflow

When you submit a job to SLURM:

Job submission: You specify resource requirements (CPUs, memory, GPUs, time limit)
Queue placement: SLURM places your job in a queue based on partition and priority
Resource matching: SLURM scans for available resources matching your requirements
Allocation: When resources become available, SLURM allocates nodes to your job
Execution: Your job runs on the allocated compute nodes
Cleanup: Upon completion, resources are released for other jobs

Info

See Running Jobs for more details on SLURM and Job Scheduling.