Skip to content

Common Workflows#

SLURM Quick Reference#

Task Command Example
Submit batch job sbatch sbatch job.sh
Interactive job srun srun --pty bash
View your jobs squeue -u squeue -u $USER
Job details scontrol show job scontrol show job <job-id>
Cancel job scancel scancel <job-id>
View partitions sinfo sinfo -p <partition-name>
Job history sacct sacct -j <job-id>
SSH to node ssh ssh <node-id>
GPU monitoring nvidia-smi nvidia-smi
Real-time stats sstat sstat -j <job-id>

Workflow 1: Submit and Monitor a Job#

# Submit job
sbatch my_job.sh
# Output: Submitted batch job 12345

# Check status
squeue -j 12345

# View detailed info
scontrol show job 12345

# Monitor resources (once running)
sstat -j 12345 --format=JobID,AveCPU,MaxRSS

Workflow 2: Interactive GPU Session#

# Request interactive GPU node
srun --partition=<partition-name> --qos=gpu --gres=gpu:1 --time=02:00:00 --pty bash

# Once allocated, check GPU
nvidia-smi

# Run your commands
python train_model.py

# Exit when done
exit

Workflow 3: Check Available Resources Before Submitting#

# Check partition availability
sinfo -p <partition-name>

# Check detailed node status
scontrol show partition <partition-name>

# View current queue load
squeue -p <partition-name>

# Submit job if resources look good
sbatch --partition=<partition-name> --qos=gpu my_gpu_job.sh

Workflow 4: Troubleshooting a Stuck Job#

# Check job status and reason
squeue -j <jobid>

# Get detailed job information
scontrol show job <jobid>

# Check job history
sacct -j <jobid> --format=JobID,State,Reason,ExitCode

# If needed, cancel and resubmit
scancel <jobid>