Skip to content

Monitoring and Managing Jobs#

SLURM Quick Reference#

Task Command Example
Submit batch job sbatch sbatch job.sh
Interactive job srun srun --pty bash
View your jobs squeue -u squeue -u $USER
Job details scontrol show job scontrol show job <job-id>
Cancel job scancel scancel <job-id>
View partitions sinfo sinfo -p <partition-name>
Job history sacct sacct -j <job-id>
SSH to node ssh ssh <node-id>
GPU monitoring nvidia-smi nvidia-smi
Real-time stats sstat sstat -j <job-id>

Monitoring Job Status#

squeue - View Job Queue#

  1. View all jobs:

    squeue
    
  2. View only your jobs:

    squeue -u $USER
    
  3. View jobs in a specific partition:

    squeue -p <partition-name>
    
  4. Custom output format:

    squeue -u $USER -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R"
    
Understanding squeue Output
  • JOBID: Unique job identifier
  • PARTITION: Partition the job is running on
  • NAME: Job name
  • USER: Job owner
  • ST: Job state (PD = Pending, R = Running, CG = Completing, CD = Completed)
  • TIME: Runtime so far
  • NODES: Number of nodes
  • NODELIST(REASON): Nodes allocated or reason for pending

scontrol show job - Detailed Job Information#

  1. View detailed information about a specific job:

    scontrol show job <jobid>
    

    Example output includes:

    • Job state and reason
    • Resource allocation details
    • Start and end times
    • Working directory
    • Output and error file paths

sacct - Job Accounting Information#

  1. View completed jobs:

    sacct
    
  2. View specific job with detailed information:

    sacct -j <jobid> --format=JobID,JobName,Partition,State,Elapsed,CPUTime,MaxRSS,ExitCode
    
  3. View jobs from the last 7 days:

    sacct -S $(date -d '7 days ago' +%Y-%m-%d)
    
Useful sacct Formats

Common fields to include in --format:

  • JobID: Job identifier
  • JobName: Job name
  • State: Final job state
  • Elapsed: Wall clock time
  • CPUTime: Total CPU time
  • MaxRSS: Maximum memory used
  • ExitCode: Job exit code

Checking Available Resources#

sinfo - Partition and Node Information#

  1. View all partitions:

    sinfo
    
  2. View specific partition:

    sinfo -p <partition-name>
    
  3. Detailed node information:

    sinfo -N -l
    
  4. Custom format showing available resources:

    sinfo -o "%20P %5D %14F %10m %11l %N"
    
Understanding sinfo Output
  • PARTITION: Partition name
  • AVAIL: Partition availability (up/down)
  • TIMELIMIT: Maximum job runtime for partition
  • NODES: Total nodes in partition
  • STATE: Node states (alloc = allocated, idle = available, down = offline, mix = partially allocated)
  • NODELIST: List of nodes

scontrol show partition - Partition Details#

  1. View detailed partition information:

    scontrol show partition <partition-name>
    

scontrol show node - Node Details#

  1. View specific node information:

    scontrol show node <nodename>
    
  2. View all nodes in a partition:

    scontrol show node | grep -A 20 <partition-name>
    

Managing Jobs#

scancel - Cancel Jobs#

  1. Cancel a specific job:

    scancel <jobid>
    
  2. Cancel all your jobs:

    scancel -u $USER
    
  3. Cancel all your pending jobs:

    scancel -u $USER -t PENDING
    
  4. Cancel all jobs in a specific partition:

    scancel -u $USER -p <partition-name>
    

Job Cancellation

Cancelled jobs cannot be recovered. Make sure you're cancelling the correct job(s) before confirming.

scontrol hold/release - Hold or Release Jobs#

  1. Hold a pending job (prevent it from running):

    scontrol hold <jobid>
    
  2. Release a held job:

    scontrol release <jobid>
    

scontrol update - Modify Job Parameters#

  1. Change job time limit:

    scontrol update jobid=<jobid> TimeLimit=10:00:00
    

Update Limitations

You can only modify certain job parameters, and only for jobs that are pending or already started. Some changes require administrator privileges.


Accessing Compute Nodes#

SSH into Running Jobs#

Once your job is running, you can SSH directly to the compute node:

1. Find your job's node:

squeue -u $USER -o "%.18i %.9P %.20j %.10u %.2t %.10M %.6D %R"

2. SSH to the node:

ssh <nodename>
SSH Access Rules
  • You can only SSH to nodes where you have a running job
  • Access is automatically granted when your job starts
  • Access is revoked when your job ends
  • Do not run intensive commands outside your job's allocation

Check your processes on the node:

1
2
3
top -u $USER
htop -u $USER
ps aux | grep $USER

Monitoring Resource Usage#

sstat - Real-time Job Statistics#

  1. Monitor running job resources:

    sstat -j <jobid> --format=JobID,AveCPU,MaxRSS,NTasks
    
  2. Continuous monitoring:

    watch -n 5 "sstat -j <jobid> --format=JobID,AveCPU,MaxRSS,MaxVMSize,NTasks"
    

Checking GPU Usage#

  1. On the compute node (via SSH):

    nvidia-smi
    
  2. Watch GPU usage continuously:

    watch -n 1 nvidia-smi
    
GPU Monitoring Tool

nvidia-smi: Shows GPU utilization, memory usage, and running processes


Tips and Best Practices#

Efficient Job Management

  1. Use descriptive job names: Makes it easier to identify jobs in the queue
  2. Set appropriate time limits: Too short = job killed, too long = longer wait in queue
  3. Monitor test runs: Run short test jobs to estimate resource requirements
  4. Check queue before submitting: Avoid submitting to busy partitions if possible
  5. Clean up output files: Regularly remove old log files to save space

Resource Estimation

  • Use sacct to review completed jobs and refine resource requests
  • Request slightly more time than needed, but be reasonable
  • Monitor memory usage with sstat and adjust --mem accordingly
  • For GPU jobs, ensure your code actually utilizes the GPU

Common Pitfalls

  • Forgetting to specify partition/QoS: Jobs may go to wrong resources
  • Not checking job output: Errors may go unnoticed
  • Requesting too many resources: Increases queue wait time
  • Running on login node: Always use compute nodes for intensive work

Additional Resources#

Need More Help?