Monitoring and Managing Jobs#

SLURM Quick Reference#

Task	Command	Example
Submit batch job	`sbatch`	`sbatch job.sh`
Interactive job	`srun`	`srun --pty bash`
View your jobs	`squeue -u`	`squeue -u $USER`
Job details	`scontrol show job`	`scontrol show job <job-id>`
Cancel job	`scancel`	`scancel <job-id>`
View partitions	`sinfo`	`sinfo -p <partition-name>`
Job history	`sacct`	`sacct -j <job-id>`
SSH to node	`ssh`	`ssh <node-id>`
GPU monitoring	`nvidia-smi`	`nvidia-smi`
Real-time stats	`sstat`	`sstat -j <job-id>`

Monitoring Job Status#

squeue - View Job Queue#

View all jobs:
1
squeue
View only your jobs:
1
squeue -u $USER
View jobs in a specific partition:
1
squeue -p <partition-name>

Custom output format:

squeue -u $USER -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R"

Understanding squeue Output

JOBID: Unique job identifier
PARTITION: Partition the job is running on
NAME: Job name
USER: Job owner
ST: Job state (PD = Pending, R = Running, CG = Completing, CD = Completed)
TIME: Runtime so far
NODES: Number of nodes
NODELIST(REASON): Nodes allocated or reason for pending

scontrol show job - Detailed Job Information#

View detailed information about a specific job:
1
scontrol show job <jobid>
Example output includes:
- Job state and reason
- Resource allocation details
- Start and end times
- Working directory
- Output and error file paths

sacct - Job Accounting Information#

View completed jobs:
1
sacct

View specific job with detailed information:

sacct -j <jobid> --format=JobID,JobName,Partition,State,Elapsed,CPUTime,MaxRSS,ExitCode

View jobs from the last 7 days:

sacct -S $(date -d '7 days ago' +%Y-%m-%d)

Useful sacct Formats

Common fields to include in --format:

JobID: Job identifier
JobName: Job name
State: Final job state
Elapsed: Wall clock time
CPUTime: Total CPU time
MaxRSS: Maximum memory used
ExitCode: Job exit code

Checking Available Resources#

sinfo - Partition and Node Information#

View all partitions:
1
sinfo
View specific partition:
1
sinfo -p <partition-name>
Detailed node information:
1
sinfo -N -l
Custom format showing available resources:
1
sinfo -o "%20P %5D %14F %10m %11l %N"

Understanding sinfo Output

PARTITION: Partition name
AVAIL: Partition availability (up/down)
TIMELIMIT: Maximum job runtime for partition
NODES: Total nodes in partition
STATE: Node states (alloc = allocated, idle = available, down = offline, mix = partially allocated)
NODELIST: List of nodes

scontrol show partition - Partition Details#

View detailed partition information:

scontrol show partition <partition-name>

scontrol show node - Node Details#

View specific node information:
1
scontrol show node <nodename>

View all nodes in a partition:

scontrol show node | grep -A 20 <partition-name>

Managing Jobs#

scancel - Cancel Jobs#

Cancel a specific job:
1
scancel <jobid>
Cancel all your jobs:
1
scancel -u $USER
Cancel all your pending jobs:
1
scancel -u $USER -t PENDING
Cancel all jobs in a specific partition:
1
scancel -u $USER -p <partition-name>

Job Cancellation

Cancelled jobs cannot be recovered. Make sure you're cancelling the correct job(s) before confirming.

scontrol hold/release - Hold or Release Jobs#

Hold a pending job (prevent it from running):
1
scontrol hold <jobid>
Release a held job:
1
scontrol release <jobid>

scontrol update - Modify Job Parameters#

Change job time limit:

scontrol update jobid=<jobid> TimeLimit=10:00:00

Update Limitations

You can only modify certain job parameters, and only for jobs that are pending or already started. Some changes require administrator privileges.

Accessing Compute Nodes#

SSH into Running Jobs#

Once your job is running, you can SSH directly to the compute node:

1. Find your job's node:

squeue -u $USER -o "%.18i %.9P %.20j %.10u %.2t %.10M %.6D %R"

2. SSH to the node:

1	`ssh <nodename>`

SSH Access Rules

You can only SSH to nodes where you have a running job
Access is automatically granted when your job starts
Access is revoked when your job ends
Do not run intensive commands outside your job's allocation

Check your processes on the node:

top -u $USER
htop -u $USER
ps aux | grep $USER

Monitoring Resource Usage#

sstat - Real-time Job Statistics#

Monitor running job resources:

sstat -j <jobid> --format=JobID,AveCPU,MaxRSS,NTasks

Continuous monitoring:

watch -n 5 "sstat -j <jobid> --format=JobID,AveCPU,MaxRSS,MaxVMSize,NTasks"

Checking GPU Usage#

On the compute node (via SSH):
1
nvidia-smi
Watch GPU usage continuously:
1
watch -n 1 nvidia-smi

GPU Monitoring Tool

nvidia-smi: Shows GPU utilization, memory usage, and running processes

Tips and Best Practices#

Efficient Job Management

Use descriptive job names: Makes it easier to identify jobs in the queue
Set appropriate time limits: Too short = job killed, too long = longer wait in queue
Monitor test runs: Run short test jobs to estimate resource requirements
Check queue before submitting: Avoid submitting to busy partitions if possible
Clean up output files: Regularly remove old log files to save space

Resource Estimation

Use sacct to review completed jobs and refine resource requests
Request slightly more time than needed, but be reasonable
Monitor memory usage with sstat and adjust --mem accordingly
For GPU jobs, ensure your code actually utilizes the GPU

Common Pitfalls

Forgetting to specify partition/QoS: Jobs may go to wrong resources
Not checking job output: Errors may go unnoticed
Requesting too many resources: Increases queue wait time
Running on login node: Always use compute nodes for intensive work

Additional Resources#

Need More Help?

Slurm Official Documentation
Slurm Command Summary
Use man <command> for detailed command documentation (e.g., man sbatch)
Contact cluster support (ssc-server-support@lists.uchicago.edu) for issues specific to your system