Monitoring and Managing Jobs#
SLURM Quick Reference#
| Task | Command | Example |
|---|---|---|
| Submit batch job | sbatch |
sbatch job.sh |
| Interactive job | srun |
srun --pty bash |
| View your jobs | squeue -u |
squeue -u $USER |
| Job details | scontrol show job |
scontrol show job <job-id> |
| Cancel job | scancel |
scancel <job-id> |
| View partitions | sinfo |
sinfo -p <partition-name> |
| Job history | sacct |
sacct -j <job-id> |
| SSH to node | ssh |
ssh <node-id> |
| GPU monitoring | nvidia-smi |
nvidia-smi |
| Real-time stats | sstat |
sstat -j <job-id> |
Monitoring Job Status#
squeue - View Job Queue#
-
View all jobs:
-
View only your jobs:
-
View jobs in a specific partition:
-
Custom output format:
Understanding squeue Output
JOBID: Unique job identifierPARTITION: Partition the job is running onNAME: Job nameUSER: Job ownerST: Job state (PD= Pending,R= Running,CG= Completing,CD= Completed)TIME: Runtime so farNODES: Number of nodesNODELIST(REASON): Nodes allocated or reason for pending
scontrol show job - Detailed Job Information#
-
View detailed information about a specific job:
Example output includes:
- Job state and reason
- Resource allocation details
- Start and end times
- Working directory
- Output and error file paths
sacct - Job Accounting Information#
-
View completed jobs:
-
View specific job with detailed information:
-
View jobs from the last 7 days:
Useful sacct Formats
Common fields to include in --format:
JobID: Job identifierJobName: Job nameState: Final job stateElapsed: Wall clock timeCPUTime: Total CPU timeMaxRSS: Maximum memory usedExitCode: Job exit code
Checking Available Resources#
sinfo - Partition and Node Information#
-
View all partitions:
-
View specific partition:
-
Detailed node information:
-
Custom format showing available resources:
Understanding sinfo Output
PARTITION: Partition nameAVAIL: Partition availability (up/down)TIMELIMIT: Maximum job runtime for partitionNODES: Total nodes in partitionSTATE: Node states (alloc= allocated,idle= available,down= offline,mix= partially allocated)NODELIST: List of nodes
scontrol show partition - Partition Details#
-
View detailed partition information:
scontrol show node - Node Details#
-
View specific node information:
-
View all nodes in a partition:
Managing Jobs#
scancel - Cancel Jobs#
-
Cancel a specific job:
-
Cancel all your jobs:
-
Cancel all your pending jobs:
-
Cancel all jobs in a specific partition:
Job Cancellation
Cancelled jobs cannot be recovered. Make sure you're cancelling the correct job(s) before confirming.
scontrol hold/release - Hold or Release Jobs#
-
Hold a pending job (prevent it from running):
-
Release a held job:
scontrol update - Modify Job Parameters#
-
Change job time limit:
Update Limitations
You can only modify certain job parameters, and only for jobs that are pending or already started. Some changes require administrator privileges.
Accessing Compute Nodes#
SSH into Running Jobs#
Once your job is running, you can SSH directly to the compute node:
1. Find your job's node:
2. SSH to the node:
SSH Access Rules
- You can only SSH to nodes where you have a running job
- Access is automatically granted when your job starts
- Access is revoked when your job ends
- Do not run intensive commands outside your job's allocation
Check your processes on the node:
Monitoring Resource Usage#
sstat - Real-time Job Statistics#
-
Monitor running job resources:
-
Continuous monitoring:
Checking GPU Usage#
-
On the compute node (via SSH):
-
Watch GPU usage continuously:
GPU Monitoring Tool
nvidia-smi: Shows GPU utilization, memory usage, and running processes
Tips and Best Practices#
Efficient Job Management
- Use descriptive job names: Makes it easier to identify jobs in the queue
- Set appropriate time limits: Too short = job killed, too long = longer wait in queue
- Monitor test runs: Run short test jobs to estimate resource requirements
- Check queue before submitting: Avoid submitting to busy partitions if possible
- Clean up output files: Regularly remove old log files to save space
Resource Estimation
- Use
sacctto review completed jobs and refine resource requests - Request slightly more time than needed, but be reasonable
- Monitor memory usage with
sstatand adjust--memaccordingly - For GPU jobs, ensure your code actually utilizes the GPU
Common Pitfalls
- Forgetting to specify partition/QoS: Jobs may go to wrong resources
- Not checking job output: Errors may go unnoticed
- Requesting too many resources: Increases queue wait time
- Running on login node: Always use compute nodes for intensive work
Additional Resources#
Need More Help?
- Slurm Official Documentation
- Slurm Command Summary
- Use
man <command>for detailed command documentation (e.g.,man sbatch) - Contact cluster support (ssc-server-support@lists.uchicago.edu) for issues specific to your system