Running Jobs

Learn how to submit, monitor, and manage jobs on REPACSS using SLURM.

Job Types

Interactive vs Batch Jobs

Interactive Jobs

interactive -c 8 -p h100

Batch Jobs

sbatch job.sh
sbatch -p zen4 job.sh
sbatch -p h100 job.sh

Job Scripts

Script Templates

Basic Template

#!/bin/bash
#SBATCH --job-name=test
#SBATCH --output=test.out
#SBATCH --error=test.err
#SBATCH --time=01:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G

# Load modules
module load 

# Run program
./my_program

Python Template

#!/bin/bash
#SBATCH --job-name=python_job
#SBATCH --output=python_job.out
#SBATCH --error=python_job.err
#SBATCH --time=01:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G

# Load required modules
module load gcc

# Activate conda environment
source ~/miniforge3/etc/profile.d/conda.sh
conda activate myenv

# Run Python script
python script.py

GPU Template

#!/bin/bash
#SBATCH --job-name=gpu_test
#SBATCH --output=gpu_test.out
#SBATCH --error=gpu_test.err
#SBATCH --time=01:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --gres=gpu:1

# Load modules
module load cuda

# Run program
./gpu_program

Job Management

Submission

sbatch job.sh - Submit job
sbatch --array=1-10 job.sh - Array jobs
sbatch --dependency=afterok:12345 job.sh - Job dependencies

Monitoring

squeue -u $USER - Your jobs
squeue -p zen4 - Zen4 partition
squeue -p h100 - H100 partition

Control

scancel 12345 - Cancel specific job
scancel -u $USER - Cancel all your jobs
scancel -p zen4 - Cancel partition jobs

Resource Requests

Resource Types

  • CPU Jobs: Use --nodes, --ntasks, --cpus-per-task, and --mem
  • GPU Jobs: Add --gres=gpu:1 (or gpu:2, gpu:4)
  • Python Jobs: See Python Environment Setup for specific configurations
  • Consider using --cpus-per-task for parallel Python processing
  • Adjust --mem based on your data processing needs