SLURM (Simple Linux Utility for Resource Management) is an open-source workload manager designed to schedule and manage jobs on large clusters. In the world of LLMs, SLURM has seen a resurgence in popularity due to the increased demand for training large models and scaling them to multiple nodes.

This guide will introduce the fundamental concepts of SLURM, common commands and script structures, and show advanced scenarios like distributed multi-node training. I’ll also share some useful tips and tricks.


Fundamental Concepts

What is SLURM?

SLURM is a job scheduler that allocates resources (like CPUs, GPUs, and memory) to users and manages the execution of jobs on a cluster. It handles the complexities of queuing, prioritization, resource allocation, and job monitoring, allowing users to focus on their computational tasks without worrying about the underlying infrastructure.

Basic Terminology

  • Job: A unit of work submitted to the scheduler.
  • Node: A single computational unit within the cluster, often equivalent to a physical machine.
  • Partition: A subset of nodes within the cluster, configured for specific purposes or resource limits.
  • Task: An individual process within a job.

Key SLURM Commands

  • sbatch: Submits a job script for batch execution.
  • srun: Used within a script or interactively to launch parallel tasks.
  • squeue: Displays information about jobs in the queue.
  • scancel: Cancels a pending or running job.

Common SLURM Commands and Script Structure

When working with SLURM, you typically create a job script that specifies resource requirements and execution instructions. Here’s a basic example:

#!/bin/bash

#SBATCH --job-name=my_ml_job
#SBATCH --output=output_%j.txt
#SBATCH --error=error_%j.txt
#SBATCH --ntasks=1
#SBATCH --time=01:00:00
#SBATCH --gres=gpu:1

# Activate environments here
source .venv/bin/activate

# Run your application
srun python train.py

Explanation of Script Directives

  • #!/bin/bash: Indicates that the script should be run in the Bash shell.
  • #SBATCH: SLURM directives that set job parameters.
    • --job-name: Assigns a name to the job.
    • --output and --error: Specify files for standard output and error streams.
    • --ntasks: Number of tasks to run (often set to the number of CPUs or processes).
    • --time: Maximum wall time the job will run.
    • --gres=gpu:1: Requests one GPU.

Difference Between sbatch and srun

  • sbatch: Used to submit a job script to the scheduler for batch execution. The job will wait in the queue until resources are available.
  • srun: Used to launch tasks, either within a job script or interactively. When used inside a script submitted with sbatch, it allocates resources for the task based on the job’s specifications.

Advanced Scenario: Distributed Multi-node Training

For large-scale LLM training/fine-tuning, leveraging multiple nodes can significantly reduce training time. Below is an example of how to set up a distributed multi-node training job using SLURM.

sbatch.sh

#!/bin/bash

#SBATCH --job-name=llm-training
#SBATCH --nodes=2
#SBATCH -D .
#SBATCH --output=output_%j.txt
#SBATCH --error=error_%j.txt
#SBATCH --gres=gpu:8

cd $SLURM_SUBMIT_DIR
srun srun_script.sh

srun_script.sh

#!/bin/bash

cd $SLURM_SUBMIT_DIR
GPUS_PER_NODE=8
HOST_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
MAIN_PROCESS_PORT=12345
# Launch the training script using the Hugging Face Accelerate 
accelerate launch \
  --num_machines $SLURM_NNODES \
  --machine_rank $SLURM_NODEID \
  --main_process_ip $HOST_ADDR \
  --main_process_port $MAIN_PROCESS_PORT \
  --num_processes $(($SLURM_NNODES * $GPUS_PER_NODE)) \
  train.py

Explanation of the Multi-node Setup

sbatch.sh

  • --job-name=llm-training: Job name.
  • --nodes=2: Requests two nodes for the job.
  • -D .: Sets the working directory to the current directory.
  • --output and --error: Specify output and error files, including job name, node name, and job ID in the filename.
  • --gres=gpu:8: Requests eight GPUs per node.
  • cd $SLURM_SUBMIT_DIR: Changes to the directory where the job was submitted.
  • srun srun_script.sh: Uses srun to execute the srun_script.sh on allocated resources.

srun_script.sh

  • GPUS_PER_NODE=8: Sets an environment variable for GPUs per node.
  • HOST_ADDR: Determines the hostname of the first node in the job’s node list.
  • MAIN_PROCESS_PORT=12345: Sets the port for inter-process communication.
  • accelerate launch: Runs the train.py script using the Hugging Face Accelerate library for distributed training.

Running the Job

To submit the job, run:

sbatch sbatch.sh

SLURM will queue the job and execute it when resources are available. The srun command within sbatch.sh ensures that srun_script.sh is executed across the allocated nodes. To monitor the job’s output, run:

tail -f output_<job_id>.txt

Useful Tips and Tricks

  • Check Job Status: Use squeue -u $USER to list your jobs.
  • Job Details: Use scontrol show job <job_id> for detailed information.
  • Verbose Output: Include set -x at the top of your scripts for verbose execution logging.
  • Environment Variables: Use env | grep SLURM to list all SLURM-related environment variables.
  • Error Files: Check the error files specified in your scripts for clues.
  • Optimize Resource Requests: Only request the resources you need to reduce queue time.
  • Time Limits: Set realistic --time limits to prevent jobs from being killed prematurely.
  • Modular Scripts: Separate job submission and execution logic (as shown in the example) for better maintainability.
  • Version Control: Keep your scripts under version control to track changes.
  • Environment Setup: Use virtual environments or containers to manage dependencies consistently.