SLURM Tutorial

📘

Go to ai.tinybio.cloud/chat to chat with a life sciences focused ChatGPT.

Overview of SLURM Workload Manager

What is SLURM?

SLURM, which stands for Simple Linux Utility for Resource Management, is an open-source, fault-tolerant, and highly scalable cluster management and job scheduling system for Linux clusters of all sizes. It's designed to be easy to install and maintain, requiring no kernel modifications and being relatively self-contained. SLURM is widely used in supercomputing and high-performance computing (HPC) environments to efficiently distribute work among the available compute nodes.

Key Functions of SLURM

SLURM serves three primary functions in a computing cluster:

  1. Resource Allocation: It provides exclusive and/or non-exclusive access to resources (compute nodes) to users for a specified duration, enabling them to perform computational tasks.
  2. Job Scheduling and Execution: SLURM offers a framework for starting, executing, and monitoring work, typically parallel jobs, on the allocated set of nodes.
  3. Queue Management: It manages a queue of pending jobs, arbitrating contention for resources and ensuring that the workload is processed efficiently.

Architecture

SLURM's architecture is centered around a few key components:

  • slurmctld: This is the central management daemon that monitors resources and work. In case of failure, a backup manager can take over its responsibilities.
  • slurmd: Each compute node runs a slurmd daemon, which waits for work, executes it, returns status, and then waits for more work. These daemons ensure fault-tolerant hierarchical communications.
  • slurmdbd: The Slurm DataBase Daemon is optional and can be used to record accounting information for multiple Slurm-managed clusters in a single database.
  • slurmrestd: Another optional component, the Slurm REST API Daemon, allows interaction with Slurm through a REST API.

User Tools and Commands

SLURM provides a variety of user tools and commands, including:

  • srun: To initiate jobs.
  • scancel: To terminate queued or running jobs.
  • sinfo: To report system status.
  • squeue: To report the status of jobs.
  • sacct: To get information about running or completed jobs and job steps.
  • sview: To graphically report system and job status, including network topology.

Administrative tools like scontrol are available for monitoring and modifying cluster configuration and state, and sacctmgr is used for database management.

Configurability

SLURM is highly configurable, allowing monitoring of node states such as processor count, memory size, temporary disk space, and operational state (UP, DOWN, etc.). Nodes can be grouped into partitions, which function like job queues, with their own set of configurations like maximum job time limit, node count per job, access lists, and priority settings.


Installation

To install SLURM, you will need to follow the official installation guide provided by the SLURM developers. The installation process involves downloading the source code, satisfying the necessary dependencies, and compiling the software. It's important to ensure that all compute nodes and the management node meet the system requirements for running SLURM.

System Requirements

Before installing SLURM, make sure that your Linux cluster meets the following requirements:

  • Linux operating system with supported kernel version.
  • Necessary libraries and development tools installed.
  • Network connectivity between all nodes in the cluster.

Downloading SLURM

SLURM can be downloaded from the official SLURM website. It's important to download the version that is compatible with your system and meets your cluster's needs.

Compilation and Installation

Once downloaded, SLURM needs to be compiled from source. This typically involves running configure to set up the build environment, followed by make and make install to compile and install the software. Detailed instructions can be found in the installation guide.

Configuration

After installation, SLURM requires configuration to match your cluster's hardware and desired scheduling policies. This involves editing the slurm.conf file and possibly other configuration files for plugins or additional features.


Quick Start

Once SLURM is installed and configured, you can start using it to manage jobs on your cluster. Here's a quick start guide to get you up and running.

Starting SLURM Services

To begin using SLURM, you need to start the slurmctld daemon on the management node and the slurmd daemons on each compute node. This can typically be done using the system's service management commands, such as systemctl or service.

Basic SLURM Commands

Here are some basic commands to interact with SLURM:

  • sinfo: Displays the state of partitions and nodes managed by SLURM.
  • squeue: Lists the jobs queued and running in the system.
  • sbatch: Submits a batch script to SLURM for execution.
  • srun: Runs a job interactively within an allocation.

Monitoring Jobs

To monitor the jobs running on your cluster, you can use the squeue and sacct commands. squeue will show you the current state of jobs in the queue, while sacct provides detailed accounting information about completed and running jobs.


Code Examples Of Popular Commands

Here are five popular SLURM commands with code examples:

1. Submitting a Batch Job

#!/bin/bash
#SBATCH --job-name=test_job
#SBATCH --output=result.txt
#SBATCH --ntasks=4
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=100

srun hostname
srun sleep 60

This batch script submits a job named test_job that runs on 4 tasks for up to 10 minutes, using 100MB of memory per CPU. It runs the hostname command and then sleeps for 60 seconds.

2. Cancelling a Job

scancel 12345

This command cancels the job with the ID 12345.

3. Checking Cluster Status

sinfo

This command displays the current state of partitions and nodes in the cluster.

4. Monitoring Job Queue

squeue

This command lists all the jobs that are currently queued or running in the system.

5. Interactive Job Allocation

salloc --ntasks=2 --time=30:00

This command requests an interactive job allocation with 2 tasks for a duration of 30 minutes.


Remember, this is just a brief overview and quick start guide to SLURM. For a full understanding and to utilize all of SLURM's capabilities, you should refer to the official documentation and consider getting training or support if needed.