SLURM Tutorial
Go to ai.tinybio.cloud/chat to chat with a life sciences focused ChatGPT.
Overview of SLURM Workload Manager
What is SLURM?
SLURM, which stands for Simple Linux Utility for Resource Management, is an open-source, fault-tolerant, and highly scalable cluster management and job scheduling system for Linux clusters of all sizes. It's designed to be easy to install and maintain, requiring no kernel modifications and being relatively self-contained. SLURM is widely used in supercomputing and high-performance computing (HPC) environments to efficiently distribute work among the available compute nodes.
Key Functions of SLURM
SLURM serves three primary functions in a computing cluster:
- Resource Allocation: It provides exclusive and/or non-exclusive access to resources (compute nodes) to users for a specified duration, enabling them to perform computational tasks.
- Job Scheduling and Execution: SLURM offers a framework for starting, executing, and monitoring work, typically parallel jobs, on the allocated set of nodes.
- Queue Management: It manages a queue of pending jobs, arbitrating contention for resources and ensuring that the workload is processed efficiently.
Architecture
SLURM's architecture is centered around a few key components:
- slurmctld: This is the central management daemon that monitors resources and work. In case of failure, a backup manager can take over its responsibilities.
- slurmd: Each compute node runs a slurmd daemon, which waits for work, executes it, returns status, and then waits for more work. These daemons ensure fault-tolerant hierarchical communications.
- slurmdbd: The Slurm DataBase Daemon is optional and can be used to record accounting information for multiple Slurm-managed clusters in a single database.
- slurmrestd: Another optional component, the Slurm REST API Daemon, allows interaction with Slurm through a REST API.
User Tools and Commands
SLURM provides a variety of user tools and commands, including:
- srun: To initiate jobs.
- scancel: To terminate queued or running jobs.
- sinfo: To report system status.
- squeue: To report the status of jobs.
- sacct: To get information about running or completed jobs and job steps.
- sview: To graphically report system and job status, including network topology.
Administrative tools like scontrol are available for monitoring and modifying cluster configuration and state, and sacctmgr is used for database management.
Configurability
SLURM is highly configurable, allowing monitoring of node states such as processor count, memory size, temporary disk space, and operational state (UP, DOWN, etc.). Nodes can be grouped into partitions, which function like job queues, with their own set of configurations like maximum job time limit, node count per job, access lists, and priority settings.
Installation
To install SLURM, you will need to follow the official installation guide provided by the SLURM developers. The installation process involves downloading the source code, satisfying the necessary dependencies, and compiling the software. It's important to ensure that all compute nodes and the management node meet the system requirements for running SLURM.
System Requirements
Before installing SLURM, make sure that your Linux cluster meets the following requirements:
- Linux operating system with supported kernel version.
- Necessary libraries and development tools installed.
- Network connectivity between all nodes in the cluster.
Downloading SLURM
SLURM can be downloaded from the official SLURM website. It's important to download the version that is compatible with your system and meets your cluster's needs.
Compilation and Installation
Once downloaded, SLURM needs to be compiled from source. This typically involves running configure
to set up the build environment, followed by make
and make install
to compile and install the software. Detailed instructions can be found in the installation guide.
Configuration
After installation, SLURM requires configuration to match your cluster's hardware and desired scheduling policies. This involves editing the slurm.conf
file and possibly other configuration files for plugins or additional features.
Quick Start
Once SLURM is installed and configured, you can start using it to manage jobs on your cluster. Here's a quick start guide to get you up and running.
Starting SLURM Services
To begin using SLURM, you need to start the slurmctld
daemon on the management node and the slurmd
daemons on each compute node. This can typically be done using the system's service management commands, such as systemctl
or service
.
Basic SLURM Commands
Here are some basic commands to interact with SLURM:
sinfo
: Displays the state of partitions and nodes managed by SLURM.squeue
: Lists the jobs queued and running in the system.sbatch
: Submits a batch script to SLURM for execution.srun
: Runs a job interactively within an allocation.
Monitoring Jobs
To monitor the jobs running on your cluster, you can use the squeue
and sacct
commands. squeue
will show you the current state of jobs in the queue, while sacct
provides detailed accounting information about completed and running jobs.
Code Examples Of Popular Commands
Here are five popular SLURM commands with code examples:
1. Submitting a Batch Job
#!/bin/bash
#SBATCH --job-name=test_job
#SBATCH --output=result.txt
#SBATCH --ntasks=4
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=100
srun hostname
srun sleep 60
This batch script submits a job named test_job
that runs on 4 tasks for up to 10 minutes, using 100MB of memory per CPU. It runs the hostname
command and then sleeps for 60 seconds.
2. Cancelling a Job
scancel 12345
This command cancels the job with the ID 12345
.
3. Checking Cluster Status
sinfo
This command displays the current state of partitions and nodes in the cluster.
4. Monitoring Job Queue
squeue
This command lists all the jobs that are currently queued or running in the system.
5. Interactive Job Allocation
salloc --ntasks=2 --time=30:00
This command requests an interactive job allocation with 2 tasks for a duration of 30 minutes.
Remember, this is just a brief overview and quick start guide to SLURM. For a full understanding and to utilize all of SLURM's capabilities, you should refer to the official documentation and consider getting training or support if needed.
Updated 7 months ago