This guide is for advanced users who want to configure and manage their own Slurm deployment on Instant Clusters. If you’re looking for a pre-configured solution, see Slurm Clusters.
This tutorial demonstrates how to configure Runpod Instant Clusters with to manage and schedule distributed workloads across multiple nodes. Slurm is a popular open-source job scheduler that provides a framework for job management, scheduling, and resource allocation in high-performance computing environments. By leveraging Slurm on Runpod’s high-speed networking infrastructure, you can efficiently manage complex workloads across multiple GPUs.Follow the steps below to deploy a cluster and start running distributed Slurm workloads efficiently.
Use the UI to name and configure your cluster. For this walkthrough, keep Pod Count at 2 and select the option for 16x H100 SXM GPUs. Keep the Pod Template at its default setting (Runpod PyTorch).
Click Deploy Cluster. You should be redirected to the Instant Clusters page after a few seconds.
Now run the installation script on each Pod, replacing [MUNGE_SECRET_KEY] with any secure random string (like a password). The secret key is used for authentication between nodes, and must be identical across all Pods in your cluster.
This script automates the complex process of configuring a two-node Slurm cluster with GPU support, handling everything from system dependencies to authentication and resource configuration. It implements the necessary setup for both the primary (i.e. master/control) and secondary (i.e compute/worker) nodes.
If you’re not sure which Pod is the primary node, run the command echo $HOSTNAME on the web terminal of each Pod and look for node-0.
On the primary node (node-0), run both Slurm services:
Copy
slurmctld -D
Use the web interface to open a second terminal on the primary node and run:
Copy
slurmd -D
On the secondary node (node-1), run:
Copy
slurmd -D
After running these commands, you should see output indicating that the services have started successfully. The -D flag keeps the services running in the foreground, so each command needs its own terminal.
Run the following command on the primary node (node-0) to submit the test job script and confirm that your cluster is working properly:
Copy
sbatch test_batch.sh
Check the output file created by the test (test_simple_[JOBID].out) and look for the hostnames of both nodes. This confirms that the job ran successfully across the cluster.