Why use Instant Clusters?
- Scale beyond single machines. Train models too large for one GPU, or accelerate training by distributing across multiple nodes.
- High-speed networking included. Clusters include 1600-3200 Gbps networking between nodes, enabling efficient gradient synchronization and data movement.
- Zero configuration. Clusters come pre-configured with static IPs, environment variables, and framework support. Start training immediately.
- On-demand availability. Deploy clusters in minutes and pay only for what you use. Scale up for intensive jobs, then release resources.
When to use Instant Clusters
Instant Clusters offer distributed computing power beyond the capabilities of single-machine setups. Consider using Instant Clusters for:- Multi-GPU language model training. Accelerate training of models like Llama or GPT across multiple GPUs.
- Large-scale computer vision projects. Process massive imagery datasets for autonomous vehicles or medical analysis.
- Scientific simulations. Run climate, molecular dynamics, or physics simulations that require massive parallel processing.
- Real-time AI inference. Deploy production AI models that demand multiple GPUs for fast output.
- Batch processing pipelines. Create systems for large-scale data processing, including video rendering and genomics.
Get started
Choose the deployment guide that matches your preferred framework and use case:Deploy a Slurm cluster
Set up a managed Slurm cluster for high-performance computing workloads. Slurm provides job scheduling, resource allocation, and queue management for research environments and batch processing workflows.
Deploy a PyTorch distributed training cluster
Set up multi-node PyTorch training for deep learning models. This tutorial covers distributed data parallel training, gradient synchronization, and performance optimization techniques.
Deploy an Axolotl fine-tuning cluster
Use Axolotl’s framework for fine-tuning large language models across multiple GPUs. This approach simplifies customizing pre-trained models like Llama or Mistral with built-in training optimizations.
Deploy an unmanaged Slurm cluster
For advanced users who need full control over Slurm configuration. This option provides a basic Slurm installation that you can customize for specialized workloads.
How it works
When you deploy an Instant Cluster, Runpod provisions multiple GPU nodes within the same and connects them with high-speed networking. One node is designated as the primary node, and all nodes receive pre-configured environment variables for distributed communication. The high-speed network interfaces (ens1-ens8) handle inter-node communication for distributed training frameworks like , , and . The eth0 interface on the primary node handles external traffic like downloading models or datasets.
For more details on environment variables and network configuration, see the configuration reference.
Supported hardware
| GPU | Network speed | Nodes |
|---|---|---|
| B200 | 3200 Gbps | 2-8 nodes (16-64 GPUs) |
| H200 | 3200 Gbps | 2-8 nodes (16-64 GPUs) |
| H100 | 3200 Gbps | 2-8 nodes (16-64 GPUs) |
| A100 | 1600 Gbps | 2-8 nodes (16-64 GPUs) |
Pricing
Instant Cluster pricing is based on the GPU type and the number of nodes in your cluster. For current pricing, see the Instant Clusters pricing page.All accounts have a default spending limit. To deploy a larger cluster, submit a support ticket at help@runpod.io.
Next steps
- Configuration reference: Learn about environment variables, network interfaces, and NCCL configuration.
- Deploy a Slurm cluster: Set up job scheduling for HPC workloads.
- Deploy a PyTorch cluster: Get started with distributed deep learning.