Torch Distributed Elastic
Created On: Jun 16, 2025 | Last Updated On: Jul 25, 2025
Makes distributed PyTorch fault-tolerant and elastic.
Get Started
Usage
Documentation
API
- torchrun (Elastic Launch)
- Elastic Agent
- Multiprocessing
- Error Propagation
- Rendezvous
- Expiration Timers
- Metrics
- Events
- Subprocess Handling
- Control Plane
- NUMA Binding Utilities
Advanced
Plugins