2025-12-03: Release notes for checkpointless training on Amazon SageMaker HyperPod v.1.0.0
Features
- Collective Communication Initialization Improvements: Offers novel initialization methods, Rootless and TCPStoreless for NCCL and Gloo.
- Memory-mapped (MMAP) Dataloader: Caches (persist) prefetched batches so that they are available even when a fault causes a restart of the training job.
- Checkpointless: Enables faster recovery from cluster training faults in large-scale distributed training environments by making framework-level optimizations
- Built on Nvidia Nemo and PyTorch Lightning: Leverages these powerful frameworks for efficient and flexible model training
Repository Structure
- src/: Contains the core implementation of checkpointless training features
- examples/: Provides getting started examples for creating entry points and k8s job yamls for checkpointless training
🔧 Key Components
- dataloader: Module for wrapping a PyTorch Lightning DataModule with MMAP capabilities
- in_process: Module for handling faults and performing checkpointless in-process recovery.
- nemo_plugins: Modules for supporting Nemo framework with checkpointless features.
- supported models: The supported models contain GPT-OSS 120b(full finetune and LoRA) and Llama 3 70-b(pre-train and LoRA)
📚 Documentation
- See README.md for more detailed documentation
🤝 Contributing
We welcome contributions to expand the capabilities of sagemaker-hyperpod-checkpointless-training. Please refer to our contributing guidelines for more information.
Thank you for choosing sagemaker-hyperpod-checkpointless-training for your large-scale language model training needs!