Release Release Notes v1.0.0 · aws/sagemaker-hyperpod-checkpointless-training

2025-12-03: Release notes for checkpointless training on Amazon SageMaker HyperPod v.1.0.0

Features

Collective Communication Initialization Improvements: Offers novel initialization methods, Rootless and TCPStoreless for NCCL and Gloo.
Memory-mapped (MMAP) Dataloader: Caches (persist) prefetched batches so that they are available even when a fault causes a restart of the training job.
Checkpointless: Enables faster recovery from cluster training faults in large-scale distributed training environments by making framework-level optimizations
Built on Nvidia Nemo and PyTorch Lightning: Leverages these powerful frameworks for efficient and flexible model training
- Nvidia NeMo
- PyTorch Lightning

Repository Structure

src/: Contains the core implementation of checkpointless training features
examples/: Provides getting started examples for creating entry points and k8s job yamls for checkpointless training

🔧 Key Components

dataloader: Module for wrapping a PyTorch Lightning DataModule with MMAP capabilities
in_process: Module for handling faults and performing checkpointless in-process recovery.
nemo_plugins: Modules for supporting Nemo framework with checkpointless features.
supported models: The supported models contain GPT-OSS 120b(full finetune and LoRA) and Llama 3 70-b(pre-train and LoRA)

📚 Documentation

See README.md for more detailed documentation

🤝 Contributing

We welcome contributions to expand the capabilities of sagemaker-hyperpod-checkpointless-training. Please refer to our contributing guidelines for more information.

Thank you for choosing sagemaker-hyperpod-checkpointless-training for your large-scale language model training needs!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Release Notes v1.0.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Features

Repository Structure

🔧 Key Components

📚 Documentation

🤝 Contributing

Uh oh!