Skip to content

Release Notes v1.0.0

Latest

Choose a tag to compare

@htzho htzho released this 01 Dec 21:50

2025-12-03: Release notes for checkpointless training on Amazon SageMaker HyperPod v.1.0.0

Features

  • Collective Communication Initialization Improvements: Offers novel initialization methods, Rootless and TCPStoreless for NCCL and Gloo.
  • Memory-mapped (MMAP) Dataloader: Caches (persist) prefetched batches so that they are available even when a fault causes a restart of the training job.
  • Checkpointless: Enables faster recovery from cluster training faults in large-scale distributed training environments by making framework-level optimizations
  • Built on Nvidia Nemo and PyTorch Lightning: Leverages these powerful frameworks for efficient and flexible model training

Repository Structure

  • src/: Contains the core implementation of checkpointless training features
  • examples/: Provides getting started examples for creating entry points and k8s job yamls for checkpointless training

🔧 Key Components

  1. dataloader: Module for wrapping a PyTorch Lightning DataModule with MMAP capabilities
  2. in_process: Module for handling faults and performing checkpointless in-process recovery.
  3. nemo_plugins: Modules for supporting Nemo framework with checkpointless features.
  4. supported models: The supported models contain GPT-OSS 120b(full finetune and LoRA) and Llama 3 70-b(pre-train and LoRA)

📚 Documentation

  • See README.md for more detailed documentation

🤝 Contributing

We welcome contributions to expand the capabilities of sagemaker-hyperpod-checkpointless-training. Please refer to our contributing guidelines for more information.

Thank you for choosing sagemaker-hyperpod-checkpointless-training for your large-scale language model training needs!