Slide 15
Slide 15 text
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Best practices for large-scale distributed training
Step-by-step guides to create clusters:
Recipes to customize AMIs
AWS-optimized Dockerfiles
EFA cheatsheet
Distributed training examples
• One-click VPC deployments
• Mount Fsxfor Lustre Filesystems
• EFA-enabled clusters
Validation (NCCL tests, etc.)
Observability (Prometheus-Grafana, etc.)
Profiling (Nsight product family)
• Slurm scripts/K8 materials
• Working with Pyxis/Enroot
• Nemo (MegatronLM,
Multimodal, Bionemo)
• MosaicML
• DDP, FSDP
• SMDP, SMMP
• Tensorflow/Jax