[AWS Summit Japan 2025] Optimizing Foundation Model Development with Amazon SageMaker HyperPod: Insights from Training the Amazon Nova Model

© 2025, Amazon Web Services, Inc. or its affiliates. All
rights reserved. コスト 40% 減の秘密を公開︕ Amazon Nova 開発で実証済みの⼤規模モデル学習ベストプラクティス渡辺啓太 A W S - 5 6 アマゾンウェブサービスジャパン合同会社 Sr. World Wide Specialist Solutions Architect, Frameworks WWSO

rights reserved. Amazon Nova Pro Amazon Nova Lite Amazon Nova Micro ⼀般提供開始⼀般提供開始⼀般提供開始 Amazon Nova Premier 最も⾼性能なマルチモーダルモデル複雑な推論タスクに対応し、モデル蒸留における教師モデルとして最適⼀般提供開始 Amazon Nova Reel 最先端のビデオ⽣成モデル⼀般提供開始 Amazon Nova Canvas 最先端の画像⽣成モデル⼀般提供開始低コスト、低遅延なテキストモデルテキストのみ低コストなマルチモーダルモデルテキスト以外に画像、⾳声、動画に対応⾼性能なマルチモーダルモデルテキスト以外に画像、⾳声、動画に対応より⾼い性能より低いコストとレイテンシー理解モデル (Understanding models) クリエイティブコンテンツ⽣成モデル Amazon Nova 卓越した性能と費⽤対効果を実現する最先端のモデル Amazon Nova Sonic リアルタイムで⼈間に近しい⾳声を理解と⽣成⼀般提供開始

rights reserved. Building blocks Nova training stack Orchestration & Observability • Resource Orchestrator • Job Scheduler Algorithms & Software • ML Frameworks HyperPod Network • Wide bandwidth interconnect • Fast accelerator with large device memory Compute • Scalable distributed file storage Storage Amazon EC2 UltraClusters Infrastructures OBSERVABILITY Prometheus CloudWatch Grafana Amazon EKS NEMO JAX PyTorch NxD

rights reserved. Amazon Novaモデルの学習 Data Processing Large-Scale Training Compression Distillation Model Vending Customer Use Cases

rights reserved. Building blocks Nova training stack Orchestration & Observability • Resource Orchestrator • Job Scheduler • Dashboards Algorithms & Software • ML Frameworks HyperPod Network • Wide bandwidth interconnect • Fast accelerator with large device memory Compute • Scalable distributed file storage Storage Amazon EC2 UltraClusters Infrastructures OBSERVABILITY Prometheus CloudWatch Grafana Amazon EKS NEMO JAX PyTorch NxD

rights reserved. 基盤モデルの学習には膨⼤なコンピュートリソースが必要となる Petabytes of unlabeled data + Millions of GPU Hours Foundation models Billions of parameters = Llama-3 70B を学習するために必要な計算基盤は︖ Question

rights reserved. 基盤モデルの学習には膨⼤なコンピュートリソースが必要となる Petabytes of unlabeled data + Millions of GPU Hours Foundation models Billions of parameters = Llama-3 70B は 6.4M1 H100 GPU hours を費やして学習 ≈ 256xp5 for 132 days Answer Source: 1https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md Llama-3 70B を学習するために必要な計算基盤は︖ Question

rights reserved. 1. Compute 要件学習に必要なメモリ: Mixed precision 学習 18 bytes/param Llama3 70B → 1.2 TB~ スケーリング則[1]: FLOPS ≈ 6 x Parameters x Tokens Chinchilla 則[2]: モデルの学習には 20 tokens/parameter 必要 Parameters (FP32/Bf16) 420 GB Gradients (FP32) 280 GB Adam Optimizer States (FP32) 560 GB VRAM consumption Llama3 70B （Without Activations etc.） [1] Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J. and Amodei, D., 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361. [2] Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D.D.L., Hendricks, L.A., Welbl, J., Clark, A. and Hennigan, T., 2022. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556. 6 × 70B 70 B x 20 × 0.6 million exaflops = ※ 近年のモデルではより多くの token 数を費やして学習している

rights reserved. 2. Network 要件 6 × 70B 70 B x 20 × 0.6 million exaflops = H100 (2000 TFLOPS/Bfloat16) ⼀枚で約 3500 ⽇かかる計算量。⼀⽉で学習を⾏おうとすると 100 台以上の H100 が必要となる。つまり多数のインスタンスを強調して動かすことが必要となってくる → GPU 間・インスタンス間のレイテンシ・スループットが学習に⼤きく影響する

rights reserved. 3. ストレージ要件 11 Data Tokens Size(Bytes) Wikitext 100 M~ 750 MB C4.EN (Colossal Clean Crawled Corpus) 156 B 305 GB RedPajama-Data-1T 1 T 5 TB RedPajama-Data-v2 30 T 170 TB [1] https://arxiv.org/abs/2104.08758 [1] https://huggingface.co/bigscience/bloom 広帯域・⼤容量の共有分散ストレージが必要となる Parameters 及び Optimizer States を保存する Ex.: Llama3 70B - Parameters: 420 GB Optimizer States: 560 GB Parameters (FP32/Bf16) 420 GB Adam Optimizer States (FP32) 560 GB Llama3 70B Checkpoints 内訳基盤モデル学習には⼤規模コーパスが必要

rights reserved. Amazon EC2 UltraClusters ⾼性能コンピューティング、ネットワーク、ストレージをサポートするスーパーコンピュータ Head-node Compute Nodes /fsx S3 Infrastructure GPU/EFA/FSx for Lustre

rights reserved. 分散学習︓基盤モデル学習をスケールさせるモデルの各ブロックを分割処理 MLP・Attention ブロックを並列化テンソル並列モデルの各レイヤを分割処理パイプライン並列学習が単⼀GPU で完結する単⼀ GPU

rights reserved. 分散学習︓基盤モデル学習をスケールさせる Replica 1 Replica 2 Replica 3 Replica 4 複数のモデルレプリカで異なるデータを分割処理データ並列

rights reserved. なぜ分散学習は困難なのか分散学習 Compute Nodes A A A A A A A A A A A A A A A A

rights reserved. 分散学習はハードウェア不良との戦い数万 GPU 数千ホスト 3–4 ヶ⽉にわたる学習⼀⽇に 10~20 回のハードウェア不良

rights reserved. 主なハードウェア不良ビット反転 Silent data corruption (SDC) PUI バスからの切断 XID エラー GPU が認識されなくなる宇宙線のような外部要因が RAMのビットを反転検出できないデータのエラー

rights reserved. 分散学習ベストプラクティスハードウェア不良を前提とした学習すべてを計測する素早く失敗素早く復旧 1. 焼きなまし 2. モニタリング 3. チェックポイントの頻繁な保存 4. 余剰ハードウェア 1. 問題発⽣時の迅速な失敗 2. 起動時間の短縮チェックポイント頻度の最適化 1. メトリクスの収集 1. 学習 2. 通信 3. ホスト 2. メトリクスの可視化 3. KPI を設定: ex. goodput

rights reserved. Amazon SageMaker HyperPod • Good news! Amazon SageMaker HyperPod を⽤いることで、これらのベストプラクティスが実現できます HyperPod Amazon SageMaker Adventurous ML Teams 😎

rights reserved. Resiliency: 故障ノードの⾃動交換機能チェックポイント保存復旧ノード不良発⽣インスタンス復旧ノードの⾃動交換学習チェックポイント保存

rights reserved. HyperPod observability HyperPod cluster Compute Nodes Accelerator observability Cluster observability Maximize accelerator utilization for specific applications Maximize cluster utilization across applications

rights reserved. AWS Deep Learning AMIs (DLAMI) • 最新の NVIDIAドライバ, CUDAライブラリ, Lustreドライバ, EFA ソフトウェアスタックでパフォーマンスを最適化 • GPU/Trainium 双⽅をサポート

rights reserved. AWS Deep Learning ソフトウェアスタック HyperPod Deep learning GPU AMI PyTorch, JAX ML frameworks Communication libraries・SDKs DDP, FSDP, MegatronLM, DeepSpeed, torch-neuronx SMP, SMDDP NCCL Accelerator SDK + Optimized libs Neuron, CUDA AWS OFI NCCL libfabric EFA Kernel Driver Accelerator Driver GPU, Trainium EFA Device Hardware・ Kernel space

rights reserved. Customers are using Amazon SageMaker HyperPod to train FMs at scale

rights reserved. ⽇本の最新事例: Llama 3.3 Swallow https://swallow-llm.github.io/llama3.3-swallow.ja.html https://aws.amazon.com/blogs/machine-learning/training-llama-3- 3-swallow-a-japanese-sovereign-llm-on-amazon-sagemaker- hyperpod/

rights reserved. Call to Action awsome-distributed training（分散学習 on AWS のベストプラクティス集） • https://github.com/aws-samples/awsome-distributed-training • AWS ParallelCluster/Amazon EKS/Amazon SageMaker HyperPod のレファレンスアーキテクチャ • Megatron-LM, Nemo, PyTorch FSDP, Mosaic-ML Composer 等のテストケース • NCCL tests などのクラスタテスト⽅法の解説 • Observability Stack のセットアップ (Prometheus&Grafana) Workshops • Machine Learning on ParallelCluster: https://catalog.workshops.aws/ml-on- aws-parallelcluster/en-US • SageMaker HyperPod Slurm Workshop: https://catalog.workshops.aws/sagemaker-hyperpod • SageMaker HyperPod EKS Workshop: https://catalog.workshops.aws/sagemaker-hyperpod-eks

rights reserved. Thank you!

[AWS Summit Japan 2025] Optimizing Foundation M...

[AWS Summit Japan 2025] Optimizing Foundation Model Development with Amazon SageMaker HyperPod: Insights from Training the Amazon Nova Model

Keita Watanabe

More Decks by Keita Watanabe

Other Decks in Technology

Featured

Transcript

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All