Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building foundation model on AWS

Keita Watanabe
February 25, 2025
3

Building foundation model on AWS

Keita Watanabe

February 25, 2025
Tweet

More Decks by Keita Watanabe

Transcript

  1. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Introduction to FM development on AWS Keita Watanabe Sr. Solutions Architect, GenAI
  2. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Foundation model training needs compute at scale Petabytes of unlabeled data + Millions of GPU Hours Foundation models Billions of parameters = Llama-3 70B used 6.4M1 H100 GPU hours ≈ 256xp5 for 132 days Example Falcon-180B used 7.0M2 A100 GPU hours ≈ 512xp4de for 73 days Example Source: 1https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md 2https://arxiv.org/pdf/2311.16867
  3. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. 1. Compute requirements Llama3 70B training requires more than 1.2 TB of VRAM Scaling low [1]: FLOPS ≈ 6 x Parameters x Tokens Chinchilla low[2] Models needs to be trained with 20 x (Num. Parameters) Tokens 3 Parameters FLOPS Tokens 1 Billion 1.21e+20 20.2 Billion 10 Billion 1.23e+23 205.1 Billion 175 Billion 3.85e+24 3.7 Trillion 1 Trillion 1.27e+26 21.2 Trillion 10 Trillion 1.30e+28 216.2 Trillion Parameters (FP32/Bf16) 420 GB Gradients (FP32) 280 GB Adam Optimizer States (FP32) 560 GB VRAM consumption Llama3 70B (Without Activations etc.) [1] Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J. and Amodei, D., 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361. [2] Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D.D.L., Hendricks, L.A., Welbl, J., Clark, A. and Hennigan, T., 2022. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
  4. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. 2. Pretraining requires communication 4 Model Release Time Size Tokens Hardware AWS Instance 換算 BLOOM Nov-2022 175B 366 Billion 384xA100 80GB 48xP4de.24xlarge Pythia Apr-2023 12B 300 Billion 256xA100 40 GB 32xP4d.24xlarge Llama Feb-2023 65B 1 Trillion 512xA100 40GB 64xP4d.24xlarge Llama2 Jul-2023 70B 2 Trillion 2000xA100 80GB 250xP4de.24xlarge [1] Zhao, W.X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z. and Du, Y., 2023. A survey of large language models. arXiv preprint arXiv:2303.18223. Multi-node distributed training is indispensable.
  5. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. 3. Distributed file storage requirements 5 Data Tokens Size(Bytes) Wikitext 100 M~ 750 MB C4.EN (Colossal Clean Crawled Corpus) 156 B 305 GB RedPajama-Data-1T 1 T 5 TB RedPajama-Data-v2 30 T 170 TB [1] https://arxiv.org/abs/2104.08758 [1] https://huggingface.co/bigscience/bloom Large scale, high speed distributed storage is required You need to store parameters and optimizer states. Ex.: Llama3 70B - Parameters: 420 GB Optimizer States: 560 GB Ex: Bloom 175 B checkpoints including optimizer states : 2.2 TB [1] Parameters (FP32/Bf16) 420 GB Adam Optimizer States (FP32) 560 GB Llama3 70B Checkpoints 内訳 Large corpus needed for FM pretraining
  6. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Introduction 6
  7. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Goal 1. INFRASTRUCTURE FOR FM TRAINING/INFERENCE EC2 Capacity Blocks Neuron UltraClusters EFA Nitro GPUs Inferentia SageMaker Trainium 2. TOOLS TO BUILD WITH LLMs AND OTHER FMs Guardrails | Agents | Customization capabilities Amazon Bedrock 3. APPLICATIONS THAT LEVERAGE FMs Amazon Q Business Amazon Q Developer Amazon Q in QuickSight Amazon Q in Connect We provide guidance on Foundation Model (FM) training on AWS
  8. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Prerequisites • Basics in machine learning(Especiallly Neural Network training) • Definition of Foundation Models • Common GPT-like foundation models including Llama • Foundation model pretraninng(Language modeling)
  9. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. What do we need to prepare to build FM on AWS?
  10. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Building blocks AWS Offering Architecture & Orchestration • Resource Orchestrator • Job Scheduler Algorithms & Software • ML Frameworks Amazon EKS AWS ParallelCluster Amazon Sagemaker HyperPod Network • Wide bandwidth interconnect • Fast accelerator with large device memory Compute • Scalable distributed file storage Storage . . . Frameworks Amazon EC2 UltraClusters Infrastructures
  11. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Infrastructures 11
  12. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Building blocks AWS Offering Architecture & Orchestration • Resource Orchestrator • Job Scheduler Algorithms & Software • ML Frameworks Amazon EKS AWS ParallelCluster Amazon Sagemaker HyperPod Network • Wide bandwidth interconnect • Fast accelerator with large device memory Compute • Scalable distributed file storage Storage . . . Frameworks Amazon EC2 UltraClusters Infrastructures
  13. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Amazon EC2 UltraClusters 13 Super computer supporting high performance computing, networking, and storage Network Compute Storage P4d(e) NVIDIA A100 TENSOR CORE P5(e) NVIDIA H100/H200 TENSOR CORE TRN1 AWS TRAINIUM TRN2 AWS TRAINIUM2 Elastic Fabric Adapter FSx for Lustre Amazon S3
  14. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Amazon EC2 instances for FM training 14 © 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Instance Size ACC NUM ACC ACC Memory vCPU RAM On demand pricing* (USD/h) P4d.24xlarge A100 8 320 GB 96 1152 GiB 32.77 P4de.24xlarge A100 8 640 GB 96 1152 GiB 40.96 P5.48xlarge H100 8 640 GB 192 2 TiB 98.32 Trn1.32xlarge Trainium 16 512 GB 128 512 GB 21.50 * N. Virginia https://aws.amazon.com/ec2/instance-types/p5/ P4d(e) NVIDIA A100 TENSOR CORE P5(e) NVIDIA H100/H200 TENSOR CORE TRN1 AWS TRAINIUM TRN2 AWS TRAINIUM2 N E W • AWS Trainium / NVIDIA GPU • Large Memory
  15. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. EFA (Elastic Fabric Adapter) Dedicated network interface for MPI/NCCL Highbandwidth low latency communication with SRD(Scalable Reliable Diagram) A Endpoint B Endpoint Flow 1 Flow 2 Flow 3 OS Kernel bypassed communication with Libfabric API Out of order での転送: Head of blocking 問題を回避 マルチパスルーティングによる安定した低レイテンシーの実現 Elastic Fabric Adapter
  16. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Network performance of FM training instances Model Accelerator Num Acc. Total memory Acc. P2P BW EFA P4d.24xlarge A100 40GB 8 320 GB 600 GB/s 400 Gbps P4de.24xlarge A100 80GB 8 640 GB 600 GB/s 400 Gbps P5.48xlarge H100 8 640 GB 900 GB/s 3200 Gbps Trn1.32xlarge Tranium1 16 512 GB 768 Gbps 800 Gbps Trn1n.32xlarge Trainium1 16 512 GB 768 Gbps 1600 Gbps
  17. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. PyTorch FSDP distributed training with/without EFA GPT-3 175B parameters model EFA v1 https://medium.com/pytorch/training-a-1-trillion-parameter-model- with-pytorch-fully-sharded-data-parallel-on-aws-3ac13aa96cff Publication date: 2022/05/16 2.5x 25.6x faster 512 GPUs = 64 instances Elastic Fabric Adapter
  18. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Amazon FSx for Lustre • 高速な分散ファイルシステムの Lustre をフルマネージドで提供 • LustreはPOSIX準拠のファイルシステムとして利用可能 • 階層型ストレージの機能もあり、S3 と透過的にデータの import/export が可能 Amazon S3 に格納されたデータが Amazon FSx for Lustre ファイルシステムに ロードされて処理される 処理結果は Amazon S3 に永続化される Amazon FSx for Lustre Amazon S3
  19. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. 分散学習におけるストレージの階層構造 Object us-east-1a Region Instance Store • Checkpoints, temporary data FSx for Lustre • Shared data sets, checkpoints, outputs Amazon S3 • Data backbone, datasets, checkpoints, outputs
  20. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Architecture and Orchestration 20
  21. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. How can we leverage the infrastructure to train FMs?
  22. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Building blocks AWS Offering Architecture & Orchestration • Resource Orchestrator • Job Scheduler Algorithms & Software • ML Frameworks Amazon EKS AWS ParallelCluster Amazon Sagemaker HyperPod Network • Wide bandwidth interconnect • Fast accelerator with large device memory Compute • Scalable distributed file storage Storage . . . Frameworks Amazon EC2 UltraClusters Infrastructures
  23. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Slurm vs. Kubernetes
  24. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Slurm Architecture 24 Cluster Engineers & Researchers Compute Nodes FSxL Datasets & checkpoints AWS Cloud Head Node Login Node
  25. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. enroot import \ -o /apps/nccl.sqsh \ dockerd://nccl-tests:latest Slurm Job submission 25 Submit training jobs Shell script with resources, commands to execute. Submit via sbatch1, for long running jobs, high control & many jobs. #SBATCH --nodes=4 #SBATCH --job-name=train-llama2 #SBATCH --output=logs/%x_%j.out #SBATCH --ntasks-per-node=8 #SBATCH --exclusive echo "Starting training job" srun python train.py Quick prototyping Book resources via salloc 2 and run commands interactively in parallel or srun salloc -N 4 --exclusive srun python train.py Submit container job Submit container job with Enroot/Pyxis docker build \ -f nccl-tests.Dockerfile \ –t nccl-tests:latest . srun \ --container-image nccl.sqsh \ all_reduce_perf -b 8 -e 16G \ -f 2 -g 1 -c 1 -n 100
  26. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. CLI Distributed training and inference jobs on thousands of AI accelerators Effective orchestration for machine learning ü Highly scalable across thousands of instances ü Improved utilization of compute resources ü Broad ecosystem of open source and proprietary tools EKS for foundation model (FM) training Amazon EKS EKS VPC Control Plane Data Plane USER VPC Compute Nodes Elastic Fabric Adapter p5 p5 p5 p5 p5 p5 p5 p5 p5 p5 p5 p5 p5 p5 p5 p5 Amazon ECR Amazon FSx for Lustre Amazon CloudWatch
  27. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. TorchElastic Training Job • Launch pods to run model training scripts: • > kubectl apply -f imagenet-fsx.yaml apiVersion: elastic.pytorch.org/v1alpha1 kind: ElasticJob metadata: name: imagenet spec: # Use "etcd-service:2379" if you already apply etcd.yaml rdzvEndpoint: etcd-service:2379 minReplicas: 1 maxReplicas: 128 replicaSpecs: Worker: replicas: 4 restartPolicy: ExitCode template: apiVersion: v1 kind: Pod spec: nodeSelector: beta.kubernetes.io/instance-type: p3.8xlarge containers: resources: limits: nvidia.com/gpu: 4 volumeMounts: - name: fsx-pvc mountPath: /fsx-shared - name: dshm mountPath: /dev/shm volumes: - name: fsx-pvc persistentVolumeClaim: claimName: fsx-claim - name: dshm emptyDir: medium: Memory containers: - name: elasticjob-worker image: torchelastic/examples:0.2.0 imagePullPolicy: Always env: - name: NCCL_DEBUG value: INFO args: - "--nproc_per_node=4" - "/workspace/examples/imagenet/main.py" - "--arch=resnet50" - "--epochs=1" - "--batch-size=64" - "--workers=8" - "--checkpoint-file=/fsx-shared/checkpoint.pth.tar" - "/fsx-shared/ILSVRC/Data/CLS-LOC/" Training
  28. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. 4. Why is resiliency so important? Increases time to train Wasted compute resources and $$ Wasted engineering hours to isolate, fix, and resume Source: https://ai.meta.com/research/publications/the-llama-3-herd-of-models/ “During a 54-day snapshot period of pre-training, we experienced a total of 466 job interruptions” * Internal benchmarking Wasted training time by cluster size Meta’s Llama3.1 Training on 16k GPUs
  29. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. SageMaker HyperPod 29 SageMaker HyperPod が クラスターインスタンス内の CPU, GPU, ネットワークを ヘルスチェック H/W 障害を検出すると SageMaker Hyperpod が 自動的に故障インスタンスを 正常インスタンスに置き換え 故障インスタンスの交換完了後 SageMaker HyperPod は Slurm でワークロードを 再びキューに入れ、 チェックポイントから再実行 Checkpoints Restore Checkpoints Alarm & interruption Instance Restore Self-healing process
  30. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Service Accessing SageMaker HyperPod cluster Engineers & Researchers Compute Nodes Admin & Ops Customer Account FSxL Datasets & checkpoints Endpoint AWS Cloud Login node
  31. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Integrates with existing EKS control plane kubectl Data plane Control plane EKS Managed Node Groups EKS Worker node POD 1 POD 2 Self-Managed Node Groups Worker node POD 1 POD 2 AWS Fargate Pod 1 SageMaker HyperPod Compute Nodes
  32. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Cross-Account Elastic network interface SageMaker User VPC EKS VPC EKS Control Plane Amazon ECR Amazon CloudWatch Amazon S3 Amazon FSx for Lustre HyperPod VPC Compute Nodes Health checks Instance replacements HyperPod Cluster HyperPod on EKS Architecture Network Load Balancer Elastic Fabric Adapter
  33. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Algorithms and Softwares 33
  34. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Building blocks AWS Offering Architecture & Orchestration • Resource Orchestrator • Job Scheduler Algorithms & Software • ML Frameworks Amazon EKS AWS ParallelCluster Amazon Sagemaker HyperPod Network • Wide bandwidth interconnect • Fast accelerator with large device memory Compute • Scalable distributed file storage Storage . . . Frameworks Amazon EC2 UltraClusters Infrastructures
  35. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. どのようなソフトウェアスタックが必要か︖
  36. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Distributed training software stack (GPU) ML Frameworks Communication libraries/SDKs Hardware drivers EC2 instance
  37. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Distributed training software stack (Neuron) ML Frameworks Communication libraries・SDKs Hardware drivers EC2 instance
  38. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Thank you! © 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved.