Building foundation model on AWS

Slide 1

Slide 1 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Introduction to FM development on AWS Keita Watanabe Sr. Solutions Architect, GenAI

Slide 2

Slide 2 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Foundation model training needs compute at scale Petabytes of unlabeled data + Millions of GPU Hours Foundation models Billions of parameters = Llama-3 70B used 6.4M1 H100 GPU hours ≈ 256xp5 for 132 days Example Falcon-180B used 7.0M2 A100 GPU hours ≈ 512xp4de for 73 days Example Source: 1https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md 2https://arxiv.org/pdf/2311.16867

Slide 3

Slide 3 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. 1. Compute requirements Llama3 70B training requires more than 1.2 TB of VRAM Scaling low [1]: FLOPS ≈ 6 x Parameters x Tokens Chinchilla low[2] Models needs to be trained with 20 x (Num. Parameters) Tokens 3 Parameters FLOPS Tokens 1 Billion 1.21e+20 20.2 Billion 10 Billion 1.23e+23 205.1 Billion 175 Billion 3.85e+24 3.7 Trillion 1 Trillion 1.27e+26 21.2 Trillion 10 Trillion 1.30e+28 216.2 Trillion Parameters (FP32/Bf16) 420 GB Gradients (FP32) 280 GB Adam Optimizer States (FP32) 560 GB VRAM consumption Llama3 70B （Without Activations etc.） [1] Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J. and Amodei, D., 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361. [2] Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D.D.L., Hendricks, L.A., Welbl, J., Clark, A. and Hennigan, T., 2022. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.

Slide 4

Slide 4 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. 2. Pretraining requires communication 4 Model Release Time Size Tokens Hardware AWS Instance 換算 BLOOM Nov-2022 175B 366 Billion 384xA100 80GB 48xP4de.24xlarge Pythia Apr-2023 12B 300 Billion 256xA100 40 GB 32xP4d.24xlarge Llama Feb-2023 65B 1 Trillion 512xA100 40GB 64xP4d.24xlarge Llama2 Jul-2023 70B 2 Trillion 2000xA100 80GB 250xP4de.24xlarge [1] Zhao, W.X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z. and Du, Y., 2023. A survey of large language models. arXiv preprint arXiv:2303.18223. Multi-node distributed training is indispensable.

Slide 5

Slide 5 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. 3. Distributed file storage requirements 5 Data Tokens Size(Bytes) Wikitext 100 M~ 750 MB C4.EN (Colossal Clean Crawled Corpus) 156 B 305 GB RedPajama-Data-1T 1 T 5 TB RedPajama-Data-v2 30 T 170 TB [1] https://arxiv.org/abs/2104.08758 [1] https://huggingface.co/bigscience/bloom Large scale, high speed distributed storage is required You need to store parameters and optimizer states. Ex.: Llama3 70B - Parameters: 420 GB Optimizer States: 560 GB Ex: Bloom 175 B checkpoints including optimizer states : 2.2 TB [1] Parameters (FP32/Bf16) 420 GB Adam Optimizer States (FP32) 560 GB Llama3 70B Checkpoints 内訳 Large corpus needed for FM pretraining

Slide 6

Slide 6 text

Slide 7

Slide 7 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Goal 1. INFRASTRUCTURE FOR FM TRAINING/INFERENCE EC2 Capacity Blocks Neuron UltraClusters EFA Nitro GPUs Inferentia SageMaker Trainium 2. TOOLS TO BUILD WITH LLMs AND OTHER FMs Guardrails | Agents | Customization capabilities Amazon Bedrock 3. APPLICATIONS THAT LEVERAGE FMs Amazon Q Business Amazon Q Developer Amazon Q in QuickSight Amazon Q in Connect We provide guidance on Foundation Model (FM) training on AWS

Slide 8

Slide 8 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Prerequisites • Basics in machine learning（Especiallly Neural Network training） • Definition of Foundation Models • Common GPT-like foundation models including Llama • Foundation model pretraninng（Language modeling）

Slide 9

Slide 9 text

Slide 10

Slide 10 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Building blocks AWS Offering Architecture & Orchestration • Resource Orchestrator • Job Scheduler Algorithms & Software • ML Frameworks Amazon EKS AWS ParallelCluster Amazon Sagemaker HyperPod Network • Wide bandwidth interconnect • Fast accelerator with large device memory Compute • Scalable distributed file storage Storage . . . Frameworks Amazon EC2 UltraClusters Infrastructures

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon EC2 UltraClusters 13 Super computer supporting high performance computing, networking, and storage Network Compute Storage P4d(e) NVIDIA A100 TENSOR CORE P5(e) NVIDIA H100/H200 TENSOR CORE TRN1 AWS TRAINIUM TRN2 AWS TRAINIUM2 Elastic Fabric Adapter FSx for Lustre Amazon S3

Slide 14

Slide 14 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon EC2 instances for FM training 14 © 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Instance Size ACC NUM ACC ACC Memory vCPU RAM On demand pricing* (USD/h) P4d.24xlarge A100 8 320 GB 96 1152 GiB 32.77 P4de.24xlarge A100 8 640 GB 96 1152 GiB 40.96 P5.48xlarge H100 8 640 GB 192 2 TiB 98.32 Trn1.32xlarge Trainium 16 512 GB 128 512 GB 21.50 * N. Virginia https://aws.amazon.com/ec2/instance-types/p5/ P4d(e) NVIDIA A100 TENSOR CORE P5(e) NVIDIA H100/H200 TENSOR CORE TRN1 AWS TRAINIUM TRN2 AWS TRAINIUM2 N E W • AWS Trainium / NVIDIA GPU • Large Memory

Slide 15

Slide 15 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. EFA (Elastic Fabric Adapter) Dedicated network interface for MPI/NCCL Highbandwidth low latency communication with SRD（Scalable Reliable Diagram） A Endpoint B Endpoint Flow 1 Flow 2 Flow 3 OS Kernel bypassed communication with Libfabric API Out of order での転送： Head of blocking 問題を回避マルチパスルーティングによる安定した低レイテンシーの実現 Elastic Fabric Adapter

Slide 16

Slide 16 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Network performance of FM training instances Model Accelerator Num Acc. Total memory Acc. P2P BW EFA P4d.24xlarge A100 40GB 8 320 GB 600 GB/s 400 Gbps P4de.24xlarge A100 80GB 8 640 GB 600 GB/s 400 Gbps P5.48xlarge H100 8 640 GB 900 GB/s 3200 Gbps Trn1.32xlarge Tranium1 16 512 GB 768 Gbps 800 Gbps Trn1n.32xlarge Trainium1 16 512 GB 768 Gbps 1600 Gbps

Slide 17

Slide 17 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. PyTorch FSDP distributed training with/without EFA GPT-3 175B parameters model EFA v1 https://medium.com/pytorch/training-a-1-trillion-parameter-model- with-pytorch-fully-sharded-data-parallel-on-aws-3ac13aa96cff Publication date: 2022/05/16 2.5x 25.6x faster 512 GPUs = 64 instances Elastic Fabric Adapter

Slide 18

Slide 18 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon FSx for Lustre • 高速な分散ファイルシステムの Lustre をフルマネージドで提供 • LustreはPOSIX準拠のファイルシステムとして利用可能 • 階層型ストレージの機能もあり、S3 と透過的にデータの import/export が可能 Amazon S3 に格納されたデータが Amazon FSx for Lustre ファイルシステムにロードされて処理される処理結果は Amazon S3 に永続化される Amazon FSx for Lustre Amazon S3

Slide 19

Slide 19 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. 分散学習におけるストレージの階層構造 Object us-east-1a Region Instance Store • Checkpoints, temporary data FSx for Lustre • Shared data sets, checkpoints, outputs Amazon S3 • Data backbone, datasets, checkpoints, outputs

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. enroot import \ -o /apps/nccl.sqsh \ dockerd://nccl-tests:latest Slurm Job submission 25 Submit training jobs Shell script with resources, commands to execute. Submit via sbatch1, for long running jobs, high control & many jobs. #SBATCH --nodes=4 #SBATCH --job-name=train-llama2 #SBATCH --output=logs/%x_%j.out #SBATCH --ntasks-per-node=8 #SBATCH --exclusive echo "Starting training job" srun python train.py Quick prototyping Book resources via salloc 2 and run commands interactively in parallel or srun salloc -N 4 --exclusive srun python train.py Submit container job Submit container job with Enroot/Pyxis docker build \ -f nccl-tests.Dockerfile \ –t nccl-tests:latest . srun \ --container-image nccl.sqsh \ all_reduce_perf -b 8 -e 16G \ -f 2 -g 1 -c 1 -n 100

Slide 26

Slide 26 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. CLI Distributed training and inference jobs on thousands of AI accelerators Effective orchestration for machine learning ü Highly scalable across thousands of instances ü Improved utilization of compute resources ü Broad ecosystem of open source and proprietary tools EKS for foundation model (FM) training Amazon EKS EKS VPC Control Plane Data Plane USER VPC Compute Nodes Elastic Fabric Adapter p5 p5 p5 p5 p5 p5 p5 p5 p5 p5 p5 p5 p5 p5 p5 p5 Amazon ECR Amazon FSx for Lustre Amazon CloudWatch

Slide 27

Slide 27 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. TorchElastic Training Job • Launch pods to run model training scripts: • > kubectl apply -f imagenet-fsx.yaml apiVersion: elastic.pytorch.org/v1alpha1 kind: ElasticJob metadata: name: imagenet spec: # Use "etcd-service:2379" if you already apply etcd.yaml rdzvEndpoint: etcd-service:2379 minReplicas: 1 maxReplicas: 128 replicaSpecs: Worker: replicas: 4 restartPolicy: ExitCode template: apiVersion: v1 kind: Pod spec: nodeSelector: beta.kubernetes.io/instance-type: p3.8xlarge containers: resources: limits: nvidia.com/gpu: 4 volumeMounts: - name: fsx-pvc mountPath: /fsx-shared - name: dshm mountPath: /dev/shm volumes: - name: fsx-pvc persistentVolumeClaim: claimName: fsx-claim - name: dshm emptyDir: medium: Memory containers: - name: elasticjob-worker image: torchelastic/examples:0.2.0 imagePullPolicy: Always env: - name: NCCL_DEBUG value: INFO args: - "--nproc_per_node=4" - "/workspace/examples/imagenet/main.py" - "--arch=resnet50" - "--epochs=1" - "--batch-size=64" - "--workers=8" - "--checkpoint-file=/fsx-shared/checkpoint.pth.tar" - "/fsx-shared/ILSVRC/Data/CLS-LOC/" Training

Slide 28

Slide 28 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. 4. Why is resiliency so important? Increases time to train Wasted compute resources and $$ Wasted engineering hours to isolate, fix, and resume Source: https://ai.meta.com/research/publications/the-llama-3-herd-of-models/ “During a 54-day snapshot period of pre-training, we experienced a total of 466 job interruptions” * Internal benchmarking Wasted training time by cluster size Meta’s Llama3.1 Training on 16k GPUs

Slide 29

Slide 29 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. SageMaker HyperPod 29 SageMaker HyperPod がクラスターインスタンス内の CPU, GPU, ネットワークをヘルスチェック H/W 障害を検出すると SageMaker Hyperpod が自動的に故障インスタンスを正常インスタンスに置き換え故障インスタンスの交換完了後 SageMaker HyperPod は Slurm でワークロードを再びキューに入れ、チェックポイントから再実行 Checkpoints Restore Checkpoints Alarm & interruption Instance Restore Self-healing process

Slide 30

Slide 30 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Service Accessing SageMaker HyperPod cluster Engineers & Researchers Compute Nodes Admin & Ops Customer Account FSxL Datasets & checkpoints Endpoint AWS Cloud Login node

Slide 31

Slide 31 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Integrates with existing EKS control plane kubectl Data plane Control plane EKS Managed Node Groups EKS Worker node POD 1 POD 2 Self-Managed Node Groups Worker node POD 1 POD 2 AWS Fargate Pod 1 SageMaker HyperPod Compute Nodes

Slide 32

Slide 32 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Cross-Account Elastic network interface SageMaker User VPC EKS VPC EKS Control Plane Amazon ECR Amazon CloudWatch Amazon S3 Amazon FSx for Lustre HyperPod VPC Compute Nodes Health checks Instance replacements HyperPod Cluster HyperPod on EKS Architecture Network Load Balancer Elastic Fabric Adapter