Building foundation models on AWS

© 2025, Amazon Web Services, Inc. or its affiliates. All
rights reserved. © 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Building Foundation Models on AWS Keita Watanabe Sr. WW Solutions Architect, GenAI

rights reserved. Foundation model training needs compute at scale Petabytes of unlabeled data + Millions of GPU Hours Foundation models Billions of parameters = Llama-3 70B used 6.4M1 H100 GPU hours ≈ 256xp5 for 132 days Example Falcon-180B used 7.0M2 A100 GPU hours ≈ 512xp4de for 73 days Example Source: 1https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md 2https://arxiv.org/pdf/2311.16867

rights reserved. 1. Compute requirements Llama3 70B training requires more than 1.2 TB of VRAM Scaling low [1]: FLOPS ≈ 6 x Parameters x Tokens Chinchilla low[2] Models needs to be trained with 20 x (Num. Parameters) Tokens 3 Parameters FLOPS Tokens 1 Billion 1.21e+20 20.2 Billion 10 Billion 1.23e+23 205.1 Billion 175 Billion 3.85e+24 3.7 Trillion 1 Trillion 1.27e+26 21.2 Trillion 10 Trillion 1.30e+28 216.2 Trillion Parameters (FP32/Bf16) 420 GB Gradients (FP32) 280 GB Adam Optimizer States (FP32) 560 GB VRAM consumption Llama3 70B （Without Activations etc.） [1] Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J. and Amodei, D., 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361. [2] Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D.D.L., Hendricks, L.A., Welbl, J., Clark, A. and Hennigan, T., 2022. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.

rights reserved. 2. Pretraining requires communication 4 Model Release Time Size Tokens Hardware AWS Instance 換算 BLOOM Nov-2022 175B 366 Billion 384xA100 80GB 48xP4de.24xlarge Pythia Apr-2023 12B 300 Billion 256xA100 40 GB 32xP4d.24xlarge Llama Feb-2023 65B 1 Trillion 512xA100 40GB 64xP4d.24xlarge Llama2 Jul-2023 70B 2 Trillion 2000xA100 80GB 250xP4de.24xlarge [1] Zhao, W.X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z. and Du, Y., 2023. A survey of large language models. arXiv preprint arXiv:2303.18223. Multi-node distributed training is indispensable.

rights reserved. 3. Distributed file storage requirements 5 Data Tokens Size(Bytes) Wikitext 100 M~ 750 MB C4.EN (Colossal Clean Crawled Corpus) 156 B 305 GB RedPajama-Data-1T 1 T 5 TB RedPajama-Data-v2 30 T 170 TB [1] https://arxiv.org/abs/2104.08758 [1] https://huggingface.co/bigscience/bloom Large scale, high speed distributed storage is required You need to store parameters and optimizer states. Ex.: Llama3 70B - Parameters: 420 GB Optimizer States: 560 GB Ex: Bloom 175 B checkpoints including optimizer states : 2.2 TB [1] Parameters (FP32/Bf16) 420 GB Adam Optimizer States (FP32) 560 GB Llama3 70B Checkpoints 内訳 Large corpus needed for FM pretraining

rights reserved. Goal 1. INFRASTRUCTURE FOR FM TRAINING/INFERENCE EC2 Capacity Blocks Neuron UltraClusters EFA Nitro GPUs Inferentia SageMaker Trainium 2. TOOLS TO BUILD WITH LLMs AND OTHER FMs Guardrails | Agents | Customization capabilities Amazon Bedrock 3. APPLICATIONS THAT LEVERAGE FMs Amazon Q Business Amazon Q Developer Amazon Q in QuickSight Amazon Q in Connect We provide guidance on Foundation Model (FM) training on AWS

rights reserved. Prerequisites • Basics in machine learning（Especiallly Neural Network training） • Definition of Foundation Models • Common GPT-like foundation models including Llama • Foundation model pretraninng（Language modeling）

rights reserved. What do we need to prepare to build FM on AWS?

rights reserved. Building blocks AWS Offering Architecture & Orchestration • Resource Orchestrator • Job Scheduler Algorithms & Software • ML Frameworks Amazon EKS AWS ParallelCluster Amazon Sagemaker HyperPod Network • Wide bandwidth interconnect • Fast accelerator with large device memory Compute • Scalable distributed file storage Storage . . . Frameworks Amazon EC2 UltraClusters Infrastructures

rights reserved. Amazon EC2 UltraClusters 13 Super computer supporting high performance computing, networking, and storage Network Compute Storage P5 NVIDIA H100/H200 TENSOR CORE P6 NVIDIA B200 TENSOR CORE TRN1 AWS TRAINIUM TRN2 AWS TRAINIUM2 Elastic Fabric Adapter FSx for Lustre Amazon S3

rights reserved. NVIDIA GPU instances 14 © 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Instance Size ACC NUM ACC ACC Memory vCPU RAM Local SSDs P5.48xlarge H100 8 80 GB x 8 640 GB 192 2 TiB 8 x 3.84 TB NVMe SSD P5e.48xlarge H200 8 141 GB x 8 1128 GB 192 2 TiB 8 x 3.84 TB NVMe SSD P5en.48xlarge H200 8 141 GB x 8 1128 GB 192 2 TiB 8 x 3.84 TB NVMe SSD https://aws.amazon.com/ec2/instance-types/p5/ P6 NVIDIA H100/H200 TENSOR CORE P5 NVIDIA B200 TENSOR CORE

rights reserved. Trainium Instances 15 © 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Instance Size ACC NUM ACC ACC Memory vCPU RAM Local SSDs Trn1.32xlarge Trainium1 16 512 GB 128 512 GB 8 TB Trn1n.32xlarge Trainium1 16 512 GB 128 512 GB 8 TB Trn2.48xlarge Trainium2 16 96 GiB x 16 1.5 TB 192 2 TB 4 x 1.92 TB NVMe SSD Trn2u.48xlarge Trainium2 16 96 GiB x 16 1.5 TB 192 2 TB 4 x 1.92 TB NVMe SSD https://aws.amazon.com/ec2/instance-types/trn1 https://aws.amazon.com/ec2/instance-types/trn2 https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/trainium1.html https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/trainium2.html TRN1 AWS TRAINIUM TRN2 AWS TRAINIUM2

rights reserved. GPT Pretraining using PyTorch FSDP with/without EFA https://medium.com/pytorch/training-a-1-trillion-parameter-model- with-pytorch-fully-sharded-data-parallel-on-aws-3ac13aa96cff 記事公開⽇: 2022/05/16 2.5x 25.6x faster 512 GPUs = 64 instances

rights reserved. Elastic Fabric Adapter (EFA) • SRD protocol purpose-built for scalability in the cloud • Kernel bypass and GPU-direct RDMA for low-latency, high-throughput communication between GPUs • Continuing improvements in latency and completion times Elastic Fabric Adapter

rights reserved. EFA (Elastic Fabric Adapter) Dedicated network interface for MPI/NCCL High bandwidth low latency communication with SRD（Scalable Reliable Diagram） A Endpoint B Endpoint Flow 1 Flow 2 Flow 3 OS Kernel bypassed communication with Libfabric API Out of order での転送： Head of blocking 問題を回避マルチパスルーティングによる安定した低レイテンシーの実現 Elastic Fabric Adapter

rights reserved. Network performance of FM training instances Model Accelerator Num Acc. Total memory Acc. P2P BW EFA P5.48xlarge H100 8 640 GB 900 GB/s 3200 Gbps P5e.48xlarge H200 8 1128 GB 900 GB/s 3200 Gbps P5en.48xlarge H200 8 1128 GB 900 GB/s 3200 Gbps Trn1.32xlarge Tranium1 16 512 GB 768 Gbps 800 Gbps Trn1n.32xlarge Trainium1 16 512 GB 768 Gbps 1600 Gbps

rights reserved. Amazon FSx for Lustre • Fully managed Lustre file system for high performance workloads • POSIX file system compatible • Native integration with Amazon S3 Transparently access to the data on S3 through Lustre Data created on Lustre is persisted in Amazon S3 Amazon FSx for Lustre Amazon S3

rights reserved. Amazon S3 lazy load example /file1.txt /file2.txt /folder1/file3.txt /folder2/file4.txt s3://bucket/file1.txt s3://bucket/file2.txt s3://bucket/folder1/file3.txt s3://bucket/folder2/file4.txt

rights reserved. ML training storage hierarchy Object us-east-1a Region Instance Store • Checkpoints, temporary data FSx for Lustre • Shared data sets, checkpoints, outputs Amazon S3 • Data backbone, datasets, checkpoints, outputs

rights reserved. How can we leverage the infrastructure to train FMs?

rights reserved. Slurm vs. Kubernetes

rights reserved. Slurm Architecture 27 Cluster Engineers & Researchers Compute Nodes FSxL Datasets & checkpoints AWS Cloud Head Node Login Node

rights reserved. HPC Cluster - AWS ParallelCluster Private subnet Public subnet us-east-1b us-east-1 p4d.24xlarge Compute Fleet Amazon FSx for Lustre /fsx Users AWS Cloud AWS ParallelCluster Head-node • 1 × c5.9xlarge36 vCPUs (18 physical) • 72 GB of memory Compute Node • 100+ × p4de.24xlarge + C6, M6, R6 • 96 vCPUs (48 physical) • 1152 GB of memory • 8 × NVIDIA A100 80GB GPUs • Network: 400Gbs ENA & EFA • Storage: 8 × 1TB NVMe + EBS Shared file-systems • Amazon FSx for Lustre of 108TB on /fsx Cluster Stack • Slurm 22.05.5 • Cuda 11.6

rights reserved. How to create a cluster https://aws.amazon.com/hpc/parallelcluster/ pcluster create-cluster –f config.yaml … Head-node Compute nodes Shared storage p4d.24xlarge Compute Fleet Amazon FSx for Lustre /fsx AWS ParallelCluster

rights reserved. enroot import \ -o /apps/nccl.sqsh \ dockerd://nccl-tests:latest Slurm Job submission 30 Submit training jobs Shell script with resources, commands to execute. Submit via sbatch1, for long running jobs, high control & many jobs. #SBATCH --nodes=4 #SBATCH --job-name=train-llama2 #SBATCH --output=logs/%x_%j.out #SBATCH --ntasks-per-node=8 #SBATCH --exclusive echo "Starting training job" srun python train.py Quick prototyping Book resources via salloc 2 and run commands interactively in parallel or srun salloc -N 4 --exclusive srun python train.py Submit container job Submit container job with Enroot/Pyxis docker build \ -f nccl-tests.Dockerfile \ –t nccl-tests:latest . srun \ --container-image nccl.sqsh \ all_reduce_perf -b 8 -e 16G \ -f 2 -g 1 -c 1 -n 100

rights reserved. CLI Distributed training and inference jobs on thousands of AI accelerators Effective orchestration for machine learning ü Highly scalable across thousands of instances ü Improved utilization of compute resources ü Broad ecosystem of open source and proprietary tools EKS for foundation model (FM) training Amazon EKS EKS VPC Control Plane Data Plane USER VPC Compute Nodes Elastic Fabric Adapter p5 p5 p5 p5 p5 p5 p5 p5 p5 p5 p5 p5 p5 p5 p5 p5 Amazon ECR Amazon FSx for Lustre Amazon CloudWatch

rights reserved. TorchElastic Training Job • Launch pods to run model training scripts: • > kubectl apply -f imagenet-fsx.yaml apiVersion: elastic.pytorch.org/v1alpha1 kind: ElasticJob metadata: name: imagenet spec: # Use "etcd-service:2379" if you already apply etcd.yaml rdzvEndpoint: etcd-service:2379 minReplicas: 1 maxReplicas: 128 replicaSpecs: Worker: replicas: 4 restartPolicy: ExitCode template: apiVersion: v1 kind: Pod spec: nodeSelector: beta.kubernetes.io/instance-type: p3.8xlarge containers: resources: limits: nvidia.com/gpu: 4 volumeMounts: - name: fsx-pvc mountPath: /fsx-shared - name: dshm mountPath: /dev/shm volumes: - name: fsx-pvc persistentVolumeClaim: claimName: fsx-claim - name: dshm emptyDir: medium: Memory containers: - name: elasticjob-worker image: torchelastic/examples:0.2.0 imagePullPolicy: Always env: - name: NCCL_DEBUG value: INFO args: - "--nproc_per_node=4" - "/workspace/examples/imagenet/main.py" - "--arch=resnet50" - "--epochs=1" - "--batch-size=64" - "--workers=8" - "--checkpoint-file=/fsx-shared/checkpoint.pth.tar" - "/fsx-shared/ILSVRC/Data/CLS-LOC/" Training

rights reserved. 4. Why is resiliency so important? Increases time to train Wasted compute resources and $$ Wasted engineering hours to isolate, fix, and resume Source: https://ai.meta.com/research/publications/the-llama-3-herd-of-models/ “During a 54-day snapshot period of pre-training, we experienced a total of 466 job interruptions” * Internal benchmarking Wasted training time by cluster size Meta’s Llama3.1 Training on 16k GPUs

rights reserved. Job scheduling on SageMaker HyperPod Compute Nodes A A A A A A A A A B B B B B C C C C C User A A LLM training 64 GPUS sbatch llama2_jobfile.sh Long running job submission User B B Experiments 32 GPUS $ salloc --nodes=4 salloc: Granted job allocation 2 Allocate for interactive experiments User C NeMo Training 32 GPUS C Integration through framework or library D D

rights reserved. Resiliency on job failure Compute Nodes A A A A A A A A A B B B B B C C C C C Failures origins Software Misconfiguration, model/code issue … Network, accelerator, infrastructure… Hardware x Loss of entire job x Idle capacity? x Time (debug, checkpointing, ops) Exit Code Done Fail User Debug Model issue? Investigate node issue Restore/replace instance Admins Restore/replace instance Fix & update allocation

rights reserved. Job auto-healing with checkpointing CHECKPOINT=(ls -ltr $CHECK_PATH \ | grep '^d' | tail -1) srun python --auto-resume=1 \ training.py \ --checkpoint ${CHECKPOINT} echo "Starting training job” srun python check_step1.py bash run_training.sh Checkpoints Restore Checkpoints Alarm & interruption Instance Restore Self-healing process

rights reserved. Service Accessing SageMaker HyperPod cluster Engineers & Researchers Compute Nodes Admin & Ops Customer Account FSxL Datasets & checkpoints Endpoint AWS Cloud Login node

rights reserved. De-mystifying ML Software Stack on AWS

rights reserved. Distributed training software stack (GPU) ML Frameworks Communication libraries/SDKs Hardware drivers EC2 instance

rights reserved. Distributed training software stack (Neuron) ML Frameworks Communication libraries・SDKs Hardware drivers EC2 instance

rights reserved. Demystifying ML Software Stack on AWS

rights reserved. Call to Action awsome-distributed training (Open-source repository for reference architectures & test cases) • https://github.com/aws-samples/awsome-distributed-training • Reference architectures for AWS ParallelCluster/Amazon EKS/Amazon SageMaker HyperPod • Test cases for various distributed training frameworks, such as Megatron-Core, Nemo, and PyTorch FSDP • Validation (NCCL tests) • Observability (Prometheus&Grafana) Workshops • Machine Learning on ParallelCluster: https://catalog.workshops.aws/ml-on- aws-parallelcluster/en-US • SageMaker HyperPod Slurm Workshop: https://catalog.workshops.aws/sagemaker-hyperpod • SageMaker HyperPod EKS Workshop: https://catalog.workshops.aws/sagemaker-hyperpod-eks

rights reserved. Basic health checks § GPU: – NVIDIA DCGM Diag command (level-2) – Check GPU status with nvidia-smi command § Trainium – Check NPU status/metrics from /sys/devices/virtual/neuron_device/* § EFA – Run EFA health checker to test network connectivity between EFAs on the instance § CPU – With Linux’s “stress” command for CPU stress testing (CPU / IO /memory allocation with many threads)

rights reserved. Deep health checks § GPU: – Verifies GPU/NVLink counts. – NVIDIA DCGM Diag command (level-4) + memory test § Trainium – Read counters from Neuron sysfs – Runs a training workload to produce numbers § NCCL test (GPU) – Verifies the performance of collective communication operations on multiple NVIDIA GPUs § NCCOM test (Trainium) – Verifies the performance of collective communication operations on multiple Trainium nodes

rights reserved. Diﬀerence Slurm ↔ EKS Feature Slurm EKS Control plane ownership Controller node is a part of HyperPod cluster Control plane is owned by EKS Login to cluster Customers need to login in to head node or login node to run Slurm commands Customers can run kubectl from remote machines Cluster metrics observability Customers can setup observability stack at application level, but it required extra effort. Customers can use CloudWatch container insights as a first class observability feature When resiliency runs Instance replacement is triggered only when `-- auto-resume=1` is specified to srun command, and when hardware failed. Health monitoring runs in the background, and instance reboot/replacement happens anytime hardware issue is detected. Deep health checks Deep health check doesn't exist Customers can enable deep health checks Instance reboot Customers can reboot by hitting "sudo reboot" command on the instance, but there is no way to reboot unresponsive instances. Customers can reboot instances by setting node label and it works even if the instance is unresponsive. Task governance You can use Slurm’s QoS feature HyperPod task governance is supported for scheduling/priority/dynamic compute resource allocation HyperPod CLI Not supported Supported

rights reserved. How to scale FM training.

rights reserved. 50 Distributed training strategies Data Parallel (DP) Model Parallel (MP) Pipeline Parallel Tensor Parallel Fully Sharded Data Parallel (FSDP) /ZeRO Data Parallel = Same Model, Different Data Model Parallel = Same Data, Different Model Other Parallelism Context Parallel

rights reserved. Fully-Sharded Data Parallel (FSDP) Zero Redundancy Optimizer (Zero) Model Shard All-Gather Forward (local) All-Gather Backward (local) Reduce- Scatter Update weights (local) Model Shard All-Gather Forward (local) All-Gather Backward (local) Reduce- Scatter Update weights (local) Accelerator GPU 0 Accelerator GPU 1 Data Gart h er W eigh ts Gart h er W eigh ts Syn c Grad s

rights reserved. Tensor Parallel MLP

rights reserved. Pipeline Parallel 53

rights reserved. 54

rights reserved. 基盤モデル構築における“計算” • Forward Backward • Collective Communication

rights reserved. この計算には何が必要か • Forward Backward • Collective Communication

rights reserved. Distributed training software stack (GPU) ML frameworks Communication libraries・SDKs Hardware drivers EC2 instance

Building foundation models on AWS

Building foundation models on AWS

More Decks by Keita Watanabe

Featured

Transcript