Slide 1

Slide 1 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Building Foundation Models on AWS Keita Watanabe Sr. WW Solutions Architect, GenAI

Slide 2

Slide 2 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Foundation model training needs compute at scale Petabytes of unlabeled data + Millions of GPU Hours Foundation models Billions of parameters = Llama-3 70B used 6.4M1 H100 GPU hours ≈ 256xp5 for 132 days Example Falcon-180B used 7.0M2 A100 GPU hours ≈ 512xp4de for 73 days Example Source: 1https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md 2https://arxiv.org/pdf/2311.16867

Slide 3

Slide 3 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. 1. Compute requirements Llama3 70B training requires more than 1.2 TB of VRAM Scaling low [1]: FLOPS ≈ 6 x Parameters x Tokens Chinchilla low[2] Models needs to be trained with 20 x (Num. Parameters) Tokens 3 Parameters FLOPS Tokens 1 Billion 1.21e+20 20.2 Billion 10 Billion 1.23e+23 205.1 Billion 175 Billion 3.85e+24 3.7 Trillion 1 Trillion 1.27e+26 21.2 Trillion 10 Trillion 1.30e+28 216.2 Trillion Parameters (FP32/Bf16) 420 GB Gradients (FP32) 280 GB Adam Optimizer States (FP32) 560 GB VRAM consumption Llama3 70B (Without Activations etc.) [1] Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J. and Amodei, D., 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361. [2] Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D.D.L., Hendricks, L.A., Welbl, J., Clark, A. and Hennigan, T., 2022. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.

Slide 4

Slide 4 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. 2. Pretraining requires communication 4 Model Release Time Size Tokens Hardware AWS Instance 換算 BLOOM Nov-2022 175B 366 Billion 384xA100 80GB 48xP4de.24xlarge Pythia Apr-2023 12B 300 Billion 256xA100 40 GB 32xP4d.24xlarge Llama Feb-2023 65B 1 Trillion 512xA100 40GB 64xP4d.24xlarge Llama2 Jul-2023 70B 2 Trillion 2000xA100 80GB 250xP4de.24xlarge [1] Zhao, W.X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z. and Du, Y., 2023. A survey of large language models. arXiv preprint arXiv:2303.18223. Multi-node distributed training is indispensable.

Slide 5

Slide 5 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. 3. Distributed file storage requirements 5 Data Tokens Size(Bytes) Wikitext 100 M~ 750 MB C4.EN (Colossal Clean Crawled Corpus) 156 B 305 GB RedPajama-Data-1T 1 T 5 TB RedPajama-Data-v2 30 T 170 TB [1] https://arxiv.org/abs/2104.08758 [1] https://huggingface.co/bigscience/bloom Large scale, high speed distributed storage is required You need to store parameters and optimizer states. Ex.: Llama3 70B - Parameters: 420 GB Optimizer States: 560 GB Ex: Bloom 175 B checkpoints including optimizer states : 2.2 TB [1] Parameters (FP32/Bf16) 420 GB Adam Optimizer States (FP32) 560 GB Llama3 70B Checkpoints 内訳 Large corpus needed for FM pretraining

Slide 6

Slide 6 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Introduction 6

Slide 7

Slide 7 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Goal 1. INFRASTRUCTURE FOR FM TRAINING/INFERENCE EC2 Capacity Blocks Neuron UltraClusters EFA Nitro GPUs Inferentia SageMaker Trainium 2. TOOLS TO BUILD WITH LLMs AND OTHER FMs Guardrails | Agents | Customization capabilities Amazon Bedrock 3. APPLICATIONS THAT LEVERAGE FMs Amazon Q Business Amazon Q Developer Amazon Q in QuickSight Amazon Q in Connect We provide guidance on Foundation Model (FM) training on AWS

Slide 8

Slide 8 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Prerequisites • Basics in machine learning(Especiallly Neural Network training) • Definition of Foundation Models • Common GPT-like foundation models including Llama • Foundation model pretraninng(Language modeling)

Slide 9

Slide 9 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. What do we need to prepare to build FM on AWS?

Slide 10

Slide 10 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Building blocks AWS Offering Architecture & Orchestration • Resource Orchestrator • Job Scheduler Algorithms & Software • ML Frameworks Amazon EKS AWS ParallelCluster Amazon Sagemaker HyperPod Network • Wide bandwidth interconnect • Fast accelerator with large device memory Compute • Scalable distributed file storage Storage . . . Frameworks Amazon EC2 UltraClusters Infrastructures

Slide 11

Slide 11 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Infrastructures 11

Slide 12

Slide 12 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Building blocks AWS Offering Architecture & Orchestration • Resource Orchestrator • Job Scheduler Algorithms & Software • ML Frameworks Amazon EKS AWS ParallelCluster Amazon Sagemaker HyperPod Network • Wide bandwidth interconnect • Fast accelerator with large device memory Compute • Scalable distributed file storage Storage . . . Frameworks Amazon EC2 UltraClusters Infrastructures

Slide 13

Slide 13 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon EC2 UltraClusters 13 Super computer supporting high performance computing, networking, and storage Network Compute Storage P5 NVIDIA H100/H200 TENSOR CORE P6 NVIDIA B200 TENSOR CORE TRN1 AWS TRAINIUM TRN2 AWS TRAINIUM2 Elastic Fabric Adapter FSx for Lustre Amazon S3

Slide 14

Slide 14 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. NVIDIA GPU instances 14 © 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Instance Size ACC NUM ACC ACC Memory vCPU RAM Local SSDs P5.48xlarge H100 8 80 GB x 8 640 GB 192 2 TiB 8 x 3.84 TB NVMe SSD P5e.48xlarge H200 8 141 GB x 8 1128 GB 192 2 TiB 8 x 3.84 TB NVMe SSD P5en.48xlarge H200 8 141 GB x 8 1128 GB 192 2 TiB 8 x 3.84 TB NVMe SSD https://aws.amazon.com/ec2/instance-types/p5/ P6 NVIDIA H100/H200 TENSOR CORE P5 NVIDIA B200 TENSOR CORE

Slide 15

Slide 15 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Trainium Instances 15 © 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Instance Size ACC NUM ACC ACC Memory vCPU RAM Local SSDs Trn1.32xlarge Trainium1 16 512 GB 128 512 GB 8 TB Trn1n.32xlarge Trainium1 16 512 GB 128 512 GB 8 TB Trn2.48xlarge Trainium2 16 96 GiB x 16 1.5 TB 192 2 TB 4 x 1.92 TB NVMe SSD Trn2u.48xlarge Trainium2 16 96 GiB x 16 1.5 TB 192 2 TB 4 x 1.92 TB NVMe SSD https://aws.amazon.com/ec2/instance-types/trn1 https://aws.amazon.com/ec2/instance-types/trn2 https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/trainium1.html https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/trainium2.html TRN1 AWS TRAINIUM TRN2 AWS TRAINIUM2

Slide 16

Slide 16 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. GPT Pretraining using PyTorch FSDP with/without EFA https://medium.com/pytorch/training-a-1-trillion-parameter-model- with-pytorch-fully-sharded-data-parallel-on-aws-3ac13aa96cff 記事公開⽇: 2022/05/16 2.5x 25.6x faster 512 GPUs = 64 instances

Slide 17

Slide 17 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Elastic Fabric Adapter (EFA) • SRD protocol purpose-built for scalability in the cloud • Kernel bypass and GPU-direct RDMA for low-latency, high-throughput communication between GPUs • Continuing improvements in latency and completion times Elastic Fabric Adapter

Slide 18

Slide 18 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. EFA (Elastic Fabric Adapter) Dedicated network interface for MPI/NCCL High bandwidth low latency communication with SRD(Scalable Reliable Diagram) A Endpoint B Endpoint Flow 1 Flow 2 Flow 3 OS Kernel bypassed communication with Libfabric API Out of order での転送: Head of blocking 問題を回避 マルチパスルーティングによる安定した低レイテンシーの実現 Elastic Fabric Adapter

Slide 19

Slide 19 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Network performance of FM training instances Model Accelerator Num Acc. Total memory Acc. P2P BW EFA P5.48xlarge H100 8 640 GB 900 GB/s 3200 Gbps P5e.48xlarge H200 8 1128 GB 900 GB/s 3200 Gbps P5en.48xlarge H200 8 1128 GB 900 GB/s 3200 Gbps Trn1.32xlarge Tranium1 16 512 GB 768 Gbps 800 Gbps Trn1n.32xlarge Trainium1 16 512 GB 768 Gbps 1600 Gbps

Slide 20

Slide 20 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon FSx for Lustre • Fully managed Lustre file system for high performance workloads • POSIX file system compatible • Native integration with Amazon S3 Transparently access to the data on S3 through Lustre Data created on Lustre is persisted in Amazon S3 Amazon FSx for Lustre Amazon S3

Slide 21

Slide 21 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon S3 lazy load example /file1.txt /file2.txt /folder1/file3.txt /folder2/file4.txt s3://bucket/file1.txt s3://bucket/file2.txt s3://bucket/folder1/file3.txt s3://bucket/folder2/file4.txt

Slide 22

Slide 22 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. ML training storage hierarchy Object us-east-1a Region Instance Store • Checkpoints, temporary data FSx for Lustre • Shared data sets, checkpoints, outputs Amazon S3 • Data backbone, datasets, checkpoints, outputs

Slide 23

Slide 23 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Architecture and Orchestration 23

Slide 24

Slide 24 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. How can we leverage the infrastructure to train FMs?

Slide 25

Slide 25 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Building blocks AWS Offering Architecture & Orchestration • Resource Orchestrator • Job Scheduler Algorithms & Software • ML Frameworks Amazon EKS AWS ParallelCluster Amazon Sagemaker HyperPod Network • Wide bandwidth interconnect • Fast accelerator with large device memory Compute • Scalable distributed file storage Storage . . . Frameworks Amazon EC2 UltraClusters Infrastructures

Slide 26

Slide 26 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Slurm vs. Kubernetes

Slide 27

Slide 27 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Slurm Architecture 27 Cluster Engineers & Researchers Compute Nodes FSxL Datasets & checkpoints AWS Cloud Head Node Login Node

Slide 28

Slide 28 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. HPC Cluster - AWS ParallelCluster Private subnet Public subnet us-east-1b us-east-1 p4d.24xlarge Compute Fleet Amazon FSx for Lustre /fsx Users AWS Cloud AWS ParallelCluster Head-node • 1 × c5.9xlarge36 vCPUs (18 physical) • 72 GB of memory Compute Node • 100+ × p4de.24xlarge + C6, M6, R6 • 96 vCPUs (48 physical) • 1152 GB of memory • 8 × NVIDIA A100 80GB GPUs • Network: 400Gbs ENA & EFA • Storage: 8 × 1TB NVMe + EBS Shared file-systems • Amazon FSx for Lustre of 108TB on /fsx Cluster Stack • Slurm 22.05.5 • Cuda 11.6

Slide 29

Slide 29 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. How to create a cluster https://aws.amazon.com/hpc/parallelcluster/ pcluster create-cluster –f config.yaml … Head-node Compute nodes Shared storage p4d.24xlarge Compute Fleet Amazon FSx for Lustre /fsx AWS ParallelCluster

Slide 30

Slide 30 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. enroot import \ -o /apps/nccl.sqsh \ dockerd://nccl-tests:latest Slurm Job submission 30 Submit training jobs Shell script with resources, commands to execute. Submit via sbatch1, for long running jobs, high control & many jobs. #SBATCH --nodes=4 #SBATCH --job-name=train-llama2 #SBATCH --output=logs/%x_%j.out #SBATCH --ntasks-per-node=8 #SBATCH --exclusive echo "Starting training job" srun python train.py Quick prototyping Book resources via salloc 2 and run commands interactively in parallel or srun salloc -N 4 --exclusive srun python train.py Submit container job Submit container job with Enroot/Pyxis docker build \ -f nccl-tests.Dockerfile \ –t nccl-tests:latest . srun \ --container-image nccl.sqsh \ all_reduce_perf -b 8 -e 16G \ -f 2 -g 1 -c 1 -n 100

Slide 31

Slide 31 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. CLI Distributed training and inference jobs on thousands of AI accelerators Effective orchestration for machine learning ü Highly scalable across thousands of instances ü Improved utilization of compute resources ü Broad ecosystem of open source and proprietary tools EKS for foundation model (FM) training Amazon EKS EKS VPC Control Plane Data Plane USER VPC Compute Nodes Elastic Fabric Adapter p5 p5 p5 p5 p5 p5 p5 p5 p5 p5 p5 p5 p5 p5 p5 p5 Amazon ECR Amazon FSx for Lustre Amazon CloudWatch

Slide 32

Slide 32 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. TorchElastic Training Job • Launch pods to run model training scripts: • > kubectl apply -f imagenet-fsx.yaml apiVersion: elastic.pytorch.org/v1alpha1 kind: ElasticJob metadata: name: imagenet spec: # Use "etcd-service:2379" if you already apply etcd.yaml rdzvEndpoint: etcd-service:2379 minReplicas: 1 maxReplicas: 128 replicaSpecs: Worker: replicas: 4 restartPolicy: ExitCode template: apiVersion: v1 kind: Pod spec: nodeSelector: beta.kubernetes.io/instance-type: p3.8xlarge containers: resources: limits: nvidia.com/gpu: 4 volumeMounts: - name: fsx-pvc mountPath: /fsx-shared - name: dshm mountPath: /dev/shm volumes: - name: fsx-pvc persistentVolumeClaim: claimName: fsx-claim - name: dshm emptyDir: medium: Memory containers: - name: elasticjob-worker image: torchelastic/examples:0.2.0 imagePullPolicy: Always env: - name: NCCL_DEBUG value: INFO args: - "--nproc_per_node=4" - "/workspace/examples/imagenet/main.py" - "--arch=resnet50" - "--epochs=1" - "--batch-size=64" - "--workers=8" - "--checkpoint-file=/fsx-shared/checkpoint.pth.tar" - "/fsx-shared/ILSVRC/Data/CLS-LOC/" Training

Slide 33

Slide 33 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. 4. Why is resiliency so important? Increases time to train Wasted compute resources and $$ Wasted engineering hours to isolate, fix, and resume Source: https://ai.meta.com/research/publications/the-llama-3-herd-of-models/ “During a 54-day snapshot period of pre-training, we experienced a total of 466 job interruptions” * Internal benchmarking Wasted training time by cluster size Meta’s Llama3.1 Training on 16k GPUs

Slide 34

Slide 34 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Job scheduling on SageMaker HyperPod Compute Nodes A A A A A A A A A B B B B B C C C C C User A A LLM training 64 GPUS sbatch llama2_jobfile.sh Long running job submission User B B Experiments 32 GPUS $ salloc --nodes=4 salloc: Granted job allocation 2 Allocate for interactive experiments User C NeMo Training 32 GPUS C Integration through framework or library D D

Slide 35

Slide 35 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Resiliency on job failure Compute Nodes A A A A A A A A A B B B B B C C C C C Failures origins Software Misconfiguration, model/code issue … Network, accelerator, infrastructure… Hardware x Loss of entire job x Idle capacity? x Time (debug, checkpointing, ops) Exit Code Done Fail User Debug Model issue? Investigate node issue Restore/replace instance Admins Restore/replace instance Fix & update allocation

Slide 36

Slide 36 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Job auto-healing with checkpointing CHECKPOINT=(ls -ltr $CHECK_PATH \ | grep '^d' | tail -1) srun python --auto-resume=1 \ training.py \ --checkpoint ${CHECKPOINT} echo "Starting training job” srun python check_step1.py bash run_training.sh Checkpoints Restore Checkpoints Alarm & interruption Instance Restore Self-healing process

Slide 37

Slide 37 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Service Accessing SageMaker HyperPod cluster Engineers & Researchers Compute Nodes Admin & Ops Customer Account FSxL Datasets & checkpoints Endpoint AWS Cloud Login node

Slide 38

Slide 38 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Algorithms and Softwares 38

Slide 39

Slide 39 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Building blocks AWS Offering Architecture & Orchestration • Resource Orchestrator • Job Scheduler Algorithms & Software • ML Frameworks Amazon EKS AWS ParallelCluster Amazon Sagemaker HyperPod Network • Wide bandwidth interconnect • Fast accelerator with large device memory Compute • Scalable distributed file storage Storage . . . Frameworks Amazon EC2 UltraClusters Infrastructures

Slide 40

Slide 40 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. De-mystifying ML Software Stack on AWS

Slide 41

Slide 41 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Distributed training software stack (GPU) ML Frameworks Communication libraries/SDKs Hardware drivers EC2 instance

Slide 42

Slide 42 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Distributed training software stack (Neuron) ML Frameworks Communication libraries・SDKs Hardware drivers EC2 instance

Slide 43

Slide 43 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Demystifying ML Software Stack on AWS

Slide 44

Slide 44 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Call to Action awsome-distributed training (Open-source repository for reference architectures & test cases) • https://github.com/aws-samples/awsome-distributed-training • Reference architectures for AWS ParallelCluster/Amazon EKS/Amazon SageMaker HyperPod • Test cases for various distributed training frameworks, such as Megatron-Core, Nemo, and PyTorch FSDP • Validation (NCCL tests) • Observability (Prometheus&Grafana) Workshops • Machine Learning on ParallelCluster: https://catalog.workshops.aws/ml-on- aws-parallelcluster/en-US • SageMaker HyperPod Slurm Workshop: https://catalog.workshops.aws/sagemaker-hyperpod • SageMaker HyperPod EKS Workshop: https://catalog.workshops.aws/sagemaker-hyperpod-eks

Slide 45

Slide 45 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Thank you! © 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 46

Slide 46 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Basic health checks § GPU: – NVIDIA DCGM Diag command (level-2) – Check GPU status with nvidia-smi command § Trainium – Check NPU status/metrics from /sys/devices/virtual/neuron_device/* § EFA – Run EFA health checker to test network connectivity between EFAs on the instance § CPU – With Linux’s “stress” command for CPU stress testing (CPU / IO /memory allocation with many threads)

Slide 47

Slide 47 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Deep health checks § GPU: – Verifies GPU/NVLink counts. – NVIDIA DCGM Diag command (level-4) + memory test § Trainium – Read counters from Neuron sysfs – Runs a training workload to produce numbers § NCCL test (GPU) – Verifies the performance of collective communication operations on multiple NVIDIA GPUs § NCCOM test (Trainium) – Verifies the performance of collective communication operations on multiple Trainium nodes

Slide 48

Slide 48 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Difference Slurm ↔ EKS Feature Slurm EKS Control plane ownership Controller node is a part of HyperPod cluster Control plane is owned by EKS Login to cluster Customers need to login in to head node or login node to run Slurm commands Customers can run kubectl from remote machines Cluster metrics observability Customers can setup observability stack at application level, but it required extra effort. Customers can use CloudWatch container insights as a first class observability feature When resiliency runs Instance replacement is triggered only when `-- auto-resume=1` is specified to srun command, and when hardware failed. Health monitoring runs in the background, and instance reboot/replacement happens anytime hardware issue is detected. Deep health checks Deep health check doesn't exist Customers can enable deep health checks Instance reboot Customers can reboot by hitting "sudo reboot" command on the instance, but there is no way to reboot unresponsive instances. Customers can reboot instances by setting node label and it works even if the instance is unresponsive. Task governance You can use Slurm’s QoS feature HyperPod task governance is supported for scheduling/priority/dynamic compute resource allocation HyperPod CLI Not supported Supported

Slide 49

Slide 49 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. How to scale FM training.

Slide 50

Slide 50 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. 50 Distributed training strategies Data Parallel (DP) Model Parallel (MP) Pipeline Parallel Tensor Parallel Fully Sharded Data Parallel (FSDP) /ZeRO Data Parallel = Same Model, Different Data Model Parallel = Same Data, Different Model Other Parallelism Context Parallel

Slide 51

Slide 51 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Fully-Sharded Data Parallel (FSDP) Zero Redundancy Optimizer (Zero) Model Shard All-Gather Forward (local) All-Gather Backward (local) Reduce- Scatter Update weights (local) Model Shard All-Gather Forward (local) All-Gather Backward (local) Reduce- Scatter Update weights (local) Accelerator GPU 0 Accelerator GPU 1 Data Gart h er W eigh ts Gart h er W eigh ts Syn c Grad s

Slide 52

Slide 52 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Tensor Parallel MLP

Slide 53

Slide 53 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Pipeline Parallel 53

Slide 54

Slide 54 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. 54

Slide 55

Slide 55 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. 基盤モデル構築における“計算” • Forward Backward • Collective Communication

Slide 56

Slide 56 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. この計算には何が必要か • Forward Backward • Collective Communication

Slide 57

Slide 57 text

© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Distributed training software stack (GPU) ML frameworks Communication libraries・SDKs Hardware drivers EC2 instance