Slide 1

Slide 1 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. N E W Y O R K C I T Y | J U L Y 1 0 , 2 0 2 4

Slide 2

Slide 2 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Demystifying the ML software stack on Amazon EC2 accelerated instances Keita Watanabe, Ph.D. C M P 3 0 1 Senior Specialist Solutions Architect WWSO GenAI - Frameworks Amazon Web Services Ankur Srivastava, Ph.D. Senior Specialist Solutions Architect WWSO GenAI - Frameworks Amazon Web Services

Slide 3

Slide 3 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. I. Training and inference compute and software stacks II. Container and AMI: Where the wild things are III. How about the AWS Deep Learning AMIs (DLAMI) and AWS Deep Learning Containers IV. Diving into Amazon EKS and AWS ParallelCluster (Slurm) V. Wrap up Outline

Slide 4

Slide 4 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Broad and deep accelerated computing portfolio M A C H I N E L E A R N I N G T R A I N I N G I N F E R E N C E P4d NVIDIA A100 TENSOR CORE P4de NVIDIA A100 TENSOR CORE P5 NVIDIA H100 TENSOR CORE DL1 INTEL HABANA GAUDI TRN1n AWS TRAINIUM G6 NVIDIA L4 TENSOR CORE INF1 AWS INFERENTIA INF2 AWS INFERENTIA2 N E W N E W N E W G6e NVIDIA L40S TENSOR CORE DL2q QUALCOMM AI 100 N E W N E W N E W TRN2 AWS TRAINIUM2 N E W G5 A10G TENSOR CORE GPUS

Slide 5

Slide 5 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Interconnect matters Model shard All-gather Forward (local) All-gather Backward (local) Reduce- scatter Update weights (local) Model shard All-gather Forward (local) All-gather Backward (local) Reduce- scatter Update weights (local) Accelerator GPU 0 Accelerator GPU 1 Data Fully sharded data parallel: Faster AI workload for fewer accelerators Gart h er W eigh ts Gart h er W eigh ts Syn c Grad s

Slide 6

Slide 6 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Distributed training architecture and AWS services AWS Services Amazon EKS AWS ParallelCluster Amazon SageMaker HyperPod Resilient and persistent clusters Slurm HPC clusters Kubernetes clusters Placement group Availability Zone Tightly coupled, communication heavy and inter-node latency sensitive Architecture Elastic Fabric Adapter Amazon FSx for Lustre /fsx

Slide 7

Slide 7 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. High-speed networking: Elastic Fabric Adapter Elastic Fabric Adapter (EFA) is an OS-bypass, RDMA-enabled network adapter for high-performance inter-node communication. It uses an AWS-designed transport named Scalable Reliable Datagram (SRD). A Endpoint B Endpoint Flow 1 Flow 2 Flow 3 SRD: How it works

Slide 8

Slide 8 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Distributed training software stack (GPU) ML frameworks Communication libraries・SDKs Hardware drivers EC2 instance

Slide 9

Slide 9 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Distributed training software stack (Neuron) ML frameworks Communication libraries・SDKs Hardware drivers EC2 instance

Slide 10

Slide 10 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. What is on AMI ? (GPU) Hardware drivers EC2 instance AMI Container toolkits

Slide 11

Slide 11 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. What is in container (GPU)? EC2 instance Container ML Frameworks Communication libraries・SDKs AMI

Slide 12

Slide 12 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. What is on AMI ? (Neuron) Hardware drivers EC2 instance aws-neuronx-oci-hook AMI Container toolkits SDK

Slide 13

Slide 13 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. What is in Container (Neuron)? EC2 instance Container AMI ML Frameworks Communication libraries・SDKs

Slide 14

Slide 14 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Call to action

Slide 15

Slide 15 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Best practices for large-scale distributed training Step-by-step guides to create clusters: Recipes to customize AMIs AWS-optimized Dockerfiles EFA cheatsheet Distributed training examples • One-click VPC deployments • Mount Fsxfor Lustre Filesystems • EFA-enabled clusters Validation (NCCL tests, etc.) Observability (Prometheus-Grafana, etc.) Profiling (Nsight product family) • Slurm scripts/K8 materials • Working with Pyxis/Enroot • Nemo (MegatronLM, Multimodal, Bionemo) • MosaicML • DDP, FSDP • SMDP, SMMP • Tensorflow/Jax

Slide 16

Slide 16 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Thank you! Please complete the session survey in the mobile app Keita Watanabe, Ph.D. [email protected] Ankur Srivastava, Ph.D. [email protected]

Slide 17

Slide 17 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Anatomy of GPU Stacks

Slide 18

Slide 18 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Deep Learning AMIs (DLAMI) Preconfigured with popular deep learning frameworks and interfaces Optimized for performance with latest NVIDIA driver, CUDA libraries, and Intel libraries Choosing DLAMI • Deep Learning AMI with Conda – dedicated Conda environment for each framework • Deep Learning Base AMI – no frameworks

Slide 19

Slide 19 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Deep Learning Containers Prepackaged ML framework container images fully configured and validated Includes AWS optimizations for TensorFlow, PyTorch, MXNet, and Hugging Face github.com/aws/deep-learning-containers

Slide 20

Slide 20 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Containers structure and storage Prefer Amazon ECR or Amazon S3 for Batch § DockerHub throttles under load, private registries can suffer us-west-2 SDK Registry AWS Batch DockerHub Github Scientist Fewer, even layers is better § Docker requests layers in parallel (1 layer = 1 request) § Even layers provide a better load distribution

Slide 21

Slide 21 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Using machine images and containers Data type Size Change frequency Input data 20 GB (r+w) 20 min per job Configurations 3 MB Weekly Application 1 GB 5 min Application libraries 4 GB Weekly Core dependencies 5 GB Biweekly Operating system 500 MB Monthly Operating system Core dependencies Application libraries Application Configurations Data machine images machine images machine images Container Container Runtime