re:Invent 2023: CMP332 De-mystifying ML software stack on Amazon EC2 accelerated instances

Slide 1

Slide 1 text

Slide 2

Slide 2 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. De-mystifying ML software stack on Amazon EC2 accelerated instances Keita Watanabe, Ph.D. C M P 3 3 2 Sr. Specialist Solutions Architect WWSO GenAI - Frameworks Amazon Web Services Pierre-Yves Aquilanti, Ph.D. Head Frameworks Solutions WWSO GenAI – Frameworks Amazon Web Services

Slide 3

Slide 3 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Broad and deep accelerated computing portfolio G5 P4d DL1 G5g P3 P4de Inf1 F1 Inf2 P5 VT1 GPU, AWS ML Accelerators, And FPGA-based EC2 instances GPUs AI/ML accelerators and ASICs FPGAs Trn1 G4 Preview Trainium accelerator Inferentia accelerator Graviton CPU H100, A100, V100 GPU A10G, T4 GPU Gaudi accelerator Radeon GPU Xilinx accelerator Xilinx FPGA 3

Slide 4

Slide 4 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Placement Group Fundamental architectures for training and inference explain hardware component (GPU/Trainium/EFA) Availability Zone Amazon FSx for Lustre /scratch EKS Nodegroup Availability Zone 1 Availability Zone 2 Training Inference • Instances: P5, P4d(e), Trn1, g5 • Scale: POC = 1-64 instances, PROD = 4-100s… • Care for: EFA, EC2 capacity, shared network • Cost objective: cost-to-train ($/iteration) • Instances: G5, g4dn, Inf1, Inf2, CPU based instances • Scale: POC = 1-64 instances, PROD = 1-1000s… • Care for: scaling latency (predictive, metric, capacity) • Cost objective: serving at scale and fast, $/inference Tightly coupled, communication heavy & inter-node latency sensitive Loosely coupled, fast scaling in/out & query latency sensitive

Slide 5

Slide 5 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. HPC Cluster - AWS ParallelCluster Private subnet Public subnet us-east-1b us-east-1 p4d.24xlarge Compute Fleet Amazon FSx for Lustre /fsx Users AWS Cloud AWS ParallelCluster Head-node • 1 × c5.9xlarge36 vCPUs (18 physical) • 72 GB of memory Compute Node • 100+ × p4de.24xlarge + C6, M6, R6 • 96 vCPUs (48 physical) • 1152 GB of memory • 8 × NVIDIA A100 80GB GPUs • Network: 400Gbs ENA & EFA • Storage: 8 × 1TB NVMe + EBS Shared file-systems • Amazon FSx for Lustre of 108TB on /fsx Cluster Stack • Slurm 22.05.5 • Cuda 11.6

Slide 6

Slide 6 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Resiliency in SageMaker HyperPod Agents monitor cluster instance health for issues with CPU, GPU, and network health. Once an agent detects a hardware failure, Scuderia will automatically replace the faulty instance with a healthy one. With the faulty instance replaced, SageMaker HyperPod then requeues the workload in Slurm and reloads the last valid checkpoint to resume processing.

Slide 7

Slide 7 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Deep learning open-source training stack with EFA (GPU) Frameworks & optimization libraries Communication Libraries Hardware Driver Launcher Training Framework

Slide 8

Slide 8 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Deep learning open-source training stack with EFA (Neuron) Frameworks & optimization libraries Communication Libraries Hardware Driver

Slide 9

Slide 9 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Containers & AMI – how we generally think of it Framework Lib A Lib B Lib C Python Software packages Container Registry Docker Telemetry Software packages Operating system AMI

Slide 10

Slide 10 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Containers & AMI – how we generally think of it Container Registry AMI Docker Telemetry Software packages Operating system Framework Lib A Lib B Lib C Python Software packages Container Registry AMI Docker Telemetry Software packages Operating system Framework Lib A Lib B Lib C Python Software packages EFA

Slide 11

Slide 11 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. What to be on AMI ? (GPU) Library Notes nvidia gpu driver Docker run: add "--gpus all" to have nvidia-smi inside container. Nvidia fabric manager Nvidia docker Service is the docker daemon. CLI invoked by users. efa driver Install with --minimal. NOTE: won't even have fi_info. Docker run: add "--device /dev/infiniband/uverbs0 ..." ssm agent Docker Telemetry Software packages Operating system AMI

Slide 12

Slide 12 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. What to be on container ? (gpu) Library Notes Libfabric-aws Install with "--skip-kmod --skip-limit-conf". Also provide openmpi prebuilt by aws. CUDA CUDNN pytorch backend cublas pytorch backend nccl pytorch backend. Can be overridden using LD_PRELOAD=.../libnccl.so Aws-ofi-nccl Plugin for NCCL ML Frameworks PyTorch/Tonsorflew/JAX Container Registry Framework Lib A Lib B Lib C Python Software packages

Slide 13

Slide 13 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. What to be on AMI ? (Neuron) Library Notes neuron driver Docker run: add "--device=/dev/neuron0 " to have neuron-ls inside container. docker Service is the docker daemon. CLI invoked by users. efa driver Install with --minimal. NOTE: won't even have fi_info. Docker run: add "--device /dev/infiniband/uverbs0 ..." ssm agent Docker Telemetry Software packages Operating system AMI

Slide 14

Slide 14 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. What to be on container ? (Neuron) Library Notes Libfabric-aws Install with "--skip-kmod --skip-limit-conf". Also provide openmpi prebuilt by aws. aws-neuron(x)- runtime-lib aws-neuron(x)- tools neuron-ls/neuron-top neuron-compiler Run neuron-cc from within a machine learning framework PyTorch Neuron torch-neuronx (Inf2 & Trn1/Trn1n) / torch-neuron (Inf1) aws-neuron- collectives Collective operation with Neuron SDK Distributed training lib neuronx-nemo-megatron/neuronx-distributed Container Registry Framework Lib A Lib B Lib C Python Software packages

Slide 15

Slide 15 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Deep Learning AMIs (DLAMI) • Preconfigured with popular deep learning frameworks and interfaces • Optimized for performance with latest NVIDIA driver, CUDA libraries, and Intel libraries Choosing DLAMI • Deep Learning AMI with Conda – dedicated Conda environment for each framework • Deep Learning Base AMI - no frameworks

Slide 16

Slide 16 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Deep Learning Containers • Prepackaged ML framework container images fully configured and validated • Includes AWS optimizations for TensorFlow, PyTorch, MXNet, and Hugging Face github.com/aws/deep-learning-containers

Slide 17

Slide 17 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Thank you! © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Please complete the session survey in the mobile app Thank you! © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Please complete the session survey in the mobile app Keita Watanabe, Ph.D. [email protected] Pierre-Yves Aquilanti, Ph.D. [email protected]