re:Invent 2023: CMP332 De-mystifying ML software stack on Amazon EC2 accelerated instances

rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. De-mystifying ML software stack on Amazon EC2 accelerated instances Keita Watanabe, Ph.D. C M P 3 3 2 Sr. Specialist Solutions Architect WWSO GenAI - Frameworks Amazon Web Services Pierre-Yves Aquilanti, Ph.D. Head Frameworks Solutions WWSO GenAI – Frameworks Amazon Web Services

rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Broad and deep accelerated computing portfolio G5 P4d DL1 G5g P3 P4de Inf1 F1 Inf2 P5 VT1 GPU, AWS ML Accelerators, And FPGA-based EC2 instances GPUs AI/ML accelerators and ASICs FPGAs Trn1 G4 Preview Trainium accelerator Inferentia accelerator Graviton CPU H100, A100, V100 GPU A10G, T4 GPU Gaudi accelerator Radeon GPU Xilinx accelerator Xilinx FPGA 3

rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Placement Group Fundamental architectures for training and inference explain hardware component (GPU/Trainium/EFA) Availability Zone Amazon FSx for Lustre /scratch EKS Nodegroup Availability Zone 1 Availability Zone 2 Training Inference • Instances: P5, P4d(e), Trn1, g5 • Scale: POC = 1-64 instances, PROD = 4-100s… • Care for: EFA, EC2 capacity, shared network • Cost objective: cost-to-train ($/iteration) • Instances: G5, g4dn, Inf1, Inf2, CPU based instances • Scale: POC = 1-64 instances, PROD = 1-1000s… • Care for: scaling latency (predictive, metric, capacity) • Cost objective: serving at scale and fast, $/inference Tightly coupled, communication heavy & inter-node latency sensitive Loosely coupled, fast scaling in/out & query latency sensitive

rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. HPC Cluster - AWS ParallelCluster Private subnet Public subnet us-east-1b us-east-1 p4d.24xlarge Compute Fleet Amazon FSx for Lustre /fsx Users AWS Cloud AWS ParallelCluster Head-node • 1 × c5.9xlarge36 vCPUs (18 physical) • 72 GB of memory Compute Node • 100+ × p4de.24xlarge + C6, M6, R6 • 96 vCPUs (48 physical) • 1152 GB of memory • 8 × NVIDIA A100 80GB GPUs • Network: 400Gbs ENA & EFA • Storage: 8 × 1TB NVMe + EBS Shared file-systems • Amazon FSx for Lustre of 108TB on /fsx Cluster Stack • Slurm 22.05.5 • Cuda 11.6

rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Resiliency in SageMaker HyperPod Agents monitor cluster instance health for issues with CPU, GPU, and network health. Once an agent detects a hardware failure, Scuderia will automatically replace the faulty instance with a healthy one. With the faulty instance replaced, SageMaker HyperPod then requeues the workload in Slurm and reloads the last valid checkpoint to resume processing.

rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Deep learning open-source training stack with EFA (GPU) Frameworks & optimization libraries Communication Libraries Hardware Driver Launcher Training Framework

rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Deep learning open-source training stack with EFA (Neuron) Frameworks & optimization libraries Communication Libraries Hardware Driver

rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Containers & AMI – how we generally think of it Framework Lib A Lib B Lib C Python Software packages Container Registry Docker Telemetry Software packages Operating system AMI

rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Containers & AMI – how we generally think of it Container Registry AMI Docker Telemetry Software packages Operating system Framework Lib A Lib B Lib C Python Software packages Container Registry AMI Docker Telemetry Software packages Operating system Framework Lib A Lib B Lib C Python Software packages EFA

rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. What to be on AMI ? (GPU) Library Notes nvidia gpu driver Docker run: add "--gpus all" to have nvidia-smi inside container. Nvidia fabric manager Nvidia docker Service is the docker daemon. CLI invoked by users. efa driver Install with --minimal. NOTE: won't even have fi_info. Docker run: add "--device /dev/infiniband/uverbs0 ..." ssm agent Docker Telemetry Software packages Operating system AMI

rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. What to be on container ? (gpu) Library Notes Libfabric-aws Install with "--skip-kmod --skip-limit-conf". Also provide openmpi prebuilt by aws. CUDA CUDNN pytorch backend cublas pytorch backend nccl pytorch backend. Can be overridden using LD_PRELOAD=.../libnccl.so Aws-ofi-nccl Plugin for NCCL ML Frameworks PyTorch/Tonsorflew/JAX Container Registry Framework Lib A Lib B Lib C Python Software packages

rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. What to be on AMI ? (Neuron) Library Notes neuron driver Docker run: add "--device=/dev/neuron0 " to have neuron-ls inside container. docker Service is the docker daemon. CLI invoked by users. efa driver Install with --minimal. NOTE: won't even have fi_info. Docker run: add "--device /dev/infiniband/uverbs0 ..." ssm agent Docker Telemetry Software packages Operating system AMI

rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. What to be on container ? (Neuron) Library Notes Libfabric-aws Install with "--skip-kmod --skip-limit-conf". Also provide openmpi prebuilt by aws. aws-neuron(x)- runtime-lib aws-neuron(x)- tools neuron-ls/neuron-top neuron-compiler Run neuron-cc from within a machine learning framework PyTorch Neuron torch-neuronx (Inf2 & Trn1/Trn1n) / torch-neuron (Inf1) aws-neuron- collectives Collective operation with Neuron SDK Distributed training lib neuronx-nemo-megatron/neuronx-distributed Container Registry Framework Lib A Lib B Lib C Python Software packages

rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Deep Learning AMIs (DLAMI) • Preconfigured with popular deep learning frameworks and interfaces • Optimized for performance with latest NVIDIA driver, CUDA libraries, and Intel libraries Choosing DLAMI • Deep Learning AMI with Conda – dedicated Conda environment for each framework • Deep Learning Base AMI - no frameworks

rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Deep Learning Containers • Prepackaged ML framework container images fully configured and validated • Includes AWS optimizations for TensorFlow, PyTorch, MXNet, and Hugging Face github.com/aws/deep-learning-containers

rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Thank you! © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Please complete the session survey in the mobile app Thank you! © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Please complete the session survey in the mobile app Keita Watanabe, Ph.D. [email protected] Pierre-Yves Aquilanti, Ph.D. [email protected]

re:Invent 2023: CMP332 De-mystifying ML softwar...

re:Invent 2023: CMP332 De-mystifying ML software stack on Amazon EC2 accelerated instances

Keita Watanabe

More Decks by Keita Watanabe

Other Decks in Technology

Featured

Transcript

© 2023, Amazon Web Services, Inc. or its affiliates. All

© 2023, Amazon Web Services, Inc. or its affiliates. All

© 2023, Amazon Web Services, Inc. or its affiliates. All

© 2023, Amazon Web Services, Inc. or its affiliates. All

© 2023, Amazon Web Services, Inc. or its affiliates. All

© 2023, Amazon Web Services, Inc. or its affiliates. All

© 2023, Amazon Web Services, Inc. or its affiliates. All

© 2023, Amazon Web Services, Inc. or its affiliates. All

© 2023, Amazon Web Services, Inc. or its affiliates. All

© 2023, Amazon Web Services, Inc. or its affiliates. All

© 2023, Amazon Web Services, Inc. or its affiliates. All

© 2023, Amazon Web Services, Inc. or its affiliates. All

© 2023, Amazon Web Services, Inc. or its affiliates. All

© 2023, Amazon Web Services, Inc. or its affiliates. All

© 2023, Amazon Web Services, Inc. or its affiliates. All

© 2023, Amazon Web Services, Inc. or its affiliates. All

© 2023, Amazon Web Services, Inc. or its affiliates. All