[re:Invent2024 Chalktalk] Cost-effectively deploy PyTorch LLMs on AWS Inferentia using Amazon EKS

© 2024, Amazon Web Services, Inc. or its affiliates. All
rights reserved.

rights reserved. © 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Cost-effectively deploy PyTorch LLMs on AWS Inferentia using Amazon EKS Keita Watanabe, PhD C M P 3 3 0 Sr. WW Solutions Architect, GenAI Amazon Web Services Nathan Arnold Sr. Solutions Architect Amazon Web Services

rights reserved. Motivation: Challenges in LLM deployment Accelerated instance availability High inference cost Unpredictable demand +

rights reserved. Architecture: Karpenter + Kserve + Inferentia KServe Predictor Service KServe (for pod scaling) Karpenter (for node scaling) Kserve Predictor Pod Llama3.2 1B Replica1 Worker Node Inf2.xlarge (on demand Worker Node Inf2.2xlarge (spot) … Kserve Predictor Pod Llama3.2 1B Replica1 Deployment Knative Revision Knative Service KServe Controller KServe InferenceService Scale Deployment with Knative Pod Autoscaler (KPA) AI Application Rest/gRPC Inference Request

rights reserved. 6 Karpenter Open source Kubernetes cluster NODE PROVISIONING TOOL developed by AWS • Kube Scheduler gets the first crack at scheduling pending pods. Tries to schedule on existing capacity • Karpenter observes aggregate resource requests of unschedulable pods (set by kube scheduler) to make decisions on what instances to launch

rights reserved. How Karpenter provisions nodes on AWS CA ASG Pod auto scaling Pending pods EC2 API NodePool EC2NodeClass

rights reserved. Compute flexibility Instance type flexibility • Attribute-based requirements à sizes, families, generations, CPU architectures • No list à picks from all instance types in EC2 universe, excluding metal • Limits how many EC2 instances this NodePool can provision AZ flexibility • Provision in any AZ • Provision in specified AZs apiVersion: karpenter.sh/v1beta1 kind: NodePool metadata: name: default spec: template: spec: requirements: - key: karpenter.k8s.aws/instance-family operator: In values: ["c5","m5","r5"] - key: karpenter.k8s.aws/instance-size operator: NotIn values: ["nano","micro","small"] - key: topology.kubernetes.io/zone operator: In values: ["us-west-2a","us-west-2b"] - key: kubernetes.io/arch operator: In values: ["amd64","arm64"] - key: karpenter.sh/capacity-type operator: In values: ["spot","on-demand"] limits: cpu: 100

rights reserved. Compute flexibility Purchase options flexibility • On-demand, if nothing specified • Prioritizes Spot if flexible to both capacity types CPU architecture flexibility • x86-64 • Arm64 apiVersion: karpenter.sh/v1beta1 kind: NodePool metadata: name: default spec: template: spec: requirements: - key: karpenter.k8s.aws/instance-family operator: In values: ["c5","m5","r5"] - key: karpenter.k8s.aws/instance-size operator: NotIn values: ["nano","micro","small"] - key: topology.kubernetes.io/zone operator: In values: ["us-west-2a","us-west-2b"] - key: kubernetes.io/arch operator: In values: ["amd64","arm64"] - key: karpenter.sh/capacity-type operator: In values: ["spot","on-demand"] limits: cpu: 100

rights reserved. Inferentia nodepool

rights reserved. Purpose-built accelerators for generative AI AWS Inferentia Lowest cost per inference in the cloud for running deep learning (DL) models Up to 70% lower cost per inference than comparable Amazon EC2 instances AWS Inferentia2 High performance at the lowest cost per inference for LLMs and diffusion models Up to 40% better price performance than comparable Amazon EC2 instances AWS Trainium The most cost-efficient, high- performance training of LLMs and diffusion models Up to 50% savings on training costs over comparable Amazon EC2 instances

rights reserved. Amazon EC2 Inf2 instances powered by AWS Inferentia2 Support for ultra-large 100B+ parameter GenAI models Up to 3x higher compute performance, 3x larger accelerator memory Up to 4x higher throughput and 10x lower latency 9.8 TB/s aggregated accelerator memory bandwidth Instance size vCPUs Instance memory Inferentia2 chips Accelerator memory NeuronLink Instance networking On-demand price Inf2.xlarge 4 16 GB 1 32 GB N/A Up to 15 Gbps $0.76/hr Inf2.8xlarge 32 128 GB 1 32 GB N/A Up to 25 Gbps $1.97/hr Inf2.24xlarge 96 384 GB 6 192 GB Yes 50 Gbps $6.49/hr Inf2.48xlarge 192 768 GB 12 384 GB Yes 100 Gbps $12.98/hr H I G H P E R F O R M A N C E A T T H E L O W E S T C O S T F O R G E N E R A T I V E A I M O D E L S

rights reserved. AWS Inferentia2 Architecture HBM HBM NeuronCore-v2 GPSIMD Engine NeuronLink-v2 DMA DMA DMA DMA DMA DMA DMA Collective Communication Host PCIe On-chip SRAM memory Tensor Engine Vector Engine Scalar Engine NeuronCore-v2 GPSIMD Engine NeuronLink-v2 On-chip SRAM memory Tensor Engine Vector Engine Scalar Engine Inferentia2

rights reserved. AWS Neuron SDK Neuron compiler Neuron runtime Developer tools Framework and Opensource Community github.com/aws/aws-neuron-sdk https://awsdocs-neuron.readthedocs-hosted.com E A S Y D E V E L O P M E N T W I T H A W S T R A I N I U M A N D A W S I N F E R E N T I A

rights reserved. KServe 17 https://kserve.github.io/website/master/ • Scale to and from Zero • Request based Autoscaling • Batching • Request/Response logging • Traffic management • Security with AuthN/AuthZ • Distributed Tracing • Out-of-the-box metrics Key Features

rights reserved. Revision Autoscaling with Knative Pod Autoscaler (KPA) 18 Route Activator Pods Deployment Autoscaler Inactive route Pull metrics Push metrics scales Creates/ deletes Active route https://knative.dev/docs/serving/istio-authorization/ https://developer.aliyun.com/article/710828

rights reserved. Call to action [AWS Machine Learning Blog] Deploy Meta Llama 3.1-8B on AWS Inferentia using Amazon EKS and vLLM https://aws.amazon.com/blogs/machine-learning/deploy-meta-llama-3-1-8b-on-aws-inferentia-using- amazon-eks-and-vllm/ Using Neuron with Amazon EKS https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/kubernetes-getting-started.html Announcing AWS Neuron Helm Chart https://aws.amazon.com/blogs/containers/announcing-aws-neuron-helm-chart/ [Containers] Scaling a Large Language Model with NVIDIA NIM on Amazon EKS with Karpenter https://aws.amazon.com/blogs/containers/scaling-a-large-language-model-with-nvidia-nim-on-amazon-eks- with-karpenter/ vLLM with Neuron setup guide https://docs.vllm.ai/en/v0.6.3/getting_started/neuron-installation.html Architecture deployment guide https://gist.github.com/KeitaW/359ddb7ea147cc68e7029c91c6f137e5

rights reserved. Thank you! © 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Please complete the session survey in the mobile app Keita Watanabe Sr. WW Solutions Architect, GenAI Amazon Web Services Sr. Solutions Architect Amazon Web Services Nathan Arnold

rights reserved. Motivation: Challenges in LLM deployment Accelerated instance availability High inference cost Unpredictable demand +

rights reserved. References Setup Mount point for Amazon S3 https://catalog.workshops.aws/sagemaker-hyperpod-eks/en-US/01-cluster/07-s3-mountpoint Awsome inference https://github.com/aws-samples/awsome-inference Awsome distributed training https://github.com/aws-samples/awsome-distributed-training/tree/main

rights reserved. Compute flexibility Instance type flexibility • Attribute-based requirements à sizes, families, generations, CPU architectures • No list à picks from all instance types in EC2 universe, excluding metal • Limits how many EC2 instances this NodePool can provision AZ flexibility • Provision in any AZ • Provision in specified AZs apiVersion: karpenter.sh/v1beta1 kind: NodePool metadata: name: default spec: template: spec: requirements: - key: karpenter.k8s.aws/instance-family operator: In values: ["c5","m5","r5"] - key: karpenter.k8s.aws/instance-size operator: NotIn values: ["nano","micro","small"] - key: topology.kubernetes.io/zone operator: In values: ["us-west-2a","us-west-2b"] - key: kubernetes.io/arch operator: In values: ["amd64","arm64"] - key: karpenter.sh/capacity-type operator: In values: ["spot","on-demand"] limits: cpu: 100

rights reserved. Compute flexibility Purchase options flexibility • On-demand, if nothing specified • Prioritizes Spot if flexible to both capacity types CPU architecture flexibility • x86-64 • Arm64 apiVersion: karpenter.sh/v1beta1 kind: NodePool metadata: name: default spec: template: spec: requirements: - key: karpenter.k8s.aws/instance-family operator: In values: ["c5","m5","r5"] - key: karpenter.k8s.aws/instance-size operator: NotIn values: ["nano","micro","small"] - key: topology.kubernetes.io/zone operator: In values: ["us-west-2a","us-west-2b"] - key: kubernetes.io/arch operator: In values: ["amd64","arm64"] - key: karpenter.sh/capacity-type operator: In values: ["spot","on-demand"] limits: cpu: 100

rights reserved. Compute per workload scheduling requirements Pod scheduling constraints must fall within a provisioner’s constraints Standard K8s pod scheduling mechanisms Workloads may be required to run In certain AZs On certain types of processors or hardware (AWS Graviton, GPUs) On Spot and on-demand capacity Node selectors Node affinity Taints and tolerations Topology spread

rights reserved. Karpenter respects scheduling constraints • Adds label karpenter.sh/capacity-type: spot, karpenter.io/arch: amd64 to nodes • Utilize nodeSelector, nodeAffinity to schedule pods into appropriate nodes These labels added to the nodes

rights reserved. User-defined annotation, labels, taints apiVersion: apps/v1 kind: Deployment metadata: name: myapp spec: nodeSelector: team: team-a apiVersion: karpenter.sh/v1beta1 kind: NodePool spec: template: metadata: annotations: application/name: "app-a" labels: team: team-a spec: taints: - key: example.com/special-taint value: "true" effect: NoSchedule These taints, labels, annotations will be added to all nodes provisioned Use labels to schedule pods for different apps

rights reserved. Spot notification Spot interruption handling with Karpenter • 2-minute Spot Instance interruption notice via Amazon EventBridge event • Set as environment variables in Karpenter controller Deployment object • NodePools can be configured for a mix of On-Demand and Spot • Karpenter has built-in Spot interruption handler • Not required to use Node Termination Handler

rights reserved. Strategies for defining NodePools Single A single NodePool can manage compute for multiple teams and workloads Example use cases: • Single NodePool for a mix of Graviton and x86, while a pending pod has a requirement for a specific processor type Multiple Isolating compute for different purposes Example use cases: • Expensive hardware • Security isolation • Team separation • Different AMI • Tenant isolation due to noisy neighbor Weighted Define order across your NodePools so that the node scheduler will attempt to scheudle with one NodePool before another Example use cases: • Prioritize RI and Savings Plan ahead of other instance types • Default clusterwide configuration • Ratio split – Spot/OD, x86/Graviton

rights reserved. Inferentia nodepool

rights reserved. Weighted NodePools apiVersion: karpenter.sh/v1beta1 kind: NodePool metadata: name: compute-savings-plan-nodepool spec: template: requirements: - key: "karpenter.k8s.aws/instance-category" operator: In values: ["c", "r"] - key: "karpenter.k8s.aws/instance-cpu" operator: In values: ["16", "32"] - key: "karpenter.k8s.aws/instance-hypervisor" operator: In values: ["nitro"] limits: cpu: "1000" memory: 1000Gi weight: 60 apiVersion: karpenter.sh/v1beta1 kind: NodePool metadata: name: general-instances-nodepool spec: template: requirements: - key: "karpenter.k8s.aws/instance-cpu" operator: In values: ["16", "32"] weight: 40 • Karpenter prioritizes NodePool with higher weight

rights reserved. EC2 - m5.xlarge EC2 - m5.xlarge EC2 - m5.xlarge Enable consolidation EC2 - m5.xlarge Karpenter optimization apiVersion: karpenter.sh/v1beta1 kind: NodePool spec: disruption: consolidationPolicy: WhenEmptyOrUnderutilized

rights reserved. EC2 - m5.xlarge EC2 - m5.xlarge Better utilization of worker nodes – reduced cost Karpenter optimization

rights reserved. EC2 - m5.xlarge EC2 - m5.xlarge EC2 - m5.xlarge Enable consolidation Karpenter optimization

rights reserved. EC2 - m5.xlarge EC2 – m5.large Better selection of worker nodes – reduced cost Karpenter optimization – Pick cheaper nodes

rights reserved. Karpenter simplifies data plane management Karpenter combines: • Cluster Autoscaler • Node groups • Node Termination Handler • Descheduler

rights reserved. Packer Example: AMI • Export your AWS credentials as the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables. • Packer will build an AMI according to the template (using t2.micro instance)

rights reserved. Note: containerd need to be used K8s 1.8 or later https://blog.scottlowe.org/2020/01/25/manually- loading-container-images-with-containerd/ AMI for the experiment • Base image: EKS optimized AMI (accelerated) • Pull TensorRT ver. Stable Diffusion image using containerd • Note: You must add –n=k8s.i flag to make the image available to K8s

rights reserved.

rights reserved. 0 50 100 150 200 250 300 350 Default AMI Prebaked AMI Start-up time (s) Start-up time (s) Result 10x faster • The pre-baked AMI image achieved 10x faster container start-up (vs. GPU Optimized AMI + TensorRT Stable Diffusion Image) • The pre-baked AMI has already pulled image

rights reserved. Conclusion 0 50 100 150 200 250 300 350 Default AMI Prebaked AMI Start-up time (s) Start-up time (s) 0 0.5 1 1.5 2 2.5 Without TensorRT With TensorRT Inference Latency (s) Inference Latency (s) • TensorRT compiled Stable Diffusion Pipeline achieved 2.8x faster inference (vs. HuggingFace pipeline ver.) • Packer pre-baked AMI image achieved 10x faster container start-up (vs. GPU Optimized AMI + TensorRT Stable Diffusion Image) 10x faster 2.8x faster

rights reserved. vLLM with Inferentia

rights reserved. Model copy 80GB Amazon is Capital of California is Largest Mammal is Output token END Output token Output token Output token END END Output token Output token T1 T2 T3 T4 T5 T6 T7 Time à Request 1 Request 2 Request 3 IDLE IDLE IDLE Attention key/value cache * Based on batch size = 4 on ml.g5.12xl for 7B model Optimize throughput C O N T I N U O U S B A T C H I N G

rights reserved. EKS Deployment: Neuron Helm Chart Neuron Device Plugin • Exposes Neuron cores and devices to Kubernetes as resources. • Runs as a daemonset in the kube-system namespace. Neuron Scheduler Extension • Manages scheduling of pods that require specific Neuron core/device configurations. • Required when you deploy a pod required multiple Neuron devices • Minimizes communication latency by identifying directly connected device sets. Neuron Node Problem Detector Plugin • Operates as a daemonset on AWS Neuron-enabled EKS worker nodes • Monitors the health of Neuron devices on each node. • Initiates node replacement if an unrecoverable error is detected.

rights reserved. Example vLLM + Inferentia deployment

rights reserved. Distributed training software stack (Neuron) ML frameworks Communication libraries・SDKs Hardware drivers EC2 instance

rights reserved. What is in Container (Neuron)? EC2 instance Container AMI ML Frameworks Communication libraries・SDKs

rights reserved. What is on AMI ? (Neuron) Hardware drivers EC2 instance aws-neuronx-oci-hook AMI Container toolkits SDK

rights reserved. Agenda 55 • KServe Overview • KServe Components • Inference Service • Predictor • AutoScaling with Knative Pod Autoscaler (KPA) • ML inference with KServe Examples

rights reserved. KServe 56 https://kserve.github.io/website/master/

rights reserved. KServe Features 57 • Scale to and from Zero • Request based Autoscaling • Batching • Request/Response logging • Traffic management • Security with AuthN/AuthZ • Distributed Tracing • Out-of-the-box metrics

rights reserved. KServe Control Plane 58 • Responsible for reconciling the InferenceService custom resources. • It creates the Knative serverless deployment for predictor, transformer to enable autoscaling based on incoming request workload including scaling down to zero when no traffic is received.

rights reserved. KServe Control Plane 59 • Responsible for reconciling the InferenceService custom resources. • It creates the Knative serverless deployment for predictor, transformer to enable autoscaling based on incoming request workload including scaling down to zero when no traffic is received.

rights reserved. Predictor 60 https://kserve.github.io/website/master/

rights reserved. Predictor 61 https://kserve.github.io/website/master/ Queue Proxy measures and limit concurrency to the user’s application Model Server deploys, manages, and serves machine learning models Storage Initializer retrieves and prepares machine learning models from various storage backends like Amazon S3

rights reserved. Transformer 62 https://kserve.github.io/website/master/

rights reserved. Transformer 63 https://kserve.github.io/website/master/ Queue Proxy measures and limits concurrency to the user’s application. Model Server preprocesses input data and postprocesses output predictions, enabling seamless integration of custom logic or data transformations with the deployed machine learning models for improved model serving and inference.

rights reserved. KServe Control Plane 64 Inference Service KServe Controller Knative Service Knative Revision Deployment Reconcile Serverless Raw Deployment

rights reserved. KServe Control Plane 65 Inference Service KServe Controller Knative Service Knative Revision Deployment Reconcile Serverless Raw Deployment

rights reserved. Knative Components 66 Inference Service KServe Controller Knative Service Knative Revision Deployment Reconcile Serverless Raw Deployment https://knative.dev/docs/serving/

rights reserved. Knative Serving 67 Inference Service KServe Controller Knative Service Knative Revision Deployment Reconcile Serverless Raw Deployment https://knative.dev/docs/serving/ Knative

rights reserved. Primary Knative Serving Resources 68 Knative Knative Service resource automatically manages the whole lifecycle of your workload. Routes maps a network endpoint to one or more revisions. Configuration maintains the desired state for your deployment. Revision is a point-in-time snapshot of the code and configuration for each modification made to the workload. Deployment

rights reserved. Revision Autoscaling with Knative Pod Autoscaler (KPA) 69 Route Activator Pods Deployment Autoscaler Inactive route Pull metrics Push metrics scales Creates/ deletes Active route https://knative.dev/docs/serving/istio-authorization/ https://developer.aliyun.com/article/710828

rights reserved. Scaling up and down (steady state) 70 https://github.com/knative/serving/blob/main/docs/scaling/SYSTEM.md

rights reserved. Scaling to zero 71 https://github.com/knative/serving/blob/main/docs/scaling/SYSTEM.md

rights reserved. Scaling from Zero 72 https://github.com/knative/serving/blob/main/docs/scaling/SYSTEM.md

rights reserved. Autoscale Sample 73 https://github.com/dewitt/knative- docs/tree/master/serving/samples/autoscale-go Ramp up traffic to maintain 10 in-flight requests.

rights reserved. Scaling pod from zero 74 https://github.com/dewitt/knative- docs/tree/master/serving/samples/autoscale-go

rights reserved. Difference between KPA and HPA 75 Knative Pod Autoscaler (KPA) • Part of the Knative Serving core and enabled by default once Knative Serving is installed. • Supports scale to zero functionality. • Does not support CPU-based autoscaling. Horizontal Pod Autoscaler (HPA) • Not part of the Knative Serving core, and must be enabled after Knative Serving installation. • Does not support scale to zero functionality. • Supports CPU-based autoscaling. https://kserve.github.io/website/0.8/modelserving/v1b eta1/torchserve/#autoscaling

rights reserved. We have covered Knative Serving part… 76 Inference Service KServe Controller Knative Service Knative Revision Deployment Reconcile Serverless Raw Deployment https://knative.dev/docs/serving/ Knative

rights reserved. Up next: Inference Service 77 Inference Service KServe Controller Knative Service Knative Revision Deployment Reconcile Serverless Raw Deployment Question: Do we have to deal with the complexity in Knative? Answer: No! All we need is Inference Service.

rights reserved. First InferenceSevice 78 Apply

rights reserved. First Inference Service 79

rights reserved. First Inference Service: load test 80 Under the hood https://kserve.github.io/website/master/get_started/first_isvc/ #5-perform-inference

[re:Invent2024 Chalktalk] Cost-effectively depl...

[re:Invent2024 Chalktalk] Cost-effectively deploy PyTorch LLMs on AWS Inferentia using Amazon EKS

More Decks by Keita Watanabe

Featured

Transcript