[re:Invent2024 Chalktalk] Cost-effectively deploy PyTorch LLMs on AWS Inferentia using Amazon EKS

Slide 1

Slide 1 text

Slide 2

Slide 2 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Cost-effectively deploy PyTorch LLMs on AWS Inferentia using Amazon EKS Keita Watanabe, PhD C M P 3 3 0 Sr. WW Solutions Architect, GenAI Amazon Web Services Nathan Arnold Sr. Solutions Architect Amazon Web Services

Slide 3

Slide 3 text

Slide 4

Slide 4 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Architecture: Karpenter + Kserve + Inferentia KServe Predictor Service KServe (for pod scaling) Karpenter (for node scaling) Kserve Predictor Pod Llama3.2 1B Replica1 Worker Node Inf2.xlarge (on demand Worker Node Inf2.2xlarge (spot) … Kserve Predictor Pod Llama3.2 1B Replica1 Deployment Knative Revision Knative Service KServe Controller KServe InferenceService Scale Deployment with Knative Pod Autoscaler (KPA) AI Application Rest/gRPC Inference Request

Slide 5

Slide 5 text

Slide 6

Slide 6 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. 6 Karpenter Open source Kubernetes cluster NODE PROVISIONING TOOL developed by AWS • Kube Scheduler gets the first crack at scheduling pending pods. Tries to schedule on existing capacity • Karpenter observes aggregate resource requests of unschedulable pods (set by kube scheduler) to make decisions on what instances to launch

Slide 7

Slide 7 text

Slide 8

Slide 8 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Compute flexibility Instance type flexibility • Attribute-based requirements à sizes, families, generations, CPU architectures • No list à picks from all instance types in EC2 universe, excluding metal • Limits how many EC2 instances this NodePool can provision AZ flexibility • Provision in any AZ • Provision in specified AZs apiVersion: karpenter.sh/v1beta1 kind: NodePool metadata: name: default spec: template: spec: requirements: - key: karpenter.k8s.aws/instance-family operator: In values: ["c5","m5","r5"] - key: karpenter.k8s.aws/instance-size operator: NotIn values: ["nano","micro","small"] - key: topology.kubernetes.io/zone operator: In values: ["us-west-2a","us-west-2b"] - key: kubernetes.io/arch operator: In values: ["amd64","arm64"] - key: karpenter.sh/capacity-type operator: In values: ["spot","on-demand"] limits: cpu: 100

Slide 9

Slide 9 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Compute flexibility Purchase options flexibility • On-demand, if nothing specified • Prioritizes Spot if flexible to both capacity types CPU architecture flexibility • x86-64 • Arm64 apiVersion: karpenter.sh/v1beta1 kind: NodePool metadata: name: default spec: template: spec: requirements: - key: karpenter.k8s.aws/instance-family operator: In values: ["c5","m5","r5"] - key: karpenter.k8s.aws/instance-size operator: NotIn values: ["nano","micro","small"] - key: topology.kubernetes.io/zone operator: In values: ["us-west-2a","us-west-2b"] - key: kubernetes.io/arch operator: In values: ["amd64","arm64"] - key: karpenter.sh/capacity-type operator: In values: ["spot","on-demand"] limits: cpu: 100

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Purpose-built accelerators for generative AI AWS Inferentia Lowest cost per inference in the cloud for running deep learning (DL) models Up to 70% lower cost per inference than comparable Amazon EC2 instances AWS Inferentia2 High performance at the lowest cost per inference for LLMs and diffusion models Up to 40% better price performance than comparable Amazon EC2 instances AWS Trainium The most cost-efficient, high- performance training of LLMs and diffusion models Up to 50% savings on training costs over comparable Amazon EC2 instances

Slide 13

Slide 13 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon EC2 Inf2 instances powered by AWS Inferentia2 Support for ultra-large 100B+ parameter GenAI models Up to 3x higher compute performance, 3x larger accelerator memory Up to 4x higher throughput and 10x lower latency 9.8 TB/s aggregated accelerator memory bandwidth Instance size vCPUs Instance memory Inferentia2 chips Accelerator memory NeuronLink Instance networking On-demand price Inf2.xlarge 4 16 GB 1 32 GB N/A Up to 15 Gbps $0.76/hr Inf2.8xlarge 32 128 GB 1 32 GB N/A Up to 25 Gbps $1.97/hr Inf2.24xlarge 96 384 GB 6 192 GB Yes 50 Gbps $6.49/hr Inf2.48xlarge 192 768 GB 12 384 GB Yes 100 Gbps $12.98/hr H I G H P E R F O R M A N C E A T T H E L O W E S T C O S T F O R G E N E R A T I V E A I M O D E L S

Slide 14

Slide 14 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Inferentia2 Architecture HBM HBM NeuronCore-v2 GPSIMD Engine NeuronLink-v2 DMA DMA DMA DMA DMA DMA DMA Collective Communication Host PCIe On-chip SRAM memory Tensor Engine Vector Engine Scalar Engine NeuronCore-v2 GPSIMD Engine NeuronLink-v2 On-chip SRAM memory Tensor Engine Vector Engine Scalar Engine Inferentia2

Slide 15

Slide 15 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Neuron SDK Neuron compiler Neuron runtime Developer tools Framework and Opensource Community github.com/aws/aws-neuron-sdk https://awsdocs-neuron.readthedocs-hosted.com E A S Y D E V E L O P M E N T W I T H A W S T R A I N I U M A N D A W S I N F E R E N T I A

Slide 16

Slide 16 text

Slide 17

Slide 17 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. KServe 17 https://kserve.github.io/website/master/ • Scale to and from Zero • Request based Autoscaling • Batching • Request/Response logging • Traffic management • Security with AuthN/AuthZ • Distributed Tracing • Out-of-the-box metrics Key Features

Slide 18

Slide 18 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Revision Autoscaling with Knative Pod Autoscaler (KPA) 18 Route Activator Pods Deployment Autoscaler Inactive route Pull metrics Push metrics scales Creates/ deletes Active route https://knative.dev/docs/serving/istio-authorization/ https://developer.aliyun.com/article/710828

Slide 19

Slide 19 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Call to action [AWS Machine Learning Blog] Deploy Meta Llama 3.1-8B on AWS Inferentia using Amazon EKS and vLLM https://aws.amazon.com/blogs/machine-learning/deploy-meta-llama-3-1-8b-on-aws-inferentia-using- amazon-eks-and-vllm/ Using Neuron with Amazon EKS https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/kubernetes-getting-started.html Announcing AWS Neuron Helm Chart https://aws.amazon.com/blogs/containers/announcing-aws-neuron-helm-chart/ [Containers] Scaling a Large Language Model with NVIDIA NIM on Amazon EKS with Karpenter https://aws.amazon.com/blogs/containers/scaling-a-large-language-model-with-nvidia-nim-on-amazon-eks- with-karpenter/ vLLM with Neuron setup guide https://docs.vllm.ai/en/v0.6.3/getting_started/neuron-installation.html Architecture deployment guide https://gist.github.com/KeitaW/359ddb7ea147cc68e7029c91c6f137e5

Slide 20

Slide 20 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Thank you! © 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Please complete the session survey in the mobile app Keita Watanabe Sr. WW Solutions Architect, GenAI Amazon Web Services Sr. Solutions Architect Amazon Web Services Nathan Arnold

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. References Setup Mount point for Amazon S3 https://catalog.workshops.aws/sagemaker-hyperpod-eks/en-US/01-cluster/07-s3-mountpoint Awsome inference https://github.com/aws-samples/awsome-inference Awsome distributed training https://github.com/aws-samples/awsome-distributed-training/tree/main

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Compute per workload scheduling requirements Pod scheduling constraints must fall within a provisioner’s constraints Standard K8s pod scheduling mechanisms Workloads may be required to run In certain AZs On certain types of processors or hardware (AWS Graviton, GPUs) On Spot and on-demand capacity Node selectors Node affinity Taints and tolerations Topology spread

Slide 30

Slide 30 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Karpenter respects scheduling constraints • Adds label karpenter.sh/capacity-type: spot, karpenter.io/arch: amd64 to nodes • Utilize nodeSelector, nodeAffinity to schedule pods into appropriate nodes These labels added to the nodes

Slide 31

Slide 31 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. User-defined annotation, labels, taints apiVersion: apps/v1 kind: Deployment metadata: name: myapp spec: nodeSelector: team: team-a apiVersion: karpenter.sh/v1beta1 kind: NodePool spec: template: metadata: annotations: application/name: "app-a" labels: team: team-a spec: taints: - key: example.com/special-taint value: "true" effect: NoSchedule These taints, labels, annotations will be added to all nodes provisioned Use labels to schedule pods for different apps

Slide 32

Slide 32 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Spot notification Spot interruption handling with Karpenter • 2-minute Spot Instance interruption notice via Amazon EventBridge event • Set as environment variables in Karpenter controller Deployment object • NodePools can be configured for a mix of On-Demand and Spot • Karpenter has built-in Spot interruption handler • Not required to use Node Termination Handler

Slide 33

Slide 33 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Strategies for defining NodePools Single A single NodePool can manage compute for multiple teams and workloads Example use cases: • Single NodePool for a mix of Graviton and x86, while a pending pod has a requirement for a specific processor type Multiple Isolating compute for different purposes Example use cases: • Expensive hardware • Security isolation • Team separation • Different AMI • Tenant isolation due to noisy neighbor Weighted Define order across your NodePools so that the node scheduler will attempt to scheudle with one NodePool before another Example use cases: • Prioritize RI and Savings Plan ahead of other instance types • Default clusterwide configuration • Ratio split – Spot/OD, x86/Graviton

Slide 34

Slide 34 text

Slide 35

Slide 35 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Weighted NodePools apiVersion: karpenter.sh/v1beta1 kind: NodePool metadata: name: compute-savings-plan-nodepool spec: template: requirements: - key: "karpenter.k8s.aws/instance-category" operator: In values: ["c", "r"] - key: "karpenter.k8s.aws/instance-cpu" operator: In values: ["16", "32"] - key: "karpenter.k8s.aws/instance-hypervisor" operator: In values: ["nitro"] limits: cpu: "1000" memory: 1000Gi weight: 60 apiVersion: karpenter.sh/v1beta1 kind: NodePool metadata: name: general-instances-nodepool spec: template: requirements: - key: "karpenter.k8s.aws/instance-cpu" operator: In values: ["16", "32"] weight: 40 • Karpenter prioritizes NodePool with higher weight

Slide 36

Slide 36 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. EC2 - m5.xlarge EC2 - m5.xlarge EC2 - m5.xlarge Enable consolidation EC2 - m5.xlarge Karpenter optimization apiVersion: karpenter.sh/v1beta1 kind: NodePool spec: disruption: consolidationPolicy: WhenEmptyOrUnderutilized

Slide 37

Slide 37 text

Slide 38

Slide 38 text

Slide 39

Slide 39 text

Slide 40

Slide 40 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Karpenter simplifies data plane management Karpenter combines: • Cluster Autoscaler • Node groups • Node Termination Handler • Descheduler

Slide 41

Slide 41 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Packer Example: AMI • Export your AWS credentials as the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables. • Packer will build an AMI according to the template (using t2.micro instance)

Slide 42

Slide 42 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Note: containerd need to be used K8s 1.8 or later https://blog.scottlowe.org/2020/01/25/manually- loading-container-images-with-containerd/ AMI for the experiment • Base image: EKS optimized AMI (accelerated) • Pull TensorRT ver. Stable Diffusion image using containerd • Note: You must add –n=k8s.i flag to make the image available to K8s

Slide 43

Slide 43 text

Slide 44

Slide 44 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. 0 50 100 150 200 250 300 350 Default AMI Prebaked AMI Start-up time (s) Start-up time (s) Result 10x faster • The pre-baked AMI image achieved 10x faster container start-up (vs. GPU Optimized AMI + TensorRT Stable Diffusion Image) • The pre-baked AMI has already pulled image

Slide 45

Slide 45 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Conclusion 0 50 100 150 200 250 300 350 Default AMI Prebaked AMI Start-up time (s) Start-up time (s) 0 0.5 1 1.5 2 2.5 Without TensorRT With TensorRT Inference Latency (s) Inference Latency (s) • TensorRT compiled Stable Diffusion Pipeline achieved 2.8x faster inference (vs. HuggingFace pipeline ver.) • Packer pre-baked AMI image achieved 10x faster container start-up (vs. GPU Optimized AMI + TensorRT Stable Diffusion Image) 10x faster 2.8x faster

Slide 46

Slide 46 text

Slide 47

Slide 47 text

Slide 48

Slide 48 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Model copy 80GB Amazon is Capital of California is Largest Mammal is Output token END Output token Output token Output token END END Output token Output token T1 T2 T3 T4 T5 T6 T7 Time à Request 1 Request 2 Request 3 IDLE IDLE IDLE Attention key/value cache * Based on batch size = 4 on ml.g5.12xl for 7B model Optimize throughput C O N T I N U O U S B A T C H I N G

Slide 49

Slide 49 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. EKS Deployment: Neuron Helm Chart Neuron Device Plugin • Exposes Neuron cores and devices to Kubernetes as resources. • Runs as a daemonset in the kube-system namespace. Neuron Scheduler Extension • Manages scheduling of pods that require specific Neuron core/device configurations. • Required when you deploy a pod required multiple Neuron devices • Minimizes communication latency by identifying directly connected device sets. Neuron Node Problem Detector Plugin • Operates as a daemonset on AWS Neuron-enabled EKS worker nodes • Monitors the health of Neuron devices on each node. • Initiates node replacement if an unrecoverable error is detected.

Slide 50

Slide 50 text

Slide 51

Slide 51 text

Slide 52

Slide 52 text

Slide 53

Slide 53 text

Slide 54

Slide 54 text

Slide 55

Slide 55 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Agenda 55 • KServe Overview • KServe Components • Inference Service • Predictor • AutoScaling with Knative Pod Autoscaler (KPA) • ML inference with KServe Examples

Slide 56

Slide 56 text

Slide 57

Slide 57 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. KServe Features 57 • Scale to and from Zero • Request based Autoscaling • Batching • Request/Response logging • Traffic management • Security with AuthN/AuthZ • Distributed Tracing • Out-of-the-box metrics

Slide 58

Slide 58 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. KServe Control Plane 58 • Responsible for reconciling the InferenceService custom resources. • It creates the Knative serverless deployment for predictor, transformer to enable autoscaling based on incoming request workload including scaling down to zero when no traffic is received.

Slide 59

Slide 59 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. KServe Control Plane 59 • Responsible for reconciling the InferenceService custom resources. • It creates the Knative serverless deployment for predictor, transformer to enable autoscaling based on incoming request workload including scaling down to zero when no traffic is received.

Slide 60

Slide 60 text

Slide 61

Slide 61 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Predictor 61 https://kserve.github.io/website/master/ Queue Proxy measures and limit concurrency to the user’s application Model Server deploys, manages, and serves machine learning models Storage Initializer retrieves and prepares machine learning models from various storage backends like Amazon S3

Slide 62

Slide 62 text

Slide 63

Slide 63 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Transformer 63 https://kserve.github.io/website/master/ Queue Proxy measures and limits concurrency to the user’s application. Model Server preprocesses input data and postprocesses output predictions, enabling seamless integration of custom logic or data transformations with the deployed machine learning models for improved model serving and inference.

Slide 64

Slide 64 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. KServe Control Plane 64 Inference Service KServe Controller Knative Service Knative Revision Deployment Reconcile Serverless Raw Deployment

Slide 65

Slide 65 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. KServe Control Plane 65 Inference Service KServe Controller Knative Service Knative Revision Deployment Reconcile Serverless Raw Deployment

Slide 66

Slide 66 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Knative Components 66 Inference Service KServe Controller Knative Service Knative Revision Deployment Reconcile Serverless Raw Deployment https://knative.dev/docs/serving/

Slide 67

Slide 67 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Knative Serving 67 Inference Service KServe Controller Knative Service Knative Revision Deployment Reconcile Serverless Raw Deployment https://knative.dev/docs/serving/ Knative

Slide 68

Slide 68 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Primary Knative Serving Resources 68 Knative Knative Service resource automatically manages the whole lifecycle of your workload. Routes maps a network endpoint to one or more revisions. Configuration maintains the desired state for your deployment. Revision is a point-in-time snapshot of the code and configuration for each modification made to the workload. Deployment

Slide 69

Slide 69 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Revision Autoscaling with Knative Pod Autoscaler (KPA) 69 Route Activator Pods Deployment Autoscaler Inactive route Pull metrics Push metrics scales Creates/ deletes Active route https://knative.dev/docs/serving/istio-authorization/ https://developer.aliyun.com/article/710828

Slide 70

Slide 70 text

Slide 71

Slide 71 text

Slide 72

Slide 72 text

Slide 73

Slide 73 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Autoscale Sample 73 https://github.com/dewitt/knative- docs/tree/master/serving/samples/autoscale-go Ramp up traffic to maintain 10 in-flight requests.

Slide 74

Slide 74 text

Slide 75

Slide 75 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Difference between KPA and HPA 75 Knative Pod Autoscaler (KPA) • Part of the Knative Serving core and enabled by default once Knative Serving is installed. • Supports scale to zero functionality. • Does not support CPU-based autoscaling. Horizontal Pod Autoscaler (HPA) • Not part of the Knative Serving core, and must be enabled after Knative Serving installation. • Does not support scale to zero functionality. • Supports CPU-based autoscaling. https://kserve.github.io/website/0.8/modelserving/v1b eta1/torchserve/#autoscaling

Slide 76

Slide 76 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. We have covered Knative Serving part… 76 Inference Service KServe Controller Knative Service Knative Revision Deployment Reconcile Serverless Raw Deployment https://knative.dev/docs/serving/ Knative

Slide 77

Slide 77 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Up next: Inference Service 77 Inference Service KServe Controller Knative Service Knative Revision Deployment Reconcile Serverless Raw Deployment Question: Do we have to deal with the complexity in Knative? Answer: No! All we need is Inference Service.

Slide 78

Slide 78 text

Slide 79

Slide 79 text

Slide 80

Slide 80 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. First Inference Service: load test 80 Under the hood https://kserve.github.io/website/master/get_started/first_isvc/ #5-perform-inference