Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[re:Invent2024 Chalktalk] Cost-effectively depl...

Keita Watanabe
December 04, 2024
22

[re:Invent2024 Chalktalk] Cost-effectively deploy PyTorch LLMs on AWS Inferentia using Amazon EKS

Unlock the full potential of large language models (LLMs) on Amazon EKS by optimizing inference performance and cost efficiency. This chalk talk provides practical guidance on deploying and scaling PyTorch-based LLMs on EKS using AWS Inferentia, Karpenter, and KServe. Learn how to leverage the specialized hardware of Inferentia to accelerate inference, reduce latency, and lower costs. Discover how Karpenter’s advanced auto scaling capabilities optimize resource utilization and handle fluctuating workloads. Master the art of efficient model deployment and management with KServe. Through real-world examples and best practices, gain the expertise to build high-performance, cost-effective LLM inference pipelines.

Keita Watanabe

December 04, 2024
Tweet

More Decks by Keita Watanabe

Transcript

  1. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Cost-effectively deploy PyTorch LLMs on AWS Inferentia using Amazon EKS Keita Watanabe, PhD C M P 3 3 0 Sr. WW Solutions Architect, GenAI Amazon Web Services Nathan Arnold Sr. Solutions Architect Amazon Web Services
  2. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Motivation: Challenges in LLM deployment Accelerated instance availability High inference cost Unpredictable demand +
  3. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Architecture: Karpenter + Kserve + Inferentia KServe Predictor Service KServe (for pod scaling) Karpenter (for node scaling) Kserve Predictor Pod Llama3.2 1B Replica1 Worker Node Inf2.xlarge (on demand Worker Node Inf2.2xlarge (spot) … Kserve Predictor Pod Llama3.2 1B Replica1 Deployment Knative Revision Knative Service KServe Controller KServe InferenceService Scale Deployment with Knative Pod Autoscaler (KPA) AI Application Rest/gRPC Inference Request
  4. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Architecture: Karpenter + Kserve + Inferentia KServe Predictor Service KServe (for pod scaling) Karpenter (for node scaling) Kserve Predictor Pod Llama3.2 1B Replica1 Worker Node Inf2.xlarge (on demand Worker Node Inf2.2xlarge (spot) … Kserve Predictor Pod Llama3.2 1B Replica1 Deployment Knative Revision Knative Service KServe Controller KServe InferenceService Scale Deployment with Knative Pod Autoscaler (KPA) AI Application Rest/gRPC Inference Request
  5. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. 6 Karpenter Open source Kubernetes cluster NODE PROVISIONING TOOL developed by AWS • Kube Scheduler gets the first crack at scheduling pending pods. Tries to schedule on existing capacity • Karpenter observes aggregate resource requests of unschedulable pods (set by kube scheduler) to make decisions on what instances to launch
  6. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. How Karpenter provisions nodes on AWS CA ASG Pod auto scaling Pending pods EC2 API NodePool EC2NodeClass
  7. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Compute flexibility Instance type flexibility • Attribute-based requirements à sizes, families, generations, CPU architectures • No list à picks from all instance types in EC2 universe, excluding metal • Limits how many EC2 instances this NodePool can provision AZ flexibility • Provision in any AZ • Provision in specified AZs apiVersion: karpenter.sh/v1beta1 kind: NodePool metadata: name: default spec: template: spec: requirements: - key: karpenter.k8s.aws/instance-family operator: In values: ["c5","m5","r5"] - key: karpenter.k8s.aws/instance-size operator: NotIn values: ["nano","micro","small"] - key: topology.kubernetes.io/zone operator: In values: ["us-west-2a","us-west-2b"] - key: kubernetes.io/arch operator: In values: ["amd64","arm64"] - key: karpenter.sh/capacity-type operator: In values: ["spot","on-demand"] limits: cpu: 100
  8. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Compute flexibility Purchase options flexibility • On-demand, if nothing specified • Prioritizes Spot if flexible to both capacity types CPU architecture flexibility • x86-64 • Arm64 apiVersion: karpenter.sh/v1beta1 kind: NodePool metadata: name: default spec: template: spec: requirements: - key: karpenter.k8s.aws/instance-family operator: In values: ["c5","m5","r5"] - key: karpenter.k8s.aws/instance-size operator: NotIn values: ["nano","micro","small"] - key: topology.kubernetes.io/zone operator: In values: ["us-west-2a","us-west-2b"] - key: kubernetes.io/arch operator: In values: ["amd64","arm64"] - key: karpenter.sh/capacity-type operator: In values: ["spot","on-demand"] limits: cpu: 100
  9. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Inferentia nodepool
  10. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Architecture: Karpenter + Kserve + Inferentia KServe Predictor Service KServe (for pod scaling) Karpenter (for node scaling) Kserve Predictor Pod Llama3.2 1B Replica1 Worker Node Inf2.xlarge (on demand Worker Node Inf2.2xlarge (spot) … Kserve Predictor Pod Llama3.2 1B Replica1 Deployment Knative Revision Knative Service KServe Controller KServe InferenceService Scale Deployment with Knative Pod Autoscaler (KPA) AI Application Rest/gRPC Inference Request
  11. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Purpose-built accelerators for generative AI AWS Inferentia Lowest cost per inference in the cloud for running deep learning (DL) models Up to 70% lower cost per inference than comparable Amazon EC2 instances AWS Inferentia2 High performance at the lowest cost per inference for LLMs and diffusion models Up to 40% better price performance than comparable Amazon EC2 instances AWS Trainium The most cost-efficient, high- performance training of LLMs and diffusion models Up to 50% savings on training costs over comparable Amazon EC2 instances
  12. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Amazon EC2 Inf2 instances powered by AWS Inferentia2 Support for ultra-large 100B+ parameter GenAI models Up to 3x higher compute performance, 3x larger accelerator memory Up to 4x higher throughput and 10x lower latency 9.8 TB/s aggregated accelerator memory bandwidth Instance size vCPUs Instance memory Inferentia2 chips Accelerator memory NeuronLink Instance networking On-demand price Inf2.xlarge 4 16 GB 1 32 GB N/A Up to 15 Gbps $0.76/hr Inf2.8xlarge 32 128 GB 1 32 GB N/A Up to 25 Gbps $1.97/hr Inf2.24xlarge 96 384 GB 6 192 GB Yes 50 Gbps $6.49/hr Inf2.48xlarge 192 768 GB 12 384 GB Yes 100 Gbps $12.98/hr H I G H P E R F O R M A N C E A T T H E L O W E S T C O S T F O R G E N E R A T I V E A I M O D E L S
  13. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. AWS Inferentia2 Architecture HBM HBM NeuronCore-v2 GPSIMD Engine NeuronLink-v2 DMA DMA DMA DMA DMA DMA DMA Collective Communication Host PCIe On-chip SRAM memory Tensor Engine Vector Engine Scalar Engine NeuronCore-v2 GPSIMD Engine NeuronLink-v2 On-chip SRAM memory Tensor Engine Vector Engine Scalar Engine Inferentia2
  14. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. AWS Neuron SDK Neuron compiler Neuron runtime Developer tools Framework and Opensource Community github.com/aws/aws-neuron-sdk https://awsdocs-neuron.readthedocs-hosted.com E A S Y D E V E L O P M E N T W I T H A W S T R A I N I U M A N D A W S I N F E R E N T I A
  15. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Architecture: Karpenter + Kserve + Inferentia KServe Predictor Service KServe (for pod scaling) Karpenter (for node scaling) Kserve Predictor Pod Llama3.2 1B Replica1 Worker Node Inf2.xlarge (on demand Worker Node Inf2.2xlarge (spot) … Kserve Predictor Pod Llama3.2 1B Replica1 Deployment Knative Revision Knative Service KServe Controller KServe InferenceService Scale Deployment with Knative Pod Autoscaler (KPA) AI Application Rest/gRPC Inference Request
  16. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. KServe 17 https://kserve.github.io/website/master/ • Scale to and from Zero • Request based Autoscaling • Batching • Request/Response logging • Traffic management • Security with AuthN/AuthZ • Distributed Tracing • Out-of-the-box metrics Key Features
  17. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Revision Autoscaling with Knative Pod Autoscaler (KPA) 18 Route Activator Pods Deployment Autoscaler Inactive route Pull metrics Push metrics scales Creates/ deletes Active route https://knative.dev/docs/serving/istio-authorization/ https://developer.aliyun.com/article/710828
  18. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Call to action [AWS Machine Learning Blog] Deploy Meta Llama 3.1-8B on AWS Inferentia using Amazon EKS and vLLM https://aws.amazon.com/blogs/machine-learning/deploy-meta-llama-3-1-8b-on-aws-inferentia-using- amazon-eks-and-vllm/ Using Neuron with Amazon EKS https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/kubernetes-getting-started.html Announcing AWS Neuron Helm Chart https://aws.amazon.com/blogs/containers/announcing-aws-neuron-helm-chart/ [Containers] Scaling a Large Language Model with NVIDIA NIM on Amazon EKS with Karpenter https://aws.amazon.com/blogs/containers/scaling-a-large-language-model-with-nvidia-nim-on-amazon-eks- with-karpenter/ vLLM with Neuron setup guide https://docs.vllm.ai/en/v0.6.3/getting_started/neuron-installation.html Architecture deployment guide https://gist.github.com/KeitaW/359ddb7ea147cc68e7029c91c6f137e5
  19. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Thank you! © 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Please complete the session survey in the mobile app Keita Watanabe Sr. WW Solutions Architect, GenAI Amazon Web Services Sr. Solutions Architect Amazon Web Services Nathan Arnold
  20. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Conclusion
  21. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Motivation: Challenges in LLM deployment Accelerated instance availability High inference cost Unpredictable demand +
  22. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Architecture: Karpenter + Kserve + Inferentia KServe Predictor Service KServe (for pod scaling) Karpenter (for node scaling) Kserve Predictor Pod Llama3.2 1B Replica1 Worker Node Inf2.xlarge (on demand Worker Node Inf2.2xlarge (spot) … Kserve Predictor Pod Llama3.2 1B Replica1 Deployment Knative Revision Knative Service KServe Controller KServe InferenceService Scale Deployment with Knative Pod Autoscaler (KPA) AI Application Rest/gRPC Inference Request
  23. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. References Setup Mount point for Amazon S3 https://catalog.workshops.aws/sagemaker-hyperpod-eks/en-US/01-cluster/07-s3-mountpoint Awsome inference https://github.com/aws-samples/awsome-inference Awsome distributed training https://github.com/aws-samples/awsome-distributed-training/tree/main
  24. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Appendix
  25. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Karpenter
  26. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Compute flexibility Instance type flexibility • Attribute-based requirements à sizes, families, generations, CPU architectures • No list à picks from all instance types in EC2 universe, excluding metal • Limits how many EC2 instances this NodePool can provision AZ flexibility • Provision in any AZ • Provision in specified AZs apiVersion: karpenter.sh/v1beta1 kind: NodePool metadata: name: default spec: template: spec: requirements: - key: karpenter.k8s.aws/instance-family operator: In values: ["c5","m5","r5"] - key: karpenter.k8s.aws/instance-size operator: NotIn values: ["nano","micro","small"] - key: topology.kubernetes.io/zone operator: In values: ["us-west-2a","us-west-2b"] - key: kubernetes.io/arch operator: In values: ["amd64","arm64"] - key: karpenter.sh/capacity-type operator: In values: ["spot","on-demand"] limits: cpu: 100
  27. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Compute flexibility Purchase options flexibility • On-demand, if nothing specified • Prioritizes Spot if flexible to both capacity types CPU architecture flexibility • x86-64 • Arm64 apiVersion: karpenter.sh/v1beta1 kind: NodePool metadata: name: default spec: template: spec: requirements: - key: karpenter.k8s.aws/instance-family operator: In values: ["c5","m5","r5"] - key: karpenter.k8s.aws/instance-size operator: NotIn values: ["nano","micro","small"] - key: topology.kubernetes.io/zone operator: In values: ["us-west-2a","us-west-2b"] - key: kubernetes.io/arch operator: In values: ["amd64","arm64"] - key: karpenter.sh/capacity-type operator: In values: ["spot","on-demand"] limits: cpu: 100
  28. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Compute per workload scheduling requirements Pod scheduling constraints must fall within a provisioner’s constraints Standard K8s pod scheduling mechanisms Workloads may be required to run In certain AZs On certain types of processors or hardware (AWS Graviton, GPUs) On Spot and on-demand capacity Node selectors Node affinity Taints and tolerations Topology spread
  29. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Karpenter respects scheduling constraints • Adds label karpenter.sh/capacity-type: spot, karpenter.io/arch: amd64 to nodes • Utilize nodeSelector, nodeAffinity to schedule pods into appropriate nodes These labels added to the nodes
  30. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. User-defined annotation, labels, taints apiVersion: apps/v1 kind: Deployment metadata: name: myapp spec: nodeSelector: team: team-a apiVersion: karpenter.sh/v1beta1 kind: NodePool spec: template: metadata: annotations: application/name: "app-a" labels: team: team-a spec: taints: - key: example.com/special-taint value: "true" effect: NoSchedule These taints, labels, annotations will be added to all nodes provisioned Use labels to schedule pods for different apps
  31. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Spot notification Spot interruption handling with Karpenter • 2-minute Spot Instance interruption notice via Amazon EventBridge event • Set as environment variables in Karpenter controller Deployment object • NodePools can be configured for a mix of On-Demand and Spot • Karpenter has built-in Spot interruption handler • Not required to use Node Termination Handler
  32. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Strategies for defining NodePools Single A single NodePool can manage compute for multiple teams and workloads Example use cases: • Single NodePool for a mix of Graviton and x86, while a pending pod has a requirement for a specific processor type Multiple Isolating compute for different purposes Example use cases: • Expensive hardware • Security isolation • Team separation • Different AMI • Tenant isolation due to noisy neighbor Weighted Define order across your NodePools so that the node scheduler will attempt to scheudle with one NodePool before another Example use cases: • Prioritize RI and Savings Plan ahead of other instance types • Default clusterwide configuration • Ratio split – Spot/OD, x86/Graviton
  33. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Inferentia nodepool
  34. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Weighted NodePools apiVersion: karpenter.sh/v1beta1 kind: NodePool metadata: name: compute-savings-plan-nodepool spec: template: requirements: - key: "karpenter.k8s.aws/instance-category" operator: In values: ["c", "r"] - key: "karpenter.k8s.aws/instance-cpu" operator: In values: ["16", "32"] - key: "karpenter.k8s.aws/instance-hypervisor" operator: In values: ["nitro"] limits: cpu: "1000" memory: 1000Gi weight: 60 apiVersion: karpenter.sh/v1beta1 kind: NodePool metadata: name: general-instances-nodepool spec: template: requirements: - key: "karpenter.k8s.aws/instance-cpu" operator: In values: ["16", "32"] weight: 40 • Karpenter prioritizes NodePool with higher weight
  35. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. EC2 - m5.xlarge EC2 - m5.xlarge EC2 - m5.xlarge Enable consolidation EC2 - m5.xlarge Karpenter optimization apiVersion: karpenter.sh/v1beta1 kind: NodePool spec: disruption: consolidationPolicy: WhenEmptyOrUnderutilized
  36. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. EC2 - m5.xlarge EC2 - m5.xlarge Better utilization of worker nodes – reduced cost Karpenter optimization
  37. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. EC2 - m5.xlarge EC2 - m5.xlarge EC2 - m5.xlarge Enable consolidation Karpenter optimization
  38. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. EC2 - m5.xlarge EC2 – m5.large Better selection of worker nodes – reduced cost Karpenter optimization – Pick cheaper nodes
  39. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Karpenter simplifies data plane management Karpenter combines: • Cluster Autoscaler • Node groups • Node Termination Handler • Descheduler
  40. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Packer Example: AMI • Export your AWS credentials as the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables. • Packer will build an AMI according to the template (using t2.micro instance)
  41. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Note: containerd need to be used K8s 1.8 or later https://blog.scottlowe.org/2020/01/25/manually- loading-container-images-with-containerd/ AMI for the experiment • Base image: EKS optimized AMI (accelerated) • Pull TensorRT ver. Stable Diffusion image using containerd • Note: You must add –n=k8s.i flag to make the image available to K8s
  42. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. 0 50 100 150 200 250 300 350 Default AMI Prebaked AMI Start-up time (s) Start-up time (s) Result 10x faster • The pre-baked AMI image achieved 10x faster container start-up (vs. GPU Optimized AMI + TensorRT Stable Diffusion Image) • The pre-baked AMI has already pulled image
  43. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Conclusion 0 50 100 150 200 250 300 350 Default AMI Prebaked AMI Start-up time (s) Start-up time (s) 0 0.5 1 1.5 2 2.5 Without TensorRT With TensorRT Inference Latency (s) Inference Latency (s) • TensorRT compiled Stable Diffusion Pipeline achieved 2.8x faster inference (vs. HuggingFace pipeline ver.) • Packer pre-baked AMI image achieved 10x faster container start-up (vs. GPU Optimized AMI + TensorRT Stable Diffusion Image) 10x faster 2.8x faster
  44. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Inferentia
  45. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. vLLM with Inferentia
  46. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Model copy 80GB Amazon is Capital of California is Largest Mammal is Output token END Output token Output token Output token END END Output token Output token T1 T2 T3 T4 T5 T6 T7 Time à Request 1 Request 2 Request 3 IDLE IDLE IDLE Attention key/value cache * Based on batch size = 4 on ml.g5.12xl for 7B model Optimize throughput C O N T I N U O U S B A T C H I N G
  47. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. EKS Deployment: Neuron Helm Chart Neuron Device Plugin • Exposes Neuron cores and devices to Kubernetes as resources. • Runs as a daemonset in the kube-system namespace. Neuron Scheduler Extension • Manages scheduling of pods that require specific Neuron core/device configurations. • Required when you deploy a pod required multiple Neuron devices • Minimizes communication latency by identifying directly connected device sets. Neuron Node Problem Detector Plugin • Operates as a daemonset on AWS Neuron-enabled EKS worker nodes • Monitors the health of Neuron devices on each node. • Initiates node replacement if an unrecoverable error is detected.
  48. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Example vLLM + Inferentia deployment
  49. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Distributed training software stack (Neuron) ML frameworks Communication libraries・SDKs Hardware drivers EC2 instance
  50. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. What is in Container (Neuron)? EC2 instance Container AMI ML Frameworks Communication libraries・SDKs
  51. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. What is on AMI ? (Neuron) Hardware drivers EC2 instance aws-neuronx-oci-hook AMI Container toolkits SDK
  52. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. KServe
  53. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Agenda 55 • KServe Overview • KServe Components • Inference Service • Predictor • AutoScaling with Knative Pod Autoscaler (KPA) • ML inference with KServe Examples
  54. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. KServe 56 https://kserve.github.io/website/master/
  55. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. KServe Features 57 • Scale to and from Zero • Request based Autoscaling • Batching • Request/Response logging • Traffic management • Security with AuthN/AuthZ • Distributed Tracing • Out-of-the-box metrics
  56. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. KServe Control Plane 58 • Responsible for reconciling the InferenceService custom resources. • It creates the Knative serverless deployment for predictor, transformer to enable autoscaling based on incoming request workload including scaling down to zero when no traffic is received.
  57. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. KServe Control Plane 59 • Responsible for reconciling the InferenceService custom resources. • It creates the Knative serverless deployment for predictor, transformer to enable autoscaling based on incoming request workload including scaling down to zero when no traffic is received.
  58. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Predictor 60 https://kserve.github.io/website/master/
  59. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Predictor 61 https://kserve.github.io/website/master/ Queue Proxy measures and limit concurrency to the user’s application Model Server deploys, manages, and serves machine learning models Storage Initializer retrieves and prepares machine learning models from various storage backends like Amazon S3
  60. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Transformer 62 https://kserve.github.io/website/master/
  61. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Transformer 63 https://kserve.github.io/website/master/ Queue Proxy measures and limits concurrency to the user’s application. Model Server preprocesses input data and postprocesses output predictions, enabling seamless integration of custom logic or data transformations with the deployed machine learning models for improved model serving and inference.
  62. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. KServe Control Plane 64 Inference Service KServe Controller Knative Service Knative Revision Deployment Reconcile Serverless Raw Deployment
  63. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. KServe Control Plane 65 Inference Service KServe Controller Knative Service Knative Revision Deployment Reconcile Serverless Raw Deployment
  64. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Knative Components 66 Inference Service KServe Controller Knative Service Knative Revision Deployment Reconcile Serverless Raw Deployment https://knative.dev/docs/serving/
  65. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Knative Serving 67 Inference Service KServe Controller Knative Service Knative Revision Deployment Reconcile Serverless Raw Deployment https://knative.dev/docs/serving/ Knative
  66. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Primary Knative Serving Resources 68 Knative Knative Service resource automatically manages the whole lifecycle of your workload. Routes maps a network endpoint to one or more revisions. Configuration maintains the desired state for your deployment. Revision is a point-in-time snapshot of the code and configuration for each modification made to the workload. Deployment
  67. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Revision Autoscaling with Knative Pod Autoscaler (KPA) 69 Route Activator Pods Deployment Autoscaler Inactive route Pull metrics Push metrics scales Creates/ deletes Active route https://knative.dev/docs/serving/istio-authorization/ https://developer.aliyun.com/article/710828
  68. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Scaling up and down (steady state) 70 https://github.com/knative/serving/blob/main/docs/scaling/SYSTEM.md
  69. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Scaling to zero 71 https://github.com/knative/serving/blob/main/docs/scaling/SYSTEM.md
  70. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Scaling from Zero 72 https://github.com/knative/serving/blob/main/docs/scaling/SYSTEM.md
  71. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Autoscale Sample 73 https://github.com/dewitt/knative- docs/tree/master/serving/samples/autoscale-go Ramp up traffic to maintain 10 in-flight requests.
  72. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Scaling pod from zero 74 https://github.com/dewitt/knative- docs/tree/master/serving/samples/autoscale-go
  73. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Difference between KPA and HPA 75 Knative Pod Autoscaler (KPA) • Part of the Knative Serving core and enabled by default once Knative Serving is installed. • Supports scale to zero functionality. • Does not support CPU-based autoscaling. Horizontal Pod Autoscaler (HPA) • Not part of the Knative Serving core, and must be enabled after Knative Serving installation. • Does not support scale to zero functionality. • Supports CPU-based autoscaling. https://kserve.github.io/website/0.8/modelserving/v1b eta1/torchserve/#autoscaling
  74. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. We have covered Knative Serving part… 76 Inference Service KServe Controller Knative Service Knative Revision Deployment Reconcile Serverless Raw Deployment https://knative.dev/docs/serving/ Knative
  75. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Up next: Inference Service 77 Inference Service KServe Controller Knative Service Knative Revision Deployment Reconcile Serverless Raw Deployment Question: Do we have to deal with the complexity in Knative? Answer: No! All we need is Inference Service.
  76. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. First InferenceSevice 78 Apply
  77. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. First Inference Service 79
  78. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. First Inference Service: load test 80 Under the hood https://kserve.github.io/website/master/get_started/first_isvc/ #5-perform-inference