Slide 1

Slide 1 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark. Introduction to Kserve Keita Watanabe Sr. Solutions Architect, AI/ML Frameworsk AWS

Slide 2

Slide 2 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Agenda 2 • KServe Overview • KServe Components • Inference Service • Predictor • AutoScaling with Knative Pod Autoscaler (KPA) • ML inference with KServe Examples

Slide 3

Slide 3 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. KServe 3 https://kserve.github.io/website/master/

Slide 4

Slide 4 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. KServe Features 4 • Scale to and from Zero • Request based Autoscaling • Batching • Request/Response logging • Traffic management • Security with AuthN/AuthZ • Distributed Tracing • Out-of-the-box metrics

Slide 5

Slide 5 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. KServe Control Plane 5 • Responsible for reconciling the InferenceService custom resources. • It creates the Knative serverless deployment for predictor, transformer to enable autoscaling based on incoming request workload including scaling down to zero when no traffic is received.

Slide 6

Slide 6 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. KServe Control Plane 6 • Responsible for reconciling the InferenceService custom resources. • It creates the Knative serverless deployment for predictor, transformer to enable autoscaling based on incoming request workload including scaling down to zero when no traffic is received.

Slide 7

Slide 7 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Predictor 7 https://kserve.github.io/website/master/

Slide 8

Slide 8 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Predictor 8 https://kserve.github.io/website/master/ Queue Proxy measures and limit concurrency to the user’s application Model Server deploys, manages, and serves machine learning models Storage Initializer retrieves and prepares machine learning models from various storage backends like Amazon S3

Slide 9

Slide 9 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Transformer 9 https://kserve.github.io/website/master/

Slide 10

Slide 10 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Transformer 10 https://kserve.github.io/website/master/ Queue Proxy measures and limits concurrency to the user’s application. Model Server preprocesses input data and postprocesses output predictions, enabling seamless integration of custom logic or data transformations with the deployed machine learning models for improved model serving and inference.

Slide 11

Slide 11 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. KServe Control Plane 11 Inference Service KServe Controller Knative Service Knative Revision Deployment Reconcile Serverless Raw Deployment

Slide 12

Slide 12 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. KServe Control Plane 12 Inference Service KServe Controller Knative Service Knative Revision Deployment Reconcile Serverless Raw Deployment

Slide 13

Slide 13 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Knative Components 13 Inference Service KServe Controller Knative Service Knative Revision Deployment Reconcile Serverless Raw Deployment https://knative.dev/docs/serving/

Slide 14

Slide 14 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Knative Serving 14 Inference Service KServe Controller Knative Service Knative Revision Deployment Reconcile Serverless Raw Deployment https://knative.dev/docs/serving/ Knative

Slide 15

Slide 15 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Primary Knative Serving Resources 15 Knative Knative Service resource automatically manages the whole lifecycle of your workload. Routes maps a network endpoint to one or more revisions. Configuration maintains the desired state for your deployment. Revision is a point-in-time snapshot of the code and configuration for each modification made to the workload. Deployment

Slide 16

Slide 16 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Revision Autoscaling with Knative Pod Autoscaler (KPA) 16 Route Activator Pods Deployment Autoscaler Inactive route Pull metrics Push metrics scales Creates/ deletes Active route https://knative.dev/docs/serving/istio-authorization/ https://developer.aliyun.com/article/710828

Slide 17

Slide 17 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Scaling up and down (steady state) 17 https://github.com/knative/serving/blob/main/docs/scaling/SYSTEM.md

Slide 18

Slide 18 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Scaling to zero 18 https://github.com/knative/serving/blob/main/docs/scaling/SYSTEM.md

Slide 19

Slide 19 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Scaling from Zero 19 https://github.com/knative/serving/blob/main/docs/scaling/SYSTEM.md

Slide 20

Slide 20 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Autoscale Sample 20 https://github.com/dewitt/knative- docs/tree/master/serving/samples/autoscale-go Ramp up traffic to maintain 10 in-flight requests.

Slide 21

Slide 21 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Scaling pod from zero 21 https://github.com/dewitt/knative- docs/tree/master/serving/samples/autoscale-go

Slide 22

Slide 22 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Difference between KPA and HPA 22 Knative Pod Autoscaler (KPA) • Part of the Knative Serving core and enabled by default once Knative Serving is installed. • Supports scale to zero functionality. • Does not support CPU-based autoscaling. Horizontal Pod Autoscaler (HPA) • Not part of the Knative Serving core, and must be enabled after Knative Serving installation. • Does not support scale to zero functionality. • Supports CPU-based autoscaling. https://kserve.github.io/website/0.8/modelserving/v1b eta1/torchserve/#autoscaling

Slide 23

Slide 23 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. We have covered Knative Serving part… 23 Inference Service KServe Controller Knative Service Knative Revision Deployment Reconcile Serverless Raw Deployment https://knative.dev/docs/serving/ Knative

Slide 24

Slide 24 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Up next: Inference Service 24 Inference Service KServe Controller Knative Service Knative Revision Deployment Reconcile Serverless Raw Deployment Question: Do we have to deal with the complexity in Knative? Answer: No! All we need is Inference Service.

Slide 25

Slide 25 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. First InferenceSevice 25 Apply

Slide 26

Slide 26 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. First Inference Service 26

Slide 27

Slide 27 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. First Inference Service: load test 27 Under the hood https://kserve.github.io/website/master/get_started/first_isvc/ #5-perform-inference

Slide 28

Slide 28 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Thank you!