Amazon SageMaker Model Deployment Strategies

Slide 1

Slide 1 text

Slide 2

Slide 2 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. What we covered today SageMaker deployment strategy Workflow of deploying models in SageMaker Inference load testing Inference A/B testing Model monitoring

Slide 3

Slide 3 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon SageMaker Purpose-built tools so you can be 10x more productive Amazon SageMaker Studio notebooks Access ML data Connect to many data sources such as Amazon S3, Apache Spark, Amazon Redshift, CSV files, and more Prepare data Transform data to browse data sources, explore metadata, schemas, and write queries in popular languages Build ML models Optimized with 150+ popular open-source models and frameworks such as TensorFlow and PyTorch Train and tune ML model Correct performance problems in real time Deploy and monitor results Create, automate, and manage end-to-end ML workflows to improve model quality

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Deploying Models in SageMaker transformer = model.transformer( instance_count = 1, instance_type = 'ml.m5.xlarge') transformer.transform( test_data_s3, content_type = 'text/csv') Easy deployment of ML models Online and offline scoring Fully managed infrastructure predictor = model.deploy( initial_instance_count = 1, instance_type = 'ml.c5.4xlarge') prediction = predictor.predict(x_test)

Slide 8

Slide 8 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Benefits when deploying models in SageMaker • Decouple application code from ML models • Call models from anywhere • Full lifecycle support • No surprises (if it works locally, it will work on AWS) • Train anywhere • Self service deployments

Slide 9

Slide 9 text

Slide 10

Slide 10 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. SageMaker inference options Low latency Ultra high throughput Multi-model endpoints A/B testing Real-time inference Asynchronous inference Near real-time Large payloads (1 GB) Long timeouts (15 mins) First purpose built serverless ML inference in cloud Fully managed Pay only for what you use, billed in milliseconds Serverless inference Batch transform Process large datasets Job-based system

Slide 11

Slide 11 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. SageMaker Deployment – Real-time Inference SageMaker Real-time Inference Create a long-running microservice Instant response for payload up to 6MB Accessible from an external application Autoscaling

Slide 12

Slide 12 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. real_time_endpoint = model.deploy( initial_instance_count = 1, instance_type = "ml.c5.xlarge", ...) real_time_endpoint.predict(payload) SageMaker Deployment – Real-time Inference SageMaker Real-time Inference Create a long-running microservice Instant response for payload up to 6MB Accessible from an external application Autoscaling

Slide 13

Slide 13 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Current customer challenges with ML inference Increases TCO Spend lot of time in provisioning and managing servers Challenging to provision capacity Data scientists are challenged with selecting optimal instance types and managing autoscaling policies End up over-provisioning capacity Utilization is low and costs are high regardless of number of requests Workloads are intermittent Some ML workloads have less predictable usage patterns and long periods of inactivity

Slide 14

Slide 14 text

Slide 15

Slide 15 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. SageMaker Deployment – Serverless Inference SageMaker Serverless Inference Ideal for unpredictable prediction traffic Workload tolerable to cold start Autoscaling (down to 0 instance) from sagemaker.serverless import ServerlessInferenceConfig serverless_config = ServerlessInferenceConfig( memory_size_in_mb=4096, max_concurrency=10 ) serverless_predictor = model.deploy( serverless_inference_config=serverless_config ) serverless_predictor.predict(data)

Slide 16

Slide 16 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Other challenges with ML inference Customers need to control costs through an environment that can scale automatically (including down to zero). Deep Learning models are complex and large, and may take several minutes to finish processing. Existing Real-Time Inference requests timeout after 60 seconds. Spinning up batch clusters takes too long. Customers need near “real-time” inference. Model sizes can be large Need to control costs Inference payloads can be large Some workloads can tolerate some latency Customers need to process large payloads (100s of MB or GB).

Slide 17

Slide 17 text

Slide 18

Slide 18 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. SageMaker Deployment – Async Inference SageMaker Asynchronous Inference Ideal for large payload up to 1GB Longer processing timeout up to 15 min Autoscaling (down to 0 instance) Suitable for CV/NLP use cases from sagemaker.async_inference import AsyncInferenceConfig async_config = AsyncInferenceConfig( output_path="s3://{s3_bucket}/{bucket_prefix}/output", max_concurrent_invocations_per_instance=10, notification_config = { "SuccessTopic": sns_success_topic_arn, "ErrorTopic": sns_error_topic_arn }) async_predictor = model.deploy(async_inference_config=async_config) async_predictor.predict_async(input_path=input_s3_path)

Slide 19

Slide 19 text

Slide 20

Slide 20 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. SageMaker Deployment – Batch Inference Fully managed mini-batching for large data Pay only for what you use Suitable for periodic arrival of large data SageMaker Batch Transform transformer = model.transformer( instance_count = 1, instance_type = "ml.m5.xlarge", output_path = "s3://{s3_bucket}/{bucket_prefix}/output") transformer.transform( input_data_s3, content_type = "text/csv")

Slide 21

Slide 21 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. • Low latency • Multi-model/ Multi- container endpoints • A/B testing • Blue/Green Deployment guardrails • CPU/GPU support • Payload size < 6MB • Request timeout - 60 secs Example use cases: ad serving, personalized recommendations, fraud detection SageMaker Model Deployment Options Real-Time Inference Batch Transform Asynchronous Inference Serverless Inference • Process large datasets1 (Max mini-batch size - 100MB) • Higher throughput • Job-based system Example use cases: Data pre-processing, churn prediction, predictive maintenance • Near real-time • Large payloads (<1 GB) • Long timeouts (15 min) Example use cases: computer vision, object detection • Automatic scaling • No need to select or manage servers. • For workloads that can tolerate cold-start • Intermittent or unpredictable traffic • Payload size < 4MB • Request Timeout - 60 secs • Limits on maximum concurrent invocations per endpoint • CPU only support Example use cases: Test workloads, Extract & analyze data from documents, form processing Real-time Micro-batch Batch 1Each instance has 30GB EBS Volume. Maximum dataset size depends on number of instances in the batch transform and type of instance. G4dn instances come with their own local SSD storage.

Slide 22

Slide 22 text

2 3 4 1 SageMaker inference options

Slide 23

Slide 23 text

real_time_endpoint = model.deploy( initial_instance_count = 1, instance_type = "ml.c5.xlarge", ...) real_time_endpoint.predict(payload) from sagemaker.serverless import ( ServerlessInferenceConfig ) serverless_config = ServerlessInferenceConfig( memory_size_in_mb=4096, max_concurrency=10 ) serverless_predictor = model.deploy( serverless_inference_config=serverless_config ) serverless_predictor.predict(data) from sagemaker.async_inference import ( AsyncInferenceConfig ) async_config = AsyncInferenceConfig( output_path= "s3://{s3_bucket}/{bucket_prefix}/output", max_concurrent_invocations_per_instance=10, notification_config = { "SuccessTopic": sns_success_topic_arn, "ErrorTopic": sns_error_topic_arn }) async_predictor = model.deploy( async_inference_config=async_config) async_predictor.predict_async( input_path=input_s3_path) transformer = model.transformer( instance_count=1, instance_type="ml.m5.xlarge", output_path="s3://{s3_bucket}/{bucket_prefix}/output") transformer.transform( input_data_s3, content_type = "text/csv") 1 2 3 4 Real-time Inference Serverless Inference Async Inference Batch Inference

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. EP-1 Model 1 EP-2 Model 2 EP-10 Model 10 … EP Model 1 Model 2 … Model 10 Example scenario: ml.c5.xlarge, $0.238/hr., 2 instances running 24/7 10 separate endpoints $3,430/mo. 1 multi-model endpoint $343/mo. SageMaker Deployment – Multi-model Endpoint C O S T - S A V I N G O P P O R T U N I T Y

Slide 29

Slide 29 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. SageMaker Deployment – Multi-model Endpoint C O S T - S A V I N G O P P O R T U N I T Y Host multiple models in one container Direct invocation to target model Improves resource utilization Dynamic loading model from Amazon S3 SageMaker Multi-Model Endpoint container = { 'Image’: mme-supported-image, 'ModelDataUrl': 's3://my-bucket/folder-of-tar-gz’, 'Mode': 'MultiModel’} sm.create_model( Containers = [container], ...) sm.create_endpoint_config(); sm.create_endpoint() smrt.invoke_endpoint( EndpointName = endpoint_name, TargetModel = 'model-007.tar.gz’, Body = body, ...)

Slide 30

Slide 30 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. TargetContainerHostname= ‘Container-05' SageMaker Deployment – Multi-container Endpoint C O S T - S A V I N G O P P O R T U N I T Y Host up to 15 distinct containers Direct or serial invocation No cold start vs. Multi-Model Endpoint SageMaker Multi-container Endpoint

Slide 31

Slide 31 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. SageMaker Deployment – Multi-container Endpoint C O S T - S A V I N G O P P O R T U N I T Y Host up to 15 distinct containers Direct or serial invocation No cold start vs. Multi-Model Endpoint SageMaker Multi-container Endpoint

Slide 32

Slide 32 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Inference Pipelines • Reuse the data transformers developed for training models • Low latency: All containers run on the same underlying EC2 Multi-container Endpoint: Inference Pipelines

Slide 33

Slide 33 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. SageMaker Deployment – Multi-container Endpoint C O S T - S A V I N G O P P O R T U N I T Y container1 = { 'Image': container, 'ContainerHostname': 'firstContainer’}; ... sm.create_model( InferenceExecutionConfig = {'Mode': 'Direct’}, Containers = [container1, container2, ...], ...) sm.create_endpoint_config() sm.create_endpoint() smrt.invoke_endpoint( EndpointName = endpoint_name, TargetContainerHostname = 'firstContainer’, Body = body, ...) Host up to 15 distinct containers Direct or serial invocation No cold start vs. Multi-Model Endpoint SageMaker Multi-container Endpoint

Slide 34

Slide 34 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Multi-Model vs. Multi-Container TargetModel= 'model-013.tar.gz' TargetContainerHostname= ‘Container-05' SageMaker Multi-container Endpoint SageMaker Multi-Model Endpoint

Slide 35

Slide 35 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Multi-Model vs. Multi-Container container = { 'Image’: mme-supported-image, 'ModelDataUrl':'s3://my-bucket/folder-of-tar-gz’, 'Mode': 'MultiModel’} sm.create_model( Containers = [container], ...) sm.create_endpoint_config() sm.create_endpoint() smrt.invoke_endpoint( EndpointName = endpoint_name, TargetModel = 'model-007.tar.gz’, Body = body, ...) container1 = { 'Image': container, 'ContainerHostname': 'firstContainer’}; ... sm.create_model( InferenceExecutionConfig = {'Mode': 'Direct’}, Containers = [container1, container2, ...], ...) sm.create_endpoint_config() sm.create_endpoint() smrt.invoke_endpoint( EndpointName = endpoint_name, TargetContainerHostname = 'firstContainer’, Body = body, ...)

Slide 36

Slide 36 text

Slide 37

Slide 37 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. SageMaker ML instance options B A L A N C I N G B E T W E E N C O S T A N D P E R F O R M A N C E High throughput, and low-latency access to CUDA GPU INSTANCES P3 G4 Low throughput, low cost, most flexible CPU INSTANCES C5 Inf1: High throughput, high performance, and lowest cost in the cloud CUSTOM CHIP Inf1

Slide 38

Slide 38 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. ML instance ML instance Endpoint Load testing K N O W Y O U R E N D P O I N T S Artificial requests Amazon SageMaker endpoint Endpoint Auto-scaling group Availability Zone 1 Availability Zone 2 ML instance ML instance ML instance ML instance Amazon CloudWatch Elastic Load Balancing

Slide 39

Slide 39 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Optimizing inference takes skills, time, and effort Performance and load testing to validate latency and throughput requirements are met and costs are within budget Using ML frameworks with converters, compilers, and kernel libraries specific to different instance types and hardware vendors Selecting the right instance size, container parameters, and autoscaling properties to maximize performance Model tuning Manual benchmarking 70+ ML instance types Systems for ML Selecting the right instance type based on resource requirements of the ML model and data payloads

Slide 40

Slide 40 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. SageMaker Inference Recommender F E A T U R E S Designed for MLOps engineers and data scientists to reduce time to get models into production Run extensive load tests that include production requirements – throughput, latency Load tests Get endpoint configuration settings that meet your production requirements Endpoint recommendations Instance recommendations Instance type recommendation for initial deployments

Slide 41

Slide 41 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Get started with Inference Recommender 1 Container image 2 Model artifacts and sample payload Model registry 3 Model metadata Deploy your model Get initial instance recommendations Specify performance requirements and instance types for a custom load test View and compare performance and cost across different endpoint configurations Inference Recommender

Slide 42

Slide 42 text

Get an instance recommendation in minutes

Slide 43

Slide 43 text

Run custom load tests across instance types

Slide 44

Slide 44 text

Review endpoint recommendations

Slide 45

Slide 45 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. How to choose your Deployment Strategy A D E C I S I O N T R E E Live Predictions? SageMaker Serverless Inference Multiple models/ containers? Single ML framework SageMaker async inference SageMaker endpoint SageMaker multi-model endpoint SageMaker multi-container endpoint Fluctuating traffic? Load testing to right-size Auto-scaling Yes No Yes No. Multiple containers Yes Can Tolerate Cold Start? Yes Yes No No No (daily, hourly, weekly) Batch Transform > 4 MB Payload or > 60 sec Yes

Slide 46

Slide 46 text

Slide 47

Slide 47 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. SageMaker Model Monitor O P T I M I Z I N G M O D E L A C C U R A C Y 79% Model quality drift Data drift SageMaker Clarify Feature importance drift & data bias

Slide 48

Slide 48 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Endpoint A/B testing U S I N G P R O D U C T I O N V A R I A N T S sm.update_endpoint_weights_and_capacities( EndpointName=endpoint_name, DesiredWeightsAndCapacities=[ { "DesiredWeight": 0.1, "VariantName": ”new-model” }, { "DesiredWeight": 0.9, "VariantName": ”existing-model” } ] ) Elastic Load Balancing

Slide 49

Slide 49 text

Slide 50

Slide 50 text

Slide 51

Slide 51 text

Define Estimator Object created deploy() predict() Object created fit() SageMaker Python SDK End to End Training and Deployment

Slide 52

Slide 52 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Workflow of Deploying Models in SageMaker Creating a Model Defining the Endpoint Configuration Creating an Endpoint Invoking an Endpoint

Slide 53

Slide 53 text

Slide 54

Slide 54 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Deploy Model: How it works 56 Create Model 1 Inference Container Image SageMaker Model Path to the SageMaker compatible inference image stored in ECR or a Private Docker Registry Packages your model for deployment

Slide 55

Slide 55 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Deploy Model: How it works 57 Create Model 1 Inference Container Image Model Artifact SageMaker Model S3 Path to the trained model artifacts. **Required for SageMaker built-in algorithms. Packages your model for deployment

Slide 56

Slide 56 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Deploy Model: How it works 58 Create Model 1 Inference Container Image Model Artifact SageMaker Model IAM Role IAM role that Sagemaker assumes to access model artifacts and the docker image for deployment Packages your model for deployment

Slide 57

Slide 57 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Deploy Model: How it works 59 Create Model 1 Inference Container Image Model Artifact Advanced Configurations SageMaker Model IAM Role Advanced configuration options are dependent on the chosen deployment option. Examples include: VPC Configuration, Multi-Container & Multi-Model deployments. Packages your model for deployment

Slide 58

Slide 58 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Deploy Model: How it works 60 Create Model 1 Packages your model for deployment Inference Container Image Model Artifact Advanced Configurations SageMaker Model IAM Role Configure & Deploy Model 2 Input Real-Time Inference Batch Transform Asynchronous Inference Serverless Inference Deploy model using the option that best meets the needs of your use case

Slide 59

Slide 59 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Create Endpoint – SageMaker Python SDK Bring your own inference script using SageMaker Framework containers predictor.predict(payload) PyTorch EC2 t3.xlarge model = PyTorchModel(model_data=zipped_model_path, role=get_execution_role(), framework_version='1.5’, entry_point='inference.py’, py_version='py3’, predictor_cls=ImagePredictor) predictor = model.deploy( instance_type='ml.t3.medium’, initial_instance_count=1) Create Model Deploy Predict Creates endpoint Runs prediction Refers to Inference Container image inference.py 1. model_fn() -> model load 2. input_fn() ->input processing 3. predict_fn() -> predictions 4. output_fn()-> output processing Model Artifacts SageMaker framework container images

Slide 60

Slide 60 text

Slide 61

Slide 61 text

Slide 62

Slide 62 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Endpoint UpdateEndpoint Docker Image (ECR) Model Artifacts (S3) model.tar.gz ── code | ├── inference.py | └── requirements.txt └── model.pth Docker Image (ECR) Model Artifacts (S3) model.tar.gz ── code | ├── inference.py | └── requirements.txt └── model.pth Instance Type Instance Count Variant … Instance Type Instance Count Variant … Endpoint Configuration 1 Model 1 Endpoint Configuration 2 Model 2 Updating an endpoint

Slide 63

Slide 63 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. aws sagemaker create-model --model-name model2 --primary-container ‘{“Image”: “123.dkr.ecr.amazonaws.com/algo”, “ModelDataUrl”: “s3://bkt/model2.tar.gz”} --execution-role-arn arn:aws:iam::123:role/me aws sagemaker create-endpoint-config --endpoint-config-name model2-config --production-variants ‘{“InitialInstanceCount”: 2, “InstanceType”: “ml.m4.xlarge”, ”InitialVariantWeight”: 1, ”ModelName”: “model2”, ”VariantName”: “AllTraffic”}’ aws sagemaker update-endpoint --endpoint-name my-endpoint --endpoint-config-name model2-config New Model New Endpoint Config Same Endpoint Updating an endpoint using the AWS CLI R E A L - T I M E I N F E R E N C E

Slide 64

Slide 64 text

Slide 65

Slide 65 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Benefits when deploying models in SageMaker Spend 90% Prediction 10% Training ML App interface Separation of Concerns Cost saving opportunity in production

Slide 66

Slide 66 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. SageMaker Model Deployment Options Real-Time Inference Serverless Inference Asynchronous Inference Batch Transform Latency Low latency Sub-second Low Latency Sub-second (tolerates cold-start) Near real-time Long processing time (<15 min.) Indefinite timeout Frequency Continuous Unpredictable Near real-time user Event-based/ Scheduled Data Size Payload size < 6MB Payload size < 4MB Large payload (<1GB) Process large datasets Use Case Fraud Detection Form Processing Image Analysis Churn Prediction

Slide 67

Slide 67 text

Slide 68

Slide 68 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Resources and Notebooks • Amazon SageMaker Workshop https://sagemaker-immersionday.workshop.aws/ • Amazon SageMaker Examples https://sagemaker-examples.readthedocs.io/en/latest/ • Amazon SageMaker python notebooks examples https://github.com/aws/amazon-sagemaker-examples • Amazon SageMaker Python SDK documentation https://sagemaker.readthedocs.io/en/stable/ • Amazon SageMaker Developer Guide https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html

Slide 69

Slide 69 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Resources and Notebooks https://github.com/aws/amazon-sagemaker-examples • Real time inference /advanced_functionality/pytorch_deploy_pretrained_bert_model/pytorch_deploy_pretrained_be rt_model.ipynb • Serverless inference /serverless-inference/Serverless-Inference-Walkthrough.ipynb • Async Inference /async-inference/Async-Inference-Walkthrough-SageMaker-Python-SDK.ipynb • Batch Transform /sagemaker_batch_transform/pytorch_mnist_batch_transform/pytorch-mnist-batch- transform_outputs.ipynb

Slide 70

Slide 70 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Resources and Notebooks https://github.com/aws/amazon-sagemaker-examples • Multi-Model /advanced_functionality/multi_model_xgboost_home_value/xgboost_multi_model_endpoint_ho me_value.ipynb • Multi-Container - Direct Invocation /advanced_functionality/multi-container-endpoint/direct-invocation/multi-container-direct- invocation.ipynb • Multi-Container - Inference Pipeline /sagemaker-python-sdk/scikit_learn_inference_pipeline/Inference Pipeline with Scikit-learn and Linear Learner.ipynb • Inference Recommender /sagemaker-inference-recommender/inference-recommender.ipynb