Amazon SageMaker Model Deployment Strategies

© 2022, Amazon Web Services, Inc. or its affiliates. All
rights reserved. High-Performance & Cost-Effective Model Deployment Strategies with Amazon SageMaker Sungmin Kim Solutions Architect AWS

rights reserved. What we covered today SageMaker deployment strategy Workflow of deploying models in SageMaker Inference load testing Inference A/B testing Model monitoring

rights reserved. Amazon SageMaker Purpose-built tools so you can be 10x more productive Amazon SageMaker Studio notebooks Access ML data Connect to many data sources such as Amazon S3, Apache Spark, Amazon Redshift, CSV files, and more Prepare data Transform data to browse data sources, explore metadata, schemas, and write queries in popular languages Build ML models Optimized with 150+ popular open-source models and frameworks such as TensorFlow and PyTorch Train and tune ML model Correct performance problems in real time Deploy and monitor results Create, automate, and manage end-to-end ML workflows to improve model quality

rights reserved.

rights reserved. Separation of Concerns ML App ML App interface W H Y N E E D A M L I N F E R E N C E E N D P O I N T

rights reserved. Why optimize model deployment Spend Predictions drive complexity and cost in production 90% Prediction 10% Training

rights reserved. Deploying Models in SageMaker transformer = model.transformer( instance_count = 1, instance_type = 'ml.m5.xlarge') transformer.transform( test_data_s3, content_type = 'text/csv') Easy deployment of ML models Online and offline scoring Fully managed infrastructure predictor = model.deploy( initial_instance_count = 1, instance_type = 'ml.c5.4xlarge') prediction = predictor.predict(x_test)

rights reserved. Benefits when deploying models in SageMaker • Decouple application code from ML models • Call models from anywhere • Full lifecycle support • No surprises (if it works locally, it will work on AWS) • Train anywhere • Self service deployments

rights reserved. Deploy Models for Inference in SageMaker

rights reserved. SageMaker inference options Low latency Ultra high throughput Multi-model endpoints A/B testing Real-time inference Asynchronous inference Near real-time Large payloads (1 GB) Long timeouts (15 mins) First purpose built serverless ML inference in cloud Fully managed Pay only for what you use, billed in milliseconds Serverless inference Batch transform Process large datasets Job-based system

rights reserved. SageMaker Deployment – Real-time Inference SageMaker Real-time Inference Create a long-running microservice Instant response for payload up to 6MB Accessible from an external application Autoscaling

rights reserved. real_time_endpoint = model.deploy( initial_instance_count = 1, instance_type = "ml.c5.xlarge", ...) real_time_endpoint.predict(payload) SageMaker Deployment – Real-time Inference SageMaker Real-time Inference Create a long-running microservice Instant response for payload up to 6MB Accessible from an external application Autoscaling

rights reserved. Current customer challenges with ML inference Increases TCO Spend lot of time in provisioning and managing servers Challenging to provision capacity Data scientists are challenged with selecting optimal instance types and managing autoscaling policies End up over-provisioning capacity Utilization is low and costs are high regardless of number of requests Workloads are intermittent Some ML workloads have less predictable usage patterns and long periods of inactivity

rights reserved. SageMaker Deployment – Serverless Inference SageMaker Serverless Inference Ideal for unpredictable prediction traffic Workload tolerable to cold start Autoscaling (down to 0 instance)

rights reserved. SageMaker Deployment – Serverless Inference SageMaker Serverless Inference Ideal for unpredictable prediction traffic Workload tolerable to cold start Autoscaling (down to 0 instance) from sagemaker.serverless import ServerlessInferenceConfig serverless_config = ServerlessInferenceConfig( memory_size_in_mb=4096, max_concurrency=10 ) serverless_predictor = model.deploy( serverless_inference_config=serverless_config ) serverless_predictor.predict(data)

rights reserved. Other challenges with ML inference Customers need to control costs through an environment that can scale automatically (including down to zero). Deep Learning models are complex and large, and may take several minutes to finish processing. Existing Real-Time Inference requests timeout after 60 seconds. Spinning up batch clusters takes too long. Customers need near “real-time” inference. Model sizes can be large Need to control costs Inference payloads can be large Some workloads can tolerate some latency Customers need to process large payloads (100s of MB or GB).

rights reserved. SageMaker Deployment – Async Inference SageMaker Asynchronous Inference Ideal for large payload up to 1GB Longer processing timeout up to 15 min Autoscaling (down to 0 instance) Suitable for CV/NLP use cases

rights reserved. SageMaker Deployment – Async Inference SageMaker Asynchronous Inference Ideal for large payload up to 1GB Longer processing timeout up to 15 min Autoscaling (down to 0 instance) Suitable for CV/NLP use cases from sagemaker.async_inference import AsyncInferenceConfig async_config = AsyncInferenceConfig( output_path="s3://{s3_bucket}/{bucket_prefix}/output", max_concurrent_invocations_per_instance=10, notification_config = { "SuccessTopic": sns_success_topic_arn, "ErrorTopic": sns_error_topic_arn }) async_predictor = model.deploy(async_inference_config=async_config) async_predictor.predict_async(input_path=input_s3_path)

rights reserved. SageMaker Deployment – Batch Inference Fully managed mini-batching for large data Pay only for what you use Suitable for periodic arrival of large data SageMaker Batch Transform

rights reserved. SageMaker Deployment – Batch Inference Fully managed mini-batching for large data Pay only for what you use Suitable for periodic arrival of large data SageMaker Batch Transform transformer = model.transformer( instance_count = 1, instance_type = "ml.m5.xlarge", output_path = "s3://{s3_bucket}/{bucket_prefix}/output") transformer.transform( input_data_s3, content_type = "text/csv")

rights reserved. • Low latency • Multi-model/ Multi- container endpoints • A/B testing • Blue/Green Deployment guardrails • CPU/GPU support • Payload size < 6MB • Request timeout - 60 secs Example use cases: ad serving, personalized recommendations, fraud detection SageMaker Model Deployment Options Real-Time Inference Batch Transform Asynchronous Inference Serverless Inference • Process large datasets1 (Max mini-batch size - 100MB) • Higher throughput • Job-based system Example use cases: Data pre-processing, churn prediction, predictive maintenance • Near real-time • Large payloads (<1 GB) • Long timeouts (15 min) Example use cases: computer vision, object detection • Automatic scaling • No need to select or manage servers. • For workloads that can tolerate cold-start • Intermittent or unpredictable traffic • Payload size < 4MB • Request Timeout - 60 secs • Limits on maximum concurrent invocations per endpoint • CPU only support Example use cases: Test workloads, Extract & analyze data from documents, form processing Real-time Micro-batch Batch 1Each instance has 30GB EBS Volume. Maximum dataset size depends on number of instances in the batch transform and type of instance. G4dn instances come with their own local SSD storage.

2 3 4 1 SageMaker inference options

real_time_endpoint = model.deploy( initial_instance_count = 1, instance_type = "ml.c5.xlarge", ...)
real_time_endpoint.predict(payload) from sagemaker.serverless import ( ServerlessInferenceConfig ) serverless_config = ServerlessInferenceConfig( memory_size_in_mb=4096, max_concurrency=10 ) serverless_predictor = model.deploy( serverless_inference_config=serverless_config ) serverless_predictor.predict(data) from sagemaker.async_inference import ( AsyncInferenceConfig ) async_config = AsyncInferenceConfig( output_path= "s3://{s3_bucket}/{bucket_prefix}/output", max_concurrent_invocations_per_instance=10, notification_config = { "SuccessTopic": sns_success_topic_arn, "ErrorTopic": sns_error_topic_arn }) async_predictor = model.deploy( async_inference_config=async_config) async_predictor.predict_async( input_path=input_s3_path) transformer = model.transformer( instance_count=1, instance_type="ml.m5.xlarge", output_path="s3://{s3_bucket}/{bucket_prefix}/output") transformer.transform( input_data_s3, content_type = "text/csv") 1 2 3 4 Real-time Inference Serverless Inference Async Inference Batch Inference

rights reserved. Cost-effective Model Deployment

rights reserved. Cost Considerations H O S T I N G I N D I V I D U A L E N D P O I N T S EndpointName='endpoint-05'

rights reserved. SageMaker Deployment – Multi-model Endpoint C O S T - S A V I N G O P P O R T U N I T Y Host multiple models in one container Direct invocation to target model Improves resource utilization Dynamic loading model from Amazon S3 TargetModel= 'model-007.tar.gz' SageMaker Multi-Model Endpoint

rights reserved. SageMaker Deployment – Multi-model Endpoint C O S T - S A V I N G O P P O R T U N I T Y Host multiple models in one container Direct invocation to target model Improves resource utilization Dynamic loading model from Amazon S3 SageMaker Multi-Model Endpoint TargetModel= 'model-013.tar.gz'

rights reserved. EP-1 Model 1 EP-2 Model 2 EP-10 Model 10 … EP Model 1 Model 2 … Model 10 Example scenario: ml.c5.xlarge, $0.238/hr., 2 instances running 24/7 10 separate endpoints $3,430/mo. 1 multi-model endpoint $343/mo. SageMaker Deployment – Multi-model Endpoint C O S T - S A V I N G O P P O R T U N I T Y

rights reserved. SageMaker Deployment – Multi-model Endpoint C O S T - S A V I N G O P P O R T U N I T Y Host multiple models in one container Direct invocation to target model Improves resource utilization Dynamic loading model from Amazon S3 SageMaker Multi-Model Endpoint container = { 'Image’: mme-supported-image, 'ModelDataUrl': 's3://my-bucket/folder-of-tar-gz’, 'Mode': 'MultiModel’} sm.create_model( Containers = [container], ...) sm.create_endpoint_config(); sm.create_endpoint() smrt.invoke_endpoint( EndpointName = endpoint_name, TargetModel = 'model-007.tar.gz’, Body = body, ...)

rights reserved. TargetContainerHostname= ‘Container-05' SageMaker Deployment – Multi-container Endpoint C O S T - S A V I N G O P P O R T U N I T Y Host up to 15 distinct containers Direct or serial invocation No cold start vs. Multi-Model Endpoint SageMaker Multi-container Endpoint

rights reserved. SageMaker Deployment – Multi-container Endpoint C O S T - S A V I N G O P P O R T U N I T Y Host up to 15 distinct containers Direct or serial invocation No cold start vs. Multi-Model Endpoint SageMaker Multi-container Endpoint

rights reserved. Inference Pipelines • Reuse the data transformers developed for training models • Low latency: All containers run on the same underlying EC2 Multi-container Endpoint: Inference Pipelines

rights reserved. SageMaker Deployment – Multi-container Endpoint C O S T - S A V I N G O P P O R T U N I T Y container1 = { 'Image': container, 'ContainerHostname': 'firstContainer’}; ... sm.create_model( InferenceExecutionConfig = {'Mode': 'Direct’}, Containers = [container1, container2, ...], ...) sm.create_endpoint_config() sm.create_endpoint() smrt.invoke_endpoint( EndpointName = endpoint_name, TargetContainerHostname = 'firstContainer’, Body = body, ...) Host up to 15 distinct containers Direct or serial invocation No cold start vs. Multi-Model Endpoint SageMaker Multi-container Endpoint

rights reserved. Multi-Model vs. Multi-Container TargetModel= 'model-013.tar.gz' TargetContainerHostname= ‘Container-05' SageMaker Multi-container Endpoint SageMaker Multi-Model Endpoint

rights reserved. Multi-Model vs. Multi-Container container = { 'Image’: mme-supported-image, 'ModelDataUrl':'s3://my-bucket/folder-of-tar-gz’, 'Mode': 'MultiModel’} sm.create_model( Containers = [container], ...) sm.create_endpoint_config() sm.create_endpoint() smrt.invoke_endpoint( EndpointName = endpoint_name, TargetModel = 'model-007.tar.gz’, Body = body, ...) container1 = { 'Image': container, 'ContainerHostname': 'firstContainer’}; ... sm.create_model( InferenceExecutionConfig = {'Mode': 'Direct’}, Containers = [container1, container2, ...], ...) sm.create_endpoint_config() sm.create_endpoint() smrt.invoke_endpoint( EndpointName = endpoint_name, TargetContainerHostname = 'firstContainer’, Body = body, ...)

rights reserved. Inference Load Testing

rights reserved. SageMaker ML instance options B A L A N C I N G B E T W E E N C O S T A N D P E R F O R M A N C E High throughput, and low-latency access to CUDA GPU INSTANCES P3 G4 Low throughput, low cost, most flexible CPU INSTANCES C5 Inf1: High throughput, high performance, and lowest cost in the cloud CUSTOM CHIP Inf1

rights reserved. ML instance ML instance Endpoint Load testing K N O W Y O U R E N D P O I N T S Artificial requests Amazon SageMaker endpoint Endpoint Auto-scaling group Availability Zone 1 Availability Zone 2 ML instance ML instance ML instance ML instance Amazon CloudWatch Elastic Load Balancing

rights reserved. Optimizing inference takes skills, time, and effort Performance and load testing to validate latency and throughput requirements are met and costs are within budget Using ML frameworks with converters, compilers, and kernel libraries specific to different instance types and hardware vendors Selecting the right instance size, container parameters, and autoscaling properties to maximize performance Model tuning Manual benchmarking 70+ ML instance types Systems for ML Selecting the right instance type based on resource requirements of the ML model and data payloads

rights reserved. SageMaker Inference Recommender F E A T U R E S Designed for MLOps engineers and data scientists to reduce time to get models into production Run extensive load tests that include production requirements – throughput, latency Load tests Get endpoint configuration settings that meet your production requirements Endpoint recommendations Instance recommendations Instance type recommendation for initial deployments

rights reserved. Get started with Inference Recommender 1 Container image 2 Model artifacts and sample payload Model registry 3 Model metadata Deploy your model Get initial instance recommendations Specify performance requirements and instance types for a custom load test View and compare performance and cost across different endpoint configurations Inference Recommender

Get an instance recommendation in minutes

Run custom load tests across instance types

Review endpoint recommendations

rights reserved. How to choose your Deployment Strategy A D E C I S I O N T R E E Live Predictions? SageMaker Serverless Inference Multiple models/ containers? Single ML framework SageMaker async inference SageMaker endpoint SageMaker multi-model endpoint SageMaker multi-container endpoint Fluctuating traffic? Load testing to right-size Auto-scaling Yes No Yes No. Multiple containers Yes Can Tolerate Cold Start? Yes Yes No No No (daily, hourly, weekly) Batch Transform > 4 MB Payload or > 60 sec Yes

rights reserved. Bonus - Model Monitoring - A/B Testing

rights reserved. SageMaker Model Monitor O P T I M I Z I N G M O D E L A C C U R A C Y 79% Model quality drift Data drift SageMaker Clarify Feature importance drift & data bias

rights reserved. Endpoint A/B testing U S I N G P R O D U C T I O N V A R I A N T S sm.update_endpoint_weights_and_capacities( EndpointName=endpoint_name, DesiredWeightsAndCapacities=[ { "DesiredWeight": 0.1, "VariantName": ”new-model” }, { "DesiredWeight": 0.9, "VariantName": ”existing-model” } ] ) Elastic Load Balancing

rights reserved. Workflow of Deploying Models in SageMaker

rights reserved. From SageMaker Notebooks training

Define Estimator Object created deploy() predict() Object created fit() SageMaker
Python SDK End to End Training and Deployment

rights reserved. Workflow of Deploying Models in SageMaker Creating a Model Defining the Endpoint Configuration Creating an Endpoint Invoking an Endpoint

real_time_endpoint = model.deploy( initial_instance_count = 1, instance_type = "ml.c5.xlarge", ...)
real_time_endpoint.predict(payload) from sagemaker.serverless import ( ServerlessInferenceConfig ) serverless_config = ServerlessInferenceConfig( memory_size_in_mb=4096, max_concurrency=10 ) serverless_predictor = model.deploy( serverless_inference_config=serverless_config ) serverless_predictor.predict(data) from sagemaker.async_inference import ( AsyncInferenceConfig ) async_config = AsyncInferenceConfig( output_path= "s3://{s3_bucket}/{bucket_prefix}/output", max_concurrent_invocations_per_instance=10, notification_config = { "SuccessTopic": sns_success_topic_arn, "ErrorTopic": sns_error_topic_arn }) async_predictor = model.deploy( async_inference_config=async_config) async_predictor.predict_async( input_path=input_s3_path) transformer = model.transformer( instance_count=1, instance_type="ml.m5.xlarge", output_path="s3://{s3_bucket}/{bucket_prefix}/output") transformer.transform( input_data_s3, content_type = "text/csv") 1 2 3 4 Real-time Inference Serverless Inference Async Inference Batch Inference Revisit: SageMaker Inference Options

rights reserved. Deploy Model: How it works 56 Create Model 1 Inference Container Image SageMaker Model Path to the SageMaker compatible inference image stored in ECR or a Private Docker Registry Packages your model for deployment

rights reserved. Deploy Model: How it works 57 Create Model 1 Inference Container Image Model Artifact SageMaker Model S3 Path to the trained model artifacts. **Required for SageMaker built-in algorithms. Packages your model for deployment

rights reserved. Deploy Model: How it works 58 Create Model 1 Inference Container Image Model Artifact SageMaker Model IAM Role IAM role that Sagemaker assumes to access model artifacts and the docker image for deployment Packages your model for deployment

rights reserved. Deploy Model: How it works 59 Create Model 1 Inference Container Image Model Artifact Advanced Configurations SageMaker Model IAM Role Advanced configuration options are dependent on the chosen deployment option. Examples include: VPC Configuration, Multi-Container & Multi-Model deployments. Packages your model for deployment

rights reserved. Deploy Model: How it works 60 Create Model 1 Packages your model for deployment Inference Container Image Model Artifact Advanced Configurations SageMaker Model IAM Role Configure & Deploy Model 2 Input Real-Time Inference Batch Transform Asynchronous Inference Serverless Inference Deploy model using the option that best meets the needs of your use case

rights reserved. Create Endpoint – SageMaker Python SDK Bring your own inference script using SageMaker Framework containers predictor.predict(payload) PyTorch EC2 t3.xlarge model = PyTorchModel(model_data=zipped_model_path, role=get_execution_role(), framework_version='1.5’, entry_point='inference.py’, py_version='py3’, predictor_cls=ImagePredictor) predictor = model.deploy( instance_type='ml.t3.medium’, initial_instance_count=1) Create Model Deploy Predict Creates endpoint Runs prediction Refers to Inference Container image inference.py 1. model_fn() -> model load 2. input_fn() ->input processing 3. predict_fn() -> predictions 4. output_fn()-> output processing Model Artifacts SageMaker framework container images

rights reserved. Inference with SageMaker Endpoint

rights reserved. Inference Endpoint Update

rights reserved. Endpoint UpdateEndpoint Docker Image (ECR) Model Artifacts (S3) model.tar.gz ── code | ├── inference.py | └── requirements.txt └── model.pth Docker Image (ECR) Model Artifacts (S3) model.tar.gz ── code | ├── inference.py | └── requirements.txt └── model.pth Instance Type Instance Count Variant … Instance Type Instance Count Variant … Endpoint Configuration 1 Model 1 Endpoint Configuration 2 Model 2 Updating an endpoint

rights reserved. aws sagemaker create-model --model-name model2 --primary-container ‘{“Image”: “123.dkr.ecr.amazonaws.com/algo”, “ModelDataUrl”: “s3://bkt/model2.tar.gz”} --execution-role-arn arn:aws:iam::123:role/me aws sagemaker create-endpoint-config --endpoint-config-name model2-config --production-variants ‘{“InitialInstanceCount”: 2, “InstanceType”: “ml.m4.xlarge”, ”InitialVariantWeight”: 1, ”ModelName”: “model2”, ”VariantName”: “AllTraffic”}’ aws sagemaker update-endpoint --endpoint-name my-endpoint --endpoint-config-name model2-config New Model New Endpoint Config Same Endpoint Updating an endpoint using the AWS CLI R E A L - T I M E I N F E R E N C E

rights reserved. Summary

rights reserved. Benefits when deploying models in SageMaker Spend 90% Prediction 10% Training ML App interface Separation of Concerns Cost saving opportunity in production

rights reserved. SageMaker Model Deployment Options Real-Time Inference Serverless Inference Asynchronous Inference Batch Transform Latency Low latency Sub-second Low Latency Sub-second (tolerates cold-start) Near real-time Long processing time (<15 min.) Indefinite timeout Frequency Continuous Unpredictable Near real-time user Event-based/ Scheduled Data Size Payload size < 6MB Payload size < 4MB Large payload (<1GB) Process large datasets Use Case Fraud Detection Form Processing Image Analysis Churn Prediction

rights reserved. How to choose your Deployment Strategy A D E C I S I O N T R E E Live Predictions? SageMaker Serverless Inference Multiple models/ containers? Single ML framework SageMaker async inference SageMaker endpoint SageMaker multi-model endpoint SageMaker multi-container endpoint Fluctuating traffic? Load testing to right-size Auto-scaling Yes No Yes No. Multiple containers Yes Can Tolerate Cold Start? Yes Yes No No No (daily, hourly, weekly) Batch Transform > 4 MB Payload or > 60 sec Yes

rights reserved. Resources and Notebooks • Amazon SageMaker Workshop https://sagemaker-immersionday.workshop.aws/ • Amazon SageMaker Examples https://sagemaker-examples.readthedocs.io/en/latest/ • Amazon SageMaker python notebooks examples https://github.com/aws/amazon-sagemaker-examples • Amazon SageMaker Python SDK documentation https://sagemaker.readthedocs.io/en/stable/ • Amazon SageMaker Developer Guide https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html

rights reserved. Resources and Notebooks https://github.com/aws/amazon-sagemaker-examples • Real time inference /advanced_functionality/pytorch_deploy_pretrained_bert_model/pytorch_deploy_pretrained_be rt_model.ipynb • Serverless inference /serverless-inference/Serverless-Inference-Walkthrough.ipynb • Async Inference /async-inference/Async-Inference-Walkthrough-SageMaker-Python-SDK.ipynb • Batch Transform /sagemaker_batch_transform/pytorch_mnist_batch_transform/pytorch-mnist-batch- transform_outputs.ipynb

rights reserved. Resources and Notebooks https://github.com/aws/amazon-sagemaker-examples • Multi-Model /advanced_functionality/multi_model_xgboost_home_value/xgboost_multi_model_endpoint_ho me_value.ipynb • Multi-Container - Direct Invocation /advanced_functionality/multi-container-endpoint/direct-invocation/multi-container-direct- invocation.ipynb • Multi-Container - Inference Pipeline /sagemaker-python-sdk/scikit_learn_inference_pipeline/Inference Pipeline with Scikit-learn and Linear Learner.ipynb • Inference Recommender /sagemaker-inference-recommender/inference-recommender.ipynb

Amazon SageMaker Model Deployment Strategies

Amazon SageMaker Model Deployment Strategies

More Decks by Sungmin Kim

Other Decks in Programming

Featured

Transcript