rights reserved. What we covered today SageMaker deployment strategy Workflow of deploying models in SageMaker Inference load testing Inference A/B testing Model monitoring
rights reserved. Amazon SageMaker Purpose-built tools so you can be 10x more productive Amazon SageMaker Studio notebooks Access ML data Connect to many data sources such as Amazon S3, Apache Spark, Amazon Redshift, CSV files, and more Prepare data Transform data to browse data sources, explore metadata, schemas, and write queries in popular languages Build ML models Optimized with 150+ popular open-source models and frameworks such as TensorFlow and PyTorch Train and tune ML model Correct performance problems in real time Deploy and monitor results Create, automate, and manage end-to-end ML workflows to improve model quality
rights reserved. Benefits when deploying models in SageMaker • Decouple application code from ML models • Call models from anywhere • Full lifecycle support • No surprises (if it works locally, it will work on AWS) • Train anywhere • Self service deployments
rights reserved. SageMaker inference options Low latency Ultra high throughput Multi-model endpoints A/B testing Real-time inference Asynchronous inference Near real-time Large payloads (1 GB) Long timeouts (15 mins) First purpose built serverless ML inference in cloud Fully managed Pay only for what you use, billed in milliseconds Serverless inference Batch transform Process large datasets Job-based system
rights reserved. SageMaker Deployment – Real-time Inference SageMaker Real-time Inference Create a long-running microservice Instant response for payload up to 6MB Accessible from an external application Autoscaling
rights reserved. Current customer challenges with ML inference Increases TCO Spend lot of time in provisioning and managing servers Challenging to provision capacity Data scientists are challenged with selecting optimal instance types and managing autoscaling policies End up over-provisioning capacity Utilization is low and costs are high regardless of number of requests Workloads are intermittent Some ML workloads have less predictable usage patterns and long periods of inactivity
rights reserved. Other challenges with ML inference Customers need to control costs through an environment that can scale automatically (including down to zero). Deep Learning models are complex and large, and may take several minutes to finish processing. Existing Real-Time Inference requests timeout after 60 seconds. Spinning up batch clusters takes too long. Customers need near “real-time” inference. Model sizes can be large Need to control costs Inference payloads can be large Some workloads can tolerate some latency Customers need to process large payloads (100s of MB or GB).
rights reserved. SageMaker Deployment – Async Inference SageMaker Asynchronous Inference Ideal for large payload up to 1GB Longer processing timeout up to 15 min Autoscaling (down to 0 instance) Suitable for CV/NLP use cases
rights reserved. SageMaker Deployment – Async Inference SageMaker Asynchronous Inference Ideal for large payload up to 1GB Longer processing timeout up to 15 min Autoscaling (down to 0 instance) Suitable for CV/NLP use cases from sagemaker.async_inference import AsyncInferenceConfig async_config = AsyncInferenceConfig( output_path="s3://{s3_bucket}/{bucket_prefix}/output", max_concurrent_invocations_per_instance=10, notification_config = { "SuccessTopic": sns_success_topic_arn, "ErrorTopic": sns_error_topic_arn }) async_predictor = model.deploy(async_inference_config=async_config) async_predictor.predict_async(input_path=input_s3_path)
rights reserved. SageMaker Deployment – Batch Inference Fully managed mini-batching for large data Pay only for what you use Suitable for periodic arrival of large data SageMaker Batch Transform
rights reserved. SageMaker Deployment – Batch Inference Fully managed mini-batching for large data Pay only for what you use Suitable for periodic arrival of large data SageMaker Batch Transform transformer = model.transformer( instance_count = 1, instance_type = "ml.m5.xlarge", output_path = "s3://{s3_bucket}/{bucket_prefix}/output") transformer.transform( input_data_s3, content_type = "text/csv")
rights reserved. • Low latency • Multi-model/ Multi- container endpoints • A/B testing • Blue/Green Deployment guardrails • CPU/GPU support • Payload size < 6MB • Request timeout - 60 secs Example use cases: ad serving, personalized recommendations, fraud detection SageMaker Model Deployment Options Real-Time Inference Batch Transform Asynchronous Inference Serverless Inference • Process large datasets1 (Max mini-batch size - 100MB) • Higher throughput • Job-based system Example use cases: Data pre-processing, churn prediction, predictive maintenance • Near real-time • Large payloads (<1 GB) • Long timeouts (15 min) Example use cases: computer vision, object detection • Automatic scaling • No need to select or manage servers. • For workloads that can tolerate cold-start • Intermittent or unpredictable traffic • Payload size < 4MB • Request Timeout - 60 secs • Limits on maximum concurrent invocations per endpoint • CPU only support Example use cases: Test workloads, Extract & analyze data from documents, form processing Real-time Micro-batch Batch 1Each instance has 30GB EBS Volume. Maximum dataset size depends on number of instances in the batch transform and type of instance. G4dn instances come with their own local SSD storage.
rights reserved. SageMaker Deployment – Multi-model Endpoint C O S T - S A V I N G O P P O R T U N I T Y Host multiple models in one container Direct invocation to target model Improves resource utilization Dynamic loading model from Amazon S3 TargetModel= 'model-007.tar.gz' SageMaker Multi-Model Endpoint
rights reserved. SageMaker Deployment – Multi-model Endpoint C O S T - S A V I N G O P P O R T U N I T Y Host multiple models in one container Direct invocation to target model Improves resource utilization Dynamic loading model from Amazon S3 SageMaker Multi-Model Endpoint TargetModel= 'model-013.tar.gz'
rights reserved. EP-1 Model 1 EP-2 Model 2 EP-10 Model 10 … EP Model 1 Model 2 … Model 10 Example scenario: ml.c5.xlarge, $0.238/hr., 2 instances running 24/7 10 separate endpoints $3,430/mo. 1 multi-model endpoint $343/mo. SageMaker Deployment – Multi-model Endpoint C O S T - S A V I N G O P P O R T U N I T Y
rights reserved. SageMaker Deployment – Multi-model Endpoint C O S T - S A V I N G O P P O R T U N I T Y Host multiple models in one container Direct invocation to target model Improves resource utilization Dynamic loading model from Amazon S3 SageMaker Multi-Model Endpoint container = { 'Image’: mme-supported-image, 'ModelDataUrl': 's3://my-bucket/folder-of-tar-gz’, 'Mode': 'MultiModel’} sm.create_model( Containers = [container], ...) sm.create_endpoint_config(); sm.create_endpoint() smrt.invoke_endpoint( EndpointName = endpoint_name, TargetModel = 'model-007.tar.gz’, Body = body, ...)
rights reserved. TargetContainerHostname= ‘Container-05' SageMaker Deployment – Multi-container Endpoint C O S T - S A V I N G O P P O R T U N I T Y Host up to 15 distinct containers Direct or serial invocation No cold start vs. Multi-Model Endpoint SageMaker Multi-container Endpoint
rights reserved. SageMaker Deployment – Multi-container Endpoint C O S T - S A V I N G O P P O R T U N I T Y Host up to 15 distinct containers Direct or serial invocation No cold start vs. Multi-Model Endpoint SageMaker Multi-container Endpoint
rights reserved. Inference Pipelines • Reuse the data transformers developed for training models • Low latency: All containers run on the same underlying EC2 Multi-container Endpoint: Inference Pipelines
rights reserved. SageMaker Deployment – Multi-container Endpoint C O S T - S A V I N G O P P O R T U N I T Y container1 = { 'Image': container, 'ContainerHostname': 'firstContainer’}; ... sm.create_model( InferenceExecutionConfig = {'Mode': 'Direct’}, Containers = [container1, container2, ...], ...) sm.create_endpoint_config() sm.create_endpoint() smrt.invoke_endpoint( EndpointName = endpoint_name, TargetContainerHostname = 'firstContainer’, Body = body, ...) Host up to 15 distinct containers Direct or serial invocation No cold start vs. Multi-Model Endpoint SageMaker Multi-container Endpoint
rights reserved. SageMaker ML instance options B A L A N C I N G B E T W E E N C O S T A N D P E R F O R M A N C E High throughput, and low-latency access to CUDA GPU INSTANCES P3 G4 Low throughput, low cost, most flexible CPU INSTANCES C5 Inf1: High throughput, high performance, and lowest cost in the cloud CUSTOM CHIP Inf1
rights reserved. ML instance ML instance Endpoint Load testing K N O W Y O U R E N D P O I N T S Artificial requests Amazon SageMaker endpoint Endpoint Auto-scaling group Availability Zone 1 Availability Zone 2 ML instance ML instance ML instance ML instance Amazon CloudWatch Elastic Load Balancing
rights reserved. Optimizing inference takes skills, time, and effort Performance and load testing to validate latency and throughput requirements are met and costs are within budget Using ML frameworks with converters, compilers, and kernel libraries specific to different instance types and hardware vendors Selecting the right instance size, container parameters, and autoscaling properties to maximize performance Model tuning Manual benchmarking 70+ ML instance types Systems for ML Selecting the right instance type based on resource requirements of the ML model and data payloads
rights reserved. SageMaker Inference Recommender F E A T U R E S Designed for MLOps engineers and data scientists to reduce time to get models into production Run extensive load tests that include production requirements – throughput, latency Load tests Get endpoint configuration settings that meet your production requirements Endpoint recommendations Instance recommendations Instance type recommendation for initial deployments
rights reserved. Get started with Inference Recommender 1 Container image 2 Model artifacts and sample payload Model registry 3 Model metadata Deploy your model Get initial instance recommendations Specify performance requirements and instance types for a custom load test View and compare performance and cost across different endpoint configurations Inference Recommender
rights reserved. How to choose your Deployment Strategy A D E C I S I O N T R E E Live Predictions? SageMaker Serverless Inference Multiple models/ containers? Single ML framework SageMaker async inference SageMaker endpoint SageMaker multi-model endpoint SageMaker multi-container endpoint Fluctuating traffic? Load testing to right-size Auto-scaling Yes No Yes No. Multiple containers Yes Can Tolerate Cold Start? Yes Yes No No No (daily, hourly, weekly) Batch Transform > 4 MB Payload or > 60 sec Yes
rights reserved. SageMaker Model Monitor O P T I M I Z I N G M O D E L A C C U R A C Y 79% Model quality drift Data drift SageMaker Clarify Feature importance drift & data bias
rights reserved. Endpoint A/B testing U S I N G P R O D U C T I O N V A R I A N T S sm.update_endpoint_weights_and_capacities( EndpointName=endpoint_name, DesiredWeightsAndCapacities=[ { "DesiredWeight": 0.1, "VariantName": ”new-model” }, { "DesiredWeight": 0.9, "VariantName": ”existing-model” } ] ) Elastic Load Balancing
rights reserved. Workflow of Deploying Models in SageMaker Creating a Model Defining the Endpoint Configuration Creating an Endpoint Invoking an Endpoint
rights reserved. Deploy Model: How it works 56 Create Model 1 Inference Container Image SageMaker Model Path to the SageMaker compatible inference image stored in ECR or a Private Docker Registry Packages your model for deployment
rights reserved. Deploy Model: How it works 57 Create Model 1 Inference Container Image Model Artifact SageMaker Model S3 Path to the trained model artifacts. **Required for SageMaker built-in algorithms. Packages your model for deployment
rights reserved. Deploy Model: How it works 58 Create Model 1 Inference Container Image Model Artifact SageMaker Model IAM Role IAM role that Sagemaker assumes to access model artifacts and the docker image for deployment Packages your model for deployment
rights reserved. Deploy Model: How it works 59 Create Model 1 Inference Container Image Model Artifact Advanced Configurations SageMaker Model IAM Role Advanced configuration options are dependent on the chosen deployment option. Examples include: VPC Configuration, Multi-Container & Multi-Model deployments. Packages your model for deployment
rights reserved. Deploy Model: How it works 60 Create Model 1 Packages your model for deployment Inference Container Image Model Artifact Advanced Configurations SageMaker Model IAM Role Configure & Deploy Model 2 Input Real-Time Inference Batch Transform Asynchronous Inference Serverless Inference Deploy model using the option that best meets the needs of your use case
rights reserved. aws sagemaker create-model --model-name model2 --primary-container ‘{“Image”: “123.dkr.ecr.amazonaws.com/algo”, “ModelDataUrl”: “s3://bkt/model2.tar.gz”} --execution-role-arn arn:aws:iam::123:role/me aws sagemaker create-endpoint-config --endpoint-config-name model2-config --production-variants ‘{“InitialInstanceCount”: 2, “InstanceType”: “ml.m4.xlarge”, ”InitialVariantWeight”: 1, ”ModelName”: “model2”, ”VariantName”: “AllTraffic”}’ aws sagemaker update-endpoint --endpoint-name my-endpoint --endpoint-config-name model2-config New Model New Endpoint Config Same Endpoint Updating an endpoint using the AWS CLI R E A L - T I M E I N F E R E N C E
rights reserved. Benefits when deploying models in SageMaker Spend 90% Prediction 10% Training ML App interface Separation of Concerns Cost saving opportunity in production
rights reserved. SageMaker Model Deployment Options Real-Time Inference Serverless Inference Asynchronous Inference Batch Transform Latency Low latency Sub-second Low Latency Sub-second (tolerates cold-start) Near real-time Long processing time (<15 min.) Indefinite timeout Frequency Continuous Unpredictable Near real-time user Event-based/ Scheduled Data Size Payload size < 6MB Payload size < 4MB Large payload (<1GB) Process large datasets Use Case Fraud Detection Form Processing Image Analysis Churn Prediction
rights reserved. How to choose your Deployment Strategy A D E C I S I O N T R E E Live Predictions? SageMaker Serverless Inference Multiple models/ containers? Single ML framework SageMaker async inference SageMaker endpoint SageMaker multi-model endpoint SageMaker multi-container endpoint Fluctuating traffic? Load testing to right-size Auto-scaling Yes No Yes No. Multiple containers Yes Can Tolerate Cold Start? Yes Yes No No No (daily, hourly, weekly) Batch Transform > 4 MB Payload or > 60 sec Yes