Upgrade to Pro — share decks privately, control downloads, hide ads and more …

AWS Startup.fm - ML Platform

AWS Startup.fm - ML Platform

Running machine learning in a containerized environment provides significant benefits on portability and reproducibility. Then, what tools should you choose for container orchestration when building a machine learning platform? In this session, Amazon EKS (Kubernetes+Kubeflow) and Amazon SageMaker are introduced. We discuss what you will need for a scalable machine learning platform, and how to build and use it. Do not forget total cost of ownership (TCO) point of view to choose the best machine learning platform.

5c772b62f1974e9da3a88fbb4ef02696?s=128

Yoshitaka Haribara

July 02, 2020
Tweet

Transcript

  1. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Web Services Japan K.K. Startup ML Solutions Architect Yoshitaka Haribara, Ph.D. 2020-07-02 AWS Startup.fm For those of you who wonder whether Kubernetes or SageMaker to build a machine learning platform
  2. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Self-introduction • Yoshitaka Haribara • Ph.D. in Information Science and Technology • Startup Machine Learning Solutions Architect • Technical support for startups and machine learning adoption. • My favorite AWS services are Amazon SageMaker (ML), Amazon Braket (Quantum).
  3. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. What to speak in this session Provide information to help make decisions in machine learning infrastructure technology selection • What is required for the machine learning infrastructure • Using Kubernetes (k8s) and Kubeflow Example of Building with Amazon EKS • Amazon SageMaker, a managed service for machine learning. that • Total Cost of Ownership (TCO) perspective for technology selection
  4. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. What is required for the machine learning infrastructure
  5. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Business Challenges Dumped into ML problems Data Collection and Acquisition Data Preprocessi ng Data Visualizatio n and Analysis FEATURES Engineering Model Learning Model Evaluation Machine learning workloads are iterative processes https://d1.awsstatic.com/whitepapers/architecture/wellarchitected-Machine-Learning-Lens.pd f Evaluation of Business Goals Production Deployment s YES N.O.
  6. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. What is required for the machine learning infrastructure • Data Lake • Stores structured and unstructured data. • Highly available, durable, scalable storage. • Develop Models • Python and Deep Learning Framework (TensorFlow, PyTorch, Apache MXNet , etc.). • Manydata scientists preferJupyter Notebook, JupyterLab. • Training • For deep learning,use GPU because it does a lot ofmatrix operations. • For complex models and large amounts of data, distributed learning on multiple GPUs can also be done using. • Deployment/Inference • You need to host your model to incorporate it into your production environment. • Scalability and high availability are required to return inference results whenever a model is called.
  7. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Challenges in building a foundation for machine learning • Consistent deployment of the model for inference requires a unified environment. • You need to have Dependencies such as deep learning framework, version, etc., which are consistent in multiple environments such as development/production environments. • Multiple data scientists, machine learning engineers, and infrastructure engineers. • Developers are entrusted with technology selection, but there are different skill sets for those who create machine learning models and those who manage the deployment infrastructure.
  8. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Components that make up a machine learning application CUDA, cuDNN Python Scripts Deep Learning Framework Configuration, Hyperparameters
  9. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Locally worked, but not in production. Local Laptops GPU Server Production v11.0 v10.1 v9.0
  10. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Asolution calledDocker containers CUDA, cuDNN Training Scripts train.py Deep Learning Framework
  11. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Challenges in Scaling Your Environment AWS Cloud DEVELOPER EC2 Instances Docker Containers
  12. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Challenges in Scaling Your Environment AWS Cloud DEVELOPER
  13. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. What is Kubernetes (k8s)? “Kubernetes is a portable, extensible, open-source platform for managing containerized workloads and services, that facilitates both declarative configuration and automation.” https://kubernetes.io/docs/concepts/overview/what-is-kubernetes/
  14. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Why Do You Run Machine Learning on Kubernetes? • Easily manage library dependencies with containers. • Consistent environment, reproducibility and traceability. • Able to declaratively control your infrastructure environment during training deployment. • Environment coding and infrastructure reproducibility. • Open source for building ML pipelines ecosystem. • Kubeflow Pipelines, MLflow, Metaflow,... • You can write your own by using Custom Resource Definition (CRD).
  15. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon EKS Example of building with Kubernetes (k8s) and Kubeflow
  16. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Elastic Kubernetes Service (EKS)
  17. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Upstream Kubernetes Kubernetes-certified managed services, collaborated with community tools. Highly Available Built for production environments, High availability across multiple AZs. Integrated with AWS Integration with the AWS Ecosystem: VPC Networking, Elastic Load Balancing, IAM Permissions, CloudWatch and so on. Amazon Elastic Kubernetes Service (EKS)
  18. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Kubeflow: Machine Learning Toolkit for Kubernetes
  19. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Sample Configuration: Kubeflow on AWS Amazon EKS Workshop > Advanced > Machine Learning using Kubeflow Amazon Elastic Kubernetes Service ( EKS) Amazon Elastic Container Registry (ECR) Docker images for training/serving pods pods pods Model serving/inference Model training_name Auto-scaling nodegroup Workers Workers Workers Workers
  20. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Kubeflow Dashboard
  21. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Create a Notebook
  22. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Get up and connect in a few minutes.
  23. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. TheJupyter Notebook screen opens
  24. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Kubeflow fairing from kubeflow.fairing import TrainJob train_job = TrainJob(HousingServe, input_files=['ames_dataset/train.csv', "requirements.txt"], docker_registry=DOCKER_REGISTRY, backend=BackendClass(build_context_source=BuildContext)) train_job.submit() from kubeflow.fairing import PredictionEndpoint endpoint = PredictionEndpoint(HousingServe, input_files=['trained_ames_model.dat', "requirements.txt"], docker_registry=DOCKER_REGISTRY, service_type='ClusterIP', backend=BackendClass(build_context_source=BuildContext)) endpoint.create()
  25. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Katib • Hyperparameter Optimization and Neural Architecture Search (NAS)
  26. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Kubeflow Pipelines • Machine Learning Pipeline Construction
  27. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Operational Considerations • You have to maintain Kubernetes and Kubeflow environments. • Updating Kubernetes Clusters (every 3 months). • Reference: The document “Updating an Amazon EKS cluster Kubernetes version” • Upgrading Kubeflow (alpha; limited support). • Reference: Document “Upgrading a Kubeflow Deployment” • Discrepancy between skillsets dealing with developing machine learning model and infrastructure workflow engine. • Data scientist and machine learning engineers are not always good at infrastructure operations and often require operations and ongoing maintenance by infrastructure engineers. • What is important is to be able to build/deploy models. • Is it possible for managed services to substitute parts built with open source, such as Kubeflow?
  28. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon SageMaker A Fully Managed Service that Provides Every Developer and Data Scientist
  29. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Fully managed data processing jobs and data labeling workflows One-click collaborative notebooks and built- in, high performance algorithms and models One-click training Debugging and optimization One-click deployment and autoscaling What you can do with Amazon SageMaker Visually track and compare experiments Automatically spot concept drift Fully managed with auto-scaling for 75% less Prepare Build Train & Tune Deploy & Manage 101011010 010101010 000011110 Collect and prepare training data Choose or bring your own ML algorithm Set up and manage environments for training Train, debug, and tune models Deploy model in production Manage training runs Monitor models Add human review of predictions Web-based IDE for machine learning Automatically build and train models
  30. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Development Environment • You canuse Jupyter Notebook/Lab in a managed manner. • Simply choose an instance type and launch it. • DL frameworks and common libraries are installed.
  31. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon SageMaker Studio (IDE for Machine Learning)
  32. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon SageMaker BUILD Jupyter Notebook/Lab Amazon S3 The Jupyter Trademark is registered with the U.S. Patent & Trademark Office.
  33. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon SageMaker BUILD Jupyter Notebook/Lab Amazon S3 TRAIN Amazon EC2 P3 Instances Amazon ECR The Jupyter Trademark is registered with the U.S. Patent & Trademark Office. Use pre-built Docker image or bring your own container.
  34. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon SageMaker BUILD TRAIN Amazon EC2 P3 Instances Jupyter Notebook/Lab Amazon S3 The Jupyter Trademark is registered with the U.S. Patent & Trademark Office. Training Job Benefits: • Launch an instance with the API ,auto-stop when training is complete • High-performance instances with a per second charge • Easily Reduce Costs with Spot Instances • Launch a specified number of instances simultaneously, making it easy for distributed training
  35. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon SageMaker BUILD TRAIN Amazon EC2 P3 Instances Jupyter Notebook/Lab Amazon S3 The Jupyter Trademark is registered with the U.S. Patent & Trademark Office.
  36. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon SageMaker BUILD INFERENCE Jupyter Notebook/Lab Endpoint/Batch transform Amazon S3 Amazon ECR The Jupyter Trademark is registered with the U.S. Patent & Trademark Office.
  37. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon SageMaker can be interpreted as Container orchestration specialized for machine learning: • Environments are prepared as Docker images. • You can choose the number of instances and types, SageMaker will deploy your containers. • Call with API, so the historical training jobs are retained, experiments can be managed.
  38. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. SageMaker Python SDK import sagemaker from sagemaker.pytorch import PyTorch # Estimator class for each DL framework estimator = PyTorch("train.py", # initialize with training script role=sagemaker.get_execution_role(), train_instance_count=1, train_instance_type="ml.p3.2xlarge", framework_version="1.5.0") estimator.fit("s3://mybucket/data/train") # call fit to train predictor = estimator.deploy(initial_instance_count=2, # Multi-AZ when >= 2 instance_type="ml.m5.xlarge") # deploy to create an endpoint Remember the Kubeflow Fairing
  39. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. SageMaker Managed Spot Training optimize the cost of training models up to 90% over on-demand instances • Specify the maximum wait time. • Write the checkpoints as there may be interruptions. • Customer story on the AWS Blog:
  40. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. SageMaker Automatic Model Tuning Hyperparameter Optimization
  41. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Step Functions Data Science SDK (AWS Step Functions) • Write workflows and automate machine learning pipelines in Python. • AWS Step Functions Data Science SDK • Supports Amazon SageMaker, AWS Lambda , and other services. • Managed Services AWS Step Functions workflow Test data Train data Data Scients/ Developers' Git webhooks docker push SageMaker Processing Amazon S3 ( data) Amazon SageMaker Training Job/HPO AWS CodeCommit or 3rd party Git repository Amazon S3 ( raw data) AMAZON Elastic Container Registry ( ECR) AWS CodeBuild Endpoint (s) Amazon SageMaker Batch Transform/Endpoin t deploy Amazon S3 ( trained model) git push
  42. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. • DAG (directed acyclic graph) can be written in Python to manage workflows. • There is also a SageMaker Operator available. • Not a managed service (EC2 + RDS required). • Reference: “Build end-to-end machine learning workflows with Amazon SageMaker and Apache Airflow”. Support for Apache Airflow Raw data Cleaned data Train data Test data Amazon SageMaker Training/HPO Model artifact Amazon SageMaker Batch transform Airflow DAG Filter long-tailed data sparse data format → R ecordIO protobuf Analyze model performance based on test data Prediction results Operator PythonOperator PythonOperator SageMakerTrainOperator/ SageMakerTransformOperator PythonOperator SageMakerTuningOperator
  43. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Get started with Amazon SageMaker Amazon SageMaker Studio Integrated Development environment(IDE) for Machine Learning Amazon SageMaker Autopilot Automatically build and train models Amazon SageMaker Model Monitor Automatically detect concept drift Amazon SageMaker Notebooks One-click notebooks with elastic compute Amazon SageMaker Experiments Capture, organize, and compare every step Amazon SageMaker Neo Train once, deploy anywhere AWS Marketplace Pre-built algorithms, models, and data Amazon SageMaker Debugger Debug and profile training runs Automatic Model Tuning One-click hyperparameter optimization Amazon Augmented AI Add human review of model predictions Amazon SageMaker GroundTruth Build and manage training dataset Prepare Build Train & Tune Deploy & Manage Processing Job Supports Python or Spark One Click Training Supports supervised, unsupervised & RL One Click Deployment Supports real-time, batch & multi-model Amazon Elastic Inference Auto scaling for 75% less
  44. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Total Cost of Ownership (TCO)
  45. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. EKS (Kubernetes) or SageMaker • Total Cost of Ownership (TCO) point of view is very important. • It's not just infrastructure cost (CPU/GPU used for training and inference). • Operational load of the ML platform itself. • Security, Compliance and Governance. • Especially important for B-to-B startups. • You can learn more about Security Pillars in AWS Well-Architected Framework ML Lens. In general, it is necessary to accept an increase in operational load to gain degrees of freedom.
  46. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. TCO in Four Types of Costs Capability (disambiguation) Considerations instance of the Provision and Operations • Easy to provision instances, build environments • Maintenance of patching, etc. • Reduce costs by leveraging Spot Instances, etc. Managing Security Compliance • Data encryption during transfer and storage • Rights Management • Trail Management • Compliance support for SOC, PCI, ISO, FedRAMP, HIPAA, etc. Infrastructure Performance Optimization • Independent performance for each training job • Tune storage and network fordistributedlearning • Optimal infrastructure selection and tuning for inference Highly available infrastructure management • Availabilityand durability of data and model deliverables • Monitoring, logging and management of training and inference environments • Availability of inference environments (Multi-AZ support and AutoScaling)
  47. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Cost of managing instance provisioning • SageMaker • Launch instances for each training and automatically terminate on your behalf when you're done. • Kubernetes • Since it shares a cluster, it is necessary to terminate the instance. • With Autoscaler, you can utilize Amazon EC2 Auto Scaling Groups to manage node groups. Start and stop instances can significantly affect costs if there are times when GPU utilization is low.
  48. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. TCO Thinking :Comparison with Amazon SageMaker • Considering the cost of theML infrastructure itself, as well as the operational and management costs • In particular, if your organization is small,it's better to use Amazon SageMaker. Organizational size Reducing Costs withAmazon SageMaker Own Kubernetes ( EKS) incomparison with Smaller 5 data scientists -90% Medium 15 data scientists -85% large 50 data scientists -65% Giant 250 data scientists -54% https://pages.awscloud.com/NAMER-ln-GC-400-machine-learning-sagemaker-tco-learn-ty.html
  49. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. SageMaker Operator for Kubernetes Call Amazon SageMaker from Kubernetes to train, tune, and deploy models • If you think SageMaker itself seems useful, but have the existing Kubernetes cluster and do not want to learn SageMaker API/SDK. • While managing infrastructure with Kubernetes, partly train, tune, and deploy models with SageMaker. • SageMaker fully manages the infrastructure. • Features such as managed spot training and distributed training are also available. Amazon SageMaker Kubernetes
  50. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. SageMaker Components for Kubeflow Pipelines Use Kubeflow Pipelines to train, tune and deploy the model with Amazon SageMaker • For those who do not want to lean SageMaker API or SDK, or think “SageMaker itself seems useful, but I want to use Kubeflow Pipelines“. • While creating pipelines and workflows in Kubeflow , you can still partially use training, tuning, and deployment with SageMaker. • SageMaker fully manages the infrastructure. • Features such as managed spot training and distributed training are also available. Amazon SageMaker Kubeflow Pipelines
  51. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Machine Learning Foundation Choices 1. Kubernetes (k8s) to Amazon Operated with EKS • Companies that are accustomed to Kubernetes • An example of a configuration usingKubeflow was introduced. 2. Using Amazon SageMaker • Reasonable for many companies in terms of TCO • Environments can bemanaged in Dockercontainers 3. Kubernetes/Kubeflow Pipelines from Amazon SageMaker call (mix) • Cut out only the machine learning part from anexisting Kubernetescluster • SageMaker Operator for Kubernetes and Called inSageMaker Components for Kubeflow Pipelines
  52. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon EKS Case Study • Mercari • “ UsingAmazon EKS in Mercari Photo Search ” • https://bit.ly/eks-mercari • ABEJA • “ Challenges in multi-tenant environments where customer application code runs andreachesEKS” • https://bit.ly/eks-abeja-20
  53. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Tens of thousands customers use Amazon SageMaker
  54. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon SageMaker Case Study • 【Report】AI/ML @Tokyo • This is a blog of past events. • https://aws.amazon.com/jp/blogs/news/tag/ai-mltokyo/ • 【Report】Amazon SageMakerCase Festival • Blog the2019Case Study Events • https://aws.amazon.com/jp/blogs/news/tag/amazon- sagemaker-fes/
  55. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Summarize • The benefits of containerization for machine learning workloads are significant. • Consistency, reproducibility, and traceability of the environment. • There are several options for container orchestration. • Consider TCO to choose the best one. • Amazon SageMaker is the first candidate for most of startups. • Tool selection also impacts development speed. • If you have too many choices, consult with AWS Solutions Architects.
  56. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Thank you! @_hariby
  57. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Appendix
  58. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Materials (Kubeflow) • EKS Workshop: Machine Learning Using Kubeflow • https://www.eksworkshop.com/advanced/420_kubeflow/ • Walksthrough the steps to build aKubeflow environment • eks-kubeflow-cloudformation-quick-start • https://github.com/aws-samples/eks-kubeflow-cloudformation- quick-start • Anyway, ifyou want to try Kubeflow on EKSquickly
  59. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Materials (Amazon SageMaker) • SageMaker Examples JP • https://github.com/aws-samples/amazon-sagemaker-examples-jp • Japanese Sample Notebook • SageMaker Examples • https://github.com/awslabs/amazon-sagemaker-examples • Sample Notebooks