Upgrade to Pro — share decks privately, control downloads, hide ads and more …

O'Reilly AI Conference - Containerized Architectures for Deep Learning

O'Reilly AI Conference - Containerized Architectures for Deep Learning

Container and cloud native technologies around Kubernetes have become the de facto standard in modern ML and AI application development. And while many data scientists and engineers tend to focus on tools, the platform that enables these tools is equally important and often overlooked.
Let’s examine some common architecture blueprints and popular technologies used to integrate AI into existing infrastructures, and learn how you can build a production-ready containerized platform for deep learning.
In particular, she explores Docker and Kubernetes, with its associated cloud native technologies, and its use and advantages in ML/AI environments.

Antje Barth

October 16, 2019
Tweet

More Decks by Antje Barth

Other Decks in Technology

Transcript

  1. Me Data Enthusiast Technical Evangelist AI / ML / Deep

    Learning Container / Kubernetes Big Data #CodeLikeAGirl
  2. Agenda • Motivation • ML pipeline tools and platforms •

    Machine Learning on Kubernetes • Deep Learning Demo • Conclusion
  3. Agenda • Motivation • ML pipeline tools and platforms •

    Machine Learning on Kubernetes • Deep Learning Demo • Conclusion
  4. ML – The (enterprise) reality • Wrangle large datasets •

    Unify disparate systems • Composability • Manage pipeline complexity • Improve training/serving consistency • Improve portability • Improve model quality • Manage versions Building a model Data ingestion Data analysis Data transform Data validation Data splitting Ad-hoc Training Model validation Logging Roll-out Serving Monitoring Distributed Training Training at scale Data Versioning HP Tuning Experiment Tracking Feature Store SYSTEM 1 SYSTEM 2 SYSTEM 3 SYSTEM 4 SYSTEM 5 SYSTEM 6 SYSTEM 3.5 SYSTEM 1.5
  5. Agenda • Motivation • ML pipeline tools and platforms •

    Machine Learning on Kubernetes • Deep Learning Demo • Conclusion
  6. Quick comparison Apache Airflow is a platform to programmatically author,

    schedule and monitor workflows. The Kubeflow project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable. TensorFlow Extended (TFX) is an end-to-end platform for deploying production ML pipelines. MLflow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility and deployment. https://airflow.apache.org/ https://www.kubeflow.org/ https://www.tensorflow.org/ tfx https://mlflow.org/
  7. Kubernetes is an API and agents The Kubernetes API provides

    containers with a scheduling, configuration, network, and storage The Kubernetes runtime manages the containers
  8. Agenda • Motivation • ML pipeline tools and platforms •

    Machine Learning on Kubernetes • Deep Learning Demo • Conclusion
  9. Machine Learning on Kubernetes • Kubernetes-native • Run wherever k8s

    runs • Move between local – dev – test – prod – cloud • Use k8s to manage ML tasks • CRDs (UDTs) for distributed training • Adopt k8s patterns • Microservices • Manage infrastructure declaratively • Support for multiple ML frameworks • Tensorflow, Pytorch, Scikit, Xgboost, etc.
  10. Introducing Kubeflow Make it easy for everyone to develop, deploy,

    and manage portable, scalable ML everywhere.
  11. Composability • Build and deploy re-usable, portable, scalable, machine learning

    workflows based on Docker containers. • Use the libraries/ frameworks of your choice Example: KubeFlow "deployer" component lets you deploy as a plain TF Serving model server: https://github.com/kubeflow/pipelines/tree/ master/components/kubeflow/deployer
  12. METADATA SERVING Back to our ML enterprise workflow! Building a

    model Data ingestion Data analysis Data transform Data validation Data splitting Ad-hoc Training Model validation Logging Roll-out Serving Monitoring Distributed Training Training at scale Data Versioning HP Tuning Experiment Tracking Feature Store
  13. Portability Containers for Deep Learning Container runtime Infrastructure NVIDIA drivers

    Host OS Packages: TensorFlow mkl cudnn cublas Nccl CUDA toolkit CPU: GPU: TensorFlow Container Image Keras horovod numpy scipy others… scikit- learn pandas openmpi Python ML environments that are:
  14. TensorFlow mkl cudnn cublas Nccl CUDA toolkit CPU: GPU: TensorFlow

    Container Image Keras horovod numpy scipy others… scikit- learn pandas openmpi Python Container runtime Development System NVIDIA drivers Host OS Container registry push pull TensorFlow mkl cudnn cublas Nccl CUDA toolkit CPU: GPU: TensorFlow Container Image Keras horovod numpy scipy others… scikit- learn pandas openmpi Python Container runtime Training Cluster NVIDIA drivers Host OS
  15. Scalability • Kubernetes - Autoscaling Jobs • Describe the job,

    let Kubernetes take care of the rest • CPU, RAM, Accelerators • TF Jobs delete themselves when finished, node pool will auto scale back down Model works great! But I need six nodes. Data Scientist IT Ops Credit: @aronchick
  16. Scalability • Kubernetes - Autoscaling Jobs • Describe the job,

    let Kubernetes take care of the rest • CPU, RAM, Accelerators • TF Jobs delete themselves when finished, node pool will auto scale back down Data Scientist IT Ops apiVersion: "kubeflow.org/v1alpha1" kind: "TFJob" spec: replicaSpecs: replicas: 6 CPU: 1 GPU: 1 containers: gcr.io/myco/myjob:1.0 Credit: @aronchick
  17. Scalability • Kubernetes - Autoscaling Jobs • Describe the job,

    let Kubernetes take care of the rest • CPU, RAM, Accelerators • TF Jobs delete themselves when finished, node pool will auto scale back down Data Scientist IT Ops GPU GPU GPU GPU GPU GPU Credit: @aronchick
  18. Scalability • Kubernetes - Autoscaling Jobs • Describe the job,

    let Kubernetes take care of the rest • CPU, RAM, Accelerators • TF Jobs delete themselves when finished, node pool will auto scale back down Job’s done! Data Scientist IT Ops Credit: @aronchick
  19. Agenda • Motivation • ML pipeline tools and platforms •

    Container > Kubernetes > Kubeflow • Deep Learning Demo • Conclusion
  20. Recap: The “Kube”flow • Deploy Kubernetes & Kubeflow • Experiment

    in Jupyter • Build Docker Image • Train at Scale • Build Model Server • Deploy Model • Integrate Model into App • Operate Model Training Model Serving Pod Pod Pod Kubernetes Worker Nodes #1 #2 #3 Jupyter Notebook Seldon Core Engine Seldon Core Engine Doppelganger Model Doppelganger Model Istio Gateway (Traffic Routing) {REST API} curl… Dockerfile Training Job Dockerfile Inference Service Data Scientist Pod Train Model Pod Train Model
  21. Agenda • Motivation • ML pipeline tools and platforms •

    Machine Learning on Kubernetes • Deep Learning Demo • Conclusion
  22. Conclusion & Take-aways • Platform matters • Composability – Portability

    – Scalability • Containerized architectures • Kubernetes + Machine Learning = Kubeflow • Start building! https://github.com/antje/doppelganger
  23. More information • Kubeflow https://www.kubeflow.org/ https://github.com/kubeflow/kubeflow • Tensorflow Extended (TFX)

    https://www.tensorflow.org/tfx • The Definitive Guide to Machine Learning Platforms https://twimlai.com/mlplatforms-ebook/ • Amazon Elastic Kubernetes Service (Amazon EKS) https://eksworkshop.com https://github.com/aws-samples/machine-learning-using-k8s