Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Cape Town Machine Learning Meetup

Antje Barth
November 21, 2019

Cape Town Machine Learning Meetup

Containerized architectures for deep learning

Antje Barth

November 21, 2019
Tweet

More Decks by Antje Barth

Other Decks in Technology

Transcript

  1. CONTAINERIZED ARCHITECTURES FOR DEEP LEARNING ANTJE BARTH TECHNICAL EVANGELIST, AI

    AND MACHINE LEARNING @ANBARTH Cape Town Machine Learning Meetup November 21, 2019
  2. Me Data Enthusiast Technical Evangelist AI / ML / Deep

    Learning Machine Learning on Kubernetes Big Data #CodeLikeAGirl
  3. Agenda • Motivation • ML pipeline tools and platforms •

    Machine Learning on Kubernetes • Deep Learning Demo • Conclusion
  4. Agenda • Motivation • ML pipeline tools and platforms •

    Machine Learning on Kubernetes • Deep Learning Demo • Conclusion
  5. ML – The (enterprise) reality • Wrangle large datasets •

    Unify disparate systems • Composability • Manage pipeline complexity • Improve training/serving consistency • Improve portability • Improve model quality • Manage versions Building a model Data ingestion Data analysis Data transform Data validation Data splitting Ad-hoc Training Model validation Logging Roll-out Serving Monitoring Distributed Training Training at scale Data Versioning HP Tuning Experiment Tracking Feature Store SYSTEM 1 SYSTEM 2 SYSTEM 3 SYSTEM 4 SYSTEM 5 SYSTEM 6 SYSTEM 3.5 SYSTEM 1.5
  6. Agenda • Motivation • ML pipeline tools and platforms •

    Machine Learning on Kubernetes • Deep Learning Demo • Conclusion
  7. Quick comparison Apache Airflow is a platform to programmatically author,

    schedule and monitor workflows. The Kubeflow project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable. TensorFlow Extended (TFX) is an end-to-end platform for deploying production ML pipelines. MLflow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility and deployment. https://airflow.apache.org/ https://www.kubeflow.org/ https://www.tensorflow.org/ tfx https://mlflow.org/
  8. Kubernetes is an API and agents The Kubernetes API provides

    containers with a scheduling, configuration, network, and storage The Kubernetes runtime manages the containers
  9. Agenda • Motivation • ML pipeline tools and platforms •

    Machine Learning on Kubernetes • Deep Learning Demo • Conclusion
  10. • Kubernetes-native • Run wherever k8s runs • Move between

    local – dev – test – prod – cloud • Use k8s to manage ML tasks • CRDs (UDTs) for distributed training • Adopt k8s patterns • Microservices • Manage infrastructure declaratively • Support for multiple ML frameworks • Tensorflow, Pytorch, Scikit, Xgboost, etc. Machine Learning on Kubernetes
  11. Introducing Kubeflow Make it easy for everyone to develop, deploy,

    and manage portable, scalable ML everywhere.
  12. Composability • Build and deploy re-usable, portable, scalable, machine learning

    workflows based on Docker containers. • Use the libraries/ frameworks of your choice Example: KubeFlow "deployer" component lets you deploy as a plain TF Serving model server: https://github.com/kubeflow/pipelines/tree/ master/components/kubeflow/deployer
  13. METADATA SERVING Back to our ML enterprise workflow! Building a

    model Data ingestion Data analysis Data transform Data validation Data splitting Ad-hoc Training Model validation Logging Roll-out Serving Monitoring Distributed Training Training at scale Data Versioning HP Tuning Experiment Tracking Feature Store
  14. Portability Containers for Deep Learning Container runtime Infrastructure NVIDIA drivers

    Host OS Packages: TensorFlow mkl cudnn cublas Nccl CUDA toolkit CPU: GPU: TensorFlow Container Image Keras horovod numpy scipy others… scikit-learn pandas openmpi Python ML environments that are:
  15. TensorFlow mkl cudnn cublas Nccl CUDA toolkit CPU: GPU: TensorFlow

    Container Image Keras horovod numpy scipy others… scikit-learn pandas openmpi Python Container runtime Development System NVIDIA drivers Host OS Container registry push pull TensorFlow mkl cudnn cublas Nccl CUDA toolkit CPU: GPU: TensorFlow Container Image Keras horovod numpy scipy others… scikit-learn pandas openmpi Python Container runtime Training Cluster NVIDIA drivers Host OS
  16. Scalability • Kubernetes - Autoscaling Jobs • Describe the job,

    let Kubernetes take care of the rest • CPU, RAM, Accelerators • TF Jobs delete themselves when finished, node pool will auto scale back down Model works great! But I need six nodes. Data Scientist IT Ops Credit: @aronchick
  17. Scalability • Kubernetes - Autoscaling Jobs • Describe the job,

    let Kubernetes take care of the rest • CPU, RAM, Accelerators • TF Jobs delete themselves when finished, node pool will auto scale back down Data Scientist IT Ops apiVersion: "kubeflow.org/v1alpha1" kind: "TFJob" spec: replicaSpecs: replicas: 6 CPU: 1 GPU: 1 containers: gcr.io/myco/myjob:1.0 Credit: @aronchick
  18. Scalability • Kubernetes - Autoscaling Jobs • Describe the job,

    let Kubernetes take care of the rest • CPU, RAM, Accelerators • TF Jobs delete themselves when finished, node pool will auto scale back down Data Scientist IT Ops GPU GPU GPU GPU GPU GPU Credit: @aronchick
  19. Scalability • Kubernetes - Autoscaling Jobs • Describe the job,

    let Kubernetes take care of the rest • CPU, RAM, Accelerators • TF Jobs delete themselves when finished, node pool will auto scale back down Job’s done! Data Scientist IT Ops Credit: @aronchick
  20. Agenda • Motivation • ML pipeline tools and platforms •

    Container > Kubernetes > Kubeflow • Deep Learning Demo • Conclusion
  21. Recap: The “Kube”flow • Deploy Kubernetes & Kubeflow • Experiment

    in Jupyter • Build Docker Image • Train at Scale • Build Model Server • Deploy Model • Integrate Model into App • Operate Model Training Model Serving Pod Pod Pod Kubernetes Worker Nodes #1 #2 #3 Jupyter Notebook Seldon Core Engine Seldon Core Engine Doppelganger Model Doppelganger Model Istio Gateway (Traffic Routing) {REST API} curl… Dockerfile Training Job Dockerfile Inference Service Data Scientist Pod Train Model Pod Train Model
  22. Agenda • Motivation • ML pipeline tools and platforms •

    Machine Learning on Kubernetes • Deep Learning Demo • Conclusion
  23. Conclusion & Take-aways • Platform matters • Composability – Portability

    – Scalability • Containerized architectures • Kubernetes + Machine Learning = Kubeflow • Start building! https://github.com/antje/doppelganger
  24. More information • Kubeflow https://www.kubeflow.org/ https://github.com/kubeflow/kubeflow • Tensorflow Extended (TFX)

    https://www.tensorflow.org/tfx • The Definitive Guide to Machine Learning Platforms https://twimlai.com/mlplatforms-ebook/ • Amazon Elastic Kubernetes Service (Amazon EKS) https://eksworkshop.com https://github.com/aws-samples/machine-learning-using-k8s
  25. Kubeflow hands-on / workshops Hands-on Learning with KubeFlow + GPU

    + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTorch + XGBoost + Airflow + MLflow + Spark + Jupyter https://www.eventbrite.com/e/full-day- workshop-kubeflow-gpu-kerastensorflow-20-tf- extended-tfx-kubernetes-pytorch-xgboost- tickets-63362929227 https://pipeline.ai >> Workshop https://community.pipeline.ai/