O'Reilly AI Conference - Containerized Architectures for Deep Learning

Containerized architectures for deep learning Antje Barth @anbarth

Me Data Enthusiast Technical Evangelist AI / ML / Deep
Learning Container / Kubernetes Big Data #CodeLikeAGirl

Agenda • Motivation • ML pipeline tools and platforms •
Machine Learning on Kubernetes • Deep Learning Demo • Conclusion

ML – Helicopter view How good are your predictions? •
Accuracy • Optimization

ML – The (enterprise) reality • Wrangle large datasets •
Unify disparate systems • Composability • Manage pipeline complexity • Improve training/serving consistency • Improve portability • Improve model quality • Manage versions Building a model Data ingestion Data analysis Data transform Data validation Data splitting Ad-hoc Training Model validation Logging Roll-out Serving Monitoring Distributed Training Training at scale Data Versioning HP Tuning Experiment Tracking Feature Store SYSTEM 1 SYSTEM 2 SYSTEM 3 SYSTEM 4 SYSTEM 5 SYSTEM 6 SYSTEM 3.5 SYSTEM 1.5

The rise of ML pipeline tools & platforms

Quick comparison Apache Airflow is a platform to programmatically author,
schedule and monitor workflows. The Kubeflow project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable. TensorFlow Extended (TFX) is an end-to-end platform for deploying production ML pipelines. MLflow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility and deployment. https://airflow.apache.org/ https://www.kubeflow.org/ https://www.tensorflow.org/ tfx https://mlflow.org/

How to scale to production? Composability Portability Scalability

Wait a minute…

Virtual Machines are Computers in a Box Containers are Applications
in a Box Containers?

Kubernetes? {api}

Kubernetes is an API and agents The Kubernetes API provides
containers with a scheduling, configuration, network, and storage The Kubernetes runtime manages the containers

Machine Learning on Kubernetes • Kubernetes-native • Run wherever k8s
runs • Move between local – dev – test – prod – cloud • Use k8s to manage ML tasks • CRDs (UDTs) for distributed training • Adopt k8s patterns • Microservices • Manage infrastructure declaratively • Support for multiple ML frameworks • Tensorflow, Pytorch, Scikit, Xgboost, etc.

Kubernetes ML/DL Landscape Source: https://twimlai.com/kubernetes-ebook/ https://landscape.lfai.foundation/ https://landscape.cncf.io/

Introducing Kubeflow Make it easy for everyone to develop, deploy,
and manage portable, scalable ML everywhere.

Credits: Kubeflow components Credits:

Composability • Build and deploy re-usable, portable, scalable, machine learning
workflows based on Docker containers. • Use the libraries/ frameworks of your choice Example: KubeFlow "deployer" component lets you deploy as a plain TF Serving model server: https://github.com/kubeflow/pipelines/tree/ master/components/kubeflow/deployer

METADATA SERVING Back to our ML enterprise workflow! Building a
model Data ingestion Data analysis Data transform Data validation Data splitting Ad-hoc Training Model validation Logging Roll-out Serving Monitoring Distributed Training Training at scale Data Versioning HP Tuning Experiment Tracking Feature Store

Portability Containers for Deep Learning Container runtime Infrastructure NVIDIA drivers
Host OS Packages: TensorFlow mkl cudnn cublas Nccl CUDA toolkit CPU: GPU: TensorFlow Container Image Keras horovod numpy scipy others… scikit- learn pandas openmpi Python ML environments that are:

TensorFlow mkl cudnn cublas Nccl CUDA toolkit CPU: GPU: TensorFlow
Container Image Keras horovod numpy scipy others… scikit- learn pandas openmpi Python Container runtime Development System NVIDIA drivers Host OS Container registry push pull TensorFlow mkl cudnn cublas Nccl CUDA toolkit CPU: GPU: TensorFlow Container Image Keras horovod numpy scipy others… scikit- learn pandas openmpi Python Container runtime Training Cluster NVIDIA drivers Host OS

Scalability • Kubernetes - Autoscaling Jobs • Describe the job,
let Kubernetes take care of the rest • CPU, RAM, Accelerators • TF Jobs delete themselves when finished, node pool will auto scale back down Model works great! But I need six nodes. Data Scientist IT Ops Credit: @aronchick

let Kubernetes take care of the rest • CPU, RAM, Accelerators • TF Jobs delete themselves when finished, node pool will auto scale back down Data Scientist IT Ops apiVersion: "kubeflow.org/v1alpha1" kind: "TFJob" spec: replicaSpecs: replicas: 6 CPU: 1 GPU: 1 containers: gcr.io/myco/myjob:1.0 Credit: @aronchick

let Kubernetes take care of the rest • CPU, RAM, Accelerators • TF Jobs delete themselves when finished, node pool will auto scale back down Data Scientist IT Ops GPU GPU GPU GPU GPU GPU Credit: @aronchick

let Kubernetes take care of the rest • CPU, RAM, Accelerators • TF Jobs delete themselves when finished, node pool will auto scale back down Job’s done! Data Scientist IT Ops Credit: @aronchick

Container > Kubernetes > Kubeflow • Deep Learning Demo • Conclusion

DEMO “Doppelganger App”

“Doppelganger” App Source: https://www.boredpanda.com/dogs-look-like-owners-gerrard-gethings Photos: Gerrard Gethings

Implementing Image Similarity search

Kubernetes & Kubeflow

Experiment in Jupyter notebook Using the Stanford Dogs Dataset http://vision.stanford.edu/
aditya86/ImageNetDogs/

Containerize training code

Containerize inference code

Run predictions curl http://c0198e9d-istiosystem-istio-2af2-1928351968.eu-central- 1.elb.amazonaws.com/seldon/deployment/doppelganger- model/api/v0.1/predictions -d '{"data":{"ndarray":[[0]]}}' -H "Content-Type:
application/json"

Recap: The “Kube”flow • Deploy Kubernetes & Kubeflow • Experiment
in Jupyter • Build Docker Image • Train at Scale • Build Model Server • Deploy Model • Integrate Model into App • Operate Model Training Model Serving Pod Pod Pod Kubernetes Worker Nodes #1 #2 #3 Jupyter Notebook Seldon Core Engine Seldon Core Engine Doppelganger Model Doppelganger Model Istio Gateway (Traffic Routing) {REST API} curl… Dockerfile Training Job Dockerfile Inference Service Data Scientist Pod Train Model Pod Train Model

Conclusion & Take-aways • Platform matters • Composability – Portability
– Scalability • Containerized architectures • Kubernetes + Machine Learning = Kubeflow • Start building! https://github.com/antje/doppelganger

More information • Kubeflow https://www.kubeflow.org/ https://github.com/kubeflow/kubeflow • Tensorflow Extended (TFX)
https://www.tensorflow.org/tfx • The Definitive Guide to Machine Learning Platforms https://twimlai.com/mlplatforms-ebook/ • Amazon Elastic Kubernetes Service (Amazon EKS) https://eksworkshop.com https://github.com/aws-samples/machine-learning-using-k8s

Thank you! antje.official antje @anbarth Antje Barth

O'Reilly AI Conference - Containerized Architec...

O'Reilly AI Conference - Containerized Architectures for Deep Learning

More Decks by Antje Barth

Other Decks in Technology

Featured

Transcript