"Hot dogs or not" at Scale with Kubernetes

'Hot Dog or Not Hot Dog' at Scale - Kubernetes
& Machine Learning Vishnu Kannan (@vishh) and David Aronchick (@aronchick) SWE & PM at Google

What is Machine Learning?

First, Start with a Question

How Much Is My House Worth?

Square Footage House Price How Much Is My House Worth?

Now, Answer Your Question

Square Footage House Price 2100 Sq. Ft. $339K How Much
Is My House Worth?

Congrats, You’re a Machine Learning Expert!

But...

Square Footage House Price Square Footage House Price Non-Linear Groupings
Multi-Dimensional Changes over Time Things Can Get Complicated

Machine Learning is a way of solving problems without explicitly
knowing how to create the solution

Google DC Ops

PUE == Power Usage Effectiveness

ML for Everyone!

However...

ML Needs DevOps

DevOps Needs...

Composability Portability Scalability

Building a Model Logging Data Ingestion Data Analysis Data Transform
-ation Data Validation Data Splitting Trainer Model Validation Training At Scale Roll-out Serving Monitoring Composability

Portability

System 6 System 5 System 4 Training At Scale System
3 System 1 Data Ingestion Data Analysis Data Transform -ation Data Validation Building a Model Model Validation Serving Logging Monitoring Roll-out System 2 Data Splitting Trainer Portability

• As a data scientist, you want to use the
right HW for the job • Every variation causes lots of pain ◦ GPUs/FPGAs, ASICs, NICs ◦ Kernel drivers, libraries, performance • Even within an ML frameworks dependencies cause chaos ◦ Tensorflow innovation (XLA) ◦ ML portability is a challenge Container Kernel GPU FPGA Infiniband Drivers Library App App Portability

Scalability • Machine specific HW (GPU) • Limited (or unlimited)
compute • Network & storage constraints ◦ Rack, Server Locality ◦ Bandwidth constraints • Heterogenous hardware ◦ HW & SW lifecycle management • Scale isn’t JUST about adding new machines! ◦ Intern vs Researcher ◦ Scale to 1000s of experiments

You Know What’s Really Good at Composability, Portability, and Scalability?

Containers and Kubernetes

Kubernetes NFS Ceph Cassandra MySQL Spark Airflow Tensorflow Caffe TF-Serving
Flask+Scikit Operating system (Linux, Windows) CPU Memory Disk SSD GPU FPGA ASIC NIC Jupyter Quota RBAC Monitoring Logging GCP AWS Azure On-prem Namespace

Kubernetes for ML • Supports accelerators in an extensible manner
◦ GPUs already in progress ◦ Support for FPGAs, high perf NICs under discussion • Existing Controllers simplify devops challenges ◦ k8s Jobs for Training ◦ k8s Deployments for Serving • Scales to 1000s of nodes • Container packaging already a standard ◦ Standard base images for ML workloads

But Wait, There’s More! • Kubernetes native scaling objects ◦
Autoscaling cluster based on workload metrics ◦ Priority eviction for removal of low priority jobs ◦ Scaled to large number of pods (experiments) • Assumes “adequate” network bandwidth • Also passes through cluster specs for specific needs ◦ Data Gravity is supported ◦ Node labels for Heterogeneous HW (more in the future) ◦ Manage SW drivers and HW health via addons

But...

Oh, you want to use ML on K8s? Before that,
can you become an expert in: • Containers • Packaging • Kubernetes service endpoints • Persistent volumes • Scaling • Immutable deployments • GPUs, Drivers & the GPL • Cloud APIs • DevOps • ...

Introducing Kubeflow

Make it Easy for Everyone to Learn, Deploy and Manage
Portable, Distributed ML on Kubernetes (Everywhere)

Kubernetes + ML = Kubeflow = Win • Composability ◦
Choose from existing popular tools ◦ Uses yaml manifests for creation • Portability ◦ Build using cloud native, portable Kubernetes APIs ◦ Let K8s community solve for your deployment • Scalability ◦ TF already supports CPU/GPU/distributed ◦ K8s scales to 5k nodes with same stack

What’s in the Box? • Jupyter Hub - for collaborative
& interactive training • A TensorFlow Training Controller • A TensorFlow Serving Deployment • Wiring to make it work on any Kubernetes anywhere

Training At Scale Data Ingestion Data Analysis Data Transform -ation
Data Validation Logging Monitoring Roll-out Data Splitting What’s in the Box? Building a Model Model Validation Trainer Serving

Using Kubeflow my-laptop# kubectl apply -f components/ -R kubeflow created
my-multi-gpu-box# kubectl apply -f components/ -R kubeflow created my-autoscaled-k8s-cluster# kubectl apply -f components/ -R kubeflow created

Using Kubeflow • Extend kubeflow for your constraints • Customizable
Storage, Auth, Images, etc. • Collection of common tools around ML

That’s It?

Yes… (For Now)

We’re Just Getting Started! • Who’s helping? ◦ Redhat, CoreOS,
Weave, CaiCloud, many more • What’s next... ◦ Easy to use accelerator integration ◦ Support for other popular tools like Spark ML, XGBoost, sklearn ◦ Autoscaled TF Serving ◦ tf.transform (programmatic data transforms) ◦ Model analysis (sliced metrics, time series graphs, visualizations embedded in notebooks) • You tell us! (Or better yet, help!)

Kubeflow is Open - open community - open design -
open source - open to ideas

https://github.com/google/kubeflow slack: kubeflow (http://kubeflow.slack.com) twitter: @kubeflow @aronchick ([email protected]) @vishnukanan ([email protected])`

Questions

"Hot dogs or not" at Scale with Kubernetes

"Hot dogs or not" at Scale with Kubernetes

More Decks by Vish Kannan

Other Decks in Technology

Featured

Transcript