Upgrade to Pro — share decks privately, control downloads, hide ads and more …

"Hot dogs or not" at Scale with Kubernetes

Vish Kannan
December 07, 2017

"Hot dogs or not" at Scale with Kubernetes

This presentation presents a full stack solution for managing Machine Learning with Kubernetes along with a turnkey solution (kubeflow) that can be deployed to most supported kubernetes services.

Vish Kannan

December 07, 2017
Tweet

More Decks by Vish Kannan

Other Decks in Technology

Transcript

  1. 'Hot Dog or Not Hot Dog' at Scale - Kubernetes

    & Machine Learning Vishnu Kannan (@vishh) and David Aronchick (@aronchick) SWE & PM at Google
  2. Square Footage House Price Square Footage House Price Non-Linear Groupings

    Multi-Dimensional Changes over Time Things Can Get Complicated
  3. Building a Model Logging Data Ingestion Data Analysis Data Transform

    -ation Data Validation Data Splitting Trainer Model Validation Training At Scale Roll-out Serving Monitoring Composability
  4. System 6 System 5 System 4 Training At Scale System

    3 System 1 Data Ingestion Data Analysis Data Transform -ation Data Validation Building a Model Model Validation Serving Logging Monitoring Roll-out System 2 Data Splitting Trainer Portability
  5. • As a data scientist, you want to use the

    right HW for the job • Every variation causes lots of pain ◦ GPUs/FPGAs, ASICs, NICs ◦ Kernel drivers, libraries, performance • Even within an ML frameworks dependencies cause chaos ◦ Tensorflow innovation (XLA) ◦ ML portability is a challenge Container Kernel GPU FPGA Infiniband Drivers Library App App Portability
  6. Scalability • Machine specific HW (GPU) • Limited (or unlimited)

    compute • Network & storage constraints ◦ Rack, Server Locality ◦ Bandwidth constraints • Heterogenous hardware ◦ HW & SW lifecycle management • Scale isn’t JUST about adding new machines! ◦ Intern vs Researcher ◦ Scale to 1000s of experiments
  7. Kubernetes NFS Ceph Cassandra MySQL Spark Airflow Tensorflow Caffe TF-Serving

    Flask+Scikit Operating system (Linux, Windows) CPU Memory Disk SSD GPU FPGA ASIC NIC Jupyter Quota RBAC Monitoring Logging GCP AWS Azure On-prem Namespace
  8. Kubernetes for ML • Supports accelerators in an extensible manner

    ◦ GPUs already in progress ◦ Support for FPGAs, high perf NICs under discussion • Existing Controllers simplify devops challenges ◦ k8s Jobs for Training ◦ k8s Deployments for Serving • Scales to 1000s of nodes • Container packaging already a standard ◦ Standard base images for ML workloads
  9. But Wait, There’s More! • Kubernetes native scaling objects ◦

    Autoscaling cluster based on workload metrics ◦ Priority eviction for removal of low priority jobs ◦ Scaled to large number of pods (experiments) • Assumes “adequate” network bandwidth • Also passes through cluster specs for specific needs ◦ Data Gravity is supported ◦ Node labels for Heterogeneous HW (more in the future) ◦ Manage SW drivers and HW health via addons
  10. Oh, you want to use ML on K8s? Before that,

    can you become an expert in: • Containers • Packaging • Kubernetes service endpoints • Persistent volumes • Scaling • Immutable deployments • GPUs, Drivers & the GPL • Cloud APIs • DevOps • ...
  11. Make it Easy for Everyone to Learn, Deploy and Manage

    Portable, Distributed ML on Kubernetes (Everywhere)
  12. Kubernetes + ML = Kubeflow = Win • Composability ◦

    Choose from existing popular tools ◦ Uses yaml manifests for creation • Portability ◦ Build using cloud native, portable Kubernetes APIs ◦ Let K8s community solve for your deployment • Scalability ◦ TF already supports CPU/GPU/distributed ◦ K8s scales to 5k nodes with same stack
  13. What’s in the Box? • Jupyter Hub - for collaborative

    & interactive training • A TensorFlow Training Controller • A TensorFlow Serving Deployment • Wiring to make it work on any Kubernetes anywhere
  14. Training At Scale Data Ingestion Data Analysis Data Transform -ation

    Data Validation Logging Monitoring Roll-out Data Splitting What’s in the Box? Building a Model Model Validation Trainer Serving
  15. Using Kubeflow my-laptop# kubectl apply -f components/ -R kubeflow created

    my-multi-gpu-box# kubectl apply -f components/ -R kubeflow created my-autoscaled-k8s-cluster# kubectl apply -f components/ -R kubeflow created
  16. Using Kubeflow • Extend kubeflow for your constraints • Customizable

    Storage, Auth, Images, etc. • Collection of common tools around ML
  17. We’re Just Getting Started! • Who’s helping? ◦ Redhat, CoreOS,

    Weave, CaiCloud, many more • What’s next... ◦ Easy to use accelerator integration ◦ Support for other popular tools like Spark ML, XGBoost, sklearn ◦ Autoscaled TF Serving ◦ tf.transform (programmatic data transforms) ◦ Model analysis (sliced metrics, time series graphs, visualizations embedded in notebooks) • You tell us! (Or better yet, help!)