"Hot dogs or not" at Scale with Kubernetes

2812cddbcd23557ee7f39f321283d0d5?s=47 Vish Kannan
December 07, 2017

"Hot dogs or not" at Scale with Kubernetes

This presentation presents a full stack solution for managing Machine Learning with Kubernetes along with a turnkey solution (kubeflow) that can be deployed to most supported kubernetes services.

2812cddbcd23557ee7f39f321283d0d5?s=128

Vish Kannan

December 07, 2017
Tweet

Transcript

  1. 1.

    'Hot Dog or Not Hot Dog' at Scale - Kubernetes

    & Machine Learning Vishnu Kannan (@vishh) and David Aronchick (@aronchick) SWE & PM at Google
  2. 13.
  3. 14.

    Square Footage House Price Square Footage House Price Non-Linear Groupings

    Multi-Dimensional Changes over Time Things Can Get Complicated
  4. 23.

    Building a Model Logging Data Ingestion Data Analysis Data Transform

    -ation Data Validation Data Splitting Trainer Model Validation Training At Scale Roll-out Serving Monitoring Composability
  5. 25.

    System 6 System 5 System 4 Training At Scale System

    3 System 1 Data Ingestion Data Analysis Data Transform -ation Data Validation Building a Model Model Validation Serving Logging Monitoring Roll-out System 2 Data Splitting Trainer Portability
  6. 26.

    • As a data scientist, you want to use the

    right HW for the job • Every variation causes lots of pain ◦ GPUs/FPGAs, ASICs, NICs ◦ Kernel drivers, libraries, performance • Even within an ML frameworks dependencies cause chaos ◦ Tensorflow innovation (XLA) ◦ ML portability is a challenge Container Kernel GPU FPGA Infiniband Drivers Library App App Portability
  7. 27.

    Scalability • Machine specific HW (GPU) • Limited (or unlimited)

    compute • Network & storage constraints ◦ Rack, Server Locality ◦ Bandwidth constraints • Heterogenous hardware ◦ HW & SW lifecycle management • Scale isn’t JUST about adding new machines! ◦ Intern vs Researcher ◦ Scale to 1000s of experiments
  8. 30.

    Kubernetes NFS Ceph Cassandra MySQL Spark Airflow Tensorflow Caffe TF-Serving

    Flask+Scikit Operating system (Linux, Windows) CPU Memory Disk SSD GPU FPGA ASIC NIC Jupyter Quota RBAC Monitoring Logging GCP AWS Azure On-prem Namespace
  9. 31.

    Kubernetes for ML • Supports accelerators in an extensible manner

    ◦ GPUs already in progress ◦ Support for FPGAs, high perf NICs under discussion • Existing Controllers simplify devops challenges ◦ k8s Jobs for Training ◦ k8s Deployments for Serving • Scales to 1000s of nodes • Container packaging already a standard ◦ Standard base images for ML workloads
  10. 32.

    But Wait, There’s More! • Kubernetes native scaling objects ◦

    Autoscaling cluster based on workload metrics ◦ Priority eviction for removal of low priority jobs ◦ Scaled to large number of pods (experiments) • Assumes “adequate” network bandwidth • Also passes through cluster specs for specific needs ◦ Data Gravity is supported ◦ Node labels for Heterogeneous HW (more in the future) ◦ Manage SW drivers and HW health via addons
  11. 33.
  12. 34.

    Oh, you want to use ML on K8s? Before that,

    can you become an expert in: • Containers • Packaging • Kubernetes service endpoints • Persistent volumes • Scaling • Immutable deployments • GPUs, Drivers & the GPL • Cloud APIs • DevOps • ...
  13. 36.

    Make it Easy for Everyone to Learn, Deploy and Manage

    Portable, Distributed ML on Kubernetes (Everywhere)
  14. 37.

    Kubernetes + ML = Kubeflow = Win • Composability ◦

    Choose from existing popular tools ◦ Uses yaml manifests for creation • Portability ◦ Build using cloud native, portable Kubernetes APIs ◦ Let K8s community solve for your deployment • Scalability ◦ TF already supports CPU/GPU/distributed ◦ K8s scales to 5k nodes with same stack
  15. 38.

    What’s in the Box? • Jupyter Hub - for collaborative

    & interactive training • A TensorFlow Training Controller • A TensorFlow Serving Deployment • Wiring to make it work on any Kubernetes anywhere
  16. 39.

    Training At Scale Data Ingestion Data Analysis Data Transform -ation

    Data Validation Logging Monitoring Roll-out Data Splitting What’s in the Box? Building a Model Model Validation Trainer Serving
  17. 40.

    Using Kubeflow my-laptop# kubectl apply -f components/ -R kubeflow created

    my-multi-gpu-box# kubectl apply -f components/ -R kubeflow created my-autoscaled-k8s-cluster# kubectl apply -f components/ -R kubeflow created
  18. 41.

    Using Kubeflow • Extend kubeflow for your constraints • Customizable

    Storage, Auth, Images, etc. • Collection of common tools around ML
  19. 42.
  20. 47.

    We’re Just Getting Started! • Who’s helping? ◦ Redhat, CoreOS,

    Weave, CaiCloud, many more • What’s next... ◦ Easy to use accelerator integration ◦ Support for other popular tools like Spark ML, XGBoost, sklearn ◦ Autoscaled TF Serving ◦ tf.transform (programmatic data transforms) ◦ Model analysis (sliced metrics, time series graphs, visualizations embedded in notebooks) • You tell us! (Or better yet, help!)
  21. 48.
  22. 50.