Upgrade to Pro — share decks privately, control downloads, hide ads and more …

"Hot dogs or not" at Scale with Kubernetes

Vish Kannan
December 07, 2017

"Hot dogs or not" at Scale with Kubernetes

This presentation presents a full stack solution for managing Machine Learning with Kubernetes along with a turnkey solution (kubeflow) that can be deployed to most supported kubernetes services.

Vish Kannan

December 07, 2017

More Decks by Vish Kannan

Other Decks in Technology


  1. 'Hot Dog or Not Hot Dog' at Scale - Kubernetes

    & Machine Learning Vishnu Kannan (@vishh) and David Aronchick (@aronchick) SWE & PM at Google
  2. What is Machine Learning?

  3. First, Start with a Question

  4. How Much Is My House Worth?

  5. Square Footage House Price How Much Is My House Worth?

  6. Square Footage House Price How Much Is My House Worth?

  7. Square Footage House Price How Much Is My House Worth?

  8. Square Footage House Price How Much Is My House Worth?

  9. Square Footage House Price How Much Is My House Worth?

  10. Now, Answer Your Question

  11. Square Footage House Price 2100 Sq. Ft. $339K How Much

    Is My House Worth?
  12. Congrats, You’re a Machine Learning Expert!

  13. But...

  14. Square Footage House Price Square Footage House Price Non-Linear Groupings

    Multi-Dimensional Changes over Time Things Can Get Complicated
  15. Machine Learning is a way of solving problems without explicitly

    knowing how to create the solution
  16. Google DC Ops

  17. PUE == Power Usage Effectiveness

  18. ML for Everyone!

  19. However...

  20. ML Needs DevOps

  21. DevOps Needs...

  22. Composability Portability Scalability

  23. Building a Model Logging Data Ingestion Data Analysis Data Transform

    -ation Data Validation Data Splitting Trainer Model Validation Training At Scale Roll-out Serving Monitoring Composability
  24. Portability

  25. System 6 System 5 System 4 Training At Scale System

    3 System 1 Data Ingestion Data Analysis Data Transform -ation Data Validation Building a Model Model Validation Serving Logging Monitoring Roll-out System 2 Data Splitting Trainer Portability
  26. • As a data scientist, you want to use the

    right HW for the job • Every variation causes lots of pain ◦ GPUs/FPGAs, ASICs, NICs ◦ Kernel drivers, libraries, performance • Even within an ML frameworks dependencies cause chaos ◦ Tensorflow innovation (XLA) ◦ ML portability is a challenge Container Kernel GPU FPGA Infiniband Drivers Library App App Portability
  27. Scalability • Machine specific HW (GPU) • Limited (or unlimited)

    compute • Network & storage constraints ◦ Rack, Server Locality ◦ Bandwidth constraints • Heterogenous hardware ◦ HW & SW lifecycle management • Scale isn’t JUST about adding new machines! ◦ Intern vs Researcher ◦ Scale to 1000s of experiments
  28. You Know What’s Really Good at Composability, Portability, and Scalability?

  29. Containers and Kubernetes

  30. Kubernetes NFS Ceph Cassandra MySQL Spark Airflow Tensorflow Caffe TF-Serving

    Flask+Scikit Operating system (Linux, Windows) CPU Memory Disk SSD GPU FPGA ASIC NIC Jupyter Quota RBAC Monitoring Logging GCP AWS Azure On-prem Namespace
  31. Kubernetes for ML • Supports accelerators in an extensible manner

    ◦ GPUs already in progress ◦ Support for FPGAs, high perf NICs under discussion • Existing Controllers simplify devops challenges ◦ k8s Jobs for Training ◦ k8s Deployments for Serving • Scales to 1000s of nodes • Container packaging already a standard ◦ Standard base images for ML workloads
  32. But Wait, There’s More! • Kubernetes native scaling objects ◦

    Autoscaling cluster based on workload metrics ◦ Priority eviction for removal of low priority jobs ◦ Scaled to large number of pods (experiments) • Assumes “adequate” network bandwidth • Also passes through cluster specs for specific needs ◦ Data Gravity is supported ◦ Node labels for Heterogeneous HW (more in the future) ◦ Manage SW drivers and HW health via addons
  33. But...

  34. Oh, you want to use ML on K8s? Before that,

    can you become an expert in: • Containers • Packaging • Kubernetes service endpoints • Persistent volumes • Scaling • Immutable deployments • GPUs, Drivers & the GPL • Cloud APIs • DevOps • ...
  35. Introducing Kubeflow

  36. Make it Easy for Everyone to Learn, Deploy and Manage

    Portable, Distributed ML on Kubernetes (Everywhere)
  37. Kubernetes + ML = Kubeflow = Win • Composability ◦

    Choose from existing popular tools ◦ Uses yaml manifests for creation • Portability ◦ Build using cloud native, portable Kubernetes APIs ◦ Let K8s community solve for your deployment • Scalability ◦ TF already supports CPU/GPU/distributed ◦ K8s scales to 5k nodes with same stack
  38. What’s in the Box? • Jupyter Hub - for collaborative

    & interactive training • A TensorFlow Training Controller • A TensorFlow Serving Deployment • Wiring to make it work on any Kubernetes anywhere
  39. Training At Scale Data Ingestion Data Analysis Data Transform -ation

    Data Validation Logging Monitoring Roll-out Data Splitting What’s in the Box? Building a Model Model Validation Trainer Serving
  40. Using Kubeflow my-laptop# kubectl apply -f components/ -R kubeflow created

    my-multi-gpu-box# kubectl apply -f components/ -R kubeflow created my-autoscaled-k8s-cluster# kubectl apply -f components/ -R kubeflow created
  41. Using Kubeflow • Extend kubeflow for your constraints • Customizable

    Storage, Auth, Images, etc. • Collection of common tools around ML
  42. Demo

  43. That’s It?

  44. Yes… (For Now)

  45. Yes… (For Now)

  46. Yes… (For Now)

  47. We’re Just Getting Started! • Who’s helping? ◦ Redhat, CoreOS,

    Weave, CaiCloud, many more • What’s next... ◦ Easy to use accelerator integration ◦ Support for other popular tools like Spark ML, XGBoost, sklearn ◦ Autoscaled TF Serving ◦ tf.transform (programmatic data transforms) ◦ Model analysis (sliced metrics, time series graphs, visualizations embedded in notebooks) • You tell us! (Or better yet, help!)
  48. Kubeflow is Open - open community - open design -

    open source - open to ideas
  49. https://github.com/google/kubeflow slack: kubeflow (http://kubeflow.slack.com) twitter: @kubeflow @aronchick ([email protected]) @vishnukanan ([email protected])`

  50. Questions