Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Dive into Kubeflow

Dive into Kubeflow

Presented 5/10/2018 and Cloud Native Easy Bay Meetup at Adobe

Lachlan Evenson

May 10, 2018
Tweet

More Decks by Lachlan Evenson

Other Decks in Technology

Transcript

  1. OpenAI
    Scaling Kubernetes to 2,500 Nodes
    https://blog.openai.com/scaling-kubernetes-to-2500-nodes/

    View full-size slide

  2. Foundations
    Containers, Kubernetes
    Why?
    1. Typical ML development workflow and some of its shortcomings
    2. How can a distributed system like Kubernetes help us improve this
    flow?
    Labs
    aka.ms/kubeflow-labs

    View full-size slide

  3. How?
    Setup a Kubernetes cluster (acs-engine/AKS)
    Demos
    1. Running a basic Docker container
    2. Running a TensorFlow job with GPU
    3. JupyterHub
    4. Distributed TensorFlow
    5. Hyper-parameter sweeping
    6. TensorFlow Serving

    View full-size slide

  4. https://github.com/Azure/kubeflow-labs/tree/master/1-docker

    View full-size slide

  5. OpenAI - Building the Infrastructure that Powers the Future of AI

    View full-size slide

  6. Azure/acs-engine
    github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler/cloudprovider/azure
    https://docs.microsoft.com/en-us/azure/aks/gpu-cluster

    View full-size slide

  7. https://github.com/Azure/kubeflow-labs/tree/master/2-kubernetes

    View full-size slide

  8. • Kubeflow project is dedicated to making deployments of machine learning
    workflows on Kubernetes simple, portable and scalable.
    • https://github.com/kubeflow/kubeflow

    View full-size slide

  9. https://github.com/Azure/kubeflow-labs/tree/master/4-kubeflow
    https://github.com/Azure/kubeflow-labs/tree/master/6-tfjob

    View full-size slide

  10. • multi-user Hub
    • spawns, manages, and proxies multiple instances of the single-user Jupyter
    notebook server.

    View full-size slide

  11. https://github.com/Azure/kubeflow-labs/tree/master/5-jupyterhub

    View full-size slide

  12. https://github.com/Azure/kubeflow-labs/tree/master/7-distributed-tensorflow

    View full-size slide

  13. Andrej Karpathy's Image painting demo

    View full-size slide

  14. https://github.com/Azure/kubeflow-labs/tree/master/8-hyperparam-sweep

    View full-size slide

  15. • Provides out-of-the-box integration with TensorFlow models
    • Multiple models (or multiple versions of the same model) can be served
    simultaneously

    View full-size slide

  16. https://github.com/Azure/kubeflow-labs/tree/master/9-serving

    View full-size slide

  17. one) Solution is Kubernetes:
    • Highly Scalable
    • Easy to explore hyper-parameters space
    • Easy to do distributed training
    But really, Data Scientists shouldn’t have to care about containers, kubernetes and
    all that stuff
    • Pachyderm can version datasets and
    trigger
    new trainings when changes occur
    • Distributed File Systems
    • NFS
    • HDFS
    • …
    Classic DevOps solutions:
    • Containers
    • CI/CD
    • Autoscaling
    • A/B testing and canary release of
    Models
    • Comparing Production accuracy
    vs expected accuracy when
    possible
    • Rolling-updates
    • …

    View full-size slide

  18. aka.ms/kubeflow-labs

    View full-size slide