Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

OpenAI Scaling Kubernetes to 2,500 Nodes https://blog.openai.com/scaling-kubernetes-to-2500-nodes/

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

Foundations Containers, Kubernetes Why? 1. Typical ML development workflow and some of its shortcomings 2. How can a distributed system like Kubernetes help us improve this flow? Labs aka.ms/kubeflow-labs

Slide 5

Slide 5 text

How? Setup a Kubernetes cluster (acs-engine/AKS) Demos 1. Running a basic Docker container 2. Running a TensorFlow job with GPU 3. JupyterHub 4. Distributed TensorFlow 5. Hyper-parameter sweeping 6. TensorFlow Serving

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

https://github.com/Azure/kubeflow-labs/tree/master/1-docker

Slide 10

Slide 10 text

OpenAI - Building the Infrastructure that Powers the Future of AI

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

Azure/acs-engine github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler/cloudprovider/azure https://docs.microsoft.com/en-us/azure/aks/gpu-cluster

Slide 13

Slide 13 text

https://github.com/Azure/kubeflow-labs/tree/master/2-kubernetes

Slide 14

Slide 14 text

• Kubeflow project is dedicated to making deployments of machine learning workflows on Kubernetes simple, portable and scalable. • https://github.com/kubeflow/kubeflow

Slide 15

Slide 15 text

https://github.com/Azure/kubeflow-labs/tree/master/4-kubeflow https://github.com/Azure/kubeflow-labs/tree/master/6-tfjob

Slide 16

Slide 16 text

• multi-user Hub • spawns, manages, and proxies multiple instances of the single-user Jupyter notebook server.

Slide 17

Slide 17 text

https://github.com/Azure/kubeflow-labs/tree/master/5-jupyterhub

Slide 18

Slide 18 text

!

Slide 19

Slide 19 text

https://github.com/Azure/kubeflow-labs/tree/master/7-distributed-tensorflow

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

Andrej Karpathy's Image painting demo

Slide 22

Slide 22 text

https://github.com/Azure/kubeflow-labs/tree/master/8-hyperparam-sweep

Slide 23

Slide 23 text

• Provides out-of-the-box integration with TensorFlow models • Multiple models (or multiple versions of the same model) can be served simultaneously

Slide 24

Slide 24 text

https://github.com/Azure/kubeflow-labs/tree/master/9-serving

Slide 25

Slide 25 text

one) Solution is Kubernetes: • Highly Scalable • Easy to explore hyper-parameters space • Easy to do distributed training But really, Data Scientists shouldn’t have to care about containers, kubernetes and all that stuff • Pachyderm can version datasets and trigger new trainings when changes occur • Distributed File Systems • NFS • HDFS • … Classic DevOps solutions: • Containers • CI/CD • Autoscaling • A/B testing and canary release of Models • Comparing Production accuracy vs expected accuracy when possible • Rolling-updates • …

Slide 26

Slide 26 text

aka.ms/kubeflow-labs