Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Kubeflow : Tensorflow on Kubernetes

Kubeflow : Tensorflow on Kubernetes

Laurent Grangeau

October 18, 2018
Tweet

More Decks by Laurent Grangeau

Other Decks in Technology

Transcript

  1. $ whoami ? Laurent Grangeau Cloud Solution Architect @ Sogeti

    I love to automate things and deliver apps at scale You can follow me on @laurentgrangeau
  2. Tensorflow on Kubernetes ? I know Kubernetes, why not train

    ML models on Kubernetes, with GPU computing and in a distributed way ? Here comes Kubeflow
  3. What is Kubeflow ? The Kubeflow project is dedicated to

    making deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable. Our goal is not to recreate other services, but to provide a straightforward way to deploy best-of-breed open-source systems for ML to diverse infrastructures. Anywhere you are running Kubernetes, you should be able to run Kubeflow.
  4. What is Kubeflow ? Use native Kubernetes object • Custom

    Resource Definition • Persistence Volume Claim • Pods • Labels (for GPU computing) • LoadBalancer • Manifests Don’t reinvent the wheel !
  5. What is Kubeflow ? Based on multiple components deployed together

    • Jupyterhub ◦ Multi server for Jupyter notebooks • Katib ◦ Hyperparameter tuning system Clone of Vizier (Google's HP Tuning System) • KVC ◦ Kubernetes Volume Controller • CRDs for various ML frameworks ◦ tf-operator, PyTorch operator, caffe-2, etc.
  6. What is Kubeflow ? Based on multiple components deployed together

    • SeldonIO ◦ CRD & tooling for serving and deploying models • Argo ◦ CRD for workflow • Pachyderm ◦ deploy and manage multi-stage data pipelines while maintaining complete reproducibility and provenance • Tensor2tensor ◦ Library of TensorFlow models and datasets for a variety of applications
  7. How to deploy Kubeflow ? Kubeflow makes use of ksonnet

    We want to be portable across different environments ksonnet is used to move ML applications between environments • local -> cloud • dev -> test -> prod Environments in ksonnet is a first-class citizen • It generates the various manifests for Kubernetes
  8. ksonnet # Create a namespace for kubeflow deployment NAMESPACE=kubeflow kubectl

    create namespace ${NAMESPACE} # Which version of Kubeflow to use # For a list of releases refer to: # https://github.com/kubeflow/kubeflow/releases VERSION=v0.2.2 # Initialize a ksonnet app. Set the namespace for it's default environment. APP_NAME=my-kubeflow ks init ${APP_NAME} cd ${APP_NAME} ks env set default --namespace ${NAMESPACE}
  9. ksonnet # Add a reference to Kubeflow's ksonnet manifests ks

    registry add kubeflow github.com/kubeflow/kubeflow/tree/${VERSION}/kubeflow # Install Kubeflow components ks pkg install kubeflow/core@${VERSION} ks pkg install kubeflow/tf-serving@${VERSION} # Create templates for core components ks generate kubeflow-core kubeflow-core # Customize Kubeflow's installation for AKS ks param set kubeflow-core cloud aks # Deploy Kubeflow ks apply default -c kubeflow-core
  10. ksonnet $ kubectl get pods -n kubeflow NAME READY STATUS

    RESTARTS AGE ambassador-7789cddc5d-czf7p 2/2 Running 0 1d ambassador-7789cddc5d-f79zp 2/2 Running 0 1d ambassador-7789cddc5d-h57ms 2/2 Running 0 1d centraldashboard-d5bf74c6b-nn925 1/1 Running 0 1d tf-hub-0 1/1 Running 0 1d tf-job-dashboard-8699ccb5ff-9phmv 1/1 Running 0 1d tf-job-operator-646bdbcb7-bc479 1/1 Running 0 1d
  11. Jupyterhub The Jupyter Notebook is an open-source web application that

    allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more
  12. Train Now that you have a working model, create a

    Docker image containing you code to be able to deploy it on Kubernetes Kubeflow creates a custom definition called TFJob so you can easily train you model with yaml manifest
  13. Tfjob apiVersion: kubeflow.org/v1alpha2 kind: TFJob metadata: name: kubeflow spec: tfReplicaSpecs:

    MASTER: replicas: 1 template: spec: containers: - image: <DOCKER_USERNAME>/tf-mnist:gpu name: tensorflow resources: limits: nvidia.com/gpu: 1 restartPolicy: OnFailure
  14. Tfjob $ kubectl create -f kubeflow-tfjob.yml $ kubectl get tfjob

    NAME AGE kubeflow-gpu 5s $ kubectl get pods NAME READY STATUS RESTARTS AGE kubeflow-master-xs4b-0-6gpfn 1/1 Running 0 2m
  15. Tfjob $ kubectl logs <pod-name> [...] INFO:tensorflow:2017-11-20 20:57:22.314198: Step 480:

    Cross entropy = 0.142486 INFO:tensorflow:2017-11-20 20:57:22.370080: Step 480: Validation accuracy = 85.0% (N=100) INFO:tensorflow:2017-11-20 20:57:22.896383: Step 490: Train accuracy = 98.0% INFO:tensorflow:2017-11-20 20:57:22.896600: Step 490: Cross entropy = 0.075210 INFO:tensorflow:2017-11-20 20:57:22.945611: Step 490: Validation accuracy = 91.0% (N=100) INFO:tensorflow:2017-11-20 20:57:23.407756: Step 499: Train accuracy = 94.0% INFO:tensorflow:2017-11-20 20:57:23.407980: Step 499: Cross entropy = 0.170348 INFO:tensorflow:2017-11-20 20:57:23.457325: Step 499: Validation accuracy = 89.0% (N=100) INFO:tensorflow:Final test accuracy = 88.4% (N=353) [...]
  16. Tfjob That's great and all, but how do we grab

    our trained model and TensorFlow's summaries? We can’t, that’s where KVC comes to the recue!
  17. Tfjob apiVersion: kubeflow.org/v1alpha2 kind: TFJob metadata: name: kubeflow spec: tfReplicaSpecs:

    MASTER: replicas: 1 template: spec: containers: - image: <DOCKER_USERNAME>/tf-mnist:gpu name: tensorflow resources: limits: nvidia.com/gpu: 1
  18. Tfjob volumeMounts: # By default our classifier saves the summaries

    in /tmp/tensorflow, # so that's where we want to mount our Azure File Share. - name: azurefile # The subPath allows us to mount a subdirectory within the azure file share instead of root # this is useful so that we can save the logs for each run in a different subdirectory # instead of overwriting what was done before. subPath: kubeflow-gpu mountPath: /tmp/tensorflow restartPolicy: OnFailure volumes: - name: azurefile persistentVolumeClaim: claimName: azurefile
  19. Distributed training Distributed training is hard • Setup VMs (with

    GPU computing) • Set the network (VMs have to talk to each others) • Upload the TF code in every machine • Modify the model to add distributed training cluster = tf.train.ClusterSpec({"worker": ["<IP_GPU_VM_1>:2222", "<IP_GPU_VM_2>:2222"], "ps": ["<IP_CPU_VM_1>:2222", "<IP_CPU_VM_2>:2222"]}) • Lots of other things to do (splitting operation across devices, getting the master session etc.)
  20. Distributed training Start the training # On ps0: $ python

    trainer.py \ --ps_hosts=<IP_CPU_VM_1>:2222,<IP_CPU_VM_2>:2222 \ --worker_hosts=<IP_GPU_VM_1>:2222,<IP_GPU_VM_2>:2222 \ --job_name=ps --task_index=0 # On ps1: $ python trainer.py \ --ps_hosts=<IP_CPU_VM_1>:2222,<IP_CPU_VM_2>:2222 \ --worker_hosts=<IP_GPU_VM_1>:2222,<IP_GPU_VM_2>:2222 \ --job_name=ps --task_index=1 # On worker0 and worker1 also...
  21. Distributed training TFJob defines TFJobSpec and TFReplicaSetSpec object TFJob Defines

    a “spec” section of type “TFJobSpec” TFJobSpec Defines a “TFReplicaSet” of type “TFReplicaSetSpec” TFReplicaSetSpec Defines a TFReplicaType, which can be MASTER, WORKER or PS
  22. Define a new TFJob with TFReplicaSetSpec apiVersion: kubeflow.org/v1alpha2 kind: TFJob

    metadata: name: kubeflow-gpu spec: tfReplicaSpecs: MASTER: replicas: 1 [...] WORKER: replicas: 2 [...] PS: replicas: 1 [...]
  23. TFJob will inject it on each pod { "cluster":{ "master":[

    "distributed-mnist-master-5oz2-0:2222" ], "ps":[ "distributed-mnist-ps-5oz2-0:2222" ], "worker":[ "distributed-mnist-worker-5oz2-0:2222", "distributed-mnist-worker-5oz2-1:2222" ] }, "task":{ "type":"worker", "index":1 }, "environment":"cloud" }
  24. Change your code # Grab the TF_CONFIG environment variable tf_config_json

    = os.environ.get("TF_CONFIG", "{}") # Deserialize to a python object tf_config = json.loads(tf_config_json) # Grab the cluster specification from tf_config and create a new tf.train.ClusterSpec instance with it cluster_spec = tf_config.get("cluster", {}) cluster_spec_object = tf.train.ClusterSpec(cluster_spec) # Grab the task assigned to this specific process from the config. job_name might be "worker" and task_id might be 1 for example task = tf_config.get("task", {}) job_name = task["type"] task_id = task["index"]
  25. Change your code # Configure the TensorFlow server server_def =

    tf.train.ServerDef( cluster=cluster_spec_object.as_cluster_def(), protocol="grpc", job_name=job_name, task_index=task_id) server = tf.train.Server(server_def) # checking if this process is the chief (also called master). The master has the responsibility of creating the session, saving the summaries etc. is_chief = (job_name == 'master') # Notice that we are not handling the case where job_name == 'ps'. That is because `TFJob` will take care of the parameter servers for us by default.
  26. Deploy it $ kubectl get pods NAME READY STATUS RESTARTS

    AGE kubeflow-master-m8vi-0-rdr5o 1/1 Running 0 23s kubeflow-ps-m8vi-0-0vhjm 1/1 Running 0 23s kubeflow-worker-m8vi-0-eyb6l 1/1 Running 0 23s kubeflow-worker-m8vi-1-bm2ue 1/1 Running 0 23s
  27. See the logs $ kubectl logs <master-pod-name> [...] Initialize GrpcChannelCache

    for job master -> {0 -> localhost:2222} Initialize GrpcChannelCache for job ps -> {0 -> kubeflow-ps-m8vi-0:2222} Initialize GrpcChannelCache for job worker -> {0 -> kubeflow-worker-m8vi-0:2222, 1 -> kubeflow-worker-m8vi-1:2222} 2018-04-30 22:45:28.963803: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:333] Started server with target: grpc://localhost:2222 ... Accuracy at step 970: 0.9784 Accuracy at step 980: 0.9791 Accuracy at step 990: 0.9796 Adding run metadata for 999
  28. Hyperparameters with Helm image: <DOCKER_USERNAME>/tf-mnist:gpu useGPU: true hyperParamValues: learningRate: -

    0.001 - 0.01 - 0.1 hiddenLayers: - 5 - 6 - 7 9 TFJob will be deployed on the cluster
  29. Create a Helm chart # First we copy the values

    of values.yaml in variable to make it easier to access them {{- $lrlist := .Values.hyperParamValues.learningRate -}} {{- $nblayerslist := .Values.hyperParamValues.hiddenLayers -}} # Then we loop over every value of $lrlist (learning rate) and $nblayerslist (hidden layer depth) # This will result in create 1 TFJob for every pair of learning rate and hidden layer depth {{- range $i, $lr := $lrlist }} {{- range $j, $nblayers := $nblayerslist }} apiVersion: kubeflow.org/v1alpha2 kind: TFJob # Each one of our trainings will be a separate TFJob metadata: name: kubeflow-tf-mnist-{{ $i }}-{{ $j }} # We give a unique name to each training
  30. Create a Helm chart args: # Here we pass a

    unique learning rate and hidden layer count to each instance. # We also put the values between quotes to avoid potential formatting issues - --learning-rate - {{ $lr | quote }} - --hidden-layers - {{ $nblayers | quote }} - --logdir - /tmp/tensorflow/tf-mnist-lr{{ $lr }}-d-{{ $nblayers }} # We save the summaries in a different directory
  31. Create a Helm chart {{ if $useGPU }} # We

    only want to request GPUs if we asked for it in values.yaml with useGPU resources: limits: nvidia.com/gpu: 1 {{ end }}
  32. Deploy it $ kubectl get pods NAME READY STATUS RESTARTS

    AGE kubeflow-tf-mnist-0-0-master-juc5-0-hw5cm 0/1 Pending 0 4s kubeflow-tf-mnist-0-1-master-pu49-0-jp06r 1/1 Running 0 14s kubeflow-tf-mnist-0-2-master-awhs-0-gfra0 0/1 Pending 0 6s kubeflow-tf-mnist-1-0-master-5tfm-0-dhhhv 1/1 Running 0 16s kubeflow-tf-mnist-1-1-master-be91-0-zw4gk 1/1 Running 0 16s kubeflow-tf-mnist-1-2-master-r2nd-0-zhws1 0/1 Pending 0 7s kubeflow-tf-mnist-2-0-master-7w37-0-ff0w9 0/1 Pending 0 13s kubeflow-tf-mnist-2-1-master-260j-0-l4o7r 0/1 Pending 0 10s kubeflow-tf-mnist-2-2-master-jtjb-0-5l84q 0/1 Pending 0 9s
  33. Deploy with ksonnet $ ks pkg install kubeflow/tf-serving@74629b7 $ ks

    generate tf-serving ${MODEL_COMPONENT} --name=${MODEL_NAME} $ ks apply default -c ${MODEL_COMPONENT}
  34. Grab the host and call our client $ export TF_MODEL_SERVER_HOST=$(kubectl

    get svc ${MODEL_NAME} -n ${NAMESPACE} --template="{{range .status.loadBalancer.ingress}}{{.ip}}{{end}}") $ export TF_MNIST_IMAGE_PATH=data/7.png $ python mnist_client.py
  35. Get the output... outputs { key: "classes" value { dtype:

    DT_UINT8 tensor_shape { dim { size: 1 } } int_val: 7 } } outputs { key: "predictions" value { dtype: DT_FLOAT tensor_shape { dim { size: 1 } dim { size: 10 } } float_val: 0.0 float_val: 0.0 [...]
  36. … and the prediction! ............................ ............................ ............................ ............................ ............................ ............................

    ............................ ..............@@@@@@........ ..........@@@@@@@@@@........ ........@@@@@@@@@@@@........ ........@@@@@@@@.@@@........ ........@@@@....@@@@........ ................@@@@........ ...............@@@@......... ...............@@@@......... ...............@@@.......... ..............@@@@.......... ..............@@@........... .............@@@@........... .............@@@............ ............@@@@............ ............@@@............. ............@@@............. ...........@@@.............. ..........@@@@.............. ..........@@@@.............. ..........@@................ ............................ Your model says the above number is... 7!
  37. What about rollout, A/B testing, ... ? Integrates well with

    Istio Supports • Auth • Quotas • Rollout • A/B testing • Metrics
  38. What about rollout, A/B testing, ... ? $ ks param

    set --env=$ENV $MODEL_COMPONENT version v2 $ ks param set --env=$ENV $MODEL_COMPONENT firstVersion false $ ks apply $ENV -c $MODEL_COMPONENT But traffic continues to go on v1 We have to update the rules
  39. What about rollout, A/B testing, ... ? apiVersion: config.istio.io/v1alpha2 kind:

    RouteRule metadata: name: inception-rollout namespace: kubeflow spec: destination: name: inception precedence: 2 route: - labels: version: v1 weight: 95 - labels: version: v2 weight: 5
  40. Conclusion and takeway • Kubeflow can ease the deployment, the

    training and the operations of ML models at scale • It still a relatively young project, and not yet mature (1.0 target end of year) • Enabling Big Data and Machine Learning on Kubernetes will allow IT organizations to standardize on the same Kubernetes infrastructure, propeling adoption and reducing costs • Both Kubernetes and Kubeflow will enable IT organizations to focus more on effort on applications rather than infrastructure