Slide 1

Slide 1 text

Kubernetes Metacontroller CLOUD NATIVE MEETUP #20 JACK LIN

Slide 2

Slide 2 text

⦁ 清華大學資工所碩四畢業 (? ⦁ Distributed System TU Dresden, Germany 交換一年 ⦁ 德國公司 Cloud&Heat 實習半年 ⦁ 在德國遠端台灣公司 InfuseAI 實習半年 ⦁ Kubeflow/tf-operator 只有 code review 和 /LGTM 權限的小小 Reviewer ⦁ 以及… ⦁…現在正在累死人的 Ithome 鐵人賽半路上 – 「分散式系統 - 在分散的世界中保持一致」 About me - Jack

Slide 3

Slide 3 text

Outline • What is K8s Controller & Operator Pattern • Build Operator with client-go • Metacontroller • What is it? • How to use it ? • Kubeflow Experience

Slide 4

Slide 4 text

How K8s Works

Slide 5

Slide 5 text

Kubernetes api-server scheduler kubelet kubelet Proxy Proxy Master node node etcd deployment controller

Slide 6

Slide 6 text

api-server scheduler kubelet kubelet Proxy Proxy Master node node etcd deployment controller 1. create deployment

Slide 7

Slide 7 text

api-server scheduler kubelet kubelet Proxy Proxy Master node node etcd deployment controller deployment object add

Slide 8

Slide 8 text

kubelet kubelet Proxy Proxy node node api-server scheduler etcd deployment controller HTTP watch

Slide 9

Slide 9 text

kubelet kubelet Proxy Proxy node node api-server scheduler deployment controller 1. Receive the “Add” deployment events (by HTTP watch) 2. Process the deployment object • How many pod does it need? • How many pod belonged to this object has already in the cluster? 3. Create/Delete Pod by calling K8s API etcd pod object “add”

Slide 10

Slide 10 text

kubelet kubelet Proxy Proxy node node api-server scheduler etcd deployment controller 1. Watches/Processes the pods without binding nodes 2. Bind pods with nodes 3. Update the pods in the etcd pod object “update”

Slide 11

Slide 11 text

api-server scheduler kubelet kubelet Proxy Proxy node node etcd deployment controller 1. Watches/Processes the pods with binding nodes 2. Create the pod with the spec

Slide 12

Slide 12 text

kubelet kubelet Proxy Proxy node node api-server scheduler XXX controller Reconcile 1. Add 2. Update 3. Delete etcd Compare desired world and real world Call K8s API to update it. object: Pods、Namespace、Service、Job …

Slide 13

Slide 13 text

Operator Pattern • What if we want to define our own object and resource in the k8s? Custom Resource Definition (CRD) + Custom Controller = Operator Pattern • I know CRD, but how to build a Custom Controller? 1. Old and Hard way 2. Fashion and Easy way

Slide 14

Slide 14 text

Controller management logic use client-go to call K8s API => operate k8s objects 1. Informers: to watch the resource and store the status in the local cache. 2. Workqueue: if there is any change, informer will call the callback and put the object into workqueue. 3. Control Loop: goroutines as Workers to process handle the item in the Workqueue 4. Use client to call API Server and make status = desired status.

Slide 15

Slide 15 text

Workqueue 1. if the process is successful, the key can be removed from the workqueue by calling the Forget() function. 2. Forget() function only stops the workqueue from tracking the history of the event. In order to remove the event completely from the workqueue, the controller must trigger the Done() function. 3. when should the controller start workers processing the workqueue? 1. wait until the cache is completely synchronized in order to achieve the latest states.

Slide 16

Slide 16 text

Old and Hard Way – All by yourself! • Language: Golang • Library: client-go, code-generator, apimachinery… • Manage the Informer, Workqueue, Controller by yourself •Tools: rook’s operator-kit

Slide 17

Slide 17 text

Build the Operator – Go dep l A package management tool for Go. l Install $ go get -u github.com/golang/dep/cmd/dep l Create the Go project $ mkdir $GOPATH/src/my-operator l Init the project $ dep init l Put the package we need in Gopkg.toml l Download the package $ dep ensure // will download the package into the vendor folder. Noted: Go package get means to clone the git repo to your $GOPATH/src

Slide 18

Slide 18 text

Build the Operator – Type & Register l $ mkdir -p $GOPATH/src/my-operator/pkg/apis/student/v1 l create 3 files in pkg/apis/student/v1 1. doc.go 2. register.go 3. types.go l $ mkdir $GOPATH/src/my-operator/hack l Create codegen.sh in hack Noted: Just copy the files from the rook/operator-kit, and modified it. CRD

Slide 19

Slide 19 text

Build the Operator – Code-generator • Run the script to generate the code. • You can now write the operator with your custom resource API of course in Golang … Reference: Almost all the operator •etcd: https://github.com/coreos/etcd-operator •Kubeflow/tf-operator: https://github.com/kubeflow/tf-operator •rook: https://github.com/rook/rook •all Kubernetes controller: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/job/job_controller.go

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

I don’t even know how to write Golang!

Slide 24

Slide 24 text

Metacontroller

Slide 25

Slide 25 text

What is Metacontroller? ⦁ open-source project started at Google ⦁ Wraps all the details into a framework, all you need to do is 1. Define CRD 2. Define management logic as webhook ! ⦁ When there is a event (add/delete/update) related to your CRD 1. Metacontroller will be triggered 2. Send the current CRD object to you in JSON format 3. You define the management logic depend on the status of the object 4. Send the desired status back also in JSON format

Slide 26

Slide 26 text

Install Metacontroller ⦁ Components: ◦ 1 Controller: metacontroller ◦ 3 CRD: ◦ compositecontrollers ◦ decoratorcontrollers ◦ controllerrevisions

Slide 27

Slide 27 text

CompositeController (CRD) ⦁Scenario: manage a set of child objects based on the desired state specified in a parent object. ⦁E.g. Deployment Controller in K8s. Deployment manages a set of pods. Your own CRD The child objects your CRD is managing. Like pod.

Slide 28

Slide 28 text

CompositeController (CRD) Strategies ⦁Child Update Methods ⦁ OnDelete: only update the child objects when they are deleted. ⦁ Recreate: when the status of child objects are different from the desired state, delete and recreate it. ⦁ InPlace: update the status of child objects without recreate it. ⦁Child Update Status Checks - Status Condition Check ⦁ type ⦁ status ⦁ reason ⦁Finalize Hook ⦁ you can do something before your CRD is killed,

Slide 29

Slide 29 text

CompositeController (CRD) • Scenario: adding new behavior to existing resources. Like pod or statefulset The objects you want to manage Select by the label E.g. you want to add service to every pod.

Slide 30

Slide 30 text

Sync Function ⦁ A simple HTTP server ⦁ Expose by service and will be called by Metacontroller ⦁ Define your own webhook function 1. Send the current CRD object to you in JSON format 2. You define the management logic depend on the status of the object 3. Send the desired status back also in JSON format ⦁ You can simply mounted using configmap

Slide 31

Slide 31 text

Demo

Slide 32

Slide 32 text

Kubeflow/tf-operator 雜談

Slide 33

Slide 33 text

Distributed Tensorflow • Worker: Do the real training and update the model with the training data • PS: Store the parameters of the model send from Worker

Slide 34

Slide 34 text

tf-operator • For each replica you define a template which is a K8s PodTemplateSpec • Manage the status of each pod (worker/PS). • Make sure the pod/service exists • PodGroup and gang-scheduling

Slide 35

Slide 35 text

Why gang-scheduling ⦁ Gang-scheduling: schedule all related pods at once (Annotation) ⦁ Use Kubernetes default scheduler to schedule tensorflow container ◦ No gang-scheduling – Deadlock will exist in the system 35 Node1 Node2 A2 A3 B3 B2 Default Scheduler in Pod level (Round Robin placement) Tensorflow Tensorflow GPU GPU GPU GPU Kubernetes A1 B1 A2 B2 A3 B3

Slide 36

Slide 36 text

Topology aware gang scheduler Gang-scheduling: ◦ Gang tasks are a single scheduling unit. E.g. Admitted, Placed, Killed, Finished. ◦ Gang tasks are independent execution unit. Adapted first come first served (AFCFS) 36 B B B . . . Topology aware gang scheduler A A A BM A A BM A Default Kubernetes scheduler Will not allowed to start

Slide 37

Slide 37 text

Topology aware gang scheduler Topology awareness ◦ Schedule the workers on the same nodes with the parameter server. ◦ If all the workers cannot be placed on the same nodes, parameter server is placed on the node with most workers 37 B B B . . . Topology aware gang scheduler A A A BM A A BM A Default Kubernetes scheduler Will not allowed to start PS A

Slide 38

Slide 38 text

38 — Scale up - for increase resource utilization and training speed. — Scale down - for decrease job waiting time. — Range for the number of workers ◦ For each job, user can set the number of minimum worker and maximum worker. ◦ The workload manager will scale up or down the job within the range Auto-scaling workload manager Noted: this is just part of yaml

Slide 39

Slide 39 text

Kubernetes Cluster Auto-scaling workload manager Scale up ◦ Periodically monitor the resource utilization ◦ When the resource idle time surpass a time threshold, scale up the current running jobs 39 Controller Auto-scaling workload manager BM A A BM A A A A A A A A A Resource utilization scale up

Slide 40

Slide 40 text

Auto-scaling workload manager Scale down ◦ Keep track of the waiting time of each job ◦ When the waiting time surpass the time threshold, release resources from the running jobs for the waiting jobs 40 Controller Auto-scaling workload manager A A A BM A A BM A Default Kubernetes scheduler B B B . . has been waited over time threshold Topology aware gang scheduler B B B

Slide 41

Slide 41 text

E2E testing

Slide 42

Slide 42 text

Thank you!