K8s_Metacontroller

Kubernetes Metacontroller CLOUD NATIVE MEETUP #20 JACK LIN

⦁ 清華大學資工所碩四畢業 (? ⦁ Distributed System TU Dresden, Germany 交換一年
⦁ 德國公司 Cloud&Heat 實習半年 ⦁ 在德國遠端台灣公司 InfuseAI 實習半年 ⦁ Kubeflow/tf-operator 只有 code review 和 /LGTM 權限的小小 Reviewer ⦁ 以及… ⦁…現在正在累死人的 Ithome 鐵人賽半路上 – 「分散式系統 - 在分散的世界中保持一致」 About me - Jack

Outline • What is K8s Controller & Operator Pattern •
Build Operator with client-go • Metacontroller • What is it? • How to use it ? • Kubeflow Experience

How K8s Works

Kubernetes api-server scheduler kubelet kubelet Proxy Proxy Master node node
etcd deployment controller

api-server scheduler kubelet kubelet Proxy Proxy Master node node etcd
deployment controller 1. create deployment

api-server scheduler kubelet kubelet Proxy Proxy Master node node etcd
deployment controller deployment object add

kubelet kubelet Proxy Proxy node node api-server scheduler etcd deployment
controller HTTP watch

kubelet kubelet Proxy Proxy node node api-server scheduler deployment controller
1. Receive the “Add” deployment events (by HTTP watch) 2. Process the deployment object • How many pod does it need? • How many pod belonged to this object has already in the cluster? 3. Create/Delete Pod by calling K8s API etcd pod object “add”

kubelet kubelet Proxy Proxy node node api-server scheduler etcd deployment
controller 1. Watches/Processes the pods without binding nodes 2. Bind pods with nodes 3. Update the pods in the etcd pod object “update”

api-server scheduler kubelet kubelet Proxy Proxy node node etcd deployment
controller 1. Watches/Processes the pods with binding nodes 2. Create the pod with the spec

kubelet kubelet Proxy Proxy node node api-server scheduler XXX controller
Reconcile 1. Add 2. Update 3. Delete etcd Compare desired world and real world Call K8s API to update it. object: Pods、Namespace、Service、Job …

Operator Pattern • What if we want to define our
own object and resource in the k8s? Custom Resource Definition (CRD) + Custom Controller = Operator Pattern • I know CRD, but how to build a Custom Controller? 1. Old and Hard way 2. Fashion and Easy way

Controller management logic use client-go to call K8s API =>
operate k8s objects 1. Informers: to watch the resource and store the status in the local cache. 2. Workqueue: if there is any change, informer will call the callback and put the object into workqueue. 3. Control Loop: goroutines as Workers to process handle the item in the Workqueue 4. Use client to call API Server and make status = desired status.

Workqueue 1. if the process is successful, the key can
be removed from the workqueue by calling the Forget() function. 2. Forget() function only stops the workqueue from tracking the history of the event. In order to remove the event completely from the workqueue, the controller must trigger the Done() function. 3. when should the controller start workers processing the workqueue? 1. wait until the cache is completely synchronized in order to achieve the latest states.

Old and Hard Way – All by yourself! • Language:
Golang • Library: client-go, code-generator, apimachinery… • Manage the Informer, Workqueue, Controller by yourself •Tools: rook’s operator-kit

Build the Operator – Go dep l A package management
tool for Go. l Install $ go get -u github.com/golang/dep/cmd/dep l Create the Go project $ mkdir $GOPATH/src/my-operator l Init the project $ dep init l Put the package we need in Gopkg.toml l Download the package $ dep ensure // will download the package into the vendor folder. Noted: Go package get means to clone the git repo to your $GOPATH/src

Build the Operator – Type & Register l $ mkdir
-p $GOPATH/src/my-operator/pkg/apis/student/v1 l create 3 files in pkg/apis/student/v1 1. doc.go 2. register.go 3. types.go l $ mkdir $GOPATH/src/my-operator/hack l Create codegen.sh in hack Noted: Just copy the files from the rook/operator-kit, and modified it. CRD

Build the Operator – Code-generator • Run the script to
generate the code. • You can now write the operator with your custom resource API of course in Golang … Reference: Almost all the operator •etcd: https://github.com/coreos/etcd-operator •Kubeflow/tf-operator: https://github.com/kubeflow/tf-operator •rook: https://github.com/rook/rook •all Kubernetes controller: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/job/job_controller.go

I don’t even know how to write Golang!

Metacontroller

What is Metacontroller? ⦁ open-source project started at Google ⦁
Wraps all the details into a framework, all you need to do is 1. Define CRD 2. Define management logic as webhook ! ⦁ When there is a event (add/delete/update) related to your CRD 1. Metacontroller will be triggered 2. Send the current CRD object to you in JSON format 3. You define the management logic depend on the status of the object 4. Send the desired status back also in JSON format

Install Metacontroller ⦁ Components: ◦ 1 Controller: metacontroller ◦ 3
CRD: ◦ compositecontrollers ◦ decoratorcontrollers ◦ controllerrevisions

CompositeController (CRD) ⦁Scenario: manage a set of child objects based
on the desired state specified in a parent object. ⦁E.g. Deployment Controller in K8s. Deployment manages a set of pods. Your own CRD The child objects your CRD is managing. Like pod.

CompositeController (CRD) Strategies ⦁Child Update Methods ⦁ OnDelete: only update
the child objects when they are deleted. ⦁ Recreate: when the status of child objects are different from the desired state, delete and recreate it. ⦁ InPlace: update the status of child objects without recreate it. ⦁Child Update Status Checks - Status Condition Check ⦁ type ⦁ status ⦁ reason ⦁Finalize Hook ⦁ you can do something before your CRD is killed,

CompositeController (CRD) • Scenario: adding new behavior to existing resources.
Like pod or statefulset The objects you want to manage Select by the label E.g. you want to add service to every pod.

Sync Function ⦁ A simple HTTP server ⦁ Expose by
service and will be called by Metacontroller ⦁ Define your own webhook function 1. Send the current CRD object to you in JSON format 2. You define the management logic depend on the status of the object 3. Send the desired status back also in JSON format ⦁ You can simply mounted using configmap

Kubeflow/tf-operator 雜談

Distributed Tensorflow • Worker: Do the real training and update
the model with the training data • PS: Store the parameters of the model send from Worker

tf-operator • For each replica you define a template which
is a K8s PodTemplateSpec • Manage the status of each pod (worker/PS). • Make sure the pod/service exists • PodGroup and gang-scheduling

Why gang-scheduling ⦁ Gang-scheduling: schedule all related pods at once
(Annotation) ⦁ Use Kubernetes default scheduler to schedule tensorflow container ◦ No gang-scheduling – Deadlock will exist in the system 35 Node1 Node2 A2 A3 B3 B2 Default Scheduler in Pod level (Round Robin placement) Tensorflow Tensorflow GPU GPU GPU GPU Kubernetes A1 B1 A2 B2 A3 B3

Topology aware gang scheduler Gang-scheduling: ◦ Gang tasks are a
single scheduling unit. E.g. Admitted, Placed, Killed, Finished. ◦ Gang tasks are independent execution unit. Adapted first come first served (AFCFS) 36 B B B . . . Topology aware gang scheduler A A A BM A A BM A Default Kubernetes scheduler Will not allowed to start

Topology aware gang scheduler Topology awareness ◦ Schedule the workers
on the same nodes with the parameter server. ◦ If all the workers cannot be placed on the same nodes, parameter server is placed on the node with most workers 37 B B B . . . Topology aware gang scheduler A A A BM A A BM A Default Kubernetes scheduler Will not allowed to start PS A

38 Scale up - for increase resource utilization and
training speed. Scale down - for decrease job waiting time. Range for the number of workers ◦ For each job, user can set the number of minimum worker and maximum worker. ◦ The workload manager will scale up or down the job within the range Auto-scaling workload manager Noted: this is just part of yaml

Kubernetes Cluster Auto-scaling workload manager Scale up ◦ Periodically monitor
the resource utilization ◦ When the resource idle time surpass a time threshold, scale up the current running jobs 39 Controller Auto-scaling workload manager BM A A BM A A A A A A A A A Resource utilization scale up

Auto-scaling workload manager Scale down ◦ Keep track of the
waiting time of each job ◦ When the waiting time surpass the time threshold, release resources from the running jobs for the waiting jobs 40 Controller Auto-scaling workload manager A A A BM A A BM A Default Kubernetes scheduler B B B . . has been waited over time threshold Topology aware gang scheduler B B B

E2E testing

Thank you!

K8s_Metacontroller

K8s_Metacontroller

More Decks by Jack

Other Decks in Technology

Featured

Transcript