Scaling Kubernetes controllers (Seattle Kubernetes Meetup)

Slide 1

Slide 1 text

Scaling Kubernetes controllers Ahmet Balkan (@ahmetb) 2023-01-24

Slide 2

Slide 2 text

Intro to Controllers in Kubernetes Examples: 1. kube-controller-manager 2. kube-scheduler ? 3. kubelet ?? 4. kube-apiserver ??? … controller Kubernetes API 2. do something 1.watch external world

Slide 3

Slide 3 text

Informer machinery 1. LIST+WATCH to be aware of tracked Kubernetes objects. 2. Cache the encountered objects in-memory (reduce live API requests) 3. Notify the controller of new/updated objects (+existing objects periodically) controller process handler informer (client-go pkgs) local cache periodic resync (list) watch notify k8s API get

Slide 4

Slide 4 text

Production-ready controllers ● tolerate faults, ● scale with the load

Slide 5

Slide 5 text

Fault tolerance ● Run N replicas. ● Elect a leader. ● Only the leader does work. ● If leader fails, take over. (active/active) e.g. kube-controller-manager, kube-scheduler, cert-manager, … Pod replica1 Pod replica2 Pod replica3 controller controller controller Lease leader: replica2

Slide 6

Slide 6 text

Lease API in Kubernetes Typically used to elect leaders in Kubernetes. apiVersion: coordination.k8s.io/v1 kind: Lease spec: holder: my-replica-1 leaseDurationSeconds: 10 renewTime: "2022-11-30T18:04:27.912073Z"

Slide 7

Slide 7 text

Scalability Skeptical? ● Supported limit 5K nodes but providers supporting 25K+ nodes ● Kubernetes Job controller in 1.26 now supports 100K pods. ● kcp project aims to push limits of Kubernetes API server beyond current limits (more storage, more watches etc) ● CRD sprawl, multitenancy, …

Slide 8

Slide 8 text

Scalability: Throughput If it takes t time to reconcile N objects, how long does it take to reconcile 1000*N objects? What if only the leader is allowed to do work? Where is it throttled? (CPU, etcd, network…) Pod replica1 Pod replica2 Pod replica3 controller controller controller workqueue

Slide 9

Slide 9 text

Scalability: Memory ● Active replicas all maintain LIST+WATCH on a local cache. ● How much memory do you think it takes to store 100,000 pods? ● What if during a periodic resync (full LIST)? ● How much memory are you willing to throw at your controller? Pod replica1 controller local cache Pod replica2 controller local cache Pod replica3 controller local cache kube-apiserver L I S T + W A T C H

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

What do we need for horizontal controller scalability? ● Use existing controller development libraries (e.g. client-go, controller-runtime) ● Membership and failure detection for controller replicas ● Preventing concurrent handling of an object

Slide 12

Slide 12 text

High-level Architecture What if we exploited the fact that you can create watches with label selectors? kube-apiserver sharder list+watch ALL Pods, label them controller replicaA controller replicaB controller replicaC list+watch Pods label=A list+watch Pods label=B list+watch Pods label=C

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Object Partitioning Consistent hash ring with virtual nodes representing controller replicas. A B’ A’’ A’ B B’’ C’’ C’ C ● hash(apiGroup_ns_name) ● find the spot on the ring ● assign object to the controller replica by labeling the object on Kubernetes API: metadata: labels: shard: controller-a73e7b

Slide 15

Slide 15 text

Scaling the sharder Active/passive: Active sharder stores metadata-only portion of objects. kube-apiserver list+watch ALL Pods + label them shard=replicaA sharder replica#3 (standby) Lease (sharder-leader) leader: sharder2 sharder replica#1 (active) sharder replica#2 (standby) local cache only object `metadata`

Slide 16

Slide 16 text

Membership discovery How do we learn about which controller replicas are up or down? sharder controller replicaA controller replicaB controller replicaC Lease replicaA holder: replicaA Lease replicaB holder: replicaB Lease replicaC holder: replicaC watch renew renew renew unhealthy if not renewed in the past 2 x leaseDurationSeconds

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Reassignment/rebalancing Sharder keeps the hash ring up to date (replicas die, new ones added). Objects must be reassigned to their destination. Need to ensure the old replica stops reconciling the object. ● Step 1: sharder adds label `drain: true` on the object ● Step 2: controller sees the `drain` label, removes `shard` label ● Step 3: sharder sees the object now has no `shard` label ● Step 4: sharder calculates the replica and sets the `shard` label. A B’ A’’ A’ B B’’ C’ ’ C’ C D

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

Results? With N=3 replicas CPU usage is only 12% less (on the active sharder)

Slide 21

Slide 21 text

Results? With N=3 replicas memory usage is only 11% less (on the active sharder) My theory: controller-runtime shared informer cache is still carrying the entire object (not “just metadata”). Needs more debugging.

Slide 22

Slide 22 text

More ideas? What if we weren’t limited to the existing controller/informer machinery? We could use various pub/sub models that assigns renciliations to controllers on the fly. k8s API dispatcher1 watch dispatcher2 dispatcherN controller controller controller controller controller LB watch-only (not cached), handles 1/N objects, dispatch updates to connected clients (consistent hashing) establish long polling to watch object changes, do not cache locally

Slide 23

Slide 23 text

Further reading https://github.com/timebertt/kubernetes-controller-sharding for code https://github.com/timebertt/thesis-controller-sharding for thesis Thanks. @ahmetb