Scaling Kubernetes controllers Ahmet Balkan (@ahmetb) 2023-01-24

Intro to Controllers in Kubernetes Examples: 1. kube-controller-manager 2. kube-scheduler ? 3. kubelet ?? 4. kube-apiserver ??? … controller Kubernetes API 2. do something external world

Informer machinery 1. LIST+WATCH to be aware of tracked Kubernetes objects. 2. Cache the encountered objects in-memory (reduce live API requests) 3. Notify the controller of new/updated objects (+existing objects periodically) controller process handler informer (client-go pkgs) local cache periodic resync (list) watch notify k8s API get

Production-ready controllers ● tolerate faults, ● scale with the load

Fault tolerance ● Run N replicas. ● Elect a leader. ● Only the leader does work. ● If leader fails, take over. (active/active) e.g. kube-controller-manager, kube-scheduler, cert-manager, … Pod replica1 Pod replica2 Pod replica3 controller controller controller Lease leader: replica2

Lease API in Kubernetes Typically used to elect leaders in Kubernetes. apiVersion: kind: Lease spec: holder: my-replica-1 leaseDurationSeconds: 10 renewTime: "2022-11-30T18:04:27.912073Z"

Scalability Skeptical? ● Supported limit 5K nodes but providers supporting 25K+ nodes ● Kubernetes Job controller in 1.26 now supports 100K pods. ● kcp project aims to push limits of Kubernetes API server beyond current limits (more storage, more watches etc) ● CRD sprawl, multitenancy, …

Scalability: Throughput If it takes t time to reconcile N objects, how long does it take to reconcile 1000*N objects? What if only the leader is allowed to do work? Where is it throttled? (CPU, etcd, network…) Pod replica1 Pod replica2 Pod replica3 controller controller controller workqueue

Scalability: Memory ● Active replicas all maintain LIST+WATCH on a local cache. ● How much memory do you think it takes to store 100,000 pods? ● What if during a periodic resync (full LIST)? ● How much memory are you willing to throw at your controller? Pod replica1 controller local cache Pod replica2 controller local cache Pod replica3 controller local cache kube-apiserver L I S T + W A T C H

What do we need for horizontal controller scalability? ● Use existing controller development libraries (e.g. client-go, controller-runtime) ● Membership and failure detection for controller replicas ● Preventing concurrent handling of an object

High-level Architecture What if we exploited the fact that you can create watches with label selectors? kube-apiserver sharder list+watch ALL Pods, label them controller replicaA controller replicaB controller replicaC list+watch Pods label=A list+watch Pods label=B list+watch Pods label=C

High-level Architecture What if we exploited the fact that you can create watches with label selectors? kube-apiserver sharder list+watch ALL Pods, label them controller replicaA controller replicaB controller replicaC list+watch Pods label=A list+watch Pods label=B list+watch Pods label=C how to discover members? single point of failure, still a bottleneck how to reassign work of dead replicas?

Object Partitioning Consistent hash ring with virtual nodes representing controller replicas. A B’ A’’ A’ B B’’ C’’ C’ C ● hash(apiGroup_ns_name) ● find the spot on the ring ● assign object to the controller replica by labeling the object on Kubernetes API: metadata: labels: shard: controller-a73e7b

Scaling the sharder Active/passive: Active sharder stores metadata-only portion of objects. kube-apiserver list+watch ALL Pods + label them shard=replicaA sharder replica#3 (standby) Lease (sharder-leader) leader: sharder2 sharder replica#1 (active) sharder replica#2 (standby) local cache only object `metadata`

Membership discovery How do we learn about which controller replicas are up or down? sharder controller replicaA controller replicaB controller replicaC Lease replicaA holder: replicaA Lease replicaB holder: replicaB Lease replicaC holder: replicaC watch renew renew renew unhealthy if not renewed in the past 2 x leaseDurationSeconds

Reassignment/rebalancing Sharder keeps the hash ring up to date (replicas die, new ones added). Objects must be reassigned to their destination. Need to ensure the old replica stops reconciling the object. ● Step 1: sharder adds label `drain: true` on the object ● Step 2: controller sees the `drain` label, removes `shard` label ● Step 3: sharder sees the object now has no `shard` label ● Step 4: sharder calculates the replica and sets the `shard` label. A B’ A’’ A’ B B’’ C’ ’ C’ C D

Results? With N=3 replicas CPU usage is only 12% less (on the active sharder)

Results? With N=3 replicas memory usage is only 11% less (on the active sharder) My theory: controller-runtime shared informer cache is still carrying the entire object (not “just metadata”). Needs more debugging.

More ideas? What if we weren’t limited to the existing controller/informer machinery? We could use various pub/sub models that assigns renciliations to controllers on the fly. k8s API dispatcher1 watch dispatcher2 dispatcherN controller controller controller controller controller LB watch-only (not cached), handles 1/N objects, dispatch updates to connected clients (consistent hashing) establish long polling to watch object changes, do not cache locally

Further reading for code for thesis Thanks. @ahmetb