Informer machinery 1. LIST+WATCH to be aware of tracked Kubernetes objects. 2. Cache the encountered objects in-memory (reduce live API requests) 3. Notify the controller of new/updated objects (+existing objects periodically) controller process handler informer (client-go pkgs) local cache periodic resync (list) watch notify k8s API get
Fault tolerance ● Run N replicas. ● Elect a leader. ● Only the leader does work. ● If leader fails, take over. (active/active) e.g. kube-controller-manager, kube-scheduler, cert-manager, … Pod replica1 Pod replica2 Pod replica3 controller controller controller Lease leader: replica2
Lease API in Kubernetes Typically used to elect leaders in Kubernetes. apiVersion: coordination.k8s.io/v1 kind: Lease spec: holder: my-replica-1 leaseDurationSeconds: 10 renewTime: "2022-11-30T18:04:27.912073Z"
Scalability: Throughput If it takes t time to reconcile N objects, how long does it take to reconcile 1000*N objects? What if only the leader is allowed to do work? Where is it throttled? (CPU, etcd, network…) Pod replica1 Pod replica2 Pod replica3 controller controller controller workqueue
Scalability: Memory ● Active replicas all maintain LIST+WATCH on a local cache. ● How much memory do you think it takes to store 100,000 pods? ● What if during a periodic resync (full LIST)? ● How much memory are you willing to throw at your controller? Pod replica1 controller local cache Pod replica2 controller local cache Pod replica3 controller local cache kube-apiserver L I S T + W A T C H
What do we need for horizontal controller scalability? ● Use existing controller development libraries (e.g. client-go, controller-runtime) ● Membership and failure detection for controller replicas ● Preventing concurrent handling of an object
High-level Architecture What if we exploited the fact that you can create watches with label selectors? kube-apiserver sharder list+watch ALL Pods, label them controller replicaA controller replicaB controller replicaC list+watch Pods label=A list+watch Pods label=B list+watch Pods label=C
High-level Architecture What if we exploited the fact that you can create watches with label selectors? kube-apiserver sharder list+watch ALL Pods, label them controller replicaA controller replicaB controller replicaC list+watch Pods label=A list+watch Pods label=B list+watch Pods label=C how to discover members? single point of failure, still a bottleneck how to reassign work of dead replicas?
Object Partitioning Consistent hash ring with virtual nodes representing controller replicas. A B’ A’’ A’ B B’’ C’’ C’ C ● hash(apiGroup_ns_name) ● find the spot on the ring ● assign object to the controller replica by labeling the object on Kubernetes API: metadata: labels: shard: controller-a73e7b
Membership discovery How do we learn about which controller replicas are up or down? sharder controller replicaA controller replicaB controller replicaC Lease replicaA holder: replicaA Lease replicaB holder: replicaB Lease replicaC holder: replicaC watch renew renew renew unhealthy if not renewed in the past 2 x leaseDurationSeconds
Membership discovery How do we learn about which controller replicas are up or down? sharder controller replicaA controller replicaB controller replicaC Lease replicaA holder: replicaA Lease replicaB holder: replicaB Lease replicaC holder: replicaC watch renew renew renew unhealthy if not renewed in the past 2 x leaseDurationSeconds
Reassignment/rebalancing Sharder keeps the hash ring up to date (replicas die, new ones added). Objects must be reassigned to their destination. Need to ensure the old replica stops reconciling the object. ● Step 1: sharder adds label `drain: true` on the object ● Step 2: controller sees the `drain` label, removes `shard` label ● Step 3: sharder sees the object now has no `shard` label ● Step 4: sharder calculates the replica and sets the `shard` label. A B’ A’’ A’ B B’’ C’ ’ C’ C D
Results? With N=3 replicas memory usage is only 11% less (on the active sharder) My theory: controller-runtime shared informer cache is still carrying the entire object (not “just metadata”). Needs more debugging.
More ideas? What if we weren’t limited to the existing controller/informer machinery? We could use various pub/sub models that assigns renciliations to controllers on the fly. k8s API dispatcher1 watch dispatcher2 dispatcherN controller controller controller controller controller LB watch-only (not cached), handles 1/N objects, dispatch updates to connected clients (consistent hashing) establish long polling to watch object changes, do not cache locally
Further reading https://github.com/timebertt/kubernetes-controller-sharding for code https://github.com/timebertt/thesis-controller-sharding for thesis Thanks. @ahmetb