Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scaling Kubernetes controllers (Seattle Kubernetes Meetup)

Scaling Kubernetes controllers (Seattle Kubernetes Meetup)

Presented Tim Ebert's thesis on controller scalability with some of my intro and thoughts at Seattle Kubernetes Meetup on Jan 24, 2023.

Ahmet Alp Balkan

January 24, 2023
Tweet

More Decks by Ahmet Alp Balkan

Other Decks in Technology

Transcript

  1. Scaling Kubernetes
    controllers
    Ahmet Balkan (@ahmetb)
    2023-01-24

    View Slide

  2. Intro to Controllers in Kubernetes
    Examples:
    1. kube-controller-manager
    2. kube-scheduler ?
    3. kubelet ??
    4. kube-apiserver ???

    controller
    Kubernetes
    API
    2. do something
    1.watch
    external
    world

    View Slide

  3. Informer machinery
    1. LIST+WATCH to be aware of tracked Kubernetes objects.
    2. Cache the encountered objects in-memory (reduce live API requests)
    3. Notify the controller of new/updated objects
    (+existing objects periodically)
    controller process
    handler
    informer
    (client-go pkgs)
    local
    cache
    periodic
    resync
    (list)
    watch
    notify
    k8s
    API
    get

    View Slide

  4. Production-ready controllers
    ● tolerate faults,
    ● scale with the load

    View Slide

  5. Fault tolerance
    ● Run N replicas.
    ● Elect a leader.
    ● Only the leader does work.
    ● If leader fails, take over.
    (active/active)
    e.g. kube-controller-manager,
    kube-scheduler, cert-manager, …
    Pod
    replica1
    Pod
    replica2
    Pod
    replica3
    controller controller controller
    Lease
    leader: replica2

    View Slide

  6. Lease API in Kubernetes
    Typically used to elect leaders in Kubernetes.
    apiVersion: coordination.k8s.io/v1
    kind: Lease
    spec:
    holder: my-replica-1
    leaseDurationSeconds: 10
    renewTime: "2022-11-30T18:04:27.912073Z"

    View Slide

  7. Scalability
    Skeptical?
    ● Supported limit 5K nodes but providers supporting 25K+ nodes
    ● Kubernetes Job controller in 1.26 now supports 100K pods.
    ● kcp project aims to push limits of Kubernetes API server beyond
    current limits (more storage, more watches etc)
    ● CRD sprawl, multitenancy, …

    View Slide

  8. Scalability: Throughput
    If it takes t time to reconcile N
    objects, how long does it take to
    reconcile 1000*N objects?
    What if only the leader is allowed
    to do work?
    Where is it throttled? (CPU, etcd, network…)
    Pod
    replica1
    Pod
    replica2
    Pod
    replica3
    controller controller controller
    workqueue

    View Slide

  9. Scalability: Memory
    ● Active replicas all maintain
    LIST+WATCH on a local cache.
    ● How much memory do you think
    it takes to store 100,000
    pods?
    ● What if during a periodic
    resync (full LIST)?
    ● How much memory are you
    willing to throw at your
    controller?
    Pod
    replica1
    controller
    local
    cache
    Pod
    replica2
    controller
    local
    cache
    Pod
    replica3
    controller
    local
    cache
    kube-apiserver
    L I S T + W A T C H

    View Slide

  10. View Slide

  11. What do we need for
    horizontal controller scalability?
    ● Use existing controller development libraries
    (e.g. client-go, controller-runtime)
    ● Membership and failure detection for controller replicas
    ● Preventing concurrent handling of an object

    View Slide

  12. High-level Architecture
    What if we exploited the fact that you can
    create watches with label selectors?
    kube-apiserver sharder
    list+watch ALL Pods,
    label them
    controller
    replicaA
    controller
    replicaB
    controller
    replicaC
    list+watch Pods
    label=A
    list+watch Pods
    label=B
    list+watch Pods
    label=C

    View Slide

  13. High-level Architecture
    What if we exploited the fact that you can
    create watches with label selectors?
    kube-apiserver sharder
    list+watch ALL Pods,
    label them
    controller
    replicaA
    controller
    replicaB
    controller
    replicaC
    list+watch Pods
    label=A
    list+watch Pods
    label=B
    list+watch Pods
    label=C how to
    discover
    members?
    single
    point of
    failure,
    still a
    bottleneck
    how to
    reassign
    work of dead
    replicas?

    View Slide

  14. Object Partitioning
    Consistent hash ring with virtual nodes representing controller replicas.
    A
    B’
    A’’
    A’
    B
    B’’
    C’’
    C’
    C
    ● hash(apiGroup_ns_name)
    ● find the spot on the ring
    ● assign object to the controller
    replica by labeling the object on
    Kubernetes API:
    metadata:
    labels:
    shard: controller-a73e7b

    View Slide

  15. Scaling the sharder
    Active/passive: Active sharder stores metadata-only portion of objects.
    kube-apiserver
    list+watch ALL Pods
    + label them
    shard=replicaA
    sharder
    replica#3
    (standby)
    Lease
    (sharder-leader)
    leader: sharder2
    sharder
    replica#1
    (active)
    sharder
    replica#2
    (standby)
    local
    cache
    only object
    `metadata`

    View Slide

  16. Membership discovery
    How do we learn about which controller replicas are up or down?
    sharder
    controller
    replicaA
    controller
    replicaB
    controller
    replicaC
    Lease replicaA
    holder: replicaA
    Lease replicaB
    holder: replicaB
    Lease replicaC
    holder: replicaC
    watch
    renew renew renew
    unhealthy if not renewed in the
    past 2 x leaseDurationSeconds

    View Slide

  17. Membership discovery
    How do we learn about which controller replicas are up or down?
    sharder
    controller
    replicaA
    controller
    replicaB
    controller
    replicaC
    Lease replicaA
    holder: replicaA
    Lease replicaB
    holder: replicaB
    Lease replicaC
    holder: replicaC
    watch
    renew renew renew
    unhealthy if not renewed in the
    past 2 x leaseDurationSeconds

    View Slide

  18. Reassignment/rebalancing
    Sharder keeps the hash ring up to date (replicas die, new ones
    added).
    Objects must be reassigned to their destination.
    Need to ensure the old replica stops reconciling the object.
    ● Step 1: sharder adds label `drain: true` on the object
    ● Step 2: controller sees the `drain` label, removes `shard` label
    ● Step 3: sharder sees the object now has no `shard` label
    ● Step 4: sharder calculates the replica and sets the `shard` label.
    A
    B’
    A’’
    A’
    B
    B’’
    C’

    C’
    C
    D

    View Slide

  19. View Slide

  20. Results?
    With N=3 replicas CPU usage is only 12% less (on the active sharder)

    View Slide

  21. Results?
    With N=3 replicas memory usage is only 11% less (on the active sharder)
    My theory: controller-runtime shared informer cache is still carrying the
    entire object (not “just metadata”). Needs more debugging.

    View Slide

  22. More ideas?
    What if we weren’t limited to the existing controller/informer machinery?
    We could use various pub/sub models that assigns renciliations to controllers on the fly.
    k8s
    API
    dispatcher1
    watch
    dispatcher2
    dispatcherN
    controller
    controller
    controller
    controller
    controller
    LB
    watch-only (not cached), handles 1/N objects,
    dispatch updates to connected clients (consistent hashing)
    establish long polling to watch
    object changes, do not cache locally

    View Slide

  23. Further reading
    https://github.com/timebertt/kubernetes-controller-sharding for code
    https://github.com/timebertt/thesis-controller-sharding for thesis
    Thanks.
    @ahmetb

    View Slide