Background Alibaba Cloud - Many small to medium size Kubernetes clusters - Container Service for Kubernetes - Serverless Kubernetes - Virtual Kubelet - Secure container + Multi-tenancy Alibaba Group (Today’s focus) - A few large Kubernetes clusters - Sigma = Kubernetes + extensions + enhancements
Workload - Production long running services - AI jobs - CI/CD jobs - YARN - Co-located with Fuxi, the batch job manager, on the same host Sigma(Kubernetes) Fuxi
Scale ~ 10 clusters ~ 10,000 nodes per cluster ~ 1,000,000 cores per cluster ~ 100,000 containers per cluster ~ 10,000 applications per cluster ~ 100 pods / second at busy time per cluster Not accurate number!
Kubernetes Scalability Official numbers ~ 5000 nodes Scaling Sigma to support Alibaba scale - Do not modify Kubernetes when possible (our cloud offering is upstream k8s) - Upstream the effort as much as we could
Storage Scalability Events - Tried to directly send to ElasticSearch - Need to modify API Server code - Need to support watch operation - Dedicated etcd clusters - Forward to ElasticSearch
Storage Scalability Cluster metadata optimizations - Eliminate unnecessary node updates - Pagination through API Server - Limit CRD usage - etcd optimizations need to be upstreamed - Btree freelist management improvements -> support 10x data size - Concurrent read support -> reduce write latency 10x
Kubernetes Master Scalability API Server - Horizontally scalable by designed - Not super CPU efficient though - Objects are heavily cached - A lot of memory - Lack of custom index support - The code to support index is half cooked - Not hard to add custom index by modifying the API Server codebase
Kubernetes Master Scalability Controller - Default controller is monolithic, not scalable - But plugable, and extensible through CRD - Built Alibaba controller/operators Scheduler - Default scheduler performance is not great - But plugable - Built Alibaba scheduler
Scheduling improvements - Node CPU topology awareness scheduling - We use CPUSet + CPUShare - Storage topology awareness scheduling - Dry run - User facing capacity planning / debugging - Pod Group - Pod colocation / gang scheduling - Pre scheduling via global optimizer
Deterministic Example use case: Alibaba’s Singles day sales - End to end optimization to support at least 10x traffic with minimal addition resources - End to end stress testing to ensure reliability and to reduce risk Requirements - Application upgrades should not change container placements - Minor resources updates should not change container placements
In-place Update of Pod Resources 1. v1: a. User: i. Set “in-place” annotation to “created” & update Pod requests/limits b. APIServer: i. Admission to check annotation c. Scheduler: i. Check if in-place update is possible 1. otherwise set “in-place update fail”, user fall back ii. Set “in-place” annotation to “accepted” d. Kubelet: i. Update container resources (CRI), cgroups manager, cpu manager, clear annotation 2. v2: a. A join effort KEP (kep #686) with community on upstream i. reviews are welcome!
IP Persistence ● A CRD named ReservedIP ● CNI Service try to consume the CR when Pod is re-created with annotation: ○ pod.beta1.sigma.ali/ip-reuse=true apiVersion: "apiextensions.k8s.io/v1beta1" kind: "CustomResourceDefinition" metadata: name: "reservedip.extensions.sigma" spec: ... spec: required: ["networkStatus", "hostIP"] properties: networkStatus: type: "string" hostIP: type: "string" ttlInSeconds: type: "integer" minimum: 30
Operation at scale - upgrades 1. Dev-rollout pipeline a. Dev-release i. main branch -> feature branch -> test sets -> Code Review -> main branch -> CI pipeline -> PASS ii. main branch -> release -> CI pipeline -> PASS b. Release-build i. release -> generate build -> test build (*.rpm) -> prod build (*.rpm) c. Test-build i. test build -> test cluster -> monitoring & dashboard a. Rollout i. prod build -> service template -> rollout plan b. Run rollout plan i. E.g. Cluster X a total of 3000+ nodes, batch interval 6hr, 12hr 1. Batch 1:2,5,10, 2. Batch 2:20,50,100 3. Batch 3:200,200,200 4. Batch 4:400,400,400 5. Batch 5:500,500,500 c. Rollback (just in case)
Operation at scale - inspection Inspector (A map-reduce style program) - Constraint violations - No unexpected resource over-commit - No affinity/anti-affinity violations - … - State cross checking - API Server pod state == pod running state query through Kubelet - Controller replica == number of pods running query through Kubelet - Warnings - Too many soft affinity/anti-affinity violations - ...