Kubernetes at Alibaba at Web Scale

Slide 1

Slide 1 text

Kubernetes at Alibaba Cluster Mgmt @ Web Scale Meetup Xiang Li, Alibaba Cloud Lei Zhang, Alibaba Cloud

Slide 2

Slide 2 text

Background Alibaba Cloud - Many small to medium size Kubernetes clusters - Container Service for Kubernetes - Serverless Kubernetes - Virtual Kubelet - Secure container + Multi-tenancy Alibaba Group (Today’s focus) - A few large Kubernetes clusters - Sigma = Kubernetes + extensions + enhancements

Slide 3

Slide 3 text

Workload - Production long running services - AI jobs - CI/CD jobs - YARN - Co-located with Fuxi, the batch job manager, on the same host Sigma(Kubernetes) Fuxi

Slide 4

Slide 4 text

Scale ~ 10 clusters ~ 10,000 nodes per cluster ~ 1,000,000 cores per cluster ~ 100,000 containers per cluster ~ 10,000 applications per cluster ~ 100 pods / second at busy time per cluster Not accurate number!

Slide 5

Slide 5 text

Kubernetes Scalability Ofﬁcial numbers ~ 5000 nodes Scaling Sigma to support Alibaba scale - Do not modify Kubernetes when possible (our cloud offering is upstream k8s) - Upstream the effort as much as we could

Slide 6

Slide 6 text

Potential issues skipped Networking Load balancing / service discovery - We do not use kube-proxy nor core-dns today

Slide 7

Slide 7 text

Storage Scalability Events - Tried to directly send to ElasticSearch - Need to modify API Server code - Need to support watch operation - Dedicated etcd clusters - Forward to ElasticSearch

Slide 8

Slide 8 text

Storage Scalability Cluster metadata - 30k nodes, 1 million pods, 5 million resources - 5KB/resources, 50GB in total - 5,000 resources update/second (~ 1000 pod creation/second) - 50,000 resources read/second - MTTR within 3 minutes - P99 < 50ms

Slide 9

Slide 9 text

Storage Scalability Cluster metadata optimizations - Eliminate unnecessary node updates - Pagination through API Server - Limit CRD usage - etcd optimizations need to be upstreamed - Btree freelist management improvements -> support 10x data size - Concurrent read support -> reduce write latency 10x

Slide 10

Slide 10 text

Storage Scalability Potential cluster metadata optimizations to scale further - Compression - Snapshot streaming - Read-only replica - Object patching instead of full update - Delta encoding - API Server / Controller checkpointing

Slide 11

Slide 11 text

Kubernetes Master Scalability API Server - Horizontally scalable by designed - Not super CPU efﬁcient though - Objects are heavily cached - A lot of memory - Lack of custom index support - The code to support index is half cooked - Not hard to add custom index by modifying the API Server codebase

Slide 12

Slide 12 text

Kubernetes Master Scalability Controller - Default controller is monolithic, not scalable - But plugable, and extensible through CRD - Built Alibaba controller/operators Scheduler - Default scheduler performance is not great - But plugable - Built Alibaba scheduler

Slide 13

Slide 13 text

Scheduling improvements - Node CPU topology awareness scheduling - We use CPUSet + CPUShare - Storage topology awareness scheduling - Dry run - User facing capacity planning / debugging - Pod Group - Pod colocation / gang scheduling - Pre scheduling via global optimizer

Slide 14

Slide 14 text

Deterministic Example use case: Alibaba’s Singles day sales - End to end optimization to support at least 10x trafﬁc with minimal addition resources - End to end stress testing to ensure reliability and to reduce risk Requirements - Application upgrades should not change container placements - Minor resources updates should not change container placements

Slide 15

Slide 15 text

In-place Update of Pod Resources 1. v1: a. User: i. Set “in-place” annotation to “created” & update Pod requests/limits b. APIServer: i. Admission to check annotation c. Scheduler: i. Check if in-place update is possible 1. otherwise set “in-place update fail”, user fall back ii. Set “in-place” annotation to “accepted” d. Kubelet: i. Update container resources (CRI), cgroups manager, cpu manager, clear annotation 2. v2: a. A join effort KEP (kep #686) with community on upstream i. reviews are welcome!

Slide 16

Slide 16 text

IP Persistence ● A CRD named ReservedIP ● CNI Service try to consume the CR when Pod is re-created with annotation: ○ pod.beta1.sigma.ali/ip-reuse=true apiVersion: "apiextensions.k8s.io/v1beta1" kind: "CustomResourceDefinition" metadata: name: "reservedip.extensions.sigma" spec: ... spec: required: ["networkStatus", "hostIP"] properties: networkStatus: type: "string" hostIP: type: "string" ttlInSeconds: type: "integer" minimum: 30

Slide 17

Slide 17 text

Image pre

Slide 18

Slide 18 text

Operation at scale - deployment Kubernetes on Kubernetes (KoK)

Slide 19

Slide 19 text

Operation at scale - upgrades 1. Dev-rollout pipeline a. Dev-release i. main branch -> feature branch -> test sets -> Code Review -> main branch -> CI pipeline -> PASS ii. main branch -> release -> CI pipeline -> PASS b. Release-build i. release -> generate build -> test build (*.rpm) -> prod build (*.rpm) c. Test-build i. test build -> test cluster -> monitoring & dashboard a. Rollout i. prod build -> service template -> rollout plan b. Run rollout plan i. E.g. Cluster X a total of 3000+ nodes, batch interval 6hr, 12hr 1. Batch 1：2，5，10， 2. Batch 2：20，50，100 3. Batch 3：200，200，200 4. Batch 4：400，400，400 5. Batch 5：500，500，500 c. Rollback (just in case)

Slide 20

Slide 20 text

Operation at scale - monitoring/alerting/analysis ● Data pipeline ○ Prometheus + Thanos

Slide 21

Slide 21 text

Operation at scale - inspection Inspector (A map-reduce style program) - Constraint violations - No unexpected resource over-commit - No affinity/anti-affinity violations - … - State cross checking - API Server pod state == pod running state query through Kubelet - Controller replica == number of pods running query through Kubelet - Warnings - Too many soft affinity/anti-affinity violations - ...

Slide 22

Slide 22 text

Thanks Cluster Mgmt @ Web Scale Meetup Xiang Li, Alibaba Cloud Lei Zhang, Alibaba Cloud