Slide 1

Slide 1 text

Kubernetes at Alibaba Cluster Mgmt @ Web Scale Meetup Xiang Li, Alibaba Cloud Lei Zhang, Alibaba Cloud

Slide 2

Slide 2 text

Background Alibaba Cloud - Many small to medium size Kubernetes clusters - Container Service for Kubernetes - Serverless Kubernetes - Virtual Kubelet - Secure container + Multi-tenancy Alibaba Group (Today’s focus) - A few large Kubernetes clusters - Sigma = Kubernetes + extensions + enhancements

Slide 3

Slide 3 text

Workload - Production long running services - AI jobs - CI/CD jobs - YARN - Co-located with Fuxi, the batch job manager, on the same host Sigma(Kubernetes) Fuxi

Slide 4

Slide 4 text

Scale ~ 10 clusters ~ 10,000 nodes per cluster ~ 1,000,000 cores per cluster ~ 100,000 containers per cluster ~ 10,000 applications per cluster ~ 100 pods / second at busy time per cluster Not accurate number!

Slide 5

Slide 5 text

Kubernetes Scalability Official numbers ~ 5000 nodes Scaling Sigma to support Alibaba scale - Do not modify Kubernetes when possible (our cloud offering is upstream k8s) - Upstream the effort as much as we could

Slide 6

Slide 6 text

Potential issues skipped Networking Load balancing / service discovery - We do not use kube-proxy nor core-dns today

Slide 7

Slide 7 text

Storage Scalability Events - Tried to directly send to ElasticSearch - Need to modify API Server code - Need to support watch operation - Dedicated etcd clusters - Forward to ElasticSearch

Slide 8

Slide 8 text

Storage Scalability Cluster metadata - 30k nodes, 1 million pods, 5 million resources - 5KB/resources, 50GB in total - 5,000 resources update/second (~ 1000 pod creation/second) - 50,000 resources read/second - MTTR within 3 minutes - P99 < 50ms

Slide 9

Slide 9 text

Storage Scalability Cluster metadata optimizations - Eliminate unnecessary node updates - Pagination through API Server - Limit CRD usage - etcd optimizations need to be upstreamed - Btree freelist management improvements -> support 10x data size - Concurrent read support -> reduce write latency 10x

Slide 10

Slide 10 text

Storage Scalability Potential cluster metadata optimizations to scale further - Compression - Snapshot streaming - Read-only replica - Object patching instead of full update - Delta encoding - API Server / Controller checkpointing

Slide 11

Slide 11 text

Kubernetes Master Scalability API Server - Horizontally scalable by designed - Not super CPU efficient though - Objects are heavily cached - A lot of memory - Lack of custom index support - The code to support index is half cooked - Not hard to add custom index by modifying the API Server codebase

Slide 12

Slide 12 text

Kubernetes Master Scalability Controller - Default controller is monolithic, not scalable - But plugable, and extensible through CRD - Built Alibaba controller/operators Scheduler - Default scheduler performance is not great - But plugable - Built Alibaba scheduler

Slide 13

Slide 13 text

Scheduling improvements - Node CPU topology awareness scheduling - We use CPUSet + CPUShare - Storage topology awareness scheduling - Dry run - User facing capacity planning / debugging - Pod Group - Pod colocation / gang scheduling - Pre scheduling via global optimizer

Slide 14

Slide 14 text

Deterministic Example use case: Alibaba’s Singles day sales - End to end optimization to support at least 10x traffic with minimal addition resources - End to end stress testing to ensure reliability and to reduce risk Requirements - Application upgrades should not change container placements - Minor resources updates should not change container placements

Slide 15

Slide 15 text

In-place Update of Pod Resources 1. v1: a. User: i. Set “in-place” annotation to “created” & update Pod requests/limits b. APIServer: i. Admission to check annotation c. Scheduler: i. Check if in-place update is possible 1. otherwise set “in-place update fail”, user fall back ii. Set “in-place” annotation to “accepted” d. Kubelet: i. Update container resources (CRI), cgroups manager, cpu manager, clear annotation 2. v2: a. A join effort KEP (kep #686) with community on upstream i. reviews are welcome!

Slide 16

Slide 16 text

IP Persistence ● A CRD named ReservedIP ● CNI Service try to consume the CR when Pod is re-created with annotation: ○ pod.beta1.sigma.ali/ip-reuse=true apiVersion: "apiextensions.k8s.io/v1beta1" kind: "CustomResourceDefinition" metadata: name: "reservedip.extensions.sigma" spec: ... spec: required: ["networkStatus", "hostIP"] properties: networkStatus: type: "string" hostIP: type: "string" ttlInSeconds: type: "integer" minimum: 30

Slide 17

Slide 17 text

Image pre

Slide 18

Slide 18 text

Operation at scale - deployment Kubernetes on Kubernetes (KoK)

Slide 19

Slide 19 text

Operation at scale - upgrades 1. Dev-rollout pipeline a. Dev-release i. main branch -> feature branch -> test sets -> Code Review -> main branch -> CI pipeline -> PASS ii. main branch -> release -> CI pipeline -> PASS b. Release-build i. release -> generate build -> test build (*.rpm) -> prod build (*.rpm) c. Test-build i. test build -> test cluster -> monitoring & dashboard a. Rollout i. prod build -> service template -> rollout plan b. Run rollout plan i. E.g. Cluster X a total of 3000+ nodes, batch interval 6hr, 12hr 1. Batch 1:2,5,10, 2. Batch 2:20,50,100 3. Batch 3:200,200,200 4. Batch 4:400,400,400 5. Batch 5:500,500,500 c. Rollback (just in case)

Slide 20

Slide 20 text

Operation at scale - monitoring/alerting/analysis ● Data pipeline ○ Prometheus + Thanos

Slide 21

Slide 21 text

Operation at scale - inspection Inspector (A map-reduce style program) - Constraint violations - No unexpected resource over-commit - No affinity/anti-affinity violations - … - State cross checking - API Server pod state == pod running state query through Kubelet - Controller replica == number of pods running query through Kubelet - Warnings - Too many soft affinity/anti-affinity violations - ...

Slide 22

Slide 22 text

Thanks Cluster Mgmt @ Web Scale Meetup Xiang Li, Alibaba Cloud Lei Zhang, Alibaba Cloud