Kubernetes at Alibaba at Web Scale

Kubernetes at Alibaba Cluster Mgmt @ Web Scale Meetup Xiang
Li, Alibaba Cloud Lei Zhang, Alibaba Cloud

Background Alibaba Cloud - Many small to medium size Kubernetes
clusters - Container Service for Kubernetes - Serverless Kubernetes - Virtual Kubelet - Secure container + Multi-tenancy Alibaba Group (Today’s focus) - A few large Kubernetes clusters - Sigma = Kubernetes + extensions + enhancements

Workload - Production long running services - AI jobs -
CI/CD jobs - YARN - Co-located with Fuxi, the batch job manager, on the same host Sigma(Kubernetes) Fuxi

Scale ~ 10 clusters ~ 10,000 nodes per cluster ~
1,000,000 cores per cluster ~ 100,000 containers per cluster ~ 10,000 applications per cluster ~ 100 pods / second at busy time per cluster Not accurate number!

Kubernetes Scalability Ofﬁcial numbers ~ 5000 nodes Scaling Sigma to
support Alibaba scale - Do not modify Kubernetes when possible (our cloud offering is upstream k8s) - Upstream the effort as much as we could

Potential issues skipped Networking Load balancing / service discovery -
We do not use kube-proxy nor core-dns today

Storage Scalability Events - Tried to directly send to ElasticSearch
- Need to modify API Server code - Need to support watch operation - Dedicated etcd clusters - Forward to ElasticSearch

Storage Scalability Cluster metadata - 30k nodes, 1 million pods,
5 million resources - 5KB/resources, 50GB in total - 5,000 resources update/second (~ 1000 pod creation/second) - 50,000 resources read/second - MTTR within 3 minutes - P99 < 50ms

Storage Scalability Cluster metadata optimizations - Eliminate unnecessary node updates
- Pagination through API Server - Limit CRD usage - etcd optimizations need to be upstreamed - Btree freelist management improvements -> support 10x data size - Concurrent read support -> reduce write latency 10x

Storage Scalability Potential cluster metadata optimizations to scale further -
Compression - Snapshot streaming - Read-only replica - Object patching instead of full update - Delta encoding - API Server / Controller checkpointing

Kubernetes Master Scalability API Server - Horizontally scalable by designed
- Not super CPU efﬁcient though - Objects are heavily cached - A lot of memory - Lack of custom index support - The code to support index is half cooked - Not hard to add custom index by modifying the API Server codebase

Kubernetes Master Scalability Controller - Default controller is monolithic, not
scalable - But plugable, and extensible through CRD - Built Alibaba controller/operators Scheduler - Default scheduler performance is not great - But plugable - Built Alibaba scheduler

Scheduling improvements - Node CPU topology awareness scheduling - We
use CPUSet + CPUShare - Storage topology awareness scheduling - Dry run - User facing capacity planning / debugging - Pod Group - Pod colocation / gang scheduling - Pre scheduling via global optimizer

Deterministic Example use case: Alibaba’s Singles day sales - End
to end optimization to support at least 10x trafﬁc with minimal addition resources - End to end stress testing to ensure reliability and to reduce risk Requirements - Application upgrades should not change container placements - Minor resources updates should not change container placements

In-place Update of Pod Resources 1. v1: a. User: i.
Set “in-place” annotation to “created” & update Pod requests/limits b. APIServer: i. Admission to check annotation c. Scheduler: i. Check if in-place update is possible 1. otherwise set “in-place update fail”, user fall back ii. Set “in-place” annotation to “accepted” d. Kubelet: i. Update container resources (CRI), cgroups manager, cpu manager, clear annotation 2. v2: a. A join effort KEP (kep #686) with community on upstream i. reviews are welcome!

IP Persistence • A CRD named ReservedIP • CNI Service
try to consume the CR when Pod is re-created with annotation: ◦ pod.beta1.sigma.ali/ip-reuse=true apiVersion: "apiextensions.k8s.io/v1beta1" kind: "CustomResourceDefinition" metadata: name: "reservedip.extensions.sigma" spec: ... spec: required: ["networkStatus", "hostIP"] properties: networkStatus: type: "string" hostIP: type: "string" ttlInSeconds: type: "integer" minimum: 30

Image pre

Operation at scale - deployment Kubernetes on Kubernetes (KoK)

Operation at scale - upgrades 1. Dev-rollout pipeline a. Dev-release
i. main branch -> feature branch -> test sets -> Code Review -> main branch -> CI pipeline -> PASS ii. main branch -> release -> CI pipeline -> PASS b. Release-build i. release -> generate build -> test build (*.rpm) -> prod build (*.rpm) c. Test-build i. test build -> test cluster -> monitoring & dashboard a. Rollout i. prod build -> service template -> rollout plan b. Run rollout plan i. E.g. Cluster X a total of 3000+ nodes, batch interval 6hr, 12hr 1. Batch 1：2，5，10， 2. Batch 2：20，50，100 3. Batch 3：200，200，200 4. Batch 4：400，400，400 5. Batch 5：500，500，500 c. Rollback (just in case)

Operation at scale - monitoring/alerting/analysis • Data pipeline ◦ Prometheus
+ Thanos

Operation at scale - inspection Inspector (A map-reduce style program)
- Constraint violations - No unexpected resource over-commit - No affinity/anti-affinity violations - … - State cross checking - API Server pod state == pod running state query through Kubelet - Controller replica == number of pods running query through Kubelet - Warnings - Too many soft affinity/anti-affinity violations - ...

Thanks Cluster Mgmt @ Web Scale Meetup Xiang Li, Alibaba
Cloud Lei Zhang, Alibaba Cloud

Kubernetes at Alibaba at Web Scale

Kubernetes at Alibaba at Web Scale

Lei (Harry) Zhang

More Decks by Lei (Harry) Zhang

Other Decks in Technology

Featured

Transcript

Kubernetes at Alibaba Cluster Mgmt @ Web Scale Meetup Xiang

Background Alibaba Cloud - Many small to medium size Kubernetes

Workload - Production long running services - AI jobs -

Scale ~ 10 clusters ~ 10,000 nodes per cluster ~

Kubernetes Scalability Ofﬁcial numbers ~ 5000 nodes Scaling Sigma to

Potential issues skipped Networking Load balancing / service discovery -

Storage Scalability Events - Tried to directly send to ElasticSearch

Storage Scalability Cluster metadata - 30k nodes, 1 million pods,

Storage Scalability Cluster metadata optimizations - Eliminate unnecessary node updates

Storage Scalability Potential cluster metadata optimizations to scale further -

Kubernetes Master Scalability API Server - Horizontally scalable by designed

Kubernetes Master Scalability Controller - Default controller is monolithic, not

Scheduling improvements - Node CPU topology awareness scheduling - We

Deterministic Example use case: Alibaba’s Singles day sales - End

In-place Update of Pod Resources 1. v1: a. User: i.

IP Persistence • A CRD named ReservedIP • CNI Service

Image pre

Operation at scale - deployment Kubernetes on Kubernetes (KoK)

Operation at scale - upgrades 1. Dev-rollout pipeline a. Dev-release

Operation at scale - monitoring/alerting/analysis • Data pipeline ◦ Prometheus

Operation at scale - inspection Inspector (A map-reduce style program)

Thanks Cluster Mgmt @ Web Scale Meetup Xiang Li, Alibaba