Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Kubernetes at Alibaba at Web Scale

Kubernetes at Alibaba at Web Scale

Practices of how Alibaba is using Kubernetes in 10k nodes cluster.

Lei (Harry) Zhang

February 27, 2019

More Decks by Lei (Harry) Zhang

Other Decks in Technology


  1. Kubernetes at Alibaba Cluster Mgmt @ Web Scale Meetup Xiang

    Li, Alibaba Cloud Lei Zhang, Alibaba Cloud
  2. Background Alibaba Cloud - Many small to medium size Kubernetes

    clusters - Container Service for Kubernetes - Serverless Kubernetes - Virtual Kubelet - Secure container + Multi-tenancy Alibaba Group (Today’s focus) - A few large Kubernetes clusters - Sigma = Kubernetes + extensions + enhancements
  3. Workload - Production long running services - AI jobs -

    CI/CD jobs - YARN - Co-located with Fuxi, the batch job manager, on the same host Sigma(Kubernetes) Fuxi
  4. Scale ~ 10 clusters ~ 10,000 nodes per cluster ~

    1,000,000 cores per cluster ~ 100,000 containers per cluster ~ 10,000 applications per cluster ~ 100 pods / second at busy time per cluster Not accurate number!
  5. Kubernetes Scalability Official numbers ~ 5000 nodes Scaling Sigma to

    support Alibaba scale - Do not modify Kubernetes when possible (our cloud offering is upstream k8s) - Upstream the effort as much as we could
  6. Storage Scalability Events - Tried to directly send to ElasticSearch

    - Need to modify API Server code - Need to support watch operation - Dedicated etcd clusters - Forward to ElasticSearch
  7. Storage Scalability Cluster metadata - 30k nodes, 1 million pods,

    5 million resources - 5KB/resources, 50GB in total - 5,000 resources update/second (~ 1000 pod creation/second) - 50,000 resources read/second - MTTR within 3 minutes - P99 < 50ms
  8. Storage Scalability Cluster metadata optimizations - Eliminate unnecessary node updates

    - Pagination through API Server - Limit CRD usage - etcd optimizations need to be upstreamed - Btree freelist management improvements -> support 10x data size - Concurrent read support -> reduce write latency 10x
  9. Storage Scalability Potential cluster metadata optimizations to scale further -

    Compression - Snapshot streaming - Read-only replica - Object patching instead of full update - Delta encoding - API Server / Controller checkpointing
  10. Kubernetes Master Scalability API Server - Horizontally scalable by designed

    - Not super CPU efficient though - Objects are heavily cached - A lot of memory - Lack of custom index support - The code to support index is half cooked - Not hard to add custom index by modifying the API Server codebase
  11. Kubernetes Master Scalability Controller - Default controller is monolithic, not

    scalable - But plugable, and extensible through CRD - Built Alibaba controller/operators Scheduler - Default scheduler performance is not great - But plugable - Built Alibaba scheduler
  12. Scheduling improvements - Node CPU topology awareness scheduling - We

    use CPUSet + CPUShare - Storage topology awareness scheduling - Dry run - User facing capacity planning / debugging - Pod Group - Pod colocation / gang scheduling - Pre scheduling via global optimizer
  13. Deterministic Example use case: Alibaba’s Singles day sales - End

    to end optimization to support at least 10x traffic with minimal addition resources - End to end stress testing to ensure reliability and to reduce risk Requirements - Application upgrades should not change container placements - Minor resources updates should not change container placements
  14. In-place Update of Pod Resources 1. v1: a. User: i.

    Set “in-place” annotation to “created” & update Pod requests/limits b. APIServer: i. Admission to check annotation c. Scheduler: i. Check if in-place update is possible 1. otherwise set “in-place update fail”, user fall back ii. Set “in-place” annotation to “accepted” d. Kubelet: i. Update container resources (CRI), cgroups manager, cpu manager, clear annotation 2. v2: a. A join effort KEP (kep #686) with community on upstream i. reviews are welcome!
  15. IP Persistence • A CRD named ReservedIP • CNI Service

    try to consume the CR when Pod is re-created with annotation: ◦ pod.beta1.sigma.ali/ip-reuse=true apiVersion: "apiextensions.k8s.io/v1beta1" kind: "CustomResourceDefinition" metadata: name: "reservedip.extensions.sigma" spec: ... spec: required: ["networkStatus", "hostIP"] properties: networkStatus: type: "string" hostIP: type: "string" ttlInSeconds: type: "integer" minimum: 30
  16. Operation at scale - upgrades 1. Dev-rollout pipeline a. Dev-release

    i. main branch -> feature branch -> test sets -> Code Review -> main branch -> CI pipeline -> PASS ii. main branch -> release -> CI pipeline -> PASS b. Release-build i. release -> generate build -> test build (*.rpm) -> prod build (*.rpm) c. Test-build i. test build -> test cluster -> monitoring & dashboard a. Rollout i. prod build -> service template -> rollout plan b. Run rollout plan i. E.g. Cluster X a total of 3000+ nodes, batch interval 6hr, 12hr 1. Batch 1:2,5,10, 2. Batch 2:20,50,100 3. Batch 3:200,200,200 4. Batch 4:400,400,400 5. Batch 5:500,500,500 c. Rollback (just in case)
  17. Operation at scale - inspection Inspector (A map-reduce style program)

    - Constraint violations - No unexpected resource over-commit - No affinity/anti-affinity violations - … - State cross checking - API Server pod state == pod running state query through Kubelet - Controller replica == number of pods running query through Kubelet - Warnings - Too many soft affinity/anti-affinity violations - ...