Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Kubernetes at Alibaba at Web Scale

Kubernetes at Alibaba at Web Scale

Practices of how Alibaba is using Kubernetes in 10k nodes cluster.

6078165a3f6a8693a0b1a1101e9260c7?s=128

Lei (Harry) Zhang

February 27, 2019
Tweet

More Decks by Lei (Harry) Zhang

Other Decks in Technology

Transcript

  1. Kubernetes at Alibaba Cluster Mgmt @ Web Scale Meetup Xiang

    Li, Alibaba Cloud Lei Zhang, Alibaba Cloud
  2. Background Alibaba Cloud - Many small to medium size Kubernetes

    clusters - Container Service for Kubernetes - Serverless Kubernetes - Virtual Kubelet - Secure container + Multi-tenancy Alibaba Group (Today’s focus) - A few large Kubernetes clusters - Sigma = Kubernetes + extensions + enhancements
  3. Workload - Production long running services - AI jobs -

    CI/CD jobs - YARN - Co-located with Fuxi, the batch job manager, on the same host Sigma(Kubernetes) Fuxi
  4. Scale ~ 10 clusters ~ 10,000 nodes per cluster ~

    1,000,000 cores per cluster ~ 100,000 containers per cluster ~ 10,000 applications per cluster ~ 100 pods / second at busy time per cluster Not accurate number!
  5. Kubernetes Scalability Official numbers ~ 5000 nodes Scaling Sigma to

    support Alibaba scale - Do not modify Kubernetes when possible (our cloud offering is upstream k8s) - Upstream the effort as much as we could
  6. Potential issues skipped Networking Load balancing / service discovery -

    We do not use kube-proxy nor core-dns today
  7. Storage Scalability Events - Tried to directly send to ElasticSearch

    - Need to modify API Server code - Need to support watch operation - Dedicated etcd clusters - Forward to ElasticSearch
  8. Storage Scalability Cluster metadata - 30k nodes, 1 million pods,

    5 million resources - 5KB/resources, 50GB in total - 5,000 resources update/second (~ 1000 pod creation/second) - 50,000 resources read/second - MTTR within 3 minutes - P99 < 50ms
  9. Storage Scalability Cluster metadata optimizations - Eliminate unnecessary node updates

    - Pagination through API Server - Limit CRD usage - etcd optimizations need to be upstreamed - Btree freelist management improvements -> support 10x data size - Concurrent read support -> reduce write latency 10x
  10. Storage Scalability Potential cluster metadata optimizations to scale further -

    Compression - Snapshot streaming - Read-only replica - Object patching instead of full update - Delta encoding - API Server / Controller checkpointing
  11. Kubernetes Master Scalability API Server - Horizontally scalable by designed

    - Not super CPU efficient though - Objects are heavily cached - A lot of memory - Lack of custom index support - The code to support index is half cooked - Not hard to add custom index by modifying the API Server codebase
  12. Kubernetes Master Scalability Controller - Default controller is monolithic, not

    scalable - But plugable, and extensible through CRD - Built Alibaba controller/operators Scheduler - Default scheduler performance is not great - But plugable - Built Alibaba scheduler
  13. Scheduling improvements - Node CPU topology awareness scheduling - We

    use CPUSet + CPUShare - Storage topology awareness scheduling - Dry run - User facing capacity planning / debugging - Pod Group - Pod colocation / gang scheduling - Pre scheduling via global optimizer
  14. Deterministic Example use case: Alibaba’s Singles day sales - End

    to end optimization to support at least 10x traffic with minimal addition resources - End to end stress testing to ensure reliability and to reduce risk Requirements - Application upgrades should not change container placements - Minor resources updates should not change container placements
  15. In-place Update of Pod Resources 1. v1: a. User: i.

    Set “in-place” annotation to “created” & update Pod requests/limits b. APIServer: i. Admission to check annotation c. Scheduler: i. Check if in-place update is possible 1. otherwise set “in-place update fail”, user fall back ii. Set “in-place” annotation to “accepted” d. Kubelet: i. Update container resources (CRI), cgroups manager, cpu manager, clear annotation 2. v2: a. A join effort KEP (kep #686) with community on upstream i. reviews are welcome!
  16. IP Persistence • A CRD named ReservedIP • CNI Service

    try to consume the CR when Pod is re-created with annotation: ◦ pod.beta1.sigma.ali/ip-reuse=true apiVersion: "apiextensions.k8s.io/v1beta1" kind: "CustomResourceDefinition" metadata: name: "reservedip.extensions.sigma" spec: ... spec: required: ["networkStatus", "hostIP"] properties: networkStatus: type: "string" hostIP: type: "string" ttlInSeconds: type: "integer" minimum: 30
  17. Image pre

  18. Operation at scale - deployment Kubernetes on Kubernetes (KoK)

  19. Operation at scale - upgrades 1. Dev-rollout pipeline a. Dev-release

    i. main branch -> feature branch -> test sets -> Code Review -> main branch -> CI pipeline -> PASS ii. main branch -> release -> CI pipeline -> PASS b. Release-build i. release -> generate build -> test build (*.rpm) -> prod build (*.rpm) c. Test-build i. test build -> test cluster -> monitoring & dashboard a. Rollout i. prod build -> service template -> rollout plan b. Run rollout plan i. E.g. Cluster X a total of 3000+ nodes, batch interval 6hr, 12hr 1. Batch 1:2,5,10, 2. Batch 2:20,50,100 3. Batch 3:200,200,200 4. Batch 4:400,400,400 5. Batch 5:500,500,500 c. Rollback (just in case)
  20. Operation at scale - monitoring/alerting/analysis • Data pipeline ◦ Prometheus

    + Thanos
  21. Operation at scale - inspection Inspector (A map-reduce style program)

    - Constraint violations - No unexpected resource over-commit - No affinity/anti-affinity violations - … - State cross checking - API Server pod state == pod running state query through Kubelet - Controller replica == number of pods running query through Kubelet - Warnings - Too many soft affinity/anti-affinity violations - ...
  22. Thanks Cluster Mgmt @ Web Scale Meetup Xiang Li, Alibaba

    Cloud Lei Zhang, Alibaba Cloud