Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Kubernetes at Alibaba at Web Scale

Kubernetes at Alibaba at Web Scale

Practices of how Alibaba is using Kubernetes in 10k nodes cluster.

Lei (Harry) Zhang

February 27, 2019
Tweet

More Decks by Lei (Harry) Zhang

Other Decks in Technology

Transcript

  1. Kubernetes at
    Alibaba
    Cluster Mgmt @ Web Scale Meetup
    Xiang Li, Alibaba Cloud
    Lei Zhang, Alibaba Cloud

    View full-size slide

  2. Background
    Alibaba Cloud
    - Many small to medium size Kubernetes clusters
    - Container Service for Kubernetes
    - Serverless Kubernetes
    - Virtual Kubelet
    - Secure container + Multi-tenancy
    Alibaba Group (Today’s focus)
    - A few large Kubernetes clusters
    - Sigma = Kubernetes + extensions + enhancements

    View full-size slide

  3. Workload
    - Production long running services
    - AI jobs
    - CI/CD jobs
    - YARN
    - Co-located with Fuxi, the batch job manager, on
    the same host
    Sigma(Kubernetes)
    Fuxi

    View full-size slide

  4. Scale
    ~ 10 clusters
    ~ 10,000 nodes per cluster
    ~ 1,000,000 cores per cluster
    ~ 100,000 containers per cluster
    ~ 10,000 applications per cluster
    ~ 100 pods / second at busy time per cluster
    Not accurate number!

    View full-size slide

  5. Kubernetes Scalability
    Official numbers
    ~ 5000 nodes
    Scaling Sigma to support Alibaba scale
    - Do not modify Kubernetes when possible (our cloud offering is upstream k8s)
    - Upstream the effort as much as we could

    View full-size slide

  6. Potential issues skipped
    Networking
    Load balancing / service discovery
    - We do not use kube-proxy nor core-dns today

    View full-size slide

  7. Storage Scalability
    Events
    - Tried to directly send to ElasticSearch
    - Need to modify API Server code
    - Need to support watch operation
    - Dedicated etcd clusters
    - Forward to ElasticSearch

    View full-size slide

  8. Storage Scalability
    Cluster metadata
    - 30k nodes, 1 million pods, 5 million resources
    - 5KB/resources, 50GB in total
    - 5,000 resources update/second (~ 1000 pod creation/second)
    - 50,000 resources read/second
    - MTTR within 3 minutes
    - P99 < 50ms

    View full-size slide

  9. Storage Scalability
    Cluster metadata optimizations
    - Eliminate unnecessary node updates
    - Pagination through API Server
    - Limit CRD usage
    - etcd optimizations need to be upstreamed
    - Btree freelist management improvements -> support 10x data size
    - Concurrent read support -> reduce write latency 10x

    View full-size slide

  10. Storage Scalability
    Potential cluster metadata optimizations to scale further
    - Compression
    - Snapshot streaming
    - Read-only replica
    - Object patching instead of full update
    - Delta encoding
    - API Server / Controller checkpointing

    View full-size slide

  11. Kubernetes Master Scalability
    API Server
    - Horizontally scalable by designed
    - Not super CPU efficient though
    - Objects are heavily cached
    - A lot of memory
    - Lack of custom index support
    - The code to support index is half cooked
    - Not hard to add custom index by modifying the API Server codebase

    View full-size slide

  12. Kubernetes Master Scalability
    Controller
    - Default controller is monolithic, not scalable
    - But plugable, and extensible through CRD
    - Built Alibaba controller/operators
    Scheduler
    - Default scheduler performance is not great
    - But plugable
    - Built Alibaba scheduler

    View full-size slide

  13. Scheduling improvements
    - Node CPU topology awareness scheduling
    - We use CPUSet + CPUShare
    - Storage topology awareness scheduling
    - Dry run
    - User facing capacity planning / debugging
    - Pod Group
    - Pod colocation / gang scheduling
    - Pre scheduling via global optimizer

    View full-size slide

  14. Deterministic
    Example use case:
    Alibaba’s Singles day sales
    - End to end optimization to support at least 10x traffic with minimal addition resources
    - End to end stress testing to ensure reliability and to reduce risk
    Requirements
    - Application upgrades should not change container placements
    - Minor resources updates should not change container placements

    View full-size slide

  15. In-place Update of Pod Resources
    1. v1:
    a. User:
    i. Set “in-place” annotation to “created” & update Pod requests/limits
    b. APIServer:
    i. Admission to check annotation
    c. Scheduler:
    i. Check if in-place update is possible
    1. otherwise set “in-place update fail”, user fall back
    ii. Set “in-place” annotation to “accepted”
    d. Kubelet:
    i. Update container resources (CRI), cgroups manager, cpu manager, clear annotation
    2. v2:
    a. A join effort KEP (kep #686) with community on upstream
    i. reviews are welcome!

    View full-size slide

  16. IP Persistence
    ● A CRD named ReservedIP
    ● CNI Service try to consume the CR when Pod is re-created with
    annotation:
    ○ pod.beta1.sigma.ali/ip-reuse=true
    apiVersion:
    "apiextensions.k8s.io/v1beta1"
    kind: "CustomResourceDefinition"
    metadata:
    name:
    "reservedip.extensions.sigma"
    spec:
    ...
    spec:
    required:
    ["networkStatus", "hostIP"]
    properties:
    networkStatus:
    type: "string"
    hostIP:
    type: "string"
    ttlInSeconds:
    type: "integer"
    minimum: 30

    View full-size slide

  17. Operation at scale - deployment
    Kubernetes on Kubernetes (KoK)

    View full-size slide

  18. Operation at scale - upgrades
    1. Dev-rollout pipeline
    a. Dev-release
    i. main branch -> feature branch -> test
    sets -> Code Review -> main branch
    -> CI pipeline -> PASS
    ii. main branch -> release -> CI pipeline
    -> PASS
    b. Release-build
    i. release -> generate build -> test build
    (*.rpm) -> prod build (*.rpm)
    c. Test-build
    i. test build -> test cluster -> monitoring
    & dashboard
    a. Rollout
    i. prod build -> service template -> rollout
    plan
    b. Run rollout plan
    i. E.g. Cluster X a total of 3000+ nodes,
    batch interval 6hr, 12hr
    1. Batch 1:2,5,10,
    2. Batch 2:20,50,100
    3. Batch 3:200,200,200
    4. Batch 4:400,400,400
    5. Batch 5:500,500,500
    c. Rollback (just in case)

    View full-size slide

  19. Operation at scale -
    monitoring/alerting/analysis
    ● Data pipeline
    ○ Prometheus + Thanos

    View full-size slide

  20. Operation at scale - inspection
    Inspector (A map-reduce style program)
    - Constraint violations
    - No unexpected resource over-commit
    - No affinity/anti-affinity violations
    - …
    - State cross checking
    - API Server pod state == pod running state query through Kubelet
    - Controller replica == number of pods running query through Kubelet
    - Warnings
    - Too many soft affinity/anti-affinity violations
    - ...

    View full-size slide

  21. Thanks
    Cluster Mgmt @ Web Scale Meetup
    Xiang Li, Alibaba Cloud
    Lei Zhang, Alibaba Cloud

    View full-size slide