Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Resource Requests and Limits Under the Hood: The Journey of a Pod Spec

Resource Requests and Limits Under the Hood: The Journey of a Pod Spec

A talk at KubeCon + CloudNativeCon Europe 2021 Virtual

Kohei Ota

May 06, 2021
Tweet

More Decks by Kohei Ota

Other Decks in Technology

Transcript

  1. Kaslin Fields, Google
    Kohei Ota, Hewlett Packard Enterprise
    Resource Requests and Limits
    Under the Hood: The Journey of
    a Pod Spec

    View Slide

  2. Self introduction
    Kohei Ota
    Architect at Hewlett-Packard Enterprise
    CNCF Ambassador
    Owner of SIG-Docs Japanese localization
    Twitter: @inductor__
    GitHub: @inductor
    Kaslin Fields
    Developer Advocate at Google
    CNCF Ambassador
    Member of K8s SIG-ContribEx
    Comics at kaslin.rocks!
    Twitter: @kaslinfields
    GitHub: @kaslin

    View Slide

  3. Your App

    View Slide

  4. Doggy Daycare Analogy

    View Slide

  5. It Takes a Village

    View Slide

  6. Resource
    Requests & Limits

    View Slide

  7. Here’s an ordinary pod spec
    apiVersion: v1
    kind: Pod
    metadata:
    name: kubecon-eu-2021
    spec:
    containers:
    - name: kubecon-eu-2021
    image: kubecon:eu-2021
    resources:
    requests:
    memory: "64Mi"
    cpu: "250m"
    limits:
    memory: "128Mi"
    cpu: "500m"
    Requests

    View Slide

  8. Requests for planning
    y
    x
    z

    View Slide

  9. I want to put my app in a
    Pod!
    The Journey of a Request
    Kubernetes API
    Developer
    App
    Ok, we can help with that.

    View Slide

  10. My app will need 2CPU
    and 4GB of Memory to run
    properly.
    Update Cluster State
    I’ll make note of that so
    your app’s needs will be
    met.
    Record to etcd*
    *etcd is the key-value
    store component used by
    Kubernetes. It functions as
    a single source of truth
    for the state of the cluster.

    View Slide

  11. Oh, a new pod is coming
    that will require 2CPU &
    4GB of memory. I’ll put
    that… here.
    Kubernetes
    Scheduler
    Assign Pod to Node

    View Slide

  12. Scheduler
    Scheduler is a Kubernetes
    component that evaluates
    nodes to assign a Pod.
    Resource request is one of the
    parameters that Scheduler
    uses when ranking nodes.

    View Slide

  13. Ok, let’s get this new pod
    settled in!
    Kubelet on
    Node
    Create Pod on Node

    View Slide

  14. The Journey of a Request
    New Pod to
    assign
    Assign a Node to a Pod
    Detects that a Pod was
    assigned to the Node
    API Server
    (Control Plane)
    Hmm…
    Which node
    should I put
    this on…
    Here ya go,
    little buddy!
    Oh, a new
    request!
    Scheduler Kubelet
    (on node)

    View Slide

  15. Requests passing through
    API Server
    (Control plane)
    Scheduler
    (Control plane)
    Kubelet
    (Each node)
    New Pod to
    assign
    Assign a Node to a Pod
    Detects a Pod that assigned to the Node
    Node evaluation with
    Resource Requests

    View Slide

  16. Requests passing through
    Detects that a Pod was
    assigned to the Node
    Hmm…
    Which node
    should I put
    this on…
    Pod Requests
    vs
    Node Allocatable
    Scheduler

    View Slide

  17. Requests passing through
    API Server
    (Control plane)
    Scheduler
    (Control plane)
    Kubelet
    (Each node)
    New Pod to
    assign
    Assign a Node to a Pod
    Detects a Pod that assigned to the Node
    Node evaluation with
    Resource Requests
    Pod Requests
    vs
    Node Allocatable

    View Slide

  18. Requests summary
    - Requests
    - Used at Pod creation
    - Scheduler selects a Node for a Pod to match the resource requirement
    - CPU request is used in order to limit CPU resource in case they’re used 100%
    - When CPU is not fully used it’s over-committable
    - CPU
    - If over request? → Potential of eviction
    - Memory
    - If over request? → Potential of eviction
    QoS Class

    View Slide

  19. QoS Class in Kubernetes
    QoS Class Condition
    Priority
    (Lower is better)
    Guaranteed
    limits and optionally requests (not equal to 0) are set for all resources across
    all containers and they are equal
    1
    Burstable
    requests and optionally limits are set (not equal to 0) for one or more
    resources across one or more containers, and they are not equal
    2
    BestEffort requests and limits are not set for all of the resources, across all containers 3

    View Slide

  20. apiVersion: v1
    kind: Pod
    metadata:
    name: kubecon-eu-2021
    spec:
    containers:
    - name: kubecon-eu-2021
    image: kubecon:eu-2021
    resources:
    requests:
    memory: "64Mi"
    cpu: "250m"
    limits:
    memory: "128Mi"
    cpu: "500m"
    Now let’s talk about limits
    Limits

    View Slide

  21. Limits enforce.

    View Slide

  22. The Journey of a Pod Limit
    Kubernetes API
    Developer
    App
    I want to make sure my
    pod doesn’t consume
    more than 2CPU and
    4GB of memory
    Yes, we can limit your
    pod’s resource usage.

    View Slide

  23. This new pod needs to be
    limited to 2CPU and 4GB.
    I better make sure the
    caretaker knows.
    Kubernetes
    Scheduler
    Assign Pod to Node

    View Slide

  24. Ah, this pod is limited. I’ll
    get only the resources it
    needs from the supplier.
    Kubelet on
    Node
    Send Pod to Kubelet on Node

    View Slide

  25. Hey, I have a pod coming in
    that needs its resources
    limited.
    Container
    Runtime/Linux
    Kubelet to Container Runtime
    Kubelet
    Ok, I can use cgroups to
    make that happen.

    View Slide

  26. Requests for planning.
    Limits for enforcing.

    View Slide

  27. Limits summary
    - Limits
    - Used to limit resources on a Pod by calling cgroups on Linux
    - CPU
    - If over limit? → CPU throttling
    - Memory
    - If over limit? → Cause OOM

    View Slide

  28. Limits by level
    API Server
    (Control plane)
    Kubelet
    (Each node)
    CRI Runtime
    (Each node)
    Detects a Pod that needs
    to be assigned to a Node
    Convert CPU cores to
    CFS period/quota
    (milliseconds)
    Set to OCI spec
    Pass limits
    OCI Runtime
    (Each node)
    Call Cgroups
    Cgroups

    View Slide

  29. Limits by level
    API Server
    (Control plane)
    Kubelet
    (Each node)
    CRI Runtime
    (Each node)
    Detects a Pod that needs
    to be assigned to a Node
    Convert CPU cores to
    CFS period/quota
    (milliseconds)
    Set to OCI spec
    Pass limits
    OCI Runtime
    (Each node)
    Call Cgroups
    Cgroups

    View Slide

  30. Container Primitives

    View Slide

  31. Cgroups

    View Slide

  32. Cgroups?
    Cgroups(Control groups) allow you to allocate resources — such as CPU time(again, not
    cores!), system memory, network bandwidth, or combinations of these
    CPU Requests in K8s → cpu.shares in cgroups
    CPU Limits in K8s → cpu.cfs_period_us & cpu.cfs_quota_us (us = μs) in cgroups
    Memory Limits in K8s → memory.limit_in_bytes in cgroups

    View Slide

  33. Cgroups?
    Cgroups(Control groups) allow you to allocate resources — such as CPU time(again, not
    cores!), system memory, network bandwidth, or combinations of these
    CPU Requests in K8s -> cpu.shares in cgroups
    CPU Limits in K8s -> cpu.cfs_period_us & cpu.cfs_quota_us (us = μs) in cgroups
    Memory Limits in K8s -> memory.limit_in_bytes in cgroups
    Add
    cpu.shares
    2048
    cpu.shares is a relative value
    https://speakerdeck.com/daikurosawa/understanding-cpu-throttling-in-kubernetes-to-improve-application-performance-number-k8sjp

    View Slide

  34. CFS Quota? Period?
    CFS = “Completely Fair” Scheduler
    A process scheduler in Linux
    Container isolation is based on
    cgroups(a Linux kernel functionality)
    resource limitation
    Cgroups uses CFS to implement CPU
    resource restriction
    CFS scheduling is based on
    processing time but not core.
    Scheduling period is every 100ms https://kccncna20.sched.com/event/ek9r/10-more-weird-ways-to-blow-up-your-kubernetes-jian-cheung-joseph-kim-airbnb

    View Slide

  35. CFS Quota? Period?
    CFS = “Completely Fair” Scheduler
    A process scheduler in Linux
    Container isolation is based on
    cgroups(a Linux kernel functionality)
    resource limitation
    Cgroups uses CFS to implement CPU
    resource restriction
    CFS scheduling is based on
    processing time but not core.
    Scheduling period is every 100ms https://kccncna20.sched.com/event/ek9r/10-more-weird-ways-to-blow-up-your-kubernetes-jian-cheung-joseph-kim-airbnb
    K8s Limits: 500m(0.5core)
    CFS_Period: 100ms
    CFS_Quota: 50ms
    K8s Limits: 2000m(2core)
    CFS_Period: 100ms
    CFS_Quota: 200ms
    How much of CPU
    resource you can
    use in every period

    View Slide

  36. CFS Quota? Period?
    CFS = “Completely Fair” Scheduler
    A process scheduler in Linux
    Container isolation is based on
    cgroups(a Linux kernel functionality)
    resource limitation
    Cgroups uses CFS to implement CPU
    resource restriction
    CFS scheduling is based on
    processing time but not core.
    Scheduling period is every 100ms https://kccncna20.sched.com/event/ek9r/10-more-weird-ways-to-blow-up-your-kubernetes-jian-cheung-joseph-kim-airbnb
    K8s Limits: 500m(0.5core)
    CFS_Period: 100ms
    CFS_Quota: 50ms
    K8s Limits: 2000m(2core)
    CFS_Period: 100ms
    CFS_Quota: 200ms
    How much of CPU
    resource you can
    use in every period
    If there’s no limits…
    CFS_Quota: -1 (unlimit)

    View Slide

  37. CRI Runtime vs OCI Runtime
    How container runtime works on Kubernetes
    Kubernetes
    kubectl run
    kubectl apply
    REST API
    containerd
    CRI
    (gRPC)
    Collection of Kubernetes system
    components.
    kube-api-server fetches the kubectl
    and kubelet talks to the CRI runtime
    runC
    CRI runtime executes a OCI runtime
    binary file with OCI container json spec.
    OCI runtime spawns the container with
    CPU/Mem in the spec
    OCI
    High level
    runtime
    Low level
    runtime

    View Slide

  38. CRI Runtime vs OCI Runtime
    How container runtime works on Kubernetes
    Kubernetes
    kubectl run
    kubectl apply
    REST API
    containerd
    CRI
    (gRPC)
    Collection of Kubernetes system
    components.
    kube-api-server fetches the kubectl
    and kubelet talks to the CRI runtime
    runC
    CRI runtime executes a OCI runtime
    binary file with OCI container json spec.
    OCI runtime spawns the container with
    CPU/Mem in the spec
    OCI
    High level
    runtime
    Low level
    runtime
    CRI (High level) Runtimes run with Kubernetes
    OCI (Low level) Runtimes run with Linux kernel

    View Slide

  39. How do I set the right requests and
    limits?

    View Slide

  40. Pod Autoscaling
    Horizontal Pod Autoscaler (HPA)
    More
    Fewer
    Vertical Pod Autoscaler (VPA)
    Change
    Size



    View Slide

  41. Vertical
    Pod
    Autoscaler
    VPA
    Modes:
    Off
    Initial
    Auto
    VPA
    Recommendations:
    Target
    Lower Bound
    Upper Bound
    Uncapped Target
    Vertical Pod Autoscaler (VPA)

    View Slide

  42. Conclusion
    Pod spec is registered in etcd through kube-apiserver
    kube-scheduler fetches newly registered pods from etcd and assign a node to each pod referring
    to resource requests
    kubelet fetches assigned pod spec in every sync period and calculate diffs between running
    containers and pod spec
    kubelet calls CreateContainer gRPC towards CRI runtime, after converting CPU cores into periods
    CRI runtime executes OCI runtime binary to create a container with OCI Spec JSON
    OCI runtime manages cgroups file system (create/delete/update)
    Vertical Pod Autoscaler (VPA) can provide recommendations for your requests and limits.

    View Slide

  43. View Slide