Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Resource Requests and Limits Under the Hood: The Journey of a Pod Spec

Resource Requests and Limits Under the Hood: The Journey of a Pod Spec

A talk at KubeCon + CloudNativeCon Europe 2021 Virtual

Kohei Ota

May 06, 2021
Tweet

More Decks by Kohei Ota

Other Decks in Technology

Transcript

  1. Kaslin Fields, Google Kohei Ota, Hewlett Packard Enterprise Resource Requests

    and Limits Under the Hood: The Journey of a Pod Spec
  2. Self introduction Kohei Ota Architect at Hewlett-Packard Enterprise CNCF Ambassador

    Owner of SIG-Docs Japanese localization Twitter: @inductor__ GitHub: @inductor Kaslin Fields Developer Advocate at Google CNCF Ambassador Member of K8s SIG-ContribEx Comics at kaslin.rocks! Twitter: @kaslinfields GitHub: @kaslin
  3. Here’s an ordinary pod spec apiVersion: v1 kind: Pod metadata:

    name: kubecon-eu-2021 spec: containers: - name: kubecon-eu-2021 image: kubecon:eu-2021 resources: requests: memory: "64Mi" cpu: "250m" limits: memory: "128Mi" cpu: "500m" Requests
  4. I want to put my app in a Pod! The

    Journey of a Request Kubernetes API Developer App Ok, we can help with that.
  5. My app will need 2CPU and 4GB of Memory to

    run properly. Update Cluster State I’ll make note of that so your app’s needs will be met. Record to etcd* *etcd is the key-value store component used by Kubernetes. It functions as a single source of truth for the state of the cluster.
  6. Oh, a new pod is coming that will require 2CPU

    & 4GB of memory. I’ll put that… here. Kubernetes Scheduler Assign Pod to Node
  7. Scheduler Scheduler is a Kubernetes component that evaluates nodes to

    assign a Pod. Resource request is one of the parameters that Scheduler uses when ranking nodes.
  8. The Journey of a Request New Pod to assign Assign

    a Node to a Pod Detects that a Pod was assigned to the Node API Server (Control Plane) Hmm… Which node should I put this on… Here ya go, little buddy! Oh, a new request! Scheduler Kubelet (on node)
  9. Requests passing through API Server (Control plane) Scheduler (Control plane)

    Kubelet (Each node) New Pod to assign Assign a Node to a Pod Detects a Pod that assigned to the Node Node evaluation with Resource Requests
  10. Requests passing through Detects that a Pod was assigned to

    the Node Hmm… Which node should I put this on… Pod Requests vs Node Allocatable Scheduler
  11. Requests passing through API Server (Control plane) Scheduler (Control plane)

    Kubelet (Each node) New Pod to assign Assign a Node to a Pod Detects a Pod that assigned to the Node Node evaluation with Resource Requests Pod Requests vs Node Allocatable
  12. Requests summary - Requests - Used at Pod creation -

    Scheduler selects a Node for a Pod to match the resource requirement - CPU request is used in order to limit CPU resource in case they’re used 100% - When CPU is not fully used it’s over-committable - CPU - If over request? → Potential of eviction - Memory - If over request? → Potential of eviction QoS Class
  13. QoS Class in Kubernetes QoS Class Condition Priority (Lower is

    better) Guaranteed limits and optionally requests (not equal to 0) are set for all resources across all containers and they are equal 1 Burstable requests and optionally limits are set (not equal to 0) for one or more resources across one or more containers, and they are not equal 2 BestEffort requests and limits are not set for all of the resources, across all containers 3
  14. apiVersion: v1 kind: Pod metadata: name: kubecon-eu-2021 spec: containers: -

    name: kubecon-eu-2021 image: kubecon:eu-2021 resources: requests: memory: "64Mi" cpu: "250m" limits: memory: "128Mi" cpu: "500m" Now let’s talk about limits Limits
  15. The Journey of a Pod Limit Kubernetes API Developer App

    I want to make sure my pod doesn’t consume more than 2CPU and 4GB of memory Yes, we can limit your pod’s resource usage.
  16. This new pod needs to be limited to 2CPU and

    4GB. I better make sure the caretaker knows. Kubernetes Scheduler Assign Pod to Node
  17. Ah, this pod is limited. I’ll get only the resources

    it needs from the supplier. Kubelet on Node Send Pod to Kubelet on Node
  18. Hey, I have a pod coming in that needs its

    resources limited. Container Runtime/Linux Kubelet to Container Runtime Kubelet Ok, I can use cgroups to make that happen.
  19. Limits summary - Limits - Used to limit resources on

    a Pod by calling cgroups on Linux - CPU - If over limit? → CPU throttling - Memory - If over limit? → Cause OOM
  20. Limits by level API Server (Control plane) Kubelet (Each node)

    CRI Runtime (Each node) Detects a Pod that needs to be assigned to a Node Convert CPU cores to CFS period/quota (milliseconds) Set to OCI spec Pass limits OCI Runtime (Each node) Call Cgroups Cgroups
  21. Limits by level API Server (Control plane) Kubelet (Each node)

    CRI Runtime (Each node) Detects a Pod that needs to be assigned to a Node Convert CPU cores to CFS period/quota (milliseconds) Set to OCI spec Pass limits OCI Runtime (Each node) Call Cgroups Cgroups
  22. Cgroups? Cgroups(Control groups) allow you to allocate resources — such

    as CPU time(again, not cores!), system memory, network bandwidth, or combinations of these CPU Requests in K8s → cpu.shares in cgroups CPU Limits in K8s → cpu.cfs_period_us & cpu.cfs_quota_us (us = μs) in cgroups Memory Limits in K8s → memory.limit_in_bytes in cgroups
  23. Cgroups? Cgroups(Control groups) allow you to allocate resources — such

    as CPU time(again, not cores!), system memory, network bandwidth, or combinations of these CPU Requests in K8s -> cpu.shares in cgroups CPU Limits in K8s -> cpu.cfs_period_us & cpu.cfs_quota_us (us = μs) in cgroups Memory Limits in K8s -> memory.limit_in_bytes in cgroups Add cpu.shares 2048 cpu.shares is a relative value https://speakerdeck.com/daikurosawa/understanding-cpu-throttling-in-kubernetes-to-improve-application-performance-number-k8sjp
  24. CFS Quota? Period? CFS = “Completely Fair” Scheduler A process

    scheduler in Linux Container isolation is based on cgroups(a Linux kernel functionality) resource limitation Cgroups uses CFS to implement CPU resource restriction CFS scheduling is based on processing time but not core. Scheduling period is every 100ms https://kccncna20.sched.com/event/ek9r/10-more-weird-ways-to-blow-up-your-kubernetes-jian-cheung-joseph-kim-airbnb
  25. CFS Quota? Period? CFS = “Completely Fair” Scheduler A process

    scheduler in Linux Container isolation is based on cgroups(a Linux kernel functionality) resource limitation Cgroups uses CFS to implement CPU resource restriction CFS scheduling is based on processing time but not core. Scheduling period is every 100ms https://kccncna20.sched.com/event/ek9r/10-more-weird-ways-to-blow-up-your-kubernetes-jian-cheung-joseph-kim-airbnb K8s Limits: 500m(0.5core) CFS_Period: 100ms CFS_Quota: 50ms K8s Limits: 2000m(2core) CFS_Period: 100ms CFS_Quota: 200ms How much of CPU resource you can use in every period
  26. CFS Quota? Period? CFS = “Completely Fair” Scheduler A process

    scheduler in Linux Container isolation is based on cgroups(a Linux kernel functionality) resource limitation Cgroups uses CFS to implement CPU resource restriction CFS scheduling is based on processing time but not core. Scheduling period is every 100ms https://kccncna20.sched.com/event/ek9r/10-more-weird-ways-to-blow-up-your-kubernetes-jian-cheung-joseph-kim-airbnb K8s Limits: 500m(0.5core) CFS_Period: 100ms CFS_Quota: 50ms K8s Limits: 2000m(2core) CFS_Period: 100ms CFS_Quota: 200ms How much of CPU resource you can use in every period If there’s no limits… CFS_Quota: -1 (unlimit)
  27. CRI Runtime vs OCI Runtime How container runtime works on

    Kubernetes Kubernetes kubectl run kubectl apply REST API containerd CRI (gRPC) Collection of Kubernetes system components. kube-api-server fetches the kubectl and kubelet talks to the CRI runtime runC CRI runtime executes a OCI runtime binary file with OCI container json spec. OCI runtime spawns the container with CPU/Mem in the spec OCI High level runtime Low level runtime
  28. CRI Runtime vs OCI Runtime How container runtime works on

    Kubernetes Kubernetes kubectl run kubectl apply REST API containerd CRI (gRPC) Collection of Kubernetes system components. kube-api-server fetches the kubectl and kubelet talks to the CRI runtime runC CRI runtime executes a OCI runtime binary file with OCI container json spec. OCI runtime spawns the container with CPU/Mem in the spec OCI High level runtime Low level runtime CRI (High level) Runtimes run with Kubernetes OCI (Low level) Runtimes run with Linux kernel
  29. Vertical Pod Autoscaler VPA Modes: Off Initial Auto VPA Recommendations:

    Target Lower Bound Upper Bound Uncapped Target Vertical Pod Autoscaler (VPA)
  30. Conclusion Pod spec is registered in etcd through kube-apiserver kube-scheduler

    fetches newly registered pods from etcd and assign a node to each pod referring to resource requests kubelet fetches assigned pod spec in every sync period and calculate diffs between running containers and pod spec kubelet calls CreateContainer gRPC towards CRI runtime, after converting CPU cores into periods CRI runtime executes OCI runtime binary to create a container with OCI Spec JSON OCI runtime manages cgroups file system (create/delete/update) Vertical Pod Autoscaler (VPA) can provide recommendations for your requests and limits.