Resource Requests and Limits Under the Hood: The Journey of a Pod Spec

Kaslin Fields, Google Kohei Ota, Hewlett Packard Enterprise Resource Requests
and Limits Under the Hood: The Journey of a Pod Spec

Self introduction Kohei Ota Architect at Hewlett-Packard Enterprise CNCF Ambassador
Owner of SIG-Docs Japanese localization Twitter: @inductor__ GitHub: @inductor Kaslin Fields Developer Advocate at Google CNCF Ambassador Member of K8s SIG-ContribEx Comics at kaslin.rocks! Twitter: @kaslinfields GitHub: @kaslin

Your App

Doggy Daycare Analogy

It Takes a Village

Resource Requests & Limits

Here’s an ordinary pod spec apiVersion: v1 kind: Pod metadata:
name: kubecon-eu-2021 spec: containers: - name: kubecon-eu-2021 image: kubecon:eu-2021 resources: requests: memory: "64Mi" cpu: "250m" limits: memory: "128Mi" cpu: "500m" Requests

Requests for planning y x z

I want to put my app in a Pod! The
Journey of a Request Kubernetes API Developer App Ok, we can help with that.

My app will need 2CPU and 4GB of Memory to
run properly. Update Cluster State I’ll make note of that so your app’s needs will be met. Record to etcd* *etcd is the key-value store component used by Kubernetes. It functions as a single source of truth for the state of the cluster.

Oh, a new pod is coming that will require 2CPU
& 4GB of memory. I’ll put that… here. Kubernetes Scheduler Assign Pod to Node

Scheduler Scheduler is a Kubernetes component that evaluates nodes to
assign a Pod. Resource request is one of the parameters that Scheduler uses when ranking nodes.

Ok, let’s get this new pod settled in! Kubelet on
Node Create Pod on Node

The Journey of a Request New Pod to assign Assign
a Node to a Pod Detects that a Pod was assigned to the Node API Server (Control Plane) Hmm… Which node should I put this on… Here ya go, little buddy! Oh, a new request! Scheduler Kubelet (on node)

Requests passing through API Server (Control plane) Scheduler (Control plane)
Kubelet (Each node) New Pod to assign Assign a Node to a Pod Detects a Pod that assigned to the Node Node evaluation with Resource Requests

Requests passing through Detects that a Pod was assigned to
the Node Hmm… Which node should I put this on… Pod Requests vs Node Allocatable Scheduler

Requests passing through API Server (Control plane) Scheduler (Control plane)
Kubelet (Each node) New Pod to assign Assign a Node to a Pod Detects a Pod that assigned to the Node Node evaluation with Resource Requests Pod Requests vs Node Allocatable

Requests summary - Requests - Used at Pod creation -
Scheduler selects a Node for a Pod to match the resource requirement - CPU request is used in order to limit CPU resource in case they’re used 100% - When CPU is not fully used it’s over-committable - CPU - If over request? → Potential of eviction - Memory - If over request? → Potential of eviction QoS Class

QoS Class in Kubernetes QoS Class Condition Priority (Lower is
better) Guaranteed limits and optionally requests (not equal to 0) are set for all resources across all containers and they are equal 1 Burstable requests and optionally limits are set (not equal to 0) for one or more resources across one or more containers, and they are not equal 2 BestEffort requests and limits are not set for all of the resources, across all containers 3

apiVersion: v1 kind: Pod metadata: name: kubecon-eu-2021 spec: containers: -
name: kubecon-eu-2021 image: kubecon:eu-2021 resources: requests: memory: "64Mi" cpu: "250m" limits: memory: "128Mi" cpu: "500m" Now let’s talk about limits Limits

Limits enforce.

The Journey of a Pod Limit Kubernetes API Developer App
I want to make sure my pod doesn’t consume more than 2CPU and 4GB of memory Yes, we can limit your pod’s resource usage.

This new pod needs to be limited to 2CPU and
4GB. I better make sure the caretaker knows. Kubernetes Scheduler Assign Pod to Node

Ah, this pod is limited. I’ll get only the resources
it needs from the supplier. Kubelet on Node Send Pod to Kubelet on Node

Hey, I have a pod coming in that needs its
resources limited. Container Runtime/Linux Kubelet to Container Runtime Kubelet Ok, I can use cgroups to make that happen.

Requests for planning. Limits for enforcing.

Limits summary - Limits - Used to limit resources on
a Pod by calling cgroups on Linux - CPU - If over limit? → CPU throttling - Memory - If over limit? → Cause OOM

Limits by level API Server (Control plane) Kubelet (Each node)
CRI Runtime (Each node) Detects a Pod that needs to be assigned to a Node Convert CPU cores to CFS period/quota (milliseconds) Set to OCI spec Pass limits OCI Runtime (Each node) Call Cgroups Cgroups

Container Primitives

Cgroups

Cgroups? Cgroups(Control groups) allow you to allocate resources — such
as CPU time(again, not cores!), system memory, network bandwidth, or combinations of these CPU Requests in K8s → cpu.shares in cgroups CPU Limits in K8s → cpu.cfs_period_us & cpu.cfs_quota_us (us = μs) in cgroups Memory Limits in K8s → memory.limit_in_bytes in cgroups

Cgroups? Cgroups(Control groups) allow you to allocate resources — such
as CPU time(again, not cores!), system memory, network bandwidth, or combinations of these CPU Requests in K8s -> cpu.shares in cgroups CPU Limits in K8s -> cpu.cfs_period_us & cpu.cfs_quota_us (us = μs) in cgroups Memory Limits in K8s -> memory.limit_in_bytes in cgroups Add cpu.shares 2048 cpu.shares is a relative value https://speakerdeck.com/daikurosawa/understanding-cpu-throttling-in-kubernetes-to-improve-application-performance-number-k8sjp

CFS Quota? Period? CFS = “Completely Fair” Scheduler A process
scheduler in Linux Container isolation is based on cgroups(a Linux kernel functionality) resource limitation Cgroups uses CFS to implement CPU resource restriction CFS scheduling is based on processing time but not core. Scheduling period is every 100ms https://kccncna20.sched.com/event/ek9r/10-more-weird-ways-to-blow-up-your-kubernetes-jian-cheung-joseph-kim-airbnb

scheduler in Linux Container isolation is based on cgroups(a Linux kernel functionality) resource limitation Cgroups uses CFS to implement CPU resource restriction CFS scheduling is based on processing time but not core. Scheduling period is every 100ms https://kccncna20.sched.com/event/ek9r/10-more-weird-ways-to-blow-up-your-kubernetes-jian-cheung-joseph-kim-airbnb K8s Limits: 500m(0.5core) CFS_Period: 100ms CFS_Quota: 50ms K8s Limits: 2000m(2core) CFS_Period: 100ms CFS_Quota: 200ms How much of CPU resource you can use in every period

scheduler in Linux Container isolation is based on cgroups(a Linux kernel functionality) resource limitation Cgroups uses CFS to implement CPU resource restriction CFS scheduling is based on processing time but not core. Scheduling period is every 100ms https://kccncna20.sched.com/event/ek9r/10-more-weird-ways-to-blow-up-your-kubernetes-jian-cheung-joseph-kim-airbnb K8s Limits: 500m(0.5core) CFS_Period: 100ms CFS_Quota: 50ms K8s Limits: 2000m(2core) CFS_Period: 100ms CFS_Quota: 200ms How much of CPU resource you can use in every period If there’s no limits… CFS_Quota: -1 (unlimit)

CRI Runtime vs OCI Runtime How container runtime works on
Kubernetes Kubernetes kubectl run kubectl apply REST API containerd CRI (gRPC) Collection of Kubernetes system components. kube-api-server fetches the kubectl and kubelet talks to the CRI runtime runC CRI runtime executes a OCI runtime binary file with OCI container json spec. OCI runtime spawns the container with CPU/Mem in the spec OCI High level runtime Low level runtime

CRI Runtime vs OCI Runtime How container runtime works on
Kubernetes Kubernetes kubectl run kubectl apply REST API containerd CRI (gRPC) Collection of Kubernetes system components. kube-api-server fetches the kubectl and kubelet talks to the CRI runtime runC CRI runtime executes a OCI runtime binary file with OCI container json spec. OCI runtime spawns the container with CPU/Mem in the spec OCI High level runtime Low level runtime CRI (High level) Runtimes run with Kubernetes OCI (Low level) Runtimes run with Linux kernel

How do I set the right requests and limits？

Pod Autoscaling Horizontal Pod Autoscaler (HPA) More Fewer Vertical Pod
Autoscaler (VPA) Change Size ？？？

Vertical Pod Autoscaler VPA Modes: Off Initial Auto VPA Recommendations:
Target Lower Bound Upper Bound Uncapped Target Vertical Pod Autoscaler (VPA)

Conclusion Pod spec is registered in etcd through kube-apiserver kube-scheduler
fetches newly registered pods from etcd and assign a node to each pod referring to resource requests kubelet fetches assigned pod spec in every sync period and calculate diffs between running containers and pod spec kubelet calls CreateContainer gRPC towards CRI runtime, after converting CPU cores into periods CRI runtime executes OCI runtime binary to create a container with OCI Spec JSON OCI runtime manages cgroups file system (create/delete/update) Vertical Pod Autoscaler (VPA) can provide recommendations for your requests and limits.

Resource Requests and Limits Under the Hood: Th...

Resource Requests and Limits Under the Hood: The Journey of a Pod Spec

More Decks by Kohei Ota

Other Decks in Technology

Featured

Transcript