Self introduction Kohei Ota Architect at Hewlett-Packard Enterprise CNCF Ambassador Owner of SIG-Docs Japanese localization Twitter: @inductor__ GitHub: @inductor Kaslin Fields Developer Advocate at Google CNCF Ambassador Member of K8s SIG-ContribEx Comics at kaslin.rocks! Twitter: @kaslinfields GitHub: @kaslin
My app will need 2CPU and 4GB of Memory to run properly. Update Cluster State I’ll make note of that so your app’s needs will be met. Record to etcd* *etcd is the key-value store component used by Kubernetes. It functions as a single source of truth for the state of the cluster.
Scheduler Scheduler is a Kubernetes component that evaluates nodes to assign a Pod. Resource request is one of the parameters that Scheduler uses when ranking nodes.
The Journey of a Request New Pod to assign Assign a Node to a Pod Detects that a Pod was assigned to the Node API Server (Control Plane) Hmm… Which node should I put this on… Here ya go, little buddy! Oh, a new request! Scheduler Kubelet (on node)
Requests passing through API Server (Control plane) Scheduler (Control plane) Kubelet (Each node) New Pod to assign Assign a Node to a Pod Detects a Pod that assigned to the Node Node evaluation with Resource Requests
Requests passing through API Server (Control plane) Scheduler (Control plane) Kubelet (Each node) New Pod to assign Assign a Node to a Pod Detects a Pod that assigned to the Node Node evaluation with Resource Requests Pod Requests vs Node Allocatable
Requests summary - Requests - Used at Pod creation - Scheduler selects a Node for a Pod to match the resource requirement - CPU request is used in order to limit CPU resource in case they’re used 100% - When CPU is not fully used it’s over-committable - CPU - If over request? → Potential of eviction - Memory - If over request? → Potential of eviction QoS Class
QoS Class in Kubernetes QoS Class Condition Priority (Lower is better) Guaranteed limits and optionally requests (not equal to 0) are set for all resources across all containers and they are equal 1 Burstable requests and optionally limits are set (not equal to 0) for one or more resources across one or more containers, and they are not equal 2 BestEffort requests and limits are not set for all of the resources, across all containers 3
The Journey of a Pod Limit Kubernetes API Developer App I want to make sure my pod doesn’t consume more than 2CPU and 4GB of memory Yes, we can limit your pod’s resource usage.
Hey, I have a pod coming in that needs its resources limited. Container Runtime/Linux Kubelet to Container Runtime Kubelet Ok, I can use cgroups to make that happen.
Limits summary - Limits - Used to limit resources on a Pod by calling cgroups on Linux - CPU - If over limit? → CPU throttling - Memory - If over limit? → Cause OOM
Limits by level API Server (Control plane) Kubelet (Each node) CRI Runtime (Each node) Detects a Pod that needs to be assigned to a Node Convert CPU cores to CFS period/quota (milliseconds) Set to OCI spec Pass limits OCI Runtime (Each node) Call Cgroups Cgroups
Limits by level API Server (Control plane) Kubelet (Each node) CRI Runtime (Each node) Detects a Pod that needs to be assigned to a Node Convert CPU cores to CFS period/quota (milliseconds) Set to OCI spec Pass limits OCI Runtime (Each node) Call Cgroups Cgroups
Cgroups? Cgroups(Control groups) allow you to allocate resources — such as CPU time(again, not cores!), system memory, network bandwidth, or combinations of these CPU Requests in K8s → cpu.shares in cgroups CPU Limits in K8s → cpu.cfs_period_us & cpu.cfs_quota_us (us = μs) in cgroups Memory Limits in K8s → memory.limit_in_bytes in cgroups
Cgroups? Cgroups(Control groups) allow you to allocate resources — such as CPU time(again, not cores!), system memory, network bandwidth, or combinations of these CPU Requests in K8s -> cpu.shares in cgroups CPU Limits in K8s -> cpu.cfs_period_us & cpu.cfs_quota_us (us = μs) in cgroups Memory Limits in K8s -> memory.limit_in_bytes in cgroups Add cpu.shares 2048 cpu.shares is a relative value https://speakerdeck.com/daikurosawa/understanding-cpu-throttling-in-kubernetes-to-improve-application-performance-number-k8sjp
CFS Quota? Period? CFS = “Completely Fair” Scheduler A process scheduler in Linux Container isolation is based on cgroups(a Linux kernel functionality) resource limitation Cgroups uses CFS to implement CPU resource restriction CFS scheduling is based on processing time but not core. Scheduling period is every 100ms https://kccncna20.sched.com/event/ek9r/10-more-weird-ways-to-blow-up-your-kubernetes-jian-cheung-joseph-kim-airbnb
CFS Quota? Period? CFS = “Completely Fair” Scheduler A process scheduler in Linux Container isolation is based on cgroups(a Linux kernel functionality) resource limitation Cgroups uses CFS to implement CPU resource restriction CFS scheduling is based on processing time but not core. Scheduling period is every 100ms https://kccncna20.sched.com/event/ek9r/10-more-weird-ways-to-blow-up-your-kubernetes-jian-cheung-joseph-kim-airbnb K8s Limits: 500m(0.5core) CFS_Period: 100ms CFS_Quota: 50ms K8s Limits: 2000m(2core) CFS_Period: 100ms CFS_Quota: 200ms How much of CPU resource you can use in every period
CFS Quota? Period? CFS = “Completely Fair” Scheduler A process scheduler in Linux Container isolation is based on cgroups(a Linux kernel functionality) resource limitation Cgroups uses CFS to implement CPU resource restriction CFS scheduling is based on processing time but not core. Scheduling period is every 100ms https://kccncna20.sched.com/event/ek9r/10-more-weird-ways-to-blow-up-your-kubernetes-jian-cheung-joseph-kim-airbnb K8s Limits: 500m(0.5core) CFS_Period: 100ms CFS_Quota: 50ms K8s Limits: 2000m(2core) CFS_Period: 100ms CFS_Quota: 200ms How much of CPU resource you can use in every period If there’s no limits… CFS_Quota: -1 (unlimit)
CRI Runtime vs OCI Runtime How container runtime works on Kubernetes Kubernetes kubectl run kubectl apply REST API containerd CRI (gRPC) Collection of Kubernetes system components. kube-api-server fetches the kubectl and kubelet talks to the CRI runtime runC CRI runtime executes a OCI runtime binary file with OCI container json spec. OCI runtime spawns the container with CPU/Mem in the spec OCI High level runtime Low level runtime
CRI Runtime vs OCI Runtime How container runtime works on Kubernetes Kubernetes kubectl run kubectl apply REST API containerd CRI (gRPC) Collection of Kubernetes system components. kube-api-server fetches the kubectl and kubelet talks to the CRI runtime runC CRI runtime executes a OCI runtime binary file with OCI container json spec. OCI runtime spawns the container with CPU/Mem in the spec OCI High level runtime Low level runtime CRI (High level) Runtimes run with Kubernetes OCI (Low level) Runtimes run with Linux kernel
Conclusion Pod spec is registered in etcd through kube-apiserver kube-scheduler fetches newly registered pods from etcd and assign a node to each pod referring to resource requests kubelet fetches assigned pod spec in every sync period and calculate diffs between running containers and pod spec kubelet calls CreateContainer gRPC towards CRI runtime, after converting CPU cores into periods CRI runtime executes OCI runtime binary to create a container with OCI Spec JSON OCI runtime manages cgroups file system (create/delete/update) Vertical Pod Autoscaler (VPA) can provide recommendations for your requests and limits.