Owner of SIG-Docs Japanese localization Twitter: @inductor__ GitHub: @inductor Kaslin Fields Developer Advocate at Google CNCF Ambassador Member of K8s SIG-ContribEx Comics at kaslin.rocks! Twitter: @kaslinfields GitHub: @kaslin
run properly. Update Cluster State I’ll make note of that so your app’s needs will be met. Record to etcd* *etcd is the key-value store component used by Kubernetes. It functions as a single source of truth for the state of the cluster.
a Node to a Pod Detects that a Pod was assigned to the Node API Server (Control Plane) Hmm… Which node should I put this on… Here ya go, little buddy! Oh, a new request! Scheduler Kubelet (on node)
Kubelet (Each node) New Pod to assign Assign a Node to a Pod Detects a Pod that assigned to the Node Node evaluation with Resource Requests Pod Requests vs Node Allocatable
Scheduler selects a Node for a Pod to match the resource requirement - CPU request is used in order to limit CPU resource in case they’re used 100% - When CPU is not fully used it’s over-committable - CPU - If over request? → Potential of eviction - Memory - If over request? → Potential of eviction QoS Class
better) Guaranteed limits and optionally requests (not equal to 0) are set for all resources across all containers and they are equal 1 Burstable requests and optionally limits are set (not equal to 0) for one or more resources across one or more containers, and they are not equal 2 BestEffort requests and limits are not set for all of the resources, across all containers 3
CRI Runtime (Each node) Detects a Pod that needs to be assigned to a Node Convert CPU cores to CFS period/quota (milliseconds) Set to OCI spec Pass limits OCI Runtime (Each node) Call Cgroups Cgroups
CRI Runtime (Each node) Detects a Pod that needs to be assigned to a Node Convert CPU cores to CFS period/quota (milliseconds) Set to OCI spec Pass limits OCI Runtime (Each node) Call Cgroups Cgroups
as CPU time(again, not cores!), system memory, network bandwidth, or combinations of these CPU Requests in K8s → cpu.shares in cgroups CPU Limits in K8s → cpu.cfs_period_us & cpu.cfs_quota_us (us = μs) in cgroups Memory Limits in K8s → memory.limit_in_bytes in cgroups
as CPU time(again, not cores!), system memory, network bandwidth, or combinations of these CPU Requests in K8s -> cpu.shares in cgroups CPU Limits in K8s -> cpu.cfs_period_us & cpu.cfs_quota_us (us = μs) in cgroups Memory Limits in K8s -> memory.limit_in_bytes in cgroups Add cpu.shares 2048 cpu.shares is a relative value https://speakerdeck.com/daikurosawa/understanding-cpu-throttling-in-kubernetes-to-improve-application-performance-number-k8sjp
scheduler in Linux Container isolation is based on cgroups(a Linux kernel functionality) resource limitation Cgroups uses CFS to implement CPU resource restriction CFS scheduling is based on processing time but not core. Scheduling period is every 100ms https://kccncna20.sched.com/event/ek9r/10-more-weird-ways-to-blow-up-your-kubernetes-jian-cheung-joseph-kim-airbnb
scheduler in Linux Container isolation is based on cgroups(a Linux kernel functionality) resource limitation Cgroups uses CFS to implement CPU resource restriction CFS scheduling is based on processing time but not core. Scheduling period is every 100ms https://kccncna20.sched.com/event/ek9r/10-more-weird-ways-to-blow-up-your-kubernetes-jian-cheung-joseph-kim-airbnb K8s Limits: 500m(0.5core) CFS_Period: 100ms CFS_Quota: 50ms K8s Limits: 2000m(2core) CFS_Period: 100ms CFS_Quota: 200ms How much of CPU resource you can use in every period
scheduler in Linux Container isolation is based on cgroups(a Linux kernel functionality) resource limitation Cgroups uses CFS to implement CPU resource restriction CFS scheduling is based on processing time but not core. Scheduling period is every 100ms https://kccncna20.sched.com/event/ek9r/10-more-weird-ways-to-blow-up-your-kubernetes-jian-cheung-joseph-kim-airbnb K8s Limits: 500m(0.5core) CFS_Period: 100ms CFS_Quota: 50ms K8s Limits: 2000m(2core) CFS_Period: 100ms CFS_Quota: 200ms How much of CPU resource you can use in every period If there’s no limits… CFS_Quota: -1 (unlimit)
Kubernetes Kubernetes kubectl run kubectl apply REST API containerd CRI (gRPC) Collection of Kubernetes system components. kube-api-server fetches the kubectl and kubelet talks to the CRI runtime runC CRI runtime executes a OCI runtime binary file with OCI container json spec. OCI runtime spawns the container with CPU/Mem in the spec OCI High level runtime Low level runtime
Kubernetes Kubernetes kubectl run kubectl apply REST API containerd CRI (gRPC) Collection of Kubernetes system components. kube-api-server fetches the kubectl and kubelet talks to the CRI runtime runC CRI runtime executes a OCI runtime binary file with OCI container json spec. OCI runtime spawns the container with CPU/Mem in the spec OCI High level runtime Low level runtime CRI (High level) Runtimes run with Kubernetes OCI (Low level) Runtimes run with Linux kernel
fetches newly registered pods from etcd and assign a node to each pod referring to resource requests kubelet fetches assigned pod spec in every sync period and calculate diffs between running containers and pod spec kubelet calls CreateContainer gRPC towards CRI runtime, after converting CPU cores into periods CRI runtime executes OCI runtime binary to create a container with OCI Spec JSON OCI runtime manages cgroups file system (create/delete/update) Vertical Pod Autoscaler (VPA) can provide recommendations for your requests and limits.