Slide 1

Slide 1 text

Kaslin Fields, Google Kohei Ota, Hewlett Packard Enterprise Resource Requests and Limits Under the Hood: The Journey of a Pod Spec

Slide 2

Slide 2 text

Self introduction Kohei Ota Architect at Hewlett-Packard Enterprise CNCF Ambassador Owner of SIG-Docs Japanese localization Twitter: @inductor__ GitHub: @inductor Kaslin Fields Developer Advocate at Google CNCF Ambassador Member of K8s SIG-ContribEx Comics at kaslin.rocks! Twitter: @kaslinfields GitHub: @kaslin

Slide 3

Slide 3 text

Your App

Slide 4

Slide 4 text

Doggy Daycare Analogy

Slide 5

Slide 5 text

It Takes a Village

Slide 6

Slide 6 text

Resource Requests & Limits

Slide 7

Slide 7 text

Here’s an ordinary pod spec apiVersion: v1 kind: Pod metadata: name: kubecon-eu-2021 spec: containers: - name: kubecon-eu-2021 image: kubecon:eu-2021 resources: requests: memory: "64Mi" cpu: "250m" limits: memory: "128Mi" cpu: "500m" Requests

Slide 8

Slide 8 text

Requests for planning y x z

Slide 9

Slide 9 text

I want to put my app in a Pod! The Journey of a Request Kubernetes API Developer App Ok, we can help with that.

Slide 10

Slide 10 text

My app will need 2CPU and 4GB of Memory to run properly. Update Cluster State I’ll make note of that so your app’s needs will be met. Record to etcd* *etcd is the key-value store component used by Kubernetes. It functions as a single source of truth for the state of the cluster.

Slide 11

Slide 11 text

Oh, a new pod is coming that will require 2CPU & 4GB of memory. I’ll put that… here. Kubernetes Scheduler Assign Pod to Node

Slide 12

Slide 12 text

Scheduler Scheduler is a Kubernetes component that evaluates nodes to assign a Pod. Resource request is one of the parameters that Scheduler uses when ranking nodes.

Slide 13

Slide 13 text

Ok, let’s get this new pod settled in! Kubelet on Node Create Pod on Node

Slide 14

Slide 14 text

The Journey of a Request New Pod to assign Assign a Node to a Pod Detects that a Pod was assigned to the Node API Server (Control Plane) Hmm… Which node should I put this on… Here ya go, little buddy! Oh, a new request! Scheduler Kubelet (on node)

Slide 15

Slide 15 text

Requests passing through API Server (Control plane) Scheduler (Control plane) Kubelet (Each node) New Pod to assign Assign a Node to a Pod Detects a Pod that assigned to the Node Node evaluation with Resource Requests

Slide 16

Slide 16 text

Requests passing through Detects that a Pod was assigned to the Node Hmm… Which node should I put this on… Pod Requests vs Node Allocatable Scheduler

Slide 17

Slide 17 text

Requests passing through API Server (Control plane) Scheduler (Control plane) Kubelet (Each node) New Pod to assign Assign a Node to a Pod Detects a Pod that assigned to the Node Node evaluation with Resource Requests Pod Requests vs Node Allocatable

Slide 18

Slide 18 text

Requests summary - Requests - Used at Pod creation - Scheduler selects a Node for a Pod to match the resource requirement - CPU request is used in order to limit CPU resource in case they’re used 100% - When CPU is not fully used it’s over-committable - CPU - If over request? → Potential of eviction - Memory - If over request? → Potential of eviction QoS Class

Slide 19

Slide 19 text

QoS Class in Kubernetes QoS Class Condition Priority (Lower is better) Guaranteed limits and optionally requests (not equal to 0) are set for all resources across all containers and they are equal 1 Burstable requests and optionally limits are set (not equal to 0) for one or more resources across one or more containers, and they are not equal 2 BestEffort requests and limits are not set for all of the resources, across all containers 3

Slide 20

Slide 20 text

apiVersion: v1 kind: Pod metadata: name: kubecon-eu-2021 spec: containers: - name: kubecon-eu-2021 image: kubecon:eu-2021 resources: requests: memory: "64Mi" cpu: "250m" limits: memory: "128Mi" cpu: "500m" Now let’s talk about limits Limits

Slide 21

Slide 21 text

Limits enforce.

Slide 22

Slide 22 text

The Journey of a Pod Limit Kubernetes API Developer App I want to make sure my pod doesn’t consume more than 2CPU and 4GB of memory Yes, we can limit your pod’s resource usage.

Slide 23

Slide 23 text

This new pod needs to be limited to 2CPU and 4GB. I better make sure the caretaker knows. Kubernetes Scheduler Assign Pod to Node

Slide 24

Slide 24 text

Ah, this pod is limited. I’ll get only the resources it needs from the supplier. Kubelet on Node Send Pod to Kubelet on Node

Slide 25

Slide 25 text

Hey, I have a pod coming in that needs its resources limited. Container Runtime/Linux Kubelet to Container Runtime Kubelet Ok, I can use cgroups to make that happen.

Slide 26

Slide 26 text

Requests for planning. Limits for enforcing.

Slide 27

Slide 27 text

Limits summary - Limits - Used to limit resources on a Pod by calling cgroups on Linux - CPU - If over limit? → CPU throttling - Memory - If over limit? → Cause OOM

Slide 28

Slide 28 text

Limits by level API Server (Control plane) Kubelet (Each node) CRI Runtime (Each node) Detects a Pod that needs to be assigned to a Node Convert CPU cores to CFS period/quota (milliseconds) Set to OCI spec Pass limits OCI Runtime (Each node) Call Cgroups Cgroups

Slide 29

Slide 29 text

Limits by level API Server (Control plane) Kubelet (Each node) CRI Runtime (Each node) Detects a Pod that needs to be assigned to a Node Convert CPU cores to CFS period/quota (milliseconds) Set to OCI spec Pass limits OCI Runtime (Each node) Call Cgroups Cgroups

Slide 30

Slide 30 text

Container Primitives

Slide 31

Slide 31 text

Cgroups

Slide 32

Slide 32 text

Cgroups? Cgroups(Control groups) allow you to allocate resources — such as CPU time(again, not cores!), system memory, network bandwidth, or combinations of these CPU Requests in K8s → cpu.shares in cgroups CPU Limits in K8s → cpu.cfs_period_us & cpu.cfs_quota_us (us = μs) in cgroups Memory Limits in K8s → memory.limit_in_bytes in cgroups

Slide 33

Slide 33 text

Cgroups? Cgroups(Control groups) allow you to allocate resources — such as CPU time(again, not cores!), system memory, network bandwidth, or combinations of these CPU Requests in K8s -> cpu.shares in cgroups CPU Limits in K8s -> cpu.cfs_period_us & cpu.cfs_quota_us (us = μs) in cgroups Memory Limits in K8s -> memory.limit_in_bytes in cgroups Add cpu.shares 2048 cpu.shares is a relative value https://speakerdeck.com/daikurosawa/understanding-cpu-throttling-in-kubernetes-to-improve-application-performance-number-k8sjp

Slide 34

Slide 34 text

CFS Quota? Period? CFS = “Completely Fair” Scheduler A process scheduler in Linux Container isolation is based on cgroups(a Linux kernel functionality) resource limitation Cgroups uses CFS to implement CPU resource restriction CFS scheduling is based on processing time but not core. Scheduling period is every 100ms https://kccncna20.sched.com/event/ek9r/10-more-weird-ways-to-blow-up-your-kubernetes-jian-cheung-joseph-kim-airbnb

Slide 35

Slide 35 text

CFS Quota? Period? CFS = “Completely Fair” Scheduler A process scheduler in Linux Container isolation is based on cgroups(a Linux kernel functionality) resource limitation Cgroups uses CFS to implement CPU resource restriction CFS scheduling is based on processing time but not core. Scheduling period is every 100ms https://kccncna20.sched.com/event/ek9r/10-more-weird-ways-to-blow-up-your-kubernetes-jian-cheung-joseph-kim-airbnb K8s Limits: 500m(0.5core) CFS_Period: 100ms CFS_Quota: 50ms K8s Limits: 2000m(2core) CFS_Period: 100ms CFS_Quota: 200ms How much of CPU resource you can use in every period

Slide 36

Slide 36 text

CFS Quota? Period? CFS = “Completely Fair” Scheduler A process scheduler in Linux Container isolation is based on cgroups(a Linux kernel functionality) resource limitation Cgroups uses CFS to implement CPU resource restriction CFS scheduling is based on processing time but not core. Scheduling period is every 100ms https://kccncna20.sched.com/event/ek9r/10-more-weird-ways-to-blow-up-your-kubernetes-jian-cheung-joseph-kim-airbnb K8s Limits: 500m(0.5core) CFS_Period: 100ms CFS_Quota: 50ms K8s Limits: 2000m(2core) CFS_Period: 100ms CFS_Quota: 200ms How much of CPU resource you can use in every period If there’s no limits… CFS_Quota: -1 (unlimit)

Slide 37

Slide 37 text

CRI Runtime vs OCI Runtime How container runtime works on Kubernetes Kubernetes kubectl run kubectl apply REST API containerd CRI (gRPC) Collection of Kubernetes system components. kube-api-server fetches the kubectl and kubelet talks to the CRI runtime runC CRI runtime executes a OCI runtime binary file with OCI container json spec. OCI runtime spawns the container with CPU/Mem in the spec OCI High level runtime Low level runtime

Slide 38

Slide 38 text

CRI Runtime vs OCI Runtime How container runtime works on Kubernetes Kubernetes kubectl run kubectl apply REST API containerd CRI (gRPC) Collection of Kubernetes system components. kube-api-server fetches the kubectl and kubelet talks to the CRI runtime runC CRI runtime executes a OCI runtime binary file with OCI container json spec. OCI runtime spawns the container with CPU/Mem in the spec OCI High level runtime Low level runtime CRI (High level) Runtimes run with Kubernetes OCI (Low level) Runtimes run with Linux kernel

Slide 39

Slide 39 text

How do I set the right requests and limits?

Slide 40

Slide 40 text

Pod Autoscaling Horizontal Pod Autoscaler (HPA) More Fewer Vertical Pod Autoscaler (VPA) Change Size ? ? ?

Slide 41

Slide 41 text

Vertical Pod Autoscaler VPA Modes: Off Initial Auto VPA Recommendations: Target Lower Bound Upper Bound Uncapped Target Vertical Pod Autoscaler (VPA)

Slide 42

Slide 42 text

Conclusion Pod spec is registered in etcd through kube-apiserver kube-scheduler fetches newly registered pods from etcd and assign a node to each pod referring to resource requests kubelet fetches assigned pod spec in every sync period and calculate diffs between running containers and pod spec kubelet calls CreateContainer gRPC towards CRI runtime, after converting CPU cores into periods CRI runtime executes OCI runtime binary to create a container with OCI Spec JSON OCI runtime manages cgroups file system (create/delete/update) Vertical Pod Autoscaler (VPA) can provide recommendations for your requests and limits.

Slide 43

Slide 43 text

No content