Everything You Ever Wanted To Know About Resource Scheduling... Almost

Slide 1

Slide 1 text

Google Cloud Platform logo Everything You Ever Wanted To Know About Resource Scheduling... Almost Tim Hockin Senior Staff Software Engineer, Google @thockin

Slide 2

Slide 2 text

Google Cloud Platform Who is thockin? Founding member of Kubernetes team Focused on storage, networking, node and other “infrastructure” things In my time at Google: - Worked on Borg & Omega - Machine management & monitoring - BIOS & Linux kernel “I reserve the right to have an opinion, regardless of how wrong I probably am.”

Slide 3

Slide 3 text

Google Cloud Platform WARNING: Some of this presentation is aspirational !

Slide 4

Slide 4 text

Google Cloud Platform I posit: Kubernetes is fundamentally ABOUT resource management

Slide 5

Slide 5 text

Google Cloud Platform ● CPU ● Memory

Slide 6

Slide 6 text

Google Cloud Platform ● CPU ● Memory ● Disk space ● Disk time ● Disk “spindles”

Slide 7

Slide 7 text

Google Cloud Platform ● CPU ● Memory ● Disk space ● Disk time ● Disk “spindles” ● Network bandwidth ● Host ports

Slide 8

Slide 8 text

Google Cloud Platform ● CPU ● Memory ● Disk space ● Disk time ● Disk “spindles” ● Network bandwidth ● Host ports ● Cache lines ● Memory bandwidth ● IP addresses ● Attached storage ● PIDs ● GPUs ● Power

Slide 9

Slide 9 text

Google Cloud Platform ● CPU ● Memory ● Disk space ● Disk time ● Disk “spindles” ● Network bandwidth ● Host ports ● Arbitrary, opaque third-party resources we can’t possibly predict ● Cache lines ● Memory bandwidth ● IP addresses ● Attached storage ● PIDs ● GPUs ● Power

Slide 10

Slide 10 text

Google Cloud Platform Mental model: Nodes produce capacity apiVersion: v1 kind: Node status: capacity: cpu: "4" memory: 32788388Ki

Slide 11

Slide 11 text

Google Cloud Platform Mental model: Pods consume capacity apiVersion: v1 kind: Pod spec: containers: - resources: requests: cpu: 1500m memory: 3.75Gi

Slide 12

Slide 12 text

Google Cloud Platform Mental model: Scheduler binds Pods to Nodes

Slide 13

Slide 13 text

Google Cloud Platform Mental model: Representing resources Node CPU RAM

Slide 14

Slide 14 text

Google Cloud Platform Mental model: Representing resources CPU RAM

Slide 15

Slide 15 text

Google Cloud Platform Mental model: Representing resources CPU RAM

Slide 16

Slide 16 text

Google Cloud Platform Mental model: Representing resources CPU RAM

Slide 17

Slide 17 text

Google Cloud Platform Mental model: Representing resources CPU RAM

Slide 18

Slide 18 text

Google Cloud Platform Mental model: Representing resources CPU RAM Available

Slide 19

Slide 19 text

Google Cloud Platform A more correct representation CPU Available RAM

Slide 20

Slide 20 text

Google Cloud Platform Scheduling

Slide 21

Slide 21 text

Google Cloud Platform Node A Node B Basic scheduling CPU RAM CPU RAM 6 4

Slide 22

Slide 22 text

Google Cloud Platform 6 4 Node A Node B Basic scheduling CPU RAM CPU RAM

Slide 23

Slide 23 text

Google Cloud Platform Node A 6 4 Node B Basic scheduling CPU RAM CPU RAM 3 4

Slide 24

Slide 24 text

Google Cloud Platform 6 4 3 4 Node B Node A Basic scheduling CPU RAM CPU RAM

Slide 25

Slide 25 text

Google Cloud Platform 6 4 3 4 Node B Node A Basic scheduling CPU RAM CPU RAM 5 3

Slide 26

Slide 26 text

Google Cloud Platform 5 3 6 4 3 4 Node B Node A Basic scheduling CPU RAM CPU RAM ?

Slide 27

Slide 27 text

Google Cloud Platform 6 4 3 4 5 3 Node B Node A Basic scheduling CPU RAM CPU RAM

Slide 28

Slide 28 text

Google Cloud Platform Node B 6 4 3 5 3 4 Node A Basic scheduling CPU RAM CPU RAM 5 3

Slide 29

Slide 29 text

Google Cloud Platform 6 4 3 5 3 5 3 4 Node A Node B Basic scheduling CPU RAM CPU RAM ?

Slide 30

Slide 30 text

Google Cloud Platform 6 4 3 5 3 5 3 4 Node B Node A Basic scheduling CPU RAM CPU RAM ?

Slide 31

Slide 31 text

Google Cloud Platform Node B 6 4 3 5 3 4 Node A Fragmentation CPU RAM CPU RAM Pending

Slide 32

Slide 32 text

Google Cloud Platform 5 3 Node B 6 4 3 4 Node A TODO: Optimizing rescheduler CPU RAM CPU RAM

Slide 33

Slide 33 text

Google Cloud Platform 5 3 5 3 6 4 3 4 Node A Node B TODO: Optimizing rescheduler CPU RAM CPU RAM

Slide 34

Slide 34 text

Google Cloud Platform Node A 5 3 5 3 6 4 3 4 Node B Stranded resources CPU RAM CPU RAM Can’t be used: stranded!

Slide 35

Slide 35 text

Google Cloud Platform Many people are still asking the wrong questions.

Slide 36

Slide 36 text

Google Cloud Platform “How do I make sure my compute-intensive jobs don’t get scheduled on my database machine?” Images by Connie Zhou

Slide 37

Slide 37 text

Google Cloud Platform “Why would I want multiple replicas on a node? I want to use ALL of the memory.” Images by Connie Zhou

Slide 38

Slide 38 text

Google Cloud Platform “How do I save some machines for important work, and use the rest for batch?” Images by Connie Zhou

Slide 39

Slide 39 text

Google Cloud Platform So... what should they be asking?

Slide 40

Slide 40 text

Google Cloud Platform “How do I make sure my compute jobs can’t hurt my database job?”

Slide 41

Slide 41 text

Google Cloud Platform Isolation

Slide 42

Slide 42 text

Google Cloud Platform “How do I know how much memory and CPU my job needs?”

Slide 43

Slide 43 text

Google Cloud Platform Sizing

Slide 44

Slide 44 text

Google Cloud Platform “How do I safely pack more work onto less machines?”

Slide 45

Slide 45 text

Google Cloud Platform Utilization

Slide 46

Slide 46 text

Google Cloud Platform Isolation

Slide 47

Slide 47 text

Google Cloud Platform Isolation Prevent apps from hurting each other Make sure you actually get what you paid for Kubernetes (and Docker) isolate CPU and memory Don’t handle things like memory bandwidth, disk time, cache, network bandwidth, ... (yet) Predictability at the extremes is paramount

Slide 48

Slide 48 text

Google Cloud Platform When does isolation matter? Infinite loops Memory leaks Disk hogs Fork bombs Cache thrashing

Slide 49

Slide 49 text

Google Cloud Platform Counter-measures Infinite loops: CPU shares and quota Memory leaks: OOM yourself Disk hogs: Quota Fork bombs: Process limits Cache thrashing: LLC jails, cache segments

Slide 50

Slide 50 text

Google Cloud Platform Counter-measures: work to do Infinite loops: CPU shares and quota Memory leaks: OOM yourself Disk hogs: Quota Fork bombs: Process limits Cache thrashing: LLC jails, cache segments

Slide 51

Slide 51 text

Google Cloud Platform Resource taxonomy Compressible resources ● Hold no state ● Can be taken away very quickly ● “Merely” cause slowness when revoked ● e.g. CPU, disk time Non-compressible resources ● Hold state ● Are slower to be taken away ● Can fail to be revoked ● e.g. Memory, disk space

Slide 52

Slide 52 text

Google Cloud Platform Requests and limits Request: amount of a resource allowed to be used, with a strong guarantee of availability ● CPU (seconds/second), RAM (bytes) ● Scheduler will not over-commit requests Limit: max amount of a resource that can be used, regardless of guarantees ● scheduler ignores limits Repercussions: ● request < usage <= limit: resources might be available ● usage > limit: throttled or killed CPU 1. 5 Limit

Slide 53

Slide 53 text

Google Cloud Platform Quality of service Guaranteed: highest protection ● limit == request Burstable: medium protection ● request > 0 && limit > request Best Effort: lowest protection ● request == 0 How is “protection” implemented? ● CPU: cgroup shares & quota ● Memory: OOM score + user-space evictions CPU 1. 5 Limit

Slide 54

Slide 54 text

Google Cloud Platform Requests and limits Behavior at (or near) the limit depends on the particular resource Compressible resources: throttle usage ● e.g. No more CPU time for you! Non-compressible resources: reclaim ● e.g. Write-back and reallocate dirty pages ● Failure means process death (OOM) Being correct is more important than optimal CPU 1. 5 Limit

Slide 55

Slide 55 text

Google Cloud Platform

Slide 56

Slide 56 text

Google Cloud Platform Example: memory 1. Try to allocate, fail 2. Find some clean pages to release (consumes CPU) 3. Write-back some dirty pages (consumes disk time) 4. If necessary, repeat this on another container How long should this be allowed to take? Really: this should be happening all the time Coupled resources

Slide 57

Slide 57 text

Google Cloud Platform What if I don’t specify? You get best-effort isolation You might get defaulted values You might get OOM killed randomly You might get CPU starved You might get no isolation at all

Slide 58

Slide 58 text

Google Cloud Platform Sizing

Slide 59

Slide 59 text

Google Cloud Platform Sizing How many replicas does my job need? How much CPU/RAM does my job need? Do I provision for worst-case? ● Expensive, wasteful Do I provision for average case? ● High failure rate (e.g. OOM) Benchmark it!

Slide 60

Slide 60 text

Google Cloud Platform Benchmarks are hard.

Slide 61

Slide 61 text

Google Cloud Platform Benchmarks are hard. Accurate benchmarks are VERY hard.

Slide 62

Slide 62 text

Google Cloud Platform Horizontal scaling Add more replicas Easy to reason about Works well when combined with resource isolation ● Having >1 replica per node makes sense Not always applicable ● e.g. Memory use scales with cluster size HorizontalPodAutoscaler ...

Slide 63

Slide 63 text

Google Cloud Platform What can we do? Horizontal scaling is not enough Resource needs change over time If only we had an “autopilot” mode... ● Collect stats & build a model ● Predict and react ● Manage Pods, Deployments, Jobs ● Try to stay ahead of the spikes

Slide 64

Slide 64 text

Google Cloud Platform Autopilot in Borg Most Borg users use autopilot See earlier statement regarding benchmarks - even at Google Kubernetes API is purpose-built for this sort of use-case We need a VerticalPodAutoscaler

Slide 65

Slide 65 text

Google Cloud Platform Utilization

Slide 66

Slide 66 text

Google Cloud Platform Utilization Resources cost money Wasted resources == wasted money You want NEED to use as much of your capacity as possible Selling it is not the same as using it

Slide 67

Slide 67 text

Google Cloud Platform What is YOUR average utilization?

Slide 68

Slide 68 text

Google Cloud Platform How can we do better? Utilization demands isolation ● If you want to push the limits, it has to be safe at the extremes People are inherently cautious ● Provision for 90%-99% case VPA & strong isolation should give enough confidence to provision more tightly We need to do some kernel work, here

Slide 69

Slide 69 text

Google Cloud Platform Some lessons from Borg Priority ● Low-priority jobs get paused/killed in favor of high-priority jobs Quota ● If everyone is important, nobody is important Overcommit ● Hedge against rare events with lower QoS/SLA for some work

Slide 70

Slide 70 text

Google Cloud Platform Overcommit Build a model of recent real usage per-container The delta between request and reality is idle -- resell it with a lower SLA ● First-tier apps can always get what they paid for - kill second-tier apps Use stats to decide how aggressive to be Let the priority system deal with the debris

Slide 71

Slide 71 text

Google Cloud Platform Siren’s song: over-packing Clusters need some room to operate ● Nodes fail or get upgraded As you approach 100% bookings (requests), consider what happens when things go bad ● Nowhere to squeeze the toothpaste! Plan for some idle capacity - it will save your bacon one day ● Priorities & rescheduling can make this less expensive

Slide 72

Slide 72 text

Google Cloud Platform Wrapping up

Slide 73

Slide 73 text

Google Cloud Platform WARNING: Some of this presentation was aspirational !

Slide 74

Slide 74 text

Google Cloud Platform We still have a LONG WAY to go. Fortunately, this is a path we’ve been down before.

Slide 75

Slide 75 text

Google Cloud Platform Kubernetes is Open https://kubernetes.io Code: github.com/kubernetes/kubernetes Chat: slack.k8s.io Twitter: @kubernetesio open community open design open source open to ideas