Everything You Ever Wanted To Know About Resource Scheduling... Almost

569f10721398d92f5033097ac6d9132c?s=47 Tim Hockin
November 09, 2016

Everything You Ever Wanted To Know About Resource Scheduling... Almost

An overview of resource scheduling, and an in-depth look at some of the "you're asking the wrong question" situations in Kubernetes.

Presented at KubeCon 2016, Seattle.

569f10721398d92f5033097ac6d9132c?s=128

Tim Hockin

November 09, 2016
Tweet

Transcript

  1. 1.

    Google Cloud Platform logo Everything You Ever Wanted To Know

    About Resource Scheduling... Almost Tim Hockin <thockin@google.com> Senior Staff Software Engineer, Google @thockin
  2. 2.

    Google Cloud Platform Who is thockin? Founding member of Kubernetes

    team Focused on storage, networking, node and other “infrastructure” things In my time at Google: - Worked on Borg & Omega - Machine management & monitoring - BIOS & Linux kernel “I reserve the right to have an opinion, regardless of how wrong I probably am.”
  3. 6.

    Google Cloud Platform • CPU • Memory • Disk space

    • Disk time • Disk “spindles”
  4. 7.

    Google Cloud Platform • CPU • Memory • Disk space

    • Disk time • Disk “spindles” • Network bandwidth • Host ports
  5. 8.

    Google Cloud Platform • CPU • Memory • Disk space

    • Disk time • Disk “spindles” • Network bandwidth • Host ports • Cache lines • Memory bandwidth • IP addresses • Attached storage • PIDs • GPUs • Power
  6. 9.

    Google Cloud Platform • CPU • Memory • Disk space

    • Disk time • Disk “spindles” • Network bandwidth • Host ports • Arbitrary, opaque third-party resources we can’t possibly predict • Cache lines • Memory bandwidth • IP addresses • Attached storage • PIDs • GPUs • Power
  7. 10.

    Google Cloud Platform Mental model: Nodes produce capacity apiVersion: v1

    kind: Node status: capacity: cpu: "4" memory: 32788388Ki
  8. 11.

    Google Cloud Platform Mental model: Pods consume capacity apiVersion: v1

    kind: Pod spec: containers: - resources: requests: cpu: 1500m memory: 3.75Gi
  9. 22.
  10. 23.

    Google Cloud Platform Node A 6 4 Node B Basic

    scheduling CPU RAM CPU RAM 3 4
  11. 24.

    Google Cloud Platform 6 4 3 4 Node B Node

    A Basic scheduling CPU RAM CPU RAM
  12. 25.

    Google Cloud Platform 6 4 3 4 Node B Node

    A Basic scheduling CPU RAM CPU RAM 5 3
  13. 26.

    Google Cloud Platform 5 3 6 4 3 4 Node

    B Node A Basic scheduling CPU RAM CPU RAM ?
  14. 27.

    Google Cloud Platform 6 4 3 4 5 3 Node

    B Node A Basic scheduling CPU RAM CPU RAM
  15. 28.

    Google Cloud Platform Node B 6 4 3 5 3

    4 Node A Basic scheduling CPU RAM CPU RAM 5 3
  16. 29.

    Google Cloud Platform 6 4 3 5 3 5 3

    4 Node A Node B Basic scheduling CPU RAM CPU RAM ?
  17. 30.

    Google Cloud Platform 6 4 3 5 3 5 3

    4 Node B Node A Basic scheduling CPU RAM CPU RAM ?
  18. 31.

    Google Cloud Platform Node B 6 4 3 5 3

    4 Node A Fragmentation CPU RAM CPU RAM Pending
  19. 32.

    Google Cloud Platform 5 3 Node B 6 4 3

    4 Node A TODO: Optimizing rescheduler CPU RAM CPU RAM
  20. 33.

    Google Cloud Platform 5 3 5 3 6 4 3

    4 Node A Node B TODO: Optimizing rescheduler CPU RAM CPU RAM
  21. 34.

    Google Cloud Platform Node A 5 3 5 3 6

    4 3 4 Node B Stranded resources CPU RAM CPU RAM Can’t be used: stranded!
  22. 36.

    Google Cloud Platform “How do I make sure my compute-intensive

    jobs don’t get scheduled on my database machine?” Images by Connie Zhou
  23. 37.

    Google Cloud Platform “Why would I want multiple replicas on

    a node? I want to use ALL of the memory.” Images by Connie Zhou
  24. 38.

    Google Cloud Platform “How do I save some machines for

    important work, and use the rest for batch?” Images by Connie Zhou
  25. 40.

    Google Cloud Platform “How do I make sure my compute

    jobs can’t hurt my database job?”
  26. 47.

    Google Cloud Platform Isolation Prevent apps from hurting each other

    Make sure you actually get what you paid for Kubernetes (and Docker) isolate CPU and memory Don’t handle things like memory bandwidth, disk time, cache, network bandwidth, ... (yet) Predictability at the extremes is paramount
  27. 48.
  28. 49.

    Google Cloud Platform Counter-measures Infinite loops: CPU shares and quota

    Memory leaks: OOM yourself Disk hogs: Quota Fork bombs: Process limits Cache thrashing: LLC jails, cache segments
  29. 50.

    Google Cloud Platform Counter-measures: work to do Infinite loops: CPU

    shares and quota Memory leaks: OOM yourself Disk hogs: Quota Fork bombs: Process limits Cache thrashing: LLC jails, cache segments
  30. 51.

    Google Cloud Platform Resource taxonomy Compressible resources • Hold no

    state • Can be taken away very quickly • “Merely” cause slowness when revoked • e.g. CPU, disk time Non-compressible resources • Hold state • Are slower to be taken away • Can fail to be revoked • e.g. Memory, disk space
  31. 52.

    Google Cloud Platform Requests and limits Request: amount of a

    resource allowed to be used, with a strong guarantee of availability • CPU (seconds/second), RAM (bytes) • Scheduler will not over-commit requests Limit: max amount of a resource that can be used, regardless of guarantees • scheduler ignores limits Repercussions: • request < usage <= limit: resources might be available • usage > limit: throttled or killed CPU 1. 5 Limit
  32. 53.

    Google Cloud Platform Quality of service Guaranteed: highest protection •

    limit == request Burstable: medium protection • request > 0 && limit > request Best Effort: lowest protection • request == 0 How is “protection” implemented? • CPU: cgroup shares & quota • Memory: OOM score + user-space evictions CPU 1. 5 Limit
  33. 54.

    Google Cloud Platform Requests and limits Behavior at (or near)

    the limit depends on the particular resource Compressible resources: throttle usage • e.g. No more CPU time for you! Non-compressible resources: reclaim • e.g. Write-back and reallocate dirty pages • Failure means process death (OOM) Being correct is more important than optimal CPU 1. 5 Limit
  34. 56.

    Google Cloud Platform Example: memory 1. Try to allocate, fail

    2. Find some clean pages to release (consumes CPU) 3. Write-back some dirty pages (consumes disk time) 4. If necessary, repeat this on another container How long should this be allowed to take? Really: this should be happening all the time Coupled resources
  35. 57.

    Google Cloud Platform What if I don’t specify? You get

    best-effort isolation You might get defaulted values You might get OOM killed randomly You might get CPU starved You might get no isolation at all
  36. 59.

    Google Cloud Platform Sizing How many replicas does my job

    need? How much CPU/RAM does my job need? Do I provision for worst-case? • Expensive, wasteful Do I provision for average case? • High failure rate (e.g. OOM) Benchmark it!
  37. 62.

    Google Cloud Platform Horizontal scaling Add more replicas Easy to

    reason about Works well when combined with resource isolation • Having >1 replica per node makes sense Not always applicable • e.g. Memory use scales with cluster size HorizontalPodAutoscaler ...
  38. 63.

    Google Cloud Platform What can we do? Horizontal scaling is

    not enough Resource needs change over time If only we had an “autopilot” mode... • Collect stats & build a model • Predict and react • Manage Pods, Deployments, Jobs • Try to stay ahead of the spikes
  39. 64.

    Google Cloud Platform Autopilot in Borg Most Borg users use

    autopilot See earlier statement regarding benchmarks - even at Google Kubernetes API is purpose-built for this sort of use-case We need a VerticalPodAutoscaler
  40. 66.

    Google Cloud Platform Utilization Resources cost money Wasted resources ==

    wasted money You want NEED to use as much of your capacity as possible Selling it is not the same as using it
  41. 68.

    Google Cloud Platform How can we do better? Utilization demands

    isolation • If you want to push the limits, it has to be safe at the extremes People are inherently cautious • Provision for 90%-99% case VPA & strong isolation should give enough confidence to provision more tightly We need to do some kernel work, here
  42. 69.

    Google Cloud Platform Some lessons from Borg Priority • Low-priority

    jobs get paused/killed in favor of high-priority jobs Quota • If everyone is important, nobody is important Overcommit • Hedge against rare events with lower QoS/SLA for some work
  43. 70.

    Google Cloud Platform Overcommit Build a model of recent real

    usage per-container The delta between request and reality is idle -- resell it with a lower SLA • First-tier apps can always get what they paid for - kill second-tier apps Use stats to decide how aggressive to be Let the priority system deal with the debris
  44. 71.

    Google Cloud Platform Siren’s song: over-packing Clusters need some room

    to operate • Nodes fail or get upgraded As you approach 100% bookings (requests), consider what happens when things go bad • Nowhere to squeeze the toothpaste! Plan for some idle capacity - it will save your bacon one day • Priorities & rescheduling can make this less expensive
  45. 74.

    Google Cloud Platform We still have a LONG WAY to

    go. Fortunately, this is a path we’ve been down before.
  46. 75.

    Google Cloud Platform Kubernetes is Open https://kubernetes.io Code: github.com/kubernetes/kubernetes Chat:

    slack.k8s.io Twitter: @kubernetesio open community open design open source open to ideas