Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Everything You Ever Wanted To Know About Resource Scheduling... Almost

Tim Hockin
November 09, 2016

Everything You Ever Wanted To Know About Resource Scheduling... Almost

An overview of resource scheduling, and an in-depth look at some of the "you're asking the wrong question" situations in Kubernetes.

Presented at KubeCon 2016, Seattle.

Tim Hockin

November 09, 2016
Tweet

More Decks by Tim Hockin

Other Decks in Technology

Transcript

  1. Google Cloud Platform
    logo
    Everything You Ever Wanted To
    Know About Resource Scheduling...
    Almost
    Tim Hockin
    Senior Staff Software Engineer, Google
    @thockin

    View Slide

  2. Google Cloud Platform
    Who is thockin?
    Founding member of Kubernetes team
    Focused on storage, networking, node and
    other “infrastructure” things
    In my time at Google:
    - Worked on Borg & Omega
    - Machine management & monitoring
    - BIOS & Linux kernel
    “I reserve the right to have an opinion, regardless
    of how wrong I probably am.”

    View Slide

  3. Google Cloud Platform
    WARNING:
    Some of this presentation
    is aspirational
    !

    View Slide

  4. Google Cloud Platform
    I posit:
    Kubernetes is fundamentally ABOUT
    resource management

    View Slide

  5. Google Cloud Platform
    ● CPU
    ● Memory

    View Slide

  6. Google Cloud Platform
    ● CPU
    ● Memory
    ● Disk space
    ● Disk time
    ● Disk “spindles”

    View Slide

  7. Google Cloud Platform
    ● CPU
    ● Memory
    ● Disk space
    ● Disk time
    ● Disk “spindles”
    ● Network bandwidth
    ● Host ports

    View Slide

  8. Google Cloud Platform
    ● CPU
    ● Memory
    ● Disk space
    ● Disk time
    ● Disk “spindles”
    ● Network bandwidth
    ● Host ports
    ● Cache lines
    ● Memory bandwidth
    ● IP addresses
    ● Attached storage
    ● PIDs
    ● GPUs
    ● Power

    View Slide

  9. Google Cloud Platform
    ● CPU
    ● Memory
    ● Disk space
    ● Disk time
    ● Disk “spindles”
    ● Network bandwidth
    ● Host ports
    ● Arbitrary, opaque third-party resources we can’t possibly
    predict
    ● Cache lines
    ● Memory bandwidth
    ● IP addresses
    ● Attached storage
    ● PIDs
    ● GPUs
    ● Power

    View Slide

  10. Google Cloud Platform
    Mental model: Nodes produce capacity
    apiVersion: v1
    kind: Node
    status:
    capacity:
    cpu: "4"
    memory: 32788388Ki

    View Slide

  11. Google Cloud Platform
    Mental model: Pods consume capacity
    apiVersion: v1
    kind: Pod
    spec:
    containers:
    - resources:
    requests:
    cpu: 1500m
    memory: 3.75Gi

    View Slide

  12. Google Cloud Platform
    Mental model: Scheduler binds Pods to Nodes

    View Slide

  13. Google Cloud Platform
    Mental model: Representing resources
    Node
    CPU
    RAM

    View Slide

  14. Google Cloud Platform
    Mental model: Representing resources
    CPU
    RAM

    View Slide

  15. Google Cloud Platform
    Mental model: Representing resources
    CPU
    RAM

    View Slide

  16. Google Cloud Platform
    Mental model: Representing resources
    CPU
    RAM

    View Slide

  17. Google Cloud Platform
    Mental model: Representing resources
    CPU
    RAM

    View Slide

  18. Google Cloud Platform
    Mental model: Representing resources
    CPU
    RAM
    Available

    View Slide

  19. Google Cloud Platform
    A more correct representation
    CPU
    Available
    RAM

    View Slide

  20. Google Cloud Platform
    Scheduling

    View Slide

  21. Google Cloud Platform
    Node A
    Node B
    Basic scheduling
    CPU
    RAM
    CPU
    RAM
    6
    4

    View Slide

  22. Google Cloud Platform
    6
    4
    Node A
    Node B
    Basic scheduling
    CPU
    RAM
    CPU
    RAM

    View Slide

  23. Google Cloud Platform
    Node A
    6
    4
    Node B
    Basic scheduling
    CPU
    RAM
    CPU
    RAM
    3
    4

    View Slide

  24. Google Cloud Platform
    6
    4
    3
    4
    Node B
    Node A
    Basic scheduling
    CPU
    RAM
    CPU
    RAM

    View Slide

  25. Google Cloud Platform
    6
    4
    3
    4
    Node B
    Node A
    Basic scheduling
    CPU
    RAM
    CPU
    RAM
    5
    3

    View Slide

  26. Google Cloud Platform
    5
    3
    6
    4
    3
    4
    Node B
    Node A
    Basic scheduling
    CPU
    RAM
    CPU
    RAM
    ?

    View Slide

  27. Google Cloud Platform
    6
    4
    3
    4
    5
    3
    Node B
    Node A
    Basic scheduling
    CPU
    RAM
    CPU
    RAM

    View Slide

  28. Google Cloud Platform
    Node B
    6
    4
    3 5
    3
    4
    Node A
    Basic scheduling
    CPU
    RAM
    CPU
    RAM
    5
    3

    View Slide

  29. Google Cloud Platform
    6
    4
    3 5
    3
    5
    3
    4
    Node A
    Node B
    Basic scheduling
    CPU
    RAM
    CPU
    RAM
    ?

    View Slide

  30. Google Cloud Platform
    6
    4
    3 5
    3
    5
    3
    4
    Node B
    Node A
    Basic scheduling
    CPU
    RAM
    CPU
    RAM
    ?

    View Slide

  31. Google Cloud Platform
    Node B
    6
    4
    3 5
    3
    4
    Node A
    Fragmentation
    CPU
    RAM
    CPU
    RAM
    Pending

    View Slide

  32. Google Cloud Platform
    5
    3
    Node B
    6
    4
    3
    4
    Node A
    TODO: Optimizing rescheduler
    CPU
    RAM
    CPU
    RAM

    View Slide

  33. Google Cloud Platform
    5
    3
    5
    3
    6
    4
    3
    4
    Node A
    Node B
    TODO: Optimizing rescheduler
    CPU
    RAM
    CPU
    RAM

    View Slide

  34. Google Cloud Platform
    Node A
    5
    3
    5
    3
    6
    4
    3
    4
    Node B
    Stranded resources
    CPU
    RAM
    CPU
    RAM
    Can’t be used:
    stranded!

    View Slide

  35. Google Cloud Platform
    Many people are still asking the wrong
    questions.

    View Slide

  36. Google Cloud Platform
    “How do I make sure my
    compute-intensive jobs
    don’t get scheduled on
    my database machine?”
    Images by Connie Zhou

    View Slide

  37. Google Cloud Platform
    “Why would I want
    multiple replicas on a
    node? I want to use ALL
    of the memory.”
    Images by Connie Zhou

    View Slide

  38. Google Cloud Platform
    “How do I save some
    machines for important
    work, and use the rest
    for batch?”
    Images by Connie Zhou

    View Slide

  39. Google Cloud Platform
    So... what should they be asking?

    View Slide

  40. Google Cloud Platform
    “How do I make sure my
    compute jobs can’t hurt
    my database job?”

    View Slide

  41. Google Cloud Platform
    Isolation

    View Slide

  42. Google Cloud Platform
    “How do I know how
    much memory and CPU
    my job needs?”

    View Slide

  43. Google Cloud Platform
    Sizing

    View Slide

  44. Google Cloud Platform
    “How do I safely pack
    more work onto less
    machines?”

    View Slide

  45. Google Cloud Platform
    Utilization

    View Slide

  46. Google Cloud Platform
    Isolation

    View Slide

  47. Google Cloud Platform
    Isolation
    Prevent apps from hurting each other
    Make sure you actually get what you paid for
    Kubernetes (and Docker) isolate CPU and
    memory
    Don’t handle things like memory bandwidth, disk
    time, cache, network bandwidth, ... (yet)
    Predictability at the extremes is paramount

    View Slide

  48. Google Cloud Platform
    When does isolation matter?
    Infinite loops
    Memory leaks
    Disk hogs
    Fork bombs
    Cache thrashing

    View Slide

  49. Google Cloud Platform
    Counter-measures
    Infinite loops: CPU shares and quota
    Memory leaks: OOM yourself
    Disk hogs: Quota
    Fork bombs: Process limits
    Cache thrashing: LLC jails, cache segments

    View Slide

  50. Google Cloud Platform
    Counter-measures: work to do
    Infinite loops: CPU shares and quota
    Memory leaks: OOM yourself
    Disk hogs: Quota
    Fork bombs: Process limits
    Cache thrashing: LLC jails, cache segments

    View Slide

  51. Google Cloud Platform
    Resource taxonomy
    Compressible resources
    ● Hold no state
    ● Can be taken away very quickly
    ● “Merely” cause slowness when revoked
    ● e.g. CPU, disk time
    Non-compressible resources
    ● Hold state
    ● Are slower to be taken away
    ● Can fail to be revoked
    ● e.g. Memory, disk space

    View Slide

  52. Google Cloud Platform
    Requests and limits
    Request: amount of a resource allowed to be
    used, with a strong guarantee of availability
    ● CPU (seconds/second), RAM (bytes)
    ● Scheduler will not over-commit requests
    Limit: max amount of a resource that can be
    used, regardless of guarantees
    ● scheduler ignores limits
    Repercussions:
    ● request < usage <= limit: resources might
    be available
    ● usage > limit: throttled or killed
    CPU
    1.
    5
    Limit

    View Slide

  53. Google Cloud Platform
    Quality of service
    Guaranteed: highest protection
    ● limit == request
    Burstable: medium protection
    ● request > 0 && limit > request
    Best Effort: lowest protection
    ● request == 0
    How is “protection” implemented?
    ● CPU: cgroup shares & quota
    ● Memory: OOM score + user-space evictions
    CPU
    1.
    5
    Limit

    View Slide

  54. Google Cloud Platform
    Requests and limits
    Behavior at (or near) the limit depends on
    the particular resource
    Compressible resources: throttle usage
    ● e.g. No more CPU time for you!
    Non-compressible resources: reclaim
    ● e.g. Write-back and reallocate dirty pages
    ● Failure means process death (OOM)
    Being correct is more important than
    optimal
    CPU
    1.
    5
    Limit

    View Slide

  55. Google Cloud Platform

    View Slide

  56. Google Cloud Platform
    Example: memory
    1. Try to allocate, fail
    2. Find some clean pages to release (consumes CPU)
    3. Write-back some dirty pages (consumes disk time)
    4. If necessary, repeat this on another container
    How long should this be allowed to take?
    Really: this should be happening all the time
    Coupled resources

    View Slide

  57. Google Cloud Platform
    What if I don’t specify?
    You get best-effort isolation
    You might get defaulted values
    You might get OOM killed randomly
    You might get CPU starved
    You might get no isolation at all

    View Slide

  58. Google Cloud Platform
    Sizing

    View Slide

  59. Google Cloud Platform
    Sizing
    How many replicas does my job need?
    How much CPU/RAM does my job need?
    Do I provision for worst-case?
    ● Expensive, wasteful
    Do I provision for average case?
    ● High failure rate (e.g. OOM)
    Benchmark it!

    View Slide

  60. Google Cloud Platform
    Benchmarks are hard.

    View Slide

  61. Google Cloud Platform
    Benchmarks are hard.
    Accurate benchmarks are VERY hard.

    View Slide

  62. Google Cloud Platform
    Horizontal scaling
    Add more replicas
    Easy to reason about
    Works well when combined with resource
    isolation
    ● Having >1 replica per node makes sense
    Not always applicable
    ● e.g. Memory use scales with cluster size
    HorizontalPodAutoscaler
    ...

    View Slide

  63. Google Cloud Platform
    What can we do?
    Horizontal scaling is not enough
    Resource needs change over time
    If only we had an “autopilot” mode...
    ● Collect stats & build a model
    ● Predict and react
    ● Manage Pods, Deployments, Jobs
    ● Try to stay ahead of the spikes

    View Slide

  64. Google Cloud Platform
    Autopilot in Borg
    Most Borg users use autopilot
    See earlier statement regarding
    benchmarks - even at Google
    Kubernetes API is purpose-built for
    this sort of use-case
    We need a VerticalPodAutoscaler

    View Slide

  65. Google Cloud Platform
    Utilization

    View Slide

  66. Google Cloud Platform
    Utilization
    Resources cost money
    Wasted resources == wasted money
    You want NEED to use as much of your
    capacity as possible
    Selling it is not the same as using it

    View Slide

  67. Google Cloud Platform
    What is YOUR average utilization?

    View Slide

  68. Google Cloud Platform
    How can we do better?
    Utilization demands isolation
    ● If you want to push the limits, it has
    to be safe at the extremes
    People are inherently cautious
    ● Provision for 90%-99% case
    VPA & strong isolation should give
    enough confidence to provision more
    tightly
    We need to do some kernel work, here

    View Slide

  69. Google Cloud Platform
    Some lessons from Borg
    Priority
    ● Low-priority jobs get paused/killed
    in favor of high-priority jobs
    Quota
    ● If everyone is important, nobody is
    important
    Overcommit
    ● Hedge against rare events with
    lower QoS/SLA for some work

    View Slide

  70. Google Cloud Platform
    Overcommit
    Build a model of recent real usage
    per-container
    The delta between request and reality is
    idle -- resell it with a lower SLA
    ● First-tier apps can always get what
    they paid for - kill second-tier apps
    Use stats to decide how aggressive to
    be
    Let the priority system deal with the
    debris

    View Slide

  71. Google Cloud Platform
    Siren’s song: over-packing
    Clusters need some room to operate
    ● Nodes fail or get upgraded
    As you approach 100% bookings
    (requests), consider what happens when
    things go bad
    ● Nowhere to squeeze the toothpaste!
    Plan for some idle capacity - it will save
    your bacon one day
    ● Priorities & rescheduling can make this
    less expensive

    View Slide

  72. Google Cloud Platform
    Wrapping up

    View Slide

  73. Google Cloud Platform
    WARNING:
    Some of this presentation
    was aspirational
    !

    View Slide

  74. Google Cloud Platform
    We still have a LONG WAY to go.
    Fortunately, this is a path we’ve been
    down before.

    View Slide

  75. Google Cloud Platform
    Kubernetes is Open
    https://kubernetes.io
    Code: github.com/kubernetes/kubernetes
    Chat: slack.k8s.io
    Twitter: @kubernetesio
    open community
    open design
    open source
    open to ideas

    View Slide