$30 off During Our Annual Pro Sale. View Details »

The Scheduler Saga

kavya
August 28, 2018

The Scheduler Saga

The Go scheduler is, simply put, the orchestrator of the language runtime.
It schedules and unschedules goroutines, and also coordinates network polling and memory management.

This talk will explore the inner workings of the scheduler machinery. We will delve into the M:N multiplexing of goroutines on system threads, and the mechanisms to schedule, unschedule, and rebalance goroutines. We will touch upon how the scheduler supports the netpoller, and the memory management systems for goroutine stack resizing and heap garbage collection. Finally, we will evaluate the effectiveness and performance of the scheduler.

kavya

August 28, 2018
Tweet

More Decks by kavya

Other Decks in Programming

Transcript

  1. @kavya719
    The Scheduler Saga

    View Slide

  2. kavya

    View Slide

  3. the innards of
    the scheduler

    View Slide

  4. the behind-the-scenes orchestrator
    of Go programs.
    the scheduler

    View Slide

  5. func main() {
    // Create goroutines.

    for _, i := range images {
    go process(i)
    }
    ...
    // Wait.
    <-ch
    }

    View Slide

  6. func main() {
    // Create goroutines.

    for _, i := range images {
    go process(i)
    }
    ...
    // Wait.
    <-ch
    }
    func process(image) {
    // Create goroutine.
    go reportMetrics()

    complicatedAlgorithm(image)
    // Write to file.
    f, err := os.OpenFile()
    ...
    }
    runs goroutines created
    pauses and resumes them:

    blocking channel operations,
    mutex operations.
    coordinates:
    blocking system calls, network I/O,
    runtime tasks garbage collection.

    View Slide

  7. func main() {
    // Create goroutines.

    for _, i := range images {
    go process(i)
    }
    ...
    // Wait.
    <-ch
    }
    func process(image) {
    // Create goroutine.
    go reportMetrics()

    complicatedAlgorithm(image)
    // Write to file.
    f, err := os.OpenFile()
    ...
    }
    runs goroutines created
    coordinates:
    blocking system calls, network I/O,
    runtime tasks garbage collection.
    pauses and resumes them:

    blocking channel operations,
    mutex operations.
    …for hundreds of thousands of goroutines*
    * dependent on workload and hardware, of course.

    View Slide

  8. the design of the scheduler, & its scheduling decisions
    impact the performance of our programs.

    View Slide

  9. spec it
    build!
    the big ideas & one sneaky idea.
    assess it.
    the difficult questions.
    the important questions.

    View Slide

  10. spec it
    the what, when & why.

    View Slide

  11. why have a scheduler?

    View Slide

  12. conceptually similar to kernel threads managed by the OS, but

    managed entirely by the Go runtime.
    lighter-weight and cheaper than kernel threads.

    multiplexed onto kernel threads by the scheduler.
    goroutines are user-space threads.
    why have a scheduler?

    View Slide

  13. conceptually similar to kernel threads managed by the OS, but

    managed entirely by the Go runtime.
    lighter-weight and cheaper than kernel threads.

    multiplexed onto kernel threads by the scheduler.
    smaller memory footprint:

    initial goroutine stack = 2KB; default thread stack = 8KB.

    state tracking overhead.
    faster creation, destruction, context switches:

    goroutine switches = ~tens of ns; thread switches = ~a µs.
    goroutines are user-space threads.
    why have a scheduler?

    View Slide

  14. conceptually similar to kernel threads managed by the OS, but

    managed entirely by the Go runtime.
    lighter-weight and cheaper than kernel threads.

    multiplexed onto kernel threads by the scheduler.
    CPU core
    thread2
    thread1
    }
    OS scheduler
    …but how are they run?
    goroutines are user-space threads.
    why have a scheduler?

    View Slide

  15. goroutines are user-space threads.
    conceptually similar to kernel threads managed by the OS, but

    managed entirely by the Go runtime.
    lighter-weight and cheaper than kernel threads.
    multiplexed onto kernel threads.
    CPU core
    g1 g6
    g2
    thread2
    }
    OS scheduler
    CPU core
    }
    OS scheduler
    } Go scheduler
    why have a scheduler?

    View Slide

  16. when does it schedule?

    View Slide

  17. when does it schedule?
    func main() {
    // Create goroutines.

    for _, i := range images {
    go process(i)
    }
    ...
    // Wait.
    <-ch
    }
    goroutine creation
    run new goroutines soon,
    continue this for now.
    goroutine blocking
    pause this one immediately.
    func process(image) {
    // Create goroutine.
    go reportMetrics()

    complicatedAlgorithm(image)
    // Write to file.
    f, err := os.OpenFile()
    ...
    }
    the thread itself blocks too!
    blocking system call
    At operations that should or would affect goroutine execution.

    View Slide

  18. when does it schedule?
    Tmain
    create g1 <-ch
    S S
    gmain g1
    gmain
    T: threads
    g: goroutines
    S: scheduler
    time
    At operations that should or would affect goroutine execution.
    The runtime causes a switch into the scheduler under-the-hood,
    and the scheduler may schedule a different goroutine on this thread.

    View Slide

  19. use a small number of kernel threads.
    kernel threads are expensive to create.
    support high concurrency.
    Go programs should be able to run lots and lots of goroutines.
    leverage parallelism i.e. scale to N cores.
    On an N-core machine, Go programs should be able to run N

    goroutines in parallel.*
    #schedgoals
    for scheduling goroutines onto kernel threads.
    * depending on the program structure, of course.

    View Slide

  20. build it!
    the big ideas & neat details.

    View Slide

  21. how to multiplex goroutines onto kernel threads?
    when to create threads?
    how to distribute goroutines across threads?

    View Slide

  22. Goroutines that ready-to-run and need to be scheduled are tracked

    in heap-allocated FIFO runqueues.
    prelude: runqueues

    View Slide

  23. Tmain
    create g1
    gmain
    program heap memory
    add g1
    remove this g to run
    longest waiter
    runq.head runq.tail
    runnable goroutines
    {
    Goroutines that ready-to-run and need to be scheduled are tracked

    in heap-allocated FIFO runqueues.
    prelude: runqueues

    View Slide

  24. assume: running on a box with 2 CPU cores.
    creates g1,
    g2
    goroutine blocks
    func main() {
    // Create goroutines.

    for _, i := range images {
    go process(i)
    }
    ...
    // Wait.
    <-ch
    }
    func process(image) {
    // Create goroutine.
    go reportMetrics()

    c ... omplicatedAlgorithm(image)
    // Write to fi, err := os.OpenFile()
    }
    gmain
    g1

    View Slide

  25. first, the non-ideas

    View Slide

  26. — no concurrency!

    if a goroutine blocks the thread, no other goroutines run either.
    — no parallelism possible:

    can only use a single CPU core, even if more are available.
    first, the non-ideas
    Tmain
    gmain g1
    blocking
    syscall
    <-ch
    I. Multiplex all goroutines on a single thread.
    g3
    runq
    create g1
    g3

    View Slide

  27. II. Create & destroy a thread per-goroutine.
    — defeats the purpose of using goroutines

    threads are heavyweight, and expensive to create and destroy.
    okay, here’s an idea…
    first, the non-ideas

    View Slide

  28. Create threads when needed; keep them around for reuse.
    idea I: reuse threads
    there’re goroutines to run,
    but all threads are busy.

    View Slide

  29. Create threads when needed; keep them around for reuse.
    idea I: reuse threads
    “thread parking” i.e.
    put them to sleep;
    no longer uses a CPU core.
    track idle threads in a list.
    (“mIdle”).
    there’re goroutines to run,
    but all threads are busy.

    View Slide

  30. Create threads when needed; keep them around for reuse.
    idea I: reuse threads
    The threads get goroutines to run from a runqueue.

    View Slide

  31. Tmain
    gmain
    runq
    program scheduler
    idle threads

    View Slide

  32. Tmain
    gmain
    create g1
    runq
    program scheduler
    idle threads

    View Slide

  33. add g1
    to runqueue.
    work to do and all threads busy,
    start one to run g1
    !
    Tmain
    gmain
    create g1
    g1
    runq
    program scheduler
    idle threads

    View Slide

  34. T1
    Tmain
    gmain
    create g1
    runq
    program scheduler
    idle threads
    g1

    View Slide

  35. T1 g1
    Tmain
    gmain
    create g1
    runq
    program scheduler
    idle threads

    View Slide

  36. T1
    Tmain
    gmain
    g1 exits
    create g1
    runq
    g1
    program scheduler
    idle threads

    View Slide

  37. Say g1
    completes, park T1
    rather than destroying it.
    T1
    Tmain
    gmain
    create g1
    runq
    program scheduler
    idle threads

    View Slide

  38. Tmain
    gmain
    create g1
    runq
    program scheduler
    create g2
    idle threads
    T1

    View Slide

  39. Tmain
    gmain
    create g1
    runq
    add g2
    to runqueue.
    idle thread present,
    don’t start a thread!
    g2
    program scheduler
    create g2
    idle threads
    T1

    View Slide

  40. T1 g2
    runq
    Tmain
    gmain
    create g1
    program scheduler
    a match made in (scheduling) heaven.
    create g2
    idle threads

    View Slide

  41. sweet.
    We have a scheme that nicely reduces thread creations and
    still provides concurrency, parallelism.

    Work is naturally balanced across threads too.

    View Slide

  42. sweet.
    …but
    — multiple threads access the same runqueue, so need a lock.
    We have a scheme that nicely reduces thread creations and
    still provides concurrency, parallelism.

    Work is naturally balanced across threads too.
    serializes scheduling.

    View Slide

  43. sweet.
    Tmain
    gmain
    create long-running g x10000,
    in quick succession.
    We have a scheme that nicely reduces thread creations and
    still provides concurrency, parallelism.

    Work is naturally balanced across threads too.
    …but
    — multiple threads access the same runqueue, so need a lock.

    View Slide

  44. sweet.
    — an unbounded number of threads can still be created.
    We have a scheme that nicely reduces thread creations and
    still provides concurrency, parallelism.
    …but
    — multiple threads access the same runqueue, so need a lock.

    View Slide

  45. sweet.
    hella contention possible.
    We have a scheme that nicely reduces thread creations and
    still provides concurrency, parallelism.
    — an unbounded number of threads can still be created.
    …but
    — multiple threads access the same runqueue, so need a lock.

    View Slide

  46. sweet.
    hella not scalable.
    We have a scheme that nicely reduces thread creations and
    still provides concurrency, parallelism.
    — an unbounded number of threads can still be created.
    …but
    — multiple threads access the same runqueue, so need a lock.

    View Slide

  47. reusing threads is still a good idea.
    If the problem is an unbounded number threads can access the runqueue…
    thread creation is expensive; reusing threads amortizes that cost.

    View Slide

  48. idea II: limit threads accessing runqueue

    View Slide

  49. idea II: limit threads accessing runqueue
    Limit the number of threads accessing the runqueue.
    As before, keep threads around for reuse;

    get goroutines to run from the runqueue.

    View Slide

  50. idea II: limit threads accessing runqueue
    threads that are running goroutines;
    threads in syscalls etc. won’t count
    towards this limit.
    Limit the number of threads accessing the runqueue.
    As before, keep threads around for reuse;

    get goroutines to run from the runqueue.

    View Slide

  51. idea II: limit threads accessing runqueue
    Limit the number of threads accessing the runqueue.
    As before, keep threads around for reuse;

    get goroutines to run from the runq.
    …to what?
    too many —> too much contention.
    too few -> won’t use all the CPU cores, i.e.
    will give up on parallelism.

    View Slide

  52. idea II: limit threads accessing runqueue
    Limit the number of threads accessing the runqueue.
    As before, keep threads around for reuse;

    get goroutines to run from the runq.
    …to what?
    To the number of CPU cores, to get all the parallelism we can!
    CORES
    too many —> too much contention.
    too few -> won’t use all the CPU cores, i.e.
    will give up on parallelism.

    View Slide

  53. T1 g1
    Tmain
    gmain
    create g1
    Say Tmain
    creates g2,
    but gmain
    and g1
    are still running.
    Limit # threads accessing runqueue to number of CPU cores (N) = 2.
    runq
    program scheduler
    create g2

    View Slide

  54. T1 g1
    Tmain
    gmain
    create g1 create g2
    Say Tmain
    creates g2,
    but gmain
    and g1
    are still running.
    Limit # threads accessing runqueue to number of CPU cores (N) = 2.
    g2
    runq
    program scheduler
    add g2
    to runqueue.
    work to do, all threads busy,

    but #(runq threads) is not < N,
    don’t start any!

    View Slide

  55. T1 g1
    Tmain
    gmain g2
    g2
    will be run at a future point.
    Limit # threads accessing runqueue to number of CPU cores (N) = 2.
    runq
    program scheduler
    create g1 create g2 <-ch

    View Slide

  56. CORES
    We get around unbounded thread contention, without 

    giving up parallelism.
    seems reasonable.

    View Slide

  57. CORES
    We get around unbounded thread contention, without 

    giving up parallelism.
    …ship it?
    seems reasonable.

    View Slide

  58. CORES
    We get around unbounded thread contention, without 

    giving up parallelism.
    …ship it?
    seems reasonable.

    View Slide

  59. We get around unbounded thread contention, without 

    giving up parallelism.
    — This scheme does not scale with the number of CPU cores!
    As N ↑ ⟶ number of runqueue-accessing threads ↑.
    seems reasonable.
    …ship it?
    ruh-roh, we’re in hella contention land again.

    View Slide

  60. the experiment
    the modified Go scheduler:
    uses a global runqueue, and 

    #(goroutine-running threads) = #(CPU cores).
    everything else about the runtime is unmodified.
    the benchmark:
    CreateGoroutineParallel, in the go repo.
    creates #(CPU cores) goroutines in parallel, 

    until a total of b.N goroutines have been created.
    the machines:
    A 4-core and 16-core x86-64.

    View Slide

  61. the modified Go scheduler:
    uses a global runqueue, and 

    #(goroutine-running threads) = #(CPU cores).
    everything else about the runtime is unmodified.
    the benchmark:
    CreateGoroutineParallel, in the go repo.
    creates #(CPU cores) goroutines in parallel, 

    until a total of b.N goroutines have been created.
    the machines:
    A 4-core and 16-core x86-64.
    the experiment

    View Slide

  62. scheduler benchmarks
    (CreateGoroutineParallel)
    the experiment
    On the 4-core:
    the modified scheduler takes about
    4x longer than the Go scheduler.
    On the 16-core:
    the modified scheduler takes about
    31x longer than the Go scheduler!

    View Slide

  63. We get around unbounded thread contention, without 

    giving up parallelism.
    — This scheme does not scale with the number of CPU cores!
    As N ↑ ⟶ number of runqueue-accessing threads ↑.
    ruh-roh, we’re in hella contention land again.
    nope.
    seems reasonable.
    …ship it?

    View Slide

  64. really, the problem is the single shared runqueue.
    #(goroutine-running threads) = #(CPU cores) is still clever.
    we maximally leverage parallelism by this.

    View Slide

  65. idea III: distributed runqueues
    Use N runqueues on an N-core machine.

    A thread claims a runqueue to run goroutines.

    View Slide

  66. it inserts and removes goroutines
    from the runqueue it is associated with.
    idea III: distributed runqueues
    As before, reuse threads.
    Use N runqueues on an N-core machine.

    A thread claims a runqueue to run goroutines.

    View Slide

  67. program scheduler
    add g1
    to Tmain
    ’s runq
    work to do, all threads busy,

    #(runq threads) < N,
    start one to run g1
    !
    Tmain
    gmain
    create g1
    runqA
    g1
    runqB
    Number of CPU cores (N) = number of runqueues = 2.

    View Slide

  68. T1
    Tmain
    gmain
    create g1
    program scheduler
    g1
    runqA
    runqB
    Number of CPU cores (N) = number of runqueues = 2.

    View Slide

  69. T1
    Tmain
    gmain
    create g1
    program scheduler
    g1
    uh oh.
    !
    The local runq is empty.
    runqA
    runqB
    Number of CPU cores (N) = number of runqueues = 2.

    View Slide

  70. so, steal!
    If the local runqueue is empty, steal work from another runqueue.
    It organically balances work across runqueues.
    “work stealing”
    pick another runqueue at random, steal half its work.

    View Slide

  71. so, steal!
    If the local runqueue is empty, steal work from another runqueue.
    It organically balances work across threads.

    View Slide

  72. T1
    Tmain
    gmain
    create g1
    program scheduler
    g1
    !
    runqA
    runqB
    Number of CPU cores (N) = number of runqueues = 2.

    View Slide

  73. T1
    Tmain
    gmain
    create g1
    program scheduler
    g1
    the steal
    runqA
    runqB
    Number of CPU cores (N) = number of runqueues = 2.

    View Slide

  74. program scheduler
    T1 g1
    Tmain
    gmain
    create g1
    “the end justifies the means”?
    runqA
    runqB
    Number of CPU cores (N) = number of runqueues = 2.

    View Slide

  75. this looks promising!
    let’s continue.
    This scheme scales nicely with the number of CPU cores, and
    threads don’t contend for work.
    The work across threads is balanced with work-stealing.

    handoff prevents starvation from blocked threads.

    View Slide

  76. goroutine & thread block
    creates g3
    func process(image) {
    // Create goroutine.
    go reportMetrics()

    complicatedAlgorithm(image)
    // Write to file.
    f, err := os.OpenFile()
    ...
    }
    g1

    View Slide

  77. program scheduler
    T1 g1
    Tmain
    gmain
    create g1
    g3
    create g3 add g3
    to T1
    ’s runq.
    don’t start a thread;
    #(runq-threads) is not < N
    runqA
    runqB
    Number of CPU cores (N) = number of runqueues = 2.

    View Slide

  78. program scheduler
    T1 g1
    Tmain
    gmain
    create g1
    blocking
    syscall
    g3
    create g3
    runqA
    runqB
    Number of CPU cores (N) = number of runqueues = 2.

    View Slide

  79. program scheduler
    T1 g1
    Tmain
    gmain
    create g1
    create g3 blocking
    syscall
    g3
    The runqueue has work,
    and the thread’s blocked.
    runqA
    runqB
    oh no.
    Number of CPU cores (N) = number of runqueues = 2.

    View Slide

  80. Use a mechanism to transfer a blocked thread’s runqueue to another
    thread.
    “handoff”
    If it did, it could give up its runqueue unnecessarily!
    Why can’t the thread itself handoff the runqueue, before it enters the
    system call?
    }
    a background monitor thread that detects threads blocked for a while,
    takes and gives the runqueues away.

    View Slide

  81. “handoff”
    Use a mechanism to transfer a blocked thread’s runqueue to another
    thread.
    The thread limit (= number of CPU cores) applies to goroutine-running 

    threads only.

    The original thread is blocked; so, another thread can take its place
    running goroutines.
    this is okay to do!
    Unpark a parked thread or start a thread if needed.

    View Slide

  82. Use a mechanism to transfer a blocked thread’s runqueue to another
    thread.
    “handoff”
    Prevents goroutine starvation.

    View Slide

  83. program scheduler
    T1 g1
    Tmain
    gmain
    create g1
    create gx blocking
    syscall
    g3
    runqA
    runqB
    Number of CPU cores (N) = number of runqueues = 2.

    View Slide

  84. program scheduler
    T1 g1
    Tmain
    gmain
    create g1
    create gx blocking
    syscall
    g3
    There’s a runqueue with work,
    its thread is blocked, and
    no parked threads, so
    the monitor starts a thread.
    runqA
    runqB
    Number of CPU cores (N) = number of runqueues = 2.

    View Slide

  85. program scheduler
    T1 g1
    Tmain
    gmain
    create g1
    create gx blocking
    syscall
    T2
    g3
    runqA
    Number of CPU cores (N) = number of runqueues = 2.
    runqB

    View Slide

  86. program scheduler
    T1 g1
    Tmain
    gmain
    create g1
    create gx blocking
    syscall
    g3
    T2
    handoff
    via the monitor
    runqA
    runqB
    Number of CPU cores (N) = number of runqueues = 2.

    View Slide

  87. program scheduler
    T1 g1
    Tmain
    gmain
    create g1
    create gx blocking
    syscall
    g3
    T2
    runqA
    runqB
    Number of CPU cores (N) = number of runqueues = 2.

    View Slide

  88. we have (finally) arrived.
    this looks promising!
    this looks promising!
    This scheme scales nicely with the number of CPU cores, and
    threads don’t contend for work.
    The work across threads is balanced with work-stealing;

    handoff prevents starvation from blocked threads.

    View Slide

  89. the Go scheduler
    the big ideas.
    reuse threads.

    View Slide

  90. the Go scheduler
    the big ideas.
    reuse threads.
    limit #(goroutine-running) threads to
    number of CPU cores.
    GOMAXPROCS

    View Slide

  91. the Go scheduler
    the big ideas.
    distributed runqueues with stealing and handoff.
    limit #(goroutine-running) threads to
    number of CPU cores.
    GOMAXPROCS
    reuse threads.

    View Slide

  92. …and one sneaky idea.
    The scheduling points are cooperative i.e. the program calls into
    the scheduler.
    // A CPU-bound computation that runs
    // for a long, long time.
    func complicatedAlgorithm(image) {
    // Do not create goroutines, or do
    // anything the blocks at all.
    ...
    }
    ruh-roh.
    a CPU-hog can starve runqueues

    View Slide

  93. To avoid this, the Go scheduler implements preemption*.
    * technically, cooperative preemption.
    It runs a background thread called the “sysmon”,
    to detect long-running goroutines (> 10ms; with caveats),
    and unschedule them when possible.

    View Slide

  94. * technically, cooperative preemption.
    …where would preempted goroutines be put?
    They essentially starved other goroutines from running, so
    don’t want to put them back on the per-core runqueues;
    it would not be fair.
    To avoid this, the Go scheduler implements preemption*.
    * technically, cooperative preemption.

    View Slide

  95. …on a global runqueue.
    “surprise”!
    The Go scheduler has a global runqueue in addition to the
    distributed runqueues.
    that’s right.

    View Slide

  96. …on a global runqueue.
    It uses this as a lower priority runqueue.
    Threads access it less frequently than their local runqueues;

    so, contention is not a real issue.

    View Slide

  97. a neat detail (or two)…
    thread spinning
    Threads without work “spin” looking for work before parking;
    they check the global runqueue, poll the network, attempt to run gc tasks,
    and work-steal.
    This burns CPU cycles, but maximally leverages available parallelism.
    Ps and runqueues
    The per-core runqueues are stored in a heap-allocated “p” struct.
    It stores other resources a thread needs to run goroutines too, like a
    memory cache.
    A thread claims a p to run goroutines, and the entire p is handed-off
    when it’s blocked.
    Fun fact: this handoff is taken care of by the sysmon too.

    View Slide

  98. assess it.
    the difficult questions.

    View Slide

  99. #schedgoals
    for scheduling goroutines onto kernel threads.
    use a small number of kernel threads.
    ideas: reuse threads & limit the number of goroutine-running threads.
    support high concurrency.
    ideas: threads use independent runqueues & keep them balanced.
    leverage parallelism i.e. scale to N cores.
    ideas: use a runqueue per core & employ thread spinning.

    View Slide

  100. limitations
    FIFO runqueues → no notion of goroutine priorities.
    Implement runqueues as priority queues, like the Linux scheduler.
    No strong preemption → no strong fairness or latency guarantees.
    recent proposal to fix this: Non-cooperative goroutine preemption.
    Is not aware of the system topology → no real locality.
    dated proposal to fix this: NUMA-aware scheduler
    Use LIFO, rather than FIFO, runqueues; better for cache utilization.

    View Slide

  101. CORES
    The Go scheduler motto, in a picture.

    View Slide

  102. @kavya719
    speakerdeck.com/kavya719/the-scheduler-saga
    Special thanks to Eben Freeman and Austin Duffield for reading drafts of this,
    & also Chris Frost, Bernardo Farah, Anubhav Jain and Jeffrey Chen.

    References

    Scalable scheduler design doc
    https://github.com/golang/go/blob/master/src/runtime/runtime2.go
    https://github.com/golang/go/blob/master/src/runtime/proc.go
    Go scheduler blog post
    Scheduling Multithreaded Computations by Work Stealing

    View Slide

  103. View Slide