The Scheduler Saga

69c2f55e7b157c112c0d988ddba7484d?s=47 kavya
August 28, 2018

The Scheduler Saga

The Go scheduler is, simply put, the orchestrator of the language runtime.
It schedules and unschedules goroutines, and also coordinates network polling and memory management.

This talk will explore the inner workings of the scheduler machinery. We will delve into the M:N multiplexing of goroutines on system threads, and the mechanisms to schedule, unschedule, and rebalance goroutines. We will touch upon how the scheduler supports the netpoller, and the memory management systems for goroutine stack resizing and heap garbage collection. Finally, we will evaluate the effectiveness and performance of the scheduler.

69c2f55e7b157c112c0d988ddba7484d?s=128

kavya

August 28, 2018
Tweet

Transcript

  1. @kavya719 The Scheduler Saga

  2. kavya

  3. the innards of the scheduler

  4. the behind-the-scenes orchestrator of Go programs. the scheduler

  5. func main() { // Create goroutines.
 for _, i :=

    range images { go process(i) } ... // Wait. <-ch }
  6. func main() { // Create goroutines.
 for _, i :=

    range images { go process(i) } ... // Wait. <-ch } func process(image) { // Create goroutine. go reportMetrics() 
 complicatedAlgorithm(image) // Write to file. f, err := os.OpenFile() ... } runs goroutines created pauses and resumes them:
 blocking channel operations, mutex operations. coordinates: blocking system calls, network I/O, runtime tasks garbage collection.
  7. func main() { // Create goroutines.
 for _, i :=

    range images { go process(i) } ... // Wait. <-ch } func process(image) { // Create goroutine. go reportMetrics() 
 complicatedAlgorithm(image) // Write to file. f, err := os.OpenFile() ... } runs goroutines created coordinates: blocking system calls, network I/O, runtime tasks garbage collection. pauses and resumes them:
 blocking channel operations, mutex operations. …for hundreds of thousands of goroutines* * dependent on workload and hardware, of course.
  8. the design of the scheduler, & its scheduling decisions impact

    the performance of our programs.
  9. spec it build! the big ideas & one sneaky idea.

    assess it. the difficult questions. the important questions.
  10. spec it the what, when & why.

  11. why have a scheduler?

  12. conceptually similar to kernel threads managed by the OS, but


    managed entirely by the Go runtime. lighter-weight and cheaper than kernel threads.
 multiplexed onto kernel threads by the scheduler. goroutines are user-space threads. why have a scheduler?
  13. conceptually similar to kernel threads managed by the OS, but


    managed entirely by the Go runtime. lighter-weight and cheaper than kernel threads.
 multiplexed onto kernel threads by the scheduler. smaller memory footprint:
 initial goroutine stack = 2KB; default thread stack = 8KB.
 state tracking overhead. faster creation, destruction, context switches:
 goroutine switches = ~tens of ns; thread switches = ~a µs. goroutines are user-space threads. why have a scheduler?
  14. conceptually similar to kernel threads managed by the OS, but


    managed entirely by the Go runtime. lighter-weight and cheaper than kernel threads.
 multiplexed onto kernel threads by the scheduler. CPU core thread2 thread1 } OS scheduler …but how are they run? goroutines are user-space threads. why have a scheduler?
  15. goroutines are user-space threads. conceptually similar to kernel threads managed

    by the OS, but
 managed entirely by the Go runtime. lighter-weight and cheaper than kernel threads. multiplexed onto kernel threads. CPU core g1 g6 g2 thread2 } OS scheduler CPU core } OS scheduler } Go scheduler why have a scheduler?
  16. when does it schedule?

  17. when does it schedule? func main() { // Create goroutines.


    for _, i := range images { go process(i) } ... // Wait. <-ch } goroutine creation run new goroutines soon, continue this for now. goroutine blocking pause this one immediately. func process(image) { // Create goroutine. go reportMetrics() 
 complicatedAlgorithm(image) // Write to file. f, err := os.OpenFile() ... } the thread itself blocks too! blocking system call At operations that should or would affect goroutine execution.
  18. when does it schedule? Tmain create g1 <-ch S S

    gmain g1 gmain T: threads g: goroutines S: scheduler time At operations that should or would affect goroutine execution. The runtime causes a switch into the scheduler under-the-hood, and the scheduler may schedule a different goroutine on this thread.
  19. use a small number of kernel threads. kernel threads are

    expensive to create. support high concurrency. Go programs should be able to run lots and lots of goroutines. leverage parallelism i.e. scale to N cores. On an N-core machine, Go programs should be able to run N
 goroutines in parallel.* #schedgoals for scheduling goroutines onto kernel threads. * depending on the program structure, of course.
  20. build it! the big ideas & neat details.

  21. how to multiplex goroutines onto kernel threads? when to create

    threads? how to distribute goroutines across threads?
  22. Goroutines that ready-to-run and need to be scheduled are tracked


    in heap-allocated FIFO runqueues. prelude: runqueues
  23. Tmain create g1 gmain program heap memory add g1 remove

    this g to run longest waiter runq.head runq.tail runnable goroutines { Goroutines that ready-to-run and need to be scheduled are tracked
 in heap-allocated FIFO runqueues. prelude: runqueues
  24. assume: running on a box with 2 CPU cores. creates

    g1, g2 goroutine blocks func main() { // Create goroutines.
 for _, i := range images { go process(i) } ... // Wait. <-ch } func process(image) { // Create goroutine. go reportMetrics() 
 c ... omplicatedAlgorithm(image) // Write to fi, err := os.OpenFile() } gmain g1
  25. first, the non-ideas

  26. — no concurrency!
 if a goroutine blocks the thread, no

    other goroutines run either. — no parallelism possible:
 can only use a single CPU core, even if more are available. first, the non-ideas Tmain gmain g1 blocking syscall <-ch I. Multiplex all goroutines on a single thread. g3 runq create g1 g3
  27. II. Create & destroy a thread per-goroutine. — defeats the

    purpose of using goroutines
 threads are heavyweight, and expensive to create and destroy. okay, here’s an idea… first, the non-ideas
  28. Create threads when needed; keep them around for reuse. idea

    I: reuse threads there’re goroutines to run, but all threads are busy.
  29. Create threads when needed; keep them around for reuse. idea

    I: reuse threads “thread parking” i.e. put them to sleep; no longer uses a CPU core. track idle threads in a list. (“mIdle”). there’re goroutines to run, but all threads are busy.
  30. Create threads when needed; keep them around for reuse. idea

    I: reuse threads The threads get goroutines to run from a runqueue.
  31. Tmain gmain runq program scheduler idle threads

  32. Tmain gmain create g1 runq program scheduler idle threads

  33. add g1 to runqueue. work to do and all threads

    busy, start one to run g1 ! Tmain gmain create g1 g1 runq program scheduler idle threads
  34. T1 Tmain gmain create g1 runq program scheduler idle threads

    g1
  35. T1 g1 Tmain gmain create g1 runq program scheduler idle

    threads
  36. T1 Tmain gmain g1 exits create g1 runq g1 program

    scheduler idle threads
  37. Say g1 completes, park T1 rather than destroying it. T1

    Tmain gmain create g1 runq program scheduler idle threads
  38. Tmain gmain create g1 runq program scheduler create g2 idle

    threads T1
  39. Tmain gmain create g1 runq add g2 to runqueue. idle

    thread present, don’t start a thread! g2 program scheduler create g2 idle threads T1
  40. T1 g2 runq Tmain gmain create g1 program scheduler a

    match made in (scheduling) heaven. create g2 idle threads
  41. sweet. We have a scheme that nicely reduces thread creations

    and still provides concurrency, parallelism.
 Work is naturally balanced across threads too.
  42. sweet. …but — multiple threads access the same runqueue, so

    need a lock. We have a scheme that nicely reduces thread creations and still provides concurrency, parallelism.
 Work is naturally balanced across threads too. serializes scheduling.
  43. sweet. Tmain gmain create long-running g x10000, in quick succession.

    We have a scheme that nicely reduces thread creations and still provides concurrency, parallelism.
 Work is naturally balanced across threads too. …but — multiple threads access the same runqueue, so need a lock.
  44. sweet. — an unbounded number of threads can still be

    created. We have a scheme that nicely reduces thread creations and still provides concurrency, parallelism. …but — multiple threads access the same runqueue, so need a lock.
  45. sweet. hella contention possible. We have a scheme that nicely

    reduces thread creations and still provides concurrency, parallelism. — an unbounded number of threads can still be created. …but — multiple threads access the same runqueue, so need a lock.
  46. sweet. hella not scalable. We have a scheme that nicely

    reduces thread creations and still provides concurrency, parallelism. — an unbounded number of threads can still be created. …but — multiple threads access the same runqueue, so need a lock.
  47. reusing threads is still a good idea. If the problem

    is an unbounded number threads can access the runqueue… thread creation is expensive; reusing threads amortizes that cost.
  48. idea II: limit threads accessing runqueue

  49. idea II: limit threads accessing runqueue Limit the number of

    threads accessing the runqueue. As before, keep threads around for reuse;
 get goroutines to run from the runqueue.
  50. idea II: limit threads accessing runqueue threads that are running

    goroutines; threads in syscalls etc. won’t count towards this limit. Limit the number of threads accessing the runqueue. As before, keep threads around for reuse;
 get goroutines to run from the runqueue.
  51. idea II: limit threads accessing runqueue Limit the number of

    threads accessing the runqueue. As before, keep threads around for reuse;
 get goroutines to run from the runq. …to what? too many —> too much contention. too few -> won’t use all the CPU cores, i.e. will give up on parallelism.
  52. idea II: limit threads accessing runqueue Limit the number of

    threads accessing the runqueue. As before, keep threads around for reuse;
 get goroutines to run from the runq. …to what? To the number of CPU cores, to get all the parallelism we can! CORES too many —> too much contention. too few -> won’t use all the CPU cores, i.e. will give up on parallelism.
  53. T1 g1 Tmain gmain create g1 Say Tmain creates g2,

    but gmain and g1 are still running. Limit # threads accessing runqueue to number of CPU cores (N) = 2. runq program scheduler create g2
  54. T1 g1 Tmain gmain create g1 create g2 Say Tmain

    creates g2, but gmain and g1 are still running. Limit # threads accessing runqueue to number of CPU cores (N) = 2. g2 runq program scheduler add g2 to runqueue. work to do, all threads busy,
 but #(runq threads) is not < N, don’t start any!
  55. T1 g1 Tmain gmain g2 g2 will be run at

    a future point. Limit # threads accessing runqueue to number of CPU cores (N) = 2. runq program scheduler create g1 create g2 <-ch
  56. CORES We get around unbounded thread contention, without 
 giving

    up parallelism. seems reasonable.
  57. CORES We get around unbounded thread contention, without 
 giving

    up parallelism. …ship it? seems reasonable.
  58. CORES We get around unbounded thread contention, without 
 giving

    up parallelism. …ship it? seems reasonable.
  59. We get around unbounded thread contention, without 
 giving up

    parallelism. — This scheme does not scale with the number of CPU cores! As N ↑ ⟶ number of runqueue-accessing threads ↑. seems reasonable. …ship it? ruh-roh, we’re in hella contention land again.
  60. the experiment the modified Go scheduler: uses a global runqueue,

    and 
 #(goroutine-running threads) = #(CPU cores). everything else about the runtime is unmodified. the benchmark: CreateGoroutineParallel, in the go repo. creates #(CPU cores) goroutines in parallel, 
 until a total of b.N goroutines have been created. the machines: A 4-core and 16-core x86-64.
  61. the modified Go scheduler: uses a global runqueue, and 


    #(goroutine-running threads) = #(CPU cores). everything else about the runtime is unmodified. the benchmark: CreateGoroutineParallel, in the go repo. creates #(CPU cores) goroutines in parallel, 
 until a total of b.N goroutines have been created. the machines: A 4-core and 16-core x86-64. the experiment
  62. scheduler benchmarks (CreateGoroutineParallel) the experiment On the 4-core: the modified

    scheduler takes about 4x longer than the Go scheduler. On the 16-core: the modified scheduler takes about 31x longer than the Go scheduler!
  63. We get around unbounded thread contention, without 
 giving up

    parallelism. — This scheme does not scale with the number of CPU cores! As N ↑ ⟶ number of runqueue-accessing threads ↑. ruh-roh, we’re in hella contention land again. nope. seems reasonable. …ship it?
  64. really, the problem is the single shared runqueue. #(goroutine-running threads)

    = #(CPU cores) is still clever. we maximally leverage parallelism by this.
  65. idea III: distributed runqueues Use N runqueues on an N-core

    machine. 
 A thread claims a runqueue to run goroutines.
  66. it inserts and removes goroutines from the runqueue it is

    associated with. idea III: distributed runqueues As before, reuse threads. Use N runqueues on an N-core machine. 
 A thread claims a runqueue to run goroutines.
  67. program scheduler add g1 to Tmain ’s runq work to

    do, all threads busy,
 #(runq threads) < N, start one to run g1 ! Tmain gmain create g1 runqA g1 runqB Number of CPU cores (N) = number of runqueues = 2.
  68. T1 Tmain gmain create g1 program scheduler g1 runqA runqB

    Number of CPU cores (N) = number of runqueues = 2.
  69. T1 Tmain gmain create g1 program scheduler g1 uh oh.

    ! The local runq is empty. runqA runqB Number of CPU cores (N) = number of runqueues = 2.
  70. so, steal! If the local runqueue is empty, steal work

    from another runqueue. It organically balances work across runqueues. “work stealing” pick another runqueue at random, steal half its work.
  71. so, steal! If the local runqueue is empty, steal work

    from another runqueue. It organically balances work across threads.
  72. T1 Tmain gmain create g1 program scheduler g1 ! runqA

    runqB Number of CPU cores (N) = number of runqueues = 2.
  73. T1 Tmain gmain create g1 program scheduler g1 the steal

    runqA runqB Number of CPU cores (N) = number of runqueues = 2.
  74. program scheduler T1 g1 Tmain gmain create g1 “the end

    justifies the means”? runqA runqB Number of CPU cores (N) = number of runqueues = 2.
  75. this looks promising! let’s continue. This scheme scales nicely with

    the number of CPU cores, and threads don’t contend for work. The work across threads is balanced with work-stealing.
 handoff prevents starvation from blocked threads.
  76. goroutine & thread block creates g3 func process(image) { //

    Create goroutine. go reportMetrics() 
 complicatedAlgorithm(image) // Write to file. f, err := os.OpenFile() ... } g1
  77. program scheduler T1 g1 Tmain gmain create g1 g3 create

    g3 add g3 to T1 ’s runq. don’t start a thread; #(runq-threads) is not < N runqA runqB Number of CPU cores (N) = number of runqueues = 2.
  78. program scheduler T1 g1 Tmain gmain create g1 blocking syscall

    g3 create g3 runqA runqB Number of CPU cores (N) = number of runqueues = 2.
  79. program scheduler T1 g1 Tmain gmain create g1 create g3

    blocking syscall g3 The runqueue has work, and the thread’s blocked. runqA runqB oh no. Number of CPU cores (N) = number of runqueues = 2.
  80. Use a mechanism to transfer a blocked thread’s runqueue to

    another thread. “handoff” If it did, it could give up its runqueue unnecessarily! Why can’t the thread itself handoff the runqueue, before it enters the system call? } a background monitor thread that detects threads blocked for a while, takes and gives the runqueues away.
  81. “handoff” Use a mechanism to transfer a blocked thread’s runqueue

    to another thread. The thread limit (= number of CPU cores) applies to goroutine-running 
 threads only.
 The original thread is blocked; so, another thread can take its place running goroutines. this is okay to do! Unpark a parked thread or start a thread if needed.
  82. Use a mechanism to transfer a blocked thread’s runqueue to

    another thread. “handoff” Prevents goroutine starvation.
  83. program scheduler T1 g1 Tmain gmain create g1 create gx

    blocking syscall g3 runqA runqB Number of CPU cores (N) = number of runqueues = 2.
  84. program scheduler T1 g1 Tmain gmain create g1 create gx

    blocking syscall g3 There’s a runqueue with work, its thread is blocked, and no parked threads, so the monitor starts a thread. runqA runqB Number of CPU cores (N) = number of runqueues = 2.
  85. program scheduler T1 g1 Tmain gmain create g1 create gx

    blocking syscall T2 g3 runqA Number of CPU cores (N) = number of runqueues = 2. runqB
  86. program scheduler T1 g1 Tmain gmain create g1 create gx

    blocking syscall g3 T2 handoff via the monitor runqA runqB Number of CPU cores (N) = number of runqueues = 2.
  87. program scheduler T1 g1 Tmain gmain create g1 create gx

    blocking syscall g3 T2 runqA runqB Number of CPU cores (N) = number of runqueues = 2.
  88. we have (finally) arrived. this looks promising! this looks promising!

    This scheme scales nicely with the number of CPU cores, and threads don’t contend for work. The work across threads is balanced with work-stealing;
 handoff prevents starvation from blocked threads.
  89. the Go scheduler the big ideas. reuse threads.

  90. the Go scheduler the big ideas. reuse threads. limit #(goroutine-running)

    threads to number of CPU cores. GOMAXPROCS
  91. the Go scheduler the big ideas. distributed runqueues with stealing

    and handoff. limit #(goroutine-running) threads to number of CPU cores. GOMAXPROCS reuse threads.
  92. …and one sneaky idea. The scheduling points are cooperative i.e.

    the program calls into the scheduler. // A CPU-bound computation that runs // for a long, long time. func complicatedAlgorithm(image) { // Do not create goroutines, or do // anything the blocks at all. ... } ruh-roh. a CPU-hog can starve runqueues
  93. To avoid this, the Go scheduler implements preemption*. * technically,

    cooperative preemption. It runs a background thread called the “sysmon”, to detect long-running goroutines (> 10ms; with caveats), and unschedule them when possible.
  94. * technically, cooperative preemption. …where would preempted goroutines be put?

    They essentially starved other goroutines from running, so don’t want to put them back on the per-core runqueues; it would not be fair. To avoid this, the Go scheduler implements preemption*. * technically, cooperative preemption.
  95. …on a global runqueue. “surprise”! The Go scheduler has a

    global runqueue in addition to the distributed runqueues. that’s right.
  96. …on a global runqueue. It uses this as a lower

    priority runqueue. Threads access it less frequently than their local runqueues;
 so, contention is not a real issue.
  97. a neat detail (or two)… thread spinning Threads without work

    “spin” looking for work before parking; they check the global runqueue, poll the network, attempt to run gc tasks, and work-steal. This burns CPU cycles, but maximally leverages available parallelism. Ps and runqueues The per-core runqueues are stored in a heap-allocated “p” struct. It stores other resources a thread needs to run goroutines too, like a memory cache. A thread claims a p to run goroutines, and the entire p is handed-off when it’s blocked. Fun fact: this handoff is taken care of by the sysmon too.
  98. assess it. the difficult questions.

  99. #schedgoals for scheduling goroutines onto kernel threads. use a small

    number of kernel threads. ideas: reuse threads & limit the number of goroutine-running threads. support high concurrency. ideas: threads use independent runqueues & keep them balanced. leverage parallelism i.e. scale to N cores. ideas: use a runqueue per core & employ thread spinning.
  100. limitations FIFO runqueues → no notion of goroutine priorities. Implement

    runqueues as priority queues, like the Linux scheduler. No strong preemption → no strong fairness or latency guarantees. recent proposal to fix this: Non-cooperative goroutine preemption. Is not aware of the system topology → no real locality. dated proposal to fix this: NUMA-aware scheduler Use LIFO, rather than FIFO, runqueues; better for cache utilization.
  101. CORES The Go scheduler motto, in a picture.

  102. @kavya719 speakerdeck.com/kavya719/the-scheduler-saga Special thanks to Eben Freeman and Austin Duffield

    for reading drafts of this, & also Chris Frost, Bernardo Farah, Anubhav Jain and Jeffrey Chen. 
 References
 Scalable scheduler design doc https://github.com/golang/go/blob/master/src/runtime/runtime2.go https://github.com/golang/go/blob/master/src/runtime/proc.go Go scheduler blog post Scheduling Multithreaded Computations by Work Stealing
  103. None