The Scheduler Saga - Speaker Deck

Slide 1

Slide 1 text

@kavya719 The Scheduler Saga

Slide 2

Slide 2 text

kavya

Slide 3

Slide 3 text

the innards of the scheduler

Slide 4

Slide 4 text

the behind-the-scenes orchestrator of Go programs. the scheduler

Slide 5

Slide 5 text

func main() { // Create goroutines.  for _, i := range images { go process(i) } ... // Wait. <-ch }

Slide 6

Slide 6 text

func main() { // Create goroutines.  for _, i := range images { go process(i) } ... // Wait. <-ch } func process(image) { // Create goroutine. go reportMetrics()   complicatedAlgorithm(image) // Write to file. f, err := os.OpenFile() ... } runs goroutines created pauses and resumes them:  blocking channel operations, mutex operations. coordinates: blocking system calls, network I/O, runtime tasks garbage collection.

Slide 7

Slide 7 text

func main() { // Create goroutines.  for _, i := range images { go process(i) } ... // Wait. <-ch } func process(image) { // Create goroutine. go reportMetrics()   complicatedAlgorithm(image) // Write to file. f, err := os.OpenFile() ... } runs goroutines created coordinates: blocking system calls, network I/O, runtime tasks garbage collection. pauses and resumes them:  blocking channel operations, mutex operations. …for hundreds of thousands of goroutines* * dependent on workload and hardware, of course.

Slide 8

Slide 8 text

the design of the scheduler, & its scheduling decisions impact the performance of our programs.

Slide 9

Slide 9 text

spec it build! the big ideas & one sneaky idea. assess it. the difﬁcult questions. the important questions.

Slide 10

Slide 10 text

spec it the what, when & why.

Slide 11

Slide 11 text

why have a scheduler?

Slide 12

Slide 12 text

Slide 13

Slide 13 text

conceptually similar to kernel threads managed by the OS, but  managed entirely by the Go runtime. lighter-weight and cheaper than kernel threads.  multiplexed onto kernel threads by the scheduler. smaller memory footprint:  initial goroutine stack = 2KB; default thread stack = 8KB.  state tracking overhead. faster creation, destruction, context switches:  goroutine switches = ~tens of ns; thread switches = ~a µs. goroutines are user-space threads. why have a scheduler?

Slide 14

Slide 14 text

Slide 15

Slide 15 text

goroutines are user-space threads. conceptually similar to kernel threads managed by the OS, but  managed entirely by the Go runtime. lighter-weight and cheaper than kernel threads. multiplexed onto kernel threads. CPU core g1 g6 g2 thread2 } OS scheduler CPU core } OS scheduler } Go scheduler why have a scheduler?

Slide 16

Slide 16 text

when does it schedule?

Slide 17

Slide 17 text

when does it schedule? func main() { // Create goroutines.  for _, i := range images { go process(i) } ... // Wait. <-ch } goroutine creation run new goroutines soon, continue this for now. goroutine blocking pause this one immediately. func process(image) { // Create goroutine. go reportMetrics()   complicatedAlgorithm(image) // Write to file. f, err := os.OpenFile() ... } the thread itself blocks too! blocking system call At operations that should or would affect goroutine execution.

Slide 18

Slide 18 text

when does it schedule? Tmain create g1 <-ch S S gmain g1 gmain T: threads g: goroutines S: scheduler time At operations that should or would affect goroutine execution. The runtime causes a switch into the scheduler under-the-hood, and the scheduler may schedule a different goroutine on this thread.

Slide 19

Slide 19 text

use a small number of kernel threads. kernel threads are expensive to create. support high concurrency. Go programs should be able to run lots and lots of goroutines. leverage parallelism i.e. scale to N cores. On an N-core machine, Go programs should be able to run N  goroutines in parallel.* #schedgoals for scheduling goroutines onto kernel threads. * depending on the program structure, of course.

Slide 20

Slide 20 text

build it! the big ideas & neat details.

Slide 21

Slide 21 text

how to multiplex goroutines onto kernel threads? when to create threads? how to distribute goroutines across threads?

Slide 22

Slide 22 text

Goroutines that ready-to-run and need to be scheduled are tracked  in heap-allocated FIFO runqueues. prelude: runqueues

Slide 23

Slide 23 text

Tmain create g1 gmain program heap memory add g1 remove this g to run longest waiter runq.head runq.tail runnable goroutines { Goroutines that ready-to-run and need to be scheduled are tracked  in heap-allocated FIFO runqueues. prelude: runqueues

Slide 24

Slide 24 text

assume: running on a box with 2 CPU cores. creates g1, g2 goroutine blocks func main() { // Create goroutines.  for _, i := range images { go process(i) } ... // Wait. <-ch } func process(image) { // Create goroutine. go reportMetrics()   c ... omplicatedAlgorithm(image) // Write to fi, err := os.OpenFile() } gmain g1

Slide 25

Slide 25 text

ﬁrst, the non-ideas

Slide 26

Slide 26 text

— no concurrency!  if a goroutine blocks the thread, no other goroutines run either. — no parallelism possible:  can only use a single CPU core, even if more are available. ﬁrst, the non-ideas Tmain gmain g1 blocking syscall <-ch I. Multiplex all goroutines on a single thread. g3 runq create g1 g3

Slide 27

Slide 27 text

II. Create & destroy a thread per-goroutine. — defeats the purpose of using goroutines  threads are heavyweight, and expensive to create and destroy. okay, here’s an idea… ﬁrst, the non-ideas

Slide 28

Slide 28 text

Create threads when needed; keep them around for reuse. idea I: reuse threads there’re goroutines to run, but all threads are busy.

Slide 29

Slide 29 text

Create threads when needed; keep them around for reuse. idea I: reuse threads “thread parking” i.e. put them to sleep; no longer uses a CPU core. track idle threads in a list. (“mIdle”). there’re goroutines to run, but all threads are busy.

Slide 30

Slide 30 text

Create threads when needed; keep them around for reuse. idea I: reuse threads The threads get goroutines to run from a runqueue.

Slide 31

Slide 31 text

Tmain gmain runq program scheduler idle threads

Slide 32

Slide 32 text

Tmain gmain create g1 runq program scheduler idle threads

Slide 33

Slide 33 text

add g1 to runqueue. work to do and all threads busy, start one to run g1 ! Tmain gmain create g1 g1 runq program scheduler idle threads

Slide 34

Slide 34 text

T1 Tmain gmain create g1 runq program scheduler idle threads g1

Slide 35

Slide 35 text

T1 g1 Tmain gmain create g1 runq program scheduler idle threads

Slide 36

Slide 36 text

T1 Tmain gmain g1 exits create g1 runq g1 program scheduler idle threads

Slide 37

Slide 37 text

Say g1 completes, park T1 rather than destroying it. T1 Tmain gmain create g1 runq program scheduler idle threads

Slide 38

Slide 38 text

Tmain gmain create g1 runq program scheduler create g2 idle threads T1

Slide 39

Slide 39 text

Tmain gmain create g1 runq add g2 to runqueue. idle thread present, don’t start a thread! g2 program scheduler create g2 idle threads T1

Slide 40

Slide 40 text

T1 g2 runq Tmain gmain create g1 program scheduler a match made in (scheduling) heaven. create g2 idle threads

Slide 41

Slide 41 text

sweet. We have a scheme that nicely reduces thread creations and still provides concurrency, parallelism.  Work is naturally balanced across threads too.

Slide 42

Slide 42 text

sweet. …but — multiple threads access the same runqueue, so need a lock. We have a scheme that nicely reduces thread creations and still provides concurrency, parallelism.  Work is naturally balanced across threads too. serializes scheduling.

Slide 43

Slide 43 text

sweet. Tmain gmain create long-running g x10000, in quick succession. We have a scheme that nicely reduces thread creations and still provides concurrency, parallelism.  Work is naturally balanced across threads too. …but — multiple threads access the same runqueue, so need a lock.

Slide 44

Slide 44 text

sweet. — an unbounded number of threads can still be created. We have a scheme that nicely reduces thread creations and still provides concurrency, parallelism. …but — multiple threads access the same runqueue, so need a lock.

Slide 45

Slide 45 text

sweet. hella contention possible. We have a scheme that nicely reduces thread creations and still provides concurrency, parallelism. — an unbounded number of threads can still be created. …but — multiple threads access the same runqueue, so need a lock.

Slide 46

Slide 46 text

sweet. hella not scalable. We have a scheme that nicely reduces thread creations and still provides concurrency, parallelism. — an unbounded number of threads can still be created. …but — multiple threads access the same runqueue, so need a lock.

Slide 47

Slide 47 text

reusing threads is still a good idea. If the problem is an unbounded number threads can access the runqueue… thread creation is expensive; reusing threads amortizes that cost.

Slide 48

Slide 48 text

idea II: limit threads accessing runqueue

Slide 49

Slide 49 text

idea II: limit threads accessing runqueue Limit the number of threads accessing the runqueue. As before, keep threads around for reuse;  get goroutines to run from the runqueue.

Slide 50

Slide 50 text

idea II: limit threads accessing runqueue threads that are running goroutines; threads in syscalls etc. won’t count towards this limit. Limit the number of threads accessing the runqueue. As before, keep threads around for reuse;  get goroutines to run from the runqueue.

Slide 51

Slide 51 text

idea II: limit threads accessing runqueue Limit the number of threads accessing the runqueue. As before, keep threads around for reuse;  get goroutines to run from the runq. …to what? too many —> too much contention. too few -> won’t use all the CPU cores, i.e. will give up on parallelism.

Slide 52

Slide 52 text

idea II: limit threads accessing runqueue Limit the number of threads accessing the runqueue. As before, keep threads around for reuse;  get goroutines to run from the runq. …to what? To the number of CPU cores, to get all the parallelism we can! CORES too many —> too much contention. too few -> won’t use all the CPU cores, i.e. will give up on parallelism.

Slide 53

Slide 53 text

T1 g1 Tmain gmain create g1 Say Tmain creates g2, but gmain and g1 are still running. Limit # threads accessing runqueue to number of CPU cores (N) = 2. runq program scheduler create g2

Slide 54

Slide 54 text

T1 g1 Tmain gmain create g1 create g2 Say Tmain creates g2, but gmain and g1 are still running. Limit # threads accessing runqueue to number of CPU cores (N) = 2. g2 runq program scheduler add g2 to runqueue. work to do, all threads busy,  but #(runq threads) is not < N, don’t start any!

Slide 55

Slide 55 text

T1 g1 Tmain gmain g2 g2 will be run at a future point. Limit # threads accessing runqueue to number of CPU cores (N) = 2. runq program scheduler create g1 create g2 <-ch

Slide 56

Slide 56 text

CORES We get around unbounded thread contention, without   giving up parallelism. seems reasonable.

Slide 57

Slide 57 text

CORES We get around unbounded thread contention, without   giving up parallelism. …ship it? seems reasonable.

Slide 58

Slide 58 text

CORES We get around unbounded thread contention, without   giving up parallelism. …ship it? seems reasonable.

Slide 59

Slide 59 text

We get around unbounded thread contention, without   giving up parallelism. — This scheme does not scale with the number of CPU cores! As N ↑ ⟶ number of runqueue-accessing threads ↑. seems reasonable. …ship it? ruh-roh, we’re in hella contention land again.

Slide 60

Slide 60 text

the experiment the modiﬁed Go scheduler: uses a global runqueue, and   #(goroutine-running threads) = #(CPU cores). everything else about the runtime is unmodiﬁed. the benchmark: CreateGoroutineParallel, in the go repo. creates #(CPU cores) goroutines in parallel,   until a total of b.N goroutines have been created. the machines: A 4-core and 16-core x86-64.

Slide 61

Slide 61 text

the modiﬁed Go scheduler: uses a global runqueue, and   #(goroutine-running threads) = #(CPU cores). everything else about the runtime is unmodiﬁed. the benchmark: CreateGoroutineParallel, in the go repo. creates #(CPU cores) goroutines in parallel,   until a total of b.N goroutines have been created. the machines: A 4-core and 16-core x86-64. the experiment

Slide 62

Slide 62 text

scheduler benchmarks (CreateGoroutineParallel) the experiment On the 4-core: the modiﬁed scheduler takes about 4x longer than the Go scheduler. On the 16-core: the modiﬁed scheduler takes about 31x longer than the Go scheduler!

Slide 63

Slide 63 text

We get around unbounded thread contention, without   giving up parallelism. — This scheme does not scale with the number of CPU cores! As N ↑ ⟶ number of runqueue-accessing threads ↑. ruh-roh, we’re in hella contention land again. nope. seems reasonable. …ship it?

Slide 64

Slide 64 text

really, the problem is the single shared runqueue. #(goroutine-running threads) = #(CPU cores) is still clever. we maximally leverage parallelism by this.

Slide 65

Slide 65 text

idea III: distributed runqueues Use N runqueues on an N-core machine.   A thread claims a runqueue to run goroutines.

Slide 66

Slide 66 text

it inserts and removes goroutines from the runqueue it is associated with. idea III: distributed runqueues As before, reuse threads. Use N runqueues on an N-core machine.   A thread claims a runqueue to run goroutines.

Slide 67

Slide 67 text

program scheduler add g1 to Tmain ’s runq work to do, all threads busy,  #(runq threads) < N, start one to run g1 ! Tmain gmain create g1 runqA g1 runqB Number of CPU cores (N) = number of runqueues = 2.

Slide 68

Slide 68 text

T1 Tmain gmain create g1 program scheduler g1 runqA runqB Number of CPU cores (N) = number of runqueues = 2.

Slide 69

Slide 69 text

T1 Tmain gmain create g1 program scheduler g1 uh oh. ! The local runq is empty. runqA runqB Number of CPU cores (N) = number of runqueues = 2.

Slide 70

Slide 70 text

so, steal! If the local runqueue is empty, steal work from another runqueue. It organically balances work across runqueues. “work stealing” pick another runqueue at random, steal half its work.

Slide 71

Slide 71 text

so, steal! If the local runqueue is empty, steal work from another runqueue. It organically balances work across threads.

Slide 72

Slide 72 text

T1 Tmain gmain create g1 program scheduler g1 ! runqA runqB Number of CPU cores (N) = number of runqueues = 2.

Slide 73

Slide 73 text

T1 Tmain gmain create g1 program scheduler g1 the steal runqA runqB Number of CPU cores (N) = number of runqueues = 2.

Slide 74

Slide 74 text

program scheduler T1 g1 Tmain gmain create g1 “the end justiﬁes the means”? runqA runqB Number of CPU cores (N) = number of runqueues = 2.

Slide 75

Slide 75 text

this looks promising! let’s continue. This scheme scales nicely with the number of CPU cores, and threads don’t contend for work. The work across threads is balanced with work-stealing.  handoff prevents starvation from blocked threads.

Slide 76

Slide 76 text

goroutine & thread block creates g3 func process(image) { // Create goroutine. go reportMetrics()   complicatedAlgorithm(image) // Write to file. f, err := os.OpenFile() ... } g1

Slide 77

Slide 77 text

program scheduler T1 g1 Tmain gmain create g1 g3 create g3 add g3 to T1 ’s runq. don’t start a thread; #(runq-threads) is not < N runqA runqB Number of CPU cores (N) = number of runqueues = 2.

Slide 78

Slide 78 text

program scheduler T1 g1 Tmain gmain create g1 blocking syscall g3 create g3 runqA runqB Number of CPU cores (N) = number of runqueues = 2.

Slide 79

Slide 79 text

program scheduler T1 g1 Tmain gmain create g1 create g3 blocking syscall g3 The runqueue has work, and the thread’s blocked. runqA runqB oh no. Number of CPU cores (N) = number of runqueues = 2.

Slide 80

Slide 80 text

Use a mechanism to transfer a blocked thread’s runqueue to another thread. “handoff” If it did, it could give up its runqueue unnecessarily! Why can’t the thread itself handoff the runqueue, before it enters the system call? } a background monitor thread that detects threads blocked for a while, takes and gives the runqueues away.

Slide 81

Slide 81 text

“handoff” Use a mechanism to transfer a blocked thread’s runqueue to another thread. The thread limit (= number of CPU cores) applies to goroutine-running   threads only.  The original thread is blocked; so, another thread can take its place running goroutines. this is okay to do! Unpark a parked thread or start a thread if needed.

Slide 82

Slide 82 text

Use a mechanism to transfer a blocked thread’s runqueue to another thread. “handoff” Prevents goroutine starvation.

Slide 83

Slide 83 text

program scheduler T1 g1 Tmain gmain create g1 create gx blocking syscall g3 runqA runqB Number of CPU cores (N) = number of runqueues = 2.

Slide 84

Slide 84 text

program scheduler T1 g1 Tmain gmain create g1 create gx blocking syscall g3 There’s a runqueue with work, its thread is blocked, and no parked threads, so the monitor starts a thread. runqA runqB Number of CPU cores (N) = number of runqueues = 2.

Slide 85

Slide 85 text

program scheduler T1 g1 Tmain gmain create g1 create gx blocking syscall T2 g3 runqA Number of CPU cores (N) = number of runqueues = 2. runqB

Slide 86

Slide 86 text

program scheduler T1 g1 Tmain gmain create g1 create gx blocking syscall g3 T2 handoff via the monitor runqA runqB Number of CPU cores (N) = number of runqueues = 2.

Slide 87

Slide 87 text

program scheduler T1 g1 Tmain gmain create g1 create gx blocking syscall g3 T2 runqA runqB Number of CPU cores (N) = number of runqueues = 2.

Slide 88

Slide 88 text

we have (ﬁnally) arrived. this looks promising! this looks promising! This scheme scales nicely with the number of CPU cores, and threads don’t contend for work. The work across threads is balanced with work-stealing;  handoff prevents starvation from blocked threads.

Slide 89

Slide 89 text

the Go scheduler the big ideas. reuse threads.

Slide 90

Slide 90 text

the Go scheduler the big ideas. reuse threads. limit #(goroutine-running) threads to number of CPU cores. GOMAXPROCS

Slide 91

Slide 91 text

the Go scheduler the big ideas. distributed runqueues with stealing and handoff. limit #(goroutine-running) threads to number of CPU cores. GOMAXPROCS reuse threads.

Slide 92

Slide 92 text

…and one sneaky idea. The scheduling points are cooperative i.e. the program calls into the scheduler. // A CPU-bound computation that runs // for a long, long time. func complicatedAlgorithm(image) { // Do not create goroutines, or do // anything the blocks at all. ... } ruh-roh. a CPU-hog can starve runqueues

Slide 93

Slide 93 text

To avoid this, the Go scheduler implements preemption*. * technically, cooperative preemption. It runs a background thread called the “sysmon”, to detect long-running goroutines (> 10ms; with caveats), and unschedule them when possible.

Slide 94

Slide 94 text

* technically, cooperative preemption. …where would preempted goroutines be put? They essentially starved other goroutines from running, so don’t want to put them back on the per-core runqueues; it would not be fair. To avoid this, the Go scheduler implements preemption*. * technically, cooperative preemption.

Slide 95

Slide 95 text

…on a global runqueue. “surprise”! The Go scheduler has a global runqueue in addition to the distributed runqueues. that’s right.

Slide 96

Slide 96 text

…on a global runqueue. It uses this as a lower priority runqueue. Threads access it less frequently than their local runqueues;  so, contention is not a real issue.

Slide 97

Slide 97 text

a neat detail (or two)… thread spinning Threads without work “spin” looking for work before parking; they check the global runqueue, poll the network, attempt to run gc tasks, and work-steal. This burns CPU cycles, but maximally leverages available parallelism. Ps and runqueues The per-core runqueues are stored in a heap-allocated “p” struct. It stores other resources a thread needs to run goroutines too, like a memory cache. A thread claims a p to run goroutines, and the entire p is handed-off when it’s blocked. Fun fact: this handoff is taken care of by the sysmon too.

Slide 98

Slide 98 text

assess it. the difﬁcult questions.

Slide 99

Slide 99 text

#schedgoals for scheduling goroutines onto kernel threads. use a small number of kernel threads. ideas: reuse threads & limit the number of goroutine-running threads. support high concurrency. ideas: threads use independent runqueues & keep them balanced. leverage parallelism i.e. scale to N cores. ideas: use a runqueue per core & employ thread spinning.

Slide 100

Slide 100 text

limitations FIFO runqueues → no notion of goroutine priorities. Implement runqueues as priority queues, like the Linux scheduler. No strong preemption → no strong fairness or latency guarantees. recent proposal to ﬁx this: Non-cooperative goroutine preemption. Is not aware of the system topology → no real locality. dated proposal to ﬁx this: NUMA-aware scheduler Use LIFO, rather than FIFO, runqueues; better for cache utilization.

Slide 101

Slide 101 text

CORES The Go scheduler motto, in a picture.

Slide 102

Slide 102 text

@kavya719 speakerdeck.com/kavya719/the-scheduler-saga Special thanks to Eben Freeman and Austin Dufﬁeld for reading drafts of this, & also Chris Frost, Bernardo Farah, Anubhav Jain and Jeffrey Chen.   References  Scalable scheduler design doc https://github.com/golang/go/blob/master/src/runtime/runtime2.go https://github.com/golang/go/blob/master/src/runtime/proc.go Go scheduler blog post Scheduling Multithreaded Computations by Work Stealing

Slide 103

Slide 103 text

No content