Slide 1

Slide 1 text

Go Runtime Scheduler Go Implementation -- Part I 12 May 2016 Gao Chao

Slide 2

Slide 2 text

Agenda Concepts Some Code Discussion

Slide 3

Slide 3 text

Why study runtime Go is performant Goroutine How to manage goroutines

Slide 4

Slide 4 text

Explanations to GOMAXPROCS goroutine numbers in your service goroutine scheduler

Slide 5

Slide 5 text

Go scheduler before 1.2 1. Single global mutex (Sched.Lock) and centralized state. The mutex protects all goroutine-related operations (creation, completion, rescheduling, etc). 2. Goroutine (G) hand-o (G.nextg). Worker threads (M's) frequently hand-o runnable goroutines between each other, this may lead to increased latencies and additional overheads. Every M must be able to execute any runnable G, in particular the M that just created the G. 3. Per-M memory cache (M.mcache). Memory cache and other caches (stack alloc) are associated with all M's, while they need to be associated only with M's running Go code (an M blocked inside of syscall does not need mcache). A ratio between M's running Go code and all M's can be as high as 1:100. This leads to excessive resource consumption (each MCache can suck up up to 2M) and poor data locality. 4. Aggressive thread blocking/unblocking. In presence of syscalls worker threads are frequently blocked and unblocked. This adds a lot of overhead.

Slide 6

Slide 6 text

Basic Concepts G -- Goroutine M -- OS thread P -- Processor (abstracted concept)

Slide 7

Slide 7 text

Responsibility M must have an associated P to execute Go code, however it can be blocked or in a syscall w/o an associated P. Gs are in P's local queue or global queue G keeps current task status, provides stack

Slide 8

Slide 8 text

GOMAXPROCS Number of P / / g o / s r c / r u n t i m e / p r o c . g o f u n c s c h e d i n i t ( ) { . . . p r o c s : = i n t ( n c p u ) i f n : = a t o i ( g o g e t e n v ( " G O M A X P R O C S " ) ) ; n > 0 { i f n > _ M a x G o m a x p r o c s { n = _ M a x G o m a x p r o c s } p r o c s = n } i f p r o c r e s i z e ( i n t 3 2 ( p r o c s ) ) ! = n i l { t h r o w ( " u n k n o w n r u n n a b l e g o r o u t i n e d u r i n g b o o t s t r a p " ) } . . .

Slide 9

Slide 9 text

Don't call GOMAXPROCS in runtime (when possible) f u n c G O M A X P R O C S ( n i n t ) i n t { i f n > _ M a x G o m a x p r o c s { n = _ M a x G o m a x p r o c s } l o c k ( & s c h e d . l o c k ) r e t : = i n t ( g o m a x p r o c s ) u n l o c k ( & s c h e d . l o c k ) i f n < = 0 | | n = = r e t { r e t u r n r e t } s t o p T h e W o r l d ( " G O M A X P R O C S " ) / / n e w p r o c s w i l l b e p r o c e s s e d b y s t a r t T h e W o r l d n e w p r o c s = i n t 3 2 ( n ) s t a r t T h e W o r l d ( ) r e t u r n r e t }

Slide 10

Slide 10 text

G -- goroutine Created in user-space Initial 2 KB stack space created by f u n c n e w p r o c ( s i z i n t 3 2 , f n * f u n c v a l ) { . . .

Slide 11

Slide 11 text

goroutine numbers Why Go allows us to create goroutines so easily f u n c n e w p r o c 1 ( f n * f u n c v a l , a r g p * u i n t 8 , n a r g i n t 3 2 , n r e t i n t 3 2 , c a l l e r p c u i n t p t r ) * g { _ g _ : = g e t g ( ) / / G E T c u r r e n t G . . . _ p _ : = _ g _ . m . p . p t r ( ) / / G E T i d l e G f r o m c u r r e n t P ' s q u e u e n e w g : = g f g e t ( _ p _ ) i f n e w g = = n i l { n e w g = m a l g ( _ S t a c k M i n ) c a s g s t a t u s ( n e w g , _ G i d l e , _ G d e a d ) a l l g a d d ( n e w g ) / / p u b l i s h e s w i t h a g - > s t a t u s o f G d e a d s o G C s c a n n e r d o e s n ' t l o o k a t u n i n i t i a l i z } Goroutines will be reused

Slide 12

Slide 12 text

M -- thread Initialization / / g o / s r c / r u n t i m e / p r o c . g o / / S e t m a x M n u m b e r t o 1 0 0 0 0 s c h e d . m a x m c o u n t = 1 0 0 0 0 . . . / / I n i t i a l i z e s t a c k s p a c e s t a c k i n i t ( ) . . . / / I n i t i a l i z e c u r r e n t M m c o m m o n i n i t ( _ g _ . m )

Slide 13

Slide 13 text

P -- processor Max value (?) 1 < < 8 P will try to put newly created G into its local queue rst, if local queue is full, P will put the new G to global queue (lock)

Slide 14

Slide 14 text

Work ow + - - - - - - - - - - - - - - - - - - - - s y s m o n - - - - - - - - - - - - - - - / / - - - - + | | | | + - - - + + - - - + - - - - - - - + + - - - - - - - - + + - - - + - - - + g o f u n c ( ) - - - > | G | - - - > | P | l o c a l | < = = = b a l a n c e = = = > | g l o b a l | < - - / / - - - | P | M | + - - - + + - - - + - - - - - - - + + - - - - - - - - + + - - - + - - - + | | | | + - - - + | | + - - - - > | M | < - - - f i n d r u n n a b l e - - - + - - - s t e a l < - - / / - - + + - - - + | | + - - - e x e c u t e < - - - - - s c h e d u l e | | | | + - - > G . f n - - > g o e x i t - - + 1 . g o c r e a t e s a n e w g o r o u t i n e 2 . n e w l y c r e a t e d g o r o u t i n e b e i n g p u t i n t o l o c a l o r g l o b a l q u e u e 3 . A M i s b e i n g w a k e n o r c r e a t e d t o e x e c u t e g o r o u t i n e 4 . S c h e d u l e l o o p 5 . T r y i t s b e s t t o g e t a g o r o u t i n e t o e x e c u t e 6 . C l e a r , r e e n t e r s c h e d u l e l o o p

Slide 15

Slide 15 text

Runtime Scheduler How to e ciently distribute tasks Work Sharing VS Work Stealing

Slide 16

Slide 16 text

Work sharing Whenever a processor generates new threads, the scheduler attempts to migrate some of them to other processors. in hopes of distributing the work to underutilized processors

Slide 17

Slide 17 text

Work Stealing Underutilized processors take the initiative Processors needing work steal computational threads from other processors

Slide 18

Slide 18 text

Compare Intuitively, the migration of threads occurs less frequently with work stealing than sharing When all processors have work to do, no threads are migrated by a work-stealing scheduler Threads are always migrated by a work-sharing scheudler

Slide 19

Slide 19 text

Work Stealing Algorithms

Slide 20

Slide 20 text

Busy-Leaves Algorithm 0. There is gloabl ready thread pool. 1. At the beginning of each step, each processor either is idle or has a thread to work on 2. Those processors that are idle begin the step by attempting to remove any ready thread from the pool. - 2.1 If there are su ciently many ready threads in the pool to satisfy all of the idle processors, then every idle processor gets a ready thread to work on - 2.2 Otherwise, some processors remain idle. 3. Then each processor that has a thread to work on executes the next instruction from that thread until the thread either spawns, stalls or dies.

Slide 21

Slide 21 text

Randomized work-stealing algorithm 0. The centralized thread pool of Busy-Leaves Algorithm is distributed across the processors. 1. Each processor maintains a ready deque data structure of threads. 2. A processor obtains work by removing the thread at the bottom of its ready deque. 3. The Work-Stealing Algorithm begines work stealing when ready deques empty. - 3.1 The processor becomes a thief and attempts to steal work from a victim processor chosen uniformly at random. - 3.2 The thief queries the ready deque of the victim, and if it is nonempty, the thief removes and begins work on the top thread. - 3.3 If the victim's ready deque is empty, however, the thief tries again, picking another victim at random.

Slide 22

Slide 22 text

Reminder -- Go Runtime Entities M must have an associated P to execute Go code, however it can be blocked or in a syscall w/o an associated P. Gs are in P's local queue or global queue G keeps current task status, provides stack Implements both Busy-Leaves & Randomized Work-Stealing

Slide 23

Slide 23 text

goroutine queues t y p e p s t r u c t { / / A v a i l a b l e G ' s ( s t a t u s = = G d e a d ) g f r e e * g g f r e e c n t i n t 3 2 } t y p e s c h e d t s t r u c t { / / G l o b a l c a c h e o f d e a d G ' s . g f l o c k m u t e x g f r e e * g n g f r e e i n t 3 2 }

Slide 24

Slide 24 text

steal goroutine from global queue / / G e t f r o m g f r e e l i s t . / / I f l o c a l l i s t i s e m p t y , g r a b a b a t c h f r o m g l o b a l l i s t . f u n c g f g e t ( _ p _ * p ) * g { r e t r y : g p : = _ p _ . g f r e e i f g p = = n i l & & s c h e d . g f r e e ! = n i l { l o c k ( & s c h e d . g f l o c k ) f o r _ p _ . g f r e e c n t < 3 2 & & s c h e d . g f r e e ! = n i l { _ p _ . g f r e e c n t + + g p = s c h e d . g f r e e s c h e d . g f r e e = g p . s c h e d l i n k . p t r ( ) s c h e d . n g f r e e - - g p . s c h e d l i n k . s e t ( _ p _ . g f r e e ) _ p _ . g f r e e = g p } u n l o c k ( & s c h e d . g f l o c k ) g o t o r e t r y }

Slide 25

Slide 25 text

steal goroutine from other places / / F i n d s a r u n n a b l e g o r o u t i n e t o e x e c u t e . / / T r i e s t o s t e a l f r o m o t h e r P ' s , g e t g f r o m g l o b a l q u e u e , p o l l n e t w o r k . f u n c f i n d r u n n a b l e ( ) ( g p * g , i n h e r i t T i m e b o o l ) { . . . / / r a n d o m s t e a l f r o m o t h e r P ' s f o r i : = 0 ; i < i n t ( 4 * g o m a x p r o c s ) ; i + + { i f s c h e d . g c w a i t i n g ! = 0 { g o t o t o p } _ p _ : = a l l p [ f a s t r a n d 1 ( ) % u i n t 3 2 ( g o m a x p r o c s ) ] v a r g p * g i f _ p _ = = _ g _ . m . p . p t r ( ) { g p , _ = r u n q g e t ( _ p _ ) } e l s e { s t e a l R u n N e x t G : = i > 2 * i n t ( g o m a x p r o c s ) / / f i r s t l o o k f o r r e a d y q u e u e s w i t h m o r e t h a n 1 g g p = r u n q s t e a l ( _ g _ . m . p . p t r ( ) , _ p _ , s t e a l R u n N e x t G ) } i f g p ! = n i l { r e t u r n g p , f a l s e } } . . .

Slide 26

Slide 26 text

Multi Threading Go programs are naturally multithreading programs All the pros and cons of multithreading programs apply

Slide 27

Slide 27 text

Latency Numbers

Slide 28

Slide 28 text

NUMA What every programmer should know about memory (https://www.akkadia.org/drepper/cpumemory.pdf)

Slide 29

Slide 29 text

NUMA Aware Go Scheduler Global resources (MHeap, global RunQ and pool of M's) are partitioned between NUMA nodes; netpoll and timers become distributed per-P.

Slide 30

Slide 30 text

Discusson

Slide 31

Slide 31 text

References Scalable Go Scheduler Design Doc (https://docs.google.com/document/d/1TTj4T2JO42uD5ID9e89oa0sLKhJYD0Y_kqxDv3I3XMw/edit#) Go Preemptive Scheduler Design Doc (https://docs.google.com/document/d/1ETuA2IOmnaQ4j81AtTGT40Y4_Jr6_IDASEKg0t0dBR8/edit) Scheduling Multithreaded Computations by Work Stealing (http://supertech.csail.mit.edu/papers/steal.pdf) What every programmer should know about memory (https://www.akkadia.org/drepper/cpumemory.pdf)

Slide 32

Slide 32 text

Thank you Gao Chao @reterclose (http://twitter.com/reterclose)

Slide 33

Slide 33 text

No content