Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Go Runtime Scheduler

Go Runtime Scheduler

A brief introduction of the basic concepts of Golang's runtime scheduler.

Brandon Gao

June 03, 2016
Tweet

Other Decks in Programming

Transcript

  1. Go Runtime Scheduler
    Go Implementation -- Part I
    12 May 2016
    Gao Chao

    View Slide

  2. Agenda
    Concepts
    Some Code
    Discussion

    View Slide

  3. Why study runtime
    Go is performant
    Goroutine
    How to manage goroutines

    View Slide

  4. Explanations to
    GOMAXPROCS
    goroutine numbers in your service
    goroutine scheduler

    View Slide

  5. Go scheduler before 1.2
    1. Single global mutex (Sched.Lock) and centralized state. The mutex protects all
    goroutine-related operations (creation, completion, rescheduling, etc).
    2. Goroutine (G) hand-o (G.nextg). Worker threads (M's) frequently hand-o runnable
    goroutines between each other, this may lead to increased latencies and additional
    overheads. Every M must be able to execute any runnable G, in particular the M that
    just created the G.
    3. Per-M memory cache (M.mcache). Memory cache and other caches (stack alloc) are
    associated with all M's, while they need to be associated only with M's running Go code
    (an M blocked inside of syscall does not need mcache). A ratio between M's running Go
    code and all M's can be as high as 1:100. This leads to excessive resource consumption
    (each MCache can suck up up to 2M) and poor data locality.
    4. Aggressive thread blocking/unblocking. In presence of syscalls worker threads are
    frequently blocked and unblocked. This adds a lot of overhead.

    View Slide

  6. Basic Concepts
    G -- Goroutine
    M -- OS thread
    P -- Processor (abstracted concept)

    View Slide

  7. Responsibility
    M must have an associated P to execute Go code, however it can be blocked or in a
    syscall w/o an associated P.
    Gs are in P's local queue or global queue
    G keeps current task status, provides stack

    View Slide

  8. GOMAXPROCS
    Number of P
    /
    / g
    o
    /
    s
    r
    c
    /
    r
    u
    n
    t
    i
    m
    e
    /
    p
    r
    o
    c
    .
    g
    o
    f
    u
    n
    c s
    c
    h
    e
    d
    i
    n
    i
    t
    (
    ) {
    .
    .
    .
    p
    r
    o
    c
    s :
    = i
    n
    t
    (
    n
    c
    p
    u
    )
    i
    f n :
    = a
    t
    o
    i
    (
    g
    o
    g
    e
    t
    e
    n
    v
    (
    "
    G
    O
    M
    A
    X
    P
    R
    O
    C
    S
    "
    )
    )
    ; n > 0 {
    i
    f n > _
    M
    a
    x
    G
    o
    m
    a
    x
    p
    r
    o
    c
    s {
    n = _
    M
    a
    x
    G
    o
    m
    a
    x
    p
    r
    o
    c
    s
    }
    p
    r
    o
    c
    s = n
    }
    i
    f p
    r
    o
    c
    r
    e
    s
    i
    z
    e
    (
    i
    n
    t
    3
    2
    (
    p
    r
    o
    c
    s
    )
    ) !
    = n
    i
    l {
    t
    h
    r
    o
    w
    (
    "
    u
    n
    k
    n
    o
    w
    n r
    u
    n
    n
    a
    b
    l
    e g
    o
    r
    o
    u
    t
    i
    n
    e d
    u
    r
    i
    n
    g b
    o
    o
    t
    s
    t
    r
    a
    p
    "
    )
    }
    .
    .
    .

    View Slide

  9. Don't call GOMAXPROCS in runtime (when possible)
    f
    u
    n
    c G
    O
    M
    A
    X
    P
    R
    O
    C
    S
    (
    n i
    n
    t
    ) i
    n
    t {
    i
    f n > _
    M
    a
    x
    G
    o
    m
    a
    x
    p
    r
    o
    c
    s {
    n = _
    M
    a
    x
    G
    o
    m
    a
    x
    p
    r
    o
    c
    s
    }
    l
    o
    c
    k
    (
    &
    s
    c
    h
    e
    d
    .
    l
    o
    c
    k
    )
    r
    e
    t :
    = i
    n
    t
    (
    g
    o
    m
    a
    x
    p
    r
    o
    c
    s
    )
    u
    n
    l
    o
    c
    k
    (
    &
    s
    c
    h
    e
    d
    .
    l
    o
    c
    k
    )
    i
    f n <
    = 0 |
    | n =
    = r
    e
    t {
    r
    e
    t
    u
    r
    n r
    e
    t
    }
    s
    t
    o
    p
    T
    h
    e
    W
    o
    r
    l
    d
    (
    "
    G
    O
    M
    A
    X
    P
    R
    O
    C
    S
    "
    )
    /
    / n
    e
    w
    p
    r
    o
    c
    s w
    i
    l
    l b
    e p
    r
    o
    c
    e
    s
    s
    e
    d b
    y s
    t
    a
    r
    t
    T
    h
    e
    W
    o
    r
    l
    d
    n
    e
    w
    p
    r
    o
    c
    s = i
    n
    t
    3
    2
    (
    n
    )
    s
    t
    a
    r
    t
    T
    h
    e
    W
    o
    r
    l
    d
    (
    )
    r
    e
    t
    u
    r
    n r
    e
    t
    }

    View Slide

  10. G -- goroutine
    Created in user-space
    Initial 2 KB stack space
    created by
    f
    u
    n
    c n
    e
    w
    p
    r
    o
    c
    (
    s
    i
    z i
    n
    t
    3
    2
    , f
    n *
    f
    u
    n
    c
    v
    a
    l
    ) {
    .
    .
    .

    View Slide

  11. goroutine numbers
    Why Go allows us to create goroutines so easily
    f
    u
    n
    c n
    e
    w
    p
    r
    o
    c
    1
    (
    f
    n *
    f
    u
    n
    c
    v
    a
    l
    , a
    r
    g
    p *
    u
    i
    n
    t
    8
    , n
    a
    r
    g i
    n
    t
    3
    2
    , n
    r
    e
    t i
    n
    t
    3
    2
    , c
    a
    l
    l
    e
    r
    p
    c u
    i
    n
    t
    p
    t
    r
    ) *
    g {
    _
    g
    _ :
    = g
    e
    t
    g
    (
    ) /
    / G
    E
    T c
    u
    r
    r
    e
    n
    t G
    .
    .
    .
    _
    p
    _ :
    = _
    g
    _
    .
    m
    .
    p
    .
    p
    t
    r
    (
    ) /
    / G
    E
    T i
    d
    l
    e G f
    r
    o
    m c
    u
    r
    r
    e
    n
    t P
    '
    s q
    u
    e
    u
    e
    n
    e
    w
    g :
    = g
    f
    g
    e
    t
    (
    _
    p
    _
    )
    i
    f n
    e
    w
    g =
    = n
    i
    l {
    n
    e
    w
    g = m
    a
    l
    g
    (
    _
    S
    t
    a
    c
    k
    M
    i
    n
    )
    c
    a
    s
    g
    s
    t
    a
    t
    u
    s
    (
    n
    e
    w
    g
    , _
    G
    i
    d
    l
    e
    , _
    G
    d
    e
    a
    d
    )
    a
    l
    l
    g
    a
    d
    d
    (
    n
    e
    w
    g
    ) /
    / p
    u
    b
    l
    i
    s
    h
    e
    s w
    i
    t
    h a g
    -
    >
    s
    t
    a
    t
    u
    s o
    f G
    d
    e
    a
    d s
    o G
    C s
    c
    a
    n
    n
    e
    r d
    o
    e
    s
    n
    '
    t l
    o
    o
    k a
    t u
    n
    i
    n
    i
    t
    i
    a
    l
    i
    z
    }
    Goroutines will be reused

    View Slide

  12. M -- thread
    Initialization
    /
    / g
    o
    /
    s
    r
    c
    /
    r
    u
    n
    t
    i
    m
    e
    /
    p
    r
    o
    c
    .
    g
    o
    /
    / S
    e
    t m
    a
    x M n
    u
    m
    b
    e
    r t
    o 1
    0
    0
    0
    0
    s
    c
    h
    e
    d
    .
    m
    a
    x
    m
    c
    o
    u
    n
    t = 1
    0
    0
    0
    0
    .
    .
    .
    /
    / I
    n
    i
    t
    i
    a
    l
    i
    z
    e s
    t
    a
    c
    k s
    p
    a
    c
    e
    s
    t
    a
    c
    k
    i
    n
    i
    t
    (
    )
    .
    .
    .
    /
    / I
    n
    i
    t
    i
    a
    l
    i
    z
    e c
    u
    r
    r
    e
    n
    t M
    m
    c
    o
    m
    m
    o
    n
    i
    n
    i
    t
    (
    _
    g
    _
    .
    m
    )

    View Slide

  13. P -- processor
    Max value (?)
    1 <
    < 8
    P will try to put newly created G into its local queue rst, if local queue is full, P will
    put the new G to global queue (lock)

    View Slide

  14. Work ow
    +
    -
    -
    -
    -
    -
    -
    -
    -
    -
    -
    -
    -
    -
    -
    -
    -
    -
    -
    -
    - s
    y
    s
    m
    o
    n -
    -
    -
    -
    -
    -
    -
    -
    -
    -
    -
    -
    -
    -
    -
    /
    /
    -
    -
    -
    -
    +
    | |
    | |
    +
    -
    -
    -
    + +
    -
    -
    -
    +
    -
    -
    -
    -
    -
    -
    -
    + +
    -
    -
    -
    -
    -
    -
    -
    -
    + +
    -
    -
    -
    +
    -
    -
    -
    +
    g
    o f
    u
    n
    c
    (
    ) -
    -
    -
    > | G | -
    -
    -
    > | P | l
    o
    c
    a
    l | <
    =
    =
    = b
    a
    l
    a
    n
    c
    e =
    =
    =
    > | g
    l
    o
    b
    a
    l | <
    -
    -
    /
    /
    -
    -
    - | P | M |
    +
    -
    -
    -
    + +
    -
    -
    -
    +
    -
    -
    -
    -
    -
    -
    -
    + +
    -
    -
    -
    -
    -
    -
    -
    -
    + +
    -
    -
    -
    +
    -
    -
    -
    +
    | | |
    | +
    -
    -
    -
    + | |
    +
    -
    -
    -
    -
    > | M | <
    -
    -
    - f
    i
    n
    d
    r
    u
    n
    n
    a
    b
    l
    e -
    -
    -
    +
    -
    -
    - s
    t
    e
    a
    l <
    -
    -
    /
    /
    -
    -
    +
    +
    -
    -
    -
    +
    |
    |
    +
    -
    -
    - e
    x
    e
    c
    u
    t
    e <
    -
    -
    -
    -
    - s
    c
    h
    e
    d
    u
    l
    e
    | |
    | |
    +
    -
    -
    > G
    .
    f
    n -
    -
    > g
    o
    e
    x
    i
    t -
    -
    +
    1
    . g
    o c
    r
    e
    a
    t
    e
    s a n
    e
    w g
    o
    r
    o
    u
    t
    i
    n
    e
    2
    . n
    e
    w
    l
    y c
    r
    e
    a
    t
    e
    d g
    o
    r
    o
    u
    t
    i
    n
    e b
    e
    i
    n
    g p
    u
    t i
    n
    t
    o l
    o
    c
    a
    l o
    r g
    l
    o
    b
    a
    l q
    u
    e
    u
    e
    3
    . A M i
    s b
    e
    i
    n
    g w
    a
    k
    e
    n o
    r c
    r
    e
    a
    t
    e
    d t
    o e
    x
    e
    c
    u
    t
    e g
    o
    r
    o
    u
    t
    i
    n
    e
    4
    . S
    c
    h
    e
    d
    u
    l
    e l
    o
    o
    p
    5
    . T
    r
    y i
    t
    s b
    e
    s
    t t
    o g
    e
    t a g
    o
    r
    o
    u
    t
    i
    n
    e t
    o e
    x
    e
    c
    u
    t
    e
    6
    . C
    l
    e
    a
    r
    , r
    e
    e
    n
    t
    e
    r s
    c
    h
    e
    d
    u
    l
    e l
    o
    o
    p

    View Slide

  15. Runtime Scheduler
    How to e ciently distribute tasks
    Work Sharing VS Work Stealing

    View Slide

  16. Work sharing
    Whenever a processor generates new threads, the scheduler attempts to migrate
    some of them to other processors.
    in hopes of distributing the work to underutilized processors

    View Slide

  17. Work Stealing
    Underutilized processors take the initiative
    Processors needing work steal computational threads from other processors

    View Slide

  18. Compare
    Intuitively, the migration of threads occurs less frequently with work stealing than
    sharing
    When all processors have work to do, no threads are migrated by a work-stealing
    scheduler
    Threads are always migrated by a work-sharing scheudler

    View Slide

  19. Work Stealing Algorithms

    View Slide

  20. Busy-Leaves Algorithm
    0. There is gloabl ready thread pool.
    1. At the beginning of each step, each processor either is idle or has a thread to work
    on
    2. Those processors that are idle begin the step by attempting to remove any ready
    thread from the pool.
    - 2.1 If there are su ciently many ready threads in the pool to satisfy all of the idle
    processors, then every idle processor gets a ready thread to work on
    - 2.2 Otherwise, some processors remain idle.
    3. Then each processor that has a thread to work on executes the next instruction
    from that thread until the thread either spawns, stalls or dies.

    View Slide

  21. Randomized work-stealing algorithm
    0. The centralized thread pool of Busy-Leaves Algorithm is distributed across the
    processors.
    1. Each processor maintains a ready deque data structure of threads.
    2. A processor obtains work by removing the thread at the bottom of its ready deque.
    3. The Work-Stealing Algorithm begines work stealing when ready deques empty.
    - 3.1 The processor becomes a thief and attempts to steal work from a victim
    processor chosen uniformly at random.
    - 3.2 The thief queries the ready deque of the victim, and if it is nonempty, the thief
    removes and begins work on the top thread.
    - 3.3 If the victim's ready deque is empty, however, the thief tries again, picking another
    victim at random.

    View Slide

  22. Reminder -- Go Runtime Entities
    M must have an associated P to execute Go code, however it can be blocked or in a
    syscall w/o an associated P.
    Gs are in P's local queue or global queue
    G keeps current task status, provides stack
    Implements both Busy-Leaves & Randomized Work-Stealing

    View Slide

  23. goroutine queues
    t
    y
    p
    e p s
    t
    r
    u
    c
    t {
    /
    / A
    v
    a
    i
    l
    a
    b
    l
    e G
    '
    s (
    s
    t
    a
    t
    u
    s =
    = G
    d
    e
    a
    d
    )
    g
    f
    r
    e
    e *
    g
    g
    f
    r
    e
    e
    c
    n
    t i
    n
    t
    3
    2
    }
    t
    y
    p
    e s
    c
    h
    e
    d
    t s
    t
    r
    u
    c
    t {
    /
    / G
    l
    o
    b
    a
    l c
    a
    c
    h
    e o
    f d
    e
    a
    d G
    '
    s
    .
    g
    f
    l
    o
    c
    k m
    u
    t
    e
    x
    g
    f
    r
    e
    e *
    g
    n
    g
    f
    r
    e
    e i
    n
    t
    3
    2
    }

    View Slide

  24. steal goroutine from global queue
    /
    / G
    e
    t f
    r
    o
    m g
    f
    r
    e
    e l
    i
    s
    t
    .
    /
    / I
    f l
    o
    c
    a
    l l
    i
    s
    t i
    s e
    m
    p
    t
    y
    , g
    r
    a
    b a b
    a
    t
    c
    h f
    r
    o
    m g
    l
    o
    b
    a
    l l
    i
    s
    t
    .
    f
    u
    n
    c g
    f
    g
    e
    t
    (
    _
    p
    _ *
    p
    ) *
    g {
    r
    e
    t
    r
    y
    :
    g
    p :
    = _
    p
    _
    .
    g
    f
    r
    e
    e
    i
    f g
    p =
    = n
    i
    l &
    & s
    c
    h
    e
    d
    .
    g
    f
    r
    e
    e !
    = n
    i
    l {
    l
    o
    c
    k
    (
    &
    s
    c
    h
    e
    d
    .
    g
    f
    l
    o
    c
    k
    )
    f
    o
    r _
    p
    _
    .
    g
    f
    r
    e
    e
    c
    n
    t < 3
    2 &
    & s
    c
    h
    e
    d
    .
    g
    f
    r
    e
    e !
    = n
    i
    l {
    _
    p
    _
    .
    g
    f
    r
    e
    e
    c
    n
    t
    +
    +
    g
    p = s
    c
    h
    e
    d
    .
    g
    f
    r
    e
    e
    s
    c
    h
    e
    d
    .
    g
    f
    r
    e
    e = g
    p
    .
    s
    c
    h
    e
    d
    l
    i
    n
    k
    .
    p
    t
    r
    (
    )
    s
    c
    h
    e
    d
    .
    n
    g
    f
    r
    e
    e
    -
    -
    g
    p
    .
    s
    c
    h
    e
    d
    l
    i
    n
    k
    .
    s
    e
    t
    (
    _
    p
    _
    .
    g
    f
    r
    e
    e
    )
    _
    p
    _
    .
    g
    f
    r
    e
    e = g
    p
    }
    u
    n
    l
    o
    c
    k
    (
    &
    s
    c
    h
    e
    d
    .
    g
    f
    l
    o
    c
    k
    )
    g
    o
    t
    o r
    e
    t
    r
    y
    }

    View Slide

  25. steal goroutine from other places
    /
    / F
    i
    n
    d
    s a r
    u
    n
    n
    a
    b
    l
    e g
    o
    r
    o
    u
    t
    i
    n
    e t
    o e
    x
    e
    c
    u
    t
    e
    .
    /
    / T
    r
    i
    e
    s t
    o s
    t
    e
    a
    l f
    r
    o
    m o
    t
    h
    e
    r P
    '
    s
    , g
    e
    t g f
    r
    o
    m g
    l
    o
    b
    a
    l q
    u
    e
    u
    e
    , p
    o
    l
    l n
    e
    t
    w
    o
    r
    k
    .
    f
    u
    n
    c f
    i
    n
    d
    r
    u
    n
    n
    a
    b
    l
    e
    (
    ) (
    g
    p *
    g
    , i
    n
    h
    e
    r
    i
    t
    T
    i
    m
    e b
    o
    o
    l
    ) {
    .
    .
    .
    /
    / r
    a
    n
    d
    o
    m s
    t
    e
    a
    l f
    r
    o
    m o
    t
    h
    e
    r P
    '
    s
    f
    o
    r i :
    = 0
    ; i < i
    n
    t
    (
    4
    *
    g
    o
    m
    a
    x
    p
    r
    o
    c
    s
    )
    ; i
    +
    + {
    i
    f s
    c
    h
    e
    d
    .
    g
    c
    w
    a
    i
    t
    i
    n
    g !
    = 0 {
    g
    o
    t
    o t
    o
    p
    }
    _
    p
    _ :
    = a
    l
    l
    p
    [
    f
    a
    s
    t
    r
    a
    n
    d
    1
    (
    )
    %
    u
    i
    n
    t
    3
    2
    (
    g
    o
    m
    a
    x
    p
    r
    o
    c
    s
    )
    ]
    v
    a
    r g
    p *
    g
    i
    f _
    p
    _ =
    = _
    g
    _
    .
    m
    .
    p
    .
    p
    t
    r
    (
    ) {
    g
    p
    , _ = r
    u
    n
    q
    g
    e
    t
    (
    _
    p
    _
    )
    } e
    l
    s
    e {
    s
    t
    e
    a
    l
    R
    u
    n
    N
    e
    x
    t
    G :
    = i > 2
    *
    i
    n
    t
    (
    g
    o
    m
    a
    x
    p
    r
    o
    c
    s
    ) /
    / f
    i
    r
    s
    t l
    o
    o
    k f
    o
    r r
    e
    a
    d
    y q
    u
    e
    u
    e
    s w
    i
    t
    h m
    o
    r
    e t
    h
    a
    n 1 g
    g
    p = r
    u
    n
    q
    s
    t
    e
    a
    l
    (
    _
    g
    _
    .
    m
    .
    p
    .
    p
    t
    r
    (
    )
    , _
    p
    _
    , s
    t
    e
    a
    l
    R
    u
    n
    N
    e
    x
    t
    G
    )
    }
    i
    f g
    p !
    = n
    i
    l {
    r
    e
    t
    u
    r
    n g
    p
    , f
    a
    l
    s
    e
    }
    }
    .
    .
    .

    View Slide

  26. Multi Threading
    Go programs are naturally multithreading programs
    All the pros and cons of multithreading programs apply

    View Slide

  27. Latency Numbers

    View Slide

  28. NUMA
    What every programmer should know about memory (https://www.akkadia.org/drepper/cpumemory.pdf)

    View Slide

  29. NUMA Aware Go Scheduler
    Global resources (MHeap, global RunQ and pool of M's) are partitioned between
    NUMA nodes; netpoll and timers become distributed per-P.

    View Slide

  30. Discusson

    View Slide

  31. References
    Scalable Go Scheduler Design Doc (https://docs.google.com/document/d/1TTj4T2JO42uD5ID9e89oa0sLKhJYD0Y_kqxDv3I3XMw/edit#)
    Go Preemptive Scheduler Design Doc (https://docs.google.com/document/d/1ETuA2IOmnaQ4j81AtTGT40Y4_Jr6_IDASEKg0t0dBR8/edit)
    Scheduling Multithreaded Computations by Work Stealing (http://supertech.csail.mit.edu/papers/steal.pdf)
    What every programmer should know about memory (https://www.akkadia.org/drepper/cpumemory.pdf)

    View Slide

  32. Thank you
    Gao Chao
    @reterclose (http://twitter.com/reterclose)

    View Slide

  33. View Slide