Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Work-stealing Tree Scheduler

Work-stealing Tree Scheduler

Explanation of the work-stealing tree scheduler used in ScalaBlitz.

Aleksandar Prokopec

September 25, 2013
Tweet

More Decks by Aleksandar Prokopec

Other Decks in Programming

Transcript

  1. Near Optimal Work-Stealing Tree for
    Highly Irregular Data-Parallel Workloads
    Aleksandar Prokopec
    Martin Odersky
    1

    View Slide

  2. Near Optimal Work-Stealing Tree for
    Highly Irregular Data-Parallel Workloads
    Aleksandar Prokopec
    Martin Odersky
    Irregular Data-Parallel
    2

    View Slide

  3. Uniform workload
    (0 until 10000000) reduce (+)
    3

    View Slide

  4. Uniform workload
    (0 until 10000000) reduce (+)
    sum = sum + x
    4

    View Slide

  5. Uniform workload
    (0 until 10000000) reduce (+)
    sum = sum + x

    N
    cycles
    5

    View Slide

  6. Baseline workload
    for (0 until 10000000) {}

    N
    cycles
    6

    View Slide

  7. Irregular workload
    7

    View Slide

  8. Irregular workload
    N
    cycles
    8

    View Slide

  9. Irregular workload
    for {
    x <- 0 until width
    y <- 0 until height
    } image(x, y) = compute(x, y)
    N
    cycles
    9

    View Slide

  10. Irregular workload
    for {
    x <- 0 until width
    y <- 0 until height
    } image(x, y) = compute(x, y)
    image(x, y) = compute(x, y)
    N
    cycles
    10

    View Slide

  11. Workload function
    workload(n) – work spent on element n
    after the data-parallel operation completed
    11

    View Slide

  12. Workload function
    Could be…
    Runtime value
    dependent
    for {
    x <- 0 until width
    y <- 0 until height
    } img(x, y) = compute(x, y)
    workload(n) – work spent on element n
    after the data-parallel operation completed
    12

    View Slide

  13. Workload function
    Could be…
    Execution-schedule
    dependent
    for (n <- nodes)
    n.neighbours += new Node
    workload(n) – work spent on element n
    after the data-parallel operation completed
    13

    View Slide

  14. Workload function
    Could be…
    Totally random
    for ((x, y) <- img.indices)
    img(x, y) = sample(
    x + random(),
    y + random()
    )
    workload(n) – work spent on element n
    after the data-parallel operation completed
    14

    View Slide

  15. Data-parallel scheduler
    Assign loop elements to workers
    without knowledge about the workload function.
    15

    View Slide

  16. Data-parallel scheduler
    1. Linear speedup for the baseline workload
    Assign loop elements to workers
    without knowledge about the workload function.
    16

    View Slide

  17. Data-parallel scheduler
    1. Linear speedup for the baseline workload
    2. Optimal speedup for irregular workloads
    Assign loop elements to workers
    without knowledge about the workload function.
    17

    View Slide

  18. Static batching
    Decides on the worker-element assignment before
    the data-parallel operation begins.
    N
    cycles
    18

    View Slide

  19. Static batching
    Decides on the worker-element assignment before
    the data-parallel operation begins.
    No knowledge → divide uniformly.
    Not optimal for even mildly irregular workloads.
    N
    cycles
    19

    View Slide

  20. Fixed-size batching
    Workload-driven – decides during execution.
    N
    cycles
    progress
    20

    View Slide

  21. Fixed-size batching
    Workload-driven – decides during execution.
    N
    cycles
    0
    21

    View Slide

  22. Fixed-size batching
    Workload-driven – decides during execution.
    N
    cycles
    2 T0: CAS
    T0
    22

    View Slide

  23. Fixed-size batching
    Workload-driven – decides during execution.
    N
    cycles
    4
    T1: CAS
    T0 T1
    23

    View Slide

  24. Fixed-size batching
    Workload-driven – decides during execution.
    N
    cycles
    6 T0: CAS
    T0
    T1
    24

    View Slide

  25. Fixed-size batching
    Workload-driven – decides during execution.
    N
    cycles
    8 T0: CAS
    T0
    T1
    25

    View Slide

  26. Fixed-size batching
    Workload-driven – decides during execution.
    N
    cycles
    10 T0: CAS
    T0
    T1
    26

    View Slide

  27. Fixed-size batching
    Workload-driven – decides during execution.
    N
    cycles
    12 T0: CAS
    T0
    T1
    27

    View Slide

  28. Fixed-size batching
    Workload-driven – decides during execution.
    N
    cycles
    progress
    Pros: lightweight
    Cons: minimum batch size, contention
    28

    View Slide

  29. Fixed-size batching - contention
    29

    View Slide

  30. Factoring, GSS, TS
    Batch size varies.
    N
    cycles
    progress
    Pros: lightweight
    Cons: contention
    30

    View Slide

  31. Task-based work-stealing
    N
    cycles
    0..2 2..4 4..8 8..16
    31

    View Slide

  32. Task-based work-stealing
    N
    cycles
    0..2 2..4 4..8 8..16
    2..4
    4..8
    8..16
    T0 T1
    0..2
    32

    View Slide

  33. Task-based work-stealing
    N
    cycles
    0..2 2..4 4..8 8..16
    2..4
    4..8
    8..16
    T0 T1
    0..2
    steal – a rare event
    33

    View Slide

  34. Task-based work-stealing
    N
    cycles
    0..2 2..4 4..8 8..16
    2..4
    4..8
    8..16
    T0 T1
    10..12
    12..16
    8..10
    0..2
    34

    View Slide

  35. Task-based work-stealing
    Pros: can be adaptive - uses stealing information
    Cons: heavyweight - minimum batch size much larger
    N
    cycles
    0..2 2..4 4..8 8..16
    2..4
    4..8
    8..16
    T0 T1
    10..12
    12..16
    0..2 8..10
    35

    View Slide

  36. Task-based work-stealing
    N
    cycles
    0..2 2..4 4..8 8..16
    Cannot be stolen
    after T0 starts
    processing it
    36

    View Slide

  37. Work-stealing tree
    0 0
    T0 N
    owned
    37

    View Slide

  38. Work-stealing tree
    0 0
    T0 N 0 50
    T0 N
    owned owned
    T0: CAS
    38

    View Slide

  39. Work-stealing tree
    0 0
    T0 N 0 50
    T0 N 0 N
    T0 N

    owned owned completed
    T0: CAS T0: CAS
    What about stealing?
    39

    View Slide

  40. Work-stealing tree
    0 0
    T0 N 0 50
    T0 N 0 N
    T0 N

    owned owned completed
    0 -51
    T0 N
    T0: CAS
    T1: CAS
    stolen
    T0: CAS
    40

    View Slide

  41. Work-stealing tree
    0 50
    T0 N 0 N
    T0 N

    owned completed
    0 -51
    T0 N
    T0: CAS
    stolen
    T0: CAS
    0 0
    T0 N
    owned
    T1: CAS
    41

    View Slide

  42. Work-stealing tree
    0 50
    T0 N 0 N
    T0 N

    owned completed
    0 -51
    T0 N
    T0: CAS
    stolen
    0 -51
    T0 N
    expanded
    50 50
    T0 M M M
    T1 N
    T0: CAS
    0 0
    T0 N
    owned
    M = (50 + N) / 2 42

    View Slide

  43. Work-stealing tree
    0 50
    T0 N 0 N
    T0 N

    owned completed
    0 -51
    T0 N
    T0: CAS
    stolen
    0 -51
    T0 N
    expanded
    50 50
    T0 M M M
    T1 N
    T0: CAS
    0 0
    T0 N
    owned
    M = (50 + N) / 2
    T0 or T1: CAS
    43

    View Slide

  44. Work-stealing tree
    0 50
    T0 N 0 N
    T0 N

    owned completed
    0 -51
    T0 N
    T0: CAS
    stolen
    0 -51
    T0 N
    expanded
    50 50
    T0 M M M
    T1 N
    T0 or T1: CAS
    T0: CAS
    0 0
    T0 N
    owned
    M = (50 + N) / 2 44

    View Slide

  45. Work-stealing tree - contention
    45

    View Slide

  46. Work-stealing tree scheduling
    1) find either a non-expanded, non-completed node
    2) if not found, terminate
    3) if not owned, steal and/or expand, and descend
    4) advance until node is completed or stolen
    5) go to 1)
    50

    View Slide

  47. Work-stealing tree scheduling
    1) find either a non-expanded, non-completed node
    2) if not found, terminate
    3) if not owned, steal and/or expand, and descend
    4) advance until node is completed or stolen
    5) go to 1)
    1) find either a non-expanded, non-completed node
    51

    View Slide

  48. Choosing the node to steal
    Find first, in-order traversal
    2 9
    5
    3
    52

    View Slide

  49. Choosing the node to steal
    Find first, in-order traversal
    2 9
    5
    3
    Catastrophic – a lot of
    stealing, huge trees
    53

    View Slide

  50. Choosing the node to steal
    Find first, in-order traversal Find first, random order traversal
    2 9
    5
    3
    2 9
    5
    3
    Catastrophic – a lot of
    stealing, huge trees
    54

    View Slide

  51. Choosing the node to steal
    Find first, in-order traversal Find first, random order traversal
    2 9
    5
    3
    2 9
    5
    3
    Catastrophic – a lot of
    stealing, huge trees
    Works reasonably well.
    55

    View Slide

  52. Choosing the node to steal
    Find first, in-order traversal Find first, random order traversal Find most elements
    2 9
    5
    3
    2 9
    5
    3
    2 9
    5
    3
    Catastrophic – a lot of
    stealing, huge trees
    Works reasonably well. Generates least nodes.
    Seems to be best.
    56

    View Slide

  53. Comparison with fixed-size batching
    57

    View Slide

  54. Comparison with fixed-size batching
    58

    View Slide

  55. Comparison with task work-stealing
    59

    View Slide

  56. Thank you!
    Questions?
    60

    View Slide

  57. Finding work
    61

    View Slide

  58. Other workloads
    62

    View Slide