Aleksandar Prokopec
September 25, 2013
43

# Work-stealing Tree Scheduler

Explanation of the work-stealing tree scheduler used in ScalaBlitz.

## Aleksandar Prokopec

September 25, 2013

## Transcript

1. Near Optimal Work-Stealing Tree for
Aleksandar Prokopec
Martin Odersky
1

2. Near Optimal Work-Stealing Tree for
Aleksandar Prokopec
Martin Odersky
Irregular Data-Parallel
2

(0 until 10000000) reduce (+)
3

(0 until 10000000) reduce (+)
sum = sum + x
4

(0 until 10000000) reduce (+)
sum = sum + x

N
cycles
5

for (0 until 10000000) {}

N
cycles
6

7

N
cycles
8

for {
x <- 0 until width
y <- 0 until height
} image(x, y) = compute(x, y)
N
cycles
9

for {
x <- 0 until width
y <- 0 until height
} image(x, y) = compute(x, y)
image(x, y) = compute(x, y)
N
cycles
10

workload(n) – work spent on element n
after the data-parallel operation completed
11

Could be…
Runtime value
dependent
for {
x <- 0 until width
y <- 0 until height
} img(x, y) = compute(x, y)
workload(n) – work spent on element n
after the data-parallel operation completed
12

Could be…
Execution-schedule
dependent
for (n <- nodes)
n.neighbours += new Node
workload(n) – work spent on element n
after the data-parallel operation completed
13

Could be…
Totally random
for ((x, y) <- img.indices)
img(x, y) = sample(
x + random(),
y + random()
)
workload(n) – work spent on element n
after the data-parallel operation completed
14

15. Data-parallel scheduler
Assign loop elements to workers
15

16. Data-parallel scheduler
1. Linear speedup for the baseline workload
Assign loop elements to workers
16

17. Data-parallel scheduler
1. Linear speedup for the baseline workload
2. Optimal speedup for irregular workloads
Assign loop elements to workers
17

18. Static batching
Decides on the worker-element assignment before
the data-parallel operation begins.
N
cycles
18

19. Static batching
Decides on the worker-element assignment before
the data-parallel operation begins.
No knowledge → divide uniformly.
Not optimal for even mildly irregular workloads.
N
cycles
19

20. Fixed-size batching
N
cycles
progress
20

21. Fixed-size batching
N
cycles
0
21

22. Fixed-size batching
N
cycles
2 T0: CAS
T0
22

23. Fixed-size batching
N
cycles
4
T1: CAS
T0 T1
23

24. Fixed-size batching
N
cycles
6 T0: CAS
T0
T1
24

25. Fixed-size batching
N
cycles
8 T0: CAS
T0
T1
25

26. Fixed-size batching
N
cycles
10 T0: CAS
T0
T1
26

27. Fixed-size batching
N
cycles
12 T0: CAS
T0
T1
27

28. Fixed-size batching
N
cycles
progress
Pros: lightweight
Cons: minimum batch size, contention
28

29. Fixed-size batching - contention
29

30. Factoring, GSS, TS
Batch size varies.
N
cycles
progress
Pros: lightweight
Cons: contention
30

N
cycles
0..2 2..4 4..8 8..16
31

N
cycles
0..2 2..4 4..8 8..16
2..4
4..8
8..16
T0 T1
0..2
32

N
cycles
0..2 2..4 4..8 8..16
2..4
4..8
8..16
T0 T1
0..2
steal – a rare event
33

N
cycles
0..2 2..4 4..8 8..16
2..4
4..8
8..16
T0 T1
10..12
12..16
8..10
0..2
34

Pros: can be adaptive - uses stealing information
Cons: heavyweight - minimum batch size much larger
N
cycles
0..2 2..4 4..8 8..16
2..4
4..8
8..16
T0 T1
10..12
12..16
0..2 8..10
35

N
cycles
0..2 2..4 4..8 8..16
Cannot be stolen
after T0 starts
processing it
36

37. Work-stealing tree
0 0
T0 N
owned
37

38. Work-stealing tree
0 0
T0 N 0 50
T0 N
owned owned
T0: CAS
38

39. Work-stealing tree
0 0
T0 N 0 50
T0 N 0 N
T0 N

owned owned completed
T0: CAS T0: CAS
39

40. Work-stealing tree
0 0
T0 N 0 50
T0 N 0 N
T0 N

owned owned completed
0 -51
T0 N
T0: CAS
T1: CAS
stolen
T0: CAS
40

41. Work-stealing tree
0 50
T0 N 0 N
T0 N

owned completed
0 -51
T0 N
T0: CAS
stolen
T0: CAS
0 0
T0 N
owned
T1: CAS
41

42. Work-stealing tree
0 50
T0 N 0 N
T0 N

owned completed
0 -51
T0 N
T0: CAS
stolen
0 -51
T0 N
expanded
50 50
T0 M M M
T1 N
T0: CAS
0 0
T0 N
owned
M = (50 + N) / 2 42

43. Work-stealing tree
0 50
T0 N 0 N
T0 N

owned completed
0 -51
T0 N
T0: CAS
stolen
0 -51
T0 N
expanded
50 50
T0 M M M
T1 N
T0: CAS
0 0
T0 N
owned
M = (50 + N) / 2
T0 or T1: CAS
43

44. Work-stealing tree
0 50
T0 N 0 N
T0 N

owned completed
0 -51
T0 N
T0: CAS
stolen
0 -51
T0 N
expanded
50 50
T0 M M M
T1 N
T0 or T1: CAS
T0: CAS
0 0
T0 N
owned
M = (50 + N) / 2 44

45. Work-stealing tree - contention
45

46. Work-stealing tree scheduling
1) find either a non-expanded, non-completed node
3) if not owned, steal and/or expand, and descend
4) advance until node is completed or stolen
5) go to 1)
50

47. Work-stealing tree scheduling
1) find either a non-expanded, non-completed node
3) if not owned, steal and/or expand, and descend
4) advance until node is completed or stolen
5) go to 1)
1) find either a non-expanded, non-completed node
51

48. Choosing the node to steal
Find first, in-order traversal
2 9
5
3
52

49. Choosing the node to steal
Find first, in-order traversal
2 9
5
3
Catastrophic – a lot of
stealing, huge trees
53

50. Choosing the node to steal
Find first, in-order traversal Find first, random order traversal
2 9
5
3
2 9
5
3
Catastrophic – a lot of
stealing, huge trees
54

51. Choosing the node to steal
Find first, in-order traversal Find first, random order traversal
2 9
5
3
2 9
5
3
Catastrophic – a lot of
stealing, huge trees
Works reasonably well.
55

52. Choosing the node to steal
Find first, in-order traversal Find first, random order traversal Find most elements
2 9
5
3
2 9
5
3
2 9
5
3
Catastrophic – a lot of
stealing, huge trees
Works reasonably well. Generates least nodes.
Seems to be best.
56

53. Comparison with fixed-size batching
57

54. Comparison with fixed-size batching
58