Slide 1

Slide 1 text

Near Optimal Work-Stealing Tree for Highly Irregular Data-Parallel Workloads Aleksandar Prokopec Martin Odersky 1

Slide 2

Slide 2 text

Near Optimal Work-Stealing Tree for Highly Irregular Data-Parallel Workloads Aleksandar Prokopec Martin Odersky Irregular Data-Parallel 2

Slide 3

Slide 3 text

Uniform workload (0 until 10000000) reduce (+) 3

Slide 4

Slide 4 text

Uniform workload (0 until 10000000) reduce (+) sum = sum + x 4

Slide 5

Slide 5 text

Uniform workload (0 until 10000000) reduce (+) sum = sum + x … N cycles 5

Slide 6

Slide 6 text

Baseline workload for (0 until 10000000) {} … N cycles 6

Slide 7

Slide 7 text

Irregular workload 7

Slide 8

Slide 8 text

Irregular workload N cycles 8

Slide 9

Slide 9 text

Irregular workload for { x <- 0 until width y <- 0 until height } image(x, y) = compute(x, y) N cycles 9

Slide 10

Slide 10 text

Irregular workload for { x <- 0 until width y <- 0 until height } image(x, y) = compute(x, y) image(x, y) = compute(x, y) N cycles 10

Slide 11

Slide 11 text

Workload function workload(n) – work spent on element n after the data-parallel operation completed 11

Slide 12

Slide 12 text

Workload function Could be… Runtime value dependent for { x <- 0 until width y <- 0 until height } img(x, y) = compute(x, y) workload(n) – work spent on element n after the data-parallel operation completed 12

Slide 13

Slide 13 text

Workload function Could be… Execution-schedule dependent for (n <- nodes) n.neighbours += new Node workload(n) – work spent on element n after the data-parallel operation completed 13

Slide 14

Slide 14 text

Workload function Could be… Totally random for ((x, y) <- img.indices) img(x, y) = sample( x + random(), y + random() ) workload(n) – work spent on element n after the data-parallel operation completed 14

Slide 15

Slide 15 text

Data-parallel scheduler Assign loop elements to workers without knowledge about the workload function. 15

Slide 16

Slide 16 text

Data-parallel scheduler 1. Linear speedup for the baseline workload Assign loop elements to workers without knowledge about the workload function. 16

Slide 17

Slide 17 text

Data-parallel scheduler 1. Linear speedup for the baseline workload 2. Optimal speedup for irregular workloads Assign loop elements to workers without knowledge about the workload function. 17

Slide 18

Slide 18 text

Static batching Decides on the worker-element assignment before the data-parallel operation begins. N cycles 18

Slide 19

Slide 19 text

Static batching Decides on the worker-element assignment before the data-parallel operation begins. No knowledge → divide uniformly. Not optimal for even mildly irregular workloads. N cycles 19

Slide 20

Slide 20 text

Fixed-size batching Workload-driven – decides during execution. N cycles progress 20

Slide 21

Slide 21 text

Fixed-size batching Workload-driven – decides during execution. N cycles 0 21

Slide 22

Slide 22 text

Fixed-size batching Workload-driven – decides during execution. N cycles 2 T0: CAS T0 22

Slide 23

Slide 23 text

Fixed-size batching Workload-driven – decides during execution. N cycles 4 T1: CAS T0 T1 23

Slide 24

Slide 24 text

Fixed-size batching Workload-driven – decides during execution. N cycles 6 T0: CAS T0 T1 24

Slide 25

Slide 25 text

Fixed-size batching Workload-driven – decides during execution. N cycles 8 T0: CAS T0 T1 25

Slide 26

Slide 26 text

Fixed-size batching Workload-driven – decides during execution. N cycles 10 T0: CAS T0 T1 26

Slide 27

Slide 27 text

Fixed-size batching Workload-driven – decides during execution. N cycles 12 T0: CAS T0 T1 27

Slide 28

Slide 28 text

Fixed-size batching Workload-driven – decides during execution. N cycles progress Pros: lightweight Cons: minimum batch size, contention 28

Slide 29

Slide 29 text

Fixed-size batching - contention 29

Slide 30

Slide 30 text

Factoring, GSS, TS Batch size varies. N cycles progress Pros: lightweight Cons: contention 30

Slide 31

Slide 31 text

Task-based work-stealing N cycles 0..2 2..4 4..8 8..16 31

Slide 32

Slide 32 text

Task-based work-stealing N cycles 0..2 2..4 4..8 8..16 2..4 4..8 8..16 T0 T1 0..2 32

Slide 33

Slide 33 text

Task-based work-stealing N cycles 0..2 2..4 4..8 8..16 2..4 4..8 8..16 T0 T1 0..2 steal – a rare event 33

Slide 34

Slide 34 text

Task-based work-stealing N cycles 0..2 2..4 4..8 8..16 2..4 4..8 8..16 T0 T1 10..12 12..16 8..10 0..2 34

Slide 35

Slide 35 text

Task-based work-stealing Pros: can be adaptive - uses stealing information Cons: heavyweight - minimum batch size much larger N cycles 0..2 2..4 4..8 8..16 2..4 4..8 8..16 T0 T1 10..12 12..16 0..2 8..10 35

Slide 36

Slide 36 text

Task-based work-stealing N cycles 0..2 2..4 4..8 8..16 Cannot be stolen after T0 starts processing it 36

Slide 37

Slide 37 text

Work-stealing tree 0 0 T0 N owned 37

Slide 38

Slide 38 text

Work-stealing tree 0 0 T0 N 0 50 T0 N owned owned T0: CAS 38

Slide 39

Slide 39 text

Work-stealing tree 0 0 T0 N 0 50 T0 N 0 N T0 N … owned owned completed T0: CAS T0: CAS What about stealing? 39

Slide 40

Slide 40 text

Work-stealing tree 0 0 T0 N 0 50 T0 N 0 N T0 N … owned owned completed 0 -51 T0 N T0: CAS T1: CAS stolen T0: CAS 40

Slide 41

Slide 41 text

Work-stealing tree 0 50 T0 N 0 N T0 N … owned completed 0 -51 T0 N T0: CAS stolen T0: CAS 0 0 T0 N owned T1: CAS 41

Slide 42

Slide 42 text

Work-stealing tree 0 50 T0 N 0 N T0 N … owned completed 0 -51 T0 N T0: CAS stolen 0 -51 T0 N expanded 50 50 T0 M M M T1 N T0: CAS 0 0 T0 N owned M = (50 + N) / 2 42

Slide 43

Slide 43 text

Work-stealing tree 0 50 T0 N 0 N T0 N … owned completed 0 -51 T0 N T0: CAS stolen 0 -51 T0 N expanded 50 50 T0 M M M T1 N T0: CAS 0 0 T0 N owned M = (50 + N) / 2 T0 or T1: CAS 43

Slide 44

Slide 44 text

Work-stealing tree 0 50 T0 N 0 N T0 N … owned completed 0 -51 T0 N T0: CAS stolen 0 -51 T0 N expanded 50 50 T0 M M M T1 N T0 or T1: CAS T0: CAS 0 0 T0 N owned M = (50 + N) / 2 44

Slide 45

Slide 45 text

Work-stealing tree - contention 45

Slide 46

Slide 46 text

Work-stealing tree scheduling 1) find either a non-expanded, non-completed node 2) if not found, terminate 3) if not owned, steal and/or expand, and descend 4) advance until node is completed or stolen 5) go to 1) 50

Slide 47

Slide 47 text

Work-stealing tree scheduling 1) find either a non-expanded, non-completed node 2) if not found, terminate 3) if not owned, steal and/or expand, and descend 4) advance until node is completed or stolen 5) go to 1) 1) find either a non-expanded, non-completed node 51

Slide 48

Slide 48 text

Choosing the node to steal Find first, in-order traversal 2 9 5 3 52

Slide 49

Slide 49 text

Choosing the node to steal Find first, in-order traversal 2 9 5 3 Catastrophic – a lot of stealing, huge trees 53

Slide 50

Slide 50 text

Choosing the node to steal Find first, in-order traversal Find first, random order traversal 2 9 5 3 2 9 5 3 Catastrophic – a lot of stealing, huge trees 54

Slide 51

Slide 51 text

Choosing the node to steal Find first, in-order traversal Find first, random order traversal 2 9 5 3 2 9 5 3 Catastrophic – a lot of stealing, huge trees Works reasonably well. 55

Slide 52

Slide 52 text

Choosing the node to steal Find first, in-order traversal Find first, random order traversal Find most elements 2 9 5 3 2 9 5 3 2 9 5 3 Catastrophic – a lot of stealing, huge trees Works reasonably well. Generates least nodes. Seems to be best. 56

Slide 53

Slide 53 text

Comparison with fixed-size batching 57

Slide 54

Slide 54 text

Comparison with fixed-size batching 58

Slide 55

Slide 55 text

Comparison with task work-stealing 59

Slide 56

Slide 56 text

Thank you! Questions? 60

Slide 57

Slide 57 text

Finding work 61

Slide 58

Slide 58 text

Other workloads 62