Slide 1

Slide 1 text

mchakravarty α TacticalGrace TacticalGrace Data Parallelism in Haskell Manuel M T Chakravarty University of New South Wales Jointly with Gabriele Keller Sean Lee Roman Leshchinskiy Ben Lippmeier Trevor L McDonell Simon Peyton Jones 1 » I like to give you an overview over our research projects on DPP in Haskell, which are joint work with… 45 minutes + 10min [10min Why DP?; 5min Design space; 18min 1st class; 12min 2nd class]

Slide 2

Slide 2 text

Part 1 Why data parallelism? 2 » Why do we focus on data parallelism?

Slide 3

Slide 3 text

“It's a great abstraction!” 3 » Read! * In what way?

Slide 4

Slide 4 text

Inside a multi-threaded program 4 * Explicit multi-threading: hard to get right * DP: single-threaded * Abstracts over concurrency control

Slide 5

Slide 5 text

Inside a data parallel program 4 * Explicit multi-threading: hard to get right * DP: single-threaded * Abstracts over concurrency control

Slide 6

Slide 6 text

ΔxΔp ~ h 5 * Indeterminism: hard to test; Heisenbugs * DP: deterministic — result independent of run and realised parallelism * Abstracts over indeterminism » Related to that is…

Slide 7

Slide 7 text

x = y ⇒ f x = f y 5 * Indeterminism: hard to test; Heisenbugs * DP: deterministic — result independent of run and realised parallelism * Abstracts over indeterminism » Related to that is…

Slide 8

Slide 8 text

6 » …that reasoning about explicit concurrency is hard * DP: compositional * Informal and formal reasoning as for sequential programs * Abstracts over interference/effects

Slide 9

Slide 9 text

map f · map g = map (f · g) 6 » …that reasoning about explicit concurrency is hard * DP: compositional * Informal and formal reasoning as for sequential programs * Abstracts over interference/effects

Slide 10

Slide 10 text

“Hardware likes SIMD!” 7 * Not only a great abstraction for developers, * also the right model for a lot of hardware.

Slide 11

Slide 11 text

8 * More and more SIMD support in CPUs (e.g., AVX)

Slide 12

Slide 12 text

9 * GPUs deliver exceptional FLOPS

Slide 13

Slide 13 text

10 * Optimse FLOPS/Watt * Less control logic * Locality

Slide 14

Slide 14 text

Energy efficiency 10 * Optimse FLOPS/Watt * Less control logic * Locality

Slide 15

Slide 15 text

“It goes a long way!” 11 * DP is a great abstraction with increasingly friendly hardware and * it is more flexible than it might first appear.

Slide 16

Slide 16 text

Flat data parallelism 12 » Concerning DP, we usually imagine rectangular structures…

Slide 17

Slide 17 text

Flat data parallelism 12 » Concerning DP, we usually imagine rectangular structures…

Slide 18

Slide 18 text

Nested data parallelism 13 » …but nested irregular arrays let us handle tree structures. * NESL

Slide 19

Slide 19 text

Nested data parallelism 13 » …but nested irregular arrays let us handle tree structures. * NESL

Slide 20

Slide 20 text

Amorphous data parallelism 14 * Sets, e.g., to represent graphs, processed with internally indeterministic operations * Blelloch's problem based benchmark suite (PBBS) » Given these varying levels of expressiveness, we have got a taxonomy…

Slide 21

Slide 21 text

Amorphous data parallelism 14 * Sets, e.g., to represent graphs, processed with internally indeterministic operations * Blelloch's problem based benchmark suite (PBBS) » Given these varying levels of expressiveness, we have got a taxonomy…

Slide 22

Slide 22 text

Part 2 The design space 15 » …which informs the design of DP language extensions and implementations.

Slide 23

Slide 23 text

Flat Nested Amorphous Embedded (2nd class) Full (1st class) 16 * Two independent language axes: (x) expressiveness of computations; (y) expressiveness of parallelism

Slide 24

Slide 24 text

Flat Nested Amorphous Repa Embedded (2nd class) Full (1st class) 16 * Two independent language axes: (x) expressiveness of computations; (y) expressiveness of parallelism

Slide 25

Slide 25 text

Flat Nested Amorphous Repa Data Parallel Haskell Embedded (2nd class) Full (1st class) 16 * Two independent language axes: (x) expressiveness of computations; (y) expressiveness of parallelism

Slide 26

Slide 26 text

Flat Nested Amorphous Repa Data Parallel Haskell Embedded (2nd class) Full (1st class) 16 * Two independent language axes: (x) expressiveness of computations; (y) expressiveness of parallelism

Slide 27

Slide 27 text

Flat Nested Amorphous Repa Accelerate Data Parallel Haskell Embedded (2nd class) Full (1st class) 16 * Two independent language axes: (x) expressiveness of computations; (y) expressiveness of parallelism

Slide 28

Slide 28 text

Flat Nested Amorphous Repa Accelerate Data Parallel Haskell Embedded (2nd class) Full (1st class) 16 * Two independent language axes: (x) expressiveness of computations; (y) expressiveness of parallelism

Slide 29

Slide 29 text

Embedded (2nd class) Full (1st class) Flat Nested Amorphous Repa Data Parallel Haskell Accelerate 17 * Additional hardware axis: (z) general or specialised architecture * We are a long way from fully parallelising Haskell with amorphous DP on GPUs! » We'll visit those points in the design space that our group explored while looking at applications such as the following…

Slide 30

Slide 30 text

CPU Embedded (2nd class) Full (1st class) Flat Nested Amorphous Repa Data Parallel Haskell GPU Accelerate 17 * Additional hardware axis: (z) general or specialised architecture * We are a long way from fully parallelising Haskell with amorphous DP on GPUs! » We'll visit those points in the design space that our group explored while looking at applications such as the following…

Slide 31

Slide 31 text

Some case studies Numerically intensive applications 18 * Smoothlife: https://github.com/AccelerateHS/accelerate-examples/tree/master/examples/ smoothlife * FFT

Slide 32

Slide 32 text

19 * N-body: https://github.com/AccelerateHS/accelerate-examples/tree/master/examples/n- body * Hierarchical solutions benefit from nested parallelism

Slide 33

Slide 33 text

20 * Fluid flow: https://github.com/AccelerateHS/accelerate-examples/tree/master/examples/ fluid * Convolution » Let's first look at our experience with parallelising full Haskell, before we move to embedded languages…

Slide 34

Slide 34 text

Part 3 First-class data parallelism: Parallelising Haskell 21

Slide 35

Slide 35 text

Data Parallel Haskell Our ambitious first take 22 » Inspired by NESL, we began by integrating NDP into Haskell…

Slide 36

Slide 36 text

Flat Nested Amorphous Repa Accelerate Data Parallel Haskell Embedded (2nd class) Full (1st class) 23 * Maximise expressivenes and convenience for the programmer * Let the compiler do the heavy lifting * Types to distinguish between parallel and sequential computations

Slide 37

Slide 37 text

sdotp :: SparseVector -> Vector -> Double sdotp sv v = sumP [: x * (v !: i) | (i, x) <- sv :] 24

Slide 38

Slide 38 text

type Vector = [:Double:] type SparseVector = [:(Int, Double):] sdotp :: SparseVector -> Vector -> Double sdotp sv v = sumP [: x * (v !: i) | (i, x) <- sv :] parallel array 24

Slide 39

Slide 39 text

type Vector = [:Double:] type SparseVector = [:(Int, Double):] sdotp :: SparseVector -> Vector -> Double sdotp sv v = sumP [: x * (v !: i) | (i, x) <- sv :] index value parallel array 24

Slide 40

Slide 40 text

type Vector = [:Double:] type SparseVector = [:(Int, Double):] sdotp :: SparseVector -> Vector -> Double sdotp sv v = sumP [: x * (v !: i) | (i, x) <- sv :] [:0, 5, 0, 0, 3:] index value parallel array 24

Slide 41

Slide 41 text

type Vector = [:Double:] type SparseVector = [:(Int, Double):] sdotp :: SparseVector -> Vector -> Double sdotp sv v = sumP [: x * (v !: i) | (i, x) <- sv :] [:0, 5, 0, 0, 3:] [:(1, 5), (4, 3):] index value parallel array 24

Slide 42

Slide 42 text

type Vector = [:Double:] type SparseVector = [:(Int, Double):] sdotp :: SparseVector -> Vector -> Double sdotp sv v = sumP [: x * (v !: i) | (i, x) <- sv :] parallel computation (no nesting) [:0, 5, 0, 0, 3:] [:(1, 5), (4, 3):] index value parallel array 24

Slide 43

Slide 43 text

sdotp :: SparseVector -> Vector -> Double sdotp sv v = sumP [: x * (v !: i) | (i, x) <- sv :] type SparseMatrix = [:SparseVector:] smvm :: SparseMatrix -> Vector -> Vector smvm sm v = [: sdotp row v | row <- sm :] parallel computation (no nesting) 25

Slide 44

Slide 44 text

sdotp :: SparseVector -> Vector -> Double sdotp sv v = sumP [: x * (v !: i) | (i, x) <- sv :] type SparseMatrix = [:SparseVector:] smvm :: SparseMatrix -> Vector -> Vector smvm sm v = [: sdotp row v | row <- sm :] parallel computation (no nesting) parallel computation (nested, irregular) 25

Slide 45

Slide 45 text

[: f x | x <- xs :] 26 * Two main components of DPH: (1) vectoriser & (2) backend array library * Implement as much as possible as a library (minimally invasive to GHC: type families, rewrite rules) * Based on GHC, including its highly optimising backend * For better or worse, we aim to compete with C!

Slide 46

Slide 46 text

[: f x | x <- xs :] Vectoriser map (lam $ \x -> f `app` x) xs Generalisation of NESL's flattening 26 * Two main components of DPH: (1) vectoriser & (2) backend array library * Implement as much as possible as a library (minimally invasive to GHC: type families, rewrite rules) * Based on GHC, including its highly optimising backend * For better or worse, we aim to compete with C!

Slide 47

Slide 47 text

[: f x | x <- xs :] Vectoriser map (lam $ \x -> f `app` x) xs map :: (..) => (a :-> b) -> Array a -> Array b Import Generalisation of NESL's flattening Backend library of flat, segmented parallel arrays 26 * Two main components of DPH: (1) vectoriser & (2) backend array library * Implement as much as possible as a library (minimally invasive to GHC: type families, rewrite rules) * Based on GHC, including its highly optimising backend * For better or worse, we aim to compete with C!

Slide 48

Slide 48 text

[: f x | x <- xs :] Vectoriser map (lam $ \x -> f `app` x) xs Optimise map :: (..) => (a :-> b) -> Array a -> Array b Import Generalisation of NESL's flattening Backend library of flat, segmented parallel arrays 26 * Two main components of DPH: (1) vectoriser & (2) backend array library * Implement as much as possible as a library (minimally invasive to GHC: type families, rewrite rules) * Based on GHC, including its highly optimising backend * For better or worse, we aim to compete with C!

Slide 49

Slide 49 text

[: f x | x <- xs :] Vectoriser map (lam $ \x -> f `app` x) xs Optimise Code generation Start: %cond = icmp eq i32 %a, %b br i1 %cond, label %IfEqual, label %IfUnequal map :: (..) => (a :-> b) -> Array a -> Array b Import Generalisation of NESL's flattening Backend library of flat, segmented parallel arrays 26 * Two main components of DPH: (1) vectoriser & (2) backend array library * Implement as much as possible as a library (minimally invasive to GHC: type families, rewrite rules) * Based on GHC, including its highly optimising backend * For better or worse, we aim to compete with C!

Slide 50

Slide 50 text

Repa Regular parallel arrays also: turnip Dialing it down 27 * Two classes of inefficiencies with DPH: (1) due to vectorisation and (2) due to implementation of flat, parallel arrays in the backend library * Repa: focus on solving one problem (without worrying about the other)

Slide 51

Slide 51 text

Flat Nested Amorphous Repa Accelerate Data Parallel Haskell Embedded (2nd class) Full (1st class) 28 * End user API for flat DPH backend library * Restricted expressiveness, programmer guided array representations * Also, rectangular arrays are better suited for matrix algorithms and similar * Shape polymorphism & track dimensionality with types

Slide 52

Slide 52 text

mmultP :: Monad m => Array U Double -> Array U Double -> m (Array U Double) 29

Slide 53

Slide 53 text

mmultP arr brr = let (Z :. rs :. _ ) = extent arr (Z :. _ :. cs) = extent brr in do trr <- computeP (transpose brr) :: Array U DIM2 Double computeP (fromFunction (Z :. rs :. cs) (dotProduct trr)) mmultP :: Monad m => Array U Double -> Array U Double -> m (Array U Double) 29

Slide 54

Slide 54 text

mmultP arr brr = let (Z :. rs :. _ ) = extent arr (Z :. _ :. cs) = extent brr in do trr <- computeP (transpose brr) :: Array U DIM2 Double computeP (fromFunction (Z :. rs :. cs) (dotProduct trr)) mmultP :: Monad m => Array U Double -> Array U Double -> m (Array U Double) force to be manifest 29

Slide 55

Slide 55 text

mmultP arr brr = let (Z :. rs :. _ ) = extent arr (Z :. _ :. cs) = extent brr in do trr <- computeP (transpose brr) :: Array U DIM2 Double computeP (fromFunction (Z :. rs :. cs) (dotProduct trr)) mmultP :: Monad m => Array U Double -> Array U Double -> m (Array U Double) force to be manifest second stage 29

Slide 56

Slide 56 text

mmultP arr brr = let (Z :. rs :. _ ) = extent arr (Z :. _ :. cs) = extent brr in do trr <- computeP (transpose brr) :: Array U DIM2 Double computeP (fromFunction (Z :. rs :. cs) (dotProduct trr)) dotProduct (Z :. i :. j) = sumAllS (zipWith (*) a_row b_row) where a_row = slice arr (row i) t_row = slice trr (row j) 29

Slide 57

Slide 57 text

[: f x | x <- xs :] Vectoriser Generalisation of NESL's flattening map (\x -> f x) xs Optimise Code generation Start: %cond = icmp eq i32 %a, %b br i1 %cond, label %IfEqual, label %IfUnequal map :: (..) => (a -> b) -> Array s a -> Array s b Import Backend library of multi-dimensional parallel arrays 30 *

Slide 58

Slide 58 text

[: f x | x <- xs :] Vectoriser Generalisation of NESL's flattening map (\x -> f x) xs Optimise Code generation Start: %cond = icmp eq i32 %a, %b br i1 %cond, label %IfEqual, label %IfUnequal map :: (..) => (a -> b) -> Array s a -> Array s b Import Backend library of multi-dimensional parallel arrays 30 *

Slide 59

Slide 59 text

“Parallelising all of Haskell with C-like performance is a challenge.” 31

Slide 60

Slide 60 text

Backend code generation DPH Repa 32

Slide 61

Slide 61 text

Backend code generation DPH Repa Native code generation: 32

Slide 62

Slide 62 text

Backend code generation DPH Repa Carefully crafted code in the library (e.g., reloads for stencils) We wrote an LLVM backend (still aliasing issues) Native code generation: [Haskell 2010] [Haskell 2011] 32

Slide 63

Slide 63 text

Interaction with GHC optimisations DPH Repa 33

Slide 64

Slide 64 text

Interaction with GHC optimisations DPH Repa GHC has a complex pipeline: 33

Slide 65

Slide 65 text

Interaction with GHC optimisations DPH Repa Firing of rewrite rules and other transformations Code explosion due to inlining and code duplication GHC has a complex pipeline: 33

Slide 66

Slide 66 text

DPH Repa Array fusion 34

Slide 67

Slide 67 text

DPH Repa Challenging research program on its own: Array fusion 34

Slide 68

Slide 68 text

DPH Repa Invented equational array fusion and stream fusion Now looking at series expressions Challenging research program on its own: Array fusion [Haskell 2013: Ben's talk on Tuesday] 34

Slide 69

Slide 69 text

DPH Repa Array representations 35 DPH: flattening of arrays of structured types with type families

Slide 70

Slide 70 text

DPH Repa Invented Haskell type families to express type-dependent representations Repa type indices: declarative control of representations Array representations [Haskell 2012a] [POPL 2005, ICFP 2005, ICFP 2008] 35 DPH: flattening of arrays of structured types with type families

Slide 71

Slide 71 text

0 0.5 1 1.5 2 2.5 64 96 128 192 256 384 512 768 1024 relative runtime matrix width C Gauss-Seidel C Jacobi Repa -N1 Repa -N2 Repa -N8 Figure 12. Runtimes for Fluid Flow Solver at the start of the program, then swaps these buffers after every relaxation iteration. Doing this improves memory locality, reducing the data cache-miss rate. In contrast, the Repa version allocates 36

Slide 72

Slide 72 text

Overhead of vectorised code DPH 37

Slide 73

Slide 73 text

Overhead of vectorised code DPH GHC's simplifier removes explicit closures etc. We invented vectorisation avoidance [Haskell 2012b] 37

Slide 74

Slide 74 text

Work and space complexity of vectorisation DPH 38

Slide 75

Slide 75 text

Work and space complexity of vectorisation DPH Fusion helps in some cases (but fragile) Work complexity: invented virtual segment descriptors [ICFP 2012] 38

Slide 76

Slide 76 text

100us 1ms 10ms 100ms 1s 100 1k 10k 100k 1M 10M Run Time Non-Zero Elements Sparse Matrix-Vector Multiplication (1% non-zero elements) Out Of Memory (150k elements) Baseline DPH New DPH Using Data.Vector 39

Slide 77

Slide 77 text

1ms 10ms 100ms 1s 10s 100s 10 100 1k 10k 100k Run Time Bodies Barnes-Hut, 1 step, 1 thread Out Of Memory (5000 bodies) Baseline DPH New DPH With Data.Vector 40

Slide 78

Slide 78 text

“We are still not out of the woods.” 41

Slide 79

Slide 79 text

Series expression work not integrated Backend library lacks optimisations Vectorisation avoidance needs to be extended DPH DPH 42

Slide 80

Slide 80 text

Part 4 Second-class data parallelism: Embedded languages 43 * Many of the discussed problems exist or are made worse by… * (1) parallelising a very expressive language (2) implemented in a very sophisticated compiler » We surely need more modest goals if we target more challenging hardware

Slide 81

Slide 81 text

Data.Array.Accelerate Pivoting 44 » Accelerate is an embedded language…

Slide 82

Slide 82 text

Flat Nested Amorphous Repa Data Parallel Haskell Embedded (2nd class) Full (1st class) Accelerate 45 » …aimed at general-purpose GPU programming.

Slide 83

Slide 83 text

Flat Nested Amorphous Repa Data Parallel Haskell Embedded (2nd class) Full (1st class) CPU GPU Accelerate 45 » …aimed at general-purpose GPU programming.

Slide 84

Slide 84 text

dotp :: [:Float:] -> [:Float:] -> Float dotp xs ys = sumP (zipWithP (*) xs ys ) Data Parallel Haskell 46 * Dot product in Data Parallel Haskell

Slide 85

Slide 85 text

dotp :: Acc (Vector Float) -> Acc (Vector Float) -> Acc (Scalar Float) dotp xs ys = sum (zipWith (*) xs ys ) 47 * Same code embedded * 'fold' is parallel tree-reduction » The main difference are the types — let's look at them in more detail…

Slide 86

Slide 86 text

dotp :: Acc (Vector Float) -> Acc (Vector Float) -> Acc (Scalar Float) dotp xs ys = sum (zipWith (*) xs ys ) Embedded Accelerate arrays 47 * Same code embedded * 'fold' is parallel tree-reduction » The main difference are the types — let's look at them in more detail…

Slide 87

Slide 87 text

dotp :: Acc (Vector Float) -> Acc (Vector Float) -> Acc (Scalar Float) dotp xs ys = sum (zipWith (*) xs ys ) Acc marks embedded array computations Embedded Accelerate arrays 47 * Same code embedded * 'fold' is parallel tree-reduction » The main difference are the types — let's look at them in more detail…

Slide 88

Slide 88 text

data Array sh e type Vector e = Array DIM1 e type Scalar e = Array DIM0 e 48 * Acc computations always produce arrays — hence, Scalar * Let's ignore the type class context for the moment * For comparison, the list version of zipWith on top in grey

Slide 89

Slide 89 text

data Array sh e type Vector e = Array DIM1 e type Scalar e = Array DIM0 e Regular, shape-polymorphic arrays 48 * Acc computations always produce arrays — hence, Scalar * Let's ignore the type class context for the moment * For comparison, the list version of zipWith on top in grey

Slide 90

Slide 90 text

data Array sh e type Vector e = Array DIM1 e type Scalar e = Array DIM0 e Regular, shape-polymorphic arrays zipWith :: (a -> b -> c) -> [a] -> [b] -> [c] zipWith :: (…) => (Exp a -> Exp b -> Exp c) -> Acc (Array sh a) -> Acc (Array sh b) -> Acc (Array sh c) Embedded scalar expressions 48 * Acc computations always produce arrays — hence, Scalar * Let's ignore the type class context for the moment * For comparison, the list version of zipWith on top in grey

Slide 91

Slide 91 text

Acc marks embedded array computations Embedded Accelerate arrays dotp :: Acc (Vector Float) -> Acc (Vector Float) -> Acc (Scalar Float) dotp xs ys = sum (zipWith (*) xs ys ) 49 * What if we want to process host data? * Using data from the host language in the embedded language * GPUs: explicit memory transfer from host to GPU memory

Slide 92

Slide 92 text

dotp :: Acc (Vector Float) -> Acc (Vector Float) -> Acc (Scalar Float) dotp xs ys = sum (zipWith (*) xs ys ) 49 * What if we want to process host data? * Using data from the host language in the embedded language * GPUs: explicit memory transfer from host to GPU memory

Slide 93

Slide 93 text

dotp :: Acc (Vector Float) -> Acc (Vector Float) -> Acc (Scalar Float) dotp xs ys = sum (zipWith (*) xs ys ) 49 * What if we want to process host data? * Using data from the host language in the embedded language * GPUs: explicit memory transfer from host to GPU memory

Slide 94

Slide 94 text

dotp :: Acc (Vector Float) -> Acc (Vector Float) -> Acc (Scalar Float) dotp xs ys = sum (zipWith (*) xs ys ) 49 * What if we want to process host data? * Using data from the host language in the embedded language * GPUs: explicit memory transfer from host to GPU memory

Slide 95

Slide 95 text

dotp :: Acc (Vector Float) -> Acc (Vector Float) -> Acc (Scalar Float) dotp xs ys = sum (zipWith (*) xs ys ) use embeds values let xs' = use xs ys' = use ys in ' ' 49 * What if we want to process host data? * Using data from the host language in the embedded language * GPUs: explicit memory transfer from host to GPU memory

Slide 96

Slide 96 text

50 Essence of embedded languages (transcending Accelerate) * High-level for expressiveness; restricted for efficiency * Further abstractions: generate embedded code

Slide 97

Slide 97 text

Embedded languages …are restricted languages 50 Essence of embedded languages (transcending Accelerate) * High-level for expressiveness; restricted for efficiency * Further abstractions: generate embedded code

Slide 98

Slide 98 text

Embedded languages …are restricted languages The embedding partly compensates for restrictions: 50 Essence of embedded languages (transcending Accelerate) * High-level for expressiveness; restricted for efficiency * Further abstractions: generate embedded code

Slide 99

Slide 99 text

Embedded languages …are restricted languages Seamless integration into host language and… …host language can generate embedded code. The embedding partly compensates for restrictions: 50 Essence of embedded languages (transcending Accelerate) * High-level for expressiveness; restricted for efficiency * Further abstractions: generate embedded code

Slide 100

Slide 100 text

map (\x -> x + 1) arr [DAMP 2011] 51 (1) Reify AST; (2) Optimise code (fusion); (3) Generate CUDA code using skeletons (a) Compile with the CUDA compiler; (b) Invoke from host code » How to implement the skeletons…

Slide 101

Slide 101 text

map (\x -> x + 1) arr Reify AST Map (Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr [DAMP 2011] 51 (1) Reify AST; (2) Optimise code (fusion); (3) Generate CUDA code using skeletons (a) Compile with the CUDA compiler; (b) Invoke from host code » How to implement the skeletons…

Slide 102

Slide 102 text

map (\x -> x + 1) arr Reify AST Map (Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr Optimise [DAMP 2011] 51 (1) Reify AST; (2) Optimise code (fusion); (3) Generate CUDA code using skeletons (a) Compile with the CUDA compiler; (b) Invoke from host code » How to implement the skeletons…

Slide 103

Slide 103 text

map (\x -> x + 1) arr Reify AST Map (Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr Optimise Skeleton instantiation __global__ void kernel (float *arr, int n) {... [DAMP 2011] 51 (1) Reify AST; (2) Optimise code (fusion); (3) Generate CUDA code using skeletons (a) Compile with the CUDA compiler; (b) Invoke from host code » How to implement the skeletons…

Slide 104

Slide 104 text

map (\x -> x + 1) arr Reify AST Map (Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr Optimise Skeleton instantiation __global__ void kernel (float *arr, int n) {... CUDA compiler [DAMP 2011] 51 (1) Reify AST; (2) Optimise code (fusion); (3) Generate CUDA code using skeletons (a) Compile with the CUDA compiler; (b) Invoke from host code » How to implement the skeletons…

Slide 105

Slide 105 text

map (\x -> x + 1) arr Reify AST Map (Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr Optimise Skeleton instantiation __global__ void kernel (float *arr, int n) {... CUDA compiler Call [DAMP 2011] 51 (1) Reify AST; (2) Optimise code (fusion); (3) Generate CUDA code using skeletons (a) Compile with the CUDA compiler; (b) Invoke from host code » How to implement the skeletons…

Slide 106

Slide 106 text

mkMap dev aenv fun arr = return $ CUTranslSkel "map" [cunit| $esc:("#include ") extern "C" __global__ void map ($params:argIn, $params:argOut) { const int shapeSize = size(shOut); const int gridSize = $exp:(gridSize dev); int ix; for ( ix = $exp:(threadIdx dev) ; ix < shapeSize ; ix += gridSize ) { $items:(dce x .=. get ix) $items:(setOut "ix" .=. f x) } } |] where ... 52 * Combinators as skeletons (code templates with holes) * Quasi-quoter for CUDA [Mainland] * Yellow anti-quotes are the holes (parameters) of the template

Slide 107

Slide 107 text

[Embed 2013] 53 Essence of skeleton-based generative programming (transcending Accelerate) * Hand tuned skeleton code * Meta programming simplifies code generation * FFI to use native libraries etc

Slide 108

Slide 108 text

Skeletons are templates …encapsulating efficient code idioms [Embed 2013] 53 Essence of skeleton-based generative programming (transcending Accelerate) * Hand tuned skeleton code * Meta programming simplifies code generation * FFI to use native libraries etc

Slide 109

Slide 109 text

Skeletons are templates …encapsulating efficient code idioms Code generation as template meta programming Foreign function interface as an escape hatch [Embed 2013] 53 Essence of skeleton-based generative programming (transcending Accelerate) * Hand tuned skeleton code * Meta programming simplifies code generation * FFI to use native libraries etc

Slide 110

Slide 110 text

Fusion is crucial c2 p5 p1 c1 p6 p7 p3 p2 p4 Fusion in 2 phases, exploiting skeleton instantiation 54 * Fusion output must be in skeleton form again * Fused skeletons share GPU kernels

Slide 111

Slide 111 text

Fusion is crucial c2 p5 p1 c1 p6 p7 p3 p2 p4 Fusion in 2 phases, exploiting skeleton instantiation [ICFP 2013: Trevor's talk on Wednesday] 54 * Fusion output must be in skeleton form again * Fused skeletons share GPU kernels

Slide 112

Slide 112 text

0.1 1 10 100 2 4 6 8 10 12 14 16 18 20 Run Time (ms) Elements (millions) Dot product Data.Vector Repa -N8 NDP2GPU Accelerate -fusion ... +fusion CUBLAS 55 * C (red) is on one CPU core (Xenon E5405 CPU @ 2 GHz, 64-bit) * Repa (blue) is on 7 CPU cores (two quad-core Xenon E5405 CPUs @ 2 GHz, 64-bit) * Accelerate (green) is on a Tesla T10 processor (240 cores @ 1.3 GHz)

Slide 113

Slide 113 text

0.1 1 10 100 2 4 6 8 10 12 14 16 18 20 Run Time (ms) Elements (millions) Dot product Data.Vector Repa -N8 NDP2GPU Accelerate -fusion ... +fusion CUBLAS Dot Product 55 * C (red) is on one CPU core (Xenon E5405 CPU @ 2 GHz, 64-bit) * Repa (blue) is on 7 CPU cores (two quad-core Xenon E5405 CPUs @ 2 GHz, 64-bit) * Accelerate (green) is on a Tesla T10 processor (240 cores @ 1.3 GHz)

Slide 114

Slide 114 text

56 * C (red) is on one CPU core (Xenon E5405 CPU @ 2 GHz, 64-bit) * Repa (blue) is on 7 CPU cores (two quad-core Xenon E5405 CPUs @ 2 GHz, 64-bit) * Accelerate (green) is on a Tesla T10 processor (240 cores @ 1.3 GHz)

Slide 115

Slide 115 text

Jos Stam's Fluid Flow Solver 56 * C (red) is on one CPU core (Xenon E5405 CPU @ 2 GHz, 64-bit) * Repa (blue) is on 7 CPU cores (two quad-core Xenon E5405 CPUs @ 2 GHz, 64-bit) * Accelerate (green) is on a Tesla T10 processor (240 cores @ 1.3 GHz)

Slide 116

Slide 116 text

“Works pretty well. Let's do more!” 57

Slide 117

Slide 117 text

Incrementally adding nested data parallelism How about amorphous data parallelism? Need a wider range of applications 58

Slide 118

Slide 118 text

Summary Data parallelism is a convenient & effective abstraction Parallelising full Haskell is possible, but a lot of hard work Embedded languages are an attractive compromise types >< state languages 59

Slide 119

Slide 119 text

References [EuroPar 2001] Nepal -- Nested Data-Parallelism in Haskell. Chakravarty, Keller, Lechtchinsky & Pfannenstiel. In "Euro-Par 2001: Parallel Processing, 7th Intl. Euro-Par Conference", 2001. [POPL 2005] Associated Types with Class. Chakravarty, Keller, Peyton Jones & Marlow. In Proceedings "The 32nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL'05)", ACM Press, 2005. [ICFP 2005] Associated Type Synonyms. Chakravarty, Keller & Peyton Jones. In Proceedings of "The Tenth ACM SIGPLAN International Conference on Functional Programming", ACM Press, 2005. [TLDI 2007] System F with Type Equality Coercions. Sulzmann, Chakravarty, Peyton Jones & Donnelly. In Proceedings of "The Third ACM SIGPLAN Workshop on Types in Language Design and Implementation", ACM Press, 2007. [ICFP 2008] Type Checking with Open Type Functions. Schrijvers, Peyton Jones, Chakravarty & Sulzmann. In Proceedings of "ICFP 2008 : The 13th ACM SIGPLAN International Conference on Functional Programming", ACM Press, 2008. 60

Slide 120

Slide 120 text

[FSTTCS 2008] Harnessing the Multicores: Nested Data Parallelism in Haskell. Peyton Jones, Leshchinskiy, Keller & Chakravarty. In "IARCS Annual Conf. on Foundations of Software Technology & Theoretical Computer Science",2008. [ICFP 2010] Regular, shape-polymorphic, parallel arrays in Haskell. Keller, Chakravarty, Leshchinskiy, Peyton Jones & Lippmeier. In Proceedings of "ICFP 2010 : The 15th ACM SIGPLAN Intl. Conf. on Functional Programming", 2010. [Haskell 2010] An LLVM Backend for GHC. Terei & Chakravarty. In Proceedings of "ACM SIGPLAN Haskell Symposium 2010", ACM Press, 2010. [DAMP 2011] Accelerating Haskell Array Codes with Multicore GPUs. Chakravarty, Keller, Lee, McDonell & Grover. In "Declarative Aspects of Multicore Programming", ACM Press, 2011. [Haskell 2011] Efficient Parallel Stencil Convolution in Haskell. Lippmeier & Keller. In Proceedings of "ACM SIGPLAN Haskell Symposium 2011", ACM Press, 2011. [ICFP 2012] Work Efficient Higher-Order Vectorisation. Lippmeier, Chakravarty, Keller, Leshchinskiy & Peyton Jones. ACM Press, 2012. 61

Slide 121

Slide 121 text

[Haskell 2012a] Guiding Parallel Array Fusion with Indexed Types. Lippmeier, Chakravarty, Keller & Peyton Jones. In Proceedings of "ACM SIGPLAN Haskell Symposium 2012", ACM Press, 2012. [Haskell 2012b] Vectorisation Avoidance. Keller, Chakravarty, Lippmeier, Leshchinskiy & Peyton Jones. In Proceedings of "ACM SIGPLAN Haskell Symposium 2012", ACM Press, 2012. [ICFP 2013] Optimising Purely Functional GPU Programs. McDonell, Chakravarty, Keller & Lippmeier. In "ACM SIGPLAN International Conference on Functional Programming", ACM Press, 2013. [Haskell 2013] Data Flow Fusion with Series Expressions in Haskell. Lippmeier, Chakravarty, Keller & Robinson. In Proceedings of "ACM SIGPLAN Haskell Symposium 2013", ACM Press, 2013. [Embed 2013] Embedding Foreign Code. Clifton-Everest, McDonell, Chakravarty & Keller. Submitted for publication, 2013. http://www.cse.unsw.edu.au/~chak/papers/ CMCK14.html 62

Slide 122

Slide 122 text

Images from http://wikimedia.org http://openclipart.org 63