Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Parallelism in Haskell

Data Parallelism in Haskell

The implicit data parallelism in collective operations on aggregate data structures constitutes an attractive parallel programming model for functional languages. Beginning with our work on integrating nested data parallelism into Haskell, we explored a variety of different approaches to array-centric data parallel programming in Haskell, experimented with a range of code generation and optimisation strategies, and targeted both multicore CPUs and GPUs. In addition to practical tools for parallel programming, the outcomes of this research programme include more widely applicable concepts, such as Haskell’s type families and stream fusion. In this talk, I will contrast the different approaches to data parallel programming that we explored. I will discuss their strengths and weaknesses and review what we have learnt in the course of exploring the various options. This includes our experience of implementing these approaches in the Glasgow Haskell Compiler as well as the experimental results that we have gathered so far. Finally, I will outline the remaining open challenges and our plans for the future.

2296a6bdc7779fe4017a23d268c8a79d?s=128

Manuel Chakravarty
PRO

September 23, 2013
Tweet

Transcript

  1. mchakravarty α TacticalGrace TacticalGrace Data Parallelism in Haskell Manuel M

    T Chakravarty University of New South Wales Jointly with Gabriele Keller Sean Lee Roman Leshchinskiy Ben Lippmeier Trevor L McDonell Simon Peyton Jones 1 » I like to give you an overview over our research projects on DPP in Haskell, which are joint work with… 45 minutes + 10min [10min Why DP?; 5min Design space; 18min 1st class; 12min 2nd class]
  2. Part 1 Why data parallelism? 2 » Why do we

    focus on data parallelism?
  3. “It's a great abstraction!” 3 » Read! * In what

    way?
  4. Inside a multi-threaded program 4 * Explicit multi-threading: hard to

    get right * DP: single-threaded * Abstracts over concurrency control
  5. Inside a data parallel program 4 * Explicit multi-threading: hard

    to get right * DP: single-threaded * Abstracts over concurrency control
  6. ΔxΔp ~ h 5 * Indeterminism: hard to test; Heisenbugs

    * DP: deterministic — result independent of run and realised parallelism * Abstracts over indeterminism » Related to that is…
  7. x = y ⇒ f x = f y 5

    * Indeterminism: hard to test; Heisenbugs * DP: deterministic — result independent of run and realised parallelism * Abstracts over indeterminism » Related to that is…
  8. 6 » …that reasoning about explicit concurrency is hard *

    DP: compositional * Informal and formal reasoning as for sequential programs * Abstracts over interference/effects
  9. map f · map g = map (f · g)

    6 » …that reasoning about explicit concurrency is hard * DP: compositional * Informal and formal reasoning as for sequential programs * Abstracts over interference/effects
  10. “Hardware likes SIMD!” 7 * Not only a great abstraction

    for developers, * also the right model for a lot of hardware.
  11. 8 * More and more SIMD support in CPUs (e.g.,

    AVX)
  12. 9 * GPUs deliver exceptional FLOPS

  13. 10 * Optimse FLOPS/Watt * Less control logic * Locality

  14. Energy efficiency 10 * Optimse FLOPS/Watt * Less control logic

    * Locality
  15. “It goes a long way!” 11 * DP is a

    great abstraction with increasingly friendly hardware and * it is more flexible than it might first appear.
  16. Flat data parallelism 12 » Concerning DP, we usually imagine

    rectangular structures…
  17. Flat data parallelism 12 » Concerning DP, we usually imagine

    rectangular structures…
  18. Nested data parallelism 13 » …but nested irregular arrays let

    us handle tree structures. * NESL
  19. Nested data parallelism 13 » …but nested irregular arrays let

    us handle tree structures. * NESL
  20. Amorphous data parallelism 14 * Sets, e.g., to represent graphs,

    processed with internally indeterministic operations * Blelloch's problem based benchmark suite (PBBS) » Given these varying levels of expressiveness, we have got a taxonomy…
  21. Amorphous data parallelism 14 * Sets, e.g., to represent graphs,

    processed with internally indeterministic operations * Blelloch's problem based benchmark suite (PBBS) » Given these varying levels of expressiveness, we have got a taxonomy…
  22. Part 2 The design space 15 » …which informs the

    design of DP language extensions and implementations.
  23. Flat Nested Amorphous Embedded (2nd class) Full (1st class) 16

    * Two independent language axes: (x) expressiveness of computations; (y) expressiveness of parallelism
  24. Flat Nested Amorphous Repa Embedded (2nd class) Full (1st class)

    16 * Two independent language axes: (x) expressiveness of computations; (y) expressiveness of parallelism
  25. Flat Nested Amorphous Repa Data Parallel Haskell Embedded (2nd class)

    Full (1st class) 16 * Two independent language axes: (x) expressiveness of computations; (y) expressiveness of parallelism
  26. Flat Nested Amorphous Repa Data Parallel Haskell Embedded (2nd class)

    Full (1st class) 16 * Two independent language axes: (x) expressiveness of computations; (y) expressiveness of parallelism
  27. Flat Nested Amorphous Repa Accelerate Data Parallel Haskell Embedded (2nd

    class) Full (1st class) 16 * Two independent language axes: (x) expressiveness of computations; (y) expressiveness of parallelism
  28. Flat Nested Amorphous Repa Accelerate Data Parallel Haskell Embedded (2nd

    class) Full (1st class) 16 * Two independent language axes: (x) expressiveness of computations; (y) expressiveness of parallelism
  29. Embedded (2nd class) Full (1st class) Flat Nested Amorphous Repa

    Data Parallel Haskell Accelerate 17 * Additional hardware axis: (z) general or specialised architecture * We are a long way from fully parallelising Haskell with amorphous DP on GPUs! » We'll visit those points in the design space that our group explored while looking at applications such as the following…
  30. CPU Embedded (2nd class) Full (1st class) Flat Nested Amorphous

    Repa Data Parallel Haskell GPU Accelerate 17 * Additional hardware axis: (z) general or specialised architecture * We are a long way from fully parallelising Haskell with amorphous DP on GPUs! » We'll visit those points in the design space that our group explored while looking at applications such as the following…
  31. Some case studies Numerically intensive applications 18 * Smoothlife: https://github.com/AccelerateHS/accelerate-examples/tree/master/examples/

    smoothlife * FFT
  32. 19 * N-body: https://github.com/AccelerateHS/accelerate-examples/tree/master/examples/n- body * Hierarchical solutions benefit from

    nested parallelism
  33. 20 * Fluid flow: https://github.com/AccelerateHS/accelerate-examples/tree/master/examples/ fluid * Convolution » Let's

    first look at our experience with parallelising full Haskell, before we move to embedded languages…
  34. Part 3 First-class data parallelism: Parallelising Haskell 21

  35. Data Parallel Haskell Our ambitious first take 22 » Inspired

    by NESL, we began by integrating NDP into Haskell…
  36. Flat Nested Amorphous Repa Accelerate Data Parallel Haskell Embedded (2nd

    class) Full (1st class) 23 * Maximise expressivenes and convenience for the programmer * Let the compiler do the heavy lifting * Types to distinguish between parallel and sequential computations
  37. sdotp :: SparseVector -> Vector -> Double sdotp sv v

    = sumP [: x * (v !: i) | (i, x) <- sv :] 24
  38. type Vector = [:Double:] type SparseVector = [:(Int, Double):] sdotp

    :: SparseVector -> Vector -> Double sdotp sv v = sumP [: x * (v !: i) | (i, x) <- sv :] parallel array 24
  39. type Vector = [:Double:] type SparseVector = [:(Int, Double):] sdotp

    :: SparseVector -> Vector -> Double sdotp sv v = sumP [: x * (v !: i) | (i, x) <- sv :] index value parallel array 24
  40. type Vector = [:Double:] type SparseVector = [:(Int, Double):] sdotp

    :: SparseVector -> Vector -> Double sdotp sv v = sumP [: x * (v !: i) | (i, x) <- sv :] [:0, 5, 0, 0, 3:] index value parallel array 24
  41. type Vector = [:Double:] type SparseVector = [:(Int, Double):] sdotp

    :: SparseVector -> Vector -> Double sdotp sv v = sumP [: x * (v !: i) | (i, x) <- sv :] [:0, 5, 0, 0, 3:] [:(1, 5), (4, 3):] index value parallel array 24
  42. type Vector = [:Double:] type SparseVector = [:(Int, Double):] sdotp

    :: SparseVector -> Vector -> Double sdotp sv v = sumP [: x * (v !: i) | (i, x) <- sv :] parallel computation (no nesting) [:0, 5, 0, 0, 3:] [:(1, 5), (4, 3):] index value parallel array 24
  43. sdotp :: SparseVector -> Vector -> Double sdotp sv v

    = sumP [: x * (v !: i) | (i, x) <- sv :] type SparseMatrix = [:SparseVector:] smvm :: SparseMatrix -> Vector -> Vector smvm sm v = [: sdotp row v | row <- sm :] parallel computation (no nesting) 25
  44. sdotp :: SparseVector -> Vector -> Double sdotp sv v

    = sumP [: x * (v !: i) | (i, x) <- sv :] type SparseMatrix = [:SparseVector:] smvm :: SparseMatrix -> Vector -> Vector smvm sm v = [: sdotp row v | row <- sm :] parallel computation (no nesting) parallel computation (nested, irregular) 25
  45. [: f x | x <- xs :] 26 *

    Two main components of DPH: (1) vectoriser & (2) backend array library * Implement as much as possible as a library (minimally invasive to GHC: type families, rewrite rules) * Based on GHC, including its highly optimising backend * For better or worse, we aim to compete with C!
  46. [: f x | x <- xs :] Vectoriser map

    (lam $ \x -> f `app` x) xs Generalisation of NESL's flattening 26 * Two main components of DPH: (1) vectoriser & (2) backend array library * Implement as much as possible as a library (minimally invasive to GHC: type families, rewrite rules) * Based on GHC, including its highly optimising backend * For better or worse, we aim to compete with C!
  47. [: f x | x <- xs :] Vectoriser map

    (lam $ \x -> f `app` x) xs map :: (..) => (a :-> b) -> Array a -> Array b Import Generalisation of NESL's flattening Backend library of flat, segmented parallel arrays 26 * Two main components of DPH: (1) vectoriser & (2) backend array library * Implement as much as possible as a library (minimally invasive to GHC: type families, rewrite rules) * Based on GHC, including its highly optimising backend * For better or worse, we aim to compete with C!
  48. [: f x | x <- xs :] Vectoriser map

    (lam $ \x -> f `app` x) xs Optimise map :: (..) => (a :-> b) -> Array a -> Array b Import Generalisation of NESL's flattening Backend library of flat, segmented parallel arrays 26 * Two main components of DPH: (1) vectoriser & (2) backend array library * Implement as much as possible as a library (minimally invasive to GHC: type families, rewrite rules) * Based on GHC, including its highly optimising backend * For better or worse, we aim to compete with C!
  49. [: f x | x <- xs :] Vectoriser map

    (lam $ \x -> f `app` x) xs Optimise Code generation Start: %cond = icmp eq i32 %a, %b br i1 %cond, label %IfEqual, label %IfUnequal map :: (..) => (a :-> b) -> Array a -> Array b Import Generalisation of NESL's flattening Backend library of flat, segmented parallel arrays 26 * Two main components of DPH: (1) vectoriser & (2) backend array library * Implement as much as possible as a library (minimally invasive to GHC: type families, rewrite rules) * Based on GHC, including its highly optimising backend * For better or worse, we aim to compete with C!
  50. Repa Regular parallel arrays also: turnip Dialing it down 27

    * Two classes of inefficiencies with DPH: (1) due to vectorisation and (2) due to implementation of flat, parallel arrays in the backend library * Repa: focus on solving one problem (without worrying about the other)
  51. Flat Nested Amorphous Repa Accelerate Data Parallel Haskell Embedded (2nd

    class) Full (1st class) 28 * End user API for flat DPH backend library * Restricted expressiveness, programmer guided array representations * Also, rectangular arrays are better suited for matrix algorithms and similar * Shape polymorphism & track dimensionality with types
  52. mmultP :: Monad m => Array U Double -> Array

    U Double -> m (Array U Double) 29
  53. mmultP arr brr = let (Z :. rs :. _

    ) = extent arr (Z :. _ :. cs) = extent brr in do trr <- computeP (transpose brr) :: Array U DIM2 Double computeP (fromFunction (Z :. rs :. cs) (dotProduct trr)) mmultP :: Monad m => Array U Double -> Array U Double -> m (Array U Double) 29
  54. mmultP arr brr = let (Z :. rs :. _

    ) = extent arr (Z :. _ :. cs) = extent brr in do trr <- computeP (transpose brr) :: Array U DIM2 Double computeP (fromFunction (Z :. rs :. cs) (dotProduct trr)) mmultP :: Monad m => Array U Double -> Array U Double -> m (Array U Double) force to be manifest 29
  55. mmultP arr brr = let (Z :. rs :. _

    ) = extent arr (Z :. _ :. cs) = extent brr in do trr <- computeP (transpose brr) :: Array U DIM2 Double computeP (fromFunction (Z :. rs :. cs) (dotProduct trr)) mmultP :: Monad m => Array U Double -> Array U Double -> m (Array U Double) force to be manifest second stage 29
  56. mmultP arr brr = let (Z :. rs :. _

    ) = extent arr (Z :. _ :. cs) = extent brr in do trr <- computeP (transpose brr) :: Array U DIM2 Double computeP (fromFunction (Z :. rs :. cs) (dotProduct trr)) dotProduct (Z :. i :. j) = sumAllS (zipWith (*) a_row b_row) where a_row = slice arr (row i) t_row = slice trr (row j) 29
  57. [: f x | x <- xs :] Vectoriser Generalisation

    of NESL's flattening map (\x -> f x) xs Optimise Code generation Start: %cond = icmp eq i32 %a, %b br i1 %cond, label %IfEqual, label %IfUnequal map :: (..) => (a -> b) -> Array s a -> Array s b Import Backend library of multi-dimensional parallel arrays 30 *
  58. [: f x | x <- xs :] Vectoriser Generalisation

    of NESL's flattening map (\x -> f x) xs Optimise Code generation Start: %cond = icmp eq i32 %a, %b br i1 %cond, label %IfEqual, label %IfUnequal map :: (..) => (a -> b) -> Array s a -> Array s b Import Backend library of multi-dimensional parallel arrays 30 *
  59. “Parallelising all of Haskell with C-like performance is a challenge.”

    31
  60. Backend code generation DPH Repa 32

  61. Backend code generation DPH Repa Native code generation: 32

  62. Backend code generation DPH Repa Carefully crafted code in the

    library (e.g., reloads for stencils) We wrote an LLVM backend (still aliasing issues) Native code generation: [Haskell 2010] [Haskell 2011] 32
  63. Interaction with GHC optimisations DPH Repa 33

  64. Interaction with GHC optimisations DPH Repa GHC has a complex

    pipeline: 33
  65. Interaction with GHC optimisations DPH Repa Firing of rewrite rules

    and other transformations Code explosion due to inlining and code duplication GHC has a complex pipeline: 33
  66. DPH Repa Array fusion 34

  67. DPH Repa Challenging research program on its own: Array fusion

    34
  68. DPH Repa Invented equational array fusion and stream fusion Now

    looking at series expressions Challenging research program on its own: Array fusion [Haskell 2013: Ben's talk on Tuesday] 34
  69. DPH Repa Array representations 35 DPH: flattening of arrays of

    structured types with type families
  70. DPH Repa Invented Haskell type families to express type-dependent representations

    Repa type indices: declarative control of representations Array representations [Haskell 2012a] [POPL 2005, ICFP 2005, ICFP 2008] 35 DPH: flattening of arrays of structured types with type families
  71. 0 0.5 1 1.5 2 2.5 64 96 128 192

    256 384 512 768 1024 relative runtime matrix width C Gauss-Seidel C Jacobi Repa -N1 Repa -N2 Repa -N8 Figure 12. Runtimes for Fluid Flow Solver at the start of the program, then swaps these buffers after every relaxation iteration. Doing this improves memory locality, reducing the data cache-miss rate. In contrast, the Repa version allocates 36
  72. Overhead of vectorised code DPH 37

  73. Overhead of vectorised code DPH GHC's simplifier removes explicit closures

    etc. We invented vectorisation avoidance [Haskell 2012b] 37
  74. Work and space complexity of vectorisation DPH 38

  75. Work and space complexity of vectorisation DPH Fusion helps in

    some cases (but fragile) Work complexity: invented virtual segment descriptors [ICFP 2012] 38
  76. 100us 1ms 10ms 100ms 1s 100 1k 10k 100k 1M

    10M Run Time Non-Zero Elements Sparse Matrix-Vector Multiplication (1% non-zero elements) Out Of Memory (150k elements) Baseline DPH New DPH Using Data.Vector 39
  77. 1ms 10ms 100ms 1s 10s 100s 10 100 1k 10k

    100k Run Time Bodies Barnes-Hut, 1 step, 1 thread Out Of Memory (5000 bodies) Baseline DPH New DPH With Data.Vector 40
  78. “We are still not out of the woods.” 41

  79. Series expression work not integrated Backend library lacks optimisations Vectorisation

    avoidance needs to be extended DPH DPH 42
  80. Part 4 Second-class data parallelism: Embedded languages 43 * Many

    of the discussed problems exist or are made worse by… * (1) parallelising a very expressive language (2) implemented in a very sophisticated compiler » We surely need more modest goals if we target more challenging hardware
  81. Data.Array.Accelerate Pivoting 44 » Accelerate is an embedded language…

  82. Flat Nested Amorphous Repa Data Parallel Haskell Embedded (2nd class)

    Full (1st class) Accelerate 45 » …aimed at general-purpose GPU programming.
  83. Flat Nested Amorphous Repa Data Parallel Haskell Embedded (2nd class)

    Full (1st class) CPU GPU Accelerate 45 » …aimed at general-purpose GPU programming.
  84. dotp :: [:Float:] -> [:Float:] -> Float dotp xs ys

    = sumP (zipWithP (*) xs ys ) Data Parallel Haskell 46 * Dot product in Data Parallel Haskell
  85. dotp :: Acc (Vector Float) -> Acc (Vector Float) ->

    Acc (Scalar Float) dotp xs ys = sum (zipWith (*) xs ys ) 47 * Same code embedded * 'fold' is parallel tree-reduction » The main difference are the types — let's look at them in more detail…
  86. dotp :: Acc (Vector Float) -> Acc (Vector Float) ->

    Acc (Scalar Float) dotp xs ys = sum (zipWith (*) xs ys ) Embedded Accelerate arrays 47 * Same code embedded * 'fold' is parallel tree-reduction » The main difference are the types — let's look at them in more detail…
  87. dotp :: Acc (Vector Float) -> Acc (Vector Float) ->

    Acc (Scalar Float) dotp xs ys = sum (zipWith (*) xs ys ) Acc marks embedded array computations Embedded Accelerate arrays 47 * Same code embedded * 'fold' is parallel tree-reduction » The main difference are the types — let's look at them in more detail…
  88. data Array sh e type Vector e = Array DIM1

    e type Scalar e = Array DIM0 e 48 * Acc computations always produce arrays — hence, Scalar * Let's ignore the type class context for the moment * For comparison, the list version of zipWith on top in grey
  89. data Array sh e type Vector e = Array DIM1

    e type Scalar e = Array DIM0 e Regular, shape-polymorphic arrays 48 * Acc computations always produce arrays — hence, Scalar * Let's ignore the type class context for the moment * For comparison, the list version of zipWith on top in grey
  90. data Array sh e type Vector e = Array DIM1

    e type Scalar e = Array DIM0 e Regular, shape-polymorphic arrays zipWith :: (a -> b -> c) -> [a] -> [b] -> [c] zipWith :: (…) => (Exp a -> Exp b -> Exp c) -> Acc (Array sh a) -> Acc (Array sh b) -> Acc (Array sh c) Embedded scalar expressions 48 * Acc computations always produce arrays — hence, Scalar * Let's ignore the type class context for the moment * For comparison, the list version of zipWith on top in grey
  91. Acc marks embedded array computations Embedded Accelerate arrays dotp ::

    Acc (Vector Float) -> Acc (Vector Float) -> Acc (Scalar Float) dotp xs ys = sum (zipWith (*) xs ys ) 49 * What if we want to process host data? * Using data from the host language in the embedded language * GPUs: explicit memory transfer from host to GPU memory
  92. dotp :: Acc (Vector Float) -> Acc (Vector Float) ->

    Acc (Scalar Float) dotp xs ys = sum (zipWith (*) xs ys ) 49 * What if we want to process host data? * Using data from the host language in the embedded language * GPUs: explicit memory transfer from host to GPU memory
  93. dotp :: Acc (Vector Float) -> Acc (Vector Float) ->

    Acc (Scalar Float) dotp xs ys = sum (zipWith (*) xs ys ) 49 * What if we want to process host data? * Using data from the host language in the embedded language * GPUs: explicit memory transfer from host to GPU memory
  94. dotp :: Acc (Vector Float) -> Acc (Vector Float) ->

    Acc (Scalar Float) dotp xs ys = sum (zipWith (*) xs ys ) 49 * What if we want to process host data? * Using data from the host language in the embedded language * GPUs: explicit memory transfer from host to GPU memory
  95. dotp :: Acc (Vector Float) -> Acc (Vector Float) ->

    Acc (Scalar Float) dotp xs ys = sum (zipWith (*) xs ys ) use embeds values let xs' = use xs ys' = use ys in ' ' 49 * What if we want to process host data? * Using data from the host language in the embedded language * GPUs: explicit memory transfer from host to GPU memory
  96. 50 Essence of embedded languages (transcending Accelerate) * High-level for

    expressiveness; restricted for efficiency * Further abstractions: generate embedded code
  97. Embedded languages …are restricted languages 50 Essence of embedded languages

    (transcending Accelerate) * High-level for expressiveness; restricted for efficiency * Further abstractions: generate embedded code
  98. Embedded languages …are restricted languages The embedding partly compensates for

    restrictions: 50 Essence of embedded languages (transcending Accelerate) * High-level for expressiveness; restricted for efficiency * Further abstractions: generate embedded code
  99. Embedded languages …are restricted languages Seamless integration into host language

    and… …host language can generate embedded code. The embedding partly compensates for restrictions: 50 Essence of embedded languages (transcending Accelerate) * High-level for expressiveness; restricted for efficiency * Further abstractions: generate embedded code
  100. map (\x -> x + 1) arr [DAMP 2011] 51

    (1) Reify AST; (2) Optimise code (fusion); (3) Generate CUDA code using skeletons (a) Compile with the CUDA compiler; (b) Invoke from host code » How to implement the skeletons…
  101. map (\x -> x + 1) arr Reify AST Map

    (Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr [DAMP 2011] 51 (1) Reify AST; (2) Optimise code (fusion); (3) Generate CUDA code using skeletons (a) Compile with the CUDA compiler; (b) Invoke from host code » How to implement the skeletons…
  102. map (\x -> x + 1) arr Reify AST Map

    (Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr Optimise [DAMP 2011] 51 (1) Reify AST; (2) Optimise code (fusion); (3) Generate CUDA code using skeletons (a) Compile with the CUDA compiler; (b) Invoke from host code » How to implement the skeletons…
  103. map (\x -> x + 1) arr Reify AST Map

    (Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr Optimise Skeleton instantiation __global__ void kernel (float *arr, int n) {... [DAMP 2011] 51 (1) Reify AST; (2) Optimise code (fusion); (3) Generate CUDA code using skeletons (a) Compile with the CUDA compiler; (b) Invoke from host code » How to implement the skeletons…
  104. map (\x -> x + 1) arr Reify AST Map

    (Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr Optimise Skeleton instantiation __global__ void kernel (float *arr, int n) {... CUDA compiler [DAMP 2011] 51 (1) Reify AST; (2) Optimise code (fusion); (3) Generate CUDA code using skeletons (a) Compile with the CUDA compiler; (b) Invoke from host code » How to implement the skeletons…
  105. map (\x -> x + 1) arr Reify AST Map

    (Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr Optimise Skeleton instantiation __global__ void kernel (float *arr, int n) {... CUDA compiler Call [DAMP 2011] 51 (1) Reify AST; (2) Optimise code (fusion); (3) Generate CUDA code using skeletons (a) Compile with the CUDA compiler; (b) Invoke from host code » How to implement the skeletons…
  106. mkMap dev aenv fun arr = return $ CUTranslSkel "map"

    [cunit| $esc:("#include <accelerate_cuda.h>") extern "C" __global__ void map ($params:argIn, $params:argOut) { const int shapeSize = size(shOut); const int gridSize = $exp:(gridSize dev); int ix; for ( ix = $exp:(threadIdx dev) ; ix < shapeSize ; ix += gridSize ) { $items:(dce x .=. get ix) $items:(setOut "ix" .=. f x) } } |] where ... 52 * Combinators as skeletons (code templates with holes) * Quasi-quoter for CUDA [Mainland] * Yellow anti-quotes are the holes (parameters) of the template
  107. [Embed 2013] 53 Essence of skeleton-based generative programming (transcending Accelerate)

    * Hand tuned skeleton code * Meta programming simplifies code generation * FFI to use native libraries etc
  108. Skeletons are templates …encapsulating efficient code idioms [Embed 2013] 53

    Essence of skeleton-based generative programming (transcending Accelerate) * Hand tuned skeleton code * Meta programming simplifies code generation * FFI to use native libraries etc
  109. Skeletons are templates …encapsulating efficient code idioms Code generation as

    template meta programming Foreign function interface as an escape hatch [Embed 2013] 53 Essence of skeleton-based generative programming (transcending Accelerate) * Hand tuned skeleton code * Meta programming simplifies code generation * FFI to use native libraries etc
  110. Fusion is crucial c2 p5 p1 c1 p6 p7 p3

    p2 p4 Fusion in 2 phases, exploiting skeleton instantiation 54 * Fusion output must be in skeleton form again * Fused skeletons share GPU kernels
  111. Fusion is crucial c2 p5 p1 c1 p6 p7 p3

    p2 p4 Fusion in 2 phases, exploiting skeleton instantiation [ICFP 2013: Trevor's talk on Wednesday] 54 * Fusion output must be in skeleton form again * Fused skeletons share GPU kernels
  112. 0.1 1 10 100 2 4 6 8 10 12

    14 16 18 20 Run Time (ms) Elements (millions) Dot product Data.Vector Repa -N8 NDP2GPU Accelerate -fusion ... +fusion CUBLAS 55 * C (red) is on one CPU core (Xenon E5405 CPU @ 2 GHz, 64-bit) * Repa (blue) is on 7 CPU cores (two quad-core Xenon E5405 CPUs @ 2 GHz, 64-bit) * Accelerate (green) is on a Tesla T10 processor (240 cores @ 1.3 GHz)
  113. 0.1 1 10 100 2 4 6 8 10 12

    14 16 18 20 Run Time (ms) Elements (millions) Dot product Data.Vector Repa -N8 NDP2GPU Accelerate -fusion ... +fusion CUBLAS Dot Product 55 * C (red) is on one CPU core (Xenon E5405 CPU @ 2 GHz, 64-bit) * Repa (blue) is on 7 CPU cores (two quad-core Xenon E5405 CPUs @ 2 GHz, 64-bit) * Accelerate (green) is on a Tesla T10 processor (240 cores @ 1.3 GHz)
  114. 56 * C (red) is on one CPU core (Xenon

    E5405 CPU @ 2 GHz, 64-bit) * Repa (blue) is on 7 CPU cores (two quad-core Xenon E5405 CPUs @ 2 GHz, 64-bit) * Accelerate (green) is on a Tesla T10 processor (240 cores @ 1.3 GHz)
  115. Jos Stam's Fluid Flow Solver 56 * C (red) is

    on one CPU core (Xenon E5405 CPU @ 2 GHz, 64-bit) * Repa (blue) is on 7 CPU cores (two quad-core Xenon E5405 CPUs @ 2 GHz, 64-bit) * Accelerate (green) is on a Tesla T10 processor (240 cores @ 1.3 GHz)
  116. “Works pretty well. Let's do more!” 57

  117. Incrementally adding nested data parallelism How about amorphous data parallelism?

    Need a wider range of applications 58
  118. Summary Data parallelism is a convenient & effective abstraction Parallelising

    full Haskell is possible, but a lot of hard work Embedded languages are an attractive compromise types >< state languages 59
  119. References [EuroPar 2001] Nepal -- Nested Data-Parallelism in Haskell. Chakravarty,

    Keller, Lechtchinsky & Pfannenstiel. In "Euro-Par 2001: Parallel Processing, 7th Intl. Euro-Par Conference", 2001. [POPL 2005] Associated Types with Class. Chakravarty, Keller, Peyton Jones & Marlow. In Proceedings "The 32nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL'05)", ACM Press, 2005. [ICFP 2005] Associated Type Synonyms. Chakravarty, Keller & Peyton Jones. In Proceedings of "The Tenth ACM SIGPLAN International Conference on Functional Programming", ACM Press, 2005. [TLDI 2007] System F with Type Equality Coercions. Sulzmann, Chakravarty, Peyton Jones & Donnelly. In Proceedings of "The Third ACM SIGPLAN Workshop on Types in Language Design and Implementation", ACM Press, 2007. [ICFP 2008] Type Checking with Open Type Functions. Schrijvers, Peyton Jones, Chakravarty & Sulzmann. In Proceedings of "ICFP 2008 : The 13th ACM SIGPLAN International Conference on Functional Programming", ACM Press, 2008. 60
  120. [FSTTCS 2008] Harnessing the Multicores: Nested Data Parallelism in Haskell.

    Peyton Jones, Leshchinskiy, Keller & Chakravarty. In "IARCS Annual Conf. on Foundations of Software Technology & Theoretical Computer Science",2008. [ICFP 2010] Regular, shape-polymorphic, parallel arrays in Haskell. Keller, Chakravarty, Leshchinskiy, Peyton Jones & Lippmeier. In Proceedings of "ICFP 2010 : The 15th ACM SIGPLAN Intl. Conf. on Functional Programming", 2010. [Haskell 2010] An LLVM Backend for GHC. Terei & Chakravarty. In Proceedings of "ACM SIGPLAN Haskell Symposium 2010", ACM Press, 2010. [DAMP 2011] Accelerating Haskell Array Codes with Multicore GPUs. Chakravarty, Keller, Lee, McDonell & Grover. In "Declarative Aspects of Multicore Programming", ACM Press, 2011. [Haskell 2011] Efficient Parallel Stencil Convolution in Haskell. Lippmeier & Keller. In Proceedings of "ACM SIGPLAN Haskell Symposium 2011", ACM Press, 2011. [ICFP 2012] Work Efficient Higher-Order Vectorisation. Lippmeier, Chakravarty, Keller, Leshchinskiy & Peyton Jones. ACM Press, 2012. 61
  121. [Haskell 2012a] Guiding Parallel Array Fusion with Indexed Types. Lippmeier,

    Chakravarty, Keller & Peyton Jones. In Proceedings of "ACM SIGPLAN Haskell Symposium 2012", ACM Press, 2012. [Haskell 2012b] Vectorisation Avoidance. Keller, Chakravarty, Lippmeier, Leshchinskiy & Peyton Jones. In Proceedings of "ACM SIGPLAN Haskell Symposium 2012", ACM Press, 2012. [ICFP 2013] Optimising Purely Functional GPU Programs. McDonell, Chakravarty, Keller & Lippmeier. In "ACM SIGPLAN International Conference on Functional Programming", ACM Press, 2013. [Haskell 2013] Data Flow Fusion with Series Expressions in Haskell. Lippmeier, Chakravarty, Keller & Robinson. In Proceedings of "ACM SIGPLAN Haskell Symposium 2013", ACM Press, 2013. [Embed 2013] Embedding Foreign Code. Clifton-Everest, McDonell, Chakravarty & Keller. Submitted for publication, 2013. http://www.cse.unsw.edu.au/~chak/papers/ CMCK14.html 62
  122. Images from http://wikimedia.org http://openclipart.org 63