Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Parallelism in Haskell

Data Parallelism in Haskell

The implicit data parallelism in collective operations on aggregate data structures constitutes an attractive parallel programming model for functional languages. Beginning with our work on integrating nested data parallelism into Haskell, we explored a variety of different approaches to array-centric data parallel programming in Haskell, experimented with a range of code generation and optimisation strategies, and targeted both multicore CPUs and GPUs. In addition to practical tools for parallel programming, the outcomes of this research programme include more widely applicable concepts, such as Haskell’s type families and stream fusion. In this talk, I will contrast the different approaches to data parallel programming that we explored. I will discuss their strengths and weaknesses and review what we have learnt in the course of exploring the various options. This includes our experience of implementing these approaches in the Glasgow Haskell Compiler as well as the experimental results that we have gathered so far. Finally, I will outline the remaining open challenges and our plans for the future.

Manuel Chakravarty

September 23, 2013
Tweet

More Decks by Manuel Chakravarty

Other Decks in Research

Transcript

  1. mchakravarty
    α TacticalGrace
    TacticalGrace
    Data Parallelism in Haskell
    Manuel M T Chakravarty
    University of New South Wales
    Jointly with
    Gabriele Keller
    Sean Lee
    Roman Leshchinskiy
    Ben Lippmeier
    Trevor L McDonell
    Simon Peyton Jones
    1
    » I like to give you an overview over our research projects on DPP in Haskell, which are joint
    work with…
    45 minutes + 10min
    [10min Why DP?; 5min Design space; 18min 1st class; 12min 2nd class]

    View full-size slide

  2. Part 1
    Why data parallelism?
    2
    » Why do we focus on data parallelism?

    View full-size slide

  3. “It's a great abstraction!”
    3
    » Read!
    * In what way?

    View full-size slide

  4. Inside a multi-threaded program
    4
    * Explicit multi-threading: hard to get right
    * DP: single-threaded
    * Abstracts over concurrency control

    View full-size slide

  5. Inside a data parallel program
    4
    * Explicit multi-threading: hard to get right
    * DP: single-threaded
    * Abstracts over concurrency control

    View full-size slide

  6. ΔxΔp ~ h
    5
    * Indeterminism: hard to test; Heisenbugs
    * DP: deterministic — result independent of run and realised parallelism
    * Abstracts over indeterminism
    » Related to that is…

    View full-size slide

  7. x = y ⇒ f x = f y
    5
    * Indeterminism: hard to test; Heisenbugs
    * DP: deterministic — result independent of run and realised parallelism
    * Abstracts over indeterminism
    » Related to that is…

    View full-size slide

  8. 6
    » …that reasoning about explicit concurrency is hard
    * DP: compositional
    * Informal and formal reasoning as for sequential programs
    * Abstracts over interference/effects

    View full-size slide

  9. map f · map g = map (f · g)
    6
    » …that reasoning about explicit concurrency is hard
    * DP: compositional
    * Informal and formal reasoning as for sequential programs
    * Abstracts over interference/effects

    View full-size slide

  10. “Hardware likes SIMD!”
    7
    * Not only a great abstraction for developers,
    * also the right model for a lot of hardware.

    View full-size slide

  11. 8
    * More and more SIMD support in CPUs (e.g., AVX)

    View full-size slide

  12. 9
    * GPUs deliver exceptional FLOPS

    View full-size slide

  13. 10
    * Optimse FLOPS/Watt
    * Less control logic
    * Locality

    View full-size slide

  14. Energy efficiency
    10
    * Optimse FLOPS/Watt
    * Less control logic
    * Locality

    View full-size slide

  15. “It goes a long way!”
    11
    * DP is a great abstraction with increasingly friendly hardware and
    * it is more flexible than it might first appear.

    View full-size slide

  16. Flat data parallelism
    12
    » Concerning DP, we usually imagine rectangular structures…

    View full-size slide

  17. Flat data parallelism
    12
    » Concerning DP, we usually imagine rectangular structures…

    View full-size slide

  18. Nested data parallelism
    13
    » …but nested irregular arrays let us handle tree structures.
    * NESL

    View full-size slide

  19. Nested data parallelism
    13
    » …but nested irregular arrays let us handle tree structures.
    * NESL

    View full-size slide

  20. Amorphous data parallelism
    14
    * Sets, e.g., to represent graphs, processed with internally indeterministic operations
    * Blelloch's problem based benchmark suite (PBBS)
    » Given these varying levels of expressiveness, we have got a taxonomy…

    View full-size slide

  21. Amorphous data parallelism
    14
    * Sets, e.g., to represent graphs, processed with internally indeterministic operations
    * Blelloch's problem based benchmark suite (PBBS)
    » Given these varying levels of expressiveness, we have got a taxonomy…

    View full-size slide

  22. Part 2
    The design space
    15
    » …which informs the design of DP language extensions and implementations.

    View full-size slide

  23. Flat
    Nested
    Amorphous
    Embedded
    (2nd class)
    Full
    (1st class)
    16
    * Two independent language axes: (x) expressiveness of computations; (y) expressiveness of
    parallelism

    View full-size slide

  24. Flat
    Nested
    Amorphous
    Repa
    Embedded
    (2nd class)
    Full
    (1st class)
    16
    * Two independent language axes: (x) expressiveness of computations; (y) expressiveness of
    parallelism

    View full-size slide

  25. Flat
    Nested
    Amorphous
    Repa
    Data Parallel
    Haskell
    Embedded
    (2nd class)
    Full
    (1st class)
    16
    * Two independent language axes: (x) expressiveness of computations; (y) expressiveness of
    parallelism

    View full-size slide

  26. Flat
    Nested
    Amorphous
    Repa
    Data Parallel
    Haskell
    Embedded
    (2nd class)
    Full
    (1st class)
    16
    * Two independent language axes: (x) expressiveness of computations; (y) expressiveness of
    parallelism

    View full-size slide

  27. Flat
    Nested
    Amorphous
    Repa
    Accelerate
    Data Parallel
    Haskell
    Embedded
    (2nd class)
    Full
    (1st class)
    16
    * Two independent language axes: (x) expressiveness of computations; (y) expressiveness of
    parallelism

    View full-size slide

  28. Flat
    Nested
    Amorphous
    Repa
    Accelerate
    Data Parallel
    Haskell
    Embedded
    (2nd class)
    Full
    (1st class)
    16
    * Two independent language axes: (x) expressiveness of computations; (y) expressiveness of
    parallelism

    View full-size slide

  29. Embedded
    (2nd class)
    Full
    (1st class)
    Flat
    Nested
    Amorphous
    Repa
    Data Parallel
    Haskell
    Accelerate
    17
    * Additional hardware axis: (z) general or specialised architecture
    * We are a long way from fully parallelising Haskell with amorphous DP on GPUs!
    » We'll visit those points in the design space that our group explored while looking at
    applications such as the following…

    View full-size slide

  30. CPU
    Embedded
    (2nd class)
    Full
    (1st class)
    Flat
    Nested
    Amorphous
    Repa
    Data Parallel
    Haskell
    GPU
    Accelerate
    17
    * Additional hardware axis: (z) general or specialised architecture
    * We are a long way from fully parallelising Haskell with amorphous DP on GPUs!
    » We'll visit those points in the design space that our group explored while looking at
    applications such as the following…

    View full-size slide

  31. Some case studies
    Numerically intensive applications
    18
    * Smoothlife: https://github.com/AccelerateHS/accelerate-examples/tree/master/examples/
    smoothlife
    * FFT

    View full-size slide

  32. 19
    * N-body: https://github.com/AccelerateHS/accelerate-examples/tree/master/examples/n-
    body
    * Hierarchical solutions benefit from nested parallelism

    View full-size slide

  33. 20
    * Fluid flow: https://github.com/AccelerateHS/accelerate-examples/tree/master/examples/
    fluid
    * Convolution
    » Let's first look at our experience with parallelising full Haskell, before we move to
    embedded languages…

    View full-size slide

  34. Part 3
    First-class data parallelism:
    Parallelising Haskell
    21

    View full-size slide

  35. Data Parallel Haskell
    Our ambitious first take
    22
    » Inspired by NESL, we began by integrating NDP into Haskell…

    View full-size slide

  36. Flat
    Nested
    Amorphous
    Repa
    Accelerate
    Data Parallel
    Haskell
    Embedded
    (2nd class)
    Full
    (1st class)
    23
    * Maximise expressivenes and convenience for the programmer
    * Let the compiler do the heavy lifting
    * Types to distinguish between parallel and sequential computations

    View full-size slide

  37. sdotp :: SparseVector -> Vector
    -> Double
    sdotp sv v =
    sumP
    [: x * (v !: i) | (i, x) <- sv :]
    24

    View full-size slide

  38. type Vector = [:Double:]
    type SparseVector = [:(Int, Double):]
    sdotp :: SparseVector -> Vector
    -> Double
    sdotp sv v =
    sumP
    [: x * (v !: i) | (i, x) <- sv :]
    parallel array
    24

    View full-size slide

  39. type Vector = [:Double:]
    type SparseVector = [:(Int, Double):]
    sdotp :: SparseVector -> Vector
    -> Double
    sdotp sv v =
    sumP
    [: x * (v !: i) | (i, x) <- sv :]
    index value
    parallel array
    24

    View full-size slide

  40. type Vector = [:Double:]
    type SparseVector = [:(Int, Double):]
    sdotp :: SparseVector -> Vector
    -> Double
    sdotp sv v =
    sumP
    [: x * (v !: i) | (i, x) <- sv :]
    [:0, 5, 0, 0, 3:]
    index value
    parallel array
    24

    View full-size slide

  41. type Vector = [:Double:]
    type SparseVector = [:(Int, Double):]
    sdotp :: SparseVector -> Vector
    -> Double
    sdotp sv v =
    sumP
    [: x * (v !: i) | (i, x) <- sv :]
    [:0, 5, 0, 0, 3:] [:(1, 5), (4, 3):]
    index value
    parallel array
    24

    View full-size slide

  42. type Vector = [:Double:]
    type SparseVector = [:(Int, Double):]
    sdotp :: SparseVector -> Vector
    -> Double
    sdotp sv v =
    sumP
    [: x * (v !: i) | (i, x) <- sv :]
    parallel computation (no nesting)
    [:0, 5, 0, 0, 3:] [:(1, 5), (4, 3):]
    index value
    parallel array
    24

    View full-size slide

  43. sdotp :: SparseVector -> Vector
    -> Double
    sdotp sv v =
    sumP
    [: x * (v !: i) | (i, x) <- sv :]
    type SparseMatrix = [:SparseVector:]
    smvm :: SparseMatrix -> Vector
    -> Vector
    smvm sm v
    = [: sdotp row v | row <- sm :]
    parallel computation (no nesting)
    25

    View full-size slide

  44. sdotp :: SparseVector -> Vector
    -> Double
    sdotp sv v =
    sumP
    [: x * (v !: i) | (i, x) <- sv :]
    type SparseMatrix = [:SparseVector:]
    smvm :: SparseMatrix -> Vector
    -> Vector
    smvm sm v
    = [: sdotp row v | row <- sm :]
    parallel computation (no nesting)
    parallel computation (nested, irregular)
    25

    View full-size slide

  45. [: f x | x <- xs :]
    26
    * Two main components of DPH: (1) vectoriser & (2) backend array library
    * Implement as much as possible as a library (minimally invasive to GHC: type families,
    rewrite rules)
    * Based on GHC, including its highly optimising backend
    * For better or worse, we aim to compete with C!

    View full-size slide

  46. [: f x | x <- xs :]
    Vectoriser
    map (lam $ \x -> f `app` x) xs
    Generalisation of
    NESL's flattening
    26
    * Two main components of DPH: (1) vectoriser & (2) backend array library
    * Implement as much as possible as a library (minimally invasive to GHC: type families,
    rewrite rules)
    * Based on GHC, including its highly optimising backend
    * For better or worse, we aim to compete with C!

    View full-size slide

  47. [: f x | x <- xs :]
    Vectoriser
    map (lam $ \x -> f `app` x) xs
    map :: (..) => (a :-> b)
    -> Array a -> Array b
    Import
    Generalisation of
    NESL's flattening
    Backend library of
    flat, segmented
    parallel arrays
    26
    * Two main components of DPH: (1) vectoriser & (2) backend array library
    * Implement as much as possible as a library (minimally invasive to GHC: type families,
    rewrite rules)
    * Based on GHC, including its highly optimising backend
    * For better or worse, we aim to compete with C!

    View full-size slide

  48. [: f x | x <- xs :]
    Vectoriser
    map (lam $ \x -> f `app` x) xs
    Optimise
    map :: (..) => (a :-> b)
    -> Array a -> Array b
    Import
    Generalisation of
    NESL's flattening
    Backend library of
    flat, segmented
    parallel arrays
    26
    * Two main components of DPH: (1) vectoriser & (2) backend array library
    * Implement as much as possible as a library (minimally invasive to GHC: type families,
    rewrite rules)
    * Based on GHC, including its highly optimising backend
    * For better or worse, we aim to compete with C!

    View full-size slide

  49. [: f x | x <- xs :]
    Vectoriser
    map (lam $ \x -> f `app` x) xs
    Optimise
    Code generation
    Start:
    %cond = icmp eq i32 %a, %b
    br i1 %cond, label %IfEqual, label %IfUnequal
    map :: (..) => (a :-> b)
    -> Array a -> Array b
    Import
    Generalisation of
    NESL's flattening
    Backend library of
    flat, segmented
    parallel arrays
    26
    * Two main components of DPH: (1) vectoriser & (2) backend array library
    * Implement as much as possible as a library (minimally invasive to GHC: type families,
    rewrite rules)
    * Based on GHC, including its highly optimising backend
    * For better or worse, we aim to compete with C!

    View full-size slide

  50. Repa
    Regular parallel arrays
    also: turnip
    Dialing it down
    27
    * Two classes of inefficiencies with DPH: (1) due to vectorisation and (2) due to
    implementation of flat, parallel arrays in the backend library
    * Repa: focus on solving one problem (without worrying about the other)

    View full-size slide

  51. Flat
    Nested
    Amorphous
    Repa
    Accelerate
    Data Parallel
    Haskell
    Embedded
    (2nd class)
    Full
    (1st class)
    28
    * End user API for flat DPH backend library
    * Restricted expressiveness, programmer guided array representations
    * Also, rectangular arrays are better suited for matrix algorithms and similar
    * Shape polymorphism & track dimensionality with types

    View full-size slide

  52. mmultP :: Monad m
    => Array U Double -> Array U Double
    -> m (Array U Double)
    29

    View full-size slide

  53. mmultP arr brr =
    let (Z :. rs :. _ ) = extent arr
    (Z :. _ :. cs) = extent brr
    in do
    trr <- computeP (transpose brr)
    :: Array U DIM2 Double
    computeP
    (fromFunction (Z :. rs :. cs)
    (dotProduct trr))
    mmultP :: Monad m
    => Array U Double -> Array U Double
    -> m (Array U Double)
    29

    View full-size slide

  54. mmultP arr brr =
    let (Z :. rs :. _ ) = extent arr
    (Z :. _ :. cs) = extent brr
    in do
    trr <- computeP (transpose brr)
    :: Array U DIM2 Double
    computeP
    (fromFunction (Z :. rs :. cs)
    (dotProduct trr))
    mmultP :: Monad m
    => Array U Double -> Array U Double
    -> m (Array U Double)
    force to be manifest
    29

    View full-size slide

  55. mmultP arr brr =
    let (Z :. rs :. _ ) = extent arr
    (Z :. _ :. cs) = extent brr
    in do
    trr <- computeP (transpose brr)
    :: Array U DIM2 Double
    computeP
    (fromFunction (Z :. rs :. cs)
    (dotProduct trr))
    mmultP :: Monad m
    => Array U Double -> Array U Double
    -> m (Array U Double)
    force to be manifest
    second stage
    29

    View full-size slide

  56. mmultP arr brr =
    let (Z :. rs :. _ ) = extent arr
    (Z :. _ :. cs) = extent brr
    in do
    trr <- computeP (transpose brr)
    :: Array U DIM2 Double
    computeP
    (fromFunction (Z :. rs :. cs)
    (dotProduct trr))
    dotProduct (Z :. i :. j) =
    sumAllS (zipWith (*) a_row b_row)
    where
    a_row = slice arr (row i)
    t_row = slice trr (row j)
    29

    View full-size slide

  57. [: f x | x <- xs :]
    Vectoriser
    Generalisation of
    NESL's flattening
    map (\x -> f x) xs
    Optimise
    Code generation
    Start:
    %cond = icmp eq i32 %a, %b
    br i1 %cond, label %IfEqual, label %IfUnequal
    map :: (..) => (a -> b)
    -> Array s a -> Array s b
    Import
    Backend library of
    multi-dimensional
    parallel arrays
    30
    *

    View full-size slide

  58. [: f x | x <- xs :]
    Vectoriser
    Generalisation of
    NESL's flattening
    map (\x -> f x) xs
    Optimise
    Code generation
    Start:
    %cond = icmp eq i32 %a, %b
    br i1 %cond, label %IfEqual, label %IfUnequal
    map :: (..) => (a -> b)
    -> Array s a -> Array s b
    Import
    Backend library of
    multi-dimensional
    parallel arrays
    30
    *

    View full-size slide

  59. “Parallelising all of Haskell
    with C-like performance
    is a challenge.”
    31

    View full-size slide

  60. Backend code generation
    DPH
    Repa
    32

    View full-size slide

  61. Backend code generation
    DPH
    Repa
    Native code generation:
    32

    View full-size slide

  62. Backend code generation
    DPH
    Repa
    Carefully crafted code in the library (e.g., reloads for
    stencils)
    We wrote an LLVM backend (still aliasing issues)
    Native code generation:
    [Haskell 2010]
    [Haskell 2011]
    32

    View full-size slide

  63. Interaction with
    GHC optimisations
    DPH
    Repa
    33

    View full-size slide

  64. Interaction with
    GHC optimisations
    DPH
    Repa
    GHC has a complex pipeline:
    33

    View full-size slide

  65. Interaction with
    GHC optimisations
    DPH
    Repa
    Firing of rewrite rules and other transformations
    Code explosion due to inlining and code duplication
    GHC has a complex pipeline:
    33

    View full-size slide

  66. DPH
    Repa Array fusion
    34

    View full-size slide

  67. DPH
    Repa
    Challenging research program on its own:
    Array fusion
    34

    View full-size slide

  68. DPH
    Repa
    Invented equational array fusion and stream fusion
    Now looking at series expressions
    Challenging research program on its own:
    Array fusion
    [Haskell 2013: Ben's talk on Tuesday]
    34

    View full-size slide

  69. DPH
    Repa Array representations
    35
    DPH: flattening of arrays of structured types with type families

    View full-size slide

  70. DPH
    Repa
    Invented Haskell type families to express type-dependent
    representations
    Repa type indices: declarative control of representations
    Array representations
    [Haskell 2012a]
    [POPL 2005, ICFP 2005, ICFP 2008]
    35
    DPH: flattening of arrays of structured types with type families

    View full-size slide

  71. 0
    0.5
    1
    1.5
    2
    2.5
    64 96 128 192 256 384 512 768 1024
    relative runtime
    matrix width
    C Gauss-Seidel
    C Jacobi
    Repa -N1
    Repa -N2
    Repa -N8
    Figure 12. Runtimes for Fluid Flow Solver
    at the start of the program, then swaps these buffers after every
    relaxation iteration. Doing this improves memory locality, reducing
    the data cache-miss rate. In contrast, the Repa version allocates
    36

    View full-size slide

  72. Overhead of vectorised code
    DPH
    37

    View full-size slide

  73. Overhead of vectorised code
    DPH
    GHC's simplifier removes explicit closures etc.
    We invented vectorisation avoidance
    [Haskell 2012b]
    37

    View full-size slide

  74. Work and space complexity
    of vectorisation
    DPH
    38

    View full-size slide

  75. Work and space complexity
    of vectorisation
    DPH
    Fusion helps in some cases (but fragile)
    Work complexity: invented virtual segment descriptors
    [ICFP 2012]
    38

    View full-size slide

  76. 100us
    1ms
    10ms
    100ms
    1s
    100 1k 10k 100k 1M 10M
    Run Time
    Non-Zero Elements
    Sparse Matrix-Vector Multiplication (1% non-zero elements)
    Out Of Memory
    (150k elements)
    Baseline DPH
    New DPH
    Using Data.Vector
    39

    View full-size slide

  77. 1ms
    10ms
    100ms
    1s
    10s
    100s
    10 100 1k 10k 100k
    Run Time
    Bodies
    Barnes-Hut, 1 step, 1 thread
    Out Of Memory
    (5000 bodies)
    Baseline DPH
    New DPH
    With Data.Vector
    40

    View full-size slide

  78. “We are still not
    out of the woods.”
    41

    View full-size slide

  79. Series expression work not integrated
    Backend library lacks optimisations
    Vectorisation avoidance needs to be extended
    DPH
    DPH
    42

    View full-size slide

  80. Part 4
    Second-class data parallelism:
    Embedded languages
    43
    * Many of the discussed problems exist or are made worse by…
    * (1) parallelising a very expressive language (2) implemented in a very sophisticated
    compiler
    » We surely need more modest goals if we target more challenging hardware

    View full-size slide

  81. Data.Array.Accelerate
    Pivoting
    44
    » Accelerate is an embedded language…

    View full-size slide

  82. Flat
    Nested
    Amorphous
    Repa
    Data Parallel
    Haskell
    Embedded
    (2nd class)
    Full
    (1st class)
    Accelerate
    45
    » …aimed at general-purpose GPU programming.

    View full-size slide

  83. Flat
    Nested
    Amorphous
    Repa
    Data Parallel
    Haskell
    Embedded
    (2nd class)
    Full
    (1st class)
    CPU
    GPU
    Accelerate
    45
    » …aimed at general-purpose GPU programming.

    View full-size slide

  84. dotp :: [:Float:]
    -> [:Float:]
    -> Float
    dotp xs ys
    = sumP (zipWithP (*) xs ys )
    Data Parallel Haskell
    46
    * Dot product in Data Parallel Haskell

    View full-size slide

  85. dotp :: Acc (Vector Float)
    -> Acc (Vector Float)
    -> Acc (Scalar Float)
    dotp xs ys
    = sum (zipWith (*) xs ys )
    47
    * Same code embedded
    * 'fold' is parallel tree-reduction
    » The main difference are the types — let's look at them in more detail…

    View full-size slide

  86. dotp :: Acc (Vector Float)
    -> Acc (Vector Float)
    -> Acc (Scalar Float)
    dotp xs ys
    = sum (zipWith (*) xs ys )
    Embedded Accelerate arrays
    47
    * Same code embedded
    * 'fold' is parallel tree-reduction
    » The main difference are the types — let's look at them in more detail…

    View full-size slide

  87. dotp :: Acc (Vector Float)
    -> Acc (Vector Float)
    -> Acc (Scalar Float)
    dotp xs ys
    = sum (zipWith (*) xs ys )
    Acc marks embedded array computations
    Embedded Accelerate arrays
    47
    * Same code embedded
    * 'fold' is parallel tree-reduction
    » The main difference are the types — let's look at them in more detail…

    View full-size slide

  88. data Array sh e
    type Vector e = Array DIM1 e
    type Scalar e = Array DIM0 e
    48
    * Acc computations always produce arrays — hence, Scalar
    * Let's ignore the type class context for the moment
    * For comparison, the list version of zipWith on top in grey

    View full-size slide

  89. data Array sh e
    type Vector e = Array DIM1 e
    type Scalar e = Array DIM0 e
    Regular, shape-polymorphic arrays
    48
    * Acc computations always produce arrays — hence, Scalar
    * Let's ignore the type class context for the moment
    * For comparison, the list version of zipWith on top in grey

    View full-size slide

  90. data Array sh e
    type Vector e = Array DIM1 e
    type Scalar e = Array DIM0 e
    Regular, shape-polymorphic arrays
    zipWith :: (a -> b -> c)
    -> [a] -> [b] -> [c]
    zipWith :: (…)
    => (Exp a -> Exp b -> Exp c)
    -> Acc (Array sh a)
    -> Acc (Array sh b)
    -> Acc (Array sh c)
    Embedded scalar expressions
    48
    * Acc computations always produce arrays — hence, Scalar
    * Let's ignore the type class context for the moment
    * For comparison, the list version of zipWith on top in grey

    View full-size slide

  91. Acc marks embedded array computations
    Embedded Accelerate arrays
    dotp :: Acc (Vector Float)
    -> Acc (Vector Float)
    -> Acc (Scalar Float)
    dotp xs ys
    = sum (zipWith (*) xs ys )
    49
    * What if we want to process host data?
    * Using data from the host language in the embedded language
    * GPUs: explicit memory transfer from host to GPU memory

    View full-size slide

  92. dotp :: Acc (Vector Float)
    -> Acc (Vector Float)
    -> Acc (Scalar Float)
    dotp xs ys
    = sum (zipWith (*) xs ys )
    49
    * What if we want to process host data?
    * Using data from the host language in the embedded language
    * GPUs: explicit memory transfer from host to GPU memory

    View full-size slide

  93. dotp :: Acc (Vector Float)
    -> Acc (Vector Float)
    -> Acc (Scalar Float)
    dotp xs ys
    = sum (zipWith (*) xs ys )
    49
    * What if we want to process host data?
    * Using data from the host language in the embedded language
    * GPUs: explicit memory transfer from host to GPU memory

    View full-size slide

  94. dotp :: Acc (Vector Float)
    -> Acc (Vector Float)
    -> Acc (Scalar Float)
    dotp xs ys
    =
    sum (zipWith (*) xs ys )
    49
    * What if we want to process host data?
    * Using data from the host language in the embedded language
    * GPUs: explicit memory transfer from host to GPU memory

    View full-size slide

  95. dotp :: Acc (Vector Float)
    -> Acc (Vector Float)
    -> Acc (Scalar Float)
    dotp xs ys
    =
    sum (zipWith (*) xs ys )
    use embeds values
    let
    xs' = use xs
    ys' = use ys
    in
    ' '
    49
    * What if we want to process host data?
    * Using data from the host language in the embedded language
    * GPUs: explicit memory transfer from host to GPU memory

    View full-size slide

  96. 50
    Essence of embedded languages (transcending Accelerate)
    * High-level for expressiveness; restricted for efficiency
    * Further abstractions: generate embedded code

    View full-size slide

  97. Embedded languages
    …are restricted languages
    50
    Essence of embedded languages (transcending Accelerate)
    * High-level for expressiveness; restricted for efficiency
    * Further abstractions: generate embedded code

    View full-size slide

  98. Embedded languages
    …are restricted languages
    The embedding partly compensates for restrictions:
    50
    Essence of embedded languages (transcending Accelerate)
    * High-level for expressiveness; restricted for efficiency
    * Further abstractions: generate embedded code

    View full-size slide

  99. Embedded languages
    …are restricted languages
    Seamless integration into host language and…
    …host language can generate embedded code.
    The embedding partly compensates for restrictions:
    50
    Essence of embedded languages (transcending Accelerate)
    * High-level for expressiveness; restricted for efficiency
    * Further abstractions: generate embedded code

    View full-size slide

  100. map (\x -> x + 1) arr
    [DAMP 2011]
    51
    (1) Reify AST; (2) Optimise code (fusion); (3) Generate CUDA code using skeletons
    (a) Compile with the CUDA compiler; (b) Invoke from host code
    » How to implement the skeletons…

    View full-size slide

  101. map (\x -> x + 1) arr
    Reify AST
    Map (Lam (Add `PrimApp`
    (ZeroIdx, Const 1))) arr
    [DAMP 2011]
    51
    (1) Reify AST; (2) Optimise code (fusion); (3) Generate CUDA code using skeletons
    (a) Compile with the CUDA compiler; (b) Invoke from host code
    » How to implement the skeletons…

    View full-size slide

  102. map (\x -> x + 1) arr
    Reify AST
    Map (Lam (Add `PrimApp`
    (ZeroIdx, Const 1))) arr
    Optimise
    [DAMP 2011]
    51
    (1) Reify AST; (2) Optimise code (fusion); (3) Generate CUDA code using skeletons
    (a) Compile with the CUDA compiler; (b) Invoke from host code
    » How to implement the skeletons…

    View full-size slide

  103. map (\x -> x + 1) arr
    Reify AST
    Map (Lam (Add `PrimApp`
    (ZeroIdx, Const 1))) arr
    Optimise
    Skeleton instantiation
    __global__ void kernel (float *arr, int n)
    {...
    [DAMP 2011]
    51
    (1) Reify AST; (2) Optimise code (fusion); (3) Generate CUDA code using skeletons
    (a) Compile with the CUDA compiler; (b) Invoke from host code
    » How to implement the skeletons…

    View full-size slide

  104. map (\x -> x + 1) arr
    Reify AST
    Map (Lam (Add `PrimApp`
    (ZeroIdx, Const 1))) arr
    Optimise
    Skeleton instantiation
    __global__ void kernel (float *arr, int n)
    {...
    CUDA compiler
    [DAMP 2011]
    51
    (1) Reify AST; (2) Optimise code (fusion); (3) Generate CUDA code using skeletons
    (a) Compile with the CUDA compiler; (b) Invoke from host code
    » How to implement the skeletons…

    View full-size slide

  105. map (\x -> x + 1) arr
    Reify AST
    Map (Lam (Add `PrimApp`
    (ZeroIdx, Const 1))) arr
    Optimise
    Skeleton instantiation
    __global__ void kernel (float *arr, int n)
    {...
    CUDA compiler Call
    [DAMP 2011]
    51
    (1) Reify AST; (2) Optimise code (fusion); (3) Generate CUDA code using skeletons
    (a) Compile with the CUDA compiler; (b) Invoke from host code
    » How to implement the skeletons…

    View full-size slide

  106. mkMap dev aenv fun arr = return $
    CUTranslSkel "map" [cunit|
    $esc:("#include ")
    extern "C" __global__ void
    map ($params:argIn, $params:argOut) {
    const int shapeSize = size(shOut);
    const int gridSize = $exp:(gridSize dev);
    int ix;
    for ( ix = $exp:(threadIdx dev)
    ; ix < shapeSize
    ; ix += gridSize ) {
    $items:(dce x .=. get ix)
    $items:(setOut "ix" .=. f x)
    }
    } |]
    where ...
    52
    * Combinators as skeletons (code templates with holes)
    * Quasi-quoter for CUDA [Mainland]
    * Yellow anti-quotes are the holes (parameters) of the template

    View full-size slide

  107. [Embed 2013]
    53
    Essence of skeleton-based generative programming (transcending Accelerate)
    * Hand tuned skeleton code
    * Meta programming simplifies code generation
    * FFI to use native libraries etc

    View full-size slide

  108. Skeletons are templates
    …encapsulating efficient code idioms
    [Embed 2013]
    53
    Essence of skeleton-based generative programming (transcending Accelerate)
    * Hand tuned skeleton code
    * Meta programming simplifies code generation
    * FFI to use native libraries etc

    View full-size slide

  109. Skeletons are templates
    …encapsulating efficient code idioms
    Code generation as template meta programming
    Foreign function interface as an escape hatch
    [Embed 2013]
    53
    Essence of skeleton-based generative programming (transcending Accelerate)
    * Hand tuned skeleton code
    * Meta programming simplifies code generation
    * FFI to use native libraries etc

    View full-size slide

  110. Fusion is crucial
    c2
    p5
    p1
    c1
    p6 p7
    p3
    p2
    p4
    Fusion in 2 phases, exploiting skeleton instantiation
    54
    * Fusion output must be in skeleton form again
    * Fused skeletons share GPU kernels

    View full-size slide

  111. Fusion is crucial
    c2
    p5
    p1
    c1
    p6 p7
    p3
    p2
    p4
    Fusion in 2 phases, exploiting skeleton instantiation
    [ICFP 2013: Trevor's talk on Wednesday]
    54
    * Fusion output must be in skeleton form again
    * Fused skeletons share GPU kernels

    View full-size slide

  112. 0.1
    1
    10
    100
    2 4 6 8 10 12 14 16 18 20
    Run Time (ms)
    Elements (millions)
    Dot product
    Data.Vector
    Repa -N8
    NDP2GPU
    Accelerate -fusion
    ... +fusion
    CUBLAS
    55
    * C (red) is on one CPU core (Xenon E5405 CPU @ 2 GHz, 64-bit)
    * Repa (blue) is on 7 CPU cores (two quad-core Xenon E5405 CPUs @ 2 GHz, 64-bit)
    * Accelerate (green) is on a Tesla T10 processor (240 cores @ 1.3 GHz)

    View full-size slide

  113. 0.1
    1
    10
    100
    2 4 6 8 10 12 14 16 18 20
    Run Time (ms)
    Elements (millions)
    Dot product
    Data.Vector
    Repa -N8
    NDP2GPU
    Accelerate -fusion
    ... +fusion
    CUBLAS
    Dot Product
    55
    * C (red) is on one CPU core (Xenon E5405 CPU @ 2 GHz, 64-bit)
    * Repa (blue) is on 7 CPU cores (two quad-core Xenon E5405 CPUs @ 2 GHz, 64-bit)
    * Accelerate (green) is on a Tesla T10 processor (240 cores @ 1.3 GHz)

    View full-size slide

  114. 56
    * C (red) is on one CPU core (Xenon E5405 CPU @ 2 GHz, 64-bit)
    * Repa (blue) is on 7 CPU cores (two quad-core Xenon E5405 CPUs @ 2 GHz, 64-bit)
    * Accelerate (green) is on a Tesla T10 processor (240 cores @ 1.3 GHz)

    View full-size slide

  115. Jos Stam's Fluid Flow Solver
    56
    * C (red) is on one CPU core (Xenon E5405 CPU @ 2 GHz, 64-bit)
    * Repa (blue) is on 7 CPU cores (two quad-core Xenon E5405 CPUs @ 2 GHz, 64-bit)
    * Accelerate (green) is on a Tesla T10 processor (240 cores @ 1.3 GHz)

    View full-size slide

  116. “Works pretty well.
    Let's do more!”
    57

    View full-size slide

  117. Incrementally adding nested data parallelism
    How about amorphous data parallelism?
    Need a wider range of applications
    58

    View full-size slide

  118. Summary
    Data parallelism is a convenient & effective abstraction
    Parallelising full Haskell is possible, but a lot of hard work
    Embedded languages are an attractive compromise
    types >< state languages
    59

    View full-size slide

  119. References
    [EuroPar 2001] Nepal -- Nested Data-Parallelism in Haskell. Chakravarty, Keller,
    Lechtchinsky & Pfannenstiel. In "Euro-Par 2001: Parallel Processing, 7th Intl.
    Euro-Par Conference", 2001.
    [POPL 2005] Associated Types with Class. Chakravarty, Keller, Peyton Jones &
    Marlow. In Proceedings "The 32nd Annual ACM SIGPLAN-SIGACT Symposium
    on Principles of Programming Languages (POPL'05)", ACM Press, 2005.
    [ICFP 2005] Associated Type Synonyms. Chakravarty, Keller & Peyton Jones. In
    Proceedings of "The Tenth ACM SIGPLAN International Conference on Functional
    Programming", ACM Press, 2005.
    [TLDI 2007] System F with Type Equality Coercions. Sulzmann, Chakravarty,
    Peyton Jones & Donnelly. In Proceedings of "The Third ACM SIGPLAN Workshop
    on Types in Language Design and Implementation", ACM Press, 2007.
    [ICFP 2008] Type Checking with Open Type Functions. Schrijvers, Peyton Jones,
    Chakravarty & Sulzmann. In Proceedings of "ICFP 2008 : The 13th ACM SIGPLAN
    International Conference on Functional Programming", ACM Press, 2008.
    60

    View full-size slide

  120. [FSTTCS 2008] Harnessing the Multicores: Nested Data Parallelism in Haskell.
    Peyton Jones, Leshchinskiy, Keller & Chakravarty. In "IARCS Annual Conf. on
    Foundations of Software Technology & Theoretical Computer Science",2008.
    [ICFP 2010] Regular, shape-polymorphic, parallel arrays in Haskell. Keller,
    Chakravarty, Leshchinskiy, Peyton Jones & Lippmeier. In Proceedings of "ICFP
    2010 : The 15th ACM SIGPLAN Intl. Conf. on Functional Programming", 2010.
    [Haskell 2010] An LLVM Backend for GHC. Terei & Chakravarty. In Proceedings of
    "ACM SIGPLAN Haskell Symposium 2010", ACM Press, 2010.
    [DAMP 2011] Accelerating Haskell Array Codes with Multicore GPUs. Chakravarty,
    Keller, Lee, McDonell & Grover. In "Declarative Aspects of Multicore
    Programming", ACM Press, 2011.
    [Haskell 2011] Efficient Parallel Stencil Convolution in Haskell. Lippmeier & Keller.
    In Proceedings of "ACM SIGPLAN Haskell Symposium 2011", ACM Press, 2011.
    [ICFP 2012] Work Efficient Higher-Order Vectorisation. Lippmeier, Chakravarty,
    Keller, Leshchinskiy & Peyton Jones. ACM Press, 2012.
    61

    View full-size slide

  121. [Haskell 2012a] Guiding Parallel Array Fusion with Indexed Types. Lippmeier,
    Chakravarty, Keller & Peyton Jones. In Proceedings of "ACM SIGPLAN Haskell
    Symposium 2012", ACM Press, 2012.
    [Haskell 2012b] Vectorisation Avoidance. Keller, Chakravarty, Lippmeier,
    Leshchinskiy & Peyton Jones. In Proceedings of "ACM SIGPLAN Haskell
    Symposium 2012", ACM Press, 2012.
    [ICFP 2013] Optimising Purely Functional GPU Programs. McDonell, Chakravarty,
    Keller & Lippmeier. In "ACM SIGPLAN International Conference on Functional
    Programming", ACM Press, 2013.
    [Haskell 2013] Data Flow Fusion with Series Expressions in Haskell. Lippmeier,
    Chakravarty, Keller & Robinson. In Proceedings of "ACM SIGPLAN Haskell
    Symposium 2013", ACM Press, 2013.
    [Embed 2013] Embedding Foreign Code. Clifton-Everest, McDonell, Chakravarty
    & Keller. Submitted for publication, 2013. http://www.cse.unsw.edu.au/~chak/papers/
    CMCK14.html
    62

    View full-size slide

  122. Images from
    http://wikimedia.org
    http://openclipart.org
    63

    View full-size slide