$30 off During Our Annual Pro Sale. View Details »

Embedded Languages for High-Performance Computing in Haskell

Embedded Languages for High-Performance Computing in Haskell

Embedded languages are a convenient and expressive method to capture patterns of high-performance code in functional languages. These patterns can be turned into efficient low-level code by template instantiation of code skeletons, where code fusion combines individual skeleton instances to minimise the abstraction penalty.

In this talk, I will illustrate these concepts as used in Accelerate, an embedded language for general-purpose GPU computing in Haskell that delivers competitive performance with a fraction of the effort required in low-level GPGPU frameworks, such as CUDA or OpenCL.

Manuel Chakravarty
PRO

August 06, 2013
Tweet

More Decks by Manuel Chakravarty

Other Decks in Research

Transcript

  1. mchakravarty
    Embedded Languages for
    High-Performance Computing in Haskell
    Manuel M T Chakravarty
    University of New South Wales
    Jointly with
    Gabriele Keller
    Sean Lee
    Trevor L. McDonell
    1
    » Jointly with; then straight to next slide
    25 minute time slot: 20min talking + 5min
    [5min Embedded; 5min Skeletons; 5min Fusion; 5min Demo & benchmarks]

    View Slide

  2. High Level Languages
    2
    * Advanced features of HLLs increase our productivity & safety
    » …but are they always a benefit…

    View Slide

  3. High Level Languages
    Boxed values
    Polymorphism
    &
    generics
    Composite
    data structures
    Immutable
    structures
    Higher-order
    functions
    &
    closures
    2
    * Advanced features of HLLs increase our productivity & safety
    » …but are they always a benefit…

    View Slide

  4. High Level Languages
    Boxed values
    Polymorphism
    &
    generics
    Composite
    data structures
    Immutable
    structures
    Higher-order
    functions
    &
    closures
    2
    * Advanced features of HLLs increase our productivity & safety
    » …but are they always a benefit…

    View Slide

  5. Data and compute
    intensive applications
    3
    * Special hardware, parallel hardware, concurrent hardware, etc.
    * Requires: high-throughput hardware & optimised code

    View Slide

  6. Data and compute
    intensive applications
    GPUs
    Cluster
    multicore
    CPU
    3
    * Special hardware, parallel hardware, concurrent hardware, etc.
    * Requires: high-throughput hardware & optimised code

    View Slide

  7. High Level Languages
    GPUs
    Cluster
    multicore
    CPU
    4
    * Many features are already difficult to efficiently map to conventional sequential
    architectures
    * Optimising them to high-throughput hardware is harder
    * Fragility and lack of predicatbility of optimisations

    View Slide

  8. High Level Languages
    GPUs
    Cluster
    multicore
    CPU
    Efficient code?
    4
    * Many features are already difficult to efficiently map to conventional sequential
    architectures
    * Optimising them to high-throughput hardware is harder
    * Fragility and lack of predicatbility of optimisations

    View Slide

  9. High Level Languages
    GPUs
    Cluster
    multicore
    CPU
    Efficient code?
    Function pointers
    Control flow
    Memory access patterns
    Data distribution
    Decomposition
    4
    * Many features are already difficult to efficiently map to conventional sequential
    architectures
    * Optimising them to high-throughput hardware is harder
    * Fragility and lack of predicatbility of optimisations

    View Slide

  10. High Level Languages
    GPUs
    Cluster
    multicore
    CPU
    5
    * Although we love high-level languages, this breaks our heart!

    View Slide

  11. High Level Languages
    GPUs
    Cluster
    multicore
    CPU
    5
    * Although we love high-level languages, this breaks our heart!

    View Slide

  12. “How about embedded
    languages with specialised code
    generation?”
    6
    * Instead of writing high-performance code, we write code that writes high-performance
    code.

    View Slide

  13. Accelerate
    an embedded language for GPU programming
    7
    * We tried that for GPGPU programming.
    * The approach is more general.
    * Smoothlife: https://github.com/AccelerateHS/accelerate-examples/tree/master/examples/
    smoothlife

    View Slide

  14. 8
    N-body: https://github.com/AccelerateHS/accelerate-examples/tree/master/examples/n-
    body

    View Slide

  15. 9
    Fluid flow: https://github.com/AccelerateHS/accelerate-examples/tree/master/examples/
    fluid

    View Slide

  16. Ingredient 1
    Embedded languages
    10
    » A simple example…

    View Slide

  17. dotp :: [Float]
    -> [Float]
    -> Float
    dotp xs ys
    = foldl (+) 0 (zipWith (*) xs ys )
    Plain Haskell (lists)
    11
    * Dot product in plain Haskell using lists
    * Dot product is map-reduce! ;)

    View Slide

  18. dotp :: Acc (Vector Float)
    -> Acc (Vector Float)
    -> Acc (Scalar Float)
    dotp xs ys
    = fold (+) 0 (zipWith (*) xs ys )
    12
    * Same code embedded
    * 'fold' is parallel tree-reduction
    » The main difference are the types — let's look at them in more detail…

    View Slide

  19. dotp :: Acc (Vector Float)
    -> Acc (Vector Float)
    -> Acc (Scalar Float)
    dotp xs ys
    = fold (+) 0 (zipWith (*) xs ys )
    Embedded Accelerate arrays
    12
    * Same code embedded
    * 'fold' is parallel tree-reduction
    » The main difference are the types — let's look at them in more detail…

    View Slide

  20. dotp :: Acc (Vector Float)
    -> Acc (Vector Float)
    -> Acc (Scalar Float)
    dotp xs ys
    = fold (+) 0 (zipWith (*) xs ys )
    Acc marks embedded array computations
    Embedded Accelerate arrays
    12
    * Same code embedded
    * 'fold' is parallel tree-reduction
    » The main difference are the types — let's look at them in more detail…

    View Slide

  21. data Array sh e
    type Vector e = Array DIM1 e
    type Scalar e = Array DIM0 e
    13
    * Acc computations always produce arrays — hence, Scalar
    * Let's ignore the type class context for the moment
    * For comparison, the list version of zipWith on top in grey

    View Slide

  22. data Array sh e
    type Vector e = Array DIM1 e
    type Scalar e = Array DIM0 e
    Regular, shape-polymorphic arrays
    13
    * Acc computations always produce arrays — hence, Scalar
    * Let's ignore the type class context for the moment
    * For comparison, the list version of zipWith on top in grey

    View Slide

  23. data Array sh e
    type Vector e = Array DIM1 e
    type Scalar e = Array DIM0 e
    Regular, shape-polymorphic arrays
    zipWith :: (a -> b -> c)
    -> [a] -> [b] -> [c]
    zipWith :: (…)
    => (Exp a -> Exp b -> Exp c)
    -> Acc (Array sh a)
    -> Acc (Array sh b)
    -> Acc (Array sh c)
    Embedded scalar expressions
    13
    * Acc computations always produce arrays — hence, Scalar
    * Let's ignore the type class context for the moment
    * For comparison, the list version of zipWith on top in grey

    View Slide

  24. Acc marks embedded array computations
    Embedded Accelerate arrays
    dotp :: Acc (Vector Float)
    -> Acc (Vector Float)
    -> Acc (Scalar Float)
    dotp xs ys
    = fold (+) 0 (zipWith (*) xs ys )
    14
    * What if we want to process host data?
    * Using data from the host language in the embedded language
    * GPUs: explicit memory transfer from host to GPU memory

    View Slide

  25. dotp :: Acc (Vector Float)
    -> Acc (Vector Float)
    -> Acc (Scalar Float)
    dotp xs ys
    = fold (+) 0 (zipWith (*) xs ys )
    14
    * What if we want to process host data?
    * Using data from the host language in the embedded language
    * GPUs: explicit memory transfer from host to GPU memory

    View Slide

  26. dotp :: Acc (Vector Float)
    -> Acc (Vector Float)
    -> Acc (Scalar Float)
    dotp xs ys
    = fold (+) 0 (zipWith (*) xs ys )
    14
    * What if we want to process host data?
    * Using data from the host language in the embedded language
    * GPUs: explicit memory transfer from host to GPU memory

    View Slide

  27. dotp :: Acc (Vector Float)
    -> Acc (Vector Float)
    -> Acc (Scalar Float)
    dotp xs ys
    =
    fold (+) 0 (zipWith (*) xs ys )
    14
    * What if we want to process host data?
    * Using data from the host language in the embedded language
    * GPUs: explicit memory transfer from host to GPU memory

    View Slide

  28. dotp :: Acc (Vector Float)
    -> Acc (Vector Float)
    -> Acc (Scalar Float)
    dotp xs ys
    =
    fold (+) 0 (zipWith (*) xs ys )
    let
    xs' = use xs
    ys' = use ys
    in
    ' '
    use embeds values
    14
    * What if we want to process host data?
    * Using data from the host language in the embedded language
    * GPUs: explicit memory transfer from host to GPU memory

    View Slide

  29. 15
    Essence of embedded languages (transcending Accelerate)
    * High-level for expressiveness; restricted for efficiency
    * Further abstractions: generate embedded code

    View Slide

  30. Embedded languages
    …are restricted languages
    15
    Essence of embedded languages (transcending Accelerate)
    * High-level for expressiveness; restricted for efficiency
    * Further abstractions: generate embedded code

    View Slide

  31. Embedded languages
    …are restricted languages
    The embedding partly compensates for restrictions:
    15
    Essence of embedded languages (transcending Accelerate)
    * High-level for expressiveness; restricted for efficiency
    * Further abstractions: generate embedded code

    View Slide

  32. Embedded languages
    …are restricted languages
    Seamless integration into host language and…
    …host language can generate embedded code.
    The embedding partly compensates for restrictions:
    15
    Essence of embedded languages (transcending Accelerate)
    * High-level for expressiveness; restricted for efficiency
    * Further abstractions: generate embedded code

    View Slide

  33. Ingredient 2
    Skeletons
    16
    » How to implement such an embedded language…

    View Slide

  34. map (\x -> x + 1) arr
    17
    (1) Reify AST; (2) Optimise code (fusion); (3) Generate CUDA code using skeletons
    (a) Compile with the CUDA compiler; (b) Invoke from host code
    » How to implement the skeletons…

    View Slide

  35. map (\x -> x + 1) arr
    Reify AST
    Map (Lam (Add `PrimApp`
    (ZeroIdx, Const 1))) arr
    17
    (1) Reify AST; (2) Optimise code (fusion); (3) Generate CUDA code using skeletons
    (a) Compile with the CUDA compiler; (b) Invoke from host code
    » How to implement the skeletons…

    View Slide

  36. map (\x -> x + 1) arr
    Reify AST
    Map (Lam (Add `PrimApp`
    (ZeroIdx, Const 1))) arr
    Optimise
    17
    (1) Reify AST; (2) Optimise code (fusion); (3) Generate CUDA code using skeletons
    (a) Compile with the CUDA compiler; (b) Invoke from host code
    » How to implement the skeletons…

    View Slide

  37. map (\x -> x + 1) arr
    Reify AST
    Map (Lam (Add `PrimApp`
    (ZeroIdx, Const 1))) arr
    Optimise
    Skeleton instantiation
    __global__ void kernel (float *arr, int n)
    {...
    17
    (1) Reify AST; (2) Optimise code (fusion); (3) Generate CUDA code using skeletons
    (a) Compile with the CUDA compiler; (b) Invoke from host code
    » How to implement the skeletons…

    View Slide

  38. map (\x -> x + 1) arr
    Reify AST
    Map (Lam (Add `PrimApp`
    (ZeroIdx, Const 1))) arr
    Optimise
    Skeleton instantiation
    __global__ void kernel (float *arr, int n)
    {...
    CUDA compiler
    17
    (1) Reify AST; (2) Optimise code (fusion); (3) Generate CUDA code using skeletons
    (a) Compile with the CUDA compiler; (b) Invoke from host code
    » How to implement the skeletons…

    View Slide

  39. map (\x -> x + 1) arr
    Reify AST
    Map (Lam (Add `PrimApp`
    (ZeroIdx, Const 1))) arr
    Optimise
    Skeleton instantiation
    __global__ void kernel (float *arr, int n)
    {...
    CUDA compiler Call
    17
    (1) Reify AST; (2) Optimise code (fusion); (3) Generate CUDA code using skeletons
    (a) Compile with the CUDA compiler; (b) Invoke from host code
    » How to implement the skeletons…

    View Slide

  40. mkMap dev aenv fun arr = return $
    CUTranslSkel "map" [cunit|
    $esc:("#include ")
    extern "C" __global__ void
    map ($params:argIn, $params:argOut) {
    const int shapeSize = size(shOut);
    const int gridSize = $exp:(gridSize dev);
    int ix;
    for ( ix = $exp:(threadIdx dev)
    ; ix < shapeSize
    ; ix += gridSize ) {
    $items:(dce x .=. get ix)
    $items:(setOut "ix" .=. f x)
    }
    } |]
    where ...
    18
    * Combinators as skeletons (code templates with holes)
    * Quasi-quoter for CUDA [Mainland]
    * Yellow anti-quotes are the holes (parameters) of the template

    View Slide

  41. 19
    Essence of skeleton-based generative programming (transcending Accelerate)
    * Hand tuned skeleton code
    * Meta programming simplifies code generation
    * FFI to use native libraries etc

    View Slide

  42. Skeletons are templates
    …encapsulating efficient code idioms
    19
    Essence of skeleton-based generative programming (transcending Accelerate)
    * Hand tuned skeleton code
    * Meta programming simplifies code generation
    * FFI to use native libraries etc

    View Slide

  43. Skeletons are templates
    …encapsulating efficient code idioms
    Code generation as template meta programming
    Foreign function interface as an escape hatch
    19
    Essence of skeleton-based generative programming (transcending Accelerate)
    * Hand tuned skeleton code
    * Meta programming simplifies code generation
    * FFI to use native libraries etc

    View Slide

  44. Ingredient 3
    Composing skeletons
    20
    » Let us zoom out and look at the big picture again…

    View Slide

  45. dotp xs ys = fold (+) 0 (zipWith (*) xs ys)
    21
    * Dot product: two skeletons executed in sequence
    * Inefficient: (1) superflous intermediate array; (2) superflous traversal of that intermediate
    array
    * Goal: one traversal of the input without an intermediate array => fuse two skeletons into

    View Slide

  46. dotp xs ys = fold (+) 0 (zipWith (*) xs ys)
    Skeleton #1 Skeleton #2
    21
    * Dot product: two skeletons executed in sequence
    * Inefficient: (1) superflous intermediate array; (2) superflous traversal of that intermediate
    array
    * Goal: one traversal of the input without an intermediate array => fuse two skeletons into

    View Slide

  47. dotp xs ys = fold (+) 0 (zipWith (*) xs ys)
    Skeleton #1 Skeleton #2
    Intermediate array
    Extra traversal
    21
    * Dot product: two skeletons executed in sequence
    * Inefficient: (1) superflous intermediate array; (2) superflous traversal of that intermediate
    array
    * Goal: one traversal of the input without an intermediate array => fuse two skeletons into

    View Slide

  48. dotp xs ys = fold (+) 0 (zipWith (*) xs ys)
    Combined skeleton
    21
    * Dot product: two skeletons executed in sequence
    * Inefficient: (1) superflous intermediate array; (2) superflous traversal of that intermediate
    array
    * Goal: one traversal of the input without an intermediate array => fuse two skeletons into

    View Slide

  49. Fusing networks of skeletons
    p1
    p1
    p2 p3
    p4
    p5 p6 p7
    c1
    c2
    22
    * Networks consist of producers (eg, generate, map) and consumers (eg, fold)
    » First, fuse producers

    View Slide

  50. Fusing networks of skeletons
    c2
    p5
    p1
    c1
    p6 p7
    p3
    p2
    p4
    Phase 1: producer/producer fusion
    23
    * producer/producer fusion: combine successive producers (e.g., map family)
    * No fusion if intermediate results are shared (risk of duplication of work)
    » Second, fuse consumers with producers

    View Slide

  51. Fusing networks of skeletons
    c2
    p5
    p1
    c1
    p6 p7
    p3
    p2
    p4
    Phase 2: consumer/producer fusion
    24
    * Consumer/producer fusion: consumer skeleton with the producer code (e.g., folds)
    * Accelerate: fused skeletons share GPU kernels

    View Slide

  52. data DelayedAcc a where
    Done :: Acc a
    -> DelayedAcc a
    Yield :: (Shape sh, Elt e)
    => Exp sh
    -> Fun (sh -> e)
    -> DelayedAcc (Array sh e)
    Step :: …
    25
    (1) Fusion friendly representation of array skeletons
    (2) Producer skeletons as smart constructors of that representation
    (3) Generate code from that representations, instantiating consumers with producers

    View Slide

  53. data DelayedAcc a where
    Done :: Acc a
    -> DelayedAcc a
    Yield :: (Shape sh, Elt e)
    => Exp sh
    -> Fun (sh -> e)
    -> DelayedAcc (Array sh e)
    Step :: …
    Fusion friendly
    25
    (1) Fusion friendly representation of array skeletons
    (2) Producer skeletons as smart constructors of that representation
    (3) Generate code from that representations, instantiating consumers with producers

    View Slide

  54. data DelayedAcc a where
    Done :: Acc a
    -> DelayedAcc a
    Yield :: (Shape sh, Elt e)
    => Exp sh
    -> Fun (sh -> e)
    -> DelayedAcc (Array sh e)
    Step :: …
    Fusion friendly
    mapD f (Yield sh g) = Yield sh (f . g)
    25
    (1) Fusion friendly representation of array skeletons
    (2) Producer skeletons as smart constructors of that representation
    (3) Generate code from that representations, instantiating consumers with producers

    View Slide

  55. data DelayedAcc a where
    Done :: Acc a
    -> DelayedAcc a
    Yield :: (Shape sh, Elt e)
    => Exp sh
    -> Fun (sh -> e)
    -> DelayedAcc (Array sh e)
    Step :: …
    Fusion friendly
    mapD f (Yield sh g) = Yield sh (f . g)
    codeGenAcc … (Fold f z arr)
    = mkFold … (codeGenFun f) (codeGenExp z)
    (codeGenEmbeddedAcc arr)
    25
    (1) Fusion friendly representation of array skeletons
    (2) Producer skeletons as smart constructors of that representation
    (3) Generate code from that representations, instantiating consumers with producers

    View Slide

  56. 26
    Essence of fusing skeletons (transcending Accelerate)
    * Efficient code often requires a coarse granularity
    * Rewriting of skeleton composition and instantiation

    View Slide

  57. Fusion of skeletons
    …reduces the abstraction penalty
    26
    Essence of fusing skeletons (transcending Accelerate)
    * Efficient code often requires a coarse granularity
    * Rewriting of skeleton composition and instantiation

    View Slide

  58. Fusion of skeletons
    …reduces the abstraction penalty
    Code generation idioms vary from high-level combinators
    Smart constructors combine producers
    Instantiate consumer skeletons with producer code
    26
    Essence of fusing skeletons (transcending Accelerate)
    * Efficient code often requires a coarse granularity
    * Rewriting of skeleton composition and instantiation

    View Slide

  59. “How fast are we going?”
    27

    View Slide

  60. 0.1
    1
    10
    100
    2 4 6 8 10 12 14 16 18 20
    Run Time (ms)
    Elements (millions)
    Dot product
    Data.Vector
    Repa -N8
    NDP2GPU
    Accelerate -fusion
    ... +fusion
    CUBLAS
    28
    * C (red) is on one CPU core (Xenon E5405 CPU @ 2 GHz, 64-bit)
    * Repa (blue) is on 7 CPU cores (two quad-core Xenon E5405 CPUs @ 2 GHz, 64-bit)
    * Accelerate (green) is on a Tesla T10 processor (240 cores @ 1.3 GHz)

    View Slide

  61. 0.1
    1
    10
    100
    2 4 6 8 10 12 14 16 18 20
    Run Time (ms)
    Elements (millions)
    Dot product
    Data.Vector
    Repa -N8
    NDP2GPU
    Accelerate -fusion
    ... +fusion
    CUBLAS
    Dot Product
    28
    * C (red) is on one CPU core (Xenon E5405 CPU @ 2 GHz, 64-bit)
    * Repa (blue) is on 7 CPU cores (two quad-core Xenon E5405 CPUs @ 2 GHz, 64-bit)
    * Accelerate (green) is on a Tesla T10 processor (240 cores @ 1.3 GHz)

    View Slide

  62. 29
    * C (red) is on one CPU core (Xenon E5405 CPU @ 2 GHz, 64-bit)
    * Repa (blue) is on 7 CPU cores (two quad-core Xenon E5405 CPUs @ 2 GHz, 64-bit)
    * Accelerate (green) is on a Tesla T10 processor (240 cores @ 1.3 GHz)

    View Slide

  63. Jos Stam's Fluid Flow Solver
    29
    * C (red) is on one CPU core (Xenon E5405 CPU @ 2 GHz, 64-bit)
    * Repa (blue) is on 7 CPU cores (two quad-core Xenon E5405 CPUs @ 2 GHz, 64-bit)
    * Accelerate (green) is on a Tesla T10 processor (240 cores @ 1.3 GHz)

    View Slide

  64. 0.1
    1
    10
    100
    1000
    64k 256k 1M 4M 16M
    Run Time (ms)
    Image Size (total pixels)
    Canny Edge Detection
    Accelerate (whole program)
    Accelerate (just GPU kernels)
    OpenCV (CPU)
    OpenCV (GPU)
    30
    * C: OpenCV CPU (blue) is one CPU core with SSE (Xenon E5405 CPU @ 2 GHz, 64-bit)
    * Accelerate (green) and OpenCV GPU (red) is on a Tesla T10 processor (240 cores @ 1.3 GHz)
    * Accelerate performs the last computation step CPU-side: the light green includes the CPU
    post processing

    View Slide

  65. 0.1
    1
    10
    100
    1000
    64k 256k 1M 4M 16M
    Run Time (ms)
    Image Size (total pixels)
    Canny Edge Detection
    Accelerate (whole program)
    Accelerate (just GPU kernels)
    OpenCV (CPU)
    OpenCV (GPU)
    Canny Edge Detection
    30
    * C: OpenCV CPU (blue) is one CPU core with SSE (Xenon E5405 CPU @ 2 GHz, 64-bit)
    * Accelerate (green) and OpenCV GPU (red) is on a Tesla T10 processor (240 cores @ 1.3 GHz)
    * Accelerate performs the last computation step CPU-side: the light green includes the CPU
    post processing

    View Slide

  66. 0.1
    1
    10
    100
    1000
    1k 2k 4k 8k 16k 32k
    Run Time (ms)
    Bodies
    N-Body
    Accelerate -fusion -sharing
    ... -fusion +sharing
    ... +fusion +sharing
    CUDA
    31
    * Accelerate (green) and handwritten CUDA (red) are on a Tesla T10 processor (240 cores @
    1.3 GHz)
    * GPU performance depends on use of shared memory. Acceleret is slower as wedo do not yet
    optimise for that.

    View Slide

  67. N-Body
    0.1
    1
    10
    100
    1000
    1k 2k 4k 8k 16k 32k
    Run Time (ms)
    Bodies
    N-Body
    Accelerate -fusion -sharing
    ... -fusion +sharing
    ... +fusion +sharing
    CUDA
    31
    * Accelerate (green) and handwritten CUDA (red) are on a Tesla T10 processor (240 cores @
    1.3 GHz)
    * GPU performance depends on use of shared memory. Acceleret is slower as wedo do not yet
    optimise for that.

    View Slide

  68. N-Body
    0.1
    1
    10
    100
    1000
    1k 2k 4k 8k 16k 32k
    Run Time (ms)
    Bodies
    N-Body
    Accelerate -fusion -sharing
    ... -fusion +sharing
    ... +fusion +sharing
    CUDA
    Missing shared
    memory
    optimisation
    31
    * Accelerate (green) and handwritten CUDA (red) are on a Tesla T10 processor (240 cores @
    1.3 GHz)
    * GPU performance depends on use of shared memory. Acceleret is slower as wedo do not yet
    optimise for that.

    View Slide

  69. Summary
    Embedded languages are restricted languages
    Skeletons encapsulate efficient code idioms
    Fusion reduces the abstraction penalty
    types >< state languages
    32

    View Slide

  70. Images from
    http://wikipedia.org
    http://openclipart.org
    Accelerating Haskell Array Codes with Multicore GPUs. Chakravarty, Keller, Lee,
    McDonell & Grover. In "Declarative Aspects of Multicore Programming", ACM
    Press, 2011.
    Optimising Purely Functional GPU Programs. McDonell, Chakravarty, Keller &
    Lippmeier. In "ACM SIGPLAN International Conference on Functional
    Programming", ACM Press, 2013.
    References
    33

    View Slide