Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Embedded Languages for High-Performance Computi...

Embedded Languages for High-Performance Computing in Haskell

Embedded languages are a convenient and expressive method to capture patterns of high-performance code in functional languages. These patterns can be turned into efficient low-level code by template instantiation of code skeletons, where code fusion combines individual skeleton instances to minimise the abstraction penalty.

In this talk, I will illustrate these concepts as used in Accelerate, an embedded language for general-purpose GPU computing in Haskell that delivers competitive performance with a fraction of the effort required in low-level GPGPU frameworks, such as CUDA or OpenCL.

Manuel Chakravarty

August 06, 2013
Tweet

More Decks by Manuel Chakravarty

Other Decks in Research

Transcript

  1. mchakravarty Embedded Languages for High-Performance Computing in Haskell Manuel M

    T Chakravarty University of New South Wales Jointly with Gabriele Keller Sean Lee Trevor L. McDonell 1 » Jointly with; then straight to next slide 25 minute time slot: 20min talking + 5min [5min Embedded; 5min Skeletons; 5min Fusion; 5min Demo & benchmarks]
  2. High Level Languages 2 * Advanced features of HLLs increase

    our productivity & safety » …but are they always a benefit…
  3. High Level Languages Boxed values Polymorphism & generics Composite data

    structures Immutable structures Higher-order functions & closures 2 * Advanced features of HLLs increase our productivity & safety » …but are they always a benefit…
  4. High Level Languages Boxed values Polymorphism & generics Composite data

    structures Immutable structures Higher-order functions & closures 2 * Advanced features of HLLs increase our productivity & safety » …but are they always a benefit…
  5. Data and compute intensive applications 3 * Special hardware, parallel

    hardware, concurrent hardware, etc. * Requires: high-throughput hardware & optimised code
  6. Data and compute intensive applications GPUs Cluster multicore CPU 3

    * Special hardware, parallel hardware, concurrent hardware, etc. * Requires: high-throughput hardware & optimised code
  7. High Level Languages GPUs Cluster multicore CPU 4 * Many

    features are already difficult to efficiently map to conventional sequential architectures * Optimising them to high-throughput hardware is harder * Fragility and lack of predicatbility of optimisations
  8. High Level Languages GPUs Cluster multicore CPU Efficient code? 4

    * Many features are already difficult to efficiently map to conventional sequential architectures * Optimising them to high-throughput hardware is harder * Fragility and lack of predicatbility of optimisations
  9. High Level Languages GPUs Cluster multicore CPU Efficient code? Function

    pointers Control flow Memory access patterns Data distribution Decomposition 4 * Many features are already difficult to efficiently map to conventional sequential architectures * Optimising them to high-throughput hardware is harder * Fragility and lack of predicatbility of optimisations
  10. High Level Languages GPUs Cluster multicore CPU 5 * Although

    we love high-level languages, this breaks our heart!
  11. High Level Languages GPUs Cluster multicore CPU 5 * Although

    we love high-level languages, this breaks our heart!
  12. “How about embedded languages with specialised code generation?” 6 *

    Instead of writing high-performance code, we write code that writes high-performance code.
  13. Accelerate an embedded language for GPU programming 7 * We

    tried that for GPGPU programming. * The approach is more general. * Smoothlife: https://github.com/AccelerateHS/accelerate-examples/tree/master/examples/ smoothlife
  14. dotp :: [Float] -> [Float] -> Float dotp xs ys

    = foldl (+) 0 (zipWith (*) xs ys ) Plain Haskell (lists) 11 * Dot product in plain Haskell using lists * Dot product is map-reduce! ;)
  15. dotp :: Acc (Vector Float) -> Acc (Vector Float) ->

    Acc (Scalar Float) dotp xs ys = fold (+) 0 (zipWith (*) xs ys ) 12 * Same code embedded * 'fold' is parallel tree-reduction » The main difference are the types — let's look at them in more detail…
  16. dotp :: Acc (Vector Float) -> Acc (Vector Float) ->

    Acc (Scalar Float) dotp xs ys = fold (+) 0 (zipWith (*) xs ys ) Embedded Accelerate arrays 12 * Same code embedded * 'fold' is parallel tree-reduction » The main difference are the types — let's look at them in more detail…
  17. dotp :: Acc (Vector Float) -> Acc (Vector Float) ->

    Acc (Scalar Float) dotp xs ys = fold (+) 0 (zipWith (*) xs ys ) Acc marks embedded array computations Embedded Accelerate arrays 12 * Same code embedded * 'fold' is parallel tree-reduction » The main difference are the types — let's look at them in more detail…
  18. data Array sh e type Vector e = Array DIM1

    e type Scalar e = Array DIM0 e 13 * Acc computations always produce arrays — hence, Scalar * Let's ignore the type class context for the moment * For comparison, the list version of zipWith on top in grey
  19. data Array sh e type Vector e = Array DIM1

    e type Scalar e = Array DIM0 e Regular, shape-polymorphic arrays 13 * Acc computations always produce arrays — hence, Scalar * Let's ignore the type class context for the moment * For comparison, the list version of zipWith on top in grey
  20. data Array sh e type Vector e = Array DIM1

    e type Scalar e = Array DIM0 e Regular, shape-polymorphic arrays zipWith :: (a -> b -> c) -> [a] -> [b] -> [c] zipWith :: (…) => (Exp a -> Exp b -> Exp c) -> Acc (Array sh a) -> Acc (Array sh b) -> Acc (Array sh c) Embedded scalar expressions 13 * Acc computations always produce arrays — hence, Scalar * Let's ignore the type class context for the moment * For comparison, the list version of zipWith on top in grey
  21. Acc marks embedded array computations Embedded Accelerate arrays dotp ::

    Acc (Vector Float) -> Acc (Vector Float) -> Acc (Scalar Float) dotp xs ys = fold (+) 0 (zipWith (*) xs ys ) 14 * What if we want to process host data? * Using data from the host language in the embedded language * GPUs: explicit memory transfer from host to GPU memory
  22. dotp :: Acc (Vector Float) -> Acc (Vector Float) ->

    Acc (Scalar Float) dotp xs ys = fold (+) 0 (zipWith (*) xs ys ) 14 * What if we want to process host data? * Using data from the host language in the embedded language * GPUs: explicit memory transfer from host to GPU memory
  23. dotp :: Acc (Vector Float) -> Acc (Vector Float) ->

    Acc (Scalar Float) dotp xs ys = fold (+) 0 (zipWith (*) xs ys ) 14 * What if we want to process host data? * Using data from the host language in the embedded language * GPUs: explicit memory transfer from host to GPU memory
  24. dotp :: Acc (Vector Float) -> Acc (Vector Float) ->

    Acc (Scalar Float) dotp xs ys = fold (+) 0 (zipWith (*) xs ys ) 14 * What if we want to process host data? * Using data from the host language in the embedded language * GPUs: explicit memory transfer from host to GPU memory
  25. dotp :: Acc (Vector Float) -> Acc (Vector Float) ->

    Acc (Scalar Float) dotp xs ys = fold (+) 0 (zipWith (*) xs ys ) let xs' = use xs ys' = use ys in ' ' use embeds values 14 * What if we want to process host data? * Using data from the host language in the embedded language * GPUs: explicit memory transfer from host to GPU memory
  26. 15 Essence of embedded languages (transcending Accelerate) * High-level for

    expressiveness; restricted for efficiency * Further abstractions: generate embedded code
  27. Embedded languages …are restricted languages 15 Essence of embedded languages

    (transcending Accelerate) * High-level for expressiveness; restricted for efficiency * Further abstractions: generate embedded code
  28. Embedded languages …are restricted languages The embedding partly compensates for

    restrictions: 15 Essence of embedded languages (transcending Accelerate) * High-level for expressiveness; restricted for efficiency * Further abstractions: generate embedded code
  29. Embedded languages …are restricted languages Seamless integration into host language

    and… …host language can generate embedded code. The embedding partly compensates for restrictions: 15 Essence of embedded languages (transcending Accelerate) * High-level for expressiveness; restricted for efficiency * Further abstractions: generate embedded code
  30. map (\x -> x + 1) arr 17 (1) Reify

    AST; (2) Optimise code (fusion); (3) Generate CUDA code using skeletons (a) Compile with the CUDA compiler; (b) Invoke from host code » How to implement the skeletons…
  31. map (\x -> x + 1) arr Reify AST Map

    (Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr 17 (1) Reify AST; (2) Optimise code (fusion); (3) Generate CUDA code using skeletons (a) Compile with the CUDA compiler; (b) Invoke from host code » How to implement the skeletons…
  32. map (\x -> x + 1) arr Reify AST Map

    (Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr Optimise 17 (1) Reify AST; (2) Optimise code (fusion); (3) Generate CUDA code using skeletons (a) Compile with the CUDA compiler; (b) Invoke from host code » How to implement the skeletons…
  33. map (\x -> x + 1) arr Reify AST Map

    (Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr Optimise Skeleton instantiation __global__ void kernel (float *arr, int n) {... 17 (1) Reify AST; (2) Optimise code (fusion); (3) Generate CUDA code using skeletons (a) Compile with the CUDA compiler; (b) Invoke from host code » How to implement the skeletons…
  34. map (\x -> x + 1) arr Reify AST Map

    (Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr Optimise Skeleton instantiation __global__ void kernel (float *arr, int n) {... CUDA compiler 17 (1) Reify AST; (2) Optimise code (fusion); (3) Generate CUDA code using skeletons (a) Compile with the CUDA compiler; (b) Invoke from host code » How to implement the skeletons…
  35. map (\x -> x + 1) arr Reify AST Map

    (Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr Optimise Skeleton instantiation __global__ void kernel (float *arr, int n) {... CUDA compiler Call 17 (1) Reify AST; (2) Optimise code (fusion); (3) Generate CUDA code using skeletons (a) Compile with the CUDA compiler; (b) Invoke from host code » How to implement the skeletons…
  36. mkMap dev aenv fun arr = return $ CUTranslSkel "map"

    [cunit| $esc:("#include <accelerate_cuda.h>") extern "C" __global__ void map ($params:argIn, $params:argOut) { const int shapeSize = size(shOut); const int gridSize = $exp:(gridSize dev); int ix; for ( ix = $exp:(threadIdx dev) ; ix < shapeSize ; ix += gridSize ) { $items:(dce x .=. get ix) $items:(setOut "ix" .=. f x) } } |] where ... 18 * Combinators as skeletons (code templates with holes) * Quasi-quoter for CUDA [Mainland] * Yellow anti-quotes are the holes (parameters) of the template
  37. 19 Essence of skeleton-based generative programming (transcending Accelerate) * Hand

    tuned skeleton code * Meta programming simplifies code generation * FFI to use native libraries etc
  38. Skeletons are templates …encapsulating efficient code idioms 19 Essence of

    skeleton-based generative programming (transcending Accelerate) * Hand tuned skeleton code * Meta programming simplifies code generation * FFI to use native libraries etc
  39. Skeletons are templates …encapsulating efficient code idioms Code generation as

    template meta programming Foreign function interface as an escape hatch 19 Essence of skeleton-based generative programming (transcending Accelerate) * Hand tuned skeleton code * Meta programming simplifies code generation * FFI to use native libraries etc
  40. Ingredient 3 Composing skeletons 20 » Let us zoom out

    and look at the big picture again…
  41. dotp xs ys = fold (+) 0 (zipWith (*) xs

    ys) 21 * Dot product: two skeletons executed in sequence * Inefficient: (1) superflous intermediate array; (2) superflous traversal of that intermediate array * Goal: one traversal of the input without an intermediate array => fuse two skeletons into
  42. dotp xs ys = fold (+) 0 (zipWith (*) xs

    ys) Skeleton #1 Skeleton #2 21 * Dot product: two skeletons executed in sequence * Inefficient: (1) superflous intermediate array; (2) superflous traversal of that intermediate array * Goal: one traversal of the input without an intermediate array => fuse two skeletons into
  43. dotp xs ys = fold (+) 0 (zipWith (*) xs

    ys) Skeleton #1 Skeleton #2 Intermediate array Extra traversal 21 * Dot product: two skeletons executed in sequence * Inefficient: (1) superflous intermediate array; (2) superflous traversal of that intermediate array * Goal: one traversal of the input without an intermediate array => fuse two skeletons into
  44. dotp xs ys = fold (+) 0 (zipWith (*) xs

    ys) Combined skeleton 21 * Dot product: two skeletons executed in sequence * Inefficient: (1) superflous intermediate array; (2) superflous traversal of that intermediate array * Goal: one traversal of the input without an intermediate array => fuse two skeletons into
  45. Fusing networks of skeletons p1 p1 p2 p3 p4 p5

    p6 p7 c1 c2 22 * Networks consist of producers (eg, generate, map) and consumers (eg, fold) » First, fuse producers
  46. Fusing networks of skeletons c2 p5 p1 c1 p6 p7

    p3 p2 p4 Phase 1: producer/producer fusion 23 * producer/producer fusion: combine successive producers (e.g., map family) * No fusion if intermediate results are shared (risk of duplication of work) » Second, fuse consumers with producers
  47. Fusing networks of skeletons c2 p5 p1 c1 p6 p7

    p3 p2 p4 Phase 2: consumer/producer fusion 24 * Consumer/producer fusion: consumer skeleton with the producer code (e.g., folds) * Accelerate: fused skeletons share GPU kernels
  48. data DelayedAcc a where Done :: Acc a -> DelayedAcc

    a Yield :: (Shape sh, Elt e) => Exp sh -> Fun (sh -> e) -> DelayedAcc (Array sh e) Step :: … 25 (1) Fusion friendly representation of array skeletons (2) Producer skeletons as smart constructors of that representation (3) Generate code from that representations, instantiating consumers with producers
  49. data DelayedAcc a where Done :: Acc a -> DelayedAcc

    a Yield :: (Shape sh, Elt e) => Exp sh -> Fun (sh -> e) -> DelayedAcc (Array sh e) Step :: … Fusion friendly 25 (1) Fusion friendly representation of array skeletons (2) Producer skeletons as smart constructors of that representation (3) Generate code from that representations, instantiating consumers with producers
  50. data DelayedAcc a where Done :: Acc a -> DelayedAcc

    a Yield :: (Shape sh, Elt e) => Exp sh -> Fun (sh -> e) -> DelayedAcc (Array sh e) Step :: … Fusion friendly mapD f (Yield sh g) = Yield sh (f . g) 25 (1) Fusion friendly representation of array skeletons (2) Producer skeletons as smart constructors of that representation (3) Generate code from that representations, instantiating consumers with producers
  51. data DelayedAcc a where Done :: Acc a -> DelayedAcc

    a Yield :: (Shape sh, Elt e) => Exp sh -> Fun (sh -> e) -> DelayedAcc (Array sh e) Step :: … Fusion friendly mapD f (Yield sh g) = Yield sh (f . g) codeGenAcc … (Fold f z arr) = mkFold … (codeGenFun f) (codeGenExp z) (codeGenEmbeddedAcc arr) 25 (1) Fusion friendly representation of array skeletons (2) Producer skeletons as smart constructors of that representation (3) Generate code from that representations, instantiating consumers with producers
  52. 26 Essence of fusing skeletons (transcending Accelerate) * Efficient code

    often requires a coarse granularity * Rewriting of skeleton composition and instantiation
  53. Fusion of skeletons …reduces the abstraction penalty 26 Essence of

    fusing skeletons (transcending Accelerate) * Efficient code often requires a coarse granularity * Rewriting of skeleton composition and instantiation
  54. Fusion of skeletons …reduces the abstraction penalty Code generation idioms

    vary from high-level combinators Smart constructors combine producers Instantiate consumer skeletons with producer code 26 Essence of fusing skeletons (transcending Accelerate) * Efficient code often requires a coarse granularity * Rewriting of skeleton composition and instantiation
  55. 0.1 1 10 100 2 4 6 8 10 12

    14 16 18 20 Run Time (ms) Elements (millions) Dot product Data.Vector Repa -N8 NDP2GPU Accelerate -fusion ... +fusion CUBLAS 28 * C (red) is on one CPU core (Xenon E5405 CPU @ 2 GHz, 64-bit) * Repa (blue) is on 7 CPU cores (two quad-core Xenon E5405 CPUs @ 2 GHz, 64-bit) * Accelerate (green) is on a Tesla T10 processor (240 cores @ 1.3 GHz)
  56. 0.1 1 10 100 2 4 6 8 10 12

    14 16 18 20 Run Time (ms) Elements (millions) Dot product Data.Vector Repa -N8 NDP2GPU Accelerate -fusion ... +fusion CUBLAS Dot Product 28 * C (red) is on one CPU core (Xenon E5405 CPU @ 2 GHz, 64-bit) * Repa (blue) is on 7 CPU cores (two quad-core Xenon E5405 CPUs @ 2 GHz, 64-bit) * Accelerate (green) is on a Tesla T10 processor (240 cores @ 1.3 GHz)
  57. 29 * C (red) is on one CPU core (Xenon

    E5405 CPU @ 2 GHz, 64-bit) * Repa (blue) is on 7 CPU cores (two quad-core Xenon E5405 CPUs @ 2 GHz, 64-bit) * Accelerate (green) is on a Tesla T10 processor (240 cores @ 1.3 GHz)
  58. Jos Stam's Fluid Flow Solver 29 * C (red) is

    on one CPU core (Xenon E5405 CPU @ 2 GHz, 64-bit) * Repa (blue) is on 7 CPU cores (two quad-core Xenon E5405 CPUs @ 2 GHz, 64-bit) * Accelerate (green) is on a Tesla T10 processor (240 cores @ 1.3 GHz)
  59. 0.1 1 10 100 1000 64k 256k 1M 4M 16M

    Run Time (ms) Image Size (total pixels) Canny Edge Detection Accelerate (whole program) Accelerate (just GPU kernels) OpenCV (CPU) OpenCV (GPU) 30 * C: OpenCV CPU (blue) is one CPU core with SSE (Xenon E5405 CPU @ 2 GHz, 64-bit) * Accelerate (green) and OpenCV GPU (red) is on a Tesla T10 processor (240 cores @ 1.3 GHz) * Accelerate performs the last computation step CPU-side: the light green includes the CPU post processing
  60. 0.1 1 10 100 1000 64k 256k 1M 4M 16M

    Run Time (ms) Image Size (total pixels) Canny Edge Detection Accelerate (whole program) Accelerate (just GPU kernels) OpenCV (CPU) OpenCV (GPU) Canny Edge Detection 30 * C: OpenCV CPU (blue) is one CPU core with SSE (Xenon E5405 CPU @ 2 GHz, 64-bit) * Accelerate (green) and OpenCV GPU (red) is on a Tesla T10 processor (240 cores @ 1.3 GHz) * Accelerate performs the last computation step CPU-side: the light green includes the CPU post processing
  61. 0.1 1 10 100 1000 1k 2k 4k 8k 16k

    32k Run Time (ms) Bodies N-Body Accelerate -fusion -sharing ... -fusion +sharing ... +fusion +sharing CUDA 31 * Accelerate (green) and handwritten CUDA (red) are on a Tesla T10 processor (240 cores @ 1.3 GHz) * GPU performance depends on use of shared memory. Acceleret is slower as wedo do not yet optimise for that.
  62. N-Body 0.1 1 10 100 1000 1k 2k 4k 8k

    16k 32k Run Time (ms) Bodies N-Body Accelerate -fusion -sharing ... -fusion +sharing ... +fusion +sharing CUDA 31 * Accelerate (green) and handwritten CUDA (red) are on a Tesla T10 processor (240 cores @ 1.3 GHz) * GPU performance depends on use of shared memory. Acceleret is slower as wedo do not yet optimise for that.
  63. N-Body 0.1 1 10 100 1000 1k 2k 4k 8k

    16k 32k Run Time (ms) Bodies N-Body Accelerate -fusion -sharing ... -fusion +sharing ... +fusion +sharing CUDA Missing shared memory optimisation 31 * Accelerate (green) and handwritten CUDA (red) are on a Tesla T10 processor (240 cores @ 1.3 GHz) * GPU performance depends on use of shared memory. Acceleret is slower as wedo do not yet optimise for that.
  64. Summary Embedded languages are restricted languages Skeletons encapsulate efficient code

    idioms Fusion reduces the abstraction penalty types >< state languages 32
  65. Images from http://wikipedia.org http://openclipart.org Accelerating Haskell Array Codes with Multicore

    GPUs. Chakravarty, Keller, Lee, McDonell & Grover. In "Declarative Aspects of Multicore Programming", ACM Press, 2011. Optimising Purely Functional GPU Programs. McDonell, Chakravarty, Keller & Lippmeier. In "ACM SIGPLAN International Conference on Functional Programming", ACM Press, 2013. References 33