Slide 1

Slide 1 text

mchakravarty Embedded Languages for High-Performance Computing in Haskell Manuel M T Chakravarty University of New South Wales Jointly with Gabriele Keller Sean Lee Trevor L. McDonell 1 » Jointly with; then straight to next slide 25 minute time slot: 20min talking + 5min [5min Embedded; 5min Skeletons; 5min Fusion; 5min Demo & benchmarks]

Slide 2

Slide 2 text

High Level Languages 2 * Advanced features of HLLs increase our productivity & safety » …but are they always a benefit…

Slide 3

Slide 3 text

High Level Languages Boxed values Polymorphism & generics Composite data structures Immutable structures Higher-order functions & closures 2 * Advanced features of HLLs increase our productivity & safety » …but are they always a benefit…

Slide 4

Slide 4 text

High Level Languages Boxed values Polymorphism & generics Composite data structures Immutable structures Higher-order functions & closures 2 * Advanced features of HLLs increase our productivity & safety » …but are they always a benefit…

Slide 5

Slide 5 text

Data and compute intensive applications 3 * Special hardware, parallel hardware, concurrent hardware, etc. * Requires: high-throughput hardware & optimised code

Slide 6

Slide 6 text

Data and compute intensive applications GPUs Cluster multicore CPU 3 * Special hardware, parallel hardware, concurrent hardware, etc. * Requires: high-throughput hardware & optimised code

Slide 7

Slide 7 text

High Level Languages GPUs Cluster multicore CPU 4 * Many features are already difficult to efficiently map to conventional sequential architectures * Optimising them to high-throughput hardware is harder * Fragility and lack of predicatbility of optimisations

Slide 8

Slide 8 text

High Level Languages GPUs Cluster multicore CPU Efficient code? 4 * Many features are already difficult to efficiently map to conventional sequential architectures * Optimising them to high-throughput hardware is harder * Fragility and lack of predicatbility of optimisations

Slide 9

Slide 9 text

High Level Languages GPUs Cluster multicore CPU Efficient code? Function pointers Control flow Memory access patterns Data distribution Decomposition 4 * Many features are already difficult to efficiently map to conventional sequential architectures * Optimising them to high-throughput hardware is harder * Fragility and lack of predicatbility of optimisations

Slide 10

Slide 10 text

High Level Languages GPUs Cluster multicore CPU 5 * Although we love high-level languages, this breaks our heart!

Slide 11

Slide 11 text

High Level Languages GPUs Cluster multicore CPU 5 * Although we love high-level languages, this breaks our heart!

Slide 12

Slide 12 text

“How about embedded languages with specialised code generation?” 6 * Instead of writing high-performance code, we write code that writes high-performance code.

Slide 13

Slide 13 text

Accelerate an embedded language for GPU programming 7 * We tried that for GPGPU programming. * The approach is more general. * Smoothlife: https://github.com/AccelerateHS/accelerate-examples/tree/master/examples/ smoothlife

Slide 14

Slide 14 text

8 N-body: https://github.com/AccelerateHS/accelerate-examples/tree/master/examples/n- body

Slide 15

Slide 15 text

9 Fluid flow: https://github.com/AccelerateHS/accelerate-examples/tree/master/examples/ fluid

Slide 16

Slide 16 text

Ingredient 1 Embedded languages 10 » A simple example…

Slide 17

Slide 17 text

dotp :: [Float] -> [Float] -> Float dotp xs ys = foldl (+) 0 (zipWith (*) xs ys ) Plain Haskell (lists) 11 * Dot product in plain Haskell using lists * Dot product is map-reduce! ;)

Slide 18

Slide 18 text

dotp :: Acc (Vector Float) -> Acc (Vector Float) -> Acc (Scalar Float) dotp xs ys = fold (+) 0 (zipWith (*) xs ys ) 12 * Same code embedded * 'fold' is parallel tree-reduction » The main difference are the types — let's look at them in more detail…

Slide 19

Slide 19 text

dotp :: Acc (Vector Float) -> Acc (Vector Float) -> Acc (Scalar Float) dotp xs ys = fold (+) 0 (zipWith (*) xs ys ) Embedded Accelerate arrays 12 * Same code embedded * 'fold' is parallel tree-reduction » The main difference are the types — let's look at them in more detail…

Slide 20

Slide 20 text

dotp :: Acc (Vector Float) -> Acc (Vector Float) -> Acc (Scalar Float) dotp xs ys = fold (+) 0 (zipWith (*) xs ys ) Acc marks embedded array computations Embedded Accelerate arrays 12 * Same code embedded * 'fold' is parallel tree-reduction » The main difference are the types — let's look at them in more detail…

Slide 21

Slide 21 text

data Array sh e type Vector e = Array DIM1 e type Scalar e = Array DIM0 e 13 * Acc computations always produce arrays — hence, Scalar * Let's ignore the type class context for the moment * For comparison, the list version of zipWith on top in grey

Slide 22

Slide 22 text

data Array sh e type Vector e = Array DIM1 e type Scalar e = Array DIM0 e Regular, shape-polymorphic arrays 13 * Acc computations always produce arrays — hence, Scalar * Let's ignore the type class context for the moment * For comparison, the list version of zipWith on top in grey

Slide 23

Slide 23 text

data Array sh e type Vector e = Array DIM1 e type Scalar e = Array DIM0 e Regular, shape-polymorphic arrays zipWith :: (a -> b -> c) -> [a] -> [b] -> [c] zipWith :: (…) => (Exp a -> Exp b -> Exp c) -> Acc (Array sh a) -> Acc (Array sh b) -> Acc (Array sh c) Embedded scalar expressions 13 * Acc computations always produce arrays — hence, Scalar * Let's ignore the type class context for the moment * For comparison, the list version of zipWith on top in grey

Slide 24

Slide 24 text

Acc marks embedded array computations Embedded Accelerate arrays dotp :: Acc (Vector Float) -> Acc (Vector Float) -> Acc (Scalar Float) dotp xs ys = fold (+) 0 (zipWith (*) xs ys ) 14 * What if we want to process host data? * Using data from the host language in the embedded language * GPUs: explicit memory transfer from host to GPU memory

Slide 25

Slide 25 text

dotp :: Acc (Vector Float) -> Acc (Vector Float) -> Acc (Scalar Float) dotp xs ys = fold (+) 0 (zipWith (*) xs ys ) 14 * What if we want to process host data? * Using data from the host language in the embedded language * GPUs: explicit memory transfer from host to GPU memory

Slide 26

Slide 26 text

dotp :: Acc (Vector Float) -> Acc (Vector Float) -> Acc (Scalar Float) dotp xs ys = fold (+) 0 (zipWith (*) xs ys ) 14 * What if we want to process host data? * Using data from the host language in the embedded language * GPUs: explicit memory transfer from host to GPU memory

Slide 27

Slide 27 text

dotp :: Acc (Vector Float) -> Acc (Vector Float) -> Acc (Scalar Float) dotp xs ys = fold (+) 0 (zipWith (*) xs ys ) 14 * What if we want to process host data? * Using data from the host language in the embedded language * GPUs: explicit memory transfer from host to GPU memory

Slide 28

Slide 28 text

dotp :: Acc (Vector Float) -> Acc (Vector Float) -> Acc (Scalar Float) dotp xs ys = fold (+) 0 (zipWith (*) xs ys ) let xs' = use xs ys' = use ys in ' ' use embeds values 14 * What if we want to process host data? * Using data from the host language in the embedded language * GPUs: explicit memory transfer from host to GPU memory

Slide 29

Slide 29 text

15 Essence of embedded languages (transcending Accelerate) * High-level for expressiveness; restricted for efficiency * Further abstractions: generate embedded code

Slide 30

Slide 30 text

Embedded languages …are restricted languages 15 Essence of embedded languages (transcending Accelerate) * High-level for expressiveness; restricted for efficiency * Further abstractions: generate embedded code

Slide 31

Slide 31 text

Embedded languages …are restricted languages The embedding partly compensates for restrictions: 15 Essence of embedded languages (transcending Accelerate) * High-level for expressiveness; restricted for efficiency * Further abstractions: generate embedded code

Slide 32

Slide 32 text

Embedded languages …are restricted languages Seamless integration into host language and… …host language can generate embedded code. The embedding partly compensates for restrictions: 15 Essence of embedded languages (transcending Accelerate) * High-level for expressiveness; restricted for efficiency * Further abstractions: generate embedded code

Slide 33

Slide 33 text

Ingredient 2 Skeletons 16 » How to implement such an embedded language…

Slide 34

Slide 34 text

map (\x -> x + 1) arr 17 (1) Reify AST; (2) Optimise code (fusion); (3) Generate CUDA code using skeletons (a) Compile with the CUDA compiler; (b) Invoke from host code » How to implement the skeletons…

Slide 35

Slide 35 text

map (\x -> x + 1) arr Reify AST Map (Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr 17 (1) Reify AST; (2) Optimise code (fusion); (3) Generate CUDA code using skeletons (a) Compile with the CUDA compiler; (b) Invoke from host code » How to implement the skeletons…

Slide 36

Slide 36 text

map (\x -> x + 1) arr Reify AST Map (Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr Optimise 17 (1) Reify AST; (2) Optimise code (fusion); (3) Generate CUDA code using skeletons (a) Compile with the CUDA compiler; (b) Invoke from host code » How to implement the skeletons…

Slide 37

Slide 37 text

map (\x -> x + 1) arr Reify AST Map (Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr Optimise Skeleton instantiation __global__ void kernel (float *arr, int n) {... 17 (1) Reify AST; (2) Optimise code (fusion); (3) Generate CUDA code using skeletons (a) Compile with the CUDA compiler; (b) Invoke from host code » How to implement the skeletons…

Slide 38

Slide 38 text

map (\x -> x + 1) arr Reify AST Map (Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr Optimise Skeleton instantiation __global__ void kernel (float *arr, int n) {... CUDA compiler 17 (1) Reify AST; (2) Optimise code (fusion); (3) Generate CUDA code using skeletons (a) Compile with the CUDA compiler; (b) Invoke from host code » How to implement the skeletons…

Slide 39

Slide 39 text

map (\x -> x + 1) arr Reify AST Map (Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr Optimise Skeleton instantiation __global__ void kernel (float *arr, int n) {... CUDA compiler Call 17 (1) Reify AST; (2) Optimise code (fusion); (3) Generate CUDA code using skeletons (a) Compile with the CUDA compiler; (b) Invoke from host code » How to implement the skeletons…

Slide 40

Slide 40 text

mkMap dev aenv fun arr = return $ CUTranslSkel "map" [cunit| $esc:("#include ") extern "C" __global__ void map ($params:argIn, $params:argOut) { const int shapeSize = size(shOut); const int gridSize = $exp:(gridSize dev); int ix; for ( ix = $exp:(threadIdx dev) ; ix < shapeSize ; ix += gridSize ) { $items:(dce x .=. get ix) $items:(setOut "ix" .=. f x) } } |] where ... 18 * Combinators as skeletons (code templates with holes) * Quasi-quoter for CUDA [Mainland] * Yellow anti-quotes are the holes (parameters) of the template

Slide 41

Slide 41 text

19 Essence of skeleton-based generative programming (transcending Accelerate) * Hand tuned skeleton code * Meta programming simplifies code generation * FFI to use native libraries etc

Slide 42

Slide 42 text

Skeletons are templates …encapsulating efficient code idioms 19 Essence of skeleton-based generative programming (transcending Accelerate) * Hand tuned skeleton code * Meta programming simplifies code generation * FFI to use native libraries etc

Slide 43

Slide 43 text

Skeletons are templates …encapsulating efficient code idioms Code generation as template meta programming Foreign function interface as an escape hatch 19 Essence of skeleton-based generative programming (transcending Accelerate) * Hand tuned skeleton code * Meta programming simplifies code generation * FFI to use native libraries etc

Slide 44

Slide 44 text

Ingredient 3 Composing skeletons 20 » Let us zoom out and look at the big picture again…

Slide 45

Slide 45 text

dotp xs ys = fold (+) 0 (zipWith (*) xs ys) 21 * Dot product: two skeletons executed in sequence * Inefficient: (1) superflous intermediate array; (2) superflous traversal of that intermediate array * Goal: one traversal of the input without an intermediate array => fuse two skeletons into

Slide 46

Slide 46 text

dotp xs ys = fold (+) 0 (zipWith (*) xs ys) Skeleton #1 Skeleton #2 21 * Dot product: two skeletons executed in sequence * Inefficient: (1) superflous intermediate array; (2) superflous traversal of that intermediate array * Goal: one traversal of the input without an intermediate array => fuse two skeletons into

Slide 47

Slide 47 text

dotp xs ys = fold (+) 0 (zipWith (*) xs ys) Skeleton #1 Skeleton #2 Intermediate array Extra traversal 21 * Dot product: two skeletons executed in sequence * Inefficient: (1) superflous intermediate array; (2) superflous traversal of that intermediate array * Goal: one traversal of the input without an intermediate array => fuse two skeletons into

Slide 48

Slide 48 text

dotp xs ys = fold (+) 0 (zipWith (*) xs ys) Combined skeleton 21 * Dot product: two skeletons executed in sequence * Inefficient: (1) superflous intermediate array; (2) superflous traversal of that intermediate array * Goal: one traversal of the input without an intermediate array => fuse two skeletons into

Slide 49

Slide 49 text

Fusing networks of skeletons p1 p1 p2 p3 p4 p5 p6 p7 c1 c2 22 * Networks consist of producers (eg, generate, map) and consumers (eg, fold) » First, fuse producers

Slide 50

Slide 50 text

Fusing networks of skeletons c2 p5 p1 c1 p6 p7 p3 p2 p4 Phase 1: producer/producer fusion 23 * producer/producer fusion: combine successive producers (e.g., map family) * No fusion if intermediate results are shared (risk of duplication of work) » Second, fuse consumers with producers

Slide 51

Slide 51 text

Fusing networks of skeletons c2 p5 p1 c1 p6 p7 p3 p2 p4 Phase 2: consumer/producer fusion 24 * Consumer/producer fusion: consumer skeleton with the producer code (e.g., folds) * Accelerate: fused skeletons share GPU kernels

Slide 52

Slide 52 text

data DelayedAcc a where Done :: Acc a -> DelayedAcc a Yield :: (Shape sh, Elt e) => Exp sh -> Fun (sh -> e) -> DelayedAcc (Array sh e) Step :: … 25 (1) Fusion friendly representation of array skeletons (2) Producer skeletons as smart constructors of that representation (3) Generate code from that representations, instantiating consumers with producers

Slide 53

Slide 53 text

data DelayedAcc a where Done :: Acc a -> DelayedAcc a Yield :: (Shape sh, Elt e) => Exp sh -> Fun (sh -> e) -> DelayedAcc (Array sh e) Step :: … Fusion friendly 25 (1) Fusion friendly representation of array skeletons (2) Producer skeletons as smart constructors of that representation (3) Generate code from that representations, instantiating consumers with producers

Slide 54

Slide 54 text

data DelayedAcc a where Done :: Acc a -> DelayedAcc a Yield :: (Shape sh, Elt e) => Exp sh -> Fun (sh -> e) -> DelayedAcc (Array sh e) Step :: … Fusion friendly mapD f (Yield sh g) = Yield sh (f . g) 25 (1) Fusion friendly representation of array skeletons (2) Producer skeletons as smart constructors of that representation (3) Generate code from that representations, instantiating consumers with producers

Slide 55

Slide 55 text

data DelayedAcc a where Done :: Acc a -> DelayedAcc a Yield :: (Shape sh, Elt e) => Exp sh -> Fun (sh -> e) -> DelayedAcc (Array sh e) Step :: … Fusion friendly mapD f (Yield sh g) = Yield sh (f . g) codeGenAcc … (Fold f z arr) = mkFold … (codeGenFun f) (codeGenExp z) (codeGenEmbeddedAcc arr) 25 (1) Fusion friendly representation of array skeletons (2) Producer skeletons as smart constructors of that representation (3) Generate code from that representations, instantiating consumers with producers

Slide 56

Slide 56 text

26 Essence of fusing skeletons (transcending Accelerate) * Efficient code often requires a coarse granularity * Rewriting of skeleton composition and instantiation

Slide 57

Slide 57 text

Fusion of skeletons …reduces the abstraction penalty 26 Essence of fusing skeletons (transcending Accelerate) * Efficient code often requires a coarse granularity * Rewriting of skeleton composition and instantiation

Slide 58

Slide 58 text

Fusion of skeletons …reduces the abstraction penalty Code generation idioms vary from high-level combinators Smart constructors combine producers Instantiate consumer skeletons with producer code 26 Essence of fusing skeletons (transcending Accelerate) * Efficient code often requires a coarse granularity * Rewriting of skeleton composition and instantiation

Slide 59

Slide 59 text

“How fast are we going?” 27

Slide 60

Slide 60 text

0.1 1 10 100 2 4 6 8 10 12 14 16 18 20 Run Time (ms) Elements (millions) Dot product Data.Vector Repa -N8 NDP2GPU Accelerate -fusion ... +fusion CUBLAS 28 * C (red) is on one CPU core (Xenon E5405 CPU @ 2 GHz, 64-bit) * Repa (blue) is on 7 CPU cores (two quad-core Xenon E5405 CPUs @ 2 GHz, 64-bit) * Accelerate (green) is on a Tesla T10 processor (240 cores @ 1.3 GHz)

Slide 61

Slide 61 text

0.1 1 10 100 2 4 6 8 10 12 14 16 18 20 Run Time (ms) Elements (millions) Dot product Data.Vector Repa -N8 NDP2GPU Accelerate -fusion ... +fusion CUBLAS Dot Product 28 * C (red) is on one CPU core (Xenon E5405 CPU @ 2 GHz, 64-bit) * Repa (blue) is on 7 CPU cores (two quad-core Xenon E5405 CPUs @ 2 GHz, 64-bit) * Accelerate (green) is on a Tesla T10 processor (240 cores @ 1.3 GHz)

Slide 62

Slide 62 text

29 * C (red) is on one CPU core (Xenon E5405 CPU @ 2 GHz, 64-bit) * Repa (blue) is on 7 CPU cores (two quad-core Xenon E5405 CPUs @ 2 GHz, 64-bit) * Accelerate (green) is on a Tesla T10 processor (240 cores @ 1.3 GHz)

Slide 63

Slide 63 text

Jos Stam's Fluid Flow Solver 29 * C (red) is on one CPU core (Xenon E5405 CPU @ 2 GHz, 64-bit) * Repa (blue) is on 7 CPU cores (two quad-core Xenon E5405 CPUs @ 2 GHz, 64-bit) * Accelerate (green) is on a Tesla T10 processor (240 cores @ 1.3 GHz)

Slide 64

Slide 64 text

0.1 1 10 100 1000 64k 256k 1M 4M 16M Run Time (ms) Image Size (total pixels) Canny Edge Detection Accelerate (whole program) Accelerate (just GPU kernels) OpenCV (CPU) OpenCV (GPU) 30 * C: OpenCV CPU (blue) is one CPU core with SSE (Xenon E5405 CPU @ 2 GHz, 64-bit) * Accelerate (green) and OpenCV GPU (red) is on a Tesla T10 processor (240 cores @ 1.3 GHz) * Accelerate performs the last computation step CPU-side: the light green includes the CPU post processing

Slide 65

Slide 65 text

0.1 1 10 100 1000 64k 256k 1M 4M 16M Run Time (ms) Image Size (total pixels) Canny Edge Detection Accelerate (whole program) Accelerate (just GPU kernels) OpenCV (CPU) OpenCV (GPU) Canny Edge Detection 30 * C: OpenCV CPU (blue) is one CPU core with SSE (Xenon E5405 CPU @ 2 GHz, 64-bit) * Accelerate (green) and OpenCV GPU (red) is on a Tesla T10 processor (240 cores @ 1.3 GHz) * Accelerate performs the last computation step CPU-side: the light green includes the CPU post processing

Slide 66

Slide 66 text

0.1 1 10 100 1000 1k 2k 4k 8k 16k 32k Run Time (ms) Bodies N-Body Accelerate -fusion -sharing ... -fusion +sharing ... +fusion +sharing CUDA 31 * Accelerate (green) and handwritten CUDA (red) are on a Tesla T10 processor (240 cores @ 1.3 GHz) * GPU performance depends on use of shared memory. Acceleret is slower as wedo do not yet optimise for that.

Slide 67

Slide 67 text

N-Body 0.1 1 10 100 1000 1k 2k 4k 8k 16k 32k Run Time (ms) Bodies N-Body Accelerate -fusion -sharing ... -fusion +sharing ... +fusion +sharing CUDA 31 * Accelerate (green) and handwritten CUDA (red) are on a Tesla T10 processor (240 cores @ 1.3 GHz) * GPU performance depends on use of shared memory. Acceleret is slower as wedo do not yet optimise for that.

Slide 68

Slide 68 text

N-Body 0.1 1 10 100 1000 1k 2k 4k 8k 16k 32k Run Time (ms) Bodies N-Body Accelerate -fusion -sharing ... -fusion +sharing ... +fusion +sharing CUDA Missing shared memory optimisation 31 * Accelerate (green) and handwritten CUDA (red) are on a Tesla T10 processor (240 cores @ 1.3 GHz) * GPU performance depends on use of shared memory. Acceleret is slower as wedo do not yet optimise for that.

Slide 69

Slide 69 text

Summary Embedded languages are restricted languages Skeletons encapsulate efficient code idioms Fusion reduces the abstraction penalty types >< state languages 32

Slide 70

Slide 70 text

Images from http://wikipedia.org http://openclipart.org Accelerating Haskell Array Codes with Multicore GPUs. Chakravarty, Keller, Lee, McDonell & Grover. In "Declarative Aspects of Multicore Programming", ACM Press, 2011. Optimising Purely Functional GPU Programs. McDonell, Chakravarty, Keller & Lippmeier. In "ACM SIGPLAN International Conference on Functional Programming", ACM Press, 2013. References 33