Embedded Languages for High-Performance Computing in Haskell

mchakravarty Embedded Languages for High-Performance Computing in Haskell Manuel M
T Chakravarty University of New South Wales Jointly with Gabriele Keller Sean Lee Trevor L. McDonell 1 » Jointly with; then straight to next slide 25 minute time slot: 20min talking + 5min [5min Embedded; 5min Skeletons; 5min Fusion; 5min Demo & benchmarks]

High Level Languages 2 * Advanced features of HLLs increase
our productivity & safety » …but are they always a beneﬁt…

High Level Languages Boxed values Polymorphism & generics Composite data
structures Immutable structures Higher-order functions & closures 2 * Advanced features of HLLs increase our productivity & safety » …but are they always a beneﬁt…

Data and compute intensive applications 3 * Special hardware, parallel
hardware, concurrent hardware, etc. * Requires: high-throughput hardware & optimised code

Data and compute intensive applications GPUs Cluster multicore CPU 3
* Special hardware, parallel hardware, concurrent hardware, etc. * Requires: high-throughput hardware & optimised code

High Level Languages GPUs Cluster multicore CPU 4 * Many
features are already difficult to efficiently map to conventional sequential architectures * Optimising them to high-throughput hardware is harder * Fragility and lack of predicatbility of optimisations

High Level Languages GPUs Cluster multicore CPU Efﬁcient code? 4
* Many features are already difficult to efficiently map to conventional sequential architectures * Optimising them to high-throughput hardware is harder * Fragility and lack of predicatbility of optimisations

High Level Languages GPUs Cluster multicore CPU Efﬁcient code? Function
pointers Control ﬂow Memory access patterns Data distribution Decomposition 4 * Many features are already difficult to efficiently map to conventional sequential architectures * Optimising them to high-throughput hardware is harder * Fragility and lack of predicatbility of optimisations

High Level Languages GPUs Cluster multicore CPU 5 * Although
we love high-level languages, this breaks our heart!

“How about embedded languages with specialised code generation?” 6 *
Instead of writing high-performance code, we write code that writes high-performance code.

Accelerate an embedded language for GPU programming 7 * We
tried that for GPGPU programming. * The approach is more general. * Smoothlife: https://github.com/AccelerateHS/accelerate-examples/tree/master/examples/ smoothlife

8 N-body: https://github.com/AccelerateHS/accelerate-examples/tree/master/examples/n- body

9 Fluid ﬂow: https://github.com/AccelerateHS/accelerate-examples/tree/master/examples/ ﬂuid

Ingredient 1 Embedded languages 10 » A simple example…

dotp :: [Float] -> [Float] -> Float dotp xs ys
= foldl (+) 0 (zipWith (*) xs ys ) Plain Haskell (lists) 11 * Dot product in plain Haskell using lists * Dot product is map-reduce! ;)

dotp :: Acc (Vector Float) -> Acc (Vector Float) ->
Acc (Scalar Float) dotp xs ys = fold (+) 0 (zipWith (*) xs ys ) 12 * Same code embedded * 'fold' is parallel tree-reduction » The main difference are the types — let's look at them in more detail…

Acc (Scalar Float) dotp xs ys = fold (+) 0 (zipWith (*) xs ys ) Embedded Accelerate arrays 12 * Same code embedded * 'fold' is parallel tree-reduction » The main difference are the types — let's look at them in more detail…

Acc (Scalar Float) dotp xs ys = fold (+) 0 (zipWith (*) xs ys ) Acc marks embedded array computations Embedded Accelerate arrays 12 * Same code embedded * 'fold' is parallel tree-reduction » The main difference are the types — let's look at them in more detail…

data Array sh e type Vector e = Array DIM1
e type Scalar e = Array DIM0 e 13 * Acc computations always produce arrays — hence, Scalar * Let's ignore the type class context for the moment * For comparison, the list version of zipWith on top in grey

e type Scalar e = Array DIM0 e Regular, shape-polymorphic arrays 13 * Acc computations always produce arrays — hence, Scalar * Let's ignore the type class context for the moment * For comparison, the list version of zipWith on top in grey

e type Scalar e = Array DIM0 e Regular, shape-polymorphic arrays zipWith :: (a -> b -> c) -> [a] -> [b] -> [c] zipWith :: (…) => (Exp a -> Exp b -> Exp c) -> Acc (Array sh a) -> Acc (Array sh b) -> Acc (Array sh c) Embedded scalar expressions 13 * Acc computations always produce arrays — hence, Scalar * Let's ignore the type class context for the moment * For comparison, the list version of zipWith on top in grey

Acc marks embedded array computations Embedded Accelerate arrays dotp ::
Acc (Vector Float) -> Acc (Vector Float) -> Acc (Scalar Float) dotp xs ys = fold (+) 0 (zipWith (*) xs ys ) 14 * What if we want to process host data? * Using data from the host language in the embedded language * GPUs: explicit memory transfer from host to GPU memory

Acc (Scalar Float) dotp xs ys = fold (+) 0 (zipWith (*) xs ys ) 14 * What if we want to process host data? * Using data from the host language in the embedded language * GPUs: explicit memory transfer from host to GPU memory

Acc (Scalar Float) dotp xs ys = fold (+) 0 (zipWith (*) xs ys ) let xs' = use xs ys' = use ys in ' ' use embeds values 14 * What if we want to process host data? * Using data from the host language in the embedded language * GPUs: explicit memory transfer from host to GPU memory

15 Essence of embedded languages (transcending Accelerate) * High-level for
expressiveness; restricted for efficiency * Further abstractions: generate embedded code

Embedded languages …are restricted languages 15 Essence of embedded languages
(transcending Accelerate) * High-level for expressiveness; restricted for efficiency * Further abstractions: generate embedded code

Embedded languages …are restricted languages The embedding partly compensates for
restrictions: 15 Essence of embedded languages (transcending Accelerate) * High-level for expressiveness; restricted for efficiency * Further abstractions: generate embedded code

Embedded languages …are restricted languages Seamless integration into host language
and… …host language can generate embedded code. The embedding partly compensates for restrictions: 15 Essence of embedded languages (transcending Accelerate) * High-level for expressiveness; restricted for efficiency * Further abstractions: generate embedded code

Ingredient 2 Skeletons 16 » How to implement such an
embedded language…

map (\x -> x + 1) arr 17 (1) Reify
AST; (2) Optimise code (fusion); (3) Generate CUDA code using skeletons (a) Compile with the CUDA compiler; (b) Invoke from host code » How to implement the skeletons…

map (\x -> x + 1) arr Reify AST Map
(Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr 17 (1) Reify AST; (2) Optimise code (fusion); (3) Generate CUDA code using skeletons (a) Compile with the CUDA compiler; (b) Invoke from host code » How to implement the skeletons…

(Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr Optimise 17 (1) Reify AST; (2) Optimise code (fusion); (3) Generate CUDA code using skeletons (a) Compile with the CUDA compiler; (b) Invoke from host code » How to implement the skeletons…

(Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr Optimise Skeleton instantiation __global__ void kernel (float *arr, int n) {... 17 (1) Reify AST; (2) Optimise code (fusion); (3) Generate CUDA code using skeletons (a) Compile with the CUDA compiler; (b) Invoke from host code » How to implement the skeletons…

(Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr Optimise Skeleton instantiation __global__ void kernel (float *arr, int n) {... CUDA compiler 17 (1) Reify AST; (2) Optimise code (fusion); (3) Generate CUDA code using skeletons (a) Compile with the CUDA compiler; (b) Invoke from host code » How to implement the skeletons…

(Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr Optimise Skeleton instantiation __global__ void kernel (float *arr, int n) {... CUDA compiler Call 17 (1) Reify AST; (2) Optimise code (fusion); (3) Generate CUDA code using skeletons (a) Compile with the CUDA compiler; (b) Invoke from host code » How to implement the skeletons…

mkMap dev aenv fun arr = return $ CUTranslSkel "map"
[cunit| $esc:("#include <accelerate_cuda.h>") extern "C" __global__ void map ($params:argIn, $params:argOut) { const int shapeSize = size(shOut); const int gridSize = $exp:(gridSize dev); int ix; for ( ix = $exp:(threadIdx dev) ; ix < shapeSize ; ix += gridSize ) { $items:(dce x .=. get ix) $items:(setOut "ix" .=. f x) } } |] where ... 18 * Combinators as skeletons (code templates with holes) * Quasi-quoter for CUDA [Mainland] * Yellow anti-quotes are the holes (parameters) of the template

19 Essence of skeleton-based generative programming (transcending Accelerate) * Hand
tuned skeleton code * Meta programming simpliﬁes code generation * FFI to use native libraries etc

Skeletons are templates …encapsulating efﬁcient code idioms 19 Essence of
skeleton-based generative programming (transcending Accelerate) * Hand tuned skeleton code * Meta programming simpliﬁes code generation * FFI to use native libraries etc

Skeletons are templates …encapsulating efﬁcient code idioms Code generation as
template meta programming Foreign function interface as an escape hatch 19 Essence of skeleton-based generative programming (transcending Accelerate) * Hand tuned skeleton code * Meta programming simpliﬁes code generation * FFI to use native libraries etc

Ingredient 3 Composing skeletons 20 » Let us zoom out
and look at the big picture again…

dotp xs ys = fold (+) 0 (zipWith (*) xs
ys) 21 * Dot product: two skeletons executed in sequence * Inefficient: (1) superﬂous intermediate array; (2) superﬂous traversal of that intermediate array * Goal: one traversal of the input without an intermediate array => fuse two skeletons into

ys) Skeleton #1 Skeleton #2 21 * Dot product: two skeletons executed in sequence * Inefficient: (1) superﬂous intermediate array; (2) superﬂous traversal of that intermediate array * Goal: one traversal of the input without an intermediate array => fuse two skeletons into

ys) Skeleton #1 Skeleton #2 Intermediate array Extra traversal 21 * Dot product: two skeletons executed in sequence * Inefficient: (1) superﬂous intermediate array; (2) superﬂous traversal of that intermediate array * Goal: one traversal of the input without an intermediate array => fuse two skeletons into

ys) Combined skeleton 21 * Dot product: two skeletons executed in sequence * Inefficient: (1) superﬂous intermediate array; (2) superﬂous traversal of that intermediate array * Goal: one traversal of the input without an intermediate array => fuse two skeletons into

Fusing networks of skeletons p1 p1 p2 p3 p4 p5
p6 p7 c1 c2 22 * Networks consist of producers (eg, generate, map) and consumers (eg, fold) » First, fuse producers

Fusing networks of skeletons c2 p5 p1 c1 p6 p7
p3 p2 p4 Phase 1: producer/producer fusion 23 * producer/producer fusion: combine successive producers (e.g., map family) * No fusion if intermediate results are shared (risk of duplication of work) » Second, fuse consumers with producers

Fusing networks of skeletons c2 p5 p1 c1 p6 p7
p3 p2 p4 Phase 2: consumer/producer fusion 24 * Consumer/producer fusion: consumer skeleton with the producer code (e.g., folds) * Accelerate: fused skeletons share GPU kernels

data DelayedAcc a where Done :: Acc a -> DelayedAcc
a Yield :: (Shape sh, Elt e) => Exp sh -> Fun (sh -> e) -> DelayedAcc (Array sh e) Step :: … 25 (1) Fusion friendly representation of array skeletons (2) Producer skeletons as smart constructors of that representation (3) Generate code from that representations, instantiating consumers with producers

a Yield :: (Shape sh, Elt e) => Exp sh -> Fun (sh -> e) -> DelayedAcc (Array sh e) Step :: … Fusion friendly 25 (1) Fusion friendly representation of array skeletons (2) Producer skeletons as smart constructors of that representation (3) Generate code from that representations, instantiating consumers with producers

a Yield :: (Shape sh, Elt e) => Exp sh -> Fun (sh -> e) -> DelayedAcc (Array sh e) Step :: … Fusion friendly mapD f (Yield sh g) = Yield sh (f . g) 25 (1) Fusion friendly representation of array skeletons (2) Producer skeletons as smart constructors of that representation (3) Generate code from that representations, instantiating consumers with producers

a Yield :: (Shape sh, Elt e) => Exp sh -> Fun (sh -> e) -> DelayedAcc (Array sh e) Step :: … Fusion friendly mapD f (Yield sh g) = Yield sh (f . g) codeGenAcc … (Fold f z arr) = mkFold … (codeGenFun f) (codeGenExp z) (codeGenEmbeddedAcc arr) 25 (1) Fusion friendly representation of array skeletons (2) Producer skeletons as smart constructors of that representation (3) Generate code from that representations, instantiating consumers with producers

26 Essence of fusing skeletons (transcending Accelerate) * Efficient code
often requires a coarse granularity * Rewriting of skeleton composition and instantiation

Fusion of skeletons …reduces the abstraction penalty 26 Essence of
fusing skeletons (transcending Accelerate) * Efficient code often requires a coarse granularity * Rewriting of skeleton composition and instantiation

Fusion of skeletons …reduces the abstraction penalty Code generation idioms
vary from high-level combinators Smart constructors combine producers Instantiate consumer skeletons with producer code 26 Essence of fusing skeletons (transcending Accelerate) * Efficient code often requires a coarse granularity * Rewriting of skeleton composition and instantiation

“How fast are we going?” 27

0.1 1 10 100 2 4 6 8 10 12
14 16 18 20 Run Time (ms) Elements (millions) Dot product Data.Vector Repa -N8 NDP2GPU Accelerate -fusion ... +fusion CUBLAS 28 * C (red) is on one CPU core (Xenon E5405 CPU @ 2 GHz, 64-bit) * Repa (blue) is on 7 CPU cores (two quad-core Xenon E5405 CPUs @ 2 GHz, 64-bit) * Accelerate (green) is on a Tesla T10 processor (240 cores @ 1.3 GHz)

0.1 1 10 100 2 4 6 8 10 12
14 16 18 20 Run Time (ms) Elements (millions) Dot product Data.Vector Repa -N8 NDP2GPU Accelerate -fusion ... +fusion CUBLAS Dot Product 28 * C (red) is on one CPU core (Xenon E5405 CPU @ 2 GHz, 64-bit) * Repa (blue) is on 7 CPU cores (two quad-core Xenon E5405 CPUs @ 2 GHz, 64-bit) * Accelerate (green) is on a Tesla T10 processor (240 cores @ 1.3 GHz)

29 * C (red) is on one CPU core (Xenon
E5405 CPU @ 2 GHz, 64-bit) * Repa (blue) is on 7 CPU cores (two quad-core Xenon E5405 CPUs @ 2 GHz, 64-bit) * Accelerate (green) is on a Tesla T10 processor (240 cores @ 1.3 GHz)

Jos Stam's Fluid Flow Solver 29 * C (red) is
on one CPU core (Xenon E5405 CPU @ 2 GHz, 64-bit) * Repa (blue) is on 7 CPU cores (two quad-core Xenon E5405 CPUs @ 2 GHz, 64-bit) * Accelerate (green) is on a Tesla T10 processor (240 cores @ 1.3 GHz)

0.1 1 10 100 1000 64k 256k 1M 4M 16M
Run Time (ms) Image Size (total pixels) Canny Edge Detection Accelerate (whole program) Accelerate (just GPU kernels) OpenCV (CPU) OpenCV (GPU) 30 * C: OpenCV CPU (blue) is one CPU core with SSE (Xenon E5405 CPU @ 2 GHz, 64-bit) * Accelerate (green) and OpenCV GPU (red) is on a Tesla T10 processor (240 cores @ 1.3 GHz) * Accelerate performs the last computation step CPU-side: the light green includes the CPU post processing

0.1 1 10 100 1000 64k 256k 1M 4M 16M
Run Time (ms) Image Size (total pixels) Canny Edge Detection Accelerate (whole program) Accelerate (just GPU kernels) OpenCV (CPU) OpenCV (GPU) Canny Edge Detection 30 * C: OpenCV CPU (blue) is one CPU core with SSE (Xenon E5405 CPU @ 2 GHz, 64-bit) * Accelerate (green) and OpenCV GPU (red) is on a Tesla T10 processor (240 cores @ 1.3 GHz) * Accelerate performs the last computation step CPU-side: the light green includes the CPU post processing

0.1 1 10 100 1000 1k 2k 4k 8k 16k
32k Run Time (ms) Bodies N-Body Accelerate -fusion -sharing ... -fusion +sharing ... +fusion +sharing CUDA 31 * Accelerate (green) and handwritten CUDA (red) are on a Tesla T10 processor (240 cores @ 1.3 GHz) * GPU performance depends on use of shared memory. Acceleret is slower as wedo do not yet optimise for that.

N-Body 0.1 1 10 100 1000 1k 2k 4k 8k
16k 32k Run Time (ms) Bodies N-Body Accelerate -fusion -sharing ... -fusion +sharing ... +fusion +sharing CUDA 31 * Accelerate (green) and handwritten CUDA (red) are on a Tesla T10 processor (240 cores @ 1.3 GHz) * GPU performance depends on use of shared memory. Acceleret is slower as wedo do not yet optimise for that.

N-Body 0.1 1 10 100 1000 1k 2k 4k 8k
16k 32k Run Time (ms) Bodies N-Body Accelerate -fusion -sharing ... -fusion +sharing ... +fusion +sharing CUDA Missing shared memory optimisation 31 * Accelerate (green) and handwritten CUDA (red) are on a Tesla T10 processor (240 cores @ 1.3 GHz) * GPU performance depends on use of shared memory. Acceleret is slower as wedo do not yet optimise for that.

Summary Embedded languages are restricted languages Skeletons encapsulate efﬁcient code
idioms Fusion reduces the abstraction penalty types >< state languages 32

Images from http://wikipedia.org http://openclipart.org Accelerating Haskell Array Codes with Multicore
GPUs. Chakravarty, Keller, Lee, McDonell & Grover. In "Declarative Aspects of Multicore Programming", ACM Press, 2011. Optimising Purely Functional GPU Programs. McDonell, Chakravarty, Keller & Lippmeier. In "ACM SIGPLAN International Conference on Functional Programming", ACM Press, 2013. References 33

Embedded Languages for High-Performance Computi...

Embedded Languages for High-Performance Computing in Haskell

More Decks by Manuel Chakravarty

Other Decks in Research

Featured

Transcript