Data Parallelism in Haskell

mchakravarty α TacticalGrace TacticalGrace Data Parallelism in Haskell Manuel M
T Chakravarty University of New South Wales Jointly with Gabriele Keller Sean Lee Roman Leshchinskiy Ben Lippmeier Trevor L McDonell Simon Peyton Jones 1 » I like to give you an overview over our research projects on DPP in Haskell, which are joint work with… 45 minutes + 10min [10min Why DP?; 5min Design space; 18min 1st class; 12min 2nd class]

Part 1 Why data parallelism? 2 » Why do we
focus on data parallelism?

“It's a great abstraction!” 3 » Read! * In what
way?

Inside a multi-threaded program 4 * Explicit multi-threading: hard to
get right * DP: single-threaded * Abstracts over concurrency control

Inside a data parallel program 4 * Explicit multi-threading: hard
to get right * DP: single-threaded * Abstracts over concurrency control

ΔxΔp ~ h 5 * Indeterminism: hard to test; Heisenbugs
* DP: deterministic — result independent of run and realised parallelism * Abstracts over indeterminism » Related to that is…

x = y ⇒ f x = f y 5
* Indeterminism: hard to test; Heisenbugs * DP: deterministic — result independent of run and realised parallelism * Abstracts over indeterminism » Related to that is…

6 » …that reasoning about explicit concurrency is hard *
DP: compositional * Informal and formal reasoning as for sequential programs * Abstracts over interference/effects

map f · map g = map (f · g)
6 » …that reasoning about explicit concurrency is hard * DP: compositional * Informal and formal reasoning as for sequential programs * Abstracts over interference/effects

“Hardware likes SIMD!” 7 * Not only a great abstraction
for developers, * also the right model for a lot of hardware.

8 * More and more SIMD support in CPUs (e.g.,
AVX)

9 * GPUs deliver exceptional FLOPS

10 * Optimse FLOPS/Watt * Less control logic * Locality

Energy efﬁciency 10 * Optimse FLOPS/Watt * Less control logic
* Locality

“It goes a long way!” 11 * DP is a
great abstraction with increasingly friendly hardware and * it is more ﬂexible than it might ﬁrst appear.

Flat data parallelism 12 » Concerning DP, we usually imagine
rectangular structures…

Nested data parallelism 13 » …but nested irregular arrays let
us handle tree structures. * NESL

Amorphous data parallelism 14 * Sets, e.g., to represent graphs,
processed with internally indeterministic operations * Blelloch's problem based benchmark suite (PBBS) » Given these varying levels of expressiveness, we have got a taxonomy…

Part 2 The design space 15 » …which informs the
design of DP language extensions and implementations.

Flat Nested Amorphous Embedded (2nd class) Full (1st class) 16
* Two independent language axes: (x) expressiveness of computations; (y) expressiveness of parallelism

Flat Nested Amorphous Repa Embedded (2nd class) Full (1st class)
16 * Two independent language axes: (x) expressiveness of computations; (y) expressiveness of parallelism

Flat Nested Amorphous Repa Data Parallel Haskell Embedded (2nd class)
Full (1st class) 16 * Two independent language axes: (x) expressiveness of computations; (y) expressiveness of parallelism

Flat Nested Amorphous Repa Accelerate Data Parallel Haskell Embedded (2nd
class) Full (1st class) 16 * Two independent language axes: (x) expressiveness of computations; (y) expressiveness of parallelism

Embedded (2nd class) Full (1st class) Flat Nested Amorphous Repa
Data Parallel Haskell Accelerate 17 * Additional hardware axis: (z) general or specialised architecture * We are a long way from fully parallelising Haskell with amorphous DP on GPUs! » We'll visit those points in the design space that our group explored while looking at applications such as the following…

CPU Embedded (2nd class) Full (1st class) Flat Nested Amorphous
Repa Data Parallel Haskell GPU Accelerate 17 * Additional hardware axis: (z) general or specialised architecture * We are a long way from fully parallelising Haskell with amorphous DP on GPUs! » We'll visit those points in the design space that our group explored while looking at applications such as the following…

Some case studies Numerically intensive applications 18 * Smoothlife: https://github.com/AccelerateHS/accelerate-examples/tree/master/examples/
smoothlife * FFT

19 * N-body: https://github.com/AccelerateHS/accelerate-examples/tree/master/examples/n- body * Hierarchical solutions beneﬁt from
nested parallelism

20 * Fluid flow: https://github.com/AccelerateHS/accelerate-examples/tree/master/examples/ fluid * Convolution » Let's
first look at our experience with parallelising full Haskell, before we move to embedded languages…

Part 3 First-class data parallelism: Parallelising Haskell 21

Data Parallel Haskell Our ambitious ﬁrst take 22 » Inspired
by NESL, we began by integrating NDP into Haskell…

class) Full (1st class) 23 * Maximise expressivenes and convenience for the programmer * Let the compiler do the heavy lifting * Types to distinguish between parallel and sequential computations

sdotp :: SparseVector -> Vector -> Double sdotp sv v
= sumP [: x * (v !: i) | (i, x) <- sv :] 24

type Vector = [:Double:] type SparseVector = [:(Int, Double):] sdotp
:: SparseVector -> Vector -> Double sdotp sv v = sumP [: x * (v !: i) | (i, x) <- sv :] parallel array 24

:: SparseVector -> Vector -> Double sdotp sv v = sumP [: x * (v !: i) | (i, x) <- sv :] index value parallel array 24

:: SparseVector -> Vector -> Double sdotp sv v = sumP [: x * (v !: i) | (i, x) <- sv :] [:0, 5, 0, 0, 3:] index value parallel array 24

:: SparseVector -> Vector -> Double sdotp sv v = sumP [: x * (v !: i) | (i, x) <- sv :] [:0, 5, 0, 0, 3:] [:(1, 5), (4, 3):] index value parallel array 24

:: SparseVector -> Vector -> Double sdotp sv v = sumP [: x * (v !: i) | (i, x) <- sv :] parallel computation (no nesting) [:0, 5, 0, 0, 3:] [:(1, 5), (4, 3):] index value parallel array 24

= sumP [: x * (v !: i) | (i, x) <- sv :] type SparseMatrix = [:SparseVector:] smvm :: SparseMatrix -> Vector -> Vector smvm sm v = [: sdotp row v | row <- sm :] parallel computation (no nesting) 25

= sumP [: x * (v !: i) | (i, x) <- sv :] type SparseMatrix = [:SparseVector:] smvm :: SparseMatrix -> Vector -> Vector smvm sm v = [: sdotp row v | row <- sm :] parallel computation (no nesting) parallel computation (nested, irregular) 25

[: f x | x <- xs :] 26 *
Two main components of DPH: (1) vectoriser & (2) backend array library * Implement as much as possible as a library (minimally invasive to GHC: type families, rewrite rules) * Based on GHC, including its highly optimising backend * For better or worse, we aim to compete with C!

[: f x | x <- xs :] Vectoriser map
(lam $ \x -> f `app` x) xs Generalisation of NESL's ﬂattening 26 * Two main components of DPH: (1) vectoriser & (2) backend array library * Implement as much as possible as a library (minimally invasive to GHC: type families, rewrite rules) * Based on GHC, including its highly optimising backend * For better or worse, we aim to compete with C!

(lam $ \x -> f àpp` x) xs map :: (..) => (a :-> b) -> Array a -> Array b Import Generalisation of NESL's flattening Backend library of flat, segmented parallel arrays 26 * Two main components of DPH: (1) vectoriser & (2) backend array library * Implement as much as possible as a library (minimally invasive to GHC: type families, rewrite rules) * Based on GHC, including its highly optimising backend * For better or worse, we aim to compete with C!

(lam $ \x -> f àpp` x) xs Optimise map :: (..) => (a :-> b) -> Array a -> Array b Import Generalisation of NESL's flattening Backend library of flat, segmented parallel arrays 26 * Two main components of DPH: (1) vectoriser & (2) backend array library * Implement as much as possible as a library (minimally invasive to GHC: type families, rewrite rules) * Based on GHC, including its highly optimising backend * For better or worse, we aim to compete with C!

(lam $ \x -> f àpp` x) xs Optimise Code generation Start: %cond = icmp eq i32 %a, %b br i1 %cond, label %IfEqual, label %IfUnequal map :: (..) => (a :-> b) -> Array a -> Array b Import Generalisation of NESL's flattening Backend library of flat, segmented parallel arrays 26 * Two main components of DPH: (1) vectoriser & (2) backend array library * Implement as much as possible as a library (minimally invasive to GHC: type families, rewrite rules) * Based on GHC, including its highly optimising backend * For better or worse, we aim to compete with C!

Repa Regular parallel arrays also: turnip Dialing it down 27
* Two classes of inefficiencies with DPH: (1) due to vectorisation and (2) due to implementation of ﬂat, parallel arrays in the backend library * Repa: focus on solving one problem (without worrying about the other)

class) Full (1st class) 28 * End user API for ﬂat DPH backend library * Restricted expressiveness, programmer guided array representations * Also, rectangular arrays are better suited for matrix algorithms and similar * Shape polymorphism & track dimensionality with types

mmultP :: Monad m => Array U Double -> Array
U Double -> m (Array U Double) 29

mmultP arr brr = let (Z :. rs :. _
) = extent arr (Z :. _ :. cs) = extent brr in do trr <- computeP (transpose brr) :: Array U DIM2 Double computeP (fromFunction (Z :. rs :. cs) (dotProduct trr)) mmultP :: Monad m => Array U Double -> Array U Double -> m (Array U Double) 29

) = extent arr (Z :. _ :. cs) = extent brr in do trr <- computeP (transpose brr) :: Array U DIM2 Double computeP (fromFunction (Z :. rs :. cs) (dotProduct trr)) mmultP :: Monad m => Array U Double -> Array U Double -> m (Array U Double) force to be manifest 29

) = extent arr (Z :. _ :. cs) = extent brr in do trr <- computeP (transpose brr) :: Array U DIM2 Double computeP (fromFunction (Z :. rs :. cs) (dotProduct trr)) mmultP :: Monad m => Array U Double -> Array U Double -> m (Array U Double) force to be manifest second stage 29

) = extent arr (Z :. _ :. cs) = extent brr in do trr <- computeP (transpose brr) :: Array U DIM2 Double computeP (fromFunction (Z :. rs :. cs) (dotProduct trr)) dotProduct (Z :. i :. j) = sumAllS (zipWith (*) a_row b_row) where a_row = slice arr (row i) t_row = slice trr (row j) 29

[: f x | x <- xs :] Vectoriser Generalisation
of NESL's ﬂattening map (\x -> f x) xs Optimise Code generation Start: %cond = icmp eq i32 %a, %b br i1 %cond, label %IfEqual, label %IfUnequal map :: (..) => (a -> b) -> Array s a -> Array s b Import Backend library of multi-dimensional parallel arrays 30 *

“Parallelising all of Haskell with C-like performance is a challenge.”
31

Backend code generation DPH Repa 32

Backend code generation DPH Repa Native code generation: 32

Backend code generation DPH Repa Carefully crafted code in the
library (e.g., reloads for stencils) We wrote an LLVM backend (still aliasing issues) Native code generation: [Haskell 2010] [Haskell 2011] 32

Interaction with GHC optimisations DPH Repa 33

Interaction with GHC optimisations DPH Repa GHC has a complex
pipeline: 33

Interaction with GHC optimisations DPH Repa Firing of rewrite rules
and other transformations Code explosion due to inlining and code duplication GHC has a complex pipeline: 33

DPH Repa Array fusion 34

DPH Repa Challenging research program on its own: Array fusion
34

DPH Repa Invented equational array fusion and stream fusion Now
looking at series expressions Challenging research program on its own: Array fusion [Haskell 2013: Ben's talk on Tuesday] 34

DPH Repa Array representations 35 DPH: ﬂattening of arrays of
structured types with type families

DPH Repa Invented Haskell type families to express type-dependent representations
Repa type indices: declarative control of representations Array representations [Haskell 2012a] [POPL 2005, ICFP 2005, ICFP 2008] 35 DPH: ﬂattening of arrays of structured types with type families

0 0.5 1 1.5 2 2.5 64 96 128 192
256 384 512 768 1024 relative runtime matrix width C Gauss-Seidel C Jacobi Repa -N1 Repa -N2 Repa -N8 Figure 12. Runtimes for Fluid Flow Solver at the start of the program, then swaps these buffers after every relaxation iteration. Doing this improves memory locality, reducing the data cache-miss rate. In contrast, the Repa version allocates 36

Overhead of vectorised code DPH 37

Overhead of vectorised code DPH GHC's simpliﬁer removes explicit closures
etc. We invented vectorisation avoidance [Haskell 2012b] 37

Work and space complexity of vectorisation DPH 38

Work and space complexity of vectorisation DPH Fusion helps in
some cases (but fragile) Work complexity: invented virtual segment descriptors [ICFP 2012] 38

100us 1ms 10ms 100ms 1s 100 1k 10k 100k 1M
10M Run Time Non-Zero Elements Sparse Matrix-Vector Multiplication (1% non-zero elements) Out Of Memory (150k elements) Baseline DPH New DPH Using Data.Vector 39

1ms 10ms 100ms 1s 10s 100s 10 100 1k 10k
100k Run Time Bodies Barnes-Hut, 1 step, 1 thread Out Of Memory (5000 bodies) Baseline DPH New DPH With Data.Vector 40

“We are still not out of the woods.” 41

Series expression work not integrated Backend library lacks optimisations Vectorisation
avoidance needs to be extended DPH DPH 42

Part 4 Second-class data parallelism: Embedded languages 43 * Many
of the discussed problems exist or are made worse by… * (1) parallelising a very expressive language (2) implemented in a very sophisticated compiler » We surely need more modest goals if we target more challenging hardware

Data.Array.Accelerate Pivoting 44 » Accelerate is an embedded language…

Full (1st class) Accelerate 45 » …aimed at general-purpose GPU programming.

Full (1st class) CPU GPU Accelerate 45 » …aimed at general-purpose GPU programming.

dotp :: [:Float:] -> [:Float:] -> Float dotp xs ys
= sumP (zipWithP (*) xs ys ) Data Parallel Haskell 46 * Dot product in Data Parallel Haskell

dotp :: Acc (Vector Float) -> Acc (Vector Float) ->
Acc (Scalar Float) dotp xs ys = sum (zipWith (*) xs ys ) 47 * Same code embedded * 'fold' is parallel tree-reduction » The main difference are the types — let's look at them in more detail…

Acc (Scalar Float) dotp xs ys = sum (zipWith (*) xs ys ) Embedded Accelerate arrays 47 * Same code embedded * 'fold' is parallel tree-reduction » The main difference are the types — let's look at them in more detail…

Acc (Scalar Float) dotp xs ys = sum (zipWith (*) xs ys ) Acc marks embedded array computations Embedded Accelerate arrays 47 * Same code embedded * 'fold' is parallel tree-reduction » The main difference are the types — let's look at them in more detail…

data Array sh e type Vector e = Array DIM1
e type Scalar e = Array DIM0 e 48 * Acc computations always produce arrays — hence, Scalar * Let's ignore the type class context for the moment * For comparison, the list version of zipWith on top in grey

e type Scalar e = Array DIM0 e Regular, shape-polymorphic arrays 48 * Acc computations always produce arrays — hence, Scalar * Let's ignore the type class context for the moment * For comparison, the list version of zipWith on top in grey

e type Scalar e = Array DIM0 e Regular, shape-polymorphic arrays zipWith :: (a -> b -> c) -> [a] -> [b] -> [c] zipWith :: (…) => (Exp a -> Exp b -> Exp c) -> Acc (Array sh a) -> Acc (Array sh b) -> Acc (Array sh c) Embedded scalar expressions 48 * Acc computations always produce arrays — hence, Scalar * Let's ignore the type class context for the moment * For comparison, the list version of zipWith on top in grey

Acc marks embedded array computations Embedded Accelerate arrays dotp ::
Acc (Vector Float) -> Acc (Vector Float) -> Acc (Scalar Float) dotp xs ys = sum (zipWith (*) xs ys ) 49 * What if we want to process host data? * Using data from the host language in the embedded language * GPUs: explicit memory transfer from host to GPU memory

Acc (Scalar Float) dotp xs ys = sum (zipWith (*) xs ys ) 49 * What if we want to process host data? * Using data from the host language in the embedded language * GPUs: explicit memory transfer from host to GPU memory

Acc (Scalar Float) dotp xs ys = sum (zipWith (*) xs ys ) use embeds values let xs' = use xs ys' = use ys in ' ' 49 * What if we want to process host data? * Using data from the host language in the embedded language * GPUs: explicit memory transfer from host to GPU memory

50 Essence of embedded languages (transcending Accelerate) * High-level for
expressiveness; restricted for efficiency * Further abstractions: generate embedded code

Embedded languages …are restricted languages 50 Essence of embedded languages
(transcending Accelerate) * High-level for expressiveness; restricted for efficiency * Further abstractions: generate embedded code

Embedded languages …are restricted languages The embedding partly compensates for
restrictions: 50 Essence of embedded languages (transcending Accelerate) * High-level for expressiveness; restricted for efficiency * Further abstractions: generate embedded code

Embedded languages …are restricted languages Seamless integration into host language
and… …host language can generate embedded code. The embedding partly compensates for restrictions: 50 Essence of embedded languages (transcending Accelerate) * High-level for expressiveness; restricted for efficiency * Further abstractions: generate embedded code

map (\x -> x + 1) arr [DAMP 2011] 51
(1) Reify AST; (2) Optimise code (fusion); (3) Generate CUDA code using skeletons (a) Compile with the CUDA compiler; (b) Invoke from host code » How to implement the skeletons…

map (\x -> x + 1) arr Reify AST Map
(Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr [DAMP 2011] 51 (1) Reify AST; (2) Optimise code (fusion); (3) Generate CUDA code using skeletons (a) Compile with the CUDA compiler; (b) Invoke from host code » How to implement the skeletons…

(Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr Optimise [DAMP 2011] 51 (1) Reify AST; (2) Optimise code (fusion); (3) Generate CUDA code using skeletons (a) Compile with the CUDA compiler; (b) Invoke from host code » How to implement the skeletons…

(Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr Optimise Skeleton instantiation __global__ void kernel (float *arr, int n) {... [DAMP 2011] 51 (1) Reify AST; (2) Optimise code (fusion); (3) Generate CUDA code using skeletons (a) Compile with the CUDA compiler; (b) Invoke from host code » How to implement the skeletons…

(Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr Optimise Skeleton instantiation __global__ void kernel (float *arr, int n) {... CUDA compiler [DAMP 2011] 51 (1) Reify AST; (2) Optimise code (fusion); (3) Generate CUDA code using skeletons (a) Compile with the CUDA compiler; (b) Invoke from host code » How to implement the skeletons…

(Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr Optimise Skeleton instantiation __global__ void kernel (float *arr, int n) {... CUDA compiler Call [DAMP 2011] 51 (1) Reify AST; (2) Optimise code (fusion); (3) Generate CUDA code using skeletons (a) Compile with the CUDA compiler; (b) Invoke from host code » How to implement the skeletons…

mkMap dev aenv fun arr = return $ CUTranslSkel "map"
[cunit| $esc:("#include <accelerate_cuda.h>") extern "C" __global__ void map ($params:argIn, $params:argOut) { const int shapeSize = size(shOut); const int gridSize = $exp:(gridSize dev); int ix; for ( ix = $exp:(threadIdx dev) ; ix < shapeSize ; ix += gridSize ) { $items:(dce x .=. get ix) $items:(setOut "ix" .=. f x) } } |] where ... 52 * Combinators as skeletons (code templates with holes) * Quasi-quoter for CUDA [Mainland] * Yellow anti-quotes are the holes (parameters) of the template

[Embed 2013] 53 Essence of skeleton-based generative programming (transcending Accelerate)
* Hand tuned skeleton code * Meta programming simpliﬁes code generation * FFI to use native libraries etc

Skeletons are templates …encapsulating efﬁcient code idioms [Embed 2013] 53
Essence of skeleton-based generative programming (transcending Accelerate) * Hand tuned skeleton code * Meta programming simpliﬁes code generation * FFI to use native libraries etc

Skeletons are templates …encapsulating efﬁcient code idioms Code generation as
template meta programming Foreign function interface as an escape hatch [Embed 2013] 53 Essence of skeleton-based generative programming (transcending Accelerate) * Hand tuned skeleton code * Meta programming simpliﬁes code generation * FFI to use native libraries etc

Fusion is crucial c2 p5 p1 c1 p6 p7 p3
p2 p4 Fusion in 2 phases, exploiting skeleton instantiation 54 * Fusion output must be in skeleton form again * Fused skeletons share GPU kernels

Fusion is crucial c2 p5 p1 c1 p6 p7 p3
p2 p4 Fusion in 2 phases, exploiting skeleton instantiation [ICFP 2013: Trevor's talk on Wednesday] 54 * Fusion output must be in skeleton form again * Fused skeletons share GPU kernels

0.1 1 10 100 2 4 6 8 10 12
14 16 18 20 Run Time (ms) Elements (millions) Dot product Data.Vector Repa -N8 NDP2GPU Accelerate -fusion ... +fusion CUBLAS 55 * C (red) is on one CPU core (Xenon E5405 CPU @ 2 GHz, 64-bit) * Repa (blue) is on 7 CPU cores (two quad-core Xenon E5405 CPUs @ 2 GHz, 64-bit) * Accelerate (green) is on a Tesla T10 processor (240 cores @ 1.3 GHz)

0.1 1 10 100 2 4 6 8 10 12
14 16 18 20 Run Time (ms) Elements (millions) Dot product Data.Vector Repa -N8 NDP2GPU Accelerate -fusion ... +fusion CUBLAS Dot Product 55 * C (red) is on one CPU core (Xenon E5405 CPU @ 2 GHz, 64-bit) * Repa (blue) is on 7 CPU cores (two quad-core Xenon E5405 CPUs @ 2 GHz, 64-bit) * Accelerate (green) is on a Tesla T10 processor (240 cores @ 1.3 GHz)

56 * C (red) is on one CPU core (Xenon
E5405 CPU @ 2 GHz, 64-bit) * Repa (blue) is on 7 CPU cores (two quad-core Xenon E5405 CPUs @ 2 GHz, 64-bit) * Accelerate (green) is on a Tesla T10 processor (240 cores @ 1.3 GHz)

Jos Stam's Fluid Flow Solver 56 * C (red) is
on one CPU core (Xenon E5405 CPU @ 2 GHz, 64-bit) * Repa (blue) is on 7 CPU cores (two quad-core Xenon E5405 CPUs @ 2 GHz, 64-bit) * Accelerate (green) is on a Tesla T10 processor (240 cores @ 1.3 GHz)

“Works pretty well. Let's do more!” 57

Incrementally adding nested data parallelism How about amorphous data parallelism?
Need a wider range of applications 58

Summary Data parallelism is a convenient & effective abstraction Parallelising
full Haskell is possible, but a lot of hard work Embedded languages are an attractive compromise types >< state languages 59

References [EuroPar 2001] Nepal -- Nested Data-Parallelism in Haskell. Chakravarty,
Keller, Lechtchinsky & Pfannenstiel. In "Euro-Par 2001: Parallel Processing, 7th Intl. Euro-Par Conference", 2001. [POPL 2005] Associated Types with Class. Chakravarty, Keller, Peyton Jones & Marlow. In Proceedings "The 32nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL'05)", ACM Press, 2005. [ICFP 2005] Associated Type Synonyms. Chakravarty, Keller & Peyton Jones. In Proceedings of "The Tenth ACM SIGPLAN International Conference on Functional Programming", ACM Press, 2005. [TLDI 2007] System F with Type Equality Coercions. Sulzmann, Chakravarty, Peyton Jones & Donnelly. In Proceedings of "The Third ACM SIGPLAN Workshop on Types in Language Design and Implementation", ACM Press, 2007. [ICFP 2008] Type Checking with Open Type Functions. Schrijvers, Peyton Jones, Chakravarty & Sulzmann. In Proceedings of "ICFP 2008 : The 13th ACM SIGPLAN International Conference on Functional Programming", ACM Press, 2008. 60

[FSTTCS 2008] Harnessing the Multicores: Nested Data Parallelism in Haskell.
Peyton Jones, Leshchinskiy, Keller & Chakravarty. In "IARCS Annual Conf. on Foundations of Software Technology & Theoretical Computer Science",2008. [ICFP 2010] Regular, shape-polymorphic, parallel arrays in Haskell. Keller, Chakravarty, Leshchinskiy, Peyton Jones & Lippmeier. In Proceedings of "ICFP 2010 : The 15th ACM SIGPLAN Intl. Conf. on Functional Programming", 2010. [Haskell 2010] An LLVM Backend for GHC. Terei & Chakravarty. In Proceedings of "ACM SIGPLAN Haskell Symposium 2010", ACM Press, 2010. [DAMP 2011] Accelerating Haskell Array Codes with Multicore GPUs. Chakravarty, Keller, Lee, McDonell & Grover. In "Declarative Aspects of Multicore Programming", ACM Press, 2011. [Haskell 2011] Eﬃcient Parallel Stencil Convolution in Haskell. Lippmeier & Keller. In Proceedings of "ACM SIGPLAN Haskell Symposium 2011", ACM Press, 2011. [ICFP 2012] Work Eﬃcient Higher-Order Vectorisation. Lippmeier, Chakravarty, Keller, Leshchinskiy & Peyton Jones. ACM Press, 2012. 61

[Haskell 2012a] Guiding Parallel Array Fusion with Indexed Types. Lippmeier,
Chakravarty, Keller & Peyton Jones. In Proceedings of "ACM SIGPLAN Haskell Symposium 2012", ACM Press, 2012. [Haskell 2012b] Vectorisation Avoidance. Keller, Chakravarty, Lippmeier, Leshchinskiy & Peyton Jones. In Proceedings of "ACM SIGPLAN Haskell Symposium 2012", ACM Press, 2012. [ICFP 2013] Optimising Purely Functional GPU Programs. McDonell, Chakravarty, Keller & Lippmeier. In "ACM SIGPLAN International Conference on Functional Programming", ACM Press, 2013. [Haskell 2013] Data Flow Fusion with Series Expressions in Haskell. Lippmeier, Chakravarty, Keller & Robinson. In Proceedings of "ACM SIGPLAN Haskell Symposium 2013", ACM Press, 2013. [Embed 2013] Embedding Foreign Code. Clifton-Everest, McDonell, Chakravarty & Keller. Submitted for publication, 2013. http://www.cse.unsw.edu.au/~chak/papers/ CMCK14.html 62

Images from http://wikimedia.org http://openclipart.org 63

Data Parallelism in Haskell

Data Parallelism in Haskell

More Decks by Manuel Chakravarty

Other Decks in Research

Featured

Transcript