Slide 1

Slide 1 text

DATA PARALLEL PROGRAMMING IN HASKELL Manuel M T Chakravarty University of New South Wales INCLUDES JOINT WORK WITH Gabriele Keller Sean Lee Roman Leshchinskiy Ben Lippmeier Trevor McDonell Simon Peyton Jones An Overview

Slide 2

Slide 2 text

Ubiquitous Parallelism

Slide 3

Slide 3 text

It's parallelism http://en.wikipedia.org/wiki/File:Cray_Y-MP_GSFC.jpg venerable supercomputer

Slide 4

Slide 4 text

It's parallelism ...but not as we know it! http://en.wikipedia.org/wiki/File:Cray_Y-MP_GSFC.jpg http://en.wikipedia.org/wiki/File:GeForce_GT_545_DDR3.jpg http://en.wikipedia.org/wiki/File:Intel_CPU_Core_i7_2600K_Sandy_Bridge_top.jpg venerable supercomputer multicore GPU multicore CPU

Slide 5

Slide 5 text

Our goals

Slide 6

Slide 6 text

Our goals Exploit parallelism of commodity hardware easily: ‣ Performance is important, but… ‣ …productivity is more important.

Slide 7

Slide 7 text

Our goals Exploit parallelism of commodity hardware easily: ‣ Performance is important, but… ‣ …productivity is more important. Semi-automatic parallelism ‣ Programmer supplies a parallel algorithm ‣ No explicit concurrency (no concurrency control, no races, no deadlocks)

Slide 8

Slide 8 text

Graphics [Haskell 2012a] Ray tracing

Slide 9

Slide 9 text

Computer Vision [Haskell 2011] Edge detection

Slide 10

Slide 10 text

1 2 3 4 5 6 7 8 800 1000 2000 4000 8000 10000 1 2 3 4 5 6 7 8 runtime (ms) threads (on 8 PEs) Canny on 2xQuad-core 2.0GHz Intel Harpertown 1024x1024 image 768x768 image 512x512 image Safe Unrolled Stencil Single-threaded OpenCV Figure 15. Sobel and Canny runtimes, 100 iterations Runtimes for Canny edge detection (100 iterations) OpenCV uses SIMD-instructions, but only one thread

Slide 11

Slide 11 text

Canny edge detection: CPU versus GPU parallelism GPU version performs post-processing on the CPU 0 10.00 20.00 30.00 40.00 1 2 3 4 5 6 7 8 36.88 22.24 17.71 15.71 13.73 13.37 12.62 15.23 Canny (512x512) Time (ms) Number of CPU Threads CPU (as before) GPU (NVIDIA Tesla T10)

Slide 12

Slide 12 text

Physical Simulation [Haskell 2012a] Fluid flow

Slide 13

Slide 13 text

0 0.5 1 1.5 2 2.5 64 96 128 192 256 384 512 768 1024 relative runtime matrix width C Gauss-Seidel C Jacobi Repa -N1 Repa -N2 Repa -N8 Figure 12. Runtimes for Fluid Flow Solver Runtimes for Jos Stam's Fluid Flow Solver We can beat C!

Slide 14

Slide 14 text

Jos Stam's Fluid Flow Solver: CPU versus GPU Performance GPU beats the CPU (includes all transfer times) 0 750 1500 2250 3000 1 2 4 8 Fluid flow (1024x1024) Time (ms) Number of CPU Threads CPU GPU

Slide 15

Slide 15 text

Medical Imaging [Haskell 2012a] Interpolation of a slice though a 256 × 256 × 109 × 16-bit data volume of an MRI image

Slide 16

Slide 16 text

Functional Parallelism

Slide 17

Slide 17 text

Our ingredients Control effects, not concurrency Types guide data representation and behaviour Bulk-parallel aggregate operations

Slide 18

Slide 18 text

Our ingredients Control effects, not concurrency Types guide data representation and behaviour Bulk-parallel aggregate operations Haskell is by default pure

Slide 19

Slide 19 text

Our ingredients Control effects, not concurrency Types guide data representation and behaviour Bulk-parallel aggregate operations Haskell is by default pure Declarative control of operational pragmatics

Slide 20

Slide 20 text

Our ingredients Control effects, not concurrency Types guide data representation and behaviour Bulk-parallel aggregate operations Haskell is by default pure Declarative control of operational pragmatics Data parallelism

Slide 21

Slide 21 text

Purity & Types

Slide 22

Slide 22 text

Purity and parallelism processList :: [Int] -> ([Int], Int) processList list = (sort list, maximum list)

Slide 23

Slide 23 text

Purity and parallelism processList :: [Int] -> ([Int], Int) processList list = (sort list, maximum list) function argument function body

Slide 24

Slide 24 text

Purity and parallelism processList :: [Int] -> ([Int], Int) processList list = (sort list, maximum list) function argument function body argument type result type

Slide 25

Slide 25 text

Purity and parallelism processList :: [Int] -> ([Int], Int) processList list = (sort list, maximum list) function argument function body argument type result type Purity: function result depends only on arguments

Slide 26

Slide 26 text

Purity and parallelism processList :: [Int] -> ([Int], Int) processList list = (sort list, maximum list) function argument function body argument type result type Purity: function result depends only on arguments Parallelism: execution order only constrained by explicit data dependencies

Slide 27

Slide 27 text

By default pure := Types track purity

Slide 28

Slide 28 text

By default pure := Types track purity Pure = no effects Impure = may have effects Int IO Int processList :: [Int] -> ([Int], Int) readFile :: FilePath -> IO String (sort list, maximum list) copyFile fn1 fn2 = do data <- readFile fn1 writeFile fn2 data

Slide 29

Slide 29 text

By default pure := Types track purity Pure = no effects Impure = may have effects Int IO Int processList :: [Int] -> ([Int], Int) readFile :: FilePath -> IO String (sort list, maximum list) copyFile fn1 fn2 = do data <- readFile fn1 writeFile fn2 data

Slide 30

Slide 30 text

By default pure := Types track purity Pure = no effects Impure = may have effects Int IO Int processList :: [Int] -> ([Int], Int) readFile :: FilePath -> IO String (sort list, maximum list) copyFile fn1 fn2 = do data <- readFile fn1 writeFile fn2 data

Slide 31

Slide 31 text

By default pure := Types track purity Pure = no effects Impure = may have effects Int IO Int processList :: [Int] -> ([Int], Int) readFile :: FilePath -> IO String (sort list, maximum list) copyFile fn1 fn2 = do data <- readFile fn1 writeFile fn2 data

Slide 32

Slide 32 text

Types guide execution

Slide 33

Slide 33 text

Datatypes for parallelism For bulk-parallel, aggregate operations, we introduce a new datatype: Array r sh e

Slide 34

Slide 34 text

Datatypes for parallelism For bulk-parallel, aggregate operations, we introduce a new datatype: Array r sh e Representation

Slide 35

Slide 35 text

Datatypes for parallelism For bulk-parallel, aggregate operations, we introduce a new datatype: Array r sh e Representation Shape

Slide 36

Slide 36 text

Datatypes for parallelism For bulk-parallel, aggregate operations, we introduce a new datatype: Array r sh e Representation Shape Element type

Slide 37

Slide 37 text

Array r sh e

Slide 38

Slide 38 text

Representation: determined by a type index; e.g., ‣ D — delayed array (represented as a function) ‣ U — unboxed array (manifest C-style array) Array r sh e r

Slide 39

Slide 39 text

Representation: determined by a type index; e.g., ‣ D — delayed array (represented as a function) ‣ U — unboxed array (manifest C-style array) Array r sh e Shape: dimensionality of the array ‣ DIM0, DIM1, DIM2, and so on sh

Slide 40

Slide 40 text

Representation: determined by a type index; e.g., ‣ D — delayed array (represented as a function) ‣ U — unboxed array (manifest C-style array) Array r sh e Shape: dimensionality of the array ‣ DIM0, DIM1, DIM2, and so on Element type: stored in the array ‣ Primitive types (Int, Float, etc.) and tuples e

Slide 41

Slide 41 text

zipWith :: (Shape sh, Source r1 a, Source r2 b) => (a -> b -> c) -> Array r1 sh a -> Array r2 sh b -> Array D sh c

Slide 42

Slide 42 text

zipWith :: (Shape sh, Source r1 a, Source r2 b) => (a -> b -> c) -> Array r1 sh a -> Array r2 sh b -> Array D sh c Pure function to be used in parallel

Slide 43

Slide 43 text

zipWith :: (Shape sh, Source r1 a, Source r2 b) => (a -> b -> c) -> Array r1 sh a -> Array r2 sh b -> Array D sh c Pure function to be used in parallel Pocessed arrays

Slide 44

Slide 44 text

zipWith :: (Shape sh, Source r1 a, Source r2 b) => (a -> b -> c) -> Array r1 sh a -> Array r2 sh b -> Array D sh c Pure function to be used in parallel Pocessed arrays Delayed result

Slide 45

Slide 45 text

zipWith :: (Shape sh, Source r1 a, Source r2 b) => (a -> b -> c) -> Array r1 sh a -> Array r2 sh b -> Array D sh c Pure function to be used in parallel Pocessed arrays Delayed result type PC5 = P C (P (S D)(P (S D)(P (S D)(P (S D) X)))) mapStencil2 :: Source r a => Boundary a -> Stencil DIM2 a -> Array r DIM2 a -> Array PC5 DIM2 a

Slide 46

Slide 46 text

zipWith :: (Shape sh, Source r1 a, Source r2 b) => (a -> b -> c) -> Array r1 sh a -> Array r2 sh b -> Array D sh c Pure function to be used in parallel Pocessed arrays Delayed result type PC5 = P C (P (S D)(P (S D)(P (S D)(P (S D) X)))) mapStencil2 :: Source r a => Boundary a -> Stencil DIM2 a -> Array r DIM2 a -> Array PC5 DIM2 a Partitioned result

Slide 47

Slide 47 text

A simple example — dot product

Slide 48

Slide 48 text

A simple example — dot product dotp v w = sumAll (zipWith (*) v w)

Slide 49

Slide 49 text

A simple example — dot product dotp v w = sumAll (zipWith (*) v w) type Vector r e = Array r DIM1 e dotp :: (Num e, Source r1 e, Source r2 e) => Vector r1 e -> Vector r2 e -> e

Slide 50

Slide 50 text

A simple example — dot product dotp v w = sumAll (zipWith (*) v w) type Vector r e = Array r DIM1 e dotp :: (Num e, Source r1 e, Source r2 e) => Vector r1 e -> Vector r2 e -> e Elements are any type of numbers…

Slide 51

Slide 51 text

A simple example — dot product dotp v w = sumAll (zipWith (*) v w) type Vector r e = Array r DIM1 e dotp :: (Num e, Source r1 e, Source r2 e) => Vector r1 e -> Vector r2 e -> e Elements are any type of numbers… …suitable to be read from an array

Slide 52

Slide 52 text

Concurrency Parallelism Data Parallelism ABSTRACTION

Slide 53

Slide 53 text

Concurrency Parallelism Data Parallelism ABSTRACTION Parallelism is safe for pure functions (i.e., functions without external effects)

Slide 54

Slide 54 text

Concurrency Parallelism Data Parallelism ABSTRACTION Parallelism is safe for pure functions (i.e., functions without external effects) Collective operations have got a single conceptual thread of control

Slide 55

Slide 55 text

Types & Embedded Computations

Slide 56

Slide 56 text

GPUs require careful program tuning COARSE-GRAINED VERSUS FINE-GRAINED PARALLELISM Core i7 970 CPU NVIDIA GF100 GPU 12 THREADS 24,576 THREADS

Slide 57

Slide 57 text

GPUs require careful program tuning COARSE-GRAINED VERSUS FINE-GRAINED PARALLELISM Core i7 970 CPU NVIDIA GF100 GPU 12 THREADS 24,576 THREADS ✴SIMD: groups of threads executing in lock step (warps) ✴Need to be careful about control flow

Slide 58

Slide 58 text

GPUs require careful program tuning COARSE-GRAINED VERSUS FINE-GRAINED PARALLELISM Core i7 970 CPU NVIDIA GF100 GPU 12 THREADS 24,576 THREADS ✴Latency hiding: optimised for regular memory access patterns ✴Optimise memory access ✴SIMD: groups of threads executing in lock step (warps) ✴Need to be careful about control flow

Slide 59

Slide 59 text

Code generation for embedded code Embedded DSL ‣ Restricted control flow ‣ First-order GPU code Generative approach based on combinator templates

Slide 60

Slide 60 text

Code generation for embedded code Embedded DSL ‣ Restricted control flow ‣ First-order GPU code Generative approach based on combinator templates ✓ limited control structures

Slide 61

Slide 61 text

Code generation for embedded code Embedded DSL ‣ Restricted control flow ‣ First-order GPU code Generative approach based on combinator templates ✓ limited control structures ✓ hand-tuned access patterns [DAMP 2011]

Slide 62

Slide 62 text

Embedded GPU computations Dot product dotp :: Vector Float -> Vector Float -> Acc (Scalar Float) dotp xs ys = let xs' = use xs ys' = use ys in fold (+) 0 (zipWith (*) xs' ys')

Slide 63

Slide 63 text

Embedded GPU computations Dot product dotp :: Vector Float -> Vector Float -> Acc (Scalar Float) dotp xs ys = let xs' = use xs ys' = use ys in fold (+) 0 (zipWith (*) xs' ys') Haskell array

Slide 64

Slide 64 text

Embedded GPU computations Dot product dotp :: Vector Float -> Vector Float -> Acc (Scalar Float) dotp xs ys = let xs' = use xs ys' = use ys in fold (+) 0 (zipWith (*) xs' ys') Haskell array Embedded array = desc. of array comps

Slide 65

Slide 65 text

Embedded GPU computations Dot product dotp :: Vector Float -> Vector Float -> Acc (Scalar Float) dotp xs ys = let xs' = use xs ys' = use ys in fold (+) 0 (zipWith (*) xs' ys') Haskell array Embedded array = desc. of array comps Lift Haskell arrays into EDSL — may trigger host➙device transfer

Slide 66

Slide 66 text

Embedded GPU computations Dot product dotp :: Vector Float -> Vector Float -> Acc (Scalar Float) dotp xs ys = let xs' = use xs ys' = use ys in fold (+) 0 (zipWith (*) xs' ys') Haskell array Embedded array = desc. of array comps Lift Haskell arrays into EDSL — may trigger host➙device transfer Embedded array computations

Slide 67

Slide 67 text

Nested Data Parallelism

Slide 68

Slide 68 text

Modularity Standard (Fortran, CUDA, etc.) is flat, regular parallelism Same for our libraries for functional data parallelism for multicore CPUs (Repa) and GPUs (Accelerate) But we want more…

Slide 69

Slide 69 text

smvm :: SparseMatrix -> Vector -> Vector smvm sm v = [: sumP (dotp sv v) | sv <- sm :]

Slide 70

Slide 70 text

smvm :: SparseMatrix -> Vector -> Vector smvm sm v = [: sumP (dotp sv v) | sv <- sm :] Parallel array comprehension

Slide 71

Slide 71 text

smvm :: SparseMatrix -> Vector -> Vector smvm sm v = [: sumP (dotp sv v) | sv <- sm :] Parallel array comprehension Parallel reduction/fold

Slide 72

Slide 72 text

smvm :: SparseMatrix -> Vector -> Vector smvm sm v = [: sumP (dotp sv v) | sv <- sm :] Parallel array comprehension Parallel reduction/fold Nested parallelism!

Slide 73

Slide 73 text

smvm :: SparseMatrix -> Vector -> Vector smvm sm v = [: sumP (dotp sv v) | sv <- sm :] Parallel array comprehension Parallel reduction/fold Nested parallelism! Defined in a libray: Internally parallel?

Slide 74

Slide 74 text

smvm :: SparseMatrix -> Vector -> Vector smvm sm v = [: sumP (dotp sv v) | sv <- sm :] Parallel array comprehension Parallel reduction/fold Nested parallelism! Defined in a libray: Internally parallel? Nested parallelism?

Slide 75

Slide 75 text

smvm :: SparseMatrix -> Vector -> Vector smvm sm v = [: sumP (dotp sv v) | sv <- sm :] Parallel array comprehension Parallel reduction/fold Nested parallelism! Defined in a libray: Internally parallel? Nested parallelism? Regular, flat data parallelism is not sufficient!

Slide 76

Slide 76 text

Nested parallelism Modular Irregular, nested data structures ‣ Sparse structures, tree structures ‣ Hierachrchical decomposition Nesting to arbitrary, dynamic depth: divide & conquer Lots of compiler work: still very experimental! [FSTTCS 2008, ICFP 2012, Haskell 2012b]

Slide 77

Slide 77 text

Nested Data Parallel Haskell Accelerate Repa FLAT NESTED EMBEDDED FULL

Slide 78

Slide 78 text

Nested Data Parallel Haskell Accelerate Repa More expressive: harder to implement FLAT NESTED EMBEDDED FULL

Slide 79

Slide 79 text

Nested Data Parallel Haskell Accelerate Repa More expressive: harder to implement FLAT NESTED EMBEDDED FULL More restrictive: special hardware

Slide 80

Slide 80 text

Types are at the centre of everything we are doing

Slide 81

Slide 81 text

Types are at the centre of everything we are doing Types separate pure from effectful code

Slide 82

Slide 82 text

Types are at the centre of everything we are doing Types separate pure from effectful code Types guide operational behaviour (data representation, use of parallelism, etc.)

Slide 83

Slide 83 text

Types are at the centre of everything we are doing Types separate pure from effectful code Types guide operational behaviour (data representation, use of parallelism, etc.) Types identify restricted code for specialised hardware, such as GPUs

Slide 84

Slide 84 text

Types are at the centre of everything we are doing Types separate pure from effectful code Types guide operational behaviour (data representation, use of parallelism, etc.) Types identify restricted code for specialised hardware, such as GPUs Types guide parallelising program transformations

Slide 85

Slide 85 text

Summary Core ingredients ‣ Control purity, not concurrency ‣ Types guide representations and behaviours ‣ Bulk-parallel operations Get it ‣ Latest Glasgow Haskell Compiler (GHC) ‣ Repa, Accelerate & DPH packages from Hackage (Haskell library repository) Blog: http://justtesting.org/ Twitter: @TacticalGrace

Slide 86

Slide 86 text

Thank you! This research has in part been funded by the Australian Research Council and by Microsoft Corporation.

Slide 87

Slide 87 text

[EuroPar 2001] Nepal -- Nested Data-Parallelism in Haskell. Chakravarty, Keller, Lechtchinsky & Pfannenstiel. In "Euro-Par 2001: Parallel Processing, 7th Intl. Euro-Par Conference", 2001. [FSTTCS 2008] Harnessing the Multicores: Nested Data Parallelism in Haskell. Peyton Jones, Leshchinskiy, Keller & Chakravarty. In "IARCS Annual Conf. on Foundations of Software Technology & Theoretical Computer Science", 2008. [ICFP 2010] Regular, shape-polymorphic, parallel arrays in Haskell. Keller, Chakravarty, Leshchinskiy, Peyton Jones & Lippmeier. In Proceedings of "ICFP 2010 : The 15th ACM SIGPLAN Intl. Conf. on Functional Programming", 2010. [DAMP 2011] Accelerating Haskell Array Codes with Multicore GPUs. Chakravarty, Keller, Lee, McDonell & Grover. In "Declarative Aspects of Multicore Programming", 2011.

Slide 88

Slide 88 text

[Haskell 2011] Efficient Parallel Stencil Convolution in Haskell. Lippmeier & Keller. In Proceedings of "ACM SIGPLAN Haskell Symposium 2011", ACM Press, 2011. [ICFP 2012] Work Efficient Higher-Order Vectorisation. Lippmeier, Chakravarty, Keller, Leshchinskiy & Peyton Jones. In Proceedings of "ICFP 2012 : The 17th ACM SIGPLAN Intl. Conf. on Functional Programming", 2012. Forthcoming. [Haskell 2012a] Guiding Parallel Array Fusion with Indexed Types. Lippmeier, Chakravarty, Keller & Peyton Jones. In Proceedings of "ACM SIGPLAN Haskell Symposium 2012", ACM Press, 2012. Forthcoming. [Haskell 2012b] Vectorisation Avoidance. Keller, Chakravarty, Lippmeier, Leshchinskiy & Peyton Jones. In Proceedings of "ACM SIGPLAN Haskell Symposium 2012", ACM Press, 2012. Forthcoming.