Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Parallel Programming in Haskell — An Overview

Data Parallel Programming in Haskell — An Overview

This talk presents a high-level overview over our work on data parallel programming in Haskell. I presented it at the MSR Faculty Summit 2012. A video of my talk (preceded by a talk by Don Syme and followed by Antonio Cisternino) is at https://research.microsoft.com/apps/video/dl.aspx?id=169976

2296a6bdc7779fe4017a23d268c8a79d?s=128

Manuel Chakravarty
PRO

July 17, 2012
Tweet

Transcript

  1. DATA PARALLEL PROGRAMMING IN HASKELL Manuel M T Chakravarty University

    of New South Wales INCLUDES JOINT WORK WITH Gabriele Keller Sean Lee Roman Leshchinskiy Ben Lippmeier Trevor McDonell Simon Peyton Jones An Overview
  2. Ubiquitous Parallelism

  3. It's parallelism http://en.wikipedia.org/wiki/File:Cray_Y-MP_GSFC.jpg venerable supercomputer

  4. It's parallelism ...but not as we know it! http://en.wikipedia.org/wiki/File:Cray_Y-MP_GSFC.jpg http://en.wikipedia.org/wiki/File:GeForce_GT_545_DDR3.jpg

    http://en.wikipedia.org/wiki/File:Intel_CPU_Core_i7_2600K_Sandy_Bridge_top.jpg venerable supercomputer multicore GPU multicore CPU
  5. Our goals

  6. Our goals Exploit parallelism of commodity hardware easily: ‣ Performance

    is important, but… ‣ …productivity is more important.
  7. Our goals Exploit parallelism of commodity hardware easily: ‣ Performance

    is important, but… ‣ …productivity is more important. Semi-automatic parallelism ‣ Programmer supplies a parallel algorithm ‣ No explicit concurrency (no concurrency control, no races, no deadlocks)
  8. Graphics [Haskell 2012a] Ray tracing

  9. Computer Vision [Haskell 2011] Edge detection

  10. 1 2 3 4 5 6 7 8 800 1000

    2000 4000 8000 10000 1 2 3 4 5 6 7 8 runtime (ms) threads (on 8 PEs) Canny on 2xQuad-core 2.0GHz Intel Harpertown 1024x1024 image 768x768 image 512x512 image Safe Unrolled Stencil Single-threaded OpenCV Figure 15. Sobel and Canny runtimes, 100 iterations Runtimes for Canny edge detection (100 iterations) OpenCV uses SIMD-instructions, but only one thread
  11. Canny edge detection: CPU versus GPU parallelism GPU version performs

    post-processing on the CPU 0 10.00 20.00 30.00 40.00 1 2 3 4 5 6 7 8 36.88 22.24 17.71 15.71 13.73 13.37 12.62 15.23 Canny (512x512) Time (ms) Number of CPU Threads CPU (as before) GPU (NVIDIA Tesla T10)
  12. Physical Simulation [Haskell 2012a] Fluid flow

  13. 0 0.5 1 1.5 2 2.5 64 96 128 192

    256 384 512 768 1024 relative runtime matrix width C Gauss-Seidel C Jacobi Repa -N1 Repa -N2 Repa -N8 Figure 12. Runtimes for Fluid Flow Solver Runtimes for Jos Stam's Fluid Flow Solver We can beat C!
  14. Jos Stam's Fluid Flow Solver: CPU versus GPU Performance GPU

    beats the CPU (includes all transfer times) 0 750 1500 2250 3000 1 2 4 8 Fluid flow (1024x1024) Time (ms) Number of CPU Threads CPU GPU
  15. Medical Imaging [Haskell 2012a] Interpolation of a slice though a

    256 × 256 × 109 × 16-bit data volume of an MRI image
  16. Functional Parallelism

  17. Our ingredients Control effects, not concurrency Types guide data representation

    and behaviour Bulk-parallel aggregate operations
  18. Our ingredients Control effects, not concurrency Types guide data representation

    and behaviour Bulk-parallel aggregate operations Haskell is by default pure
  19. Our ingredients Control effects, not concurrency Types guide data representation

    and behaviour Bulk-parallel aggregate operations Haskell is by default pure Declarative control of operational pragmatics
  20. Our ingredients Control effects, not concurrency Types guide data representation

    and behaviour Bulk-parallel aggregate operations Haskell is by default pure Declarative control of operational pragmatics Data parallelism
  21. Purity & Types

  22. Purity and parallelism processList :: [Int] -> ([Int], Int) processList

    list = (sort list, maximum list)
  23. Purity and parallelism processList :: [Int] -> ([Int], Int) processList

    list = (sort list, maximum list) function argument function body
  24. Purity and parallelism processList :: [Int] -> ([Int], Int) processList

    list = (sort list, maximum list) function argument function body argument type result type
  25. Purity and parallelism processList :: [Int] -> ([Int], Int) processList

    list = (sort list, maximum list) function argument function body argument type result type Purity: function result depends only on arguments
  26. Purity and parallelism processList :: [Int] -> ([Int], Int) processList

    list = (sort list, maximum list) function argument function body argument type result type Purity: function result depends only on arguments Parallelism: execution order only constrained by explicit data dependencies
  27. By default pure := Types track purity

  28. By default pure := Types track purity Pure = no

    effects Impure = may have effects Int IO Int processList :: [Int] -> ([Int], Int) readFile :: FilePath -> IO String (sort list, maximum list) copyFile fn1 fn2 = do data <- readFile fn1 writeFile fn2 data
  29. By default pure := Types track purity Pure = no

    effects Impure = may have effects Int IO Int processList :: [Int] -> ([Int], Int) readFile :: FilePath -> IO String (sort list, maximum list) copyFile fn1 fn2 = do data <- readFile fn1 writeFile fn2 data
  30. By default pure := Types track purity Pure = no

    effects Impure = may have effects Int IO Int processList :: [Int] -> ([Int], Int) readFile :: FilePath -> IO String (sort list, maximum list) copyFile fn1 fn2 = do data <- readFile fn1 writeFile fn2 data
  31. By default pure := Types track purity Pure = no

    effects Impure = may have effects Int IO Int processList :: [Int] -> ([Int], Int) readFile :: FilePath -> IO String (sort list, maximum list) copyFile fn1 fn2 = do data <- readFile fn1 writeFile fn2 data
  32. Types guide execution

  33. Datatypes for parallelism For bulk-parallel, aggregate operations, we introduce a

    new datatype: Array r sh e
  34. Datatypes for parallelism For bulk-parallel, aggregate operations, we introduce a

    new datatype: Array r sh e Representation
  35. Datatypes for parallelism For bulk-parallel, aggregate operations, we introduce a

    new datatype: Array r sh e Representation Shape
  36. Datatypes for parallelism For bulk-parallel, aggregate operations, we introduce a

    new datatype: Array r sh e Representation Shape Element type
  37. Array r sh e

  38. Representation: determined by a type index; e.g., ‣ D —

    delayed array (represented as a function) ‣ U — unboxed array (manifest C-style array) Array r sh e r
  39. Representation: determined by a type index; e.g., ‣ D —

    delayed array (represented as a function) ‣ U — unboxed array (manifest C-style array) Array r sh e Shape: dimensionality of the array ‣ DIM0, DIM1, DIM2, and so on sh
  40. Representation: determined by a type index; e.g., ‣ D —

    delayed array (represented as a function) ‣ U — unboxed array (manifest C-style array) Array r sh e Shape: dimensionality of the array ‣ DIM0, DIM1, DIM2, and so on Element type: stored in the array ‣ Primitive types (Int, Float, etc.) and tuples e
  41. zipWith :: (Shape sh, Source r1 a, Source r2 b)

    => (a -> b -> c) -> Array r1 sh a -> Array r2 sh b -> Array D sh c
  42. zipWith :: (Shape sh, Source r1 a, Source r2 b)

    => (a -> b -> c) -> Array r1 sh a -> Array r2 sh b -> Array D sh c Pure function to be used in parallel
  43. zipWith :: (Shape sh, Source r1 a, Source r2 b)

    => (a -> b -> c) -> Array r1 sh a -> Array r2 sh b -> Array D sh c Pure function to be used in parallel Pocessed arrays
  44. zipWith :: (Shape sh, Source r1 a, Source r2 b)

    => (a -> b -> c) -> Array r1 sh a -> Array r2 sh b -> Array D sh c Pure function to be used in parallel Pocessed arrays Delayed result
  45. zipWith :: (Shape sh, Source r1 a, Source r2 b)

    => (a -> b -> c) -> Array r1 sh a -> Array r2 sh b -> Array D sh c Pure function to be used in parallel Pocessed arrays Delayed result type PC5 = P C (P (S D)(P (S D)(P (S D)(P (S D) X)))) mapStencil2 :: Source r a => Boundary a -> Stencil DIM2 a -> Array r DIM2 a -> Array PC5 DIM2 a
  46. zipWith :: (Shape sh, Source r1 a, Source r2 b)

    => (a -> b -> c) -> Array r1 sh a -> Array r2 sh b -> Array D sh c Pure function to be used in parallel Pocessed arrays Delayed result type PC5 = P C (P (S D)(P (S D)(P (S D)(P (S D) X)))) mapStencil2 :: Source r a => Boundary a -> Stencil DIM2 a -> Array r DIM2 a -> Array PC5 DIM2 a Partitioned result
  47. A simple example — dot product

  48. A simple example — dot product dotp v w =

    sumAll (zipWith (*) v w)
  49. A simple example — dot product dotp v w =

    sumAll (zipWith (*) v w) type Vector r e = Array r DIM1 e dotp :: (Num e, Source r1 e, Source r2 e) => Vector r1 e -> Vector r2 e -> e
  50. A simple example — dot product dotp v w =

    sumAll (zipWith (*) v w) type Vector r e = Array r DIM1 e dotp :: (Num e, Source r1 e, Source r2 e) => Vector r1 e -> Vector r2 e -> e Elements are any type of numbers…
  51. A simple example — dot product dotp v w =

    sumAll (zipWith (*) v w) type Vector r e = Array r DIM1 e dotp :: (Num e, Source r1 e, Source r2 e) => Vector r1 e -> Vector r2 e -> e Elements are any type of numbers… …suitable to be read from an array
  52. Concurrency Parallelism Data Parallelism ABSTRACTION

  53. Concurrency Parallelism Data Parallelism ABSTRACTION Parallelism is safe for pure

    functions (i.e., functions without external effects)
  54. Concurrency Parallelism Data Parallelism ABSTRACTION Parallelism is safe for pure

    functions (i.e., functions without external effects) Collective operations have got a single conceptual thread of control
  55. Types & Embedded Computations

  56. GPUs require careful program tuning COARSE-GRAINED VERSUS FINE-GRAINED PARALLELISM Core

    i7 970 CPU NVIDIA GF100 GPU 12 THREADS 24,576 THREADS
  57. GPUs require careful program tuning COARSE-GRAINED VERSUS FINE-GRAINED PARALLELISM Core

    i7 970 CPU NVIDIA GF100 GPU 12 THREADS 24,576 THREADS ✴SIMD: groups of threads executing in lock step (warps) ✴Need to be careful about control flow
  58. GPUs require careful program tuning COARSE-GRAINED VERSUS FINE-GRAINED PARALLELISM Core

    i7 970 CPU NVIDIA GF100 GPU 12 THREADS 24,576 THREADS ✴Latency hiding: optimised for regular memory access patterns ✴Optimise memory access ✴SIMD: groups of threads executing in lock step (warps) ✴Need to be careful about control flow
  59. Code generation for embedded code Embedded DSL ‣ Restricted control

    flow ‣ First-order GPU code Generative approach based on combinator templates
  60. Code generation for embedded code Embedded DSL ‣ Restricted control

    flow ‣ First-order GPU code Generative approach based on combinator templates ✓ limited control structures
  61. Code generation for embedded code Embedded DSL ‣ Restricted control

    flow ‣ First-order GPU code Generative approach based on combinator templates ✓ limited control structures ✓ hand-tuned access patterns [DAMP 2011]
  62. Embedded GPU computations Dot product dotp :: Vector Float ->

    Vector Float -> Acc (Scalar Float) dotp xs ys = let xs' = use xs ys' = use ys in fold (+) 0 (zipWith (*) xs' ys')
  63. Embedded GPU computations Dot product dotp :: Vector Float ->

    Vector Float -> Acc (Scalar Float) dotp xs ys = let xs' = use xs ys' = use ys in fold (+) 0 (zipWith (*) xs' ys') Haskell array
  64. Embedded GPU computations Dot product dotp :: Vector Float ->

    Vector Float -> Acc (Scalar Float) dotp xs ys = let xs' = use xs ys' = use ys in fold (+) 0 (zipWith (*) xs' ys') Haskell array Embedded array = desc. of array comps
  65. Embedded GPU computations Dot product dotp :: Vector Float ->

    Vector Float -> Acc (Scalar Float) dotp xs ys = let xs' = use xs ys' = use ys in fold (+) 0 (zipWith (*) xs' ys') Haskell array Embedded array = desc. of array comps Lift Haskell arrays into EDSL — may trigger host➙device transfer
  66. Embedded GPU computations Dot product dotp :: Vector Float ->

    Vector Float -> Acc (Scalar Float) dotp xs ys = let xs' = use xs ys' = use ys in fold (+) 0 (zipWith (*) xs' ys') Haskell array Embedded array = desc. of array comps Lift Haskell arrays into EDSL — may trigger host➙device transfer Embedded array computations
  67. Nested Data Parallelism

  68. Modularity Standard (Fortran, CUDA, etc.) is flat, regular parallelism Same

    for our libraries for functional data parallelism for multicore CPUs (Repa) and GPUs (Accelerate) But we want more…
  69. smvm :: SparseMatrix -> Vector -> Vector smvm sm v

    = [: sumP (dotp sv v) | sv <- sm :]
  70. smvm :: SparseMatrix -> Vector -> Vector smvm sm v

    = [: sumP (dotp sv v) | sv <- sm :] Parallel array comprehension
  71. smvm :: SparseMatrix -> Vector -> Vector smvm sm v

    = [: sumP (dotp sv v) | sv <- sm :] Parallel array comprehension Parallel reduction/fold
  72. smvm :: SparseMatrix -> Vector -> Vector smvm sm v

    = [: sumP (dotp sv v) | sv <- sm :] Parallel array comprehension Parallel reduction/fold Nested parallelism!
  73. smvm :: SparseMatrix -> Vector -> Vector smvm sm v

    = [: sumP (dotp sv v) | sv <- sm :] Parallel array comprehension Parallel reduction/fold Nested parallelism! Defined in a libray: Internally parallel?
  74. smvm :: SparseMatrix -> Vector -> Vector smvm sm v

    = [: sumP (dotp sv v) | sv <- sm :] Parallel array comprehension Parallel reduction/fold Nested parallelism! Defined in a libray: Internally parallel? Nested parallelism?
  75. smvm :: SparseMatrix -> Vector -> Vector smvm sm v

    = [: sumP (dotp sv v) | sv <- sm :] Parallel array comprehension Parallel reduction/fold Nested parallelism! Defined in a libray: Internally parallel? Nested parallelism? Regular, flat data parallelism is not sufficient!
  76. Nested parallelism Modular Irregular, nested data structures ‣ Sparse structures,

    tree structures ‣ Hierachrchical decomposition Nesting to arbitrary, dynamic depth: divide & conquer Lots of compiler work: still very experimental! [FSTTCS 2008, ICFP 2012, Haskell 2012b]
  77. Nested Data Parallel Haskell Accelerate Repa FLAT NESTED EMBEDDED FULL

  78. Nested Data Parallel Haskell Accelerate Repa More expressive: harder to

    implement FLAT NESTED EMBEDDED FULL
  79. Nested Data Parallel Haskell Accelerate Repa More expressive: harder to

    implement FLAT NESTED EMBEDDED FULL More restrictive: special hardware
  80. Types are at the centre of everything we are doing

  81. Types are at the centre of everything we are doing

    Types separate pure from effectful code
  82. Types are at the centre of everything we are doing

    Types separate pure from effectful code Types guide operational behaviour (data representation, use of parallelism, etc.)
  83. Types are at the centre of everything we are doing

    Types separate pure from effectful code Types guide operational behaviour (data representation, use of parallelism, etc.) Types identify restricted code for specialised hardware, such as GPUs
  84. Types are at the centre of everything we are doing

    Types separate pure from effectful code Types guide operational behaviour (data representation, use of parallelism, etc.) Types identify restricted code for specialised hardware, such as GPUs Types guide parallelising program transformations
  85. Summary Core ingredients ‣ Control purity, not concurrency ‣ Types

    guide representations and behaviours ‣ Bulk-parallel operations Get it ‣ Latest Glasgow Haskell Compiler (GHC) ‣ Repa, Accelerate & DPH packages from Hackage (Haskell library repository) Blog: http://justtesting.org/ Twitter: @TacticalGrace
  86. Thank you! This research has in part been funded by

    the Australian Research Council and by Microsoft Corporation.
  87. [EuroPar 2001] Nepal -- Nested Data-Parallelism in Haskell. Chakravarty, Keller,

    Lechtchinsky & Pfannenstiel. In "Euro-Par 2001: Parallel Processing, 7th Intl. Euro-Par Conference", 2001. [FSTTCS 2008] Harnessing the Multicores: Nested Data Parallelism in Haskell. Peyton Jones, Leshchinskiy, Keller & Chakravarty. In "IARCS Annual Conf. on Foundations of Software Technology & Theoretical Computer Science", 2008. [ICFP 2010] Regular, shape-polymorphic, parallel arrays in Haskell. Keller, Chakravarty, Leshchinskiy, Peyton Jones & Lippmeier. In Proceedings of "ICFP 2010 : The 15th ACM SIGPLAN Intl. Conf. on Functional Programming", 2010. [DAMP 2011] Accelerating Haskell Array Codes with Multicore GPUs. Chakravarty, Keller, Lee, McDonell & Grover. In "Declarative Aspects of Multicore Programming", 2011.
  88. [Haskell 2011] Efficient Parallel Stencil Convolution in Haskell. Lippmeier &

    Keller. In Proceedings of "ACM SIGPLAN Haskell Symposium 2011", ACM Press, 2011. [ICFP 2012] Work Efficient Higher-Order Vectorisation. Lippmeier, Chakravarty, Keller, Leshchinskiy & Peyton Jones. In Proceedings of "ICFP 2012 : The 17th ACM SIGPLAN Intl. Conf. on Functional Programming", 2012. Forthcoming. [Haskell 2012a] Guiding Parallel Array Fusion with Indexed Types. Lippmeier, Chakravarty, Keller & Peyton Jones. In Proceedings of "ACM SIGPLAN Haskell Symposium 2012", ACM Press, 2012. Forthcoming. [Haskell 2012b] Vectorisation Avoidance. Keller, Chakravarty, Lippmeier, Leshchinskiy & Peyton Jones. In Proceedings of "ACM SIGPLAN Haskell Symposium 2012", ACM Press, 2012. Forthcoming.