Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Functional High-Performance Computing

Functional High-Performance Computing

Keynote talk at the inaugural workshop on Functional High-Performance and Numerical Computing 2019: https://icfp19.sigplan.org/home/FHPNC-2019

Video: TBD


Trevor L. McDonell

August 18, 2019


  1. Functional High-Performance Computing Trevor L. McDonell Utrecht University AccelerateHS acceleratehs.org

  2. All modern processors have multiple cores Ryzen 7 1800X (8

    core) 4.8B transistors A12 Bionic (2+4 core) 6.9B transistors GTX 1080 (2560 core*) 7.2B transistors
  3. GPU (graphics processing unit) medical imaging data science weather &

    climate bioinformatics computational chemistry machine learning GTX 1080 (2560 cores*) software programmable caches data distribution thread synchronisation memory access patterns control flow divergence
  4. Performance Effort

  5. Performance Effort expected

  6. Performance Effort expected actual

  7. Performance Effort expected actual desired

  8. Performance Effort expected actual desired After expressing available parallelism, I

    often find that the code has slowed down. — Jeff Larkin, NVIDIA Developer Technology https://devblogs.nvidia.com/getting-started-openacc/
  9. Can we have
 parallel programming
 less effort?

  10. Part 1: Thinking in parallel

  11. for (int i = 0; i < length; ++i) {

    // do something (in parallel) }
  12. Theory Practice

  13. Why is this difficult? Concurrency Multiple interleaved threads of control

    All threads have effects on the world
 Non-determinism and concurrency control
  14. Data parallelism Instead of unrestricted concurrency, let’s simplify The same

    operation is applied to different data
 abstracts over concurrency control
 abstracts over indeterminism
 great for developers and hardware
  15. Energy efficiency

  16. Flat data-parallelism

  17. Nested data-parallelism

  18. Amorphous data-parallelism

  19. Expressiveness of
 Expressiveness of computation Embedded Native Flat

    Nested Amorphous Repa Futhark Lift Data-Parallel Haskell Nessie Accelerate
  20. Accelerate Haskell/Accelerate program Target code Compile and run on the

    CPU/GPU Copy result back to Haskell Reify and optimise Accelerate program An embedded language for data-parallel arrays
  21. Example: vector dot product dotp xs ys =

  22. dotp xs ys = fold (+) 0 (zipWith (*) xs

    ys) 1 2 3 4 ⋮ 5 6 7 8 ⋮ * * * * Example: vector dot product
  23. Example: vector dot product dotp xs ys = fold (+)

    0 (zipWith (*) xs ys) 6 8 10 12 … + + + + … + 0
  24. Example: vector dot product import Prelude dotp :: Num a

    => [a] -> [a] -> a dotp xs ys = fold (+) 0 (zipWith (*) xs ys)
  25. Example: vector dot product import Data.Vector.Unboxed dotp :: (Num a,

    Unbox a) => Vector a -> Vector a -> a dotp xs ys = fold (+) 0 (zipWith (*) xs ys)
  26. Example: vector dot product import Data.Array.Accelerate dotp :: (Num a,

    Elt a) => Acc (Vector a) -> Acc (Vector a) -> Acc (Scalar a) dotp xs ys = fold (+) 0 (zipWith (*) xs ys)
  27. Computers are good at operating on bulk data,
 not on

    single elements Restrictions can guide the programmer into
 writing an efficient parallel program Parallel programming and functional programming
 are a natural fit Thinking in parallel
  28. Part 2: Make it work

  29. @jasper.samoyed

  30. Show, don’t tell https://github.com/AccelerateHS/accelerate-examples

  31. LULESH Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics https://github.com/tmcdonell/lulesh-accelerate

  32. LULESH Lines of Code Runtime @ 643 (s) C (OpenMP)

    2400 64 CUDA 3000 5.2 Accelerate (CPU) 1200 38 Accelerate (GPU) ±1 4.1 i7-6700K @ 3.4GHz / GTX 1080 Ti Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics
  33. Salt marsh creek formation https://github.com/tmcdonell/spatial-ecology-accelerate

  34. Salt marsh creek formation Elapsed time (s) 1 10 100

    1000 Grid size 512 1024 1536 2048 3072 4096 6144 8192 Python+OpenCL Accelerate GTX 1080 Ti 7x faster 2x fewer lines of code
  35. use real examples as a working laboratory

  36. Motivating examples validate and test your ideas
 improve performance

    is good about this, what is bad about it?
 as a basis for future work Can we take what’s good as a seed and make it into something that is better?
  37. Part 3: What’s next?

  38. The tower of abstraction

  39. Sequence of abstraction Machine language
 Assembly language
 Fortran / C

    / C++
 C# / Haskell / Javascript “I am working at a higher level;
 being smart, saving effort!”
  40. mean :: [Double] -> Double mean xs = sum xs

    / fromIntegral (length xs)
  41. data T = MkT Int Int MkT I# Int# I#

    Int# MkT Int# Int# data T = MkT !Int !Int
  42. data Word8x4 = Word8x4 !Word8 !Word8 !Word8 !Word8 Word8x4 Word8#

    Word8# Word8# Word8# size in memory = ? = header + 4 * 8 bytes ∴ in Haskell: 4 * 8 = 256 . . .
  43. data T a = MkT a MkT a what if

    I need to know whether ‘a’ has been computed? data T a = MkT (IORef (Maybe a)) IORef Nothing Just a MkT
  44. Loss of capability Can no longer program in assembly

    know how values are stored in memory
 Don’t know what the CPU is doing The rhetoric is “I shouldn’t have to”
 but the flip side is the loss of ability to
  45. Functional high-performance computing Performance Effort expected actual desired

  46. Functional high-performance computing Performance Effort ? expected actual desired

  47. Functional high-performance computing Real-world examples as a laboratory to develop:

    new features, test performance, … Working at a high level is good
 but this also entails a loss of capability Reality exists at the low level
  48. Functional high-performance computing Functional programming languages provides
 the right set

    of abstractions Instead of:
 climbing the tower of abstraction Better idea?:
 feet on the ground; reach for the heavens
  49. acceleratehs.org https://github.com/AccelerateHS/ Trevor L. McDonell Robert Clifton-Everest Manuel M. T.

    Chakravarty Josh Meredith Gabriele Keller Ben Lippmeier
  50. Image attribution Logo designed by Tina Lam http://instagram.com/tinabarbarina https://en.wikipedia.org/wiki/Waterman_butterfly_projection https://www.instagram.com/p/BbTjiebnaw1

    http://book.realworldhaskell.org/read/profiling-and-optimization.html https://researchinprogress.tumblr.com/post/34088637501/fast-vs-exact-solutions https://researchinprogress.tumblr.com/post/34627563943/when-somebody-mixes-up-causality-and-correlation https://researchinprogress.tumblr.com/post/32886698944/how-is-your-research-useful https://en.wikipedia.org/wiki/Tower_of_Babel https://www.art.com/products/p46922818644-sa-i10543606/paul-souders-polar-bear-swimming-past-melting-iceberg-near- harbor-islands-canada.htm http://unseasonably.blogspot.com/2014/03/a-tangled-ball-of-yarn.html https://www.reddit.com/r/aww/comments/2oagj8/multithreaded_programming_theory_and_practice/