Slide 1

Slide 1 text

Functional High-Performance Computing Trevor L. McDonell Utrecht University AccelerateHS acceleratehs.org

Slide 2

Slide 2 text

All modern processors have multiple cores Ryzen 7 1800X (8 core) 4.8B transistors A12 Bionic (2+4 core) 6.9B transistors GTX 1080 (2560 core*) 7.2B transistors

Slide 3

Slide 3 text

GPU (graphics processing unit) medical imaging data science weather &
 climate bioinformatics computational chemistry machine learning GTX 1080 (2560 cores*) software programmable caches data distribution thread synchronisation memory access patterns control flow divergence

Slide 4

Slide 4 text

Performance Effort

Slide 5

Slide 5 text

Performance Effort expected

Slide 6

Slide 6 text

Performance Effort expected actual

Slide 7

Slide 7 text

Performance Effort expected actual desired

Slide 8

Slide 8 text

Performance Effort expected actual desired After expressing available parallelism, I often find that the code has slowed down. — Jeff Larkin, NVIDIA Developer Technology https://devblogs.nvidia.com/getting-started-openacc/

Slide 9

Slide 9 text

Can we have
 parallel programming
 with
 less effort?

Slide 10

Slide 10 text

Part 1: Thinking in parallel

Slide 11

Slide 11 text

for (int i = 0; i < length; ++i) { // do something (in parallel) }

Slide 12

Slide 12 text

Theory Practice

Slide 13

Slide 13 text

Why is this difficult? Concurrency Multiple interleaved threads of control
 All threads have effects on the world
 Non-determinism and concurrency control

Slide 14

Slide 14 text

Data parallelism Instead of unrestricted concurrency, let’s simplify The same operation is applied to different data
 abstracts over concurrency control
 abstracts over indeterminism
 great for developers and hardware

Slide 15

Slide 15 text

Energy efficiency

Slide 16

Slide 16 text

Flat data-parallelism

Slide 17

Slide 17 text

Nested data-parallelism

Slide 18

Slide 18 text

Amorphous data-parallelism

Slide 19

Slide 19 text

Expressiveness of
 Parallelism
 
 Expressiveness of computation Embedded Native Flat Nested Amorphous Repa Futhark Lift Data-Parallel Haskell Nessie Accelerate

Slide 20

Slide 20 text

Accelerate Haskell/Accelerate program Target code Compile and run on the CPU/GPU Copy result back to Haskell Reify and optimise Accelerate program An embedded language for data-parallel arrays

Slide 21

Slide 21 text

Example: vector dot product dotp xs ys =

Slide 22

Slide 22 text

dotp xs ys = fold (+) 0 (zipWith (*) xs ys) 1 2 3 4 ⋮ 5 6 7 8 ⋮ * * * * Example: vector dot product

Slide 23

Slide 23 text

Example: vector dot product dotp xs ys = fold (+) 0 (zipWith (*) xs ys) 6 8 10 12 … + + + + … + 0

Slide 24

Slide 24 text

Example: vector dot product import Prelude dotp :: Num a => [a] -> [a] -> a dotp xs ys = fold (+) 0 (zipWith (*) xs ys)

Slide 25

Slide 25 text

Example: vector dot product import Data.Vector.Unboxed dotp :: (Num a, Unbox a) => Vector a -> Vector a -> a dotp xs ys = fold (+) 0 (zipWith (*) xs ys)

Slide 26

Slide 26 text

Example: vector dot product import Data.Array.Accelerate dotp :: (Num a, Elt a) => Acc (Vector a) -> Acc (Vector a) -> Acc (Scalar a) dotp xs ys = fold (+) 0 (zipWith (*) xs ys)

Slide 27

Slide 27 text

Computers are good at operating on bulk data,
 not on single elements Restrictions can guide the programmer into
 writing an efficient parallel program Parallel programming and functional programming
 are a natural fit Thinking in parallel

Slide 28

Slide 28 text

Part 2: Make it work

Slide 29

Slide 29 text

@jasper.samoyed

Slide 30

Slide 30 text

Show, don’t tell https://github.com/AccelerateHS/accelerate-examples

Slide 31

Slide 31 text

LULESH Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics https://github.com/tmcdonell/lulesh-accelerate

Slide 32

Slide 32 text

LULESH Lines of Code Runtime @ 643 (s) C (OpenMP) 2400 64 CUDA 3000 5.2 Accelerate (CPU) 1200 38 Accelerate (GPU) ±1 4.1 i7-6700K @ 3.4GHz / GTX 1080 Ti Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics

Slide 33

Slide 33 text

Salt marsh creek formation https://github.com/tmcdonell/spatial-ecology-accelerate

Slide 34

Slide 34 text

Salt marsh creek formation Elapsed time (s) 1 10 100 1000 Grid size 512 1024 1536 2048 3072 4096 6144 8192 Python+OpenCL Accelerate GTX 1080 Ti 7x faster 2x fewer lines of code

Slide 35

Slide 35 text

use real examples as a working laboratory

Slide 36

Slide 36 text

Motivating examples validate and test your ideas
 improve performance
 what is good about this, what is bad about it?
 as a basis for future work Can we take what’s good as a seed and make it into something that is better?

Slide 37

Slide 37 text

Part 3: What’s next?

Slide 38

Slide 38 text

The tower of abstraction

Slide 39

Slide 39 text

Sequence of abstraction Machine language
 Assembly language
 Fortran / C / C++
 C# / Haskell / Javascript “I am working at a higher level;
 being smart, saving effort!”

Slide 40

Slide 40 text

mean :: [Double] -> Double mean xs = sum xs / fromIntegral (length xs)

Slide 41

Slide 41 text

data T = MkT Int Int MkT I# Int# I# Int# MkT Int# Int# data T = MkT !Int !Int

Slide 42

Slide 42 text

data Word8x4 = Word8x4 !Word8 !Word8 !Word8 !Word8 Word8x4 Word8# Word8# Word8# Word8# size in memory = ? = header + 4 * 8 bytes ∴ in Haskell: 4 * 8 = 256 . . .

Slide 43

Slide 43 text

data T a = MkT a MkT a what if I need to know whether ‘a’ has been computed? data T a = MkT (IORef (Maybe a)) IORef Nothing Just a MkT

Slide 44

Slide 44 text

Loss of capability Can no longer program in assembly
 Don’t know how values are stored in memory
 Don’t know what the CPU is doing The rhetoric is “I shouldn’t have to”
 but the flip side is the loss of ability to

Slide 45

Slide 45 text

Functional high-performance computing Performance Effort expected actual desired

Slide 46

Slide 46 text

Functional high-performance computing Performance Effort ? expected actual desired

Slide 47

Slide 47 text

Functional high-performance computing Real-world examples as a laboratory to develop:
 new features, test performance, … Working at a high level is good
 but this also entails a loss of capability Reality exists at the low level

Slide 48

Slide 48 text

Functional high-performance computing Functional programming languages provides
 the right set of abstractions Instead of:
 climbing the tower of abstraction Better idea?:
 feet on the ground; reach for the heavens

Slide 49

Slide 49 text

acceleratehs.org https://github.com/AccelerateHS/ Trevor L. McDonell Robert Clifton-Everest Manuel M. T. Chakravarty Josh Meredith Gabriele Keller Ben Lippmeier

Slide 50

Slide 50 text

Image attribution Logo designed by Tina Lam http://instagram.com/tinabarbarina https://en.wikipedia.org/wiki/Waterman_butterfly_projection https://www.instagram.com/p/BbTjiebnaw1 http://book.realworldhaskell.org/read/profiling-and-optimization.html https://researchinprogress.tumblr.com/post/34088637501/fast-vs-exact-solutions https://researchinprogress.tumblr.com/post/34627563943/when-somebody-mixes-up-causality-and-correlation https://researchinprogress.tumblr.com/post/32886698944/how-is-your-research-useful https://en.wikipedia.org/wiki/Tower_of_Babel https://www.art.com/products/p46922818644-sa-i10543606/paul-souders-polar-bear-swimming-past-melting-iceberg-near- harbor-islands-canada.htm http://unseasonably.blogspot.com/2014/03/a-tangled-ball-of-yarn.html https://www.reddit.com/r/aww/comments/2oagj8/multithreaded_programming_theory_and_practice/