Slide 1

Slide 1 text

Optimising Purely Functional GPU Programs Trevor L. McDonell University of New South Wales Jointly with Manuel M. T. Chakravarty Gabriele Keller Ben Lippmeier tmcdonell tlmcdonell

Slide 2

Slide 2 text

High Level Languages

Slide 3

Slide 3 text

High Level Languages Boxed values Polymorphism & generics Composite data structures Immutable structures Higher-order functions & closures

Slide 4

Slide 4 text

High Level Languages Boxed values Polymorphism & generics Composite data structures Immutable structures Higher-order functions & closures

Slide 5

Slide 5 text

High Level Languages multicore CPU GPUs Cluster

Slide 6

Slide 6 text

High Level Languages multicore CPU GPUs Cluster Function pointers Control flow Memory access patterns Data distribution Decomposition

Slide 7

Slide 7 text

High Level Languages multicore CPU GPUs Cluster Function pointers Control flow Memory access patterns Data distribution Decomposition Efficient code?

Slide 8

Slide 8 text

scanl&::&(a&)>&b&)>&a)&)>&a&)>&[b]&)>&[a]& scanl&f&q&ls&=&&q&:&(case&ls&of& &&&&&&&&&&&&&&&&&&&&&[]&&&)>&[]& &&&&&&&&&&&&&&&&&&&&&x:xs&)>&scanl&f&(f&q&x)&xs)

Slide 9

Slide 9 text

Fast Scan Algorithms on Graphics Processors Yuri Dotsenko Naga K. Govindaraju Peter-Pike Sloan Charles Boyd John Manferdelli Microsoft Corporation One Microsoft Way Redmond, WA 98052, USA {yurido, nagag, ppsloan, chasb, jmanfer}@microsoft.com ABSTRACT Scan and segmented scan are important data-parallel primitives for a wide range of applications. We present fast, work-efficient algorithms for these primitives on graphics processing units (GPUs). We use novel data representations that map well to the GPU architecture. Our algorithms exploit shared memory to improve memory performance. We further improve the performance of our algorithms by eliminating shared-memory bank conflicts and reducing the overheads in prior shared-memory GPU algorithms. Furthermore, our algorithms are designed to work well on general data sets, including segmented arrays with arbitrary segment lengths. We also present optimizations to improve the performance of segmented scans based on the segment lengths. We implemented our algorithms on a PC with an NVIDIA GeForce 8800 GPU and compared our results with prior GPU-based algorithms. Our results indicate up to 10x higher performance over prior algorithms on input sequences with millions of elements. Categories and Subject Descriptors D.1.3 [Concurrent Programming]: Parallel programming. General Terms Algorithms, performance. Keywords Scan, all-prefix-sum, segmented scan, GPGPU, GPU, parallel algorithm, HPC, many-core. 1. INTRODUCTION bandwidth and massive parallelism on GPUs. The current state- of-the-art GPU-based algorithms also exploit shared memory to improve the performance of scans. In this paper, we analyze the issues in mapping scan algorithms to the GPU architecture. We highlight that the prior algorithms deliver suboptimal performance due to high overhead of shared-memory bank conflicts, synchronization, and index arithmetic. We present fast scan algorithms that map better to GPUs and achieve higher performance than prior GPU-based algorithms. Our main contribution is a novel data representation in shared and global memory that maps better to the GPU memory hierarchy and the scan algorithms. Accesses to the data representation involve no bank conflicts in the shared memory while exploiting the high parallelism on GPUs. Our algorithm involves low overhead compared to prior approaches and the performance of the kernel scales better with shared memory sizes. We implemented our algorithms on a PC with a modern NVIDIA GPU. We benchmark our algorithms against prior state-of-the-art GPU-based algorithms on several GPUs. Our results on unsegmented scans indicate up to 60% higher performance than prior optimized algorithms. On segmented scans, we observed up to an order of magnitude higher performance over optimized GPU-based segmented scan algorithms. Organization of the paper: The rest of the paper is organized as follows. In Section 2, we present the related work. In Section 3, we give an overview of scan algorithms and the issues in mapping them to GPUs. We present our scan algorithms and provide analysis in Section 4. In Section 5, we describe our experimental International conference on Supercomputing, 2008

Slide 10

Slide 10 text

Fast Scan Algorithms on Graphics Processors Yuri Dotsenko Naga K. Govindaraju Peter-Pike Sloan Charles Boyd John Manferdelli Microsoft Corporation One Microsoft Way Redmond, WA 98052, USA {yurido, nagag, ppsloan, chasb, jmanfer}@microsoft.com ABSTRACT Scan and segmented scan are important data-parallel primitives for a wide range of applications. We present fast, work-efficient algorithms for these primitives on graphics processing units (GPUs). We use novel data representations that map well to the GPU architecture. Our algorithms exploit shared memory to improve memory performance. We further improve the performance of our algorithms by eliminating shared-memory bank conflicts and reducing the overheads in prior shared-memory GPU algorithms. Furthermore, our algorithms are designed to work well on general data sets, including segmented arrays with arbitrary segment lengths. We also present optimizations to improve the performance of segmented scans based on the segment lengths. We implemented our algorithms on a PC with an NVIDIA GeForce 8800 GPU and compared our results with prior GPU-based algorithms. Our results indicate up to 10x higher performance over prior algorithms on input sequences with millions of elements. Categories and Subject Descriptors D.1.3 [Concurrent Programming]: Parallel programming. General Terms Algorithms, performance. Keywords Scan, all-prefix-sum, segmented scan, GPGPU, GPU, parallel algorithm, HPC, many-core. 1. INTRODUCTION bandwidth and massive parallelism on GPUs. The current state- of-the-art GPU-based algorithms also exploit shared memory to improve the performance of scans. In this paper, we analyze the issues in mapping scan algorithms to the GPU architecture. We highlight that the prior algorithms deliver suboptimal performance due to high overhead of shared-memory bank conflicts, synchronization, and index arithmetic. We present fast scan algorithms that map better to GPUs and achieve higher performance than prior GPU-based algorithms. Our main contribution is a novel data representation in shared and global memory that maps better to the GPU memory hierarchy and the scan algorithms. Accesses to the data representation involve no bank conflicts in the shared memory while exploiting the high parallelism on GPUs. Our algorithm involves low overhead compared to prior approaches and the performance of the kernel scales better with shared memory sizes. We implemented our algorithms on a PC with a modern NVIDIA GPU. We benchmark our algorithms against prior state-of-the-art GPU-based algorithms on several GPUs. Our results on unsegmented scans indicate up to 60% higher performance than prior optimized algorithms. On segmented scans, we observed up to an order of magnitude higher performance over optimized GPU-based segmented scan algorithms. Organization of the paper: The rest of the paper is organized as follows. In Section 2, we present the related work. In Section 3, we give an overview of scan algorithms and the issues in mapping them to GPUs. We present our scan algorithms and provide analysis in Section 4. In Section 5, we describe our experimental International conference on Supercomputing, 2008 Optimising Parallel Prefix operations for the Fermi architecture ! Mark Harris Michael Garland NVIDIA Corporation Published October 2012

Slide 11

Slide 11 text

Fast Scan Algorithms on Graphics Processors Yuri Dotsenko Naga K. Govindaraju Peter-Pike Sloan Charles Boyd John Manferdelli Microsoft Corporation One Microsoft Way Redmond, WA 98052, USA {yurido, nagag, ppsloan, chasb, jmanfer}@microsoft.com ABSTRACT Scan and segmented scan are important data-parallel primitives for a wide range of applications. We present fast, work-efficient algorithms for these primitives on graphics processing units (GPUs). We use novel data representations that map well to the GPU architecture. Our algorithms exploit shared memory to improve memory performance. We further improve the performance of our algorithms by eliminating shared-memory bank conflicts and reducing the overheads in prior shared-memory GPU algorithms. Furthermore, our algorithms are designed to work well on general data sets, including segmented arrays with arbitrary segment lengths. We also present optimizations to improve the performance of segmented scans based on the segment lengths. We implemented our algorithms on a PC with an NVIDIA GeForce 8800 GPU and compared our results with prior GPU-based algorithms. Our results indicate up to 10x higher performance over prior algorithms on input sequences with millions of elements. Categories and Subject Descriptors D.1.3 [Concurrent Programming]: Parallel programming. General Terms Algorithms, performance. Keywords Scan, all-prefix-sum, segmented scan, GPGPU, GPU, parallel algorithm, HPC, many-core. 1. INTRODUCTION bandwidth and massive parallelism on GPUs. The current state- of-the-art GPU-based algorithms also exploit shared memory to improve the performance of scans. In this paper, we analyze the issues in mapping scan algorithms to the GPU architecture. We highlight that the prior algorithms deliver suboptimal performance due to high overhead of shared-memory bank conflicts, synchronization, and index arithmetic. We present fast scan algorithms that map better to GPUs and achieve higher performance than prior GPU-based algorithms. Our main contribution is a novel data representation in shared and global memory that maps better to the GPU memory hierarchy and the scan algorithms. Accesses to the data representation involve no bank conflicts in the shared memory while exploiting the high parallelism on GPUs. Our algorithm involves low overhead compared to prior approaches and the performance of the kernel scales better with shared memory sizes. We implemented our algorithms on a PC with a modern NVIDIA GPU. We benchmark our algorithms against prior state-of-the-art GPU-based algorithms on several GPUs. Our results on unsegmented scans indicate up to 60% higher performance than prior optimized algorithms. On segmented scans, we observed up to an order of magnitude higher performance over optimized GPU-based segmented scan algorithms. Organization of the paper: The rest of the paper is organized as follows. In Section 2, we present the related work. In Section 3, we give an overview of scan algorithms and the issues in mapping them to GPUs. We present our scan algorithms and provide analysis in Section 4. In Section 5, we describe our experimental International conference on Supercomputing, 2008 Optimising Parallel Prefix operations for the Fermi architecture ! Mark Harris Michael Garland NVIDIA Corporation Published October 2012

Slide 12

Slide 12 text

Fast Scan Algorithms on Graphics Processors Yuri Dotsenko Naga K. Govindaraju Peter-Pike Sloan Charles Boyd John Manferdelli Microsoft Corporation One Microsoft Way Redmond, WA 98052, USA {yurido, nagag, ppsloan, chasb, jmanfer}@microsoft.com ABSTRACT Scan and segmented scan are important data-parallel primitives for a wide range of applications. We present fast, work-efficient algorithms for these primitives on graphics processing units (GPUs). We use novel data representations that map well to the GPU architecture. Our algorithms exploit shared memory to improve memory performance. We further improve the performance of our algorithms by eliminating shared-memory bank conflicts and reducing the overheads in prior shared-memory GPU algorithms. Furthermore, our algorithms are designed to work well on general data sets, including segmented arrays with arbitrary segment lengths. We also present optimizations to improve the performance of segmented scans based on the segment lengths. We implemented our algorithms on a PC with an NVIDIA GeForce 8800 GPU and compared our results with prior GPU-based algorithms. Our results indicate up to 10x higher performance over prior algorithms on input sequences with millions of elements. Categories and Subject Descriptors D.1.3 [Concurrent Programming]: Parallel programming. General Terms Algorithms, performance. Keywords Scan, all-prefix-sum, segmented scan, GPGPU, GPU, parallel algorithm, HPC, many-core. 1. INTRODUCTION bandwidth and massive parallelism on GPUs. The current state- of-the-art GPU-based algorithms also exploit shared memory to improve the performance of scans. In this paper, we analyze the issues in mapping scan algorithms to the GPU architecture. We highlight that the prior algorithms deliver suboptimal performance due to high overhead of shared-memory bank conflicts, synchronization, and index arithmetic. We present fast scan algorithms that map better to GPUs and achieve higher performance than prior GPU-based algorithms. Our main contribution is a novel data representation in shared and global memory that maps better to the GPU memory hierarchy and the scan algorithms. Accesses to the data representation involve no bank conflicts in the shared memory while exploiting the high parallelism on GPUs. Our algorithm involves low overhead compared to prior approaches and the performance of the kernel scales better with shared memory sizes. We implemented our algorithms on a PC with a modern NVIDIA GPU. We benchmark our algorithms against prior state-of-the-art GPU-based algorithms on several GPUs. Our results on unsegmented scans indicate up to 60% higher performance than prior optimized algorithms. On segmented scans, we observed up to an order of magnitude higher performance over optimized GPU-based segmented scan algorithms. Organization of the paper: The rest of the paper is organized as follows. In Section 2, we present the related work. In Section 3, we give an overview of scan algorithms and the issues in mapping them to GPUs. We present our scan algorithms and provide analysis in Section 4. In Section 5, we describe our experimental International conference on Supercomputing, 2008 Optimising Parallel Prefix operations for the Fermi architecture ! Mark Harris Michael Garland NVIDIA Corporation Published October 2012

Slide 13

Slide 13 text

http://jacohaasbroek.com

Slide 14

Slide 14 text

How about embedded languages with specialised code generation?

Slide 15

Slide 15 text

Accelerate An embedded language for GPU programming

Slide 16

Slide 16 text

Accelerate An embedded language for GPU programming

Slide 17

Slide 17 text

Accelerate An embedded language for GPU programming

Slide 18

Slide 18 text

dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&)

Slide 19

Slide 19 text

dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&) Embedded language arrays

Slide 20

Slide 20 text

dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&) Embedded language arrays dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&)

Slide 21

Slide 21 text

dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&) Embedded language arrays dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&) dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&)

Slide 22

Slide 22 text

dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&) Embedded language arrays dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&) dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&) From Accelerate library

Slide 23

Slide 23 text

#include ! typedef DIM1 DimOut;! extern "C" __global__ void zipWith! (! const DIM1 shIn0,! const Int64* __restrict__ arrIn0_a0,! const DIM1 shIn1,! const Int64* __restrict__ arrIn1_a0,! const DIM1 shOut,! Int64* __restrict__ arrOut_a0! )! {! const int shapeSize = size(shOut);! const int gridSize = blockDim.x * gridDim.x;! int ix;! ! for (ix = blockDim.x * blockIdx.x + threadIdx.x; ix < shapeSize; ix += gridSize) {! const DimOut sh = fromIndex(shOut, ix);! const int v0 = toIndex(shIn0, shape(sh));! const int v1 = toIndex(shIn1, shape(sh));! ! arrOut_a0[ix] = arrIn0_a0[v0] * arrIn1_a0[v1];! dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&) Embedded language arrays dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&) dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&) From Accelerate library

Slide 24

Slide 24 text

#include ! typedef DIM1 DimOut;! extern "C" __global__ void zipWith! (! const DIM1 shIn0,! const Int64* __restrict__ arrIn0_a0,! const DIM1 shIn1,! const Int64* __restrict__ arrIn1_a0,! const DIM1 shOut,! Int64* __restrict__ arrOut_a0! )! {! const int shapeSize = size(shOut);! const int gridSize = blockDim.x * gridDim.x;! int ix;! ! for (ix = blockDim.x * blockIdx.x + threadIdx.x; ix < shapeSize; ix += gridSize) {! const DimOut sh = fromIndex(shOut, ix);! const int v0 = toIndex(shIn0, shape(sh));! const int v1 = toIndex(shIn1, shape(sh));! ! arrOut_a0[ix] = arrIn0_a0[v0] * arrIn1_a0[v1];! }! }! sdata0[threadIdx.x] = y0;! __syncthreads();! ix = min(shapeSize - blockIdx.x * blockDim.x, blockDim.x);! if (threadIdx.x + 512 < ix) {! x0 = sdata0[threadIdx.x + 512];! y0 = y0 + x0;! sdata0[threadIdx.x] = y0;! }! __syncthreads();! if (threadIdx.x + 256 < ix) {! x0 = sdata0[threadIdx.x + 256];! y0 = y0 + x0;! sdata0[threadIdx.x] = y0;! }! __syncthreads();! if (threadIdx.x + 128 < ix) {! x0 = sdata0[threadIdx.x + 128];! y0 = y0 + x0;! sdata0[threadIdx.x] = y0;! }! __syncthreads();! if (threadIdx.x + 64 < ix) {! x0 = sdata0[threadIdx.x + 64];! y0 = y0 + x0;! sdata0[threadIdx.x] = y0;! }! __syncthreads();! if (threadIdx.x < 32) {! if (threadIdx.x + 32 < ix) {! x0 = sdata0[threadIdx.x + 32];! y0 = y0 + x0;! sdata0[threadIdx.x] = y0;! }! if (threadIdx.x + 16 < ix) {! x0 = sdata0[threadIdx.x + 16];! y0 = y0 + x0;! sdata0[threadIdx.x] = y0;! }! if (threadIdx.x + 8 < ix) {! x0 = sdata0[threadIdx.x + 8];! y0 = y0 + x0;! sdata0[threadIdx.x] = y0;! }! if (threadIdx.x + 4 < ix) {! x0 = sdata0[threadIdx.x + 4];! y0 = y0 + x0;! dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&) Embedded language arrays dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&) dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&) From Accelerate library

Slide 25

Slide 25 text

Problem #1: Fusion

Slide 26

Slide 26 text

dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&) dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&) dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&)

Slide 27

Slide 27 text

dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&) dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&) dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&) Skeleton #1 Skeleton #2

Slide 28

Slide 28 text

dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&) dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&) dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&) Skeleton #1 Skeleton #2 Intermediate array Extra traversal

Slide 29

Slide 29 text

dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&) dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&) dotp&xs&ys&=&fold&(+)&0&(&zipWith&(*)&xs&ys&) Combined skeleton

Slide 30

Slide 30 text

Stream fusion? Data.Vector ( D. Coutts et. al, ICFP ‘07 )

Slide 31

Slide 31 text

Stream fusion? Data.Vector ( D. Coutts et. al, ICFP ‘07 )

Slide 32

Slide 32 text

Retain the skeleton-based structure of the program

Slide 33

Slide 33 text

map zipWith backpermute generate

Slide 34

Slide 34 text

map zipWith backpermute generate Producers

Slide 35

Slide 35 text

map zipWith backpermute generate fold scanl permute Producers Consumers

Slide 36

Slide 36 text

map zipWith backpermute generate fold scanl permute Producers Consumers Internal representation: fusion friendly

Slide 37

Slide 37 text

map zipWith backpermute generate fold scanl permute Producers Consumers Internal representation: fusion friendly Embed producer into consumer skeleton

Slide 38

Slide 38 text

map zipWith backpermute generate fold scanl permute Producers Consumers Internal representation: fusion friendly Embed producer into consumer skeleton

Slide 39

Slide 39 text

Fusing networks of skeletons p1 p1 p2 p3 p4 p5 p6 p7 c1 c2

Slide 40

Slide 40 text

Fusing networks of skeletons c2 p5 p1 c1 p6 p7 p3 p2 p4 Phase 1: producer/producer fusion

Slide 41

Slide 41 text

Fusing networks of skeletons c2 p5 p1 c1 p6 p7 p3 p2 p4 Phase 2: consumer/producer fusion

Slide 42

Slide 42 text

Fusing networks of skeletons c2 p5 p1 c1 p6 p7 p3 p2 p4 Phase 2: consumer/producer fusion Single fused skeleton

Slide 43

Slide 43 text

Fusing networks of skeletons c2 p5 p1 c1 p6 p7 p3 p2 p4 Phase 2: consumer/producer fusion Single fused skeleton (see paper for details)

Slide 44

Slide 44 text

let&inc&=&(+)&1& in&&let&nine&=&let&three&=&inc&2& &&&&&&&&&&&&&&&in& &&&&&&&&&&&&&&&(*)&three&three& &&&&in& &&&&())&(inc&nine)&nine # # # # # # # ŸÏ Ÿ¥ ŸŠ

Slide 45

Slide 45 text

let&inc&=&(+)&1& in&&let&nine&=&let&three&=&inc&2& &&&&&&&&&&&&&&&in& &&&&&&&&&&&&&&&(*)&three&three& &&&&in& &&&&())&(inc&nine)&nine # # # # # # # ŸÏ Ÿ¥ ŸŠ

Slide 46

Slide 46 text

let&inc&=&(+)&1& in&&let&nine&=&let&three&=&inc&2& &&&&&&&&&&&&&&&in& &&&&&&&&&&&&&&&(*)&three&three& &&&&in& &&&&())&(inc&nine)&nine # # # # # # # ŸÏ Ÿ¥ ŸŠ let&inc&=&(+)&1& in&&let&nine&=&let&three&=&inc&2& &&&&&&&&&&&&&&&in& &&&&&&&&&&&&&&&(*)&three&three& &&&&in& &&&&())&(inc&nine)&nine

Slide 47

Slide 47 text

let&inc&=&(+)&1& in&&let&nine&=&let&three&=&inc&2& &&&&&&&&&&&&&&&in& &&&&&&&&&&&&&&&(*)&three&three& &&&&in& &&&&())&(inc&nine)&nine # # # # # # # ŸÏ Ÿ¥ ŸŠ let&inc&=&(+)&1& in&&let&nine&=&let&three&=&inc&2& &&&&&&&&&&&&&&&in& &&&&&&&&&&&&&&&(*)&three&three& &&&&in& &&&&())&(inc&nine)&nine let&inc&=&(+)&1& in&&let&nine&=&let&three&=&inc&2& &&&&&&&&&&&&&&&in& &&&&&&&&&&&&&&&(*)&three&three& &&&&in& &&&&())&(inc&nine)&nine

Slide 48

Slide 48 text

let&inc&=&(+)&1& in&&let&nine&=&let&three&=&inc&2& &&&&&&&&&&&&&&&in& &&&&&&&&&&&&&&&(*)&three&three& &&&&in& &&&&())&(inc&nine)&nine # # # # # # # ŸÏ Ÿ¥ ŸŠ let&inc&=&(+)&1& in&&let&nine&=&let&three&=&inc&2& &&&&&&&&&&&&&&&in& &&&&&&&&&&&&&&&(*)&three&three& &&&&in& &&&&())&(inc&nine)&nine let&inc&=&(+)&1& in&&let&nine&=&let&three&=&inc&2& &&&&&&&&&&&&&&&in& &&&&&&&&&&&&&&&(*)&three&three& &&&&in& &&&&())&(inc&nine)&nine let&inc&=&(+)&1& in&&let&nine&=&let&three&=&inc&2& &&&&&&&&&&&&&&&in& &&&&&&&&&&&&&&&(*)&three&three& &&&&in& &&&&())&(inc&nine)&nine

Slide 49

Slide 49 text

let&inc&=&(+)&1& in&&let&nine&=&let&three&=&inc&2& &&&&&&&&&&&&&&&in& &&&&&&&&&&&&&&&(*)&three&three& &&&&in& &&&&())&(inc&nine)&nine # # # # # # # ŸÏ Ÿ¥ ŸŠ let&inc&=&(+)&1& in&&let&nine&=&let&three&=&inc&2& &&&&&&&&&&&&&&&in& &&&&&&&&&&&&&&&(*)&three&three& &&&&in& &&&&())&(inc&nine)&nine let&inc&=&(+)&1& in&&let&nine&=&let&three&=&inc&2& &&&&&&&&&&&&&&&in& &&&&&&&&&&&&&&&(*)&three&three& &&&&in& &&&&())&(inc&nine)&nine let&inc&=&(+)&1& in&&let&nine&=&let&three&=&inc&2& &&&&&&&&&&&&&&&in& &&&&&&&&&&&&&&&(*)&three&three& &&&&in& &&&&())&(inc&nine)&nine let&inc&=&(+)&1& in&&let&nine&=&let&three&=&inc&2& &&&&&&&&&&&&&&&in& &&&&&&&&&&&&&&&(*)&three&three& &&&&in& &&&&())&(inc&nine)&nine

Slide 50

Slide 50 text

Problem #2: Sharing

Slide 51

Slide 51 text

let&inc&=&(+)&1& in&&let&nine&=&let&three&=&inc&2& &&&&&&&&&&&&&&&in& &&&&&&&&&&&&&&&(*)&three&three& &&&&in& &&&&())&(inc&nine)&nine # # # # # # # ŸÏ Ÿ¥ ŸŠ

Slide 52

Slide 52 text

let&inc&=&(+)&1& in&&let&nine&=&let&three&=&inc&2& &&&&&&&&&&&&&&&in& &&&&&&&&&&&&&&&(*)&three&three& &&&&in& &&&&())&(inc&nine)&nine # # # # # # # ŸÏ Ÿ¥ ŸŠ Without sharing this is evaluated 4 times

Slide 53

Slide 53 text

Syntactic? ( E. Axelsson, ICFP ‘12 )

Slide 54

Slide 54 text

Syntactic? ( E. Axelsson, ICFP ‘12 )

Slide 55

Slide 55 text

Type preserving sharing recovery Preserve the tree structure of the program

Slide 56

Slide 56 text

Sharing Recovery let&inc&=&(+)&1& in&&let&nine&=&let&three&=&inc&2& &&&&&&&&&&&&&&&in& &&&&&&&&&&&&&&&(*)&three&three& &&&&in& &&&&())&(inc&nine)&nine # # # # # # # ŸÏ Ÿ¥ ŸŠ

Slide 57

Slide 57 text

Sharing Recovery # # # # # # # ŸÏ Ÿ¥ ŸŠ

Slide 58

Slide 58 text

# # # # # # # ŸÏ Ÿ¥ ŸŠ Phase 1: prune shared subtrees Sharing Recovery

Slide 59

Slide 59 text

Phase 1: prune shared subtrees Sharing Recovery # # # # # # # ŸÏ Ÿ¥ ŸŠ

Slide 60

Slide 60 text

Phase 1: prune shared subtrees Sharing Recovery # # # # # # # ŸÏ Ÿ¥ ŸŠ

Slide 61

Slide 61 text

Phase 2: float shared terms Sharing Recovery # # # # # # # ŸÏ Ÿ¥ ŸŠ # # # # # # # ŸÏ Ÿ¥ ŸŠ # # # # # # # ŸÏ Ÿ¥ ŸŠ # # # # # # # ŸÏ Ÿ¥ ŸŠ

Slide 62

Slide 62 text

Phase 2: float shared terms Sharing Recovery # # # # # ŸÏ Ÿ¥ ŸŠ # # # # # # # Ÿ¥ ŸŠ # # # # # # # Ÿ¥ ŸŠ # # # # # # # Ÿ¥ ŸŠ # # ŸŠ # À Ä À À # à # à Ÿ¥ # Ä # ŸÏ à Ä

Slide 63

Slide 63 text

Phase 2: float shared terms Sharing Recovery# # # # # # # ŸÏ Ÿ¥ ŸŠ # # # # # # # ŸÏ Ÿ¥ ŸŠ # # # # # # Ÿ¥ ŸŠ # # ŸŠ # À Ä À À # à # à Ÿ¥ # Ä # ŸÏ Ã Ä # # ŸŠ # À Ä À À # à # à Ÿ¥ # Ä # ŸÏ à Ä

Slide 64

Slide 64 text

Phase 3: introduce binders Sharing Recovery # # ŸŠ # À Ä À À # à # à Ÿ¥ # Ä # ŸÏ à Ä

Slide 65

Slide 65 text

Phase 3: introduce binders Sharing Recovery # # ŸŠ # À Ä À À # à # à Ÿ¥ # Ä # ŸÏ Ã Ä # # ŸŠ G@O G@O G@O # T S T S T # U # U Ÿ¥ # S # ŸÏ Ó U Ó Ó

Slide 66

Slide 66 text

Sharing Recovery Before we formalise our sharing recovery algorithm in the follow- ing subsections, we shall illustrate the main idea. Consider the fol- lowing source term: let inc = (+) 1 in let nine = let three = inc 2 in (*) three three in (-) (inc nine) nine This term’s abstract syntax DAG is the leftmost diagram in Fig- ure 2. It uses @ nodes to represent applications; as in this grammar: T ! C T ⌧ where | x C ⌧ :: T ⌧ | x. T x ⌧ :: T ⌧ | T1 @ T2 x ⌧1 .T ⌧2 :: T ⌧1!⌧2 C ! hconstantsi T ⌧1!⌧2 1 @ T ⌧1 2 :: T ⌧2 The left definition does not track types, whereas the right one does. We implement typed ASTs in Haskell with GADTs and work with typed representations henceforth. Typed HOAS conversion with sharing recover proceeds in three stages: 1. Prune shared subterms: A depth first traversal over the AST an- notates each node with its unique stable name, where we build an occurrence map of how many times we’ve already visited each node. If we encounter a previously visited node, it repre- sents a shared subterm, and we replace it by a placeholder con- taining its stable name. The second diagram in Figure 2 shows the outcome of this stage. Each node is labeled by a number that represents its stable name, and the dotted edges indicate where we encountered a previously visited, shared node. The placeholders are indicated by underlined stable names. 2. Float shared terms: All shared subterms float upwards in the tree to just above the lowest node that dominates all edges to the original position of that shared subterm — see the third diagram in Figure 2. Floated subterms are referenced by circled stable names located above the node that they floated to. If a node collects more than one shared subterm, the subterm whose origin is deeper in the original term goes on top — here, 9 on top of 5. Nested sharing leads to subterms floating up inside other floated subterms — here, 8 stays inside the subterm rooted in 5. Figure 2. Recovering sharing in an example term 3. Binder introduction: Each floated subterm gets let-bound right above the node it floated to (rightmost diagram in Figure 2). While we use explicit, bound names in the figure, we introduce de Bruijn indices at the same time as introducing the lets. 3.2 Prune shared subterms First, we identify and prune shared subtrees, producing a pruned tree of the following form (second diagram in Figure 2): T ⌧ where ` :: T ⌧ -- binder conversion level ⌫ ⌧ :: T ⌧ -- pruned subtree (name) C ⌧ :: T ⌧ `. T ⌧2 :: T ⌧1!⌧2 T ⌧1!⌧2 1 @ T ⌧1 2 :: T ⌧2 A stable name (here, of type Name ) associates a unique name with each unique term node, so that two terms with the same stable name are identical, and are represented by the same data structure in memory. Here, we denote the stable name of a term as a superscript during pattern matching — e.g., 1⌫ is a constant with stable name ⌫ , just as in the second and third diagram in Figure 2. An occurrence map, ⌦ :: Name 7! Int , is a finite map that determines the number of occurrences of a Name that we encoun- tered during a traversal. The expression ⌦ ⌫ yields the number of occurrences of the name ⌫ , and we have ⌫ 2 ⌦ ⌘ (⌦ ⌫ > 0). To add an occurrence to ⌦, we write ⌫ B⌦. We will see in the next sub- section that we cannot simplify ⌦ to be merely a set of occurring names. We need the actual occurrence count to determine where shared subterms should be let-bound. The identification and pruning of shared subtrees is formalised by the following function operating on closed terms from T ⌧ : prune :: Level ! ( Name 7! Int ) ! T ⌧ ! (( Name 7! Int ) , T ⌧ ) prune ` ⌦ e ⌫ | ⌫ 2 ⌦ = ( ⌫ B ⌦, ⌫ ) prune ` ⌦ e ⌫ | otherwise = enter ( ⌫ B ⌦ ) e where enter ⌦ c = ( ⌦, c ) enter ⌦ ( x.e ) = let ( ⌦ 0 , e 0) = prune ( ` + 1) ⌦ ([ `/x ] e ) in ( ⌦ 0 , `.e 0) enter ⌦ ( e1 @ e2) = let ( ⌦1, e 0 1 ) = prune ` ⌦ e1 ( ⌦2, e 0 2 ) = prune ` ⌦1 e2 in ( ⌦2, e 0 1 @ e 0 2 ) The first equation of prune covers t occurrence. In that case, we prune sha by a tag ⌫ containing its stable name in the second diagram in Figure 2. To interleave sharing recovery wit to typed de Bruijn indices, prune lambdas. Moreover, the lambda case binder x by the level ` at the binding Why don’t we separate computing ing? When computing occurrences, subtrees multiple times, so we can as Moreover, in the first line of prune , w stead of ⌫ — e is of the wrong form a As far as type-preservation is conc due to replacing variables by levels described by Atkey et al. [1], which w check in an environment lookup, as al 3.3 Float shared subterms Second, we float all shared subtrees let-bound, represented by (see third d " T ⌧ ! ⌫ : " T ⌧0 # T ⌧ # T ⌧ where ⌫ ⌧ :: # T ⌧ C ⌧ :: # T ⌧ ⌫. " T ⌧2 :: # T ⌧1!⌧2 " T ⌧1!⌧2 1 @ " T ⌧1 2 :: # T ⌧2 A term in " T comprises a sequence of by their stable name as well as a bod the floated subterms where extracted replaced lambda binders in T get re their term node. This simplifies a unif indices for let and lambda bound vari We write ⌫ : " T for a possibly ⌫1 : " T1, . . . , ⌫n : " Tn , where • d The floating function float maint floating terms and levels, defined as fo ! ⌫ i : " T ⌧ | ⌫ i : · | ⌫ i : ` These are floated subtrees named ⌫ o occurrences. The occurrence count in term gets let bound: namely at the This is why prune needed to collec in ⌦ . When the occurrence count ma ` :: T -- binder conversion level ⌫ ⌧ :: T ⌧ -- pruned subtree (name) C ⌧ :: T ⌧ `. T ⌧2 :: T ⌧1!⌧2 T ⌧1!⌧2 1 @ T ⌧1 2 :: T ⌧2 A stable name (here, of type Name ) associates a unique name with each unique term node, so that two terms with the same stable name are identical, and are represented by the same data structure in memory. Here, we denote the stable name of a term as a superscript during pattern matching — e.g., 1⌫ is a constant with stable name ⌫ , just as in the second and third diagram in Figure 2. An occurrence map, ⌦ :: Name 7! Int , is a finite map that determines the number of occurrences of a Name that we encoun- tered during a traversal. The expression ⌦ ⌫ yields the number of occurrences of the name ⌫ , and we have ⌫ 2 ⌦ ⌘ (⌦ ⌫ > 0). To add an occurrence to ⌦, we write ⌫ B⌦. We will see in the next sub- section that we cannot simplify ⌦ to be merely a set of occurring names. We need the actual occurrence count to determine where shared subterms should be let-bound. The identification and pruning of shared subtrees is formalised by the following function operating on closed terms from T ⌧ : prune :: Level ! ( Name 7! Int ) ! T ⌧ ! (( Name 7! Int ) , T ⌧ ) prune ` ⌦ e ⌫ | ⌫ 2 ⌦ = ( ⌫ B ⌦, ⌫ ) prune ` ⌦ e ⌫ | otherwise = enter ( ⌫ B ⌦ ) e where enter ⌦ c = ( ⌦, c ) enter ⌦ ( x.e ) = let ( ⌦ 0 , e 0) = prune ( ` + 1) ⌦ ([ `/x ] e ) in ( ⌦ 0 , `.e 0) enter ⌦ ( e1 @ e2) = let ( ⌦1, e 0 1 ) = prune ` ⌦ e1 ( ⌦2, e 0 2 ) = prune ` ⌦1 e2 in ( ⌦2, e 0 1 @ e 0 2 ) Moreover, in the first line of prune , we cannot simply return e in- stead of ⌫ — e is of the wrong form as it has type T and not T ! As far as type-preservation is concerned, we do lose information due to replacing variables by levels ` . This is the inevitable loss described by Atkey et al. [1], which we make up for by a dynamic check in an environment lookup, as already discussed. 3.3 Float shared subterms Second, we float all shared subtrees out to where they should be let-bound, represented by (see third diagram in Figure 2) " T ⌧ ! ⌫ : " T ⌧0 # T ⌧ # T ⌧ where ⌫ ⌧ :: # T ⌧ C ⌧ :: # T ⌧ ⌫. " T ⌧2 :: # T ⌧1!⌧2 " T ⌧1!⌧2 1 @ " T ⌧1 2 :: # T ⌧2 A term in " T comprises a sequence of floated-out subterms labelled by their stable name as well as a body term from # T from which the floated subterms where extracted. Moreover, the levels ` that replaced lambda binders in T get replaced by the stable name of their term node. This simplifies a uniform introduction of de Bruijn indices for let and lambda bound variables. We write ⌫ : " T for a possibly empty sequence of items: ⌫1 : " T1, . . . , ⌫n : " Tn , where • denotes an empty sequence. The floating function float maintains an auxiliary structure of floating terms and levels, defined as follows: ! ⌫ i : " T ⌧ | ⌫ i : · | ⌫ i : ` These are floated subtrees named ⌫ of which we have collected i occurrences. The occurrence count indicates where a shared sub- term gets let bound: namely at the node where it matches ⌦⌫ . This is why prune needed to collect the number of occurrences in ⌦ . When the occurrence count matches ⌦⌫ , we call the floated ering sharing in an example term right e 2). duce uned vel e) name table re in cript name that oun- er of ). To sub- rring here lised t ) , T ⌧ ) The first equation of prune covers the case of a term’s repeated occurrence. In that case, we prune sharing by replacing the term e ⌫ by a tag ⌫ containing its stable name — these are the dotted lines in the second diagram in Figure 2. To interleave sharing recovery with the conversion from HOAS to typed de Bruijn indices, prune tracks the nesting Level of lambdas. Moreover, the lambda case of enter replaces the HOAS binder x by the level ` at the binding and usage sites. Why don’t we separate computing occurrences from tree prun- ing? When computing occurrences, we must not traverse shared subtrees multiple times, so we can as well prune at the same time. Moreover, in the first line of prune , we cannot simply return e in- stead of ⌫ — e is of the wrong form as it has type T and not T ! As far as type-preservation is concerned, we do lose information due to replacing variables by levels ` . This is the inevitable loss described by Atkey et al. [1], which we make up for by a dynamic check in an environment lookup, as already discussed. 3.3 Float shared subterms Second, we float all shared subtrees out to where they should be let-bound, represented by (see third diagram in Figure 2) " T ⌧ ! ⌫ : " T ⌧0 # T ⌧ # T ⌧ where ⌫ ⌧ :: # T ⌧ C ⌧ :: # T ⌧ ⌫. " T ⌧2 :: # T ⌧1!⌧2 " T ⌧1!⌧2 1 @ " T ⌧1 2 :: # T ⌧2 A term in " T comprises a sequence of floated-out subterms labelled by their stable name as well as a body term from # T from which the floated subterms where extracted. Moreover, the levels ` that replaced lambda binders in T get replaced by the stable name of their term node. This simplifies a uniform introduction of de Bruijn indices for let and lambda bound variables. We write ⌫ : " T for a possibly empty sequence of items: ⌫1 : " T1, . . . , ⌫n : " Tn , where • denotes an empty sequence. The floating function float maintains an auxiliary structure of floating terms and levels, defined as follows: und names in the figure, we introduce me time as introducing the lets. s shared subtrees, producing a pruned cond diagram in Figure 2): ⌧ -- binder conversion level ⌧ -- pruned subtree (name) ⌧ ⌧1!⌧2 ⌧2 ype Name ) associates a unique name so that two terms with the same stable presented by the same data structure in stable name of a term as a superscript .g., 1⌫ is a constant with stable name ird diagram in Figure 2. Name 7! Int , is a finite map that currences of a Name that we encoun- expression ⌦ ⌫ yields the number of and we have ⌫ 2 ⌦ ⌘ (⌦ ⌫ > 0). To rite ⌫ B⌦. We will see in the next sub- ify ⌦ to be merely a set of occurring occurrence count to determine where -bound. ning of shared subtrees is formalised rating on closed terms from T ⌧ : ! Int ) ! T ⌧ ! (( Name 7! Int ) , T ⌧ ) = ( ⌫ B ⌦, ⌫ ) = enter ( ⌫ B ⌦ ) e ⌦, c ) t ( ⌦ 0 , e 0) = prune ( ` + 1) ⌦ ([ `/x ] e ) n ⌦ 0 , `.e 0) t ( ⌦1, e 0 1 ) = prune ` ⌦ e1 ( ⌦2, e 0 2 ) = prune ` ⌦1 e2 n ⌦2, e 0 1 @ e 0 2 ) by a tag ⌫ containing its stable name — these are the dotted lines in the second diagram in Figure 2. To interleave sharing recovery with the conversion from HOAS to typed de Bruijn indices, prune tracks the nesting Level of lambdas. Moreover, the lambda case of enter replaces the HOAS binder x by the level ` at the binding and usage sites. Why don’t we separate computing occurrences from tree prun- ing? When computing occurrences, we must not traverse shared subtrees multiple times, so we can as well prune at the same time. Moreover, in the first line of prune , we cannot simply return e in- stead of ⌫ — e is of the wrong form as it has type T and not T ! As far as type-preservation is concerned, we do lose information due to replacing variables by levels ` . This is the inevitable loss described by Atkey et al. [1], which we make up for by a dynamic check in an environment lookup, as already discussed. 3.3 Float shared subterms Second, we float all shared subtrees out to where they should be let-bound, represented by (see third diagram in Figure 2) " T ⌧ ! ⌫ : " T ⌧0 # T ⌧ # T ⌧ where ⌫ ⌧ :: # T ⌧ C ⌧ :: # T ⌧ ⌫. " T ⌧2 :: # T ⌧1!⌧2 " T ⌧1!⌧2 1 @ " T ⌧1 2 :: # T ⌧2 A term in " T comprises a sequence of floated-out subterms labelled by their stable name as well as a body term from # T from which the floated subterms where extracted. Moreover, the levels ` that replaced lambda binders in T get replaced by the stable name of their term node. This simplifies a uniform introduction of de Bruijn indices for let and lambda bound variables. We write ⌫ : " T for a possibly empty sequence of items: ⌫1 : " T1, . . . , ⌫n : " Tn , where • denotes an empty sequence. The floating function float maintains an auxiliary structure of floating terms and levels, defined as follows: ! ⌫ i : " T ⌧ | ⌫ i : · | ⌫ i : ` These are floated subtrees named ⌫ of which we have collected i occurrences. The occurrence count indicates where a shared sub- term gets let bound: namely at the node where it matches ⌦⌫ . This is why prune needed to collect the number of occurrences in ⌦ . When the occurrence count matches ⌦⌫ , we call the floated term saturated. The following function determines saturated floated terms, which ought to be let bound: bind :: ( Name 7! Int ) ! ! 9 ⌧.⌫ : " T ⌧ bind ⌦ • = • bind ⌦ ( ⌫ i : e, ) | ⌦⌫ == i = ⌫ : e, bind ⌦ bind ⌦ ( ⌫ i : , ) = bind ⌦ Note that does not keep track of the type ⌧ of a floated term " T ⌧ ; hence, floated terms from bind come in an existential package. This does not introduce additional loss of type safety as we already lost the type of lambda bound variables in ⌫ i : ` . It merely means that let bound, just like lambda bound, variables require the dynamically checked environment look up we already discussed. When floating the first occurrence of a shared tree (not pruned by prune ), we use ⌫ i : " T ⌧ . When floating subsequent occurrences (which were pruned), we use ⌫ i : ·. Finally, when floating a level, to replace it by a stable name, we use ⌫ i : ` . We define a partial ordering on floated terms: ⌫1 i : x < ⌫2 j : y iff the direct path from ⌫1 to the root of the AST is shorter than that of ⌫2 . We keep sequences of floated terms in descending order — so that the deepest subterm comes first. We write 1 ] 2 to merge two sequences of floated terms. Merging respects the partial order, and it combines floated trees with the same stable name by adding their occurrence counts. To combine the first occurrence and a subsequent occurrence of a shared tree, we preserve the term of the first occurrence. We write \ ⌫ to delete elements of that term saturated. The following function determines saturated floated terms, which ought to be let bound: bind :: ( Name 7! Int ) ! ! 9 ⌧.⌫ : " T ⌧ bind ⌦ • = • bind ⌦ ( ⌫ i : e, ) | ⌦⌫ == i = ⌫ : e, bind ⌦ bind ⌦ ( ⌫ i : , ) = bind ⌦ Note that does not keep track of the type ⌧ of a floated term " T ⌧ ; hence, floated terms from bind come in an existential package. This does not introduce additional loss of type safety as we already lost the type of lambda bound variables in ⌫ i : ` . It merely means that let bound, just like lambda bound, variables require the dynamically checked environment look up we already discussed. When floating the first occurrence of a shared tree (not pruned by prune ), we use ⌫ i : " T ⌧ . When floating subsequent occurrences (which were pruned), we use ⌫ i : ·. Finally, when floating a level, to replace it by a stable name, we use ⌫ i : ` . We define a partial ordering on floated terms: ⌫1 i : x < ⌫2 j : y iff the direct path from ⌫1 to the root of the AST is shorter than that of ⌫2 . We keep sequences of floated terms in descending order — so that the deepest subterm comes first. We write 1 ] 2 to merge two sequences of floated terms. Merging respects the partial order, and it combines floated trees with the same stable name by adding their occurrence counts. To combine the first occurrence and a subsequent occurrence of a shared tree, we preserve the term of the first occurrence. We write \ ⌫ to delete elements of that are tagged with a name that appears in the sequence ⌫ . We can now formalise the floating process as follows: float :: ( Name 7! Int ) ! T ⌧ ! ( , " T ⌧ ) float ⌦ ` ⌫ = ( ⌫ 1 : `, ⌫ ) float ⌦ ⌫ = ( ⌫ 1 : · , ⌫ ) float ⌦ e ⌫ = let ( , e 0) = descend e ⌫b : eb = bind ⌦ d = ⌫b : eb e 0 in if ⌦⌫ == 1 then ( \ ⌫b, d ) else ( \ ⌫b ] { ⌫ : d } , ⌫ ) where descend :: T ⌧ ! ( , # T ⌧ ) descend c = (• , c ) descend ( `.e ) = let ( , e 0) = float ⌦ e in if 9 ⌫ 0 i. ( ⌫ 0 i : ` ) 2 then ( \ { ⌫ 0} , ⌫ 0 .e 0) else ( , .e 0) descend ( e1 @ e2) = let ( 1, e 0 1 ) = float ⌦ e1 ( 2, e 0 2 ) = float ⌦ e2 in ( 1 ] 2, e 0 1 @ e 0 2 ) Regardless of whether a term gets floated, all saturated float terms, ⌫b : eb , must prefix the result, e 0, and be removed from When descend ing into a term, the only interesting case is lambdas. For a lambda at level ` , we look for a floated level of t form ⌫ 0 : ` . If that is available, ⌫ 0 replaces ` as a binder and remove ⌫ 0 : ` from . However, if ⌫ 0 : ` is not in , the bind introduced by the lambda doesn’t get used in e . In this case, pick an arbitrary new name; here symbolised by an underscore ” 3.4 Binder introduction Thirdly, we introduce typed de Bruijn indices to represent lamb and let binding structure (rightmost diagram in Figure 2): env T ⌧ where C ⌧ :: env T ⌧ env ◆ ⌧ :: env T ⌧ (⌧1, env) T ⌧2 :: env T ⌧1!⌧2 env T ⌧1!⌧2 1 @ env T ⌧1 2 :: env T ⌧2 let env T ⌧1 1 in (⌧1, env) T ⌧2 2 :: env T ⌧2 With this type of terms, e :: env T ⌧ means that e is a term rep senting a computation producing a value of type ⌧ under the ty environment env . Type environments are nested pair types, pos bly terminated by a unit type (). For example, ((() , ⌧1) , ⌧0) i type environment, where de Bruijn index 0 represents a variable type ⌧0 and de Bruijn index 1 represents a variable of type ⌧1 . We abbreviate let e1 in · · · let en in eb as let e in Both and let use de Bruijn indices ◆ instead of introduci explicit binders. To replace the names of pruned subtrees and of lambda bou variables by de Bruijn indices, we need to construct a suitab type environment as well as an association of environment entri their de Bruijn indices, and the stable names that they replace. W maintain the type environment with associated de Bruijn indices the following environment layout structure: env env0 where :: env () env env0 ; env ◆ ⌧ :: env (env0, t) Together with a layout, we use a sequence of names ⌫ of the sam size as the layout, where corresponding entries represent the sam variable. As this association between typed layout and untyp sequence of names is not validated by types, the lookup functi lyt # i getting the i th index of layout lyt makes use of a dynam type check. It’s signature is (#) :: N ! env env0 ! env ◆ ⌧ . Now we can introduces de Bruijn indices to body expression body :: env env ! ⌫ ! # T ⌧ ! env T ⌧ body lyt ( ⌫⇢,0, . . . , ⌫⇢,n ) ⌫ | ⌫ == ⌫⇢,i = lyt # i body lyt ⌫⇢ c = c body lyt ⌫⇢ ( ⌫.e ) = ( binders lyt + ( ⌫, ⌫⇢) e ) body lyt ⌫⇢ ( e1 @ e2) = ( binders lyt ⌫⇢ e1) @ ( binders lyt The first equation performs a lookup in the environment layo at the same index where the stable name ⌫ occurs in the nam environment ⌫ . The lookup is the same for lambda and let bou variables. It is the only place where we need a dynamic type che and that is already needed for lambda bound variables alone. In the case of a lambda, we add a new binder by extendi the layout, denoted lyt +, with a new zeroth de Bruijn index a term saturated. The following function determines saturated floated terms, which ought to be let bound: bind :: ( Name 7! Int ) ! ! 9 ⌧.⌫ : " T ⌧ bind ⌦ • = • bind ⌦ ( ⌫ i : e, ) | ⌦⌫ == i = ⌫ : e, bind ⌦ bind ⌦ ( ⌫ i : , ) = bind ⌦ Note that does not keep track of the type ⌧ of a floated term " T ⌧ ; hence, floated terms from bind come in an existential package. This does not introduce additional loss of type safety as we already lost the type of lambda bound variables in ⌫ i : ` . It merely means that let bound, just like lambda bound, variables require the dynamically checked environment look up we already discussed. When floating the first occurrence of a shared tree (not pruned by prune ), we use ⌫ i : " T ⌧ . When floating subsequent occurrences (which were pruned), we use ⌫ i : ·. Finally, when floating a level, to replace it by a stable name, we use ⌫ i : ` . We define a partial ordering on floated terms: ⌫1 i : x < ⌫2 j : y iff the direct path from ⌫1 to the root of the AST is shorter than that of ⌫2 . We keep sequences of floated terms in descending order — so that the deepest subterm comes first. We write 1 ] 2 to merge two sequences of floated terms. Merging respects the partial order, and it combines floated trees with the same stable name by adding their occurrence counts. To combine the first occurrence and a subsequent occurrence of a shared tree, we preserve the term of the first occurrence. We write \ ⌫ to delete elements of that are tagged with a name that appears in the sequence ⌫ . We can now formalise the floating process as follows: float :: ( Name 7! Int ) ! T ⌧ ! ( , " T ⌧ ) float ⌦ ` ⌫ = ( ⌫ 1 : `, ⌫ ) float ⌦ ⌫ = ( ⌫ 1 : · , ⌫ ) float ⌦ e ⌫ = let ( , e 0) = descend e ⌫b : eb = bind ⌦ d = ⌫b : eb e 0 in if ⌦⌫ == 1 then ( \ ⌫b, d ) else ( \ ⌫b ] { ⌫ : d } , ⌫ ) where descend :: T ⌧ ! ( , # T ⌧ ) descend c = (• , c ) descend ( `.e ) = let ( , e 0) = float ⌦ e in if 9 ⌫ 0 i. ( ⌫ 0 i : ` ) 2 then ( \ { ⌫ 0} , ⌫ 0 .e 0) else ( , .e 0) descend ( e1 @ e2) = let ( 1, e 0 1 ) = float ⌦ e1 ( 2, e 0 2 ) = float ⌦ e2 in ( 1 ] 2, e 0 1 @ e 0 2 ) Regardless of whether a term gets floated, all saturated fl terms, ⌫b : eb , must prefix the result, e 0, and be removed fr When descend ing into a term, the only interesting case lambdas. For a lambda at level ` , we look for a floated level form ⌫ 0 : ` . If that is available, ⌫ 0 replaces ` as a binder a remove ⌫ 0 : ` from . However, if ⌫ 0 : ` is not in , the introduced by the lambda doesn’t get used in e . In this ca pick an arbitrary new name; here symbolised by an undersco 3.4 Binder introduction Thirdly, we introduce typed de Bruijn indices to represent l and let binding structure (rightmost diagram in Figure 2): env T ⌧ where C ⌧ :: env T ⌧ env ◆ ⌧ :: env T ⌧ (⌧1, env) T ⌧2 :: env T ⌧1!⌧2 env T ⌧1!⌧2 1 @ env T ⌧1 2 :: env T ⌧2 let env T ⌧1 1 in (⌧1, env) T ⌧2 2 :: env T ⌧2 With this type of terms, e :: env T ⌧ means that e is a term senting a computation producing a value of type ⌧ under th environment env . Type environments are nested pair types, bly terminated by a unit type (). For example, ((() , ⌧1) , ⌧ type environment, where de Bruijn index 0 represents a varia type ⌧0 and de Bruijn index 1 represents a variable of type ⌧ We abbreviate let e1 in · · · let en in eb as let e Both and let use de Bruijn indices ◆ instead of introd explicit binders. To replace the names of pruned subtrees and of lambda variables by de Bruijn indices, we need to construct a su type environment as well as an association of environment e their de Bruijn indices, and the stable names that they replac maintain the type environment with associated de Bruijn ind the following environment layout structure: env env0 where :: env () env env0 ; env ◆ ⌧ :: env (env0, t) Together with a layout, we use a sequence of names ⌫ of the size as the layout, where corresponding entries represent the variable. As this association between typed layout and un sequence of names is not validated by types, the lookup fu lyt # i getting the i th index of layout lyt makes use of a dy type check. It’s signature is (#) :: N ! env env0 ! env ◆ ⌧ Now we can introduces de Bruijn indices to body express body :: env env ! ⌫ ! # T ⌧ ! env T ⌧ body lyt ( ⌫⇢,0, . . . , ⌫⇢,n ) ⌫ | ⌫ == ⌫⇢,i = lyt # i body lyt ⌫⇢ c = c body lyt ⌫⇢ ( ⌫.e ) = ( binders lyt + ( ⌫, ⌫⇢) e ) body lyt ⌫⇢ ( e1 @ e2) = ( binders lyt ⌫⇢ e1) @ ( binders The first equation performs a lookup in the environment at the same index where the stable name ⌫ occurs in the environment ⌫ . The lookup is the same for lambda and let variables. It is the only place where we need a dynamic type and that is already needed for lambda bound variables alone In the case of a lambda, we add a new binder by exte We define a partial ordering on floated terms: ⌫1 : x < ⌫2 : y the direct path from ⌫1 to the root of the AST is shorter than at of ⌫2 . We keep sequences of floated terms in descending order — so that the deepest subterm comes first. We write 1 ] 2 to erge two sequences of floated terms. Merging respects the partial der, and it combines floated trees with the same stable name by dding their occurrence counts. To combine the first occurrence and subsequent occurrence of a shared tree, we preserve the term of e first occurrence. We write \ ⌫ to delete elements of that e tagged with a name that appears in the sequence ⌫ . We can now formalise the floating process as follows: float :: ( Name 7! Int ) ! T ⌧ ! ( , " T ⌧ ) float ⌦ ` ⌫ = ( ⌫ 1 : `, ⌫ ) float ⌦ ⌫ = ( ⌫ 1 : · , ⌫ ) float ⌦ e ⌫ = let ( , e 0) = descend e ⌫b : eb = bind ⌦ d = ⌫b : eb e 0 in if ⌦⌫ == 1 then ( \ ⌫b, d ) else ( \ ⌫b ] { ⌫ : d } , ⌫ ) where descend :: T ⌧ ! ( , # T ⌧ ) descend c = (• , c ) descend ( `.e ) = let ( , e 0) = float ⌦ e in if 9 ⌫ 0 i. ( ⌫ 0 i : ` ) 2 then ( \ { ⌫ 0} , ⌫ 0 .e 0) else ( , .e 0) descend ( e1 @ e2) = let ( 1, e 0 1 ) = float ⌦ e1 ( 2, e 0 2 ) = float ⌦ e2 in ( 1 ] 2, e 0 1 @ e 0 2 ) he first two cases of float ensure that the levels of lambda bound riables and the names of pruned shared subterms are floated gardless of how often they occur. In contrast, the third equation oats a term with name ⌫ only if it is shared; i.e., ⌦⌫ is not 1. If it shared, it is also pruned; i.e., replaced by its name ⌫ — just as in e third diagram of Figure 2. With this type of terms, e :: env T ⌧ means that e is a term repre- senting a computation producing a value of type ⌧ under the type environment env . Type environments are nested pair types, possi- bly terminated by a unit type (). For example, ((() , ⌧1) , ⌧0) is a type environment, where de Bruijn index 0 represents a variable of type ⌧0 and de Bruijn index 1 represents a variable of type ⌧1 . We abbreviate let e1 in · · · let en in eb as let e in eb . Both and let use de Bruijn indices ◆ instead of introducing explicit binders. To replace the names of pruned subtrees and of lambda bound variables by de Bruijn indices, we need to construct a suitable type environment as well as an association of environment entries, their de Bruijn indices, and the stable names that they replace. We maintain the type environment with associated de Bruijn indices in the following environment layout structure: env env0 where :: env () env env0 ; env ◆ ⌧ :: env (env0, t) Together with a layout, we use a sequence of names ⌫ of the same size as the layout, where corresponding entries represent the same variable. As this association between typed layout and untyped sequence of names is not validated by types, the lookup function lyt # i getting the i th index of layout lyt makes use of a dynamic type check. It’s signature is (#) :: N ! env env0 ! env ◆ ⌧ . Now we can introduces de Bruijn indices to body expressions: body :: env env ! ⌫ ! # T ⌧ ! env T ⌧ body lyt ( ⌫⇢,0, . . . , ⌫⇢,n ) ⌫ | ⌫ == ⌫⇢,i = lyt # i body lyt ⌫⇢ c = c body lyt ⌫⇢ ( ⌫.e ) = ( binders lyt + ( ⌫, ⌫⇢) e ) body lyt ⌫⇢ ( e1 @ e2) = ( binders lyt ⌫⇢ e1) @ ( binders lyt ⌫⇢ e2) The first equation performs a lookup in the environment layout at the same index where the stable name ⌫ occurs in the name environment ⌫ . The lookup is the same for lambda and let bound variables. It is the only place where we need a dynamic type check and that is already needed for lambda bound variables alone. In the case of a lambda, we add a new binder by extending the layout, denoted lyt +, with a new zeroth de Bruijn index and shifting all others one up. Keeping the name environment in sync, we add the stable name ⌫ , which # T used as a binder. In the same vein, we bind n floated terms ⌫ : e with let bind- ings in body expression eb , by extending the type environment n times ( map applies a function to each element of a sequence): terms, which ought to be let bound: bind :: ( Name 7! Int ) ! ! 9 ⌧.⌫ : " T ⌧ bind ⌦ • = • bind ⌦ ( ⌫ i : e, ) | ⌦⌫ == i = ⌫ : e, bind ⌦ bind ⌦ ( ⌫ i : , ) = bind ⌦ Note that does not keep track of the type ⌧ of a floated term " T ⌧ ; hence, floated terms from bind come in an existential package. This does not introduce additional loss of type safety as we already lost the type of lambda bound variables in ⌫ i : ` . It merely means that let bound, just like lambda bound, variables require the dynamically checked environment look up we already discussed. When floating the first occurrence of a shared tree (not pruned by prune ), we use ⌫ i : " T ⌧ . When floating subsequent occurrences (which were pruned), we use ⌫ i : ·. Finally, when floating a level, to replace it by a stable name, we use ⌫ i : ` . We define a partial ordering on floated terms: ⌫1 i : x < ⌫2 j : y iff the direct path from ⌫1 to the root of the AST is shorter than that of ⌫2 . We keep sequences of floated terms in descending order — so that the deepest subterm comes first. We write 1 ] 2 to merge two sequences of floated terms. Merging respects the partial order, and it combines floated trees with the same stable name by adding their occurrence counts. To combine the first occurrence and a subsequent occurrence of a shared tree, we preserve the term of the first occurrence. We write \ ⌫ to delete elements of that are tagged with a name that appears in the sequence ⌫ . We can now formalise the floating process as follows: float :: ( Name 7! Int ) ! T ⌧ ! ( , " T ⌧ ) float ⌦ ` ⌫ = ( ⌫ 1 : `, ⌫ ) float ⌦ ⌫ = ( ⌫ 1 : · , ⌫ ) float ⌦ e ⌫ = let ( , e 0) = descend e ⌫b : eb = bind ⌦ d = ⌫b : eb e 0 in if ⌦⌫ == 1 then ( \ ⌫b, d ) else ( \ ⌫b ] { ⌫ : d } , ⌫ ) where descend :: T ⌧ ! ( , # T ⌧ ) descend c = (• , c ) descend ( `.e ) = let ( , e 0) = float ⌦ e in if 9 ⌫ 0 i. ( ⌫ 0 i : ` ) 2 then ( \ { ⌫ 0} , ⌫ 0 .e 0) else ( , .e 0) descend ( e1 @ e2) = let ( 1, e 0 1 ) = float ⌦ e1 ( 2, e 0 2 ) = float ⌦ e2 in ( 1 ] 2, e 0 1 @ e 0 2 ) The first two cases of float ensure that the levels of lambda bound variables and the names of pruned shared subterms are floated terms, ⌫b : eb , must prefix the result, e 0, and be removed from . When descend ing into a term, the only interesting case is for lambdas. For a lambda at level ` , we look for a floated level of the form ⌫ 0 : ` . If that is available, ⌫ 0 replaces ` as a binder and we remove ⌫ 0 : ` from . However, if ⌫ 0 : ` is not in , the binder introduced by the lambda doesn’t get used in e . In this case, we pick an arbitrary new name; here symbolised by an underscore ” ”. 3.4 Binder introduction Thirdly, we introduce typed de Bruijn indices to represent lambda and let binding structure (rightmost diagram in Figure 2): env T ⌧ where C ⌧ :: env T ⌧ env ◆ ⌧ :: env T ⌧ (⌧1, env) T ⌧2 :: env T ⌧1!⌧2 env T ⌧1!⌧2 1 @ env T ⌧1 2 :: env T ⌧2 let env T ⌧1 1 in (⌧1, env) T ⌧2 2 :: env T ⌧2 With this type of terms, e :: env T ⌧ means that e is a term repre- senting a computation producing a value of type ⌧ under the type environment env . Type environments are nested pair types, possi- bly terminated by a unit type (). For example, ((() , ⌧1) , ⌧0) is a type environment, where de Bruijn index 0 represents a variable of type ⌧0 and de Bruijn index 1 represents a variable of type ⌧1 . We abbreviate let e1 in · · · let en in eb as let e in eb . Both and let use de Bruijn indices ◆ instead of introducing explicit binders. To replace the names of pruned subtrees and of lambda bound variables by de Bruijn indices, we need to construct a suitable type environment as well as an association of environment entries, their de Bruijn indices, and the stable names that they replace. We maintain the type environment with associated de Bruijn indices in the following environment layout structure: env env0 where :: env () env env0 ; env ◆ ⌧ :: env (env0, t) Together with a layout, we use a sequence of names ⌫ of the same size as the layout, where corresponding entries represent the same variable. As this association between typed layout and untyped sequence of names is not validated by types, the lookup function lyt # i getting the i th index of layout lyt makes use of a dynamic type check. It’s signature is (#) :: N ! env env0 ! env ◆ ⌧ . Now we can introduces de Bruijn indices to body expressions: body :: env env ! ⌫ ! # T ⌧ ! env T ⌧ body lyt ( ⌫⇢,0, . . . , ⌫⇢,n ) ⌫ | ⌫ == ⌫⇢,i = lyt # i body lyt ⌫⇢ c = c body lyt ⌫⇢ ( ⌫.e ) = ( binders lyt + ( ⌫, ⌫⇢) e ) body lyt ⌫⇢ ( e1 @ e2) = ( binders lyt ⌫⇢ e1) @ ( binders lyt ⌫⇢ e2) The first equation performs a lookup in the environment layout at the same index where the stable name ⌫ occurs in the name environment ⌫ . The lookup is the same for lambda and let bound variables. It is the only place where we need a dynamic type check and that is already needed for lambda bound variables alone. In the case of a lambda, we add a new binder by extending the layout, denoted lyt +, with a new zeroth de Bruijn index and shifting all others one up. Keeping the name environment in sync, # (Before fusion) p1 p1 p2 p3 p4 p5 p6 p7 c1 c2 (After producer/producer fusion) c2 p5 p1 c1 p6 p7 p3 p2 p4 (After consumer/producer fusion) c2 p5 p1 c1 p6 p7 p3 p2 p4 Figure 3. Produce/producer and consumer/producer fusion binders :: env env ! ⌫ ! " T ⌧ ! env T ⌧ binders lyt ⌫⇢ ( ⌫ : e eb ) = let map ( binders lyt ⌫⇢) e in body lyt +n ( ⌫, ⌫⇢) eb where n = length ( ⌫ : e ) We tie the three stages together to convert from HOAS with sharing recovery producing let bindings and typed de Bruijn indices: variables are used multiple times in the body of an expression, un- restrained inlining can lead to duplication of work. Compilers such as GHC, handle this situation by only inlining the definitions of let- bound variables that have a single use site, or by relying on some heuristic about the size of the resulting code to decide what to inline [26]. However, in typical Accelerate programs, each array is used at least twice: once to access the shape information and once to access the array data; so, we must handle at least this case separately. Filtering. General array fusion transforms must deal with filter- like operations, for which the size of the result structure depends on the value of the input structure, as well as its size. Accelerate does not encode filtering as a primitive operation, so we do not need to consider it further.1 Fusion at run-time. As the Accelerate language is embedded in Haskell, compilation of the Accelerate program happens at Haskell runtime rather than when compiling the Haskell program. For this reason, optimisations applied to an Accelerate program contribute to its overall runtime, so we must be mindful of the cost of analysis and code transformation. On the flip-side, runtime optimisations can make use of information that is only available at runtime. Fusion on typed de Brujin terms. We fuse Accelerate programs by rewriting typed de Bruijn terms in a type preserving manner. However, maintaining type information adds complexity to the def- initions and rules, which amounts to a partial proof of correctness checked by the type checker, but is not particularly exciting for the present exposition. Hence, in this section, we elide the steps neces- sary to maintain type information during fusion. 4.1 The Main Idea All collective operations in Accelerate are array-to-array transfor- mations. Reductions, such as fold, which reduce an array to a sin- gle element, yield a singleton array rather than a scalar expression. Hence, we can partition array operations into two categories: 1. Operations where each element of the result array depends on at most one element of each input array. Multiple elements of the c2 p5 p1 c1 p6 p7 p3 p2 p4 (After consumer/producer fusion) c2 p5 p1 c1 p6 p7 p3 p2 p4 Figure 3. Produce/producer and consumer/producer fusion binders :: env env ! ⌫ ! " T ⌧ ! env T ⌧ binders lyt ⌫⇢ ( ⌫ : e eb ) = let map ( binders lyt ⌫⇢) e in body lyt +n ( ⌫, ⌫⇢) eb where n = length ( ⌫ : e ) We tie the three stages together to convert from HOAS with sharing recovery producing let bindings and typed de Bruijn indices: hoasSharing :: T ⌧ ! () T ⌧ hoasSharing e = let ( ⌦, e 0) = prune 0 • e (• , e 00) = float ⌦ e 0 in binders • e 00 4. Array fusion Fusion in a massively data-parallel, embedded language for GPUs, such as Accelerate, requires a few uncommon considerations. Parallelism. While fusing parallel collective operations, we must be careful not to lose information essential to parallel execution. For example, foldr/build fusion [15] is not applicable, because it produces sequential tail-recursive loops rather than massively parallel GPU kernels. Similarly, the split/join approach used in Data Parallel Haskell (DPH) [16] is not helpful, although fused operations are split into sequential and parallel subcomputations, as they assume an explicit parallel scheduler, which in DPH is written directly in Haskell. Accelerate compiles massively parallel array combinators to CUDA code via template skeleton instantiation, so any fusion system must preserve the combinator representation of the intermediate code. Sharing. Existing fusion transforms rely on inlining to move pro- ducer and consumer expressions next to each other, which allows producer/consumer pairs to be detected. However, when let-bound Fusion at run-time. As th Haskell, compilation of the A runtime rather than when co reason, optimisations applie to its overall runtime, so we and code transformation. O can make use of information Fusion on typed de Brujin by rewriting typed de Bruij However, maintaining type i initions and rules, which am checked by the type checker present exposition. Hence, in sary to maintain type inform 4.1 The Main Idea All collective operations in mations. Reductions, such a gle element, yield a singleto Hence, we can partition arra 1. Operations where each e most one element of eac output array may depen all output elements can b these operations as produ 2. Operations where each e multiple elements of the consumers, in spite of th Table 1 summarises the colle In a parallel context, produc cause independent element- ping to the GPU. Consume know exactly how the comp plement them efficiently. For ciative operator) can be impl but a parallel scan requires nately, this sort of informati niques. To support the diffe sumers, our fusion transform • Producer/producer: fuse producer. This is implem mation on the AST. • Consumer/producer: fus into the consumer. This h we specialise the consum 1 filter is easily implemented is provided as part of the library

Slide 67

Slide 67 text

How fast are we going?

Slide 68

Slide 68 text

Dot Product Run Time (ms) Elements (millions) Data.Vector Accelerate -fusion ... +fusion Hand optimised GPU

Slide 69

Slide 69 text

Dot Product Run Time (ms) Elements (millions) Data.Vector Accelerate -fusion ... +fusion Hand optimised GPU 2x

Slide 70

Slide 70 text

Dot Product Run Time (ms) Elements (millions) Data.Vector Accelerate -fusion ... +fusion Hand optimised GPU 2x 1.2x

Slide 71

Slide 71 text

Black-Scholes options pricing Run Time (ms) Options (millions) Accelerate -sharing ... +sharing Hand optimised GPU

Slide 72

Slide 72 text

Black-Scholes options pricing Run Time (ms) Options (millions) Accelerate -sharing ... +sharing Hand optimised GPU 17x

Slide 73

Slide 73 text

Black-Scholes options pricing Run Time (ms) Options (millions) Accelerate -sharing ... +sharing Hand optimised GPU 17x 0.9x

Slide 74

Slide 74 text

N-Body Run Time (ms) Bodies Accelerate -fusion -sharing ... -fusion +sharing ... +fusion +sharing Hand optimised GPU

Slide 75

Slide 75 text

N-Body Run Time (ms) Bodies Accelerate -fusion -sharing ... -fusion +sharing ... +fusion +sharing Hand optimised GPU 11x

Slide 76

Slide 76 text

N-Body Run Time (ms) Bodies Accelerate -fusion -sharing ... -fusion +sharing ... +fusion +sharing Hand optimised GPU Missing “optimisation” 11x

Slide 77

Slide 77 text

N-Body Run Time (ms) Bodies Accelerate -fusion -sharing ... -fusion +sharing ... +fusion +sharing Hand optimised GPU Run Time (ms) Bodies Accelerate -fusion -sharing ... -fusion +sharing ... +fusion +sharing Hand optimised GPU Current version Missing “optimisation” 2x 11x

Slide 78

Slide 78 text

Embedded languages are restricted languages Skeletons encapsulate efficient code idioms Fusion and Sharing reduce the abstraction penalty Summary https://github.com/AccelerateHS/