Slide 1

Slide 1 text

1 ASV: Accelerated Stereo Vision System Yu Feng with Paul Whatmough (Arm Research) and Yuhao Zhu Department of Computer Science University of Rochester http://horizon-lab.org

Slide 2

Slide 2 text

2

Slide 3

Slide 3 text

2

Slide 4

Slide 4 text

2 Eve ❤

Slide 5

Slide 5 text

2 Distance: 1.0 inch Heart rate: 200↑ ❤ Eve ❤

Slide 6

Slide 6 text

2 Distance: 1.0 inch Heart rate: 200↑ ❤ Eve ❤ Right Distance (Depth) is Important!

Slide 7

Slide 7 text

Applications Need Depth Information 3

Slide 8

Slide 8 text

Applications Need Depth Information 3 3D Reconstruction

Slide 9

Slide 9 text

Applications Need Depth Information 3 3D Reconstruction Augment Reality

Slide 10

Slide 10 text

Applications Need Depth Information 3 3D Reconstruction Drone Navigation Augment Reality

Slide 11

Slide 11 text

Applications Need Depth Information 3 3D Reconstruction Drone Navigation Augment Reality Domestic Robot

Slide 12

Slide 12 text

Techniques to Extract Depth Information 4

Slide 13

Slide 13 text

Techniques to Extract Depth Information 4 Passive Sensing Active Sensing

Slide 14

Slide 14 text

Techniques to Extract Depth Information 4 Passive Sensing Active Sensing

Slide 15

Slide 15 text

Techniques to Extract Depth Information 4 Passive Sensing Active Sensing

Slide 16

Slide 16 text

Techniques to Extract Depth Information 4 Passive Sensing Active Sensing

Slide 17

Slide 17 text

Triangulation: Binocular Depth Sensing 5 Physical Point

Slide 18

Slide 18 text

Triangulation: Binocular Depth Sensing 5 Physical Point Right Camera Left Camera

Slide 19

Slide 19 text

Triangulation: Binocular Depth Sensing 5 Physical Point Right Camera Left Camera

Slide 20

Slide 20 text

Triangulation: Binocular Depth Sensing 5 Physical Point Right Camera Left Camera Left Plate Right Plate XR XL

Slide 21

Slide 21 text

Triangulation: Binocular Depth Sensing 5 Physical Point Right Camera Left Camera B B + Z Left Plate Right Plate XR XL f

Slide 22

Slide 22 text

Triangulation: Binocular Depth Sensing 5 Physical Point Right Camera Left Camera B B + Z Left Plate Right Plate XR XL f D, Depth

Slide 23

Slide 23 text

Triangulation: Binocular Depth Sensing 5 Physical Point Right Camera Left Camera B B + Z Left Plate Right Plate XR XL D D + f = B B + Z Using similar triangles: f D, Depth

Slide 24

Slide 24 text

Triangulation: Binocular Depth Sensing 5 Physical Point Right Camera Left Camera B B + Z Left Plate Right Plate XR XL X’L D D + f = B B + Z Using similar triangles: f D, Depth

Slide 25

Slide 25 text

Triangulation: Binocular Depth Sensing 5 Physical Point Right Camera Left Camera B B + Z Left Plate Right Plate XR XL X’L D D + f = B B + Z Using similar triangles: XR - XL f D, Depth

Slide 26

Slide 26 text

Triangulation: Binocular Depth Sensing 5 Physical Point Right Camera Left Camera B B + Z Left Plate Right Plate XR XL X’L D D + f = B B + Z Using similar triangles: XR - XL f D, Depth Z, Disparity

Slide 27

Slide 27 text

Triangulation: Binocular Depth Sensing 5 Physical Point Right Camera Left Camera B B + Z Left Plate Right Plate XR XL X’L D D + f = B B + Z Using similar triangles: XR - XL f D, Depth Z, Disparity D = Bf/Z Using similar triangles:

Slide 28

Slide 28 text

Continuous Stereo Vision 6 L R Inputs Output Disparity Map

Slide 29

Slide 29 text

Continuous Stereo Vision 6 L R Inputs Output XR XL Disparity Map

Slide 30

Slide 30 text

Continuous Stereo Vision 6 L R Inputs Output | | - = Z XR XL Disparity Map z

Slide 31

Slide 31 text

Continuous Stereo Vision 6 L R Inputs Output | | - = Z XR XL Disparity Map z

Slide 32

Slide 32 text

Continuous Stereo Vision 6 L R Inputs Output | | - = Z XR XL Disparity Map z Depth

Slide 33

Slide 33 text

Continuous Stereo Vision 7 L R Inputs Output Disparity Map

Slide 34

Slide 34 text

Stereo Matching Algorithms Continuous Stereo Vision 7 L R Inputs Output Disparity Map +

Slide 35

Slide 35 text

Continuous Stereo Vision 7 L R Inputs Output Disparity Map { DNN-based non-DNN-based } +

Slide 36

Slide 36 text

Accuracy vs. Speed Trade-off FPS 0 1 100 Error Rate (%) 0 4 8 12 16 non-DNN (CPU) DNN (GPU) DNN (Accelerator) 8

Slide 37

Slide 37 text

Accuracy vs. Speed Trade-off FPS 0 1 100 Error Rate (%) 0 4 8 12 16 non-DNN (CPU) DNN (GPU) DNN (Accelerator) 8 30FPS

Slide 38

Slide 38 text

Accuracy vs. Speed Trade-off FPS 0 1 100 Error Rate (%) 0 4 8 12 16 non-DNN (CPU) DNN (GPU) DNN (Accelerator) 8 30FPS

Slide 39

Slide 39 text

Accuracy vs. Speed Trade-off FPS 0 1 100 Error Rate (%) 0 4 8 12 16 non-DNN (CPU) DNN (GPU) DNN (Accelerator) 8 30FPS

Slide 40

Slide 40 text

Accuracy vs. Speed Trade-off FPS 0 1 100 Error Rate (%) 0 4 8 12 16 non-DNN (CPU) DNN (GPU) DNN (Accelerator) 8 ASV 30FPS

Slide 41

Slide 41 text

ASV: Accelerated Stereo Vision System 9

Slide 42

Slide 42 text

ASV: Accelerated Stereo Vision System 9 ‣Algorithm: Invariant-based Stereo Matching Algorithm +

Slide 43

Slide 43 text

ASV: Accelerated Stereo Vision System 9 ‣Algorithm: Invariant-based Stereo Matching Algorithm + ‣Compiler: Deconvolution Transformation and Dataflow Optimization +

Slide 44

Slide 44 text

ASV: Accelerated Stereo Vision System 9 ‣Algorithm: Invariant-based Stereo Matching Algorithm + ‣Compiler: Deconvolution Transformation and Dataflow Optimization + ‣Hardware: Principled and Minimal Hardware Modifications +

Slide 45

Slide 45 text

ASV: Accelerated Stereo Vision System 10 ‣Algorithm: Invariant-based Stereo Matching Algorithm + ‣Compiler: Deconvolution Transformation and Dataflow Optimization + ‣Hardware: Principled and Minimal Hardware Modifications +

Slide 46

Slide 46 text

ASV: Accelerated Stereo Vision System 10 ‣Algorithm: Invariant-based Stereo Matching Algorithm + ‣Compiler: Deconvolution Transformation and Dataflow Optimization + ‣Hardware: Principled and Minimal Hardware Modifications +

Slide 47

Slide 47 text

ISM: Invariant-based Stereo Matching Algorithm 11

Slide 48

Slide 48 text

ISM: Invariant-based Stereo Matching Algorithm 11 t = t0 L R

Slide 49

Slide 49 text

ISM: Invariant-based Stereo Matching Algorithm 11 t = t0 L R = DNN inference

Slide 50

Slide 50 text

ISM: Invariant-based Stereo Matching Algorithm 11 t = t0 L R = DNN inference

Slide 51

Slide 51 text

ISM: Invariant-based Stereo Matching Algorithm 11 t = t0 L R = DNN inference

Slide 52

Slide 52 text

t = t0+1 L R ISM: Invariant-based Stereo Matching Algorithm 11 t = t0 L R ??? = DNN inference

Slide 53

Slide 53 text

t = t0+1 L R ISM: Invariant-based Stereo Matching Algorithm 11 t = t0 L R Find Correspondences ??? = DNN inference

Slide 54

Slide 54 text

t = t0+1 L R ISM: Invariant-based Stereo Matching Algorithm 11 t = t0 L R Find Correspondences Propagate Correspondences (motion estimation) ??? = DNN inference

Slide 55

Slide 55 text

t = t0+1 L R ISM: Invariant-based Stereo Matching Algorithm 11 t = t0 L R Find Correspondences Propagate Correspondences (motion estimation) ??? = DNN inference Invariant: two corresponding pixels always correspond to the same physical point across frames over time.

Slide 56

Slide 56 text

t = t0+1 L R ISM: Invariant-based Stereo Matching Algorithm 11 t = t0 L R Find Correspondences Propagate Correspondences (motion estimation) ??? = DNN inference Refine Correspondences

Slide 57

Slide 57 text

t = t0+1 L R ISM: Invariant-based Stereo Matching Algorithm 11 t = t0 L R Find Correspondences Propagate Correspondences (motion estimation) ??? = DNN inference Refine Correspondences Optical Flow Algorithm

Slide 58

Slide 58 text

t = t0+1 L R ISM: Invariant-based Stereo Matching Algorithm 11 t = t0 L R Find Correspondences Propagate Correspondences (motion estimation) ??? = DNN inference Refine Correspondences Optical Flow Algorithm Block Matching

Slide 59

Slide 59 text

t = t0+1 L R ISM: Invariant-based Stereo Matching Algorithm 11 t = t0 L R Find Correspondences Propagate Correspondences (motion estimation) ??? = DNN inference Refine Correspondences Optical Flow Algorithm Block Matching Optical Flow Algorithm Block Matching

Slide 60

Slide 60 text

ISM: Invariant-based Stereo Matching Algorithm 12

Slide 61

Slide 61 text

ISM: Invariant-based Stereo Matching Algorithm 12 t = t0+1 t = t0 L R L R L R L R Time t = t0+2 t = t0+3

Slide 62

Slide 62 text

ISM: Invariant-based Stereo Matching Algorithm 12 t = t0+1 t = t0 L R L R L R L R Time t = t0+2 t = t0+3 Method Performance Accuracy GOOD GOOD GOOD GOOD DNN Inference DNN Inference DNN Inference DNN Inference SLOW SLOW SLOW SLOW

Slide 63

Slide 63 text

ISM: Invariant-based Stereo Matching Algorithm 12 t = t0+1 t = t0 L R L R L R L R Time t = t0+2 t = t0+3 Method Performance Accuracy GOOD GOOD GOOD GOOD SLOW FAST FAST SLOW DNN Inference ISM Algorithm ISM Algorithm DNN Inference

Slide 64

Slide 64 text

ISM: Invariant-based Stereo Matching Algorithm 12 t = t0+1 t = t0 L R L R L R L R Time t = t0+2 t = t0+3 Method Performance Accuracy GOOD GOOD GOOD GOOD SLOW FAST FAST SLOW DNN Inference ISM Algorithm ISM Algorithm DNN Inference https://github.com/horizon-research/ism-algorithm

Slide 65

Slide 65 text

ASV: Accelerated Stereo Vision System 13 ‣Algorithm: Invariant-based Stereo Matching Algorithm + ‣Compiler: Deconvolution Transformation and Dataflow Optimization + ‣Hardware: Principled and Minimal Hardware Modifications +

Slide 66

Slide 66 text

ASV: Accelerated Stereo Vision System 13 ‣Algorithm: Invariant-based Stereo Matching Algorithm + ‣Compiler: Deconvolution Transformation and Dataflow Optimization + ‣Hardware: Principled and Minimal Hardware Modifications +

Slide 67

Slide 67 text

Deconv. is the Major Operation in Stereo DNN 14

Slide 68

Slide 68 text

… … … … … … … … Deconv. is the Major Operation in Stereo DNN 14

Slide 69

Slide 69 text

… … … … … … … … Deconv. is the Major Operation in Stereo DNN 14

Slide 70

Slide 70 text

… … … … … … … … Deconv. is the Major Operation in Stereo DNN 14 Downsampling: Extract and Combine High-level Features

Slide 71

Slide 71 text

… … … … … … … … Deconv. is the Major Operation in Stereo DNN 14 Downsampling: Extract and Combine High-level Features Upsampling: Restore and Refine Disparity Resolution

Slide 72

Slide 72 text

… … … … … … … … Deconv. is the Major Operation in Stereo DNN 14 Downsampling: Extract and Combine High-level Features Upsampling: Restore and Refine Disparity Resolution CONV. DECONV.

Slide 73

Slide 73 text

… … … … … … … … Deconv. is the Major Operation in Stereo DNN 14 Deconvolution Comp. Cost (%) 0 25 50 75 100 Flow N etC DispN et G C -N et PSM N et Downsampling: Extract and Combine High-level Features Upsampling: Restore and Refine Disparity Resolution CONV. DECONV.

Slide 74

Slide 74 text

… … … … … … … … Deconv. is the Major Operation in Stereo DNN 14 Deconvolution Comp. Cost (%) 0 25 50 75 100 Flow N etC DispN et G C -N et PSM N et Downsampling: Extract and Combine High-level Features Upsampling: Restore and Refine Disparity Resolution CONV. DECONV.

Slide 75

Slide 75 text

Deconvolution Transformation 15 B A D C ifmap

Slide 76

Slide 76 text

Deconvolution Transformation 15 B A D C ifmap B A D C B A D C Upsampled ifmap

Slide 77

Slide 77 text

Deconvolution Transformation 15 B A D C ifmap b c a e f d i h g ✽ Original kernel 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 B A D C B A D C Upsampled ifmap

Slide 78

Slide 78 text

Deconvolution Transformation 15 B A D C ifmap b c a e f d i h g ✽ Original kernel 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 B A D C B A D C Upsampled ifmap A b c a e f d i h g ✽ 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3

Slide 79

Slide 79 text

Deconvolution Transformation 15 B A D C ifmap b c a e f d i h g ✽ Original kernel B A D C B A D C Upsampled ifmap A b c a e f d i h g ✽ B b c a e f d i h g ✽ C b c a e f d i h g ✽ D b c a e f d i h g ✽ (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3

Slide 80

Slide 80 text

Deconvolution Transformation 16 b c a e f d i h g Upsampled ifmap ✽ (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e Original kernel B A D C B A D C B A D C ifmap

Slide 81

Slide 81 text

1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 Deconvolution Transformation 16 b c a e f d i h g Upsampled ifmap ✽ b c a e f d i h g ✽ B A b c a e f d i h g ✽ D C (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d Original kernel B A D C B A D C B A D C ifmap

Slide 82

Slide 82 text

Deconvolution Transformation 17 b c a e f d i h g Upsampled ifmap ✽ (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d Original kernel B A D C ifmap B A D C B A D C

Slide 83

Slide 83 text

1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 Deconvolution Transformation 17 b c a e f d i h g Upsampled ifmap ✽ b c a e f d i h g ✽ A C b c a e f d i h g ✽ B D (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h Original kernel B A D C ifmap B A D C B A D C

Slide 84

Slide 84 text

Deconvolution Transformation 18 b c a e f d i h g Upsampled ifmap ✽ (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h c a i g (2, 2) = A * a + B * c + C * g + D * i Original kernel B A D C ifmap B A D C B A D C

Slide 85

Slide 85 text

1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 Deconvolution Transformation 18 b c a e f d i h g Upsampled ifmap ✽ b c a e f d i h g ✽ B A D C (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h c a i g (2, 2) = A * a + B * c + C * g + D * i Original kernel B A D C ifmap B A D C B A D C

Slide 86

Slide 86 text

1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 Deconvolution Transformation 18 b c a e f d i h g Upsampled ifmap ✽ b c a e f d i h g ✽ B A D C (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h c a i g (2, 2) = A * a + B * c + C * g + D * i Original kernel B A D C ifmap B A D C B A D C

Slide 87

Slide 87 text

1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 Deconvolution Transformation 18 b c a e f d i h g Upsampled ifmap ✽ b c a e f d i h g ✽ B A D C (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h c a i g (2, 2) = A * a + B * c + C * g + D * i Original kernel B A D C ifmap B A D C B A D C

Slide 88

Slide 88 text

1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 Deconvolution Transformation 18 b c a e f d i h g Upsampled ifmap ✽ b c a e f d i h g ✽ B A D C (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h c a i g (2, 2) = A * a + B * c + C * g + D * i Original kernel B A D C ifmap B A D C B A D C

Slide 89

Slide 89 text

1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 Deconvolution Transformation 18 b c a e f d i h g Upsampled ifmap ✽ b c a e f d i h g ✽ B A D C (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h c a i g (2, 2) = A * a + B * c + C * g + D * i B A D C ifmap B A D C B A D C

Slide 90

Slide 90 text

Deconvolution Transformation 19 Upsampled ifmap ✽ B A D C B A D C (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h c a i g (2, 2) = A * a + B * c + C * g + D * i c a i g e f d b h 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 B A D C ifmap

Slide 91

Slide 91 text

Deconvolution Transformation 19 (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h c a i g (2, 2) = A * a + B * c + C * g + D * i c a i g e f d b h 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 B A D C ifmap

Slide 92

Slide 92 text

Deconvolution Transformation 19 (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h c a i g (2, 2) = A * a + B * c + C * g + D * i c a i g e f d b h ✽ 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 B A D C ifmap

Slide 93

Slide 93 text

Deconvolution Transformation 19 (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h c a i g (2, 2) = A * a + B * c + C * g + D * i c a i g e f d b h ✽ 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 B A D C ifmap

Slide 94

Slide 94 text

Deconvolution Transformation 19 (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h c a i g (2, 2) = A * a + B * c + C * g + D * i c a i g e f d b h ✽ 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 B A D C ifmap

Slide 95

Slide 95 text

Deconvolution Transformation 19 (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h c a i g (2, 2) = A * a + B * c + C * g + D * i c a i g e f d b h ✽ 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 B A D C ifmap

Slide 96

Slide 96 text

Deconvolution Transformation 20 (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h ▸Compile a deconvolution layer into 4 convolution layers c a i g (2, 2) = A * a + B * c + C * g + D * i

Slide 97

Slide 97 text

Deconvolution Transformation 20 (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h ▸Compile a deconvolution layer into 4 convolution layers I Original input feature map e ofmap elements generated in this round are also stored he buffer, and are too shaded. terns. The key is to recognize that the four computation erns are essentially four different convolutions, each con- ving the original ifmap with a distinct kernel that is part he original kernel. For instance, (2, 2), (2, 4), (4, 2), and 4) are generated by convolving ⇥ a c g i ⇤ with ifmap. More erally, the deconvolution in Fig. 6 can be calculated as: b c e f h i # b ~ I = G ([e]~I,[d f]~I,  b h ~I,  a c g i ~I) ere b ~ denotes deconvolution, ~ denotes standard convolu- n, I denotes the ifmap, and G denotes the gather operation t assembles the ofmap from the results of the four con- utions. G can be simply implemented as a set of load rations to the scratchpad memory (on-chip buffer). Essentially, our algorithm decomposes the original 3⇥3 cient for convolutions. also be extended to supp which have more relaxe We assume that the ac (scratchpad memory) th as output elements. The hold all the data for a lay in multiple rounds. Onl loaded into the buffer ea into the buffer in each ro and is determined by th The buffer is evenly s buffer to support doub computing the current ro data needed for the next The next round does no This design choice guara Deconvolution c a i g (2, 2) = A * a + B * c + C * g + D * i

Slide 98

Slide 98 text

Deconvolution Transformation 20 (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h ▸Compile a deconvolution layer into 4 convolution layers I Original input feature map e ofmap elements generated in this round are also stored he buffer, and are too shaded. terns. The key is to recognize that the four computation erns are essentially four different convolutions, each con- ving the original ifmap with a distinct kernel that is part he original kernel. For instance, (2, 2), (2, 4), (4, 2), and 4) are generated by convolving ⇥ a c g i ⇤ with ifmap. More erally, the deconvolution in Fig. 6 can be calculated as: b c e f h i # b ~ I = G ([e]~I,[d f]~I,  b h ~I,  a c g i ~I) ere b ~ denotes deconvolution, ~ denotes standard convolu- n, I denotes the ifmap, and G denotes the gather operation t assembles the ofmap from the results of the four con- utions. G can be simply implemented as a set of load rations to the scratchpad memory (on-chip buffer). Essentially, our algorithm decomposes the original 3⇥3 cient for convolutions. also be extended to supp which have more relaxe We assume that the ac (scratchpad memory) th as output elements. The hold all the data for a lay in multiple rounds. Onl loaded into the buffer ea into the buffer in each ro and is determined by th The buffer is evenly s buffer to support doub computing the current ro data needed for the next The next round does no This design choice guara Deconvolution c a i g (2, 2) = A * a + B * c + C * g + D * i

Slide 99

Slide 99 text

Deconvolution Transformation 20 (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h ▸Compile a deconvolution layer into 4 convolution layers I Original input feature map e ofmap elements generated in this round are also stored he buffer, and are too shaded. terns. The key is to recognize that the four computation erns are essentially four different convolutions, each con- ving the original ifmap with a distinct kernel that is part he original kernel. For instance, (2, 2), (2, 4), (4, 2), and 4) are generated by convolving ⇥ a c g i ⇤ with ifmap. More erally, the deconvolution in Fig. 6 can be calculated as: b c e f h i # b ~ I = G ([e]~I,[d f]~I,  b h ~I,  a c g i ~I) ere b ~ denotes deconvolution, ~ denotes standard convolu- n, I denotes the ifmap, and G denotes the gather operation t assembles the ofmap from the results of the four con- utions. G can be simply implemented as a set of load rations to the scratchpad memory (on-chip buffer). Essentially, our algorithm decomposes the original 3⇥3 cient for convolutions. also be extended to supp which have more relaxe We assume that the ac (scratchpad memory) th as output elements. The hold all the data for a lay in multiple rounds. Onl loaded into the buffer ea into the buffer in each ro and is determined by th The buffer is evenly s buffer to support doub computing the current ro data needed for the next The next round does no This design choice guara Deconvolution ents generated in this round are also stored d are too shaded. y is to recognize that the four computation ntially four different convolutions, each con- nal ifmap with a distinct kernel that is part ernel. For instance, (2, 2), (2, 4), (4, 2), and ted by convolving ⇥ a c g i ⇤ with ifmap. More convolution in Fig. 6 can be calculated as: = G ([e]~I,[d f]~I,  b h ~I,  a c g i ~I) deconvolution, ~ denotes standard convolu- e ifmap, and G denotes the gather operation he ofmap from the results of the four con- n be simply implemented as a set of load cient for convolutions. Alte also be extended to support which have more relaxed co We assume that the accele (scratchpad memory) that h as output elements. The bu hold all the data for a layer. T in multiple rounds. Only pa loaded into the buffer each r into the buffer in each round and is determined by the lo The buffer is evenly split buffer to support double-b computing the current round data needed for the next rou Convolution c a i g (2, 2) = A * a + B * c + C * g + D * i ( )

Slide 100

Slide 100 text

Deconvolution Transformation 20 (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h ▸Compile a deconvolution layer into 4 convolution layers I Original input feature map e ofmap elements generated in this round are also stored he buffer, and are too shaded. terns. The key is to recognize that the four computation erns are essentially four different convolutions, each con- ving the original ifmap with a distinct kernel that is part he original kernel. For instance, (2, 2), (2, 4), (4, 2), and 4) are generated by convolving ⇥ a c g i ⇤ with ifmap. More erally, the deconvolution in Fig. 6 can be calculated as: b c e f h i # b ~ I = G ([e]~I,[d f]~I,  b h ~I,  a c g i ~I) ere b ~ denotes deconvolution, ~ denotes standard convolu- n, I denotes the ifmap, and G denotes the gather operation t assembles the ofmap from the results of the four con- utions. G can be simply implemented as a set of load rations to the scratchpad memory (on-chip buffer). Essentially, our algorithm decomposes the original 3⇥3 cient for convolutions. also be extended to supp which have more relaxe We assume that the ac (scratchpad memory) th as output elements. The hold all the data for a lay in multiple rounds. Onl loaded into the buffer ea into the buffer in each ro and is determined by th The buffer is evenly s buffer to support doub computing the current ro data needed for the next The next round does no This design choice guara Deconvolution ents generated in this round are also stored d are too shaded. y is to recognize that the four computation ntially four different convolutions, each con- nal ifmap with a distinct kernel that is part ernel. For instance, (2, 2), (2, 4), (4, 2), and ted by convolving ⇥ a c g i ⇤ with ifmap. More convolution in Fig. 6 can be calculated as: = G ([e]~I,[d f]~I,  b h ~I,  a c g i ~I) deconvolution, ~ denotes standard convolu- e ifmap, and G denotes the gather operation he ofmap from the results of the four con- n be simply implemented as a set of load cient for convolutions. Alte also be extended to support which have more relaxed co We assume that the accele (scratchpad memory) that h as output elements. The bu hold all the data for a layer. T in multiple rounds. Only pa loaded into the buffer each r into the buffer in each round and is determined by the lo The buffer is evenly split buffer to support double-b computing the current round data needed for the next rou Convolution h a 3⇥3 kernel split into four sub-kernels. With a tiling strategy W = 2,H = 2,C1 = 1,C2 = 2,C3 = only the shaded elements are loaded into the buffer. p elements generated in this round are also stored fer, and are too shaded. The key is to recognize that the four computation re essentially four different convolutions, each con- he original ifmap with a distinct kernel that is part ginal kernel. For instance, (2, 2), (2, 4), (4, 2), and generated by convolving ⇥ a c g i ⇤ with ifmap. More , the deconvolution in Fig. 6 can be calculated as: c f i # b ~ I = G ([e]~I,[d f]~I,  b h ~I,  a c g i ~I) denotes deconvolution, ~ denotes standard convolu- notes the ifmap, and G denotes the gather operation mbles the ofmap from the results of the four con- sists of a 2D systolic array, in whic (PE) performs one MAC operation arrays use a simple neighbor-to- mechanism that simplifies the con cient for convolutions. Alternativ also be extended to support SIMD- which have more relaxed control w We assume that the accelerator h (scratchpad memory) that holds ac as output elements. The buffer siz hold all the data for a layer. Therefo in multiple rounds. Only part of th loaded into the buffer each round. E into the buffer in each round is criti and is determined by the loop tilin The buffer is evenly split into a w buffer to support double-bufferin computing the current round using data needed for the next round is p Gather (stores to scratchpad) c a i g (2, 2) = A * a + B * c + C * g + D * i ( ) =

Slide 101

Slide 101 text

Deconvolution Transformation 20 (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h ▸Compile a deconvolution layer into 4 convolution layers c a i g (2, 2) = A * a + B * c + C * g + D * i ( ) =

Slide 102

Slide 102 text

Deconvolution Transformation 20 (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h ▸Compile a deconvolution layer into 4 convolution layers ▸Naive transformation and compute increase memory traffic ▹4 sub-kernels + 4 ifmaps! c a i g (2, 2) = A * a + B * c + C * g + D * i ( ) =

Slide 103

Slide 103 text

Deconvolution Transformation 20 (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h ▸Compile a deconvolution layer into 4 convolution layers ▸Naive transformation and compute increase memory traffic ▹4 sub-kernels + 4 ifmaps! ▸Key observation Sub-convolutions share the same ifmap. New data reuse opportunity. B A D C ifmap c a i g (2, 2) = A * a + B * c + C * g + D * i ( ) =

Slide 104

Slide 104 text

Deconvolution Transformation 20 (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h ▸Compile a deconvolution layer into 4 convolution layers ▸Naive transformation and compute increase memory traffic ▹4 sub-kernels + 4 ifmaps! ▸Key observation Sub-convolutions share the same ifmap. New data reuse opportunity. B A D C ifmap c a i g (2, 2) = A * a + B * c + C * g + D * i Inter-Layer Activation Reuse (ILAR) ( ) =

Slide 105

Slide 105 text

Deconvolution Optimization: Problem Setup 21

Slide 106

Slide 106 text

Deconvolution Optimization: Problem Setup 21 Goal: minimize the latency and/or memory traffic in deconvolution.

Slide 107

Slide 107 text

Deconvolution Optimization: Problem Setup 21 ▸Hardware Assumption Goal: minimize the latency and/or memory traffic in deconvolution.

Slide 108

Slide 108 text

DRAM Deconvolution Optimization: Problem Setup 21 ▸Hardware Assumption ▹A system-on-chip connected with DRAM Goal: minimize the latency and/or memory traffic in deconvolution.

Slide 109

Slide 109 text

On-chip Buffer DRAM Deconvolution Optimization: Problem Setup 21 ▸Hardware Assumption ▹A system-on-chip connected with DRAM ▹On-chip buffer for ifmap, weights and ofmap Goal: minimize the latency and/or memory traffic in deconvolution.

Slide 110

Slide 110 text

On-chip Buffer DRAM Deconvolution Optimization: Problem Setup 21 ▸Hardware Assumption ▹A system-on-chip connected with DRAM ▹On-chip buffer for ifmap, weights and ofmap ▹Systolic array, output stationary Goal: minimize the latency and/or memory traffic in deconvolution.

Slide 111

Slide 111 text

DRAM Deconvolution Optimization: Problem Setup 21 ▸Hardware Assumption ▹A system-on-chip connected with DRAM ▹On-chip buffer for ifmap, weights and ofmap ▹Systolic array, output stationary ▹Used double-buffering Working buffer Filling buffer Goal: minimize the latency and/or memory traffic in deconvolution.

Slide 112

Slide 112 text

Deconvolution Optimization: Complexity 22 f d f d f d f d f d eeeee b h b h b h b h b h c a i g c a i g c a i g c a i g c a i g sub-kernels ✽ Ifmap B C A F G E K J I O L H N D M P B C A F G E K J I O L H N D M P B C A F G E K J I O L H N D M P B C A F G E K J I O L H N D M P B C A F G E K J I O L H N D M P Goal: minimize the latency and/or memory traffic in deconvolution. B C A D B C A D B C A D B C A D B C A D eeeee b h b h b h b h b h f d f d f d f d f d eeeee c a i g c a i g c a i g c a i g c a i g B A F E B A F E B A F E B A F E B A F E

Slide 113

Slide 113 text

Deconvolution Optimization: Complexity 22 f d f d f d f d f d eeeee b h b h b h b h b h c a i g c a i g c a i g c a i g c a i g sub-kernels ✽ Ifmap B C A F G E K J I O L H N D M P B C A F G E K J I O L H N D M P B C A F G E K J I O L H N D M P B C A F G E K J I O L H N D M P B C A F G E K J I O L H N D M P Goal: minimize the latency and/or memory traffic in deconvolution. Schedule 1 Schedule 2 B C A D B C A D B C A D B C A D B C A D eeeee b h b h b h b h b h f d f d f d f d f d eeeee c a i g c a i g c a i g c a i g c a i g B A F E B A F E B A F E B A F E B A F E

Slide 114

Slide 114 text

Deconvolution Optimization: Complexity 22 f d f d f d f d f d eeeee b h b h b h b h b h c a i g c a i g c a i g c a i g c a i g sub-kernels ✽ Ifmap B C A F G E K J I O L H N D M P B C A F G E K J I O L H N D M P B C A F G E K J I O L H N D M P B C A F G E K J I O L H N D M P B C A F G E K J I O L H N D M P Goal: minimize the latency and/or memory traffic in deconvolution. Schedule 1 Schedule 2 B C A D B C A D B C A D B C A D B C A D eeeee b h b h b h b h b h f d f d f d f d f d eeeee c a i g c a i g c a i g c a i g c a i g B A F E B A F E B A F E B A F E B A F E ?

Slide 115

Slide 115 text

Deconvolution Optimization: Formulation 23

Slide 116

Slide 116 text

Deconvolution Optimization: Formulation 23 ▸Dataflow optimization → Constrained optimization

Slide 117

Slide 117 text

Deconvolution Optimization: Formulation 23 ▸Dataflow optimization → Constrained optimization Objective: Min. L(Θ, ϕ)

Slide 118

Slide 118 text

Deconvolution Optimization: Formulation 23 ▸Dataflow optimization → Constrained optimization Objective: Min. L(Θ, ϕ) Θ : Hardware configuration ϕ : Tiling schedule

Slide 119

Slide 119 text

Deconvolution Optimization: Formulation 23 ▸Dataflow optimization → Constrained optimization Objective: Min. L(Θ, ϕ) Θ : Hardware configuration ϕ : Tiling schedule ▸Hardware configuration, ={A, BW, Buf}

Slide 120

Slide 120 text

Deconvolution Optimization: Formulation 23 ▸Dataflow optimization → Constrained optimization Objective: Min. L(Θ, ϕ) Θ : Hardware configuration ϕ : Tiling schedule ▸Hardware configuration, ={A, BW, Buf} A ≤ A* ▹Systolic Array Capability: BW ≤ BW* ▹Memory Bandwidth: Buf ≤ Buf* ▹On-chip Buffer Size: *: hardware capacity

Slide 121

Slide 121 text

Deconvolution Optimization: Formulation 23 ▸Dataflow optimization → Constrained optimization Objective: Min. L(Θ, ϕ) Θ : Hardware configuration ϕ : Tiling schedule ▸Hardware configuration, ={A, BW, Buf} A ≤ A* ▹Systolic Array Capability: BW ≤ BW* ▹Memory Bandwidth: Buf ≤ Buf* ▹On-chip Buffer Size: ▸Variables, = {Tile, Ksub } ϕ

Slide 122

Slide 122 text

Deconvolution Optimization: Formulation 23 ▸Dataflow optimization → Constrained optimization Objective: Min. L(Θ, ϕ) Θ : Hardware configuration ϕ : Tiling schedule ▸Hardware configuration, ={A, BW, Buf} A ≤ A* ▹Systolic Array Capability: BW ≤ BW* ▹Memory Bandwidth: Buf ≤ Buf* ▹On-chip Buffer Size: ▸Variables, = {Tile, Ksub } ϕ ▹Tile: Tile Size in every round

Slide 123

Slide 123 text

Deconvolution Optimization: Formulation 23 ▸Dataflow optimization → Constrained optimization Objective: Min. L(Θ, ϕ) Θ : Hardware configuration ϕ : Tiling schedule ▸Hardware configuration, ={A, BW, Buf} A ≤ A* ▹Systolic Array Capability: BW ≤ BW* ▹Memory Bandwidth: Buf ≤ Buf* ▹On-chip Buffer Size: ▸Variables, = {Tile, Ksub } ϕ ▹Tile: Tile Size in every round ▹Ksub : The number of different sub-kernels in every round

Slide 124

Slide 124 text

Deconvolution Optimization: Formulation 23 ▸Dataflow optimization → Constrained optimization Objective: Min. L(Θ, ϕ) Θ : Hardware configuration ϕ : Tiling schedule ▸Hardware configuration, ={A, BW, Buf} A ≤ A* ▹Systolic Array Capability: BW ≤ BW* ▹Memory Bandwidth: Buf ≤ Buf* ▹On-chip Buffer Size: ▸Variables, = {Tile, Ksub } ϕ ▹Tile: Tile Size in every round ▹Ksub : The number of different sub-kernels in every round

Slide 125

Slide 125 text

Deconvolution Optimization: Formulation 23 ▸Dataflow optimization → Constrained optimization Objective: Min. L(Θ, ϕ) Θ : Hardware configuration ϕ : Tiling schedule ▸Hardware configuration, ={A, BW, Buf} A ≤ A* ▹Systolic Array Capability: BW ≤ BW* ▹Memory Bandwidth: Buf ≤ Buf* ▹On-chip Buffer Size: ▸Variables, = {Tile, Ksub } ϕ ▹Tile: Tile Size in every round ▹Ksub : The number of different sub-kernels in every round Non-linear Constraint Optimization “Sequential Least Squares Programming”

Slide 126

Slide 126 text

Deconvolution Optimization: Formulation 23 ▸Dataflow optimization → Constrained optimization Objective: Min. L(Θ, ϕ) Θ : Hardware configuration ϕ : Tiling schedule ▸Hardware configuration, ={A, BW, Buf} A ≤ A* ▹Systolic Array Capability: BW ≤ BW* ▹Memory Bandwidth: Buf ≤ Buf* ▹On-chip Buffer Size: ▸Variables, = {Tile, Ksub } ϕ ▹Tile: Tile Size in every round ▹Ksub : The number of different sub-kernels in every round https://github.com/horizon-research/systolic-array-dataflow-optimizer

Slide 127

Slide 127 text

ASV: Accelerated Stereo Vision System 24 ‣Algorithm: Invariant-based Stereo Matching Algorithm + ‣Compiler: Deconvolution Transformation and Dataflow Optimization + ‣Hardware: Principled and Minimal Hardware Modifications +

Slide 128

Slide 128 text

ASV: Accelerated Stereo Vision System 24 ‣Algorithm: Invariant-based Stereo Matching Algorithm + ‣Compiler: Deconvolution Transformation and Dataflow Optimization + ‣Hardware: Principled and Minimal Hardware Modifications +

Slide 129

Slide 129 text

Hardware Implementation 25 O P R

Slide 130

Slide 130 text

Hardware Implementation 25 Baseline Systolic Array: Baseline Scalar Unit: O P R

Slide 131

Slide 131 text

Hardware Implementation 25 ▹ Convolutions in DNN Baseline Systolic Array: Baseline Scalar Unit: O P R Conv.

Slide 132

Slide 132 text

Hardware Implementation 25 ▹ Convolutions in DNN Baseline Systolic Array: Baseline Scalar Unit: ▹ ReLU, Pooling in DNN O P R Conv.

Slide 133

Slide 133 text

Hardware Implementation 25 ▹ Convolutions in DNN Baseline Systolic Array: Baseline Scalar Unit: ▹ ReLU, Pooling in DNN O P R Conv. BM ISM Algorithm: OF

Slide 134

Slide 134 text

Hardware Implementation 25 ▹ Convolutions in DNN ▹ Block Matching in Refine Correspondences Baseline Scalar Unit: ▹ ReLU, Pooling in DNN O P R Modified Systolic Array: Conv. BM ISM Algorithm: OF

Slide 135

Slide 135 text

Hardware Implementation 25 ▹ Convolutions in DNN ▹ Block Matching in Refine Correspondences ▹ ReLU, Pooling in DNN ▹ Operations in Optical Flow O P R Modified Systolic Array: Conv. BM ISM Algorithm: OF Modified Scalar Unit:

Slide 136

Slide 136 text

Hardware Implementation 25 ▹ Convolutions in DNN ▹ Block Matching in Refine Correspondences ▹ ReLU, Pooling in DNN ▹ Operations in Optical Flow O P R Modified Systolic Array: Conv. BM ISM Algorithm: OF Modified Scalar Unit: The overall area overhead introduced by ASV is below 0.5%.

Slide 137

Slide 137 text

Experimental Setup 26 Hardware implementation:

Slide 138

Slide 138 text

Experimental Setup 26 Hardware implementation: ▹ Systolic array: 24x24 PE at 1 GHz 24 24

Slide 139

Slide 139 text

Experimental Setup 26 Hardware implementation: ▹ Systolic array: 24x24 PE at 1 GHz 24 24 ▹ 8 Scalar unit: run in parallel at 250 MHz 8

Slide 140

Slide 140 text

Experimental Setup 26 Hardware implementation: ▹ Systolic array: 24x24 PE at 1 GHz 24 24 ▹ 8 Scalar unit: run in parallel at 250 MHz 8 ▹ SRAM: 1.5 MB on-chip buffer 1.5MB On-chip Buffer ▹ DRAM: 4 Micron 16 Gb LPDDR3-1600 channels

Slide 141

Slide 141 text

Experimental Setup 26 Hardware implementation: Stereo DNNs: ▹ FlowNet, DispNet, GC-Net, PSMNet ▹ Systolic array: 24x24 PE at 1 GHz 24 24 ▹ 8 Scalar unit: run in parallel at 250 MHz 8 ▹ SRAM: 1.5 MB on-chip buffer 1.5MB On-chip Buffer ▹ DRAM: 4 Micron 16 Gb LPDDR3-1600 channels

Slide 142

Slide 142 text

Experimental Setup 26 Hardware implementation: Stereo DNNs: ▹ FlowNet, DispNet, GC-Net, PSMNet Datasets: ▹ SceneFlow and KITTI dataset ▹ Systolic array: 24x24 PE at 1 GHz 24 24 ▹ 8 Scalar unit: run in parallel at 250 MHz 8 ▹ SRAM: 1.5 MB on-chip buffer 1.5MB On-chip Buffer ▹ DRAM: 4 Micron 16 Gb LPDDR3-1600 channels

Slide 143

Slide 143 text

Evaluation 27 Variants:

Slide 144

Slide 144 text

Evaluation 27 Variants: ▹ ISM: ISM algorithm without deconv. optimizations.

Slide 145

Slide 145 text

Evaluation 27 Variants: ▹ ISM: ISM algorithm without deconv. optimizations. DNN inference for every 4 frames DNN Inference Non-DNN Inference Non-DNN Inference Non-DNN Inference

Slide 146

Slide 146 text

Evaluation 27 Variants: ▹ ISM: ISM algorithm without deconv. optimizations. ▹ DCO: Deconv. optimizations without ISM algorithm. DNN inference for every 4 frames DNN Inference Non-DNN Inference Non-DNN Inference Non-DNN Inference

Slide 147

Slide 147 text

Evaluation 27 Variants: ▹ ISM: ISM algorithm without deconv. optimizations. ▹ DCO: Deconv. optimizations without ISM algorithm. ▹ ISM + DCO: combined both optimization. DNN inference for every 4 frames DNN Inference Non-DNN Inference Non-DNN Inference Non-DNN Inference

Slide 148

Slide 148 text

Evaluation 28 DNN inference for every 4 frames DNN Inference Non-DNN Inference Non-DNN Inference Non-DNN Inference

Slide 149

Slide 149 text

Evaluation 28 Error rate (%) 0 1.25 2.5 3.75 5 DispNet FlowNetC PSMNet GC-Net AVG. DNN ISM DNN inference for every 4 frames DNN Inference Non-DNN Inference Non-DNN Inference Non-DNN Inference

Slide 150

Slide 150 text

Evaluation 28 Error rate (%) 0 1.25 2.5 3.75 5 DispNet FlowNetC PSMNet GC-Net AVG. DNN ISM DNN inference for every 4 frames DNN Inference Non-DNN Inference Non-DNN Inference Non-DNN Inference 3.89

Slide 151

Slide 151 text

Evaluation 28 Error rate (%) 0 1.25 2.5 3.75 5 DispNet FlowNetC PSMNet GC-Net AVG. DNN ISM DNN inference for every 4 frames DNN Inference Non-DNN Inference Non-DNN Inference Non-DNN Inference 3.89 3.95

Slide 152

Slide 152 text

Evaluation 29

Slide 153

Slide 153 text

Evaluation 29 Speedup 0 2 4 6 8 DispNet FlowNetC PSMNet GC-Net AVG. DCO ISM DCO+ISM

Slide 154

Slide 154 text

Evaluation 29 Speedup 0 2 4 6 8 DispNet FlowNetC PSMNet GC-Net AVG. DCO ISM DCO+ISM 1.5x

Slide 155

Slide 155 text

Evaluation 29 Speedup 0 2 4 6 8 DispNet FlowNetC PSMNet GC-Net AVG. DCO ISM DCO+ISM 1.5x 3.3x

Slide 156

Slide 156 text

Evaluation 29 Speedup 0 2 4 6 8 DispNet FlowNetC PSMNet GC-Net AVG. DCO ISM DCO+ISM 1.5x 3.3x 5.0x

Slide 157

Slide 157 text

Evaluation 29 Speedup 0 2 4 6 8 DispNet FlowNetC PSMNet GC-Net AVG. DCO ISM DCO+ISM Energy Reduction 0 25 50 75 100 DispNet FlowNetC PSMNet GC-Net AVG. 1.5x 3.3x 5.0x

Slide 158

Slide 158 text

Evaluation 29 Speedup 0 2 4 6 8 DispNet FlowNetC PSMNet GC-Net AVG. DCO ISM DCO+ISM Energy Reduction 0 25 50 75 100 DispNet FlowNetC PSMNet GC-Net AVG. 1.5x 3.3x 5.0x 42%

Slide 159

Slide 159 text

Evaluation 29 Speedup 0 2 4 6 8 DispNet FlowNetC PSMNet GC-Net AVG. DCO ISM DCO+ISM Energy Reduction 0 25 50 75 100 DispNet FlowNetC PSMNet GC-Net AVG. 1.5x 3.3x 5.0x 42% 75%

Slide 160

Slide 160 text

Evaluation 29 Speedup 0 2 4 6 8 DispNet FlowNetC PSMNet GC-Net AVG. DCO ISM DCO+ISM Energy Reduction 0 25 50 75 100 DispNet FlowNetC PSMNet GC-Net AVG. 1.5x 3.3x 5.0x 42% 75% 85%

Slide 161

Slide 161 text

Evaluation 30 Speedup 0 1.5 3 4.5 6 AVG. ASV GANNX 3.6x 5.0x Energy Reduction 0 1.5 3 4.5 6 AVG. ASV GANNX 3.2x 4.2x

Slide 162

Slide 162 text

Conclusion 31

Slide 163

Slide 163 text

Conclusion 31 ‣“Depth from stereo" is critical to emerging intelligent applications deployed in energy- and performance-constrained devices.

Slide 164

Slide 164 text

Conclusion 31 ‣ASV simultaneously improves performance and energy-efficiency, while maintaining high accuracy via a HW & SW co-design. ‣“Depth from stereo" is critical to emerging intelligent applications deployed in energy- and performance-constrained devices.

Slide 165

Slide 165 text

Conclusion 31 ‣Careful design choices can let these optimizations be integrated with existing DNN accelerators with minor hardware extensions + + + ‣ASV simultaneously improves performance and energy-efficiency, while maintaining high accuracy via a HW & SW co-design. ‣“Depth from stereo" is critical to emerging intelligent applications deployed in energy- and performance-constrained devices.