ASV: Accelerated Stereo Vision System

1 ASV: Accelerated Stereo Vision System Yu Feng with Paul
Whatmough (Arm Research) and Yuhao Zhu Department of Computer Science University of Rochester http://horizon-lab.org

2 Eve ❤

2 Distance: 1.0 inch Heart rate: 200↑ ❤ Eve ❤

2 Distance: 1.0 inch Heart rate: 200↑ ❤ Eve ❤
Right Distance (Depth) is Important!

Applications Need Depth Information 3

Applications Need Depth Information 3 3D Reconstruction

Applications Need Depth Information 3 3D Reconstruction Augment Reality

Applications Need Depth Information 3 3D Reconstruction Drone Navigation Augment
Reality

Applications Need Depth Information 3 3D Reconstruction Drone Navigation Augment
Reality Domestic Robot

Techniques to Extract Depth Information 4

Techniques to Extract Depth Information 4 Passive Sensing Active Sensing

Triangulation: Binocular Depth Sensing 5 Physical Point

Triangulation: Binocular Depth Sensing 5 Physical Point Right Camera Left
Camera

Camera Left Plate Right Plate XR XL

Camera B B + Z Left Plate Right Plate XR XL f

Camera B B + Z Left Plate Right Plate XR XL f D, Depth

Camera B B + Z Left Plate Right Plate XR XL D D + f = B B + Z Using similar triangles: f D, Depth

Camera B B + Z Left Plate Right Plate XR XL X’L D D + f = B B + Z Using similar triangles: f D, Depth

Camera B B + Z Left Plate Right Plate XR XL X’L D D + f = B B + Z Using similar triangles: XR - XL f D, Depth

Camera B B + Z Left Plate Right Plate XR XL X’L D D + f = B B + Z Using similar triangles: XR - XL f D, Depth Z, Disparity

Camera B B + Z Left Plate Right Plate XR XL X’L D D + f = B B + Z Using similar triangles: XR - XL f D, Depth Z, Disparity D = Bf/Z Using similar triangles:

Continuous Stereo Vision 6 L R Inputs Output Disparity Map

Continuous Stereo Vision 6 L R Inputs Output XR XL
Disparity Map

Continuous Stereo Vision 6 L R Inputs Output | |
- = Z XR XL Disparity Map z

Continuous Stereo Vision 6 L R Inputs Output | |
- = Z XR XL Disparity Map z Depth

Stereo Matching Algorithms Continuous Stereo Vision 7 L R Inputs
Output Disparity Map +

{ DNN-based non-DNN-based } +

Accuracy vs. Speed Trade-off FPS 0 1 100 Error Rate
(%) 0 4 8 12 16 non-DNN (CPU) DNN (GPU) DNN (Accelerator) 8

(%) 0 4 8 12 16 non-DNN (CPU) DNN (GPU) DNN (Accelerator) 8 30FPS

(%) 0 4 8 12 16 non-DNN (CPU) DNN (GPU) DNN (Accelerator) 8 ASV 30FPS

ASV: Accelerated Stereo Vision System 9

ASV: Accelerated Stereo Vision System 9 ‣Algorithm: Invariant-based Stereo Matching
Algorithm +

Algorithm + ‣Compiler: Deconvolution Transformation and Dataﬂow Optimization +

Algorithm + ‣Compiler: Deconvolution Transformation and Dataﬂow Optimization + ‣Hardware: Principled and Minimal Hardware Modiﬁcations +

ISM: Invariant-based Stereo Matching Algorithm 11

ISM: Invariant-based Stereo Matching Algorithm 11 t = t0 L
R

ISM: Invariant-based Stereo Matching Algorithm 11 t = t0 L
R = DNN inference

t = t0+1 L R ISM: Invariant-based Stereo Matching Algorithm
11 t = t0 L R ??? = DNN inference

11 t = t0 L R Find Correspondences ??? = DNN inference

11 t = t0 L R Find Correspondences Propagate Correspondences (motion estimation) ??? = DNN inference

11 t = t0 L R Find Correspondences Propagate Correspondences (motion estimation) ??? = DNN inference Invariant: two corresponding pixels always correspond to the same physical point across frames over time.

11 t = t0 L R Find Correspondences Propagate Correspondences (motion estimation) ??? = DNN inference Reﬁne Correspondences

11 t = t0 L R Find Correspondences Propagate Correspondences (motion estimation) ??? = DNN inference Reﬁne Correspondences Optical Flow Algorithm

11 t = t0 L R Find Correspondences Propagate Correspondences (motion estimation) ??? = DNN inference Reﬁne Correspondences Optical Flow Algorithm Block Matching

11 t = t0 L R Find Correspondences Propagate Correspondences (motion estimation) ??? = DNN inference Reﬁne Correspondences Optical Flow Algorithm Block Matching Optical Flow Algorithm Block Matching

ISM: Invariant-based Stereo Matching Algorithm 12

ISM: Invariant-based Stereo Matching Algorithm 12 t = t0+1 t
= t0 L R L R L R L R Time t = t0+2 t = t0+3

= t0 L R L R L R L R Time t = t0+2 t = t0+3 Method Performance Accuracy GOOD GOOD GOOD GOOD DNN Inference DNN Inference DNN Inference DNN Inference SLOW SLOW SLOW SLOW

= t0 L R L R L R L R Time t = t0+2 t = t0+3 Method Performance Accuracy GOOD GOOD GOOD GOOD SLOW FAST FAST SLOW DNN Inference ISM Algorithm ISM Algorithm DNN Inference

= t0 L R L R L R L R Time t = t0+2 t = t0+3 Method Performance Accuracy GOOD GOOD GOOD GOOD SLOW FAST FAST SLOW DNN Inference ISM Algorithm ISM Algorithm DNN Inference https://github.com/horizon-research/ism-algorithm

Deconv. is the Major Operation in Stereo DNN 14

… … … … … … … … Deconv. is
the Major Operation in Stereo DNN 14

… … … … … … … … Deconv. is
the Major Operation in Stereo DNN 14 Downsampling: Extract and Combine High-level Features

… … … … … … … … Deconv. is
the Major Operation in Stereo DNN 14 Downsampling: Extract and Combine High-level Features Upsampling: Restore and Reﬁne Disparity Resolution

… … … … … … … … Deconv. is
the Major Operation in Stereo DNN 14 Downsampling: Extract and Combine High-level Features Upsampling: Restore and Reﬁne Disparity Resolution CONV. DECONV.

… … … … … … … … Deconv. is
the Major Operation in Stereo DNN 14 Deconvolution Comp. Cost (%) 0 25 50 75 100 Flow N etC DispN et G C -N et PSM N et Downsampling: Extract and Combine High-level Features Upsampling: Restore and Reﬁne Disparity Resolution CONV. DECONV.

Deconvolution Transformation 15 B A D C ifmap

Deconvolution Transformation 15 B A D C ifmap B A
D C B A D C Upsampled ifmap

Deconvolution Transformation 15 B A D C ifmap b c
a e f d i h g ✽ Original kernel 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 B A D C B A D C Upsampled ifmap

a e f d i h g ✽ Original kernel 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 B A D C B A D C Upsampled ifmap A b c a e f d i h g ✽ 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3

a e f d i h g ✽ Original kernel B A D C B A D C Upsampled ifmap A b c a e f d i h g ✽ B b c a e f d i h g ✽ C b c a e f d i h g ✽ D b c a e f d i h g ✽ (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3

Deconvolution Transformation 16 b c a e f d i
h g Upsampled ifmap ✽ (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e Original kernel B A D C B A D C B A D C ifmap

1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 Deconvolution
Transformation 16 b c a e f d i h g Upsampled ifmap ✽ b c a e f d i h g ✽ B A b c a e f d i h g ✽ D C (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d Original kernel B A D C B A D C B A D C ifmap

h g Upsampled ifmap ✽ (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d Original kernel B A D C ifmap B A D C B A D C

1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 Deconvolution
Transformation 17 b c a e f d i h g Upsampled ifmap ✽ b c a e f d i h g ✽ A C b c a e f d i h g ✽ B D (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h Original kernel B A D C ifmap B A D C B A D C

h g Upsampled ifmap ✽ (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h c a i g (2, 2) = A * a + B * c + C * g + D * i Original kernel B A D C ifmap B A D C B A D C

1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 Deconvolution
Transformation 18 b c a e f d i h g Upsampled ifmap ✽ b c a e f d i h g ✽ B A D C (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h c a i g (2, 2) = A * a + B * c + C * g + D * i Original kernel B A D C ifmap B A D C B A D C

1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 Deconvolution
Transformation 18 b c a e f d i h g Upsampled ifmap ✽ b c a e f d i h g ✽ B A D C (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h c a i g (2, 2) = A * a + B * c + C * g + D * i B A D C ifmap B A D C B A D C

Deconvolution Transformation 19 Upsampled ifmap ✽ B A D C
B A D C (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h c a i g (2, 2) = A * a + B * c + C * g + D * i c a i g e f d b h 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 B A D C ifmap

Deconvolution Transformation 19 (1, 1) = A * e (1,
3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h c a i g (2, 2) = A * a + B * c + C * g + D * i c a i g e f d b h 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 B A D C ifmap

3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h c a i g (2, 2) = A * a + B * c + C * g + D * i c a i g e f d b h ✽ 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 B A D C ifmap

3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h c a i g (2, 2) = A * a + B * c + C * g + D * i c a i g e f d b h ✽ 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 B A D C ifmap

3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h c a i g (2, 2) = A * a + B * c + C * g + D * i c a i g e f d b h ✽ 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 B A D C ifmap

3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h c a i g (2, 2) = A * a + B * c + C * g + D * i c a i g e f d b h ✽ 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 B A D C ifmap

3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h ▸Compile a deconvolution layer into 4 convolution layers c a i g (2, 2) = A * a + B * c + C * g + D * i

3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h ▸Compile a deconvolution layer into 4 convolution layers I Original input feature map e ofmap elements generated in this round are also stored he buffer, and are too shaded. terns. The key is to recognize that the four computation erns are essentially four different convolutions, each con- ving the original ifmap with a distinct kernel that is part he original kernel. For instance, (2, 2), (2, 4), (4, 2), and 4) are generated by convolving ⇥ a c g i ⇤ with ifmap. More erally, the deconvolution in Fig. 6 can be calculated as: b c e f h i # b ~ I = G ([e]~I,[d f]~I,  b h ~I,  a c g i ~I) ere b ~ denotes deconvolution, ~ denotes standard convolu- n, I denotes the ifmap, and G denotes the gather operation t assembles the ofmap from the results of the four con- utions. G can be simply implemented as a set of load rations to the scratchpad memory (on-chip buffer). Essentially, our algorithm decomposes the original 3⇥3 cient for convolutions. also be extended to supp which have more relaxe We assume that the ac (scratchpad memory) th as output elements. The hold all the data for a lay in multiple rounds. Onl loaded into the buffer ea into the buffer in each ro and is determined by th The buffer is evenly s buffer to support doub computing the current ro data needed for the next The next round does no This design choice guara Deconvolution c a i g (2, 2) = A * a + B * c + C * g + D * i

3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h ▸Compile a deconvolution layer into 4 convolution layers I Original input feature map e ofmap elements generated in this round are also stored he buffer, and are too shaded. terns. The key is to recognize that the four computation erns are essentially four different convolutions, each con- ving the original ifmap with a distinct kernel that is part he original kernel. For instance, (2, 2), (2, 4), (4, 2), and 4) are generated by convolving ⇥ a c g i ⇤ with ifmap. More erally, the deconvolution in Fig. 6 can be calculated as: b c e f h i # b ~ I = G ([e]~I,[d f]~I,  b h ~I,  a c g i ~I) ere b ~ denotes deconvolution, ~ denotes standard convolu- n, I denotes the ifmap, and G denotes the gather operation t assembles the ofmap from the results of the four con- utions. G can be simply implemented as a set of load rations to the scratchpad memory (on-chip buffer). Essentially, our algorithm decomposes the original 3⇥3 cient for convolutions. also be extended to supp which have more relaxe We assume that the ac (scratchpad memory) th as output elements. The hold all the data for a lay in multiple rounds. Onl loaded into the buffer ea into the buffer in each ro and is determined by th The buffer is evenly s buffer to support doub computing the current ro data needed for the next The next round does no This design choice guara Deconvolution ents generated in this round are also stored d are too shaded. y is to recognize that the four computation ntially four different convolutions, each con- nal ifmap with a distinct kernel that is part ernel. For instance, (2, 2), (2, 4), (4, 2), and ted by convolving ⇥ a c g i ⇤ with ifmap. More convolution in Fig. 6 can be calculated as: = G ([e]~I,[d f]~I,  b h ~I,  a c g i ~I) deconvolution, ~ denotes standard convolu- e ifmap, and G denotes the gather operation he ofmap from the results of the four con- n be simply implemented as a set of load cient for convolutions. Alte also be extended to support which have more relaxed co We assume that the accele (scratchpad memory) that h as output elements. The bu hold all the data for a layer. T in multiple rounds. Only pa loaded into the buffer each r into the buffer in each round and is determined by the lo The buffer is evenly split buffer to support double-b computing the current round data needed for the next rou Convolution c a i g (2, 2) = A * a + B * c + C * g + D * i ( )

3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h ▸Compile a deconvolution layer into 4 convolution layers I Original input feature map e ofmap elements generated in this round are also stored he buffer, and are too shaded. terns. The key is to recognize that the four computation erns are essentially four different convolutions, each con- ving the original ifmap with a distinct kernel that is part he original kernel. For instance, (2, 2), (2, 4), (4, 2), and 4) are generated by convolving ⇥ a c g i ⇤ with ifmap. More erally, the deconvolution in Fig. 6 can be calculated as: b c e f h i # b ~ I = G ([e]~I,[d f]~I,  b h ~I,  a c g i ~I) ere b ~ denotes deconvolution, ~ denotes standard convolu- n, I denotes the ifmap, and G denotes the gather operation t assembles the ofmap from the results of the four con- utions. G can be simply implemented as a set of load rations to the scratchpad memory (on-chip buffer). Essentially, our algorithm decomposes the original 3⇥3 cient for convolutions. also be extended to supp which have more relaxe We assume that the ac (scratchpad memory) th as output elements. The hold all the data for a lay in multiple rounds. Onl loaded into the buffer ea into the buffer in each ro and is determined by th The buffer is evenly s buffer to support doub computing the current ro data needed for the next The next round does no This design choice guara Deconvolution ents generated in this round are also stored d are too shaded. y is to recognize that the four computation ntially four different convolutions, each con- nal ifmap with a distinct kernel that is part ernel. For instance, (2, 2), (2, 4), (4, 2), and ted by convolving ⇥ a c g i ⇤ with ifmap. More convolution in Fig. 6 can be calculated as: = G ([e]~I,[d f]~I,  b h ~I,  a c g i ~I) deconvolution, ~ denotes standard convolu- e ifmap, and G denotes the gather operation he ofmap from the results of the four con- n be simply implemented as a set of load cient for convolutions. Alte also be extended to support which have more relaxed co We assume that the accele (scratchpad memory) that h as output elements. The bu hold all the data for a layer. T in multiple rounds. Only pa loaded into the buffer each r into the buffer in each round and is determined by the lo The buffer is evenly split buffer to support double-b computing the current round data needed for the next rou Convolution h a 3⇥3 kernel split into four sub-kernels. With a tiling strategy W = 2,H = 2,C1 = 1,C2 = 2,C3 = only the shaded elements are loaded into the buffer. p elements generated in this round are also stored fer, and are too shaded. The key is to recognize that the four computation re essentially four different convolutions, each con- he original ifmap with a distinct kernel that is part ginal kernel. For instance, (2, 2), (2, 4), (4, 2), and generated by convolving ⇥ a c g i ⇤ with ifmap. More , the deconvolution in Fig. 6 can be calculated as: c f i # b ~ I = G ([e]~I,[d f]~I,  b h ~I,  a c g i ~I) denotes deconvolution, ~ denotes standard convolu- notes the ifmap, and G denotes the gather operation mbles the ofmap from the results of the four con- sists of a 2D systolic array, in whic (PE) performs one MAC operation arrays use a simple neighbor-to- mechanism that simpliﬁes the con cient for convolutions. Alternativ also be extended to support SIMD- which have more relaxed control w We assume that the accelerator h (scratchpad memory) that holds ac as output elements. The buffer siz hold all the data for a layer. Therefo in multiple rounds. Only part of th loaded into the buffer each round. E into the buffer in each round is criti and is determined by the loop tilin The buffer is evenly split into a w buffer to support double-bufferin computing the current round using data needed for the next round is p Gather (stores to scratchpad) c a i g (2, 2) = A * a + B * c + C * g + D * i ( ) =

3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h ▸Compile a deconvolution layer into 4 convolution layers c a i g (2, 2) = A * a + B * c + C * g + D * i ( ) =

3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h ▸Compile a deconvolution layer into 4 convolution layers ▸Naive transformation and compute increase memory trafﬁc ▹4 sub-kernels + 4 ifmaps! c a i g (2, 2) = A * a + B * c + C * g + D * i ( ) =

3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h ▸Compile a deconvolution layer into 4 convolution layers ▸Naive transformation and compute increase memory trafﬁc ▹4 sub-kernels + 4 ifmaps! ▸Key observation Sub-convolutions share the same ifmap. New data reuse opportunity. B A D C ifmap c a i g (2, 2) = A * a + B * c + C * g + D * i ( ) =

3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h ▸Compile a deconvolution layer into 4 convolution layers ▸Naive transformation and compute increase memory trafﬁc ▹4 sub-kernels + 4 ifmaps! ▸Key observation Sub-convolutions share the same ifmap. New data reuse opportunity. B A D C ifmap c a i g (2, 2) = A * a + B * c + C * g + D * i Inter-Layer Activation Reuse (ILAR) ( ) =

Deconvolution Optimization: Problem Setup 21

Deconvolution Optimization: Problem Setup 21 Goal: minimize the latency and/or
memory traﬃc in deconvolution.

Deconvolution Optimization: Problem Setup 21 ▸Hardware Assumption Goal: minimize the
latency and/or memory traﬃc in deconvolution.

DRAM Deconvolution Optimization: Problem Setup 21 ▸Hardware Assumption ▹A system-on-chip
connected with DRAM Goal: minimize the latency and/or memory traﬃc in deconvolution.

On-chip Buffer DRAM Deconvolution Optimization: Problem Setup 21 ▸Hardware Assumption
▹A system-on-chip connected with DRAM ▹On-chip buffer for ifmap, weights and ofmap Goal: minimize the latency and/or memory traﬃc in deconvolution.

On-chip Buffer DRAM Deconvolution Optimization: Problem Setup 21 ▸Hardware Assumption
▹A system-on-chip connected with DRAM ▹On-chip buffer for ifmap, weights and ofmap ▹Systolic array, output stationary Goal: minimize the latency and/or memory traﬃc in deconvolution.

DRAM Deconvolution Optimization: Problem Setup 21 ▸Hardware Assumption ▹A system-on-chip
connected with DRAM ▹On-chip buffer for ifmap, weights and ofmap ▹Systolic array, output stationary ▹Used double-buffering Working buffer Filling buffer Goal: minimize the latency and/or memory traﬃc in deconvolution.

Deconvolution Optimization: Complexity 22 f d f d f d
f d f d eeeee b h b h b h b h b h c a i g c a i g c a i g c a i g c a i g sub-kernels ✽ Ifmap B C A F G E K J I O L H N D M P B C A F G E K J I O L H N D M P B C A F G E K J I O L H N D M P B C A F G E K J I O L H N D M P B C A F G E K J I O L H N D M P Goal: minimize the latency and/or memory traﬃc in deconvolution. B C A D B C A D B C A D B C A D B C A D eeeee b h b h b h b h b h f d f d f d f d f d eeeee c a i g c a i g c a i g c a i g c a i g B A F E B A F E B A F E B A F E B A F E

f d f d eeeee b h b h b h b h b h c a i g c a i g c a i g c a i g c a i g sub-kernels ✽ Ifmap B C A F G E K J I O L H N D M P B C A F G E K J I O L H N D M P B C A F G E K J I O L H N D M P B C A F G E K J I O L H N D M P B C A F G E K J I O L H N D M P Goal: minimize the latency and/or memory traﬃc in deconvolution. Schedule 1 Schedule 2 B C A D B C A D B C A D B C A D B C A D eeeee b h b h b h b h b h f d f d f d f d f d eeeee c a i g c a i g c a i g c a i g c a i g B A F E B A F E B A F E B A F E B A F E

f d f d eeeee b h b h b h b h b h c a i g c a i g c a i g c a i g c a i g sub-kernels ✽ Ifmap B C A F G E K J I O L H N D M P B C A F G E K J I O L H N D M P B C A F G E K J I O L H N D M P B C A F G E K J I O L H N D M P B C A F G E K J I O L H N D M P Goal: minimize the latency and/or memory traﬃc in deconvolution. Schedule 1 Schedule 2 B C A D B C A D B C A D B C A D B C A D eeeee b h b h b h b h b h f d f d f d f d f d eeeee c a i g c a i g c a i g c a i g c a i g B A F E B A F E B A F E B A F E B A F E ?

Deconvolution Optimization: Formulation 23

Deconvolution Optimization: Formulation 23 ▸Dataﬂow optimization → Constrained optimization

Deconvolution Optimization: Formulation 23 ▸Dataﬂow optimization → Constrained optimization Objective:
Min. L(Θ, ϕ)

Min. L(Θ, ϕ) Θ : Hardware conﬁguration ϕ : Tiling schedule

Min. L(Θ, ϕ) Θ : Hardware conﬁguration ϕ : Tiling schedule ▸Hardware conﬁguration, ={A, BW, Buf}

Min. L(Θ, ϕ) Θ : Hardware conﬁguration ϕ : Tiling schedule ▸Hardware conﬁguration, ={A, BW, Buf} A ≤ A* ▹Systolic Array Capability: BW ≤ BW* ▹Memory Bandwidth: Buf ≤ Buf* ▹On-chip Buffer Size: *: hardware capacity

Min. L(Θ, ϕ) Θ : Hardware conﬁguration ϕ : Tiling schedule ▸Hardware conﬁguration, ={A, BW, Buf} A ≤ A* ▹Systolic Array Capability: BW ≤ BW* ▹Memory Bandwidth: Buf ≤ Buf* ▹On-chip Buffer Size: ▸Variables, = {Tile, Ksub } ϕ

Min. L(Θ, ϕ) Θ : Hardware conﬁguration ϕ : Tiling schedule ▸Hardware conﬁguration, ={A, BW, Buf} A ≤ A* ▹Systolic Array Capability: BW ≤ BW* ▹Memory Bandwidth: Buf ≤ Buf* ▹On-chip Buffer Size: ▸Variables, = {Tile, Ksub } ϕ ▹Tile: Tile Size in every round

Min. L(Θ, ϕ) Θ : Hardware conﬁguration ϕ : Tiling schedule ▸Hardware conﬁguration, ={A, BW, Buf} A ≤ A* ▹Systolic Array Capability: BW ≤ BW* ▹Memory Bandwidth: Buf ≤ Buf* ▹On-chip Buffer Size: ▸Variables, = {Tile, Ksub } ϕ ▹Tile: Tile Size in every round ▹Ksub : The number of different sub-kernels in every round

Min. L(Θ, ϕ) Θ : Hardware conﬁguration ϕ : Tiling schedule ▸Hardware conﬁguration, ={A, BW, Buf} A ≤ A* ▹Systolic Array Capability: BW ≤ BW* ▹Memory Bandwidth: Buf ≤ Buf* ▹On-chip Buffer Size: ▸Variables, = {Tile, Ksub } ϕ ▹Tile: Tile Size in every round ▹Ksub : The number of different sub-kernels in every round Non-linear Constraint Optimization “Sequential Least Squares Programming”

Min. L(Θ, ϕ) Θ : Hardware configuration ϕ : Tiling schedule ▸Hardware configuration, ={A, BW, Buf} A ≤ A* ▹Systolic Array Capability: BW ≤ BW* ▹Memory Bandwidth: Buf ≤ Buf* ▹On-chip Buffer Size: ▸Variables, = {Tile, Ksub } ϕ ▹Tile: Tile Size in every round ▹Ksub : The number of different sub-kernels in every round https://github.com/horizon-research/systolic-array-dataflow-optimizer

Hardware Implementation 25 O P R

Hardware Implementation 25 Baseline Systolic Array: Baseline Scalar Unit: O
P R

Hardware Implementation 25 ▹ Convolutions in DNN Baseline Systolic Array:
Baseline Scalar Unit: O P R Conv.

Baseline Scalar Unit: ▹ ReLU, Pooling in DNN O P R Conv.

Baseline Scalar Unit: ▹ ReLU, Pooling in DNN O P R Conv. BM ISM Algorithm: OF

Hardware Implementation 25 ▹ Convolutions in DNN ▹ Block Matching
in Reﬁne Correspondences Baseline Scalar Unit: ▹ ReLU, Pooling in DNN O P R Modiﬁed Systolic Array: Conv. BM ISM Algorithm: OF

in Refine Correspondences ▹ ReLU, Pooling in DNN ▹ Operations in Optical Flow O P R Modified Systolic Array: Conv. BM ISM Algorithm: OF Modified Scalar Unit:

in Refine Correspondences ▹ ReLU, Pooling in DNN ▹ Operations in Optical Flow O P R Modified Systolic Array: Conv. BM ISM Algorithm: OF Modified Scalar Unit: The overall area overhead introduced by ASV is below 0.5%.

Experimental Setup 26 Hardware implementation:

Experimental Setup 26 Hardware implementation: ▹ Systolic array: 24x24 PE
at 1 GHz 24 24

at 1 GHz 24 24 ▹ 8 Scalar unit: run in parallel at 250 MHz 8

at 1 GHz 24 24 ▹ 8 Scalar unit: run in parallel at 250 MHz 8 ▹ SRAM: 1.5 MB on-chip buffer 1.5MB On-chip Buffer ▹ DRAM: 4 Micron 16 Gb LPDDR3-1600 channels

Experimental Setup 26 Hardware implementation: Stereo DNNs: ▹ FlowNet, DispNet,
GC-Net, PSMNet ▹ Systolic array: 24x24 PE at 1 GHz 24 24 ▹ 8 Scalar unit: run in parallel at 250 MHz 8 ▹ SRAM: 1.5 MB on-chip buffer 1.5MB On-chip Buffer ▹ DRAM: 4 Micron 16 Gb LPDDR3-1600 channels

Experimental Setup 26 Hardware implementation: Stereo DNNs: ▹ FlowNet, DispNet,
GC-Net, PSMNet Datasets: ▹ SceneFlow and KITTI dataset ▹ Systolic array: 24x24 PE at 1 GHz 24 24 ▹ 8 Scalar unit: run in parallel at 250 MHz 8 ▹ SRAM: 1.5 MB on-chip buffer 1.5MB On-chip Buffer ▹ DRAM: 4 Micron 16 Gb LPDDR3-1600 channels

Evaluation 27 Variants:

Evaluation 27 Variants: ▹ ISM: ISM algorithm without deconv. optimizations.

DNN inference for every 4 frames DNN Inference Non-DNN Inference Non-DNN Inference Non-DNN Inference

▹ DCO: Deconv. optimizations without ISM algorithm. DNN inference for every 4 frames DNN Inference Non-DNN Inference Non-DNN Inference Non-DNN Inference

▹ DCO: Deconv. optimizations without ISM algorithm. ▹ ISM + DCO: combined both optimization. DNN inference for every 4 frames DNN Inference Non-DNN Inference Non-DNN Inference Non-DNN Inference

Evaluation 28 DNN inference for every 4 frames DNN Inference
Non-DNN Inference Non-DNN Inference Non-DNN Inference

Evaluation 28 Error rate (%) 0 1.25 2.5 3.75 5
DispNet FlowNetC PSMNet GC-Net AVG. DNN ISM DNN inference for every 4 frames DNN Inference Non-DNN Inference Non-DNN Inference Non-DNN Inference

DispNet FlowNetC PSMNet GC-Net AVG. DNN ISM DNN inference for every 4 frames DNN Inference Non-DNN Inference Non-DNN Inference Non-DNN Inference 3.89

DispNet FlowNetC PSMNet GC-Net AVG. DNN ISM DNN inference for every 4 frames DNN Inference Non-DNN Inference Non-DNN Inference Non-DNN Inference 3.89 3.95

Evaluation 29

Evaluation 29 Speedup 0 2 4 6 8 DispNet FlowNetC
PSMNet GC-Net AVG. DCO ISM DCO+ISM

PSMNet GC-Net AVG. DCO ISM DCO+ISM 1.5x

PSMNet GC-Net AVG. DCO ISM DCO+ISM 1.5x 3.3x

PSMNet GC-Net AVG. DCO ISM DCO+ISM 1.5x 3.3x 5.0x

PSMNet GC-Net AVG. DCO ISM DCO+ISM Energy Reduction 0 25 50 75 100 DispNet FlowNetC PSMNet GC-Net AVG. 1.5x 3.3x 5.0x

PSMNet GC-Net AVG. DCO ISM DCO+ISM Energy Reduction 0 25 50 75 100 DispNet FlowNetC PSMNet GC-Net AVG. 1.5x 3.3x 5.0x 42%

PSMNet GC-Net AVG. DCO ISM DCO+ISM Energy Reduction 0 25 50 75 100 DispNet FlowNetC PSMNet GC-Net AVG. 1.5x 3.3x 5.0x 42% 75%

PSMNet GC-Net AVG. DCO ISM DCO+ISM Energy Reduction 0 25 50 75 100 DispNet FlowNetC PSMNet GC-Net AVG. 1.5x 3.3x 5.0x 42% 75% 85%

Evaluation 30 Speedup 0 1.5 3 4.5 6 AVG. ASV
GANNX 3.6x 5.0x Energy Reduction 0 1.5 3 4.5 6 AVG. ASV GANNX 3.2x 4.2x

Conclusion 31

Conclusion 31 ‣“Depth from stereo" is critical to emerging intelligent
applications deployed in energy- and performance-constrained devices.

Conclusion 31 ‣ASV simultaneously improves performance and energy-efﬁciency, while maintaining
high accuracy via a HW & SW co-design. ‣“Depth from stereo" is critical to emerging intelligent applications deployed in energy- and performance-constrained devices.

Conclusion 31 ‣Careful design choices can let these optimizations be
integrated with existing DNN accelerators with minor hardware extensions + + + ‣ASV simultaneously improves performance and energy-efﬁciency, while maintaining high accuracy via a HW & SW co-design. ‣“Depth from stereo" is critical to emerging intelligent applications deployed in energy- and performance-constrained devices.

ASV: Accelerated Stereo Vision System

ASV: Accelerated Stereo Vision System

More Decks by Yu Feng

Other Decks in Research

Featured

Transcript