Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ASV: Accelerated Stereo Vision System

Yu Feng
October 15, 2019

ASV: Accelerated Stereo Vision System

Estimating depth from stereo vision cameras, i.e., "depth from stereo", is critical to emerging intelligent applications deployed in energy- and performance-constrained devices, such as augmented reality headsets and mobile autonomous robots. While existing stereo vision systems make trade-offs between accuracy, performance and energy-efficiency, we describe ASV, an accelerated stereo vision system that simultaneously improves both performance and energy-efficiency while achieving high accuracy.
The key to ASV is to exploit unique characteristics inherent to stereo vision, and apply stereo-specific optimizations, both algorithmically and computationally. We make two contributions. Firstly, we propose a new stereo algorithm, invariant-based stereo matching (ISM), that achieves significant speedup while retaining high accuracy. The algorithm combines classic “hand-crafted” stereo algorithms with recent developments in Deep Neural Networks (DNNs), by leveraging the correspondence invariant unique to stereo vision systems. Secondly, we observe that the bottleneck of the ISM algorithm is the DNN inference, and in particular the deconvolution operations that introduce massive compute-inefficiencies. We propose a set of software optimizations that mitigate these inefficiencies. We show that with less than 0.5% hardware area overhead, these algorithmic and computational optimizations can be effectively integrated within a conventional DNN accelerator. Overall, ASV achieves 5× speedup and 85% energy saving with 0.02% accuracy loss compared to today’s DNN-based stereo vision systems.

Yu Feng

October 15, 2019
Tweet

More Decks by Yu Feng

Other Decks in Research

Transcript

  1. 1 ASV: Accelerated Stereo Vision System Yu Feng with Paul

    Whatmough (Arm Research) and Yuhao Zhu Department of Computer Science University of Rochester http://horizon-lab.org
  2. 2

  3. 2

  4. 2 Distance: 1.0 inch Heart rate: 200↑ ❤ Eve ❤

    Right Distance (Depth) is Important!
  5. Triangulation: Binocular Depth Sensing 5 Physical Point Right Camera Left

    Camera B B + Z Left Plate Right Plate XR XL f D, Depth
  6. Triangulation: Binocular Depth Sensing 5 Physical Point Right Camera Left

    Camera B B + Z Left Plate Right Plate XR XL D D + f = B B + Z Using similar triangles: f D, Depth
  7. Triangulation: Binocular Depth Sensing 5 Physical Point Right Camera Left

    Camera B B + Z Left Plate Right Plate XR XL X’L D D + f = B B + Z Using similar triangles: f D, Depth
  8. Triangulation: Binocular Depth Sensing 5 Physical Point Right Camera Left

    Camera B B + Z Left Plate Right Plate XR XL X’L D D + f = B B + Z Using similar triangles: XR - XL f D, Depth
  9. Triangulation: Binocular Depth Sensing 5 Physical Point Right Camera Left

    Camera B B + Z Left Plate Right Plate XR XL X’L D D + f = B B + Z Using similar triangles: XR - XL f D, Depth Z, Disparity
  10. Triangulation: Binocular Depth Sensing 5 Physical Point Right Camera Left

    Camera B B + Z Left Plate Right Plate XR XL X’L D D + f = B B + Z Using similar triangles: XR - XL f D, Depth Z, Disparity D = Bf/Z Using similar triangles:
  11. Continuous Stereo Vision 6 L R Inputs Output | |

    - = Z XR XL Disparity Map z Depth
  12. Accuracy vs. Speed Trade-off FPS 0 1 100 Error Rate

    (%) 0 4 8 12 16 non-DNN (CPU) DNN (GPU) DNN (Accelerator) 8
  13. Accuracy vs. Speed Trade-off FPS 0 1 100 Error Rate

    (%) 0 4 8 12 16 non-DNN (CPU) DNN (GPU) DNN (Accelerator) 8 30FPS
  14. Accuracy vs. Speed Trade-off FPS 0 1 100 Error Rate

    (%) 0 4 8 12 16 non-DNN (CPU) DNN (GPU) DNN (Accelerator) 8 30FPS
  15. Accuracy vs. Speed Trade-off FPS 0 1 100 Error Rate

    (%) 0 4 8 12 16 non-DNN (CPU) DNN (GPU) DNN (Accelerator) 8 30FPS
  16. Accuracy vs. Speed Trade-off FPS 0 1 100 Error Rate

    (%) 0 4 8 12 16 non-DNN (CPU) DNN (GPU) DNN (Accelerator) 8 ASV 30FPS
  17. ASV: Accelerated Stereo Vision System 9 ‣Algorithm: Invariant-based Stereo Matching

    Algorithm + ‣Compiler: Deconvolution Transformation and Dataflow Optimization +
  18. ASV: Accelerated Stereo Vision System 9 ‣Algorithm: Invariant-based Stereo Matching

    Algorithm + ‣Compiler: Deconvolution Transformation and Dataflow Optimization + ‣Hardware: Principled and Minimal Hardware Modifications +
  19. ASV: Accelerated Stereo Vision System 10 ‣Algorithm: Invariant-based Stereo Matching

    Algorithm + ‣Compiler: Deconvolution Transformation and Dataflow Optimization + ‣Hardware: Principled and Minimal Hardware Modifications +
  20. ASV: Accelerated Stereo Vision System 10 ‣Algorithm: Invariant-based Stereo Matching

    Algorithm + ‣Compiler: Deconvolution Transformation and Dataflow Optimization + ‣Hardware: Principled and Minimal Hardware Modifications +
  21. t = t0+1 L R ISM: Invariant-based Stereo Matching Algorithm

    11 t = t0 L R Find Correspondences ??? = DNN inference
  22. t = t0+1 L R ISM: Invariant-based Stereo Matching Algorithm

    11 t = t0 L R Find Correspondences Propagate Correspondences (motion estimation) ??? = DNN inference
  23. t = t0+1 L R ISM: Invariant-based Stereo Matching Algorithm

    11 t = t0 L R Find Correspondences Propagate Correspondences (motion estimation) ??? = DNN inference Invariant: two corresponding pixels always correspond to the same physical point across frames over time.
  24. t = t0+1 L R ISM: Invariant-based Stereo Matching Algorithm

    11 t = t0 L R Find Correspondences Propagate Correspondences (motion estimation) ??? = DNN inference Refine Correspondences
  25. t = t0+1 L R ISM: Invariant-based Stereo Matching Algorithm

    11 t = t0 L R Find Correspondences Propagate Correspondences (motion estimation) ??? = DNN inference Refine Correspondences Optical Flow Algorithm
  26. t = t0+1 L R ISM: Invariant-based Stereo Matching Algorithm

    11 t = t0 L R Find Correspondences Propagate Correspondences (motion estimation) ??? = DNN inference Refine Correspondences Optical Flow Algorithm Block Matching
  27. t = t0+1 L R ISM: Invariant-based Stereo Matching Algorithm

    11 t = t0 L R Find Correspondences Propagate Correspondences (motion estimation) ??? = DNN inference Refine Correspondences Optical Flow Algorithm Block Matching Optical Flow Algorithm Block Matching
  28. ISM: Invariant-based Stereo Matching Algorithm 12 t = t0+1 t

    = t0 L R L R L R L R Time t = t0+2 t = t0+3
  29. ISM: Invariant-based Stereo Matching Algorithm 12 t = t0+1 t

    = t0 L R L R L R L R Time t = t0+2 t = t0+3 Method Performance Accuracy GOOD GOOD GOOD GOOD DNN Inference DNN Inference DNN Inference DNN Inference SLOW SLOW SLOW SLOW
  30. ISM: Invariant-based Stereo Matching Algorithm 12 t = t0+1 t

    = t0 L R L R L R L R Time t = t0+2 t = t0+3 Method Performance Accuracy GOOD GOOD GOOD GOOD SLOW FAST FAST SLOW DNN Inference ISM Algorithm ISM Algorithm DNN Inference
  31. ISM: Invariant-based Stereo Matching Algorithm 12 t = t0+1 t

    = t0 L R L R L R L R Time t = t0+2 t = t0+3 Method Performance Accuracy GOOD GOOD GOOD GOOD SLOW FAST FAST SLOW DNN Inference ISM Algorithm ISM Algorithm DNN Inference https://github.com/horizon-research/ism-algorithm
  32. ASV: Accelerated Stereo Vision System 13 ‣Algorithm: Invariant-based Stereo Matching

    Algorithm + ‣Compiler: Deconvolution Transformation and Dataflow Optimization + ‣Hardware: Principled and Minimal Hardware Modifications +
  33. ASV: Accelerated Stereo Vision System 13 ‣Algorithm: Invariant-based Stereo Matching

    Algorithm + ‣Compiler: Deconvolution Transformation and Dataflow Optimization + ‣Hardware: Principled and Minimal Hardware Modifications +
  34. … … … … … … … … Deconv. is

    the Major Operation in Stereo DNN 14
  35. … … … … … … … … Deconv. is

    the Major Operation in Stereo DNN 14
  36. … … … … … … … … Deconv. is

    the Major Operation in Stereo DNN 14 Downsampling: Extract and Combine High-level Features
  37. … … … … … … … … Deconv. is

    the Major Operation in Stereo DNN 14 Downsampling: Extract and Combine High-level Features Upsampling: Restore and Refine Disparity Resolution
  38. … … … … … … … … Deconv. is

    the Major Operation in Stereo DNN 14 Downsampling: Extract and Combine High-level Features Upsampling: Restore and Refine Disparity Resolution CONV. DECONV.
  39. … … … … … … … … Deconv. is

    the Major Operation in Stereo DNN 14 Deconvolution Comp. Cost (%) 0 25 50 75 100 Flow N etC DispN et G C -N et PSM N et Downsampling: Extract and Combine High-level Features Upsampling: Restore and Refine Disparity Resolution CONV. DECONV.
  40. … … … … … … … … Deconv. is

    the Major Operation in Stereo DNN 14 Deconvolution Comp. Cost (%) 0 25 50 75 100 Flow N etC DispN et G C -N et PSM N et Downsampling: Extract and Combine High-level Features Upsampling: Restore and Refine Disparity Resolution CONV. DECONV.
  41. Deconvolution Transformation 15 B A D C ifmap b c

    a e f d i h g ✽ Original kernel 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 B A D C B A D C Upsampled ifmap
  42. Deconvolution Transformation 15 B A D C ifmap b c

    a e f d i h g ✽ Original kernel 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 B A D C B A D C Upsampled ifmap A b c a e f d i h g ✽ 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3
  43. Deconvolution Transformation 15 B A D C ifmap b c

    a e f d i h g ✽ Original kernel B A D C B A D C Upsampled ifmap A b c a e f d i h g ✽ B b c a e f d i h g ✽ C b c a e f d i h g ✽ D b c a e f d i h g ✽ (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3
  44. Deconvolution Transformation 16 b c a e f d i

    h g Upsampled ifmap ✽ (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e Original kernel B A D C B A D C B A D C ifmap
  45. 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 Deconvolution

    Transformation 16 b c a e f d i h g Upsampled ifmap ✽ b c a e f d i h g ✽ B A b c a e f d i h g ✽ D C (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d Original kernel B A D C B A D C B A D C ifmap
  46. Deconvolution Transformation 17 b c a e f d i

    h g Upsampled ifmap ✽ (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d Original kernel B A D C ifmap B A D C B A D C
  47. 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 Deconvolution

    Transformation 17 b c a e f d i h g Upsampled ifmap ✽ b c a e f d i h g ✽ A C b c a e f d i h g ✽ B D (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h Original kernel B A D C ifmap B A D C B A D C
  48. Deconvolution Transformation 18 b c a e f d i

    h g Upsampled ifmap ✽ (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h c a i g (2, 2) = A * a + B * c + C * g + D * i Original kernel B A D C ifmap B A D C B A D C
  49. 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 Deconvolution

    Transformation 18 b c a e f d i h g Upsampled ifmap ✽ b c a e f d i h g ✽ B A D C (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h c a i g (2, 2) = A * a + B * c + C * g + D * i Original kernel B A D C ifmap B A D C B A D C
  50. 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 Deconvolution

    Transformation 18 b c a e f d i h g Upsampled ifmap ✽ b c a e f d i h g ✽ B A D C (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h c a i g (2, 2) = A * a + B * c + C * g + D * i Original kernel B A D C ifmap B A D C B A D C
  51. 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 Deconvolution

    Transformation 18 b c a e f d i h g Upsampled ifmap ✽ b c a e f d i h g ✽ B A D C (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h c a i g (2, 2) = A * a + B * c + C * g + D * i Original kernel B A D C ifmap B A D C B A D C
  52. 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 Deconvolution

    Transformation 18 b c a e f d i h g Upsampled ifmap ✽ b c a e f d i h g ✽ B A D C (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h c a i g (2, 2) = A * a + B * c + C * g + D * i Original kernel B A D C ifmap B A D C B A D C
  53. 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 Deconvolution

    Transformation 18 b c a e f d i h g Upsampled ifmap ✽ b c a e f d i h g ✽ B A D C (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h c a i g (2, 2) = A * a + B * c + C * g + D * i B A D C ifmap B A D C B A D C
  54. Deconvolution Transformation 19 Upsampled ifmap ✽ B A D C

    B A D C (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h c a i g (2, 2) = A * a + B * c + C * g + D * i c a i g e f d b h 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 B A D C ifmap
  55. Deconvolution Transformation 19 (1, 1) = A * e (1,

    3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h c a i g (2, 2) = A * a + B * c + C * g + D * i c a i g e f d b h 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 B A D C ifmap
  56. Deconvolution Transformation 19 (1, 1) = A * e (1,

    3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h c a i g (2, 2) = A * a + B * c + C * g + D * i c a i g e f d b h ✽ 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 B A D C ifmap
  57. Deconvolution Transformation 19 (1, 1) = A * e (1,

    3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h c a i g (2, 2) = A * a + B * c + C * g + D * i c a i g e f d b h ✽ 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 B A D C ifmap
  58. Deconvolution Transformation 19 (1, 1) = A * e (1,

    3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h c a i g (2, 2) = A * a + B * c + C * g + D * i c a i g e f d b h ✽ 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 B A D C ifmap
  59. Deconvolution Transformation 19 (1, 1) = A * e (1,

    3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h c a i g (2, 2) = A * a + B * c + C * g + D * i c a i g e f d b h ✽ 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 B A D C ifmap
  60. Deconvolution Transformation 20 (1, 1) = A * e (1,

    3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h ▸Compile a deconvolution layer into 4 convolution layers c a i g (2, 2) = A * a + B * c + C * g + D * i
  61. Deconvolution Transformation 20 (1, 1) = A * e (1,

    3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h ▸Compile a deconvolution layer into 4 convolution layers I Original input feature map e ofmap elements generated in this round are also stored he buffer, and are too shaded. terns. The key is to recognize that the four computation erns are essentially four different convolutions, each con- ving the original ifmap with a distinct kernel that is part he original kernel. For instance, (2, 2), (2, 4), (4, 2), and 4) are generated by convolving ⇥ a c g i ⇤ with ifmap. More erally, the deconvolution in Fig. 6 can be calculated as: b c e f h i # b ~ I = G ([e]~I,[d f]~I,  b h ~I,  a c g i ~I) ere b ~ denotes deconvolution, ~ denotes standard convolu- n, I denotes the ifmap, and G denotes the gather operation t assembles the ofmap from the results of the four con- utions. G can be simply implemented as a set of load rations to the scratchpad memory (on-chip buffer). Essentially, our algorithm decomposes the original 3⇥3 cient for convolutions. also be extended to supp which have more relaxe We assume that the ac (scratchpad memory) th as output elements. The hold all the data for a lay in multiple rounds. Onl loaded into the buffer ea into the buffer in each ro and is determined by th The buffer is evenly s buffer to support doub computing the current ro data needed for the next The next round does no This design choice guara Deconvolution c a i g (2, 2) = A * a + B * c + C * g + D * i
  62. Deconvolution Transformation 20 (1, 1) = A * e (1,

    3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h ▸Compile a deconvolution layer into 4 convolution layers I Original input feature map e ofmap elements generated in this round are also stored he buffer, and are too shaded. terns. The key is to recognize that the four computation erns are essentially four different convolutions, each con- ving the original ifmap with a distinct kernel that is part he original kernel. For instance, (2, 2), (2, 4), (4, 2), and 4) are generated by convolving ⇥ a c g i ⇤ with ifmap. More erally, the deconvolution in Fig. 6 can be calculated as: b c e f h i # b ~ I = G ([e]~I,[d f]~I,  b h ~I,  a c g i ~I) ere b ~ denotes deconvolution, ~ denotes standard convolu- n, I denotes the ifmap, and G denotes the gather operation t assembles the ofmap from the results of the four con- utions. G can be simply implemented as a set of load rations to the scratchpad memory (on-chip buffer). Essentially, our algorithm decomposes the original 3⇥3 cient for convolutions. also be extended to supp which have more relaxe We assume that the ac (scratchpad memory) th as output elements. The hold all the data for a lay in multiple rounds. Onl loaded into the buffer ea into the buffer in each ro and is determined by th The buffer is evenly s buffer to support doub computing the current ro data needed for the next The next round does no This design choice guara Deconvolution c a i g (2, 2) = A * a + B * c + C * g + D * i
  63. Deconvolution Transformation 20 (1, 1) = A * e (1,

    3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h ▸Compile a deconvolution layer into 4 convolution layers I Original input feature map e ofmap elements generated in this round are also stored he buffer, and are too shaded. terns. The key is to recognize that the four computation erns are essentially four different convolutions, each con- ving the original ifmap with a distinct kernel that is part he original kernel. For instance, (2, 2), (2, 4), (4, 2), and 4) are generated by convolving ⇥ a c g i ⇤ with ifmap. More erally, the deconvolution in Fig. 6 can be calculated as: b c e f h i # b ~ I = G ([e]~I,[d f]~I,  b h ~I,  a c g i ~I) ere b ~ denotes deconvolution, ~ denotes standard convolu- n, I denotes the ifmap, and G denotes the gather operation t assembles the ofmap from the results of the four con- utions. G can be simply implemented as a set of load rations to the scratchpad memory (on-chip buffer). Essentially, our algorithm decomposes the original 3⇥3 cient for convolutions. also be extended to supp which have more relaxe We assume that the ac (scratchpad memory) th as output elements. The hold all the data for a lay in multiple rounds. Onl loaded into the buffer ea into the buffer in each ro and is determined by th The buffer is evenly s buffer to support doub computing the current ro data needed for the next The next round does no This design choice guara Deconvolution ents generated in this round are also stored d are too shaded. y is to recognize that the four computation ntially four different convolutions, each con- nal ifmap with a distinct kernel that is part ernel. For instance, (2, 2), (2, 4), (4, 2), and ted by convolving ⇥ a c g i ⇤ with ifmap. More convolution in Fig. 6 can be calculated as: = G ([e]~I,[d f]~I,  b h ~I,  a c g i ~I) deconvolution, ~ denotes standard convolu- e ifmap, and G denotes the gather operation he ofmap from the results of the four con- n be simply implemented as a set of load cient for convolutions. Alte also be extended to support which have more relaxed co We assume that the accele (scratchpad memory) that h as output elements. The bu hold all the data for a layer. T in multiple rounds. Only pa loaded into the buffer each r into the buffer in each round and is determined by the lo The buffer is evenly split buffer to support double-b computing the current round data needed for the next rou Convolution c a i g (2, 2) = A * a + B * c + C * g + D * i ( )
  64. Deconvolution Transformation 20 (1, 1) = A * e (1,

    3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h ▸Compile a deconvolution layer into 4 convolution layers I Original input feature map e ofmap elements generated in this round are also stored he buffer, and are too shaded. terns. The key is to recognize that the four computation erns are essentially four different convolutions, each con- ving the original ifmap with a distinct kernel that is part he original kernel. For instance, (2, 2), (2, 4), (4, 2), and 4) are generated by convolving ⇥ a c g i ⇤ with ifmap. More erally, the deconvolution in Fig. 6 can be calculated as: b c e f h i # b ~ I = G ([e]~I,[d f]~I,  b h ~I,  a c g i ~I) ere b ~ denotes deconvolution, ~ denotes standard convolu- n, I denotes the ifmap, and G denotes the gather operation t assembles the ofmap from the results of the four con- utions. G can be simply implemented as a set of load rations to the scratchpad memory (on-chip buffer). Essentially, our algorithm decomposes the original 3⇥3 cient for convolutions. also be extended to supp which have more relaxe We assume that the ac (scratchpad memory) th as output elements. The hold all the data for a lay in multiple rounds. Onl loaded into the buffer ea into the buffer in each ro and is determined by th The buffer is evenly s buffer to support doub computing the current ro data needed for the next The next round does no This design choice guara Deconvolution ents generated in this round are also stored d are too shaded. y is to recognize that the four computation ntially four different convolutions, each con- nal ifmap with a distinct kernel that is part ernel. For instance, (2, 2), (2, 4), (4, 2), and ted by convolving ⇥ a c g i ⇤ with ifmap. More convolution in Fig. 6 can be calculated as: = G ([e]~I,[d f]~I,  b h ~I,  a c g i ~I) deconvolution, ~ denotes standard convolu- e ifmap, and G denotes the gather operation he ofmap from the results of the four con- n be simply implemented as a set of load cient for convolutions. Alte also be extended to support which have more relaxed co We assume that the accele (scratchpad memory) that h as output elements. The bu hold all the data for a layer. T in multiple rounds. Only pa loaded into the buffer each r into the buffer in each round and is determined by the lo The buffer is evenly split buffer to support double-b computing the current round data needed for the next rou Convolution h a 3⇥3 kernel split into four sub-kernels. With a tiling strategy W = 2,H = 2,C1 = 1,C2 = 2,C3 = only the shaded elements are loaded into the buffer. p elements generated in this round are also stored fer, and are too shaded. The key is to recognize that the four computation re essentially four different convolutions, each con- he original ifmap with a distinct kernel that is part ginal kernel. For instance, (2, 2), (2, 4), (4, 2), and generated by convolving ⇥ a c g i ⇤ with ifmap. More , the deconvolution in Fig. 6 can be calculated as: c f i # b ~ I = G ([e]~I,[d f]~I,  b h ~I,  a c g i ~I) denotes deconvolution, ~ denotes standard convolu- notes the ifmap, and G denotes the gather operation mbles the ofmap from the results of the four con- sists of a 2D systolic array, in whic (PE) performs one MAC operation arrays use a simple neighbor-to- mechanism that simplifies the con cient for convolutions. Alternativ also be extended to support SIMD- which have more relaxed control w We assume that the accelerator h (scratchpad memory) that holds ac as output elements. The buffer siz hold all the data for a layer. Therefo in multiple rounds. Only part of th loaded into the buffer each round. E into the buffer in each round is criti and is determined by the loop tilin The buffer is evenly split into a w buffer to support double-bufferin computing the current round using data needed for the next round is p Gather (stores to scratchpad) c a i g (2, 2) = A * a + B * c + C * g + D * i ( ) =
  65. Deconvolution Transformation 20 (1, 1) = A * e (1,

    3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h ▸Compile a deconvolution layer into 4 convolution layers c a i g (2, 2) = A * a + B * c + C * g + D * i ( ) =
  66. Deconvolution Transformation 20 (1, 1) = A * e (1,

    3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h ▸Compile a deconvolution layer into 4 convolution layers ▸Naive transformation and compute increase memory traffic ▹4 sub-kernels + 4 ifmaps! c a i g (2, 2) = A * a + B * c + C * g + D * i ( ) =
  67. Deconvolution Transformation 20 (1, 1) = A * e (1,

    3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h ▸Compile a deconvolution layer into 4 convolution layers ▸Naive transformation and compute increase memory traffic ▹4 sub-kernels + 4 ifmaps! ▸Key observation Sub-convolutions share the same ifmap. New data reuse opportunity. B A D C ifmap c a i g (2, 2) = A * a + B * c + C * g + D * i ( ) =
  68. Deconvolution Transformation 20 (1, 1) = A * e (1,

    3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h ▸Compile a deconvolution layer into 4 convolution layers ▸Naive transformation and compute increase memory traffic ▹4 sub-kernels + 4 ifmaps! ▸Key observation Sub-convolutions share the same ifmap. New data reuse opportunity. B A D C ifmap c a i g (2, 2) = A * a + B * c + C * g + D * i Inter-Layer Activation Reuse (ILAR) ( ) =
  69. DRAM Deconvolution Optimization: Problem Setup 21 ▸Hardware Assumption ▹A system-on-chip

    connected with DRAM Goal: minimize the latency and/or memory traffic in deconvolution.
  70. On-chip Buffer DRAM Deconvolution Optimization: Problem Setup 21 ▸Hardware Assumption

    ▹A system-on-chip connected with DRAM ▹On-chip buffer for ifmap, weights and ofmap Goal: minimize the latency and/or memory traffic in deconvolution.
  71. On-chip Buffer DRAM Deconvolution Optimization: Problem Setup 21 ▸Hardware Assumption

    ▹A system-on-chip connected with DRAM ▹On-chip buffer for ifmap, weights and ofmap ▹Systolic array, output stationary Goal: minimize the latency and/or memory traffic in deconvolution.
  72. DRAM Deconvolution Optimization: Problem Setup 21 ▸Hardware Assumption ▹A system-on-chip

    connected with DRAM ▹On-chip buffer for ifmap, weights and ofmap ▹Systolic array, output stationary ▹Used double-buffering Working buffer Filling buffer Goal: minimize the latency and/or memory traffic in deconvolution.
  73. Deconvolution Optimization: Complexity 22 f d f d f d

    f d f d eeeee b h b h b h b h b h c a i g c a i g c a i g c a i g c a i g sub-kernels ✽ Ifmap B C A F G E K J I O L H N D M P B C A F G E K J I O L H N D M P B C A F G E K J I O L H N D M P B C A F G E K J I O L H N D M P B C A F G E K J I O L H N D M P Goal: minimize the latency and/or memory traffic in deconvolution. B C A D B C A D B C A D B C A D B C A D eeeee b h b h b h b h b h f d f d f d f d f d eeeee c a i g c a i g c a i g c a i g c a i g B A F E B A F E B A F E B A F E B A F E
  74. Deconvolution Optimization: Complexity 22 f d f d f d

    f d f d eeeee b h b h b h b h b h c a i g c a i g c a i g c a i g c a i g sub-kernels ✽ Ifmap B C A F G E K J I O L H N D M P B C A F G E K J I O L H N D M P B C A F G E K J I O L H N D M P B C A F G E K J I O L H N D M P B C A F G E K J I O L H N D M P Goal: minimize the latency and/or memory traffic in deconvolution. Schedule 1 Schedule 2 B C A D B C A D B C A D B C A D B C A D eeeee b h b h b h b h b h f d f d f d f d f d eeeee c a i g c a i g c a i g c a i g c a i g B A F E B A F E B A F E B A F E B A F E
  75. Deconvolution Optimization: Complexity 22 f d f d f d

    f d f d eeeee b h b h b h b h b h c a i g c a i g c a i g c a i g c a i g sub-kernels ✽ Ifmap B C A F G E K J I O L H N D M P B C A F G E K J I O L H N D M P B C A F G E K J I O L H N D M P B C A F G E K J I O L H N D M P B C A F G E K J I O L H N D M P Goal: minimize the latency and/or memory traffic in deconvolution. Schedule 1 Schedule 2 B C A D B C A D B C A D B C A D B C A D eeeee b h b h b h b h b h f d f d f d f d f d eeeee c a i g c a i g c a i g c a i g c a i g B A F E B A F E B A F E B A F E B A F E ?
  76. Deconvolution Optimization: Formulation 23 ▸Dataflow optimization → Constrained optimization Objective:

    Min. L(Θ, ϕ) Θ : Hardware configuration ϕ : Tiling schedule ▸Hardware configuration, ={A, BW, Buf}
  77. Deconvolution Optimization: Formulation 23 ▸Dataflow optimization → Constrained optimization Objective:

    Min. L(Θ, ϕ) Θ : Hardware configuration ϕ : Tiling schedule ▸Hardware configuration, ={A, BW, Buf} A ≤ A* ▹Systolic Array Capability: BW ≤ BW* ▹Memory Bandwidth: Buf ≤ Buf* ▹On-chip Buffer Size: *: hardware capacity
  78. Deconvolution Optimization: Formulation 23 ▸Dataflow optimization → Constrained optimization Objective:

    Min. L(Θ, ϕ) Θ : Hardware configuration ϕ : Tiling schedule ▸Hardware configuration, ={A, BW, Buf} A ≤ A* ▹Systolic Array Capability: BW ≤ BW* ▹Memory Bandwidth: Buf ≤ Buf* ▹On-chip Buffer Size: ▸Variables, = {Tile, Ksub } ϕ
  79. Deconvolution Optimization: Formulation 23 ▸Dataflow optimization → Constrained optimization Objective:

    Min. L(Θ, ϕ) Θ : Hardware configuration ϕ : Tiling schedule ▸Hardware configuration, ={A, BW, Buf} A ≤ A* ▹Systolic Array Capability: BW ≤ BW* ▹Memory Bandwidth: Buf ≤ Buf* ▹On-chip Buffer Size: ▸Variables, = {Tile, Ksub } ϕ ▹Tile: Tile Size in every round
  80. Deconvolution Optimization: Formulation 23 ▸Dataflow optimization → Constrained optimization Objective:

    Min. L(Θ, ϕ) Θ : Hardware configuration ϕ : Tiling schedule ▸Hardware configuration, ={A, BW, Buf} A ≤ A* ▹Systolic Array Capability: BW ≤ BW* ▹Memory Bandwidth: Buf ≤ Buf* ▹On-chip Buffer Size: ▸Variables, = {Tile, Ksub } ϕ ▹Tile: Tile Size in every round ▹Ksub : The number of different sub-kernels in every round
  81. Deconvolution Optimization: Formulation 23 ▸Dataflow optimization → Constrained optimization Objective:

    Min. L(Θ, ϕ) Θ : Hardware configuration ϕ : Tiling schedule ▸Hardware configuration, ={A, BW, Buf} A ≤ A* ▹Systolic Array Capability: BW ≤ BW* ▹Memory Bandwidth: Buf ≤ Buf* ▹On-chip Buffer Size: ▸Variables, = {Tile, Ksub } ϕ ▹Tile: Tile Size in every round ▹Ksub : The number of different sub-kernels in every round
  82. Deconvolution Optimization: Formulation 23 ▸Dataflow optimization → Constrained optimization Objective:

    Min. L(Θ, ϕ) Θ : Hardware configuration ϕ : Tiling schedule ▸Hardware configuration, ={A, BW, Buf} A ≤ A* ▹Systolic Array Capability: BW ≤ BW* ▹Memory Bandwidth: Buf ≤ Buf* ▹On-chip Buffer Size: ▸Variables, = {Tile, Ksub } ϕ ▹Tile: Tile Size in every round ▹Ksub : The number of different sub-kernels in every round Non-linear Constraint Optimization “Sequential Least Squares Programming”
  83. Deconvolution Optimization: Formulation 23 ▸Dataflow optimization → Constrained optimization Objective:

    Min. L(Θ, ϕ) Θ : Hardware configuration ϕ : Tiling schedule ▸Hardware configuration, ={A, BW, Buf} A ≤ A* ▹Systolic Array Capability: BW ≤ BW* ▹Memory Bandwidth: Buf ≤ Buf* ▹On-chip Buffer Size: ▸Variables, = {Tile, Ksub } ϕ ▹Tile: Tile Size in every round ▹Ksub : The number of different sub-kernels in every round https://github.com/horizon-research/systolic-array-dataflow-optimizer
  84. ASV: Accelerated Stereo Vision System 24 ‣Algorithm: Invariant-based Stereo Matching

    Algorithm + ‣Compiler: Deconvolution Transformation and Dataflow Optimization + ‣Hardware: Principled and Minimal Hardware Modifications +
  85. ASV: Accelerated Stereo Vision System 24 ‣Algorithm: Invariant-based Stereo Matching

    Algorithm + ‣Compiler: Deconvolution Transformation and Dataflow Optimization + ‣Hardware: Principled and Minimal Hardware Modifications +
  86. Hardware Implementation 25 ▹ Convolutions in DNN Baseline Systolic Array:

    Baseline Scalar Unit: ▹ ReLU, Pooling in DNN O P R Conv.
  87. Hardware Implementation 25 ▹ Convolutions in DNN Baseline Systolic Array:

    Baseline Scalar Unit: ▹ ReLU, Pooling in DNN O P R Conv. BM ISM Algorithm: OF
  88. Hardware Implementation 25 ▹ Convolutions in DNN ▹ Block Matching

    in Refine Correspondences Baseline Scalar Unit: ▹ ReLU, Pooling in DNN O P R Modified Systolic Array: Conv. BM ISM Algorithm: OF
  89. Hardware Implementation 25 ▹ Convolutions in DNN ▹ Block Matching

    in Refine Correspondences ▹ ReLU, Pooling in DNN ▹ Operations in Optical Flow O P R Modified Systolic Array: Conv. BM ISM Algorithm: OF Modified Scalar Unit:
  90. Hardware Implementation 25 ▹ Convolutions in DNN ▹ Block Matching

    in Refine Correspondences ▹ ReLU, Pooling in DNN ▹ Operations in Optical Flow O P R Modified Systolic Array: Conv. BM ISM Algorithm: OF Modified Scalar Unit: The overall area overhead introduced by ASV is below 0.5%.
  91. Experimental Setup 26 Hardware implementation: ▹ Systolic array: 24x24 PE

    at 1 GHz 24 24 ▹ 8 Scalar unit: run in parallel at 250 MHz 8
  92. Experimental Setup 26 Hardware implementation: ▹ Systolic array: 24x24 PE

    at 1 GHz 24 24 ▹ 8 Scalar unit: run in parallel at 250 MHz 8 ▹ SRAM: 1.5 MB on-chip buffer 1.5MB On-chip Buffer ▹ DRAM: 4 Micron 16 Gb LPDDR3-1600 channels
  93. Experimental Setup 26 Hardware implementation: Stereo DNNs: ▹ FlowNet, DispNet,

    GC-Net, PSMNet ▹ Systolic array: 24x24 PE at 1 GHz 24 24 ▹ 8 Scalar unit: run in parallel at 250 MHz 8 ▹ SRAM: 1.5 MB on-chip buffer 1.5MB On-chip Buffer ▹ DRAM: 4 Micron 16 Gb LPDDR3-1600 channels
  94. Experimental Setup 26 Hardware implementation: Stereo DNNs: ▹ FlowNet, DispNet,

    GC-Net, PSMNet Datasets: ▹ SceneFlow and KITTI dataset ▹ Systolic array: 24x24 PE at 1 GHz 24 24 ▹ 8 Scalar unit: run in parallel at 250 MHz 8 ▹ SRAM: 1.5 MB on-chip buffer 1.5MB On-chip Buffer ▹ DRAM: 4 Micron 16 Gb LPDDR3-1600 channels
  95. Evaluation 27 Variants: ▹ ISM: ISM algorithm without deconv. optimizations.

    DNN inference for every 4 frames DNN Inference Non-DNN Inference Non-DNN Inference Non-DNN Inference
  96. Evaluation 27 Variants: ▹ ISM: ISM algorithm without deconv. optimizations.

    ▹ DCO: Deconv. optimizations without ISM algorithm. DNN inference for every 4 frames DNN Inference Non-DNN Inference Non-DNN Inference Non-DNN Inference
  97. Evaluation 27 Variants: ▹ ISM: ISM algorithm without deconv. optimizations.

    ▹ DCO: Deconv. optimizations without ISM algorithm. ▹ ISM + DCO: combined both optimization. DNN inference for every 4 frames DNN Inference Non-DNN Inference Non-DNN Inference Non-DNN Inference
  98. Evaluation 28 DNN inference for every 4 frames DNN Inference

    Non-DNN Inference Non-DNN Inference Non-DNN Inference
  99. Evaluation 28 Error rate (%) 0 1.25 2.5 3.75 5

    DispNet FlowNetC PSMNet GC-Net AVG. DNN ISM DNN inference for every 4 frames DNN Inference Non-DNN Inference Non-DNN Inference Non-DNN Inference
  100. Evaluation 28 Error rate (%) 0 1.25 2.5 3.75 5

    DispNet FlowNetC PSMNet GC-Net AVG. DNN ISM DNN inference for every 4 frames DNN Inference Non-DNN Inference Non-DNN Inference Non-DNN Inference 3.89
  101. Evaluation 28 Error rate (%) 0 1.25 2.5 3.75 5

    DispNet FlowNetC PSMNet GC-Net AVG. DNN ISM DNN inference for every 4 frames DNN Inference Non-DNN Inference Non-DNN Inference Non-DNN Inference 3.89 3.95
  102. Evaluation 29 Speedup 0 2 4 6 8 DispNet FlowNetC

    PSMNet GC-Net AVG. DCO ISM DCO+ISM
  103. Evaluation 29 Speedup 0 2 4 6 8 DispNet FlowNetC

    PSMNet GC-Net AVG. DCO ISM DCO+ISM 1.5x
  104. Evaluation 29 Speedup 0 2 4 6 8 DispNet FlowNetC

    PSMNet GC-Net AVG. DCO ISM DCO+ISM 1.5x 3.3x
  105. Evaluation 29 Speedup 0 2 4 6 8 DispNet FlowNetC

    PSMNet GC-Net AVG. DCO ISM DCO+ISM 1.5x 3.3x 5.0x
  106. Evaluation 29 Speedup 0 2 4 6 8 DispNet FlowNetC

    PSMNet GC-Net AVG. DCO ISM DCO+ISM Energy Reduction 0 25 50 75 100 DispNet FlowNetC PSMNet GC-Net AVG. 1.5x 3.3x 5.0x
  107. Evaluation 29 Speedup 0 2 4 6 8 DispNet FlowNetC

    PSMNet GC-Net AVG. DCO ISM DCO+ISM Energy Reduction 0 25 50 75 100 DispNet FlowNetC PSMNet GC-Net AVG. 1.5x 3.3x 5.0x 42%
  108. Evaluation 29 Speedup 0 2 4 6 8 DispNet FlowNetC

    PSMNet GC-Net AVG. DCO ISM DCO+ISM Energy Reduction 0 25 50 75 100 DispNet FlowNetC PSMNet GC-Net AVG. 1.5x 3.3x 5.0x 42% 75%
  109. Evaluation 29 Speedup 0 2 4 6 8 DispNet FlowNetC

    PSMNet GC-Net AVG. DCO ISM DCO+ISM Energy Reduction 0 25 50 75 100 DispNet FlowNetC PSMNet GC-Net AVG. 1.5x 3.3x 5.0x 42% 75% 85%
  110. Evaluation 30 Speedup 0 1.5 3 4.5 6 AVG. ASV

    GANNX 3.6x 5.0x Energy Reduction 0 1.5 3 4.5 6 AVG. ASV GANNX 3.2x 4.2x
  111. Conclusion 31 ‣“Depth from stereo" is critical to emerging intelligent

    applications deployed in energy- and performance-constrained devices.
  112. Conclusion 31 ‣ASV simultaneously improves performance and energy-efficiency, while maintaining

    high accuracy via a HW & SW co-design. ‣“Depth from stereo" is critical to emerging intelligent applications deployed in energy- and performance-constrained devices.
  113. Conclusion 31 ‣Careful design choices can let these optimizations be

    integrated with existing DNN accelerators with minor hardware extensions + + + ‣ASV simultaneously improves performance and energy-efficiency, while maintaining high accuracy via a HW & SW co-design. ‣“Depth from stereo" is critical to emerging intelligent applications deployed in energy- and performance-constrained devices.