ASV: Accelerated Stereo Vision System

F0c4b39a71fc7c752d4e6c451f6f678b?s=47 HorizonLab
October 14, 2019

ASV: Accelerated Stereo Vision System

MICRO 2019. Presented by Yu Feng

F0c4b39a71fc7c752d4e6c451f6f678b?s=128

HorizonLab

October 14, 2019
Tweet

Transcript

  1. 1 ASV: Accelerated Stereo Vision System Yu Feng with Paul

    Whatmough (Arm Research) and Yuhao Zhu Department of Computer Science University of Rochester http://horizon-lab.org
  2. 2

  3. 2

  4. 2 Eve ❤

  5. 2 Distance: 1.0 inch Heart rate: 200↑ ❤ Eve ❤

  6. 2 Distance: 1.0 inch Heart rate: 200↑ ❤ Eve ❤

    Right Distance (Depth) is Important!
  7. Applications Need Depth Information 3

  8. Applications Need Depth Information 3 3D Reconstruction

  9. Applications Need Depth Information 3 3D Reconstruction Augment Reality

  10. Applications Need Depth Information 3 3D Reconstruction Drone Navigation Augment

    Reality
  11. Applications Need Depth Information 3 3D Reconstruction Drone Navigation Augment

    Reality Domestic Robot
  12. Techniques to Extract Depth Information 4

  13. Techniques to Extract Depth Information 4 Passive Sensing Active Sensing

  14. Techniques to Extract Depth Information 4 Passive Sensing Active Sensing

  15. Techniques to Extract Depth Information 4 Passive Sensing Active Sensing

  16. Techniques to Extract Depth Information 4 Passive Sensing Active Sensing

  17. Triangulation: Binocular Depth Sensing 5 Physical Point

  18. Triangulation: Binocular Depth Sensing 5 Physical Point Right Camera Left

    Camera
  19. Triangulation: Binocular Depth Sensing 5 Physical Point Right Camera Left

    Camera
  20. Triangulation: Binocular Depth Sensing 5 Physical Point Right Camera Left

    Camera Left Plate Right Plate XR XL
  21. Triangulation: Binocular Depth Sensing 5 Physical Point Right Camera Left

    Camera B B + Z Left Plate Right Plate XR XL f
  22. Triangulation: Binocular Depth Sensing 5 Physical Point Right Camera Left

    Camera B B + Z Left Plate Right Plate XR XL f D, Depth
  23. Triangulation: Binocular Depth Sensing 5 Physical Point Right Camera Left

    Camera B B + Z Left Plate Right Plate XR XL D D + f = B B + Z Using similar triangles: f D, Depth
  24. Triangulation: Binocular Depth Sensing 5 Physical Point Right Camera Left

    Camera B B + Z Left Plate Right Plate XR XL X’L D D + f = B B + Z Using similar triangles: f D, Depth
  25. Triangulation: Binocular Depth Sensing 5 Physical Point Right Camera Left

    Camera B B + Z Left Plate Right Plate XR XL X’L D D + f = B B + Z Using similar triangles: XR - XL f D, Depth
  26. Triangulation: Binocular Depth Sensing 5 Physical Point Right Camera Left

    Camera B B + Z Left Plate Right Plate XR XL X’L D D + f = B B + Z Using similar triangles: XR - XL f D, Depth Z, Disparity
  27. Triangulation: Binocular Depth Sensing 5 Physical Point Right Camera Left

    Camera B B + Z Left Plate Right Plate XR XL X’L D D + f = B B + Z Using similar triangles: XR - XL f D, Depth Z, Disparity D = Bf/Z Using similar triangles:
  28. Continuous Stereo Vision 6 L R Inputs Output Disparity Map

  29. Continuous Stereo Vision 6 L R Inputs Output XR XL

    Disparity Map
  30. Continuous Stereo Vision 6 L R Inputs Output | |

    - = Z XR XL Disparity Map z
  31. Continuous Stereo Vision 6 L R Inputs Output | |

    - = Z XR XL Disparity Map z
  32. Continuous Stereo Vision 6 L R Inputs Output | |

    - = Z XR XL Disparity Map z Depth
  33. Continuous Stereo Vision 7 L R Inputs Output Disparity Map

  34. Stereo Matching Algorithms Continuous Stereo Vision 7 L R Inputs

    Output Disparity Map +
  35. Continuous Stereo Vision 7 L R Inputs Output Disparity Map

    { DNN-based non-DNN-based } +
  36. Accuracy vs. Speed Trade-off FPS 0 1 100 Error Rate

    (%) 0 4 8 12 16 non-DNN (CPU) DNN (GPU) DNN (Accelerator) 8
  37. Accuracy vs. Speed Trade-off FPS 0 1 100 Error Rate

    (%) 0 4 8 12 16 non-DNN (CPU) DNN (GPU) DNN (Accelerator) 8 30FPS
  38. Accuracy vs. Speed Trade-off FPS 0 1 100 Error Rate

    (%) 0 4 8 12 16 non-DNN (CPU) DNN (GPU) DNN (Accelerator) 8 30FPS
  39. Accuracy vs. Speed Trade-off FPS 0 1 100 Error Rate

    (%) 0 4 8 12 16 non-DNN (CPU) DNN (GPU) DNN (Accelerator) 8 30FPS
  40. Accuracy vs. Speed Trade-off FPS 0 1 100 Error Rate

    (%) 0 4 8 12 16 non-DNN (CPU) DNN (GPU) DNN (Accelerator) 8 ASV 30FPS
  41. ASV: Accelerated Stereo Vision System 9

  42. ASV: Accelerated Stereo Vision System 9 ‣Algorithm: Invariant-based Stereo Matching

    Algorithm +
  43. ASV: Accelerated Stereo Vision System 9 ‣Algorithm: Invariant-based Stereo Matching

    Algorithm + ‣Compiler: Deconvolution Transformation and Dataflow Optimization +
  44. ASV: Accelerated Stereo Vision System 9 ‣Algorithm: Invariant-based Stereo Matching

    Algorithm + ‣Compiler: Deconvolution Transformation and Dataflow Optimization + ‣Hardware: Principled and Minimal Hardware Modifications +
  45. ASV: Accelerated Stereo Vision System 10 ‣Algorithm: Invariant-based Stereo Matching

    Algorithm + ‣Compiler: Deconvolution Transformation and Dataflow Optimization + ‣Hardware: Principled and Minimal Hardware Modifications +
  46. ASV: Accelerated Stereo Vision System 10 ‣Algorithm: Invariant-based Stereo Matching

    Algorithm + ‣Compiler: Deconvolution Transformation and Dataflow Optimization + ‣Hardware: Principled and Minimal Hardware Modifications +
  47. ISM: Invariant-based Stereo Matching Algorithm 11

  48. ISM: Invariant-based Stereo Matching Algorithm 11 t = t0 L

    R
  49. ISM: Invariant-based Stereo Matching Algorithm 11 t = t0 L

    R = DNN inference
  50. ISM: Invariant-based Stereo Matching Algorithm 11 t = t0 L

    R = DNN inference
  51. ISM: Invariant-based Stereo Matching Algorithm 11 t = t0 L

    R = DNN inference
  52. t = t0+1 L R ISM: Invariant-based Stereo Matching Algorithm

    11 t = t0 L R ??? = DNN inference
  53. t = t0+1 L R ISM: Invariant-based Stereo Matching Algorithm

    11 t = t0 L R Find Correspondences ??? = DNN inference
  54. t = t0+1 L R ISM: Invariant-based Stereo Matching Algorithm

    11 t = t0 L R Find Correspondences Propagate Correspondences (motion estimation) ??? = DNN inference
  55. t = t0+1 L R ISM: Invariant-based Stereo Matching Algorithm

    11 t = t0 L R Find Correspondences Propagate Correspondences (motion estimation) ??? = DNN inference Invariant: two corresponding pixels always correspond to the same physical point across frames over time.
  56. t = t0+1 L R ISM: Invariant-based Stereo Matching Algorithm

    11 t = t0 L R Find Correspondences Propagate Correspondences (motion estimation) ??? = DNN inference Refine Correspondences
  57. t = t0+1 L R ISM: Invariant-based Stereo Matching Algorithm

    11 t = t0 L R Find Correspondences Propagate Correspondences (motion estimation) ??? = DNN inference Refine Correspondences Optical Flow Algorithm
  58. t = t0+1 L R ISM: Invariant-based Stereo Matching Algorithm

    11 t = t0 L R Find Correspondences Propagate Correspondences (motion estimation) ??? = DNN inference Refine Correspondences Optical Flow Algorithm Block Matching
  59. t = t0+1 L R ISM: Invariant-based Stereo Matching Algorithm

    11 t = t0 L R Find Correspondences Propagate Correspondences (motion estimation) ??? = DNN inference Refine Correspondences Optical Flow Algorithm Block Matching Optical Flow Algorithm Block Matching
  60. ISM: Invariant-based Stereo Matching Algorithm 12

  61. ISM: Invariant-based Stereo Matching Algorithm 12 t = t0+1 t

    = t0 L R L R L R L R Time t = t0+2 t = t0+3
  62. ISM: Invariant-based Stereo Matching Algorithm 12 t = t0+1 t

    = t0 L R L R L R L R Time t = t0+2 t = t0+3 Method Performance Accuracy GOOD GOOD GOOD GOOD DNN Inference DNN Inference DNN Inference DNN Inference SLOW SLOW SLOW SLOW
  63. ISM: Invariant-based Stereo Matching Algorithm 12 t = t0+1 t

    = t0 L R L R L R L R Time t = t0+2 t = t0+3 Method Performance Accuracy GOOD GOOD GOOD GOOD SLOW FAST FAST SLOW DNN Inference ISM Algorithm ISM Algorithm DNN Inference
  64. ISM: Invariant-based Stereo Matching Algorithm 12 t = t0+1 t

    = t0 L R L R L R L R Time t = t0+2 t = t0+3 Method Performance Accuracy GOOD GOOD GOOD GOOD SLOW FAST FAST SLOW DNN Inference ISM Algorithm ISM Algorithm DNN Inference https://github.com/horizon-research/ism-algorithm
  65. ASV: Accelerated Stereo Vision System 13 ‣Algorithm: Invariant-based Stereo Matching

    Algorithm + ‣Compiler: Deconvolution Transformation and Dataflow Optimization + ‣Hardware: Principled and Minimal Hardware Modifications +
  66. ASV: Accelerated Stereo Vision System 13 ‣Algorithm: Invariant-based Stereo Matching

    Algorithm + ‣Compiler: Deconvolution Transformation and Dataflow Optimization + ‣Hardware: Principled and Minimal Hardware Modifications +
  67. Deconv. is the Major Operation in Stereo DNN 14

  68. … … … … … … … … Deconv. is

    the Major Operation in Stereo DNN 14
  69. … … … … … … … … Deconv. is

    the Major Operation in Stereo DNN 14
  70. … … … … … … … … Deconv. is

    the Major Operation in Stereo DNN 14 Downsampling: Extract and Combine High-level Features
  71. … … … … … … … … Deconv. is

    the Major Operation in Stereo DNN 14 Downsampling: Extract and Combine High-level Features Upsampling: Restore and Refine Disparity Resolution
  72. … … … … … … … … Deconv. is

    the Major Operation in Stereo DNN 14 Downsampling: Extract and Combine High-level Features Upsampling: Restore and Refine Disparity Resolution CONV. DECONV.
  73. … … … … … … … … Deconv. is

    the Major Operation in Stereo DNN 14 Deconvolution Comp. Cost (%) 0 25 50 75 100 Flow N etC DispN et G C -N et PSM N et Downsampling: Extract and Combine High-level Features Upsampling: Restore and Refine Disparity Resolution CONV. DECONV.
  74. … … … … … … … … Deconv. is

    the Major Operation in Stereo DNN 14 Deconvolution Comp. Cost (%) 0 25 50 75 100 Flow N etC DispN et G C -N et PSM N et Downsampling: Extract and Combine High-level Features Upsampling: Restore and Refine Disparity Resolution CONV. DECONV.
  75. Deconvolution Transformation 15 B A D C ifmap

  76. Deconvolution Transformation 15 B A D C ifmap B A

    D C B A D C Upsampled ifmap
  77. Deconvolution Transformation 15 B A D C ifmap b c

    a e f d i h g ✽ Original kernel 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 B A D C B A D C Upsampled ifmap
  78. Deconvolution Transformation 15 B A D C ifmap b c

    a e f d i h g ✽ Original kernel 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 B A D C B A D C Upsampled ifmap A b c a e f d i h g ✽ 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3
  79. Deconvolution Transformation 15 B A D C ifmap b c

    a e f d i h g ✽ Original kernel B A D C B A D C Upsampled ifmap A b c a e f d i h g ✽ B b c a e f d i h g ✽ C b c a e f d i h g ✽ D b c a e f d i h g ✽ (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3
  80. Deconvolution Transformation 16 b c a e f d i

    h g Upsampled ifmap ✽ (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e Original kernel B A D C B A D C B A D C ifmap
  81. 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 Deconvolution

    Transformation 16 b c a e f d i h g Upsampled ifmap ✽ b c a e f d i h g ✽ B A b c a e f d i h g ✽ D C (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d Original kernel B A D C B A D C B A D C ifmap
  82. Deconvolution Transformation 17 b c a e f d i

    h g Upsampled ifmap ✽ (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d Original kernel B A D C ifmap B A D C B A D C
  83. 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 Deconvolution

    Transformation 17 b c a e f d i h g Upsampled ifmap ✽ b c a e f d i h g ✽ A C b c a e f d i h g ✽ B D (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h Original kernel B A D C ifmap B A D C B A D C
  84. Deconvolution Transformation 18 b c a e f d i

    h g Upsampled ifmap ✽ (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h c a i g (2, 2) = A * a + B * c + C * g + D * i Original kernel B A D C ifmap B A D C B A D C
  85. 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 Deconvolution

    Transformation 18 b c a e f d i h g Upsampled ifmap ✽ b c a e f d i h g ✽ B A D C (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h c a i g (2, 2) = A * a + B * c + C * g + D * i Original kernel B A D C ifmap B A D C B A D C
  86. 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 Deconvolution

    Transformation 18 b c a e f d i h g Upsampled ifmap ✽ b c a e f d i h g ✽ B A D C (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h c a i g (2, 2) = A * a + B * c + C * g + D * i Original kernel B A D C ifmap B A D C B A D C
  87. 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 Deconvolution

    Transformation 18 b c a e f d i h g Upsampled ifmap ✽ b c a e f d i h g ✽ B A D C (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h c a i g (2, 2) = A * a + B * c + C * g + D * i Original kernel B A D C ifmap B A D C B A D C
  88. 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 Deconvolution

    Transformation 18 b c a e f d i h g Upsampled ifmap ✽ b c a e f d i h g ✽ B A D C (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h c a i g (2, 2) = A * a + B * c + C * g + D * i Original kernel B A D C ifmap B A D C B A D C
  89. 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 Deconvolution

    Transformation 18 b c a e f d i h g Upsampled ifmap ✽ b c a e f d i h g ✽ B A D C (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h c a i g (2, 2) = A * a + B * c + C * g + D * i B A D C ifmap B A D C B A D C
  90. Deconvolution Transformation 19 Upsampled ifmap ✽ B A D C

    B A D C (1, 1) = A * e (1, 3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h c a i g (2, 2) = A * a + B * c + C * g + D * i c a i g e f d b h 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 B A D C ifmap
  91. Deconvolution Transformation 19 (1, 1) = A * e (1,

    3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h c a i g (2, 2) = A * a + B * c + C * g + D * i c a i g e f d b h 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 B A D C ifmap
  92. Deconvolution Transformation 19 (1, 1) = A * e (1,

    3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h c a i g (2, 2) = A * a + B * c + C * g + D * i c a i g e f d b h ✽ 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 B A D C ifmap
  93. Deconvolution Transformation 19 (1, 1) = A * e (1,

    3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h c a i g (2, 2) = A * a + B * c + C * g + D * i c a i g e f d b h ✽ 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 B A D C ifmap
  94. Deconvolution Transformation 19 (1, 1) = A * e (1,

    3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h c a i g (2, 2) = A * a + B * c + C * g + D * i c a i g e f d b h ✽ 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 B A D C ifmap
  95. Deconvolution Transformation 19 (1, 1) = A * e (1,

    3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h c a i g (2, 2) = A * a + B * c + C * g + D * i c a i g e f d b h ✽ 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 1,1 1,2 1,3 3,1 2,3 2,1 2,2 3,2 3,3 B A D C ifmap
  96. Deconvolution Transformation 20 (1, 1) = A * e (1,

    3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h ▸Compile a deconvolution layer into 4 convolution layers c a i g (2, 2) = A * a + B * c + C * g + D * i
  97. Deconvolution Transformation 20 (1, 1) = A * e (1,

    3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h ▸Compile a deconvolution layer into 4 convolution layers I Original input feature map e ofmap elements generated in this round are also stored he buffer, and are too shaded. terns. The key is to recognize that the four computation erns are essentially four different convolutions, each con- ving the original ifmap with a distinct kernel that is part he original kernel. For instance, (2, 2), (2, 4), (4, 2), and 4) are generated by convolving ⇥ a c g i ⇤ with ifmap. More erally, the deconvolution in Fig. 6 can be calculated as: b c e f h i # b ~ I = G ([e]~I,[d f]~I,  b h ~I,  a c g i ~I) ere b ~ denotes deconvolution, ~ denotes standard convolu- n, I denotes the ifmap, and G denotes the gather operation t assembles the ofmap from the results of the four con- utions. G can be simply implemented as a set of load rations to the scratchpad memory (on-chip buffer). Essentially, our algorithm decomposes the original 3⇥3 cient for convolutions. also be extended to supp which have more relaxe We assume that the ac (scratchpad memory) th as output elements. The hold all the data for a lay in multiple rounds. Onl loaded into the buffer ea into the buffer in each ro and is determined by th The buffer is evenly s buffer to support doub computing the current ro data needed for the next The next round does no This design choice guara Deconvolution c a i g (2, 2) = A * a + B * c + C * g + D * i
  98. Deconvolution Transformation 20 (1, 1) = A * e (1,

    3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h ▸Compile a deconvolution layer into 4 convolution layers I Original input feature map e ofmap elements generated in this round are also stored he buffer, and are too shaded. terns. The key is to recognize that the four computation erns are essentially four different convolutions, each con- ving the original ifmap with a distinct kernel that is part he original kernel. For instance, (2, 2), (2, 4), (4, 2), and 4) are generated by convolving ⇥ a c g i ⇤ with ifmap. More erally, the deconvolution in Fig. 6 can be calculated as: b c e f h i # b ~ I = G ([e]~I,[d f]~I,  b h ~I,  a c g i ~I) ere b ~ denotes deconvolution, ~ denotes standard convolu- n, I denotes the ifmap, and G denotes the gather operation t assembles the ofmap from the results of the four con- utions. G can be simply implemented as a set of load rations to the scratchpad memory (on-chip buffer). Essentially, our algorithm decomposes the original 3⇥3 cient for convolutions. also be extended to supp which have more relaxe We assume that the ac (scratchpad memory) th as output elements. The hold all the data for a lay in multiple rounds. Onl loaded into the buffer ea into the buffer in each ro and is determined by th The buffer is evenly s buffer to support doub computing the current ro data needed for the next The next round does no This design choice guara Deconvolution c a i g (2, 2) = A * a + B * c + C * g + D * i
  99. Deconvolution Transformation 20 (1, 1) = A * e (1,

    3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h ▸Compile a deconvolution layer into 4 convolution layers I Original input feature map e ofmap elements generated in this round are also stored he buffer, and are too shaded. terns. The key is to recognize that the four computation erns are essentially four different convolutions, each con- ving the original ifmap with a distinct kernel that is part he original kernel. For instance, (2, 2), (2, 4), (4, 2), and 4) are generated by convolving ⇥ a c g i ⇤ with ifmap. More erally, the deconvolution in Fig. 6 can be calculated as: b c e f h i # b ~ I = G ([e]~I,[d f]~I,  b h ~I,  a c g i ~I) ere b ~ denotes deconvolution, ~ denotes standard convolu- n, I denotes the ifmap, and G denotes the gather operation t assembles the ofmap from the results of the four con- utions. G can be simply implemented as a set of load rations to the scratchpad memory (on-chip buffer). Essentially, our algorithm decomposes the original 3⇥3 cient for convolutions. also be extended to supp which have more relaxe We assume that the ac (scratchpad memory) th as output elements. The hold all the data for a lay in multiple rounds. Onl loaded into the buffer ea into the buffer in each ro and is determined by th The buffer is evenly s buffer to support doub computing the current ro data needed for the next The next round does no This design choice guara Deconvolution ents generated in this round are also stored d are too shaded. y is to recognize that the four computation ntially four different convolutions, each con- nal ifmap with a distinct kernel that is part ernel. For instance, (2, 2), (2, 4), (4, 2), and ted by convolving ⇥ a c g i ⇤ with ifmap. More convolution in Fig. 6 can be calculated as: = G ([e]~I,[d f]~I,  b h ~I,  a c g i ~I) deconvolution, ~ denotes standard convolu- e ifmap, and G denotes the gather operation he ofmap from the results of the four con- n be simply implemented as a set of load cient for convolutions. Alte also be extended to support which have more relaxed co We assume that the accele (scratchpad memory) that h as output elements. The bu hold all the data for a layer. T in multiple rounds. Only pa loaded into the buffer each r into the buffer in each round and is determined by the lo The buffer is evenly split buffer to support double-b computing the current round data needed for the next rou Convolution c a i g (2, 2) = A * a + B * c + C * g + D * i ( )
  100. Deconvolution Transformation 20 (1, 1) = A * e (1,

    3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h ▸Compile a deconvolution layer into 4 convolution layers I Original input feature map e ofmap elements generated in this round are also stored he buffer, and are too shaded. terns. The key is to recognize that the four computation erns are essentially four different convolutions, each con- ving the original ifmap with a distinct kernel that is part he original kernel. For instance, (2, 2), (2, 4), (4, 2), and 4) are generated by convolving ⇥ a c g i ⇤ with ifmap. More erally, the deconvolution in Fig. 6 can be calculated as: b c e f h i # b ~ I = G ([e]~I,[d f]~I,  b h ~I,  a c g i ~I) ere b ~ denotes deconvolution, ~ denotes standard convolu- n, I denotes the ifmap, and G denotes the gather operation t assembles the ofmap from the results of the four con- utions. G can be simply implemented as a set of load rations to the scratchpad memory (on-chip buffer). Essentially, our algorithm decomposes the original 3⇥3 cient for convolutions. also be extended to supp which have more relaxe We assume that the ac (scratchpad memory) th as output elements. The hold all the data for a lay in multiple rounds. Onl loaded into the buffer ea into the buffer in each ro and is determined by th The buffer is evenly s buffer to support doub computing the current ro data needed for the next The next round does no This design choice guara Deconvolution ents generated in this round are also stored d are too shaded. y is to recognize that the four computation ntially four different convolutions, each con- nal ifmap with a distinct kernel that is part ernel. For instance, (2, 2), (2, 4), (4, 2), and ted by convolving ⇥ a c g i ⇤ with ifmap. More convolution in Fig. 6 can be calculated as: = G ([e]~I,[d f]~I,  b h ~I,  a c g i ~I) deconvolution, ~ denotes standard convolu- e ifmap, and G denotes the gather operation he ofmap from the results of the four con- n be simply implemented as a set of load cient for convolutions. Alte also be extended to support which have more relaxed co We assume that the accele (scratchpad memory) that h as output elements. The bu hold all the data for a layer. T in multiple rounds. Only pa loaded into the buffer each r into the buffer in each round and is determined by the lo The buffer is evenly split buffer to support double-b computing the current round data needed for the next rou Convolution h a 3⇥3 kernel split into four sub-kernels. With a tiling strategy W = 2,H = 2,C1 = 1,C2 = 2,C3 = only the shaded elements are loaded into the buffer. p elements generated in this round are also stored fer, and are too shaded. The key is to recognize that the four computation re essentially four different convolutions, each con- he original ifmap with a distinct kernel that is part ginal kernel. For instance, (2, 2), (2, 4), (4, 2), and generated by convolving ⇥ a c g i ⇤ with ifmap. More , the deconvolution in Fig. 6 can be calculated as: c f i # b ~ I = G ([e]~I,[d f]~I,  b h ~I,  a c g i ~I) denotes deconvolution, ~ denotes standard convolu- notes the ifmap, and G denotes the gather operation mbles the ofmap from the results of the four con- sists of a 2D systolic array, in whic (PE) performs one MAC operation arrays use a simple neighbor-to- mechanism that simplifies the con cient for convolutions. Alternativ also be extended to support SIMD- which have more relaxed control w We assume that the accelerator h (scratchpad memory) that holds ac as output elements. The buffer siz hold all the data for a layer. Therefo in multiple rounds. Only part of th loaded into the buffer each round. E into the buffer in each round is criti and is determined by the loop tilin The buffer is evenly split into a w buffer to support double-bufferin computing the current round using data needed for the next round is p Gather (stores to scratchpad) c a i g (2, 2) = A * a + B * c + C * g + D * i ( ) =
  101. Deconvolution Transformation 20 (1, 1) = A * e (1,

    3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h ▸Compile a deconvolution layer into 4 convolution layers c a i g (2, 2) = A * a + B * c + C * g + D * i ( ) =
  102. Deconvolution Transformation 20 (1, 1) = A * e (1,

    3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h ▸Compile a deconvolution layer into 4 convolution layers ▸Naive transformation and compute increase memory traffic ▹4 sub-kernels + 4 ifmaps! c a i g (2, 2) = A * a + B * c + C * g + D * i ( ) =
  103. Deconvolution Transformation 20 (1, 1) = A * e (1,

    3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h ▸Compile a deconvolution layer into 4 convolution layers ▸Naive transformation and compute increase memory traffic ▹4 sub-kernels + 4 ifmaps! ▸Key observation Sub-convolutions share the same ifmap. New data reuse opportunity. B A D C ifmap c a i g (2, 2) = A * a + B * c + C * g + D * i ( ) =
  104. Deconvolution Transformation 20 (1, 1) = A * e (1,

    3) = B * e (3, 1) = C * e (3, 3) = D * e e (1, 2) = A * d + B * f (3, 2) = C * d + D * f f d (1, 2) = A * d + B * f (3, 2) = C * d + D * f b h ▸Compile a deconvolution layer into 4 convolution layers ▸Naive transformation and compute increase memory traffic ▹4 sub-kernels + 4 ifmaps! ▸Key observation Sub-convolutions share the same ifmap. New data reuse opportunity. B A D C ifmap c a i g (2, 2) = A * a + B * c + C * g + D * i Inter-Layer Activation Reuse (ILAR) ( ) =
  105. Deconvolution Optimization: Problem Setup 21

  106. Deconvolution Optimization: Problem Setup 21 Goal: minimize the latency and/or

    memory traffic in deconvolution.
  107. Deconvolution Optimization: Problem Setup 21 ▸Hardware Assumption Goal: minimize the

    latency and/or memory traffic in deconvolution.
  108. DRAM Deconvolution Optimization: Problem Setup 21 ▸Hardware Assumption ▹A system-on-chip

    connected with DRAM Goal: minimize the latency and/or memory traffic in deconvolution.
  109. On-chip Buffer DRAM Deconvolution Optimization: Problem Setup 21 ▸Hardware Assumption

    ▹A system-on-chip connected with DRAM ▹On-chip buffer for ifmap, weights and ofmap Goal: minimize the latency and/or memory traffic in deconvolution.
  110. On-chip Buffer DRAM Deconvolution Optimization: Problem Setup 21 ▸Hardware Assumption

    ▹A system-on-chip connected with DRAM ▹On-chip buffer for ifmap, weights and ofmap ▹Systolic array, output stationary Goal: minimize the latency and/or memory traffic in deconvolution.
  111. DRAM Deconvolution Optimization: Problem Setup 21 ▸Hardware Assumption ▹A system-on-chip

    connected with DRAM ▹On-chip buffer for ifmap, weights and ofmap ▹Systolic array, output stationary ▹Used double-buffering Working buffer Filling buffer Goal: minimize the latency and/or memory traffic in deconvolution.
  112. Deconvolution Optimization: Complexity 22 f d f d f d

    f d f d eeeee b h b h b h b h b h c a i g c a i g c a i g c a i g c a i g sub-kernels ✽ Ifmap B C A F G E K J I O L H N D M P B C A F G E K J I O L H N D M P B C A F G E K J I O L H N D M P B C A F G E K J I O L H N D M P B C A F G E K J I O L H N D M P Goal: minimize the latency and/or memory traffic in deconvolution. B C A D B C A D B C A D B C A D B C A D eeeee b h b h b h b h b h f d f d f d f d f d eeeee c a i g c a i g c a i g c a i g c a i g B A F E B A F E B A F E B A F E B A F E
  113. Deconvolution Optimization: Complexity 22 f d f d f d

    f d f d eeeee b h b h b h b h b h c a i g c a i g c a i g c a i g c a i g sub-kernels ✽ Ifmap B C A F G E K J I O L H N D M P B C A F G E K J I O L H N D M P B C A F G E K J I O L H N D M P B C A F G E K J I O L H N D M P B C A F G E K J I O L H N D M P Goal: minimize the latency and/or memory traffic in deconvolution. Schedule 1 Schedule 2 B C A D B C A D B C A D B C A D B C A D eeeee b h b h b h b h b h f d f d f d f d f d eeeee c a i g c a i g c a i g c a i g c a i g B A F E B A F E B A F E B A F E B A F E
  114. Deconvolution Optimization: Complexity 22 f d f d f d

    f d f d eeeee b h b h b h b h b h c a i g c a i g c a i g c a i g c a i g sub-kernels ✽ Ifmap B C A F G E K J I O L H N D M P B C A F G E K J I O L H N D M P B C A F G E K J I O L H N D M P B C A F G E K J I O L H N D M P B C A F G E K J I O L H N D M P Goal: minimize the latency and/or memory traffic in deconvolution. Schedule 1 Schedule 2 B C A D B C A D B C A D B C A D B C A D eeeee b h b h b h b h b h f d f d f d f d f d eeeee c a i g c a i g c a i g c a i g c a i g B A F E B A F E B A F E B A F E B A F E ?
  115. Deconvolution Optimization: Formulation 23

  116. Deconvolution Optimization: Formulation 23 ▸Dataflow optimization → Constrained optimization

  117. Deconvolution Optimization: Formulation 23 ▸Dataflow optimization → Constrained optimization Objective:

    Min. L(Θ, ϕ)
  118. Deconvolution Optimization: Formulation 23 ▸Dataflow optimization → Constrained optimization Objective:

    Min. L(Θ, ϕ) Θ : Hardware configuration ϕ : Tiling schedule
  119. Deconvolution Optimization: Formulation 23 ▸Dataflow optimization → Constrained optimization Objective:

    Min. L(Θ, ϕ) Θ : Hardware configuration ϕ : Tiling schedule ▸Hardware configuration, ={A, BW, Buf}
  120. Deconvolution Optimization: Formulation 23 ▸Dataflow optimization → Constrained optimization Objective:

    Min. L(Θ, ϕ) Θ : Hardware configuration ϕ : Tiling schedule ▸Hardware configuration, ={A, BW, Buf} A ≤ A* ▹Systolic Array Capability: BW ≤ BW* ▹Memory Bandwidth: Buf ≤ Buf* ▹On-chip Buffer Size: *: hardware capacity
  121. Deconvolution Optimization: Formulation 23 ▸Dataflow optimization → Constrained optimization Objective:

    Min. L(Θ, ϕ) Θ : Hardware configuration ϕ : Tiling schedule ▸Hardware configuration, ={A, BW, Buf} A ≤ A* ▹Systolic Array Capability: BW ≤ BW* ▹Memory Bandwidth: Buf ≤ Buf* ▹On-chip Buffer Size: ▸Variables, = {Tile, Ksub } ϕ
  122. Deconvolution Optimization: Formulation 23 ▸Dataflow optimization → Constrained optimization Objective:

    Min. L(Θ, ϕ) Θ : Hardware configuration ϕ : Tiling schedule ▸Hardware configuration, ={A, BW, Buf} A ≤ A* ▹Systolic Array Capability: BW ≤ BW* ▹Memory Bandwidth: Buf ≤ Buf* ▹On-chip Buffer Size: ▸Variables, = {Tile, Ksub } ϕ ▹Tile: Tile Size in every round
  123. Deconvolution Optimization: Formulation 23 ▸Dataflow optimization → Constrained optimization Objective:

    Min. L(Θ, ϕ) Θ : Hardware configuration ϕ : Tiling schedule ▸Hardware configuration, ={A, BW, Buf} A ≤ A* ▹Systolic Array Capability: BW ≤ BW* ▹Memory Bandwidth: Buf ≤ Buf* ▹On-chip Buffer Size: ▸Variables, = {Tile, Ksub } ϕ ▹Tile: Tile Size in every round ▹Ksub : The number of different sub-kernels in every round
  124. Deconvolution Optimization: Formulation 23 ▸Dataflow optimization → Constrained optimization Objective:

    Min. L(Θ, ϕ) Θ : Hardware configuration ϕ : Tiling schedule ▸Hardware configuration, ={A, BW, Buf} A ≤ A* ▹Systolic Array Capability: BW ≤ BW* ▹Memory Bandwidth: Buf ≤ Buf* ▹On-chip Buffer Size: ▸Variables, = {Tile, Ksub } ϕ ▹Tile: Tile Size in every round ▹Ksub : The number of different sub-kernels in every round
  125. Deconvolution Optimization: Formulation 23 ▸Dataflow optimization → Constrained optimization Objective:

    Min. L(Θ, ϕ) Θ : Hardware configuration ϕ : Tiling schedule ▸Hardware configuration, ={A, BW, Buf} A ≤ A* ▹Systolic Array Capability: BW ≤ BW* ▹Memory Bandwidth: Buf ≤ Buf* ▹On-chip Buffer Size: ▸Variables, = {Tile, Ksub } ϕ ▹Tile: Tile Size in every round ▹Ksub : The number of different sub-kernels in every round Non-linear Constraint Optimization “Sequential Least Squares Programming”
  126. Deconvolution Optimization: Formulation 23 ▸Dataflow optimization → Constrained optimization Objective:

    Min. L(Θ, ϕ) Θ : Hardware configuration ϕ : Tiling schedule ▸Hardware configuration, ={A, BW, Buf} A ≤ A* ▹Systolic Array Capability: BW ≤ BW* ▹Memory Bandwidth: Buf ≤ Buf* ▹On-chip Buffer Size: ▸Variables, = {Tile, Ksub } ϕ ▹Tile: Tile Size in every round ▹Ksub : The number of different sub-kernels in every round https://github.com/horizon-research/systolic-array-dataflow-optimizer
  127. ASV: Accelerated Stereo Vision System 24 ‣Algorithm: Invariant-based Stereo Matching

    Algorithm + ‣Compiler: Deconvolution Transformation and Dataflow Optimization + ‣Hardware: Principled and Minimal Hardware Modifications +
  128. ASV: Accelerated Stereo Vision System 24 ‣Algorithm: Invariant-based Stereo Matching

    Algorithm + ‣Compiler: Deconvolution Transformation and Dataflow Optimization + ‣Hardware: Principled and Minimal Hardware Modifications +
  129. Hardware Implementation 25 O P R

  130. Hardware Implementation 25 Baseline Systolic Array: Baseline Scalar Unit: O

    P R
  131. Hardware Implementation 25 ▹ Convolutions in DNN Baseline Systolic Array:

    Baseline Scalar Unit: O P R Conv.
  132. Hardware Implementation 25 ▹ Convolutions in DNN Baseline Systolic Array:

    Baseline Scalar Unit: ▹ ReLU, Pooling in DNN O P R Conv.
  133. Hardware Implementation 25 ▹ Convolutions in DNN Baseline Systolic Array:

    Baseline Scalar Unit: ▹ ReLU, Pooling in DNN O P R Conv. BM ISM Algorithm: OF
  134. Hardware Implementation 25 ▹ Convolutions in DNN ▹ Block Matching

    in Refine Correspondences Baseline Scalar Unit: ▹ ReLU, Pooling in DNN O P R Modified Systolic Array: Conv. BM ISM Algorithm: OF
  135. Hardware Implementation 25 ▹ Convolutions in DNN ▹ Block Matching

    in Refine Correspondences ▹ ReLU, Pooling in DNN ▹ Operations in Optical Flow O P R Modified Systolic Array: Conv. BM ISM Algorithm: OF Modified Scalar Unit:
  136. Hardware Implementation 25 ▹ Convolutions in DNN ▹ Block Matching

    in Refine Correspondences ▹ ReLU, Pooling in DNN ▹ Operations in Optical Flow O P R Modified Systolic Array: Conv. BM ISM Algorithm: OF Modified Scalar Unit: The overall area overhead introduced by ASV is below 0.5%.
  137. Experimental Setup 26 Hardware implementation:

  138. Experimental Setup 26 Hardware implementation: ▹ Systolic array: 24x24 PE

    at 1 GHz 24 24
  139. Experimental Setup 26 Hardware implementation: ▹ Systolic array: 24x24 PE

    at 1 GHz 24 24 ▹ 8 Scalar unit: run in parallel at 250 MHz 8
  140. Experimental Setup 26 Hardware implementation: ▹ Systolic array: 24x24 PE

    at 1 GHz 24 24 ▹ 8 Scalar unit: run in parallel at 250 MHz 8 ▹ SRAM: 1.5 MB on-chip buffer 1.5MB On-chip Buffer ▹ DRAM: 4 Micron 16 Gb LPDDR3-1600 channels
  141. Experimental Setup 26 Hardware implementation: Stereo DNNs: ▹ FlowNet, DispNet,

    GC-Net, PSMNet ▹ Systolic array: 24x24 PE at 1 GHz 24 24 ▹ 8 Scalar unit: run in parallel at 250 MHz 8 ▹ SRAM: 1.5 MB on-chip buffer 1.5MB On-chip Buffer ▹ DRAM: 4 Micron 16 Gb LPDDR3-1600 channels
  142. Experimental Setup 26 Hardware implementation: Stereo DNNs: ▹ FlowNet, DispNet,

    GC-Net, PSMNet Datasets: ▹ SceneFlow and KITTI dataset ▹ Systolic array: 24x24 PE at 1 GHz 24 24 ▹ 8 Scalar unit: run in parallel at 250 MHz 8 ▹ SRAM: 1.5 MB on-chip buffer 1.5MB On-chip Buffer ▹ DRAM: 4 Micron 16 Gb LPDDR3-1600 channels
  143. Evaluation 27 Variants:

  144. Evaluation 27 Variants: ▹ ISM: ISM algorithm without deconv. optimizations.

  145. Evaluation 27 Variants: ▹ ISM: ISM algorithm without deconv. optimizations.

    DNN inference for every 4 frames DNN Inference Non-DNN Inference Non-DNN Inference Non-DNN Inference
  146. Evaluation 27 Variants: ▹ ISM: ISM algorithm without deconv. optimizations.

    ▹ DCO: Deconv. optimizations without ISM algorithm. DNN inference for every 4 frames DNN Inference Non-DNN Inference Non-DNN Inference Non-DNN Inference
  147. Evaluation 27 Variants: ▹ ISM: ISM algorithm without deconv. optimizations.

    ▹ DCO: Deconv. optimizations without ISM algorithm. ▹ ISM + DCO: combined both optimization. DNN inference for every 4 frames DNN Inference Non-DNN Inference Non-DNN Inference Non-DNN Inference
  148. Evaluation 28 DNN inference for every 4 frames DNN Inference

    Non-DNN Inference Non-DNN Inference Non-DNN Inference
  149. Evaluation 28 Error rate (%) 0 1.25 2.5 3.75 5

    DispNet FlowNetC PSMNet GC-Net AVG. DNN ISM DNN inference for every 4 frames DNN Inference Non-DNN Inference Non-DNN Inference Non-DNN Inference
  150. Evaluation 28 Error rate (%) 0 1.25 2.5 3.75 5

    DispNet FlowNetC PSMNet GC-Net AVG. DNN ISM DNN inference for every 4 frames DNN Inference Non-DNN Inference Non-DNN Inference Non-DNN Inference 3.89
  151. Evaluation 28 Error rate (%) 0 1.25 2.5 3.75 5

    DispNet FlowNetC PSMNet GC-Net AVG. DNN ISM DNN inference for every 4 frames DNN Inference Non-DNN Inference Non-DNN Inference Non-DNN Inference 3.89 3.95
  152. Evaluation 29

  153. Evaluation 29 Speedup 0 2 4 6 8 DispNet FlowNetC

    PSMNet GC-Net AVG. DCO ISM DCO+ISM
  154. Evaluation 29 Speedup 0 2 4 6 8 DispNet FlowNetC

    PSMNet GC-Net AVG. DCO ISM DCO+ISM 1.5x
  155. Evaluation 29 Speedup 0 2 4 6 8 DispNet FlowNetC

    PSMNet GC-Net AVG. DCO ISM DCO+ISM 1.5x 3.3x
  156. Evaluation 29 Speedup 0 2 4 6 8 DispNet FlowNetC

    PSMNet GC-Net AVG. DCO ISM DCO+ISM 1.5x 3.3x 5.0x
  157. Evaluation 29 Speedup 0 2 4 6 8 DispNet FlowNetC

    PSMNet GC-Net AVG. DCO ISM DCO+ISM Energy Reduction 0 25 50 75 100 DispNet FlowNetC PSMNet GC-Net AVG. 1.5x 3.3x 5.0x
  158. Evaluation 29 Speedup 0 2 4 6 8 DispNet FlowNetC

    PSMNet GC-Net AVG. DCO ISM DCO+ISM Energy Reduction 0 25 50 75 100 DispNet FlowNetC PSMNet GC-Net AVG. 1.5x 3.3x 5.0x 42%
  159. Evaluation 29 Speedup 0 2 4 6 8 DispNet FlowNetC

    PSMNet GC-Net AVG. DCO ISM DCO+ISM Energy Reduction 0 25 50 75 100 DispNet FlowNetC PSMNet GC-Net AVG. 1.5x 3.3x 5.0x 42% 75%
  160. Evaluation 29 Speedup 0 2 4 6 8 DispNet FlowNetC

    PSMNet GC-Net AVG. DCO ISM DCO+ISM Energy Reduction 0 25 50 75 100 DispNet FlowNetC PSMNet GC-Net AVG. 1.5x 3.3x 5.0x 42% 75% 85%
  161. Evaluation 30 Speedup 0 1.5 3 4.5 6 AVG. ASV

    GANNX 3.6x 5.0x Energy Reduction 0 1.5 3 4.5 6 AVG. ASV GANNX 3.2x 4.2x
  162. Conclusion 31

  163. Conclusion 31 ‣“Depth from stereo" is critical to emerging intelligent

    applications deployed in energy- and performance-constrained devices.
  164. Conclusion 31 ‣ASV simultaneously improves performance and energy-efficiency, while maintaining

    high accuracy via a HW & SW co-design. ‣“Depth from stereo" is critical to emerging intelligent applications deployed in energy- and performance-constrained devices.
  165. Conclusion 31 ‣Careful design choices can let these optimizations be

    integrated with existing DNN accelerators with minor hardware extensions + + + ‣ASV simultaneously improves performance and energy-efficiency, while maintaining high accuracy via a HW & SW co-design. ‣“Depth from stereo" is critical to emerging intelligent applications deployed in energy- and performance-constrained devices.