Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ASV: Accelerated Stereo Vision System

HorizonLab
October 14, 2019

ASV: Accelerated Stereo Vision System

MICRO 2019. Presented by Yu Feng

HorizonLab

October 14, 2019
Tweet

More Decks by HorizonLab

Other Decks in Technology

Transcript

  1. 1
    ASV: Accelerated Stereo Vision System
    Yu Feng
    with Paul Whatmough (Arm Research) and Yuhao Zhu
    Department of Computer Science

    University of Rochester

    http://horizon-lab.org

    View Slide

  2. 2

    View Slide

  3. 2

    View Slide

  4. 2
    Eve

    View Slide

  5. 2
    Distance: 1.0 inch
    Heart rate: 200↑ ❤
    Eve

    View Slide

  6. 2
    Distance: 1.0 inch
    Heart rate: 200↑ ❤
    Eve

    Right Distance (Depth) is Important!

    View Slide

  7. Applications Need Depth Information
    3

    View Slide

  8. Applications Need Depth Information
    3
    3D Reconstruction

    View Slide

  9. Applications Need Depth Information
    3
    3D Reconstruction
    Augment Reality

    View Slide

  10. Applications Need Depth Information
    3
    3D Reconstruction
    Drone Navigation
    Augment Reality

    View Slide

  11. Applications Need Depth Information
    3
    3D Reconstruction
    Drone Navigation
    Augment Reality
    Domestic Robot

    View Slide

  12. Techniques to Extract Depth Information
    4

    View Slide

  13. Techniques to Extract Depth Information
    4
    Passive Sensing Active Sensing

    View Slide

  14. Techniques to Extract Depth Information
    4
    Passive Sensing Active Sensing

    View Slide

  15. Techniques to Extract Depth Information
    4
    Passive Sensing Active Sensing

    View Slide

  16. Techniques to Extract Depth Information
    4
    Passive Sensing Active Sensing

    View Slide

  17. Triangulation: Binocular Depth Sensing
    5
    Physical
    Point

    View Slide

  18. Triangulation: Binocular Depth Sensing
    5
    Physical
    Point
    Right
    Camera
    Left
    Camera

    View Slide

  19. Triangulation: Binocular Depth Sensing
    5
    Physical
    Point
    Right
    Camera
    Left
    Camera

    View Slide

  20. Triangulation: Binocular Depth Sensing
    5
    Physical
    Point
    Right
    Camera
    Left
    Camera
    Left
    Plate
    Right
    Plate
    XR
    XL

    View Slide

  21. Triangulation: Binocular Depth Sensing
    5
    Physical
    Point
    Right
    Camera
    Left
    Camera
    B
    B + Z
    Left
    Plate
    Right
    Plate
    XR
    XL
    f

    View Slide

  22. Triangulation: Binocular Depth Sensing
    5
    Physical
    Point
    Right
    Camera
    Left
    Camera
    B
    B + Z
    Left
    Plate
    Right
    Plate
    XR
    XL
    f
    D, Depth

    View Slide

  23. Triangulation: Binocular Depth Sensing
    5
    Physical
    Point
    Right
    Camera
    Left
    Camera
    B
    B + Z
    Left
    Plate
    Right
    Plate
    XR
    XL
    D
    D + f
    =
    B
    B + Z
    Using similar triangles:
    f
    D, Depth

    View Slide

  24. Triangulation: Binocular Depth Sensing
    5
    Physical
    Point
    Right
    Camera
    Left
    Camera
    B
    B + Z
    Left
    Plate
    Right
    Plate
    XR
    XL X’L
    D
    D + f
    =
    B
    B + Z
    Using similar triangles:
    f
    D, Depth

    View Slide

  25. Triangulation: Binocular Depth Sensing
    5
    Physical
    Point
    Right
    Camera
    Left
    Camera
    B
    B + Z
    Left
    Plate
    Right
    Plate
    XR
    XL X’L
    D
    D + f
    =
    B
    B + Z
    Using similar triangles:
    XR
    - XL
    f
    D, Depth

    View Slide

  26. Triangulation: Binocular Depth Sensing
    5
    Physical
    Point
    Right
    Camera
    Left
    Camera
    B
    B + Z
    Left
    Plate
    Right
    Plate
    XR
    XL X’L
    D
    D + f
    =
    B
    B + Z
    Using similar triangles:
    XR
    - XL
    f
    D, Depth
    Z, Disparity

    View Slide

  27. Triangulation: Binocular Depth Sensing
    5
    Physical
    Point
    Right
    Camera
    Left
    Camera
    B
    B + Z
    Left
    Plate
    Right
    Plate
    XR
    XL X’L
    D
    D + f
    =
    B
    B + Z
    Using similar triangles:
    XR
    - XL
    f
    D, Depth
    Z, Disparity
    D = Bf/Z
    Using similar triangles:

    View Slide

  28. Continuous Stereo Vision
    6
    L
    R
    Inputs Output
    Disparity
    Map

    View Slide

  29. Continuous Stereo Vision
    6
    L
    R
    Inputs Output
    XR
    XL
    Disparity
    Map

    View Slide

  30. Continuous Stereo Vision
    6
    L
    R
    Inputs Output
    | |
    - = Z
    XR
    XL
    Disparity
    Map
    z

    View Slide

  31. Continuous Stereo Vision
    6
    L
    R
    Inputs Output
    | |
    - = Z
    XR
    XL
    Disparity
    Map
    z

    View Slide

  32. Continuous Stereo Vision
    6
    L
    R
    Inputs Output
    | |
    - = Z
    XR
    XL
    Disparity
    Map
    z
    Depth

    View Slide

  33. Continuous Stereo Vision
    7
    L
    R
    Inputs Output
    Disparity
    Map

    View Slide

  34. Stereo Matching
    Algorithms
    Continuous Stereo Vision
    7
    L
    R
    Inputs Output
    Disparity
    Map
    +

    View Slide

  35. Continuous Stereo Vision
    7
    L
    R
    Inputs Output
    Disparity
    Map
    { DNN-based
    non-DNN-based
    }
    +

    View Slide

  36. Accuracy vs. Speed Trade-off
    FPS
    0
    1
    100
    Error Rate (%)
    0 4 8 12 16
    non-DNN (CPU)
    DNN (GPU)
    DNN (Accelerator)
    8

    View Slide

  37. Accuracy vs. Speed Trade-off
    FPS
    0
    1
    100
    Error Rate (%)
    0 4 8 12 16
    non-DNN (CPU)
    DNN (GPU)
    DNN (Accelerator)
    8
    30FPS

    View Slide

  38. Accuracy vs. Speed Trade-off
    FPS
    0
    1
    100
    Error Rate (%)
    0 4 8 12 16
    non-DNN (CPU)
    DNN (GPU)
    DNN (Accelerator)
    8
    30FPS

    View Slide

  39. Accuracy vs. Speed Trade-off
    FPS
    0
    1
    100
    Error Rate (%)
    0 4 8 12 16
    non-DNN (CPU)
    DNN (GPU)
    DNN (Accelerator)
    8
    30FPS

    View Slide

  40. Accuracy vs. Speed Trade-off
    FPS
    0
    1
    100
    Error Rate (%)
    0 4 8 12 16
    non-DNN (CPU)
    DNN (GPU)
    DNN (Accelerator)
    8
    ASV
    30FPS

    View Slide

  41. ASV: Accelerated Stereo Vision System
    9

    View Slide

  42. ASV: Accelerated Stereo Vision System
    9
    ‣Algorithm: Invariant-based Stereo
    Matching Algorithm
    +

    View Slide

  43. ASV: Accelerated Stereo Vision System
    9
    ‣Algorithm: Invariant-based Stereo
    Matching Algorithm
    +
    ‣Compiler: Deconvolution Transformation
    and Dataflow Optimization
    +

    View Slide

  44. ASV: Accelerated Stereo Vision System
    9
    ‣Algorithm: Invariant-based Stereo
    Matching Algorithm
    +
    ‣Compiler: Deconvolution Transformation
    and Dataflow Optimization
    +
    ‣Hardware: Principled and Minimal
    Hardware Modifications
    +

    View Slide

  45. ASV: Accelerated Stereo Vision System
    10
    ‣Algorithm: Invariant-based Stereo
    Matching Algorithm
    +
    ‣Compiler: Deconvolution Transformation
    and Dataflow Optimization
    +
    ‣Hardware: Principled and Minimal
    Hardware Modifications
    +

    View Slide

  46. ASV: Accelerated Stereo Vision System
    10
    ‣Algorithm: Invariant-based Stereo
    Matching Algorithm
    +
    ‣Compiler: Deconvolution Transformation
    and Dataflow Optimization
    +
    ‣Hardware: Principled and Minimal
    Hardware Modifications
    +

    View Slide

  47. ISM: Invariant-based Stereo Matching Algorithm
    11

    View Slide

  48. ISM: Invariant-based Stereo Matching Algorithm
    11
    t = t0
    L R

    View Slide

  49. ISM: Invariant-based Stereo Matching Algorithm
    11
    t = t0
    L R
    =
    DNN inference

    View Slide

  50. ISM: Invariant-based Stereo Matching Algorithm
    11
    t = t0
    L R
    =
    DNN inference

    View Slide

  51. ISM: Invariant-based Stereo Matching Algorithm
    11
    t = t0
    L R
    =
    DNN inference

    View Slide

  52. t = t0+1
    L R
    ISM: Invariant-based Stereo Matching Algorithm
    11
    t = t0
    L R
    ???
    =
    DNN inference

    View Slide

  53. t = t0+1
    L R
    ISM: Invariant-based Stereo Matching Algorithm
    11
    t = t0
    L R
    Find Correspondences
    ???
    =
    DNN inference

    View Slide

  54. t = t0+1
    L R
    ISM: Invariant-based Stereo Matching Algorithm
    11
    t = t0
    L R
    Find Correspondences
    Propagate Correspondences
    (motion estimation)
    ???
    =
    DNN inference

    View Slide

  55. t = t0+1
    L R
    ISM: Invariant-based Stereo Matching Algorithm
    11
    t = t0
    L R
    Find Correspondences
    Propagate Correspondences
    (motion estimation)
    ???
    =
    DNN inference
    Invariant: two corresponding pixels
    always correspond to the same physical
    point across frames over time.

    View Slide

  56. t = t0+1
    L R
    ISM: Invariant-based Stereo Matching Algorithm
    11
    t = t0
    L R
    Find Correspondences
    Propagate Correspondences
    (motion estimation)
    ???
    =
    DNN inference
    Refine
    Correspondences

    View Slide

  57. t = t0+1
    L R
    ISM: Invariant-based Stereo Matching Algorithm
    11
    t = t0
    L R
    Find Correspondences
    Propagate Correspondences
    (motion estimation)
    ???
    =
    DNN inference
    Refine
    Correspondences
    Optical Flow
    Algorithm

    View Slide

  58. t = t0+1
    L R
    ISM: Invariant-based Stereo Matching Algorithm
    11
    t = t0
    L R
    Find Correspondences
    Propagate Correspondences
    (motion estimation)
    ???
    =
    DNN inference
    Refine
    Correspondences
    Optical Flow
    Algorithm
    Block
    Matching

    View Slide

  59. t = t0+1
    L R
    ISM: Invariant-based Stereo Matching Algorithm
    11
    t = t0
    L R
    Find Correspondences
    Propagate Correspondences
    (motion estimation)
    ???
    =
    DNN inference
    Refine
    Correspondences
    Optical Flow
    Algorithm
    Block
    Matching
    Optical Flow
    Algorithm
    Block
    Matching

    View Slide

  60. ISM: Invariant-based Stereo Matching Algorithm
    12

    View Slide

  61. ISM: Invariant-based Stereo Matching Algorithm
    12
    t = t0+1
    t = t0
    L
    R
    L
    R
    L
    R
    L
    R
    Time
    t = t0+2 t = t0+3

    View Slide

  62. ISM: Invariant-based Stereo Matching Algorithm
    12
    t = t0+1
    t = t0
    L
    R
    L
    R
    L
    R
    L
    R
    Time
    t = t0+2 t = t0+3
    Method
    Performance
    Accuracy GOOD GOOD GOOD GOOD
    DNN
    Inference
    DNN
    Inference
    DNN
    Inference
    DNN
    Inference
    SLOW SLOW SLOW SLOW

    View Slide

  63. ISM: Invariant-based Stereo Matching Algorithm
    12
    t = t0+1
    t = t0
    L
    R
    L
    R
    L
    R
    L
    R
    Time
    t = t0+2 t = t0+3
    Method
    Performance
    Accuracy GOOD GOOD GOOD GOOD
    SLOW FAST FAST SLOW
    DNN
    Inference
    ISM
    Algorithm
    ISM
    Algorithm
    DNN
    Inference

    View Slide

  64. ISM: Invariant-based Stereo Matching Algorithm
    12
    t = t0+1
    t = t0
    L
    R
    L
    R
    L
    R
    L
    R
    Time
    t = t0+2 t = t0+3
    Method
    Performance
    Accuracy GOOD GOOD GOOD GOOD
    SLOW FAST FAST SLOW
    DNN
    Inference
    ISM
    Algorithm
    ISM
    Algorithm
    DNN
    Inference
    https://github.com/horizon-research/ism-algorithm

    View Slide

  65. ASV: Accelerated Stereo Vision System
    13
    ‣Algorithm: Invariant-based Stereo
    Matching Algorithm
    +
    ‣Compiler: Deconvolution Transformation
    and Dataflow Optimization
    +
    ‣Hardware: Principled and Minimal
    Hardware Modifications
    +

    View Slide

  66. ASV: Accelerated Stereo Vision System
    13
    ‣Algorithm: Invariant-based Stereo
    Matching Algorithm
    +
    ‣Compiler: Deconvolution Transformation
    and Dataflow Optimization
    +
    ‣Hardware: Principled and Minimal
    Hardware Modifications
    +

    View Slide

  67. Deconv. is the Major Operation in Stereo DNN
    14

    View Slide









  68. Deconv. is the Major Operation in Stereo DNN
    14

    View Slide









  69. Deconv. is the Major Operation in Stereo DNN
    14

    View Slide









  70. Deconv. is the Major Operation in Stereo DNN
    14
    Downsampling:
    Extract and
    Combine
    High-level
    Features

    View Slide









  71. Deconv. is the Major Operation in Stereo DNN
    14
    Downsampling:
    Extract and
    Combine
    High-level
    Features
    Upsampling:
    Restore and
    Refine Disparity
    Resolution

    View Slide









  72. Deconv. is the Major Operation in Stereo DNN
    14
    Downsampling:
    Extract and
    Combine
    High-level
    Features
    Upsampling:
    Restore and
    Refine Disparity
    Resolution
    CONV. DECONV.

    View Slide









  73. Deconv. is the Major Operation in Stereo DNN
    14
    Deconvolution
    Comp. Cost (%)
    0
    25
    50
    75
    100
    Flow
    N
    etC
    DispN
    et
    G
    C
    -N
    et
    PSM
    N
    et
    Downsampling:
    Extract and
    Combine
    High-level
    Features
    Upsampling:
    Restore and
    Refine Disparity
    Resolution
    CONV. DECONV.

    View Slide









  74. Deconv. is the Major Operation in Stereo DNN
    14
    Deconvolution
    Comp. Cost (%)
    0
    25
    50
    75
    100
    Flow
    N
    etC
    DispN
    et
    G
    C
    -N
    et
    PSM
    N
    et
    Downsampling:
    Extract and
    Combine
    High-level
    Features
    Upsampling:
    Restore and
    Refine Disparity
    Resolution
    CONV. DECONV.

    View Slide

  75. Deconvolution Transformation
    15
    B
    A
    D
    C
    ifmap

    View Slide

  76. Deconvolution Transformation
    15
    B
    A
    D
    C
    ifmap
    B
    A
    D
    C
    B
    A
    D
    C
    Upsampled ifmap

    View Slide

  77. Deconvolution Transformation
    15
    B
    A
    D
    C
    ifmap
    b c
    a
    e f
    d
    i
    h
    g

    Original
    kernel
    1,1 1,2 1,3
    3,1
    2,3
    2,1 2,2
    3,2 3,3
    B
    A
    D
    C
    B
    A
    D
    C
    Upsampled ifmap

    View Slide

  78. Deconvolution Transformation
    15
    B
    A
    D
    C
    ifmap
    b c
    a
    e f
    d
    i
    h
    g

    Original
    kernel
    1,1 1,2 1,3
    3,1
    2,3
    2,1 2,2
    3,2 3,3
    B
    A
    D
    C
    B
    A
    D
    C
    Upsampled ifmap
    A
    b c
    a
    e f
    d
    i
    h
    g

    1,1 1,2 1,3
    3,1
    2,3
    2,1 2,2
    3,2 3,3

    View Slide

  79. Deconvolution Transformation
    15
    B
    A
    D
    C
    ifmap
    b c
    a
    e f
    d
    i
    h
    g

    Original
    kernel
    B
    A
    D
    C
    B
    A
    D
    C
    Upsampled ifmap
    A
    b c
    a
    e f
    d
    i
    h
    g

    B
    b c
    a
    e f
    d
    i
    h
    g

    C
    b c
    a
    e f
    d
    i
    h
    g

    D
    b c
    a
    e f
    d
    i
    h
    g

    (1, 1) = A * e
    (1, 3) = B * e
    (3, 1) = C * e
    (3, 3) = D * e
    e
    1,1 1,2 1,3
    3,1
    2,3
    2,1 2,2
    3,2 3,3

    View Slide

  80. Deconvolution Transformation
    16
    b c
    a
    e f
    d
    i
    h
    g
    Upsampled ifmap

    (1, 1) = A * e
    (1, 3) = B * e
    (3, 1) = C * e
    (3, 3) = D * e
    e
    Original
    kernel
    B
    A
    D
    C
    B
    A
    D
    C
    B
    A
    D
    C
    ifmap

    View Slide

  81. 1,1 1,2 1,3
    3,1
    2,3
    2,1 2,2
    3,2 3,3
    Deconvolution Transformation
    16
    b c
    a
    e f
    d
    i
    h
    g
    Upsampled ifmap

    b c
    a
    e f
    d
    i
    h
    g

    B
    A
    b c
    a
    e f
    d
    i
    h
    g

    D
    C
    (1, 1) = A * e
    (1, 3) = B * e
    (3, 1) = C * e
    (3, 3) = D * e
    e
    (1, 2) = A * d + B * f
    (3, 2) = C * d + D * f
    f
    d
    Original
    kernel
    B
    A
    D
    C
    B
    A
    D
    C
    B
    A
    D
    C
    ifmap

    View Slide

  82. Deconvolution Transformation
    17
    b c
    a
    e f
    d
    i
    h
    g
    Upsampled ifmap

    (1, 1) = A * e
    (1, 3) = B * e
    (3, 1) = C * e
    (3, 3) = D * e
    e
    (1, 2) = A * d + B * f
    (3, 2) = C * d + D * f
    f
    d
    Original
    kernel
    B
    A
    D
    C
    ifmap
    B
    A
    D
    C
    B
    A
    D
    C

    View Slide

  83. 1,1 1,2 1,3
    3,1
    2,3
    2,1 2,2
    3,2 3,3
    Deconvolution Transformation
    17
    b c
    a
    e f
    d
    i
    h
    g
    Upsampled ifmap

    b c
    a
    e f
    d
    i
    h
    g

    A
    C
    b c
    a
    e f
    d
    i
    h
    g

    B
    D
    (1, 1) = A * e
    (1, 3) = B * e
    (3, 1) = C * e
    (3, 3) = D * e
    e
    (1, 2) = A * d + B * f
    (3, 2) = C * d + D * f
    f
    d
    (1, 2) = A * d + B * f
    (3, 2) = C * d + D * f
    b
    h
    Original
    kernel
    B
    A
    D
    C
    ifmap
    B
    A
    D
    C
    B
    A
    D
    C

    View Slide

  84. Deconvolution Transformation
    18
    b c
    a
    e f
    d
    i
    h
    g
    Upsampled ifmap

    (1, 1) = A * e
    (1, 3) = B * e
    (3, 1) = C * e
    (3, 3) = D * e
    e
    (1, 2) = A * d + B * f
    (3, 2) = C * d + D * f
    f
    d
    (1, 2) = A * d + B * f
    (3, 2) = C * d + D * f
    b
    h
    c
    a
    i
    g
    (2, 2) = A * a + B * c + C * g + D * i
    Original
    kernel
    B
    A
    D
    C
    ifmap
    B
    A
    D
    C
    B
    A
    D
    C

    View Slide

  85. 1,1 1,2 1,3
    3,1
    2,3
    2,1 2,2
    3,2 3,3
    Deconvolution Transformation
    18
    b c
    a
    e f
    d
    i
    h
    g
    Upsampled ifmap

    b c
    a
    e f
    d
    i
    h
    g

    B
    A
    D
    C
    (1, 1) = A * e
    (1, 3) = B * e
    (3, 1) = C * e
    (3, 3) = D * e
    e
    (1, 2) = A * d + B * f
    (3, 2) = C * d + D * f
    f
    d
    (1, 2) = A * d + B * f
    (3, 2) = C * d + D * f
    b
    h
    c
    a
    i
    g
    (2, 2) = A * a + B * c + C * g + D * i
    Original
    kernel
    B
    A
    D
    C
    ifmap
    B
    A
    D
    C
    B
    A
    D
    C

    View Slide

  86. 1,1 1,2 1,3
    3,1
    2,3
    2,1 2,2
    3,2 3,3
    Deconvolution Transformation
    18
    b c
    a
    e
    f
    d
    i
    h
    g
    Upsampled ifmap

    b c
    a
    e f
    d
    i
    h
    g

    B
    A
    D
    C
    (1, 1) = A * e
    (1, 3) = B * e
    (3, 1) = C * e
    (3, 3) = D * e
    e
    (1, 2) = A * d + B * f
    (3, 2) = C * d + D * f
    f
    d
    (1, 2) = A * d + B * f
    (3, 2) = C * d + D * f
    b
    h
    c
    a
    i
    g
    (2, 2) = A * a + B * c + C * g + D * i
    Original
    kernel
    B
    A
    D
    C
    ifmap
    B
    A
    D
    C
    B
    A
    D
    C

    View Slide

  87. 1,1 1,2 1,3
    3,1
    2,3
    2,1 2,2
    3,2 3,3
    Deconvolution Transformation
    18
    b c
    a
    e
    f
    d
    i
    h
    g
    Upsampled ifmap

    b c
    a
    e f
    d
    i
    h
    g

    B
    A
    D
    C
    (1, 1) = A * e
    (1, 3) = B * e
    (3, 1) = C * e
    (3, 3) = D * e
    e
    (1, 2) = A * d + B * f
    (3, 2) = C * d + D * f
    f
    d
    (1, 2) = A * d + B * f
    (3, 2) = C * d + D * f
    b
    h
    c
    a
    i
    g
    (2, 2) = A * a + B * c + C * g + D * i
    Original
    kernel
    B
    A
    D
    C
    ifmap
    B
    A
    D
    C
    B
    A
    D
    C

    View Slide

  88. 1,1 1,2 1,3
    3,1
    2,3
    2,1 2,2
    3,2 3,3
    Deconvolution Transformation
    18
    b c
    a
    e
    f
    d
    i
    h
    g
    Upsampled ifmap

    b c
    a
    e f
    d
    i
    h
    g

    B
    A
    D
    C
    (1, 1) = A * e
    (1, 3) = B * e
    (3, 1) = C * e
    (3, 3) = D * e
    e
    (1, 2) = A * d + B * f
    (3, 2) = C * d + D * f
    f
    d
    (1, 2) = A * d + B * f
    (3, 2) = C * d + D * f
    b
    h
    c
    a
    i
    g
    (2, 2) = A * a + B * c + C * g + D * i
    Original
    kernel
    B
    A
    D
    C
    ifmap
    B
    A
    D
    C
    B
    A
    D
    C

    View Slide

  89. 1,1 1,2 1,3
    3,1
    2,3
    2,1 2,2
    3,2 3,3
    Deconvolution Transformation
    18
    b
    c
    a
    e
    f
    d
    i
    h
    g
    Upsampled ifmap

    b c
    a
    e f
    d
    i
    h
    g

    B
    A
    D
    C
    (1, 1) = A * e
    (1, 3) = B * e
    (3, 1) = C * e
    (3, 3) = D * e
    e
    (1, 2) = A * d + B * f
    (3, 2) = C * d + D * f
    f
    d
    (1, 2) = A * d + B * f
    (3, 2) = C * d + D * f
    b
    h
    c
    a
    i
    g
    (2, 2) = A * a + B * c + C * g + D * i
    B
    A
    D
    C
    ifmap
    B
    A
    D
    C
    B
    A
    D
    C

    View Slide

  90. Deconvolution Transformation
    19
    Upsampled ifmap

    B
    A
    D
    C
    B
    A
    D
    C
    (1, 1) = A * e
    (1, 3) = B * e
    (3, 1) = C * e
    (3, 3) = D * e
    e
    (1, 2) = A * d + B * f
    (3, 2) = C * d + D * f
    f
    d
    (1, 2) = A * d + B * f
    (3, 2) = C * d + D * f
    b
    h
    c
    a
    i
    g
    (2, 2) = A * a + B * c + C * g + D * i
    c
    a
    i
    g
    e
    f
    d
    b
    h
    1,1 1,2 1,3
    3,1
    2,3
    2,1 2,2
    3,2 3,3
    B
    A
    D
    C
    ifmap

    View Slide

  91. Deconvolution Transformation
    19
    (1, 1) = A * e
    (1, 3) = B * e
    (3, 1) = C * e
    (3, 3) = D * e
    e
    (1, 2) = A * d + B * f
    (3, 2) = C * d + D * f
    f
    d
    (1, 2) = A * d + B * f
    (3, 2) = C * d + D * f
    b
    h
    c
    a
    i
    g
    (2, 2) = A * a + B * c + C * g + D * i
    c
    a
    i
    g
    e
    f
    d
    b
    h
    1,1 1,2 1,3
    3,1
    2,3
    2,1 2,2
    3,2 3,3
    B
    A
    D
    C
    ifmap

    View Slide

  92. Deconvolution Transformation
    19
    (1, 1) = A * e
    (1, 3) = B * e
    (3, 1) = C * e
    (3, 3) = D * e
    e
    (1, 2) = A * d + B * f
    (3, 2) = C * d + D * f
    f
    d
    (1, 2) = A * d + B * f
    (3, 2) = C * d + D * f
    b
    h
    c
    a
    i
    g
    (2, 2) = A * a + B * c + C * g + D * i
    c
    a
    i
    g
    e
    f
    d
    b
    h

    1,1 1,2 1,3
    3,1
    2,3
    2,1 2,2
    3,2 3,3
    B
    A
    D
    C
    ifmap

    View Slide

  93. Deconvolution Transformation
    19
    (1, 1) = A * e
    (1, 3) = B * e
    (3, 1) = C * e
    (3, 3) = D * e
    e
    (1, 2) = A * d + B * f
    (3, 2) = C * d + D * f
    f
    d
    (1, 2) = A * d + B * f
    (3, 2) = C * d + D * f
    b
    h
    c
    a
    i
    g
    (2, 2) = A * a + B * c + C * g + D * i
    c
    a
    i
    g
    e
    f
    d
    b
    h

    1,1 1,2 1,3
    3,1
    2,3
    2,1 2,2
    3,2 3,3
    1,1 1,2 1,3
    3,1
    2,3
    2,1 2,2
    3,2 3,3
    B
    A
    D
    C
    ifmap

    View Slide

  94. Deconvolution Transformation
    19
    (1, 1) = A * e
    (1, 3) = B * e
    (3, 1) = C * e
    (3, 3) = D * e
    e
    (1, 2) = A * d + B * f
    (3, 2) = C * d + D * f
    f
    d
    (1, 2) = A * d + B * f
    (3, 2) = C * d + D * f
    b
    h
    c
    a
    i
    g
    (2, 2) = A * a + B * c + C * g + D * i
    c
    a
    i
    g
    e
    f
    d
    b
    h

    1,1 1,2 1,3
    3,1
    2,3
    2,1 2,2
    3,2 3,3
    1,1 1,2 1,3
    3,1
    2,3
    2,1 2,2
    3,2 3,3
    1,1 1,2 1,3
    3,1
    2,3
    2,1 2,2
    3,2 3,3
    B
    A
    D
    C
    ifmap

    View Slide

  95. Deconvolution Transformation
    19
    (1, 1) = A * e
    (1, 3) = B * e
    (3, 1) = C * e
    (3, 3) = D * e
    e
    (1, 2) = A * d + B * f
    (3, 2) = C * d + D * f
    f
    d
    (1, 2) = A * d + B * f
    (3, 2) = C * d + D * f
    b
    h
    c
    a
    i
    g
    (2, 2) = A * a + B * c + C * g + D * i
    c
    a
    i
    g
    e
    f
    d
    b
    h

    1,1 1,2 1,3
    3,1
    2,3
    2,1 2,2
    3,2 3,3
    1,1 1,2 1,3
    3,1
    2,3
    2,1 2,2
    3,2 3,3
    1,1 1,2 1,3
    3,1
    2,3
    2,1 2,2
    3,2 3,3
    1,1 1,2 1,3
    3,1
    2,3
    2,1 2,2
    3,2 3,3
    B
    A
    D
    C
    ifmap

    View Slide

  96. Deconvolution Transformation
    20
    (1, 1) = A * e
    (1, 3) = B * e
    (3, 1) = C * e
    (3, 3) = D * e
    e
    (1, 2) = A * d + B * f
    (3, 2) = C * d + D * f
    f
    d
    (1, 2) = A * d + B * f
    (3, 2) = C * d + D * f
    b
    h
    ▸Compile a deconvolution layer into 4 convolution layers
    c
    a
    i
    g
    (2, 2) = A * a + B * c + C * g + D * i

    View Slide

  97. Deconvolution Transformation
    20
    (1, 1) = A * e
    (1, 3) = B * e
    (3, 1) = C * e
    (3, 3) = D * e
    e
    (1, 2) = A * d + B * f
    (3, 2) = C * d + D * f
    f
    d
    (1, 2) = A * d + B * f
    (3, 2) = C * d + D * f
    b
    h
    ▸Compile a deconvolution layer into 4 convolution layers
    I Original input feature map
    e ofmap elements generated in this round are also stored
    he buffer, and are too shaded.
    terns. The key is to recognize that the four computation
    erns are essentially four different convolutions, each con-
    ving the original ifmap with a distinct kernel that is part
    he original kernel. For instance, (2, 2), (2, 4), (4, 2), and
    4) are generated by convolving

    a c
    g i

    with ifmap. More
    erally, the deconvolution in Fig. 6 can be calculated as:
    b c
    e f
    h i
    #
    b
    ~ I = G ([e]~I,[d f]~I,

    b
    h
    ~I,

    a c
    g i
    ~I)
    ere b
    ~ denotes deconvolution, ~ denotes standard convolu-
    n, I denotes the ifmap, and G denotes the gather operation
    t assembles the ofmap from the results of the four con-
    utions. G can be simply implemented as a set of load
    rations to the scratchpad memory (on-chip buffer).
    Essentially, our algorithm decomposes the original 3⇥3
    cient for convolutions.
    also be extended to supp
    which have more relaxe
    We assume that the ac
    (scratchpad memory) th
    as output elements. The
    hold all the data for a lay
    in multiple rounds. Onl
    loaded into the buffer ea
    into the buffer in each ro
    and is determined by th
    The buffer is evenly s
    buffer to support doub
    computing the current ro
    data needed for the next
    The next round does no
    This design choice guara
    Deconvolution
    c
    a
    i
    g
    (2, 2) = A * a + B * c + C * g + D * i

    View Slide

  98. Deconvolution Transformation
    20
    (1, 1) = A * e
    (1, 3) = B * e
    (3, 1) = C * e
    (3, 3) = D * e
    e
    (1, 2) = A * d + B * f
    (3, 2) = C * d + D * f
    f
    d
    (1, 2) = A * d + B * f
    (3, 2) = C * d + D * f
    b
    h
    ▸Compile a deconvolution layer into 4 convolution layers
    I Original input feature map
    e ofmap elements generated in this round are also stored
    he buffer, and are too shaded.
    terns. The key is to recognize that the four computation
    erns are essentially four different convolutions, each con-
    ving the original ifmap with a distinct kernel that is part
    he original kernel. For instance, (2, 2), (2, 4), (4, 2), and
    4) are generated by convolving

    a c
    g i

    with ifmap. More
    erally, the deconvolution in Fig. 6 can be calculated as:
    b c
    e f
    h i
    #
    b
    ~ I = G ([e]~I,[d f]~I,

    b
    h
    ~I,

    a c
    g i
    ~I)
    ere b
    ~ denotes deconvolution, ~ denotes standard convolu-
    n, I denotes the ifmap, and G denotes the gather operation
    t assembles the ofmap from the results of the four con-
    utions. G can be simply implemented as a set of load
    rations to the scratchpad memory (on-chip buffer).
    Essentially, our algorithm decomposes the original 3⇥3
    cient for convolutions.
    also be extended to supp
    which have more relaxe
    We assume that the ac
    (scratchpad memory) th
    as output elements. The
    hold all the data for a lay
    in multiple rounds. Onl
    loaded into the buffer ea
    into the buffer in each ro
    and is determined by th
    The buffer is evenly s
    buffer to support doub
    computing the current ro
    data needed for the next
    The next round does no
    This design choice guara
    Deconvolution
    c
    a
    i
    g
    (2, 2) = A * a + B * c + C * g + D * i

    View Slide

  99. Deconvolution Transformation
    20
    (1, 1) = A * e
    (1, 3) = B * e
    (3, 1) = C * e
    (3, 3) = D * e
    e
    (1, 2) = A * d + B * f
    (3, 2) = C * d + D * f
    f
    d
    (1, 2) = A * d + B * f
    (3, 2) = C * d + D * f
    b
    h
    ▸Compile a deconvolution layer into 4 convolution layers
    I Original input feature map
    e ofmap elements generated in this round are also stored
    he buffer, and are too shaded.
    terns. The key is to recognize that the four computation
    erns are essentially four different convolutions, each con-
    ving the original ifmap with a distinct kernel that is part
    he original kernel. For instance, (2, 2), (2, 4), (4, 2), and
    4) are generated by convolving

    a c
    g i

    with ifmap. More
    erally, the deconvolution in Fig. 6 can be calculated as:
    b c
    e f
    h i
    #
    b
    ~ I = G ([e]~I,[d f]~I,

    b
    h
    ~I,

    a c
    g i
    ~I)
    ere b
    ~ denotes deconvolution, ~ denotes standard convolu-
    n, I denotes the ifmap, and G denotes the gather operation
    t assembles the ofmap from the results of the four con-
    utions. G can be simply implemented as a set of load
    rations to the scratchpad memory (on-chip buffer).
    Essentially, our algorithm decomposes the original 3⇥3
    cient for convolutions.
    also be extended to supp
    which have more relaxe
    We assume that the ac
    (scratchpad memory) th
    as output elements. The
    hold all the data for a lay
    in multiple rounds. Onl
    loaded into the buffer ea
    into the buffer in each ro
    and is determined by th
    The buffer is evenly s
    buffer to support doub
    computing the current ro
    data needed for the next
    The next round does no
    This design choice guara
    Deconvolution
    ents generated in this round are also stored
    d are too shaded.
    y is to recognize that the four computation
    ntially four different convolutions, each con-
    nal ifmap with a distinct kernel that is part
    ernel. For instance, (2, 2), (2, 4), (4, 2), and
    ted by convolving

    a c
    g i

    with ifmap. More
    convolution in Fig. 6 can be calculated as:
    = G ([e]~I,[d f]~I,

    b
    h
    ~I,

    a c
    g i
    ~I)
    deconvolution, ~ denotes standard convolu-
    e ifmap, and G denotes the gather operation
    he ofmap from the results of the four con-
    n be simply implemented as a set of load
    cient for convolutions. Alte
    also be extended to support
    which have more relaxed co
    We assume that the accele
    (scratchpad memory) that h
    as output elements. The bu
    hold all the data for a layer. T
    in multiple rounds. Only pa
    loaded into the buffer each r
    into the buffer in each round
    and is determined by the lo
    The buffer is evenly split
    buffer to support double-b
    computing the current round
    data needed for the next rou
    Convolution
    c
    a
    i
    g
    (2, 2) = A * a + B * c + C * g + D * i
    ( )

    View Slide

  100. Deconvolution Transformation
    20
    (1, 1) = A * e
    (1, 3) = B * e
    (3, 1) = C * e
    (3, 3) = D * e
    e
    (1, 2) = A * d + B * f
    (3, 2) = C * d + D * f
    f
    d
    (1, 2) = A * d + B * f
    (3, 2) = C * d + D * f
    b
    h
    ▸Compile a deconvolution layer into 4 convolution layers
    I Original input feature map
    e ofmap elements generated in this round are also stored
    he buffer, and are too shaded.
    terns. The key is to recognize that the four computation
    erns are essentially four different convolutions, each con-
    ving the original ifmap with a distinct kernel that is part
    he original kernel. For instance, (2, 2), (2, 4), (4, 2), and
    4) are generated by convolving

    a c
    g i

    with ifmap. More
    erally, the deconvolution in Fig. 6 can be calculated as:
    b c
    e f
    h i
    #
    b
    ~ I = G ([e]~I,[d f]~I,

    b
    h
    ~I,

    a c
    g i
    ~I)
    ere b
    ~ denotes deconvolution, ~ denotes standard convolu-
    n, I denotes the ifmap, and G denotes the gather operation
    t assembles the ofmap from the results of the four con-
    utions. G can be simply implemented as a set of load
    rations to the scratchpad memory (on-chip buffer).
    Essentially, our algorithm decomposes the original 3⇥3
    cient for convolutions.
    also be extended to supp
    which have more relaxe
    We assume that the ac
    (scratchpad memory) th
    as output elements. The
    hold all the data for a lay
    in multiple rounds. Onl
    loaded into the buffer ea
    into the buffer in each ro
    and is determined by th
    The buffer is evenly s
    buffer to support doub
    computing the current ro
    data needed for the next
    The next round does no
    This design choice guara
    Deconvolution
    ents generated in this round are also stored
    d are too shaded.
    y is to recognize that the four computation
    ntially four different convolutions, each con-
    nal ifmap with a distinct kernel that is part
    ernel. For instance, (2, 2), (2, 4), (4, 2), and
    ted by convolving

    a c
    g i

    with ifmap. More
    convolution in Fig. 6 can be calculated as:
    = G ([e]~I,[d f]~I,

    b
    h
    ~I,

    a c
    g i
    ~I)
    deconvolution, ~ denotes standard convolu-
    e ifmap, and G denotes the gather operation
    he ofmap from the results of the four con-
    n be simply implemented as a set of load
    cient for convolutions. Alte
    also be extended to support
    which have more relaxed co
    We assume that the accele
    (scratchpad memory) that h
    as output elements. The bu
    hold all the data for a layer. T
    in multiple rounds. Only pa
    loaded into the buffer each r
    into the buffer in each round
    and is determined by the lo
    The buffer is evenly split
    buffer to support double-b
    computing the current round
    data needed for the next rou
    Convolution
    h a 3⇥3 kernel split into four sub-kernels. With a
    tiling strategy W = 2,H = 2,C1 = 1,C2 = 2,C3 =
    only the shaded elements are loaded into the buffer.
    p elements generated in this round are also stored
    fer, and are too shaded.
    The key is to recognize that the four computation
    re essentially four different convolutions, each con-
    he original ifmap with a distinct kernel that is part
    ginal kernel. For instance, (2, 2), (2, 4), (4, 2), and
    generated by convolving

    a c
    g i

    with ifmap. More
    , the deconvolution in Fig. 6 can be calculated as:
    c
    f
    i
    #
    b
    ~ I = G ([e]~I,[d f]~I,

    b
    h
    ~I,

    a c
    g i
    ~I)
    denotes deconvolution, ~ denotes standard convolu-
    notes the ifmap, and G denotes the gather operation
    mbles the ofmap from the results of the four con-
    sists of a 2D systolic array, in whic
    (PE) performs one MAC operation
    arrays use a simple neighbor-to-
    mechanism that simplifies the con
    cient for convolutions. Alternativ
    also be extended to support SIMD-
    which have more relaxed control w
    We assume that the accelerator h
    (scratchpad memory) that holds ac
    as output elements. The buffer siz
    hold all the data for a layer. Therefo
    in multiple rounds. Only part of th
    loaded into the buffer each round. E
    into the buffer in each round is criti
    and is determined by the loop tilin
    The buffer is evenly split into a w
    buffer to support double-bufferin
    computing the current round using
    data needed for the next round is p
    Gather (stores to scratchpad)
    c
    a
    i
    g
    (2, 2) = A * a + B * c + C * g + D * i
    ( )
    =

    View Slide

  101. Deconvolution Transformation
    20
    (1, 1) = A * e
    (1, 3) = B * e
    (3, 1) = C * e
    (3, 3) = D * e
    e
    (1, 2) = A * d + B * f
    (3, 2) = C * d + D * f
    f
    d
    (1, 2) = A * d + B * f
    (3, 2) = C * d + D * f
    b
    h
    ▸Compile a deconvolution layer into 4 convolution layers
    c
    a
    i
    g
    (2, 2) = A * a + B * c + C * g + D * i
    ( )
    =

    View Slide

  102. Deconvolution Transformation
    20
    (1, 1) = A * e
    (1, 3) = B * e
    (3, 1) = C * e
    (3, 3) = D * e
    e
    (1, 2) = A * d + B * f
    (3, 2) = C * d + D * f
    f
    d
    (1, 2) = A * d + B * f
    (3, 2) = C * d + D * f
    b
    h
    ▸Compile a deconvolution layer into 4 convolution layers
    ▸Naive transformation and compute increase
    memory traffic
    ▹4 sub-kernels + 4 ifmaps!
    c
    a
    i
    g
    (2, 2) = A * a + B * c + C * g + D * i
    ( )
    =

    View Slide

  103. Deconvolution Transformation
    20
    (1, 1) = A * e
    (1, 3) = B * e
    (3, 1) = C * e
    (3, 3) = D * e
    e
    (1, 2) = A * d + B * f
    (3, 2) = C * d + D * f
    f
    d
    (1, 2) = A * d + B * f
    (3, 2) = C * d + D * f
    b
    h
    ▸Compile a deconvolution layer into 4 convolution layers
    ▸Naive transformation and compute increase
    memory traffic
    ▹4 sub-kernels + 4 ifmaps!
    ▸Key observation
    Sub-convolutions share the same
    ifmap. New data reuse opportunity.
    B
    A
    D
    C
    ifmap
    c
    a
    i
    g
    (2, 2) = A * a + B * c + C * g + D * i
    ( )
    =

    View Slide

  104. Deconvolution Transformation
    20
    (1, 1) = A * e
    (1, 3) = B * e
    (3, 1) = C * e
    (3, 3) = D * e
    e
    (1, 2) = A * d + B * f
    (3, 2) = C * d + D * f
    f
    d
    (1, 2) = A * d + B * f
    (3, 2) = C * d + D * f
    b
    h
    ▸Compile a deconvolution layer into 4 convolution layers
    ▸Naive transformation and compute increase
    memory traffic
    ▹4 sub-kernels + 4 ifmaps!
    ▸Key observation
    Sub-convolutions share the same
    ifmap. New data reuse opportunity.
    B
    A
    D
    C
    ifmap
    c
    a
    i
    g
    (2, 2) = A * a + B * c + C * g + D * i
    Inter-Layer Activation Reuse
    (ILAR)
    ( )
    =

    View Slide

  105. Deconvolution Optimization: Problem Setup
    21

    View Slide

  106. Deconvolution Optimization: Problem Setup
    21
    Goal: minimize the latency and/or
    memory traffic in deconvolution.

    View Slide

  107. Deconvolution Optimization: Problem Setup
    21
    ▸Hardware Assumption
    Goal: minimize the latency and/or
    memory traffic in deconvolution.

    View Slide

  108. DRAM
    Deconvolution Optimization: Problem Setup
    21
    ▸Hardware Assumption
    ▹A system-on-chip connected with DRAM
    Goal: minimize the latency and/or
    memory traffic in deconvolution.

    View Slide

  109. On-chip
    Buffer
    DRAM
    Deconvolution Optimization: Problem Setup
    21
    ▸Hardware Assumption
    ▹A system-on-chip connected with DRAM
    ▹On-chip buffer for ifmap, weights and ofmap
    Goal: minimize the latency and/or
    memory traffic in deconvolution.

    View Slide

  110. On-chip
    Buffer
    DRAM
    Deconvolution Optimization: Problem Setup
    21
    ▸Hardware Assumption
    ▹A system-on-chip connected with DRAM
    ▹On-chip buffer for ifmap, weights and ofmap
    ▹Systolic array, output stationary
    Goal: minimize the latency and/or
    memory traffic in deconvolution.

    View Slide

  111. DRAM
    Deconvolution Optimization: Problem Setup
    21
    ▸Hardware Assumption
    ▹A system-on-chip connected with DRAM
    ▹On-chip buffer for ifmap, weights and ofmap
    ▹Systolic array, output stationary
    ▹Used double-buffering
    Working
    buffer
    Filling
    buffer
    Goal: minimize the latency and/or
    memory traffic in deconvolution.

    View Slide

  112. Deconvolution Optimization: Complexity
    22
    f
    d f
    d f
    d f
    d f
    d
    eeeee
    b
    h
    b
    h
    b
    h
    b
    h
    b
    h
    c
    a
    i
    g
    c
    a
    i
    g
    c
    a
    i
    g
    c
    a
    i
    g
    c
    a
    i
    g
    sub-kernels

    Ifmap
    B C
    A
    F G
    E
    K
    J
    I
    O
    L
    H
    N
    D
    M P
    B C
    A
    F G
    E
    K
    J
    I
    O
    L
    H
    N
    D
    M P
    B C
    A
    F G
    E
    K
    J
    I
    O
    L
    H
    N
    D
    M P
    B C
    A
    F G
    E
    K
    J
    I
    O
    L
    H
    N
    D
    M P
    B C
    A
    F G
    E
    K
    J
    I
    O
    L
    H
    N
    D
    M P
    Goal: minimize the latency and/or
    memory traffic in deconvolution.
    B C
    A D
    B C
    A D
    B C
    A D
    B C
    A D
    B C
    A D
    eeeee
    b
    h
    b
    h
    b
    h
    b
    h
    b
    h
    f
    d f
    d f
    d f
    d f
    d
    eeeee
    c
    a
    i
    g
    c
    a
    i
    g
    c
    a
    i
    g
    c
    a
    i
    g
    c
    a
    i
    g
    B
    A
    F
    E
    B
    A
    F
    E
    B
    A
    F
    E
    B
    A
    F
    E
    B
    A
    F
    E

    View Slide

  113. Deconvolution Optimization: Complexity
    22
    f
    d f
    d f
    d f
    d f
    d
    eeeee
    b
    h
    b
    h
    b
    h
    b
    h
    b
    h
    c
    a
    i
    g
    c
    a
    i
    g
    c
    a
    i
    g
    c
    a
    i
    g
    c
    a
    i
    g
    sub-kernels

    Ifmap
    B C
    A
    F G
    E
    K
    J
    I
    O
    L
    H
    N
    D
    M P
    B C
    A
    F G
    E
    K
    J
    I
    O
    L
    H
    N
    D
    M P
    B C
    A
    F G
    E
    K
    J
    I
    O
    L
    H
    N
    D
    M P
    B C
    A
    F G
    E
    K
    J
    I
    O
    L
    H
    N
    D
    M P
    B C
    A
    F G
    E
    K
    J
    I
    O
    L
    H
    N
    D
    M P
    Goal: minimize the latency and/or
    memory traffic in deconvolution.
    Schedule 1
    Schedule 2
    B C
    A D
    B C
    A D
    B C
    A D
    B C
    A D
    B C
    A D
    eeeee
    b
    h
    b
    h
    b
    h
    b
    h
    b
    h
    f
    d f
    d f
    d f
    d f
    d
    eeeee
    c
    a
    i
    g
    c
    a
    i
    g
    c
    a
    i
    g
    c
    a
    i
    g
    c
    a
    i
    g
    B
    A
    F
    E
    B
    A
    F
    E
    B
    A
    F
    E
    B
    A
    F
    E
    B
    A
    F
    E

    View Slide

  114. Deconvolution Optimization: Complexity
    22
    f
    d f
    d f
    d f
    d f
    d
    eeeee
    b
    h
    b
    h
    b
    h
    b
    h
    b
    h
    c
    a
    i
    g
    c
    a
    i
    g
    c
    a
    i
    g
    c
    a
    i
    g
    c
    a
    i
    g
    sub-kernels

    Ifmap
    B C
    A
    F G
    E
    K
    J
    I
    O
    L
    H
    N
    D
    M P
    B C
    A
    F G
    E
    K
    J
    I
    O
    L
    H
    N
    D
    M P
    B C
    A
    F G
    E
    K
    J
    I
    O
    L
    H
    N
    D
    M P
    B C
    A
    F G
    E
    K
    J
    I
    O
    L
    H
    N
    D
    M P
    B C
    A
    F G
    E
    K
    J
    I
    O
    L
    H
    N
    D
    M P
    Goal: minimize the latency and/or
    memory traffic in deconvolution.
    Schedule 1
    Schedule 2
    B C
    A D
    B C
    A D
    B C
    A D
    B C
    A D
    B C
    A D
    eeeee b
    h
    b
    h
    b
    h
    b
    h
    b
    h
    f
    d f
    d f
    d f
    d f
    d
    eeeee
    c
    a
    i
    g
    c
    a
    i
    g
    c
    a
    i
    g
    c
    a
    i
    g
    c
    a
    i
    g
    B
    A
    F
    E
    B
    A
    F
    E
    B
    A
    F
    E
    B
    A
    F
    E
    B
    A
    F
    E
    ?

    View Slide

  115. Deconvolution Optimization: Formulation
    23

    View Slide

  116. Deconvolution Optimization: Formulation
    23
    ▸Dataflow optimization → Constrained optimization

    View Slide

  117. Deconvolution Optimization: Formulation
    23
    ▸Dataflow optimization → Constrained optimization
    Objective: Min. L(Θ, ϕ)

    View Slide

  118. Deconvolution Optimization: Formulation
    23
    ▸Dataflow optimization → Constrained optimization
    Objective: Min. L(Θ, ϕ)
    Θ : Hardware configuration
    ϕ : Tiling schedule

    View Slide

  119. Deconvolution Optimization: Formulation
    23
    ▸Dataflow optimization → Constrained optimization
    Objective: Min. L(Θ, ϕ)
    Θ : Hardware configuration
    ϕ : Tiling schedule
    ▸Hardware configuration, ={A, BW, Buf}

    View Slide

  120. Deconvolution Optimization: Formulation
    23
    ▸Dataflow optimization → Constrained optimization
    Objective: Min. L(Θ, ϕ)
    Θ : Hardware configuration
    ϕ : Tiling schedule
    ▸Hardware configuration, ={A, BW, Buf}
    A ≤ A*
    ▹Systolic Array Capability:
    BW ≤ BW*
    ▹Memory Bandwidth:
    Buf ≤ Buf*
    ▹On-chip Buffer Size: *: hardware capacity

    View Slide

  121. Deconvolution Optimization: Formulation
    23
    ▸Dataflow optimization → Constrained optimization
    Objective: Min. L(Θ, ϕ)
    Θ : Hardware configuration
    ϕ : Tiling schedule
    ▸Hardware configuration, ={A, BW, Buf}
    A ≤ A*
    ▹Systolic Array Capability:
    BW ≤ BW*
    ▹Memory Bandwidth:
    Buf ≤ Buf*
    ▹On-chip Buffer Size:
    ▸Variables, = {Tile, Ksub
    }
    ϕ

    View Slide

  122. Deconvolution Optimization: Formulation
    23
    ▸Dataflow optimization → Constrained optimization
    Objective: Min. L(Θ, ϕ)
    Θ : Hardware configuration
    ϕ : Tiling schedule
    ▸Hardware configuration, ={A, BW, Buf}
    A ≤ A*
    ▹Systolic Array Capability:
    BW ≤ BW*
    ▹Memory Bandwidth:
    Buf ≤ Buf*
    ▹On-chip Buffer Size:
    ▸Variables, = {Tile, Ksub
    }
    ϕ
    ▹Tile: Tile Size in every round

    View Slide

  123. Deconvolution Optimization: Formulation
    23
    ▸Dataflow optimization → Constrained optimization
    Objective: Min. L(Θ, ϕ)
    Θ : Hardware configuration
    ϕ : Tiling schedule
    ▸Hardware configuration, ={A, BW, Buf}
    A ≤ A*
    ▹Systolic Array Capability:
    BW ≤ BW*
    ▹Memory Bandwidth:
    Buf ≤ Buf*
    ▹On-chip Buffer Size:
    ▸Variables, = {Tile, Ksub
    }
    ϕ
    ▹Tile: Tile Size in every round
    ▹Ksub
    : The number of different sub-kernels in every round

    View Slide

  124. Deconvolution Optimization: Formulation
    23
    ▸Dataflow optimization → Constrained optimization
    Objective: Min. L(Θ, ϕ)
    Θ : Hardware configuration
    ϕ : Tiling schedule
    ▸Hardware configuration, ={A, BW, Buf}
    A ≤ A*
    ▹Systolic Array Capability:
    BW ≤ BW*
    ▹Memory Bandwidth:
    Buf ≤ Buf*
    ▹On-chip Buffer Size:
    ▸Variables, = {Tile, Ksub
    }
    ϕ
    ▹Tile: Tile Size in every round
    ▹Ksub
    : The number of different sub-kernels in every round

    View Slide

  125. Deconvolution Optimization: Formulation
    23
    ▸Dataflow optimization → Constrained optimization
    Objective: Min. L(Θ, ϕ)
    Θ : Hardware configuration
    ϕ : Tiling schedule
    ▸Hardware configuration, ={A, BW, Buf}
    A ≤ A*
    ▹Systolic Array Capability:
    BW ≤ BW*
    ▹Memory Bandwidth:
    Buf ≤ Buf*
    ▹On-chip Buffer Size:
    ▸Variables, = {Tile, Ksub
    }
    ϕ
    ▹Tile: Tile Size in every round
    ▹Ksub
    : The number of different sub-kernels in every round
    Non-linear
    Constraint Optimization
    “Sequential Least
    Squares Programming”

    View Slide

  126. Deconvolution Optimization: Formulation
    23
    ▸Dataflow optimization → Constrained optimization
    Objective: Min. L(Θ, ϕ)
    Θ : Hardware configuration
    ϕ : Tiling schedule
    ▸Hardware configuration, ={A, BW, Buf}
    A ≤ A*
    ▹Systolic Array Capability:
    BW ≤ BW*
    ▹Memory Bandwidth:
    Buf ≤ Buf*
    ▹On-chip Buffer Size:
    ▸Variables, = {Tile, Ksub
    }
    ϕ
    ▹Tile: Tile Size in every round
    ▹Ksub
    : The number of different sub-kernels in every round
    https://github.com/horizon-research/systolic-array-dataflow-optimizer

    View Slide

  127. ASV: Accelerated Stereo Vision System
    24
    ‣Algorithm: Invariant-based Stereo
    Matching Algorithm
    +
    ‣Compiler: Deconvolution Transformation
    and Dataflow Optimization
    +
    ‣Hardware: Principled and Minimal
    Hardware Modifications
    +

    View Slide

  128. ASV: Accelerated Stereo Vision System
    24
    ‣Algorithm: Invariant-based Stereo
    Matching Algorithm
    +
    ‣Compiler: Deconvolution Transformation
    and Dataflow Optimization
    +
    ‣Hardware: Principled and Minimal
    Hardware Modifications
    +

    View Slide

  129. Hardware Implementation
    25
    O
    P
    R

    View Slide

  130. Hardware Implementation
    25
    Baseline Systolic Array:
    Baseline Scalar Unit:
    O
    P
    R

    View Slide

  131. Hardware Implementation
    25
    ▹ Convolutions in DNN
    Baseline Systolic Array:
    Baseline Scalar Unit:
    O
    P
    R
    Conv.

    View Slide

  132. Hardware Implementation
    25
    ▹ Convolutions in DNN
    Baseline Systolic Array:
    Baseline Scalar Unit:
    ▹ ReLU, Pooling in DNN
    O
    P
    R
    Conv.

    View Slide

  133. Hardware Implementation
    25
    ▹ Convolutions in DNN
    Baseline Systolic Array:
    Baseline Scalar Unit:
    ▹ ReLU, Pooling in DNN
    O
    P
    R
    Conv.
    BM
    ISM Algorithm: OF

    View Slide

  134. Hardware Implementation
    25
    ▹ Convolutions in DNN
    ▹ Block Matching in
    Refine Correspondences
    Baseline Scalar Unit:
    ▹ ReLU, Pooling in DNN
    O
    P
    R
    Modified Systolic Array:
    Conv.
    BM
    ISM Algorithm: OF

    View Slide

  135. Hardware Implementation
    25
    ▹ Convolutions in DNN
    ▹ Block Matching in
    Refine Correspondences
    ▹ ReLU, Pooling in DNN
    ▹ Operations in Optical Flow
    O
    P
    R
    Modified Systolic Array:
    Conv.
    BM
    ISM Algorithm: OF
    Modified Scalar Unit:

    View Slide

  136. Hardware Implementation
    25
    ▹ Convolutions in DNN
    ▹ Block Matching in
    Refine Correspondences
    ▹ ReLU, Pooling in DNN
    ▹ Operations in Optical Flow
    O
    P
    R
    Modified Systolic Array:
    Conv.
    BM
    ISM Algorithm: OF
    Modified Scalar Unit:
    The overall area overhead
    introduced by ASV is below 0.5%.

    View Slide

  137. Experimental Setup
    26
    Hardware implementation:

    View Slide

  138. Experimental Setup
    26
    Hardware implementation:
    ▹ Systolic array: 24x24 PE at 1 GHz
    24
    24

    View Slide

  139. Experimental Setup
    26
    Hardware implementation:
    ▹ Systolic array: 24x24 PE at 1 GHz
    24
    24
    ▹ 8 Scalar unit: run in parallel at 250 MHz 8

    View Slide

  140. Experimental Setup
    26
    Hardware implementation:
    ▹ Systolic array: 24x24 PE at 1 GHz
    24
    24
    ▹ 8 Scalar unit: run in parallel at 250 MHz 8
    ▹ SRAM: 1.5 MB on-chip buffer
    1.5MB On-chip Buffer
    ▹ DRAM: 4 Micron 16 Gb
    LPDDR3-1600 channels

    View Slide

  141. Experimental Setup
    26
    Hardware implementation:
    Stereo DNNs:
    ▹ FlowNet, DispNet, GC-Net, PSMNet
    ▹ Systolic array: 24x24 PE at 1 GHz
    24
    24
    ▹ 8 Scalar unit: run in parallel at 250 MHz 8
    ▹ SRAM: 1.5 MB on-chip buffer
    1.5MB On-chip Buffer
    ▹ DRAM: 4 Micron 16 Gb
    LPDDR3-1600 channels

    View Slide

  142. Experimental Setup
    26
    Hardware implementation:
    Stereo DNNs:
    ▹ FlowNet, DispNet, GC-Net, PSMNet
    Datasets:
    ▹ SceneFlow and KITTI dataset
    ▹ Systolic array: 24x24 PE at 1 GHz
    24
    24
    ▹ 8 Scalar unit: run in parallel at 250 MHz 8
    ▹ SRAM: 1.5 MB on-chip buffer
    1.5MB On-chip Buffer
    ▹ DRAM: 4 Micron 16 Gb
    LPDDR3-1600 channels

    View Slide

  143. Evaluation
    27
    Variants:

    View Slide

  144. Evaluation
    27
    Variants:
    ▹ ISM: ISM algorithm without deconv. optimizations.

    View Slide

  145. Evaluation
    27
    Variants:
    ▹ ISM: ISM algorithm without deconv. optimizations.
    DNN inference for every 4 frames
    DNN
    Inference
    Non-DNN
    Inference
    Non-DNN
    Inference
    Non-DNN
    Inference

    View Slide

  146. Evaluation
    27
    Variants:
    ▹ ISM: ISM algorithm without deconv. optimizations.
    ▹ DCO: Deconv. optimizations without ISM algorithm.
    DNN inference for every 4 frames
    DNN
    Inference
    Non-DNN
    Inference
    Non-DNN
    Inference
    Non-DNN
    Inference

    View Slide

  147. Evaluation
    27
    Variants:
    ▹ ISM: ISM algorithm without deconv. optimizations.
    ▹ DCO: Deconv. optimizations without ISM algorithm.
    ▹ ISM + DCO: combined both optimization.
    DNN inference for every 4 frames
    DNN
    Inference
    Non-DNN
    Inference
    Non-DNN
    Inference
    Non-DNN
    Inference

    View Slide

  148. Evaluation
    28
    DNN inference for every 4 frames
    DNN
    Inference
    Non-DNN
    Inference
    Non-DNN
    Inference
    Non-DNN
    Inference

    View Slide

  149. Evaluation
    28
    Error rate (%)
    0
    1.25
    2.5
    3.75
    5
    DispNet FlowNetC PSMNet GC-Net AVG.
    DNN
    ISM
    DNN inference for every 4 frames
    DNN
    Inference
    Non-DNN
    Inference
    Non-DNN
    Inference
    Non-DNN
    Inference

    View Slide

  150. Evaluation
    28
    Error rate (%)
    0
    1.25
    2.5
    3.75
    5
    DispNet FlowNetC PSMNet GC-Net AVG.
    DNN
    ISM
    DNN inference for every 4 frames
    DNN
    Inference
    Non-DNN
    Inference
    Non-DNN
    Inference
    Non-DNN
    Inference
    3.89

    View Slide

  151. Evaluation
    28
    Error rate (%)
    0
    1.25
    2.5
    3.75
    5
    DispNet FlowNetC PSMNet GC-Net AVG.
    DNN
    ISM
    DNN inference for every 4 frames
    DNN
    Inference
    Non-DNN
    Inference
    Non-DNN
    Inference
    Non-DNN
    Inference
    3.89 3.95

    View Slide

  152. Evaluation
    29

    View Slide

  153. Evaluation
    29
    Speedup
    0
    2
    4
    6
    8
    DispNet FlowNetC PSMNet GC-Net AVG.
    DCO ISM DCO+ISM

    View Slide

  154. Evaluation
    29
    Speedup
    0
    2
    4
    6
    8
    DispNet FlowNetC PSMNet GC-Net AVG.
    DCO ISM DCO+ISM
    1.5x

    View Slide

  155. Evaluation
    29
    Speedup
    0
    2
    4
    6
    8
    DispNet FlowNetC PSMNet GC-Net AVG.
    DCO ISM DCO+ISM
    1.5x
    3.3x

    View Slide

  156. Evaluation
    29
    Speedup
    0
    2
    4
    6
    8
    DispNet FlowNetC PSMNet GC-Net AVG.
    DCO ISM DCO+ISM
    1.5x
    3.3x
    5.0x

    View Slide

  157. Evaluation
    29
    Speedup
    0
    2
    4
    6
    8
    DispNet FlowNetC PSMNet GC-Net AVG.
    DCO ISM DCO+ISM
    Energy
    Reduction
    0
    25
    50
    75
    100
    DispNet FlowNetC PSMNet GC-Net AVG.
    1.5x
    3.3x
    5.0x

    View Slide

  158. Evaluation
    29
    Speedup
    0
    2
    4
    6
    8
    DispNet FlowNetC PSMNet GC-Net AVG.
    DCO ISM DCO+ISM
    Energy
    Reduction
    0
    25
    50
    75
    100
    DispNet FlowNetC PSMNet GC-Net AVG.
    1.5x
    3.3x
    5.0x
    42%

    View Slide

  159. Evaluation
    29
    Speedup
    0
    2
    4
    6
    8
    DispNet FlowNetC PSMNet GC-Net AVG.
    DCO ISM DCO+ISM
    Energy
    Reduction
    0
    25
    50
    75
    100
    DispNet FlowNetC PSMNet GC-Net AVG.
    1.5x
    3.3x
    5.0x
    42%
    75%

    View Slide

  160. Evaluation
    29
    Speedup
    0
    2
    4
    6
    8
    DispNet FlowNetC PSMNet GC-Net AVG.
    DCO ISM DCO+ISM
    Energy
    Reduction
    0
    25
    50
    75
    100
    DispNet FlowNetC PSMNet GC-Net AVG.
    1.5x
    3.3x
    5.0x
    42%
    75%
    85%

    View Slide

  161. Evaluation
    30
    Speedup
    0
    1.5
    3
    4.5
    6
    AVG.
    ASV GANNX
    3.6x
    5.0x
    Energy Reduction
    0
    1.5
    3
    4.5
    6
    AVG.
    ASV GANNX
    3.2x
    4.2x

    View Slide

  162. Conclusion
    31

    View Slide

  163. Conclusion
    31
    ‣“Depth from stereo" is critical to emerging
    intelligent applications deployed in energy-
    and performance-constrained devices.

    View Slide

  164. Conclusion
    31
    ‣ASV simultaneously improves performance
    and energy-efficiency, while maintaining high
    accuracy via a HW & SW co-design.
    ‣“Depth from stereo" is critical to emerging
    intelligent applications deployed in energy-
    and performance-constrained devices.

    View Slide

  165. Conclusion
    31
    ‣Careful design choices can let these
    optimizations be integrated with existing DNN
    accelerators with minor hardware extensions
    +
    +
    +
    ‣ASV simultaneously improves performance
    and energy-efficiency, while maintaining high
    accuracy via a HW & SW co-design.
    ‣“Depth from stereo" is critical to emerging
    intelligent applications deployed in energy-
    and performance-constrained devices.

    View Slide