Upgrade to Pro — share decks privately, control downloads, hide ads and more …

FADEC: FPGA-based Acceleration of Video Depth Estimation by HW/SW Co-design (FPT 2022)

Nobuho Hashimoto
December 08, 2022

FADEC: FPGA-based Acceleration of Video Depth Estimation by HW/SW Co-design (FPT 2022)

Nobuho Hashimoto

December 08, 2022
Tweet

More Decks by Nobuho Hashimoto

Other Decks in Research

Transcript

  1. FADEC: FPGA-based Acceleration of
    Video Depth Estimation by HW/SW Co-design
    Nobuho Hashimoto, Shinya Takamaeda-Yamazaki
    The University of Tokyo
    FPT 2022
    Dec. 8th, 2022
    Session: Tools & Design

    View full-size slide

  2. Outline
    1. Backgrounds
    2. FADEC
    3. Evaluation
    4. Conclusion
    1

    View full-size slide

  3. Outline
    1. Backgrounds
    1. Depth Estimation
    2. DNN-based Depth Estimation
    3. DeepVideoMVS
    4. Direction
    2. FADEC
    3. Evaluation
    4. Conclusion
    2

    View full-size slide

  4. 1.1. Depth Estimation
    ❖ Estimate distance between camera and target objects
    Ø Wide range of applications, including autonomous driving and AR
    Ø Complex algorithm that combines traditional image/video processing algorithms
    and DNNs
    3
    Disparity: difference in position
    between each point in one image
    and its corresponding point in
    another image
    Stereo Matching
    to estimate disparities
    Triangulation
    to calculate distance
    between camera and
    each point
    (similar but different) 2+ images
    cf. eyes
    depth map
    Flow of depth estimation
    target object
    corresponding
    point

    View full-size slide

  5. 1.2. DNN-based Depth Estimation
    Finding accurate corresponding points in 2+ images
    based on degree of similarity is difficult
    → DNN-based algorithms
    Ø DeepV2D [Teed et al., 2020], DeepVideoMVS [Duzceker et al., 2020],
    HITNet [Tankovich et al., 2021], Open4D [Bansal et al., 2020],
    NeRF [Mildenhall et al., 2020], NSFF [Li et al., 2021]
    ❖ We chose DeepVideoMVS
    1. Simple inputs: video from monocular camera and camera poses of each frame
    2. Can be applied to wide range of situations without training again
    3. Likely to operate at near real-time speeds in low-power embedded environments
    4

    View full-size slide

  6. 1.3. DeepVideoMVS
    5
    Diagram of DeepVideoMVS
    Input Frame
    Keyframe Buffer
    (KB)
    Output
    Depth Map
    cell state
    hidden state
    Input Pose hidden state
    correction
    Select frames with close poses.
    Feature
    Extractor
    (FE)
    Cost
    Volume
    Encoder
    (CVE)
    Cost
    Volume
    Decoder
    (CVD)
    Conv
    LSTM
    (CL)
    Feature
    Shrinker
    (FS)
    Cost
    Volume
    Fusion
    (CVF)
    0.3 -0.4 0.9 -2.0
    -0.0 -0.9 -0.4 0.1
    1.0 0.1 -0.3 4.7
    0 0 0 1 KB.get
    KB.add
    pre-
    process
    post-
    process
    Traditional image/video processing
    encoder part
    decoder part
    ConvLSTM
    DNN-based processes
    data dependencies from current frame
    data dependencies from previous frame

    View full-size slide

  7. 1.3. DeepVideoMVS
    6
    Diagram of DeepVideoMVS
    Input Frame
    Keyframe Buffer
    (KB)
    Output
    Depth Map
    cell state
    hidden state
    Input Pose hidden state
    correction
    Select frames with close poses.
    Feature
    Extractor
    (FE)
    Cost
    Volume
    Encoder
    (CVE)
    Cost
    Volume
    Decoder
    (CVD)
    Conv
    LSTM
    (CL)
    Feature
    Shrinker
    (FS)
    Cost
    Volume
    Fusion
    (CVF)
    0.3 -0.4 0.9 -2.0
    -0.0 -0.9 -0.4 0.1
    1.0 0.1 -0.3 4.7
    0 0 0 1 KB.get
    KB.add
    pre-
    process
    post-
    process
    Traditional image/video processing
    encoder part
    decoder part
    ConvLSTM
    DNN-based processes
    data dependencies from current frame
    data dependencies from previous frame
    Extract image features
    Arrange cost volume
    (degree of correspondence
    between points)
    RNN enables to
    use time series
    information

    View full-size slide

  8. 1.3. DeepVideoMVS
    7
    Diagram of DeepVideoMVS
    Input Frame
    Keyframe Buffer
    (KB)
    Output
    Depth Map
    cell state
    hidden state
    Input Pose hidden state
    correction
    Select frames with close poses.
    Feature
    Extractor
    (FE)
    Cost
    Volume
    Encoder
    (CVE)
    Cost
    Volume
    Decoder
    (CVD)
    Conv
    LSTM
    (CL)
    Feature
    Shrinker
    (FS)
    Cost
    Volume
    Fusion
    (CVF)
    0.3 -0.4 0.9 -2.0
    -0.0 -0.9 -0.4 0.1
    1.0 0.1 -0.3 4.7
    0 0 0 1 KB.get
    KB.add
    pre-
    process
    post-
    process
    Traditional image/video processing
    encoder part
    decoder part
    ConvLSTM
    DNN-based processes
    data dependencies from current frame
    data dependencies from previous frame
    Reuse past frames with
    similar pose to input pose
    Apply viewpoint changes
    by using grid sampling
    (explained later)

    View full-size slide

  9. Outline
    1. Backgrounds
    2. FADEC
    1. FADEC Overview
    2. HW/SW Co-design
    3. HW/SW Scheduling
    1. Communication Mechanism
    2. Task-level Parallelization
    3. Evaluation
    4. Conclusion
    8

    View full-size slide

  10. 2.1. FADEC Overview
    Design accelerator in following flow
    1. HW/SW Co-design
    Ø Determine SW-friendly processes by taking advantage of
    characteristics of HW and SW
    2. HW Design
    Ø Design custom circuits for HW-friendly processes on
    programmable logic (PL) using open-source high-level
    synthesis (HLS) tool called NNgen [https://github.com/NNgen/nngen]
    3. SW Design
    Ø Design optimized programs for SW-friendly processes on CPU
    4. HW/SW Scheduling
    Ø Hide execution latencies of HW and SW implementations
    by executing them in parallel on PL and CPU
    9
    HW Architecture
    (only important
    operations)
    BRAMs
    (data)
    Conv (1, 1)
    Conv (3, 1)
    Conv (3, 2)
    Conv (5, 2)
    Conv (5, 1)
    ReLU
    sigmoid
    upsampling
    add rshift clip
    add rshift clip
    add rshift clip
    lshift
    lshift
    rshift
    BRAMs
    (params)
    concat
    concat
    slice
    decoder part
    encoder part
    encoder/decoder part
    every part
    ConvLSTM
    DRAM
    AXI Bus
    DMA Controller
    skip connection
    concat

    View full-size slide

  11. 2.2. HW/SW Co-design
    Determine operations to be implemented in SW by considering number of
    executions, characteristics, and memory access pattern of each process
    → Hide execution latencies by executing them in parallel with HW
    10
    Operation
    Process
    FE FS CVF CVE CL CVD
    Conv (1, 1) 33 5 0 0 0 0
    Conv (3, 1) 6 4 0 9 1 14
    Conv (3, 2) 2 0 0 3 0 0
    Conv (5, 1) 7 0 0 3 0 5
    Conv (5, 2) 3 0 0 1 0 0
    Activation (ReLU) 34 0 0 16 0 14
    Activation (sigmoid) 0 0 0 0 3 5
    Activation (ELU) 0 0 0 0 2 0
    Addition 10 4 128 0 1 0
    Multiplication 0 0 64 0 3 0
    Concatenation 0 0 0 4 1 5
    Slice 0 0 0 0 4 0
    Layer Normalization 0 0 0 0 2 9
    Upsampling (nearest) 0 4 0 0 0 0
    Upsampling (bilinear) 0 0 0 0 0 9
    Grid Sampling 0 0 128 0 0 0
    Number of executions in each process
    : Operations to be implemented in SW
    Input Frame
    Keyframe Buffer
    (KB)
    Output
    Depth Map
    cell state
    hidden state
    Input Pose hidden state
    correction
    Feature
    Extractor
    (FE)
    Cost
    Volume
    Encoder
    (CVE)
    Cost
    Volume
    Decoder
    (CVD)
    Conv
    LSTM
    (CL)
    Feature
    Shrinker
    (FS)
    Cost
    Volume
    Fusion
    (CVF)
    0.3 -0.4 0.9 -2.0
    -0.0 -0.9 -0.4 0.1
    1.0 0.1 -0.3 4.7
    0 0 0 1 KB.get
    KB.add
    pre-
    process
    post-
    process
    Diagram of DeepVideoMVS (reposted)

    View full-size slide

  12. 2.2. HW/SW Co-design
    Determine operations to be implemented in SW by considering number of
    executions, characteristics, and memory access pattern of each process
    → Hide execution latencies by executing them in parallel with HW
    Grid sampling
    ❖ Bilinear interpolation
    ❖ Irregular memory access
    ❖ Largest latency among
    SW-friendly operations
    11
    Operation
    Process
    FE FS CVF CVE CL CVD
    Conv (1, 1) 33 5 0 0 0 0
    Conv (3, 1) 6 4 0 9 1 14
    Conv (3, 2) 2 0 0 3 0 0
    Conv (5, 1) 7 0 0 3 0 5
    Conv (5, 2) 3 0 0 1 0 0
    Activation (ReLU) 34 0 0 16 0 14
    Activation (sigmoid) 0 0 0 0 3 5
    Activation (ELU) 0 0 0 0 2 0
    Addition 10 4 128 0 1 0
    Multiplication 0 0 64 0 3 0
    Concatenation 0 0 0 4 1 5
    Slice 0 0 0 0 4 0
    Layer Normalization 0 0 0 0 2 9
    Upsampling (nearest) 0 4 0 0 0 0
    Upsampling (bilinear) 0 0 0 0 0 9
    Grid Sampling 0 0 128 0 0 0
    Number of executions in each process
    : Operations to be implemented in SW

    View full-size slide

  13. 2.3. HW/SW Scheduling
    ❖ Following points are required to make PL and CPU work parallelly and
    cooperatively to hide execution latencies
    Ø Communication mechanism between HW and SW to notify end of each process and
    exchange data
    Ø Task-level parallelization
    12

    View full-size slide

  14. 2.3.1. Communication Mechanism
    Use contiguous memory allocator (CMA) and interrupt handling mechanism
    CMA
    ❖ Allocate contiguous physical memory area
    ❖ Enable to share memory space between HW and SW
    Ø HW can only handle physical memory space
    Ø SW can handle virtual memory space
    13
    Interrupt handling mechanism
    PL
    CPU
    1. process
    2. write data
    memory
    4. read
    opcode
    3. write opcode
    5. read data
    6. process
    7. write data 8. write end flag
    9. read
    end flag
    10. read data
    11. resume process
    polling
    register

    View full-size slide

  15. 2.3.2 Task-level Parallelization
    ❖ Increase parallelism to hide maximum execution latencies
    ❖ Hide 93% of total latencies required for CVF, which includes grid sampling
    Ø Grid sampling does not have data dependencies on previous process (FS)
    14
    FADEC pipeline chart
    Diagram of DeepVideoMVS
    (reposted)
    SW (CPU)
    HW (PL)
    pre-process
    CVF (preparation)
    post-process
    correction
    KB.get CVF
    CVE
    CL CVD
    layer normalization
    upsampling (bilinear)
    depth
    map
    frame
    KB.add
    time
    pose
    FE + FS
    Input Frame
    Keyframe Buffer
    (KB)
    Output
    Depth Map
    cell state
    hidden state
    Input Pose hidden state
    correction
    Feature
    Extractor
    (FE)
    Cost
    Volume
    Encoder
    (CVE)
    Cost
    Volume
    Decoder
    (CVD)
    Conv
    LSTM
    (CL)
    Feature
    Shrinker
    (FS)
    Cost
    Volume
    Fusion
    (CVF)
    0.3 -0.4 0.9 -2.0
    -0.0 -0.9 -0.4 0.1
    1.0 0.1 -0.3 4.7
    0 0 0 1 KB.get
    KB.add
    pre-
    process
    post-
    process

    View full-size slide

  16. Outline
    1. Backgrounds
    2. FADEC
    3. Evaluation
    1. Evaluation Environment
    2. Execution Time / HW Resources
    3. Accuracy
    4. Conclusion
    15

    View full-size slide

  17. 3.1. Evaluation Environment
    Implement FADEC on FPGA and compare it with C++ implementation on CPU
    16
    Input image size 96 º 64
    Model Pre-trained model using TUM RGB-D [Sturm et al., 2012]
    FPGA Xilinx ZCU104 board
    HW implementation Written in Python, compiled using NNgen v1.3.3,
    and converted to bitstream using Vivado 2021.2
    SW implementation Written and compiled using Cython v0.29
    Execution PYNQ v2.6
    Evaluation dataset 7-Scenes [Shotton et al., 2013]
    Implementation for
    comparison
    Compiled using g++ 7.3.0 with -O3 option,
    and executed on the same FPGA board
    Xilinx ZCU104 board

    View full-size slide

  18. 3.2. Execution Time / HW Resources
    Clock Frequency is 187.512 MHz
    60.2 times faster than CPU-only execution
    Take full advantage of HW resources
    17
    Name #Utilization Available Utilization [%]
    Slice 28256 28800 98.1
    LUT 176377 230400 76.6
    FF 143072 460800 31.0
    DSP 128 1728 7.41
    BRAM 309 312 99.0
    Platform median [s] std [s] frequency [MHz]
    CPU-only 16.744 0.049 N/A
    CPU-only (w/ PTQ) 13.248 0.035 N/A
    PL + CPU (ours) 0.278 0.118 187.52
    Comparison of execution time per frame HW resource utilization of FADEC

    View full-size slide

  19. 3.3. Accuracy
    Do not exhibit sufficient degradation to be visually distinguishable
    MSE is slightly lower, but degradation remains below 10% in most cases
    18
    (a) Input (b) Ground truth (c) Output of C++
    impl
    (d) Output of C++
    impl w/ PTQ
    (e) Output of the
    proposed accelerator
    Results of processing the frame number 000139 in the fire-seq-01 scene.
    The MSEs between the outputs and ground truth are (c) 0.091, (d) 0.073, (e) 0.089, and (f) 0.084, respectively.
    (a) Input (b) Ground truth (c) Output of C++
    impl
    (d) Output of C++
    impl w/ PTQ
    (e) Output of the
    proposed accelerator
    Results of processing the frame number 000268 in the redkitchen-seq-07 scene.
    The MSEs between the outputs and ground truth are (c) 0.808, (d) 0.880, (e) 1.099, and (f) 1.050, respectively.
    Results of qualitative evaluation
    Scene-by-scene comparison of MSE
    between output and ground truth

    View full-size slide

  20. Outline
    1. Backgrounds
    2. FADEC
    3. Evaluation
    4. Conclusion
    19

    View full-size slide

  21. 4. Conclusion
    ❖ Accelerate complex depth estimation algorithm that combines traditional
    image/video processing algorithms and DNNs
    ❖ Propose and implement FPGA-based accelerator for DeepVideoMVS
    using HW/SW co-design
    ❖ Demonstrate that FADEC operates 60.2 times faster than CPU-only
    execution on Xilinx ZCU104 board with minimal accuracy degradation
    ❖ See https://github.com/casys-utokyo/fadec/
    20

    View full-size slide