FADEC: FPGA-based Acceleration of Video Depth Estimation by HW/SW Co-design (FPT 2022)

FADEC: FPGA-based Acceleration of Video Depth Estimation by HW/SW Co-design
Nobuho Hashimoto, Shinya Takamaeda-Yamazaki The University of Tokyo FPT 2022 Dec. 8th, 2022 Session: Tools & Design

Outline 1. Backgrounds 2. FADEC 3. Evaluation 4. Conclusion 1

Outline 1. Backgrounds 1. Depth Estimation 2. DNN-based Depth Estimation
3. DeepVideoMVS 4. Direction 2. FADEC 3. Evaluation 4. Conclusion 2

1.1. Depth Estimation ❖ Estimate distance between camera and target
objects Ø Wide range of applications, including autonomous driving and AR Ø Complex algorithm that combines traditional image/video processing algorithms and DNNs 3 Disparity: difference in position between each point in one image and its corresponding point in another image Stereo Matching to estimate disparities Triangulation to calculate distance between camera and each point (similar but different) 2+ images cf. eyes depth map Flow of depth estimation target object corresponding point

1.2. DNN-based Depth Estimation Finding accurate corresponding points in 2+
images based on degree of similarity is difficult → DNN-based algorithms Ø DeepV2D [Teed et al., 2020], DeepVideoMVS [Duzceker et al., 2020], HITNet [Tankovich et al., 2021], Open4D [Bansal et al., 2020], NeRF [Mildenhall et al., 2020], NSFF [Li et al., 2021] ❖ We chose DeepVideoMVS 1. Simple inputs: video from monocular camera and camera poses of each frame 2. Can be applied to wide range of situations without training again 3. Likely to operate at near real-time speeds in low-power embedded environments 4

1.3. DeepVideoMVS 5 Diagram of DeepVideoMVS Input Frame Keyframe Buffer
(KB) Output Depth Map cell state hidden state Input Pose hidden state correction Select frames with close poses. Feature Extractor (FE) Cost Volume Encoder (CVE) Cost Volume Decoder (CVD) Conv LSTM (CL) Feature Shrinker (FS) Cost Volume Fusion (CVF) 0.3 -0.4 0.9 -2.0 -0.0 -0.9 -0.4 0.1 1.0 0.1 -0.3 4.7 0 0 0 1 KB.get KB.add pre- process post- process Traditional image/video processing encoder part decoder part ConvLSTM DNN-based processes data dependencies from current frame data dependencies from previous frame

(KB) Output Depth Map cell state hidden state Input Pose hidden state correction Select frames with close poses. Feature Extractor (FE) Cost Volume Encoder (CVE) Cost Volume Decoder (CVD) Conv LSTM (CL) Feature Shrinker (FS) Cost Volume Fusion (CVF) 0.3 -0.4 0.9 -2.0 -0.0 -0.9 -0.4 0.1 1.0 0.1 -0.3 4.7 0 0 0 1 KB.get KB.add pre- process post- process Traditional image/video processing encoder part decoder part ConvLSTM DNN-based processes data dependencies from current frame data dependencies from previous frame Extract image features Arrange cost volume (degree of correspondence between points) RNN enables to use time series information

(KB) Output Depth Map cell state hidden state Input Pose hidden state correction Select frames with close poses. Feature Extractor (FE) Cost Volume Encoder (CVE) Cost Volume Decoder (CVD) Conv LSTM (CL) Feature Shrinker (FS) Cost Volume Fusion (CVF) 0.3 -0.4 0.9 -2.0 -0.0 -0.9 -0.4 0.1 1.0 0.1 -0.3 4.7 0 0 0 1 KB.get KB.add pre- process post- process Traditional image/video processing encoder part decoder part ConvLSTM DNN-based processes data dependencies from current frame data dependencies from previous frame Reuse past frames with similar pose to input pose Apply viewpoint changes by using grid sampling (explained later)

Outline 1. Backgrounds 2. FADEC 1. FADEC Overview 2. HW/SW
Co-design 3. HW/SW Scheduling 1. Communication Mechanism 2. Task-level Parallelization 3. Evaluation 4. Conclusion 8

2.1. FADEC Overview Design accelerator in following flow 1. HW/SW
Co-design Ø Determine SW-friendly processes by taking advantage of characteristics of HW and SW 2. HW Design Ø Design custom circuits for HW-friendly processes on programmable logic (PL) using open-source high-level synthesis (HLS) tool called NNgen [https://github.com/NNgen/nngen] 3. SW Design Ø Design optimized programs for SW-friendly processes on CPU 4. HW/SW Scheduling Ø Hide execution latencies of HW and SW implementations by executing them in parallel on PL and CPU 9 HW Architecture (only important operations) BRAMs (data) Conv (1, 1) Conv (3, 1) Conv (3, 2) Conv (5, 2) Conv (5, 1) ReLU sigmoid upsampling add rshift clip add rshift clip add rshift clip lshift lshift rshift BRAMs (params) concat concat slice decoder part encoder part encoder/decoder part every part ConvLSTM DRAM AXI Bus DMA Controller skip connection concat

2.2. HW/SW Co-design Determine operations to be implemented in SW
by considering number of executions, characteristics, and memory access pattern of each process → Hide execution latencies by executing them in parallel with HW 10 Operation Process FE FS CVF CVE CL CVD Conv (1, 1) 33 5 0 0 0 0 Conv (3, 1) 6 4 0 9 1 14 Conv (3, 2) 2 0 0 3 0 0 Conv (5, 1) 7 0 0 3 0 5 Conv (5, 2) 3 0 0 1 0 0 Activation (ReLU) 34 0 0 16 0 14 Activation (sigmoid) 0 0 0 0 3 5 Activation (ELU) 0 0 0 0 2 0 Addition 10 4 128 0 1 0 Multiplication 0 0 64 0 3 0 Concatenation 0 0 0 4 1 5 Slice 0 0 0 0 4 0 Layer Normalization 0 0 0 0 2 9 Upsampling (nearest) 0 4 0 0 0 0 Upsampling (bilinear) 0 0 0 0 0 9 Grid Sampling 0 0 128 0 0 0 Number of executions in each process : Operations to be implemented in SW Input Frame Keyframe Buffer (KB) Output Depth Map cell state hidden state Input Pose hidden state correction Feature Extractor (FE) Cost Volume Encoder (CVE) Cost Volume Decoder (CVD) Conv LSTM (CL) Feature Shrinker (FS) Cost Volume Fusion (CVF) 0.3 -0.4 0.9 -2.0 -0.0 -0.9 -0.4 0.1 1.0 0.1 -0.3 4.7 0 0 0 1 KB.get KB.add pre- process post- process Diagram of DeepVideoMVS (reposted)

2.2. HW/SW Co-design Determine operations to be implemented in SW
by considering number of executions, characteristics, and memory access pattern of each process → Hide execution latencies by executing them in parallel with HW Grid sampling ❖ Bilinear interpolation ❖ Irregular memory access ❖ Largest latency among SW-friendly operations 11 Operation Process FE FS CVF CVE CL CVD Conv (1, 1) 33 5 0 0 0 0 Conv (3, 1) 6 4 0 9 1 14 Conv (3, 2) 2 0 0 3 0 0 Conv (5, 1) 7 0 0 3 0 5 Conv (5, 2) 3 0 0 1 0 0 Activation (ReLU) 34 0 0 16 0 14 Activation (sigmoid) 0 0 0 0 3 5 Activation (ELU) 0 0 0 0 2 0 Addition 10 4 128 0 1 0 Multiplication 0 0 64 0 3 0 Concatenation 0 0 0 4 1 5 Slice 0 0 0 0 4 0 Layer Normalization 0 0 0 0 2 9 Upsampling (nearest) 0 4 0 0 0 0 Upsampling (bilinear) 0 0 0 0 0 9 Grid Sampling 0 0 128 0 0 0 Number of executions in each process : Operations to be implemented in SW

2.3. HW/SW Scheduling ❖ Following points are required to make
PL and CPU work parallelly and cooperatively to hide execution latencies Ø Communication mechanism between HW and SW to notify end of each process and exchange data Ø Task-level parallelization 12

2.3.1. Communication Mechanism Use contiguous memory allocator (CMA) and interrupt
handling mechanism CMA ❖ Allocate contiguous physical memory area ❖ Enable to share memory space between HW and SW Ø HW can only handle physical memory space Ø SW can handle virtual memory space 13 Interrupt handling mechanism PL CPU 1. process 2. write data memory 4. read opcode 3. write opcode 5. read data 6. process 7. write data 8. write end flag 9. read end flag 10. read data 11. resume process polling register

2.3.2 Task-level Parallelization ❖ Increase parallelism to hide maximum execution
latencies ❖ Hide 93% of total latencies required for CVF, which includes grid sampling Ø Grid sampling does not have data dependencies on previous process (FS) 14 FADEC pipeline chart Diagram of DeepVideoMVS (reposted) SW (CPU) HW (PL) pre-process CVF (preparation) post-process correction KB.get CVF CVE CL CVD layer normalization upsampling (bilinear) depth map frame KB.add time pose FE + FS Input Frame Keyframe Buffer (KB) Output Depth Map cell state hidden state Input Pose hidden state correction Feature Extractor (FE) Cost Volume Encoder (CVE) Cost Volume Decoder (CVD) Conv LSTM (CL) Feature Shrinker (FS) Cost Volume Fusion (CVF) 0.3 -0.4 0.9 -2.0 -0.0 -0.9 -0.4 0.1 1.0 0.1 -0.3 4.7 0 0 0 1 KB.get KB.add pre- process post- process

Outline 1. Backgrounds 2. FADEC 3. Evaluation 1. Evaluation Environment
2. Execution Time / HW Resources 3. Accuracy 4. Conclusion 15

3.1. Evaluation Environment Implement FADEC on FPGA and compare it
with C++ implementation on CPU 16 Input image size 96 º 64 Model Pre-trained model using TUM RGB-D [Sturm et al., 2012] FPGA Xilinx ZCU104 board HW implementation Written in Python, compiled using NNgen v1.3.3, and converted to bitstream using Vivado 2021.2 SW implementation Written and compiled using Cython v0.29 Execution PYNQ v2.6 Evaluation dataset 7-Scenes [Shotton et al., 2013] Implementation for comparison Compiled using g++ 7.3.0 with -O3 option, and executed on the same FPGA board Xilinx ZCU104 board

3.2. Execution Time / HW Resources Clock Frequency is 187.512
MHz 60.2 times faster than CPU-only execution Take full advantage of HW resources 17 Name #Utilization Available Utilization [%] Slice 28256 28800 98.1 LUT 176377 230400 76.6 FF 143072 460800 31.0 DSP 128 1728 7.41 BRAM 309 312 99.0 Platform median [s] std [s] frequency [MHz] CPU-only 16.744 0.049 N/A CPU-only (w/ PTQ) 13.248 0.035 N/A PL + CPU (ours) 0.278 0.118 187.52 Comparison of execution time per frame HW resource utilization of FADEC

3.3. Accuracy Do not exhibit sufficient degradation to be visually
distinguishable MSE is slightly lower, but degradation remains below 10% in most cases 18 (a) Input (b) Ground truth (c) Output of C++ impl (d) Output of C++ impl w/ PTQ (e) Output of the proposed accelerator Results of processing the frame number 000139 in the ﬁre-seq-01 scene. The MSEs between the outputs and ground truth are (c) 0.091, (d) 0.073, (e) 0.089, and (f) 0.084, respectively. (a) Input (b) Ground truth (c) Output of C++ impl (d) Output of C++ impl w/ PTQ (e) Output of the proposed accelerator Results of processing the frame number 000268 in the redkitchen-seq-07 scene. The MSEs between the outputs and ground truth are (c) 0.808, (d) 0.880, (e) 1.099, and (f) 1.050, respectively. Results of qualitative evaluation Scene-by-scene comparison of MSE between output and ground truth

Outline 1. Backgrounds 2. FADEC 3. Evaluation 4. Conclusion 19

4. Conclusion ❖ Accelerate complex depth estimation algorithm that combines
traditional image/video processing algorithms and DNNs ❖ Propose and implement FPGA-based accelerator for DeepVideoMVS using HW/SW co-design ❖ Demonstrate that FADEC operates 60.2 times faster than CPU-only execution on Xilinx ZCU104 board with minimal accuracy degradation ❖ See https://github.com/casys-utokyo/fadec/ 20

FADEC: FPGA-based Acceleration of Video Depth E...

FADEC: FPGA-based Acceleration of Video Depth Estimation by HW/SW Co-design (FPT 2022)

Nobuho Hashimoto

More Decks by Nobuho Hashimoto

Other Decks in Research

Featured

Transcript

FADEC: FPGA-based Acceleration of Video Depth Estimation by HW/SW Co-design

Outline 1. Backgrounds 2. FADEC 3. Evaluation 4. Conclusion 1

Outline 1. Backgrounds 1. Depth Estimation 2. DNN-based Depth Estimation

1.1. Depth Estimation ❖ Estimate distance between camera and target

1.2. DNN-based Depth Estimation Finding accurate corresponding points in 2+

1.3. DeepVideoMVS 5 Diagram of DeepVideoMVS Input Frame Keyframe Buffer

1.3. DeepVideoMVS 6 Diagram of DeepVideoMVS Input Frame Keyframe Buffer

1.3. DeepVideoMVS 7 Diagram of DeepVideoMVS Input Frame Keyframe Buffer

Outline 1. Backgrounds 2. FADEC 1. FADEC Overview 2. HW/SW

2.1. FADEC Overview Design accelerator in following flow 1. HW/SW

2.2. HW/SW Co-design Determine operations to be implemented in SW

2.2. HW/SW Co-design Determine operations to be implemented in SW

2.3. HW/SW Scheduling ❖ Following points are required to make

2.3.1. Communication Mechanism Use contiguous memory allocator (CMA) and interrupt

2.3.2 Task-level Parallelization ❖ Increase parallelism to hide maximum execution

Outline 1. Backgrounds 2. FADEC 3. Evaluation 1. Evaluation Environment

3.1. Evaluation Environment Implement FADEC on FPGA and compare it

3.2. Execution Time / HW Resources Clock Frequency is 187.512

3.3. Accuracy Do not exhibit sufficient degradation to be visually

Outline 1. Backgrounds 2. FADEC 3. Evaluation 4. Conclusion 19

4. Conclusion ❖ Accelerate complex depth estimation algorithm that combines

21