Upgrade to Pro — share decks privately, control downloads, hide ads and more …

An FPGA-Based Fully Pipelined Bilateral Grid for Real-Time Image Denoising (FPL 2021)

An FPGA-Based Fully Pipelined Bilateral Grid for Real-Time Image Denoising (FPL 2021)

This presentation is made for "Session 3A: Application Acceleration" in
the FPL 2021.
Program and Abstract: https://whova.com/embedded/subsession/fpl_202108/1837315/1837318/
Paper: https://arxiv.org/abs/2110.07186
Movie: https://youtu.be/q5lxi7N-uX8
Profile: https://n-hassy.info

290a66d9796825297c2cdb438a372f6a?s=128

Nobuho Hashimoto

August 15, 2021
Tweet

Transcript

  1. An FPGA-Based Fully Pipelined Bilateral Grid for Real-Time Image Denoising

    Nobuho Hashimoto, Shinya Takamaeda-Yamazaki The University of Tokyo FPL 2021 (Sep. 2nd, 2021) Session 3A: Application Acceleration
  2. Outline ❖Bilateral Filter (BF) ❖Our Approach Ø Algorithm-Level Contribution l

    Enhanced Bilateral Grid (BG) for BF Ø Hardware-Level Contribution l Fully Pipelined Design l Memory Access Optimization ❖Experiment on Actual FPGA Ø Qualitative Evaluation Ø Quantitative Evaluation ❖Conclusion 1
  3. What is BF?? ❖Edge-preserving smoother ❖Wide variety of applications Ø

    Denoising Ø Tone mapping Ø Stylization Ø Upsampling Ø Optical-flow estimation 2 Before filtering After filtering Horse image
  4. Definition of BF ❖Neighboring (space and range) pixels have larger

    weights ➜Edge-preserving characteristics 3 C. Tomasi and R. Manduchi, “Bilateral filtering for gray and color images,” ICCV, 1998 The pixel of interest Pixel-wise product
  5. Computational Complexity ❖Calculations increase in accordance with window radius 𝒓

    Ø 𝑶(𝒓𝟐) per pixel ❖Real-time processing of large-scale and high-resolution images is difficult Ø Large number of pixels and window radius 4 High-resolution Low-resolution
  6. Our Approach ❖Goal Ø Large-scale and high-resolution image processing on

    small-scale hardware and in real-time ❖Method Ø Suppression of resources when window radius is large ➜ Enhanced BG Ø High throughput and low latency ➜ II = 1 pipeline and sequential processing 5 one pixel next pixel clock II (Initiation Interval)
  7. Bilateral Grid 1. Grid Creation ØStore input on “grid” by

    discretization in space and range direction 2. Gaussian Filter ØBlur grid using only spatial kernel 3. Trilinear Interpolation ØCalculate output 6 J Chen, S Paris, and F Durand, “Real-time edge-aware image processing with the bilateral grid,” ACM Trans. Graph., 2007
  8. Enhanced BG ❖Window radius on grid is variable in original

    BG Ø Increase in resources is larger in 3D grid than in 2D input ➜Fix radius on grid at 1 and change radius on input Ø Radius on input does not greatly affect resource usage 7 Radius on input Radius on grid Original BG Not considered Variable Enhanced BG Variable 1 (Fixed)
  9. Proposed Algorithm 1. Grid Creation: Project input image onto grid

    Ø Per image pixel 2. Gaussian Filter: Blur grid using Gaussian Filter Ø Per grid element 3. Trilinear Interpolation: Interpolate values using input image Ø Per image pixel 8
  10. Fully Pipelined Design ❖Macro Pipeline Ø Pipeline between colored areas

    ❖Micro Pipeline Ø Pipeline within colored areas 9
  11. Memory Access Optimization ❖Read-Modify-Write operation is performed in Grid Creation

    Ø Blue area is projected onto same grid element Ø II=1 is basically impossible ❖𝑟 times sequential accesses in y direction (red area) are utilized ❖1.5 to 2 times faster 10 Input image Accesses to each element
  12. Experiment ❖Implementation on ZCU 104 FPGA board ❖Tools Ø Vivado

    HLS 2019.2 l Generate Verilog codes (High-Level-Synthesis) Ø Vivado 2019.2 l Generate the bitstream Ø PYNQ v2.6 l Exchange data with the board 11 ZCU 104 FPGA board
  13. Denoising Quality 12 Original image Image with Gaussian noise Image

    processed by BF Image processed by BG
  14. Comparison with Different Radius ❖Each index does not change greatly

    when window radius is enlarged 13 4 8 12 16 (MHz) 214 214 214 214 (fps) 95.15 100.13 99.24 98.36 Slice 1955 (6.79 %) 2214 (7.69 %) 1611 (5.59 %) 1986 (6.90 %) LUT 10449 (4.54%) 11490 (4.99 %) 9013 (3.91 %) 9877 (4.29 %) FF 8682 (1.88 %) 7654 (1.66 %) 7438 (1.61 %) 6923 (1.50 %) DSP 19 (1.10 %) 15 (0.87 %) 15 (0.87 %) 15 (0.87 %) BRAM 22 (7.05 %) 23 (7.37 %) 26.5 (8.49 %) 28 (8.97 %) Comparison of the speed and resources of our design by changing window radius
  15. Comparison with Other Designs ❖High speed processing with large image

    and large window radius while suppressing resources Ø Faster than GPU A100 PCIe implementation 14 (2) A. Gabiger-Rose, M. Kube, R. Weigel, and R. Rose, “An FPGA-based fully synchronized design of a bilateral filter for real-time image denoising,” Transactions on Industrial Electronics, 2014 (3) S. D. Dabhade, G. N. Rathna, and K. N. Chaudhury, “A reconfigurable and scalable FPGA architecture for bilateral filtering,” Transactions on Industrial Electronics, 2018 Comparison of speed and resources between our design, GPU implementation of the BF, and other existing designs
  16. Conclusion ❖Enhance Bilateral Grid (BG) so that window size can

    be varied Ø BG is used to accelerate Bilateral Filter (BF) ❖Propose fully pipelined FPGA implementation for BG ❖Verify that our design outperforms others in speed and resources on actual FPGA 15