Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Quantization and FPGA Implementation of Neural ...

Masayuki Usui
November 08, 2023

Quantization and FPGA Implementation of Neural Networks

東京大学理学部情報科学科(通称理情)には、情報科学演習Ⅲ(通称演習3)という授業が存在し、そこでは計3つの研究室を訪問した上で、それぞれの研究室で短期間のプロジェクトに取り組みます。このスライドは、私が高前田研究室を訪問したときの最終発表で用いたものです。ニューラルネットワークの量子化及びFPGA実装に取り組んでいます。

I try quantization and FPGA implementation of neural networks in this slides.

Masayuki Usui

November 08, 2023
Tweet

More Decks by Masayuki Usui

Other Decks in Technology

Transcript

  1. Neural network quantization Quantization is especially important for the hardware

    implementation of neural networks For example, if you use (u)int8 instead of float32, you can improve both area and performance Personally, I strongly feel using floating-point operations is pretty bad since I actually implemented fadd last semester
  2. Neural network quantization (cont’d) As a rule of thumb, 8-bit

    quantization is easy and 4-bit quantization is difficult However, it was known that 8-bit quantization with per-tensor quantization significantly degrades accuracy for some models, notably MobileNetV2 Nagel et al. solved this by proposing cross-layer equalization (CLE) I implemented this and conducted experiments on the quantization of MobileNetV2 using ImageNet dataset
  3. Experimental settings I quantized pretrained models provided by PyTorch Although

    PyTorch has functions for batch normalization folding and quantization, I implemented these steps by myself
  4. Experimental results granularity weight activation CLE precision per-tensor 8 8

    disabled 68.808 per-tensor 8 8 enabled 69.628 per-channel 8 8 disabled 70.464 per-channel 8 8 enabled 70.304 per-tensor 6 8 disabled 9.542 per-tensor 6 8 enabled 58.992 per-channel 6 8 disabled 64.32 per-channel 6 8 enabled 64.338
  5. Discussion Per-channel quantization performs best CLE improves performance for per-tensor

    quantization The above coincides with the results in the literature However, 8-bit quantization without CLE performs much worse in the literature This difference is beyond my understanding I wish I could have implemented the quantized network in FPGA (MobilenetV2 was too big to implement by hand)
  6. FPGA implementation A simple CNN with two convolution layers and

    one fully-connected layer is trained, tested, and quantized using MNIST dataset (implemented quantization by myself) A simple CNN with four convolution layers and one fully-connected layer is trained, tested, and quantized using CIFAR-10 dataset (implemented quantization by myself) As to MNIST, a separate computation unit is used for each layer As to CIFAR-10, a common computation unit is used for all layers
  7. FPGA implementation details Technology: Zynq UltraScale+ & PYNQ & Vitis

    HLS Board: Ultra96-V2 Figure: cited from Avnet
  8. FPGA implementation (MNIST) All parameters are loaded from PS main

    memory to PL BRAM at once All feature maps are placed on PL BRAM Two computation units are created for two convolution layers respectively This scheme scales poorly but is simple to implement and optimize By parallelizing each computation unit along the output channel direction, 40x speedup was achieved on the whole (measured by the latency estimated by Vitis HLS Cosimulation)
  9. FPAGA implementation (CIFAR-10) Parameters for one layer are loaded from

    PS main memory to PL BRAM at once Feature maps for two layers are placed on PL BRAM (double buffering) Only one computation unit is created for four convolution layers and it is used to compute all of them I’m afraid I didn’t have enough time to optimize the design for CIFAR-10
  10. Conclusion I went through the entire process of the hardware

    implementation of neural networks: training, test, quantization, and implementation on FPGA I got perspectives of both software (including machine learning and quantization) and hardware (including SoC FPGA and HLS) In the future, I want to further explore the possibility of optimizing the architecture and implementation for neural networks In the future, I want to explore more optimization of the architecture for neural networks and scale the size of neural network models
  11. References M. Nagel, M. V. Baalen, T. Blankevoort and M.

    Welling, ”Data-Free Quantization Through Weight Equalization and Bias Correction,” 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 1325-1334, doi: 10.1109/ICCV.2019.00141.