Quantization and FPGA Implementation of Neural Networks

Exercise in Information Science 3 Masayuki Usui May 12, 2022

Agenda Literature search and implementation of neural network quantization FPGA
implementation of a simple CNN

Neural network quantization Quantization is especially important for the hardware
implementation of neural networks For example, if you use (u)int8 instead of float32, you can improve both area and performance Personally, I strongly feel using floating-point operations is pretty bad since I actually implemented fadd last semester

Neural network quantization (cont’d) As a rule of thumb, 8-bit
quantization is easy and 4-bit quantization is difficult However, it was known that 8-bit quantization with per-tensor quantization significantly degrades accuracy for some models, notably MobileNetV2 Nagel et al. solved this by proposing cross-layer equalization (CLE) I implemented this and conducted experiments on the quantization of MobileNetV2 using ImageNet dataset

Experimental settings I quantized pretrained models provided by PyTorch Although
PyTorch has functions for batch normalization folding and quantization, I implemented these steps by myself

Experimental results granularity weight activation CLE precision per-tensor 8 8
disabled 68.808 per-tensor 8 8 enabled 69.628 per-channel 8 8 disabled 70.464 per-channel 8 8 enabled 70.304 per-tensor 6 8 disabled 9.542 per-tensor 6 8 enabled 58.992 per-channel 6 8 disabled 64.32 per-channel 6 8 enabled 64.338

Discussion Per-channel quantization performs best CLE improves performance for per-tensor
quantization The above coincides with the results in the literature However, 8-bit quantization without CLE performs much worse in the literature This difference is beyond my understanding I wish I could have implemented the quantized network in FPGA (MobilenetV2 was too big to implement by hand)

FPGA implementation A simple CNN with two convolution layers and
one fully-connected layer is trained, tested, and quantized using MNIST dataset (implemented quantization by myself) A simple CNN with four convolution layers and one fully-connected layer is trained, tested, and quantized using CIFAR-10 dataset (implemented quantization by myself) As to MNIST, a separate computation unit is used for each layer As to CIFAR-10, a common computation unit is used for all layers

FPGA implementation details Technology: Zynq UltraScale+ & PYNQ & Vitis
HLS Board: Ultra96-V2 Figure: cited from Avnet

FPGA implementation (MNIST) All parameters are loaded from PS main
memory to PL BRAM at once All feature maps are placed on PL BRAM Two computation units are created for two convolution layers respectively This scheme scales poorly but is simple to implement and optimize By parallelizing each computation unit along the output channel direction, 40x speedup was achieved on the whole (measured by the latency estimated by Vitis HLS Cosimulation)

FPAGA implementation (CIFAR-10) Parameters for one layer are loaded from
PS main memory to PL BRAM at once Feature maps for two layers are placed on PL BRAM (double buffering) Only one computation unit is created for four convolution layers and it is used to compute all of them I’m afraid I didn’t have enough time to optimize the design for CIFAR-10

Conclusion I went through the entire process of the hardware
implementation of neural networks: training, test, quantization, and implementation on FPGA I got perspectives of both software (including machine learning and quantization) and hardware (including SoC FPGA and HLS) In the future, I want to further explore the possibility of optimizing the architecture and implementation for neural networks In the future, I want to explore more optimization of the architecture for neural networks and scale the size of neural network models

References M. Nagel, M. V. Baalen, T. Blankevoort and M.
Welling, ”Data-Free Quantization Through Weight Equalization and Bias Correction,” 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 1325-1334, doi: 10.1109/ICCV.2019.00141.

Quantization and FPGA Implementation of Neural ...

Quantization and FPGA Implementation of Neural Networks

Masayuki Usui

More Decks by Masayuki Usui

Other Decks in Technology

Featured

Transcript

Exercise in Information Science 3 Masayuki Usui May 12, 2022

Agenda Literature search and implementation of neural network quantization FPGA

Neural network quantization Quantization is especially important for the hardware

Neural network quantization (cont’d) As a rule of thumb, 8-bit

Experimental settings I quantized pretrained models provided by PyTorch Although

Experimental results granularity weight activation CLE precision per-tensor 8 8

Discussion Per-channel quantization performs best CLE improves performance for per-tensor

FPGA implementation A simple CNN with two convolution layers and

FPGA implementation details Technology: Zynq UltraScale+ & PYNQ & Vitis

FPGA implementation (MNIST) All parameters are loaded from PS main

FPAGA implementation (CIFAR-10) Parameters for one layer are loaded from

Conclusion I went through the entire process of the hardware

References M. Nagel, M. V. Baalen, T. Blankevoort and M.