Save 37% off PRO during our Black Friday Sale! »

TMPA-2021: Analysis of Hardware-Implemented U-Net–like Convolutional Neural Networks

TMPA-2021: Analysis of Hardware-Implemented U-Net–like Convolutional Neural Networks

Ivan Zoev, Nikolay Markov, Konstantin Maslov and Evgeniy Mytsko

Analysis of Hardware-Implemented U-Net–like Convolutional Neural Networks

TMPA is an annual International Conference on Software Testing, Machine Learning and Complex Process Analysis. The conference will focus on the application of modern methods of data science to the analysis of software quality.

To learn more about Exactpro, visit our website https://exactpro.com/

Follow us on
LinkedIn https://www.linkedin.com/company/exactpro-systems-llc
Twitter https://twitter.com/exactpro

5206c19df417b8876825b5561344c1a0?s=128

Exactpro
PRO

November 25, 2021
Tweet

Transcript

  1. 1 25-27 NOVEMBER SOFTWARE TESTING, MACHINE LEARNING AND COMPLEX PROCESS

    ANALYSIS Analysis of Hardware-Implemented U-Net–like Convolutional Neural Networks Ivan Zoev, Konstantin Maslov, Nikolay Markov and Evgeniy Mytsko
  2. 2 Relevance of the Topic Designing mobile monitoring systems with

    intelligent computer vision system (CVS) based on unmanned aerial vehicles (UAVs) is relevant for solving the problem of monitoring hazardous technological objects as well as forests damaged by pests. The complexity of finding a balance between the speed of the computing unit (CU), the object recognition performance and the mass and power consumption requires studies on software and hardware implementations of convolutional neural networks (CNN).
  3. 3 Purpose and Objectives of Research The purpose is to

    study the effectiveness of hardware-implemented U-Net CNN models in programmable logic gate arrays (FPGA) of modern systems on a chip (SoC). Objectives: • design U-Net models • implement the models in software • fit the models using a preliminary prepared dataset • transfer the fitted models to hardware • evaluate performance, computational costs and power consumption of the software- and hardware- implemented CNNs
  4. 4 Developed U-Net Model The model was developed on the

    basis of the original U-Net model. Differences from the original U-Net model: • the input image of the network is represented by a 256x256x3 tensor which corresponds to an ordinary RGB image • convolutions do not reduce the size of the feature maps and cropping is not used • batch normalization applied after each activation function LeakyReLU (ELU, ReLu) • the output tensor is calculated using C convolutions with 1x1 kernels thereby allowing to classify the pixels of C classes
  5. 5 U-Net Model with Dilated Convolutions Differences between dilated and

    ordinary convolutions: • inserting zero coefficients to the convolution kernel • the parameter «dilation rate» is applied The U-Net model with dilated convolutions is obtained by replacing every two consequent convolutions with 3x3 kernels one dilated convolution with 5x5 kernels (dilation rate is 2). dilation rate is 2
  6. 6 Dataset Preparation • pictures of Abies sibirica trees damaged

    by Polygraphus proximus • five classes—four classes of tree condition and background • images patches 256x256x3 are fed to the CNNs • number of samples (labeled patches) for training—2004 • the number of samples (labeled patches) for validation—673 Living Dying Recently dead Long dead one patch 256x256x3 ground truth (made by experts) green – living yellow – dying red – recently dead white – long dead
  7. 7 Software Implementation of the U-Net Models CNN models are

    implemented on: • Core i7-8700 CPU + NVIDIA GeForce GTX 1080 Ti GPU (Python + Keras) • Intel Core i3-9100F CPU 3600 MHz (Python, C/C++)
  8. 8 SoC for Hardware Implementation of the CNN Models •

    modern SoCs with FPGAs with high performance • low power consumption, no more than 10-12 W • SoC Zinq 7000 (Kintex FPGA) by Xilinx • Tools: - System Verilog HDL - Vivado
  9. 9 Enlarged Functional Diagram of the CU • direct access

    of the FPGA-based CNN models to the external memory • use of universal unified computational blocks for convolution and subsampling • unification of computational blocks in convolution / subsampling by extracting block parameters and placing them in the configuration memory area («CONFIG space») • there is a neural computing unit (NCU) containing universal unified computing units in the block «Hardware implementation of CNN»
  10. 10 Computing Unit Architecture • there are 64 universal computing

    units in NCU simultaneously working on each layer of the CNN • each output feature map is simultaneously generated by a separate universal computing unit
  11. 11 Results of the CNN Implementation in Software (1) •

    Core i7-8700 CPU + NVIDIA GeForce GTX 1080 Ti GPU (Python + Keras) Model IoU mIoU Back. Living Dying Recently dead Long dead U-Net 0.87 0.77 0.39 0.75 0.62 0.68 U-Net* 0.86 0.76 0.37 0.75 0.62 0.67 Table1. Quality of segmentation for U-Net and U-Net with dilated convolutions (U-Net*) Model One patch inference time, ms Median MAD U-Net 30.78 0.16 U-Net* 27.67 0.16 Table 2. Inference time for U-Net and U-Net with dilated convolutions (U-Net*) on CPU+GPU test area ground truth output of the U-Net model output of the U-Net model with dilated convolutions IoU—Intersection over Union, mIoU—mean Intersection over Union MAD – median absolute deviation
  12. 12 Results of the CNN Implementation in Software (2) •

    Intel Core i3 9100F CPU 3600 MHz (Python, C/C++) • 16-bit (IEEE float16) и 32-bit (float32) floating point numbers Table 3. Inference time for U-Net and U-Net with dilated convolutions (U-Net*) on x86 CPU Table 4. RAM size for software CNN implementation on x86 CPU CNN Model RAM size, min. – max., MB float32 float16 U-Net 1205 – 1603 934 – 1124 U-Net* 1022 – 1902 839 – 1278 One patch 256x256x3 CNN output CNN Model One patch inference time, sec float32 float16 Median MAD Median MAD U-Net 13.61 0.04 45.86 0.03 U-Net* 18.05 0.07 61.38 0.04
  13. 13 Results of the CNN Implementation in Hardware • 16-bit

    (IEEE float16) и 32-bit (float32) floating point numbers • activations functions: ELU, ReLU и LeakyReLU Activation function IoU Back. Living Dying Recently dead Long dead mIoU float16 ELU 0.84 0.72 0.39 0.74 0.63 0.66 ReLU 0.82 0.68 0.28 0.76 0.66 0.64 LeakyReLU 0.84 0.72 0.36 0.77 0.65 0.67 U-Net* (LeakyReLU) 0.83 0.69 0.33 0.78 0.62 0.65 float32 ELU 0.84 0.72 0.39 0.74 0.63 0.66 ReLU 0.83 0.68 0.29 0.76 0.66 0.64 LeakyReLU 0.84 0.72 0.36 0.77 0.65 0.67 U-Net* (LeakyReLU) 0.83 0.69 0.33 0.78 0.62 0.65 CNN model One patch inference time, sec. float32 float16 Медиана MAD Медиана MAD U-Net 31.12 0.07 38.03 0.12 U-Net* 23.47 0.03 27.58 0.05 Power consumption and FPGA resources Type Power, Watts LUT, count LUTRAM, count BRAM, count float16 4.32 138917 2074 225,5 float32 5.33 165278 871 328,5 Table 5. Quality of segmentation for U-Net and U-Net with dilated convolutions (U-Net*) Table 7. SoC power consumption and FPGA resources Table 6. Inference time for U-Net and U-Net with dilated convolutions (U-Net*) on FPGA with 100 MHz frequency
  14. 14 Discussion of the Results and Conclusion (1) The use

    of dilated convolutions in the CNN model leads to a slight loss of segmentation accuracy (by 0.02 for the mIoU metric). The segmentation quality when using float16 numbers in FPGA computations does not differ from the results of segmentation quality when using float32. The computation speed of software-implemented CNN models on an Intel Core i7-8700 CPU with an NVIDIA GeForce GTX 1080 Ti GPU is almost a thousand times higher than the computation speed of these CNN models on FPGAs. The software implementation of the U-Net model on an Intel Core i3-9100F CPU with a frequency of 3600 MHz performs a patch segmentation about 2.4 times faster than on an FPGA with a frequency of 100 MHz. The U-Net model with dilated convolutions allows to analyze a 256x256x3 image patch on FPGA more than 25% faster than the original U-Net model. The power consumption of SoCs with FPGAs is slightly more than 5 W which is 50 times less than that of the graphics accelerator NVIDIA GeForce GTX 1080 Ti (250 W) and 8 times less than that of the CPU core i3-9100f (43 W).
  15. 15 Discussion of the Results and Conclusion (2) At the

    moment, 64 universal computing units in the NCU are not used quite optimally and it requires further studies and modifications of the CU to improve the performance of the hardware implementation on the FPGA. Further research is required on the use of parallel computing in the CNN layers. The obtained results of complex studies of the hardware-implemented U-Net models are of great scientific importance for the developers of intelligent CVS as a part of mobile systems for monitoring objects of the earth's surface based on UAVs. The use of the presented results will allow to make more informed design decisions when creating intelligent CVS for solving various applied monitoring problems.
  16. 16 Thank You! Follow TMPA on Facebook TMPA-2021 Conference