Upgrade to Pro — share decks privately, control downloads, hide ads and more …

TMPA-2021: Open-Source Tools for Neural Network Inference on FPGAs

Exactpro
November 25, 2021
57

TMPA-2021: Open-Source Tools for Neural Network Inference on FPGAs

Mikhail Lebedev and Pavel Belecky

Open-Source Tools for Neural Network Inference on FPGAs

TMPA is an annual International Conference on Software Testing, Machine Learning and Complex Process Analysis. The conference will focus on the application of modern methods of data science to the analysis of software quality.

To learn more about Exactpro, visit our website https://exactpro.com/

Follow us on
LinkedIn https://www.linkedin.com/company/exactpro-systems-llc
Twitter https://twitter.com/exactpro

Exactpro

November 25, 2021
Tweet

Transcript

  1. 1 25-27 NOVEMBER SOFTWARE TESTING, MACHINE LEARNING AND COMPLEX PROCESS

    ANALYSIS Open-Source Tools for Neural Network Inference on FPGAs Mikhail Lebedev1,2, Pavel Belecky1 1Ivannikov Institute for System Programming of RAS, 2Plekhanov Russian University of Economics
  2. 2 Introduction • Hardware development demands much effort • Synthesis

    can be a solution ◦ High-level ◦ Middle-level ◦ Hardware construction Behavioral model Hardware model Synthesis ASIC FPGA
  3. 3 Hardware Construction • System & low-level architecture • Many

    hardware details • Very close to RTL • Input languages ◦ Chisel ◦ Bluespec ◦ DSLs Behavioral model Hardware model Synthesis ASIC FPGA
  4. 4 Middle-Level Synthesis • System architecture • Some hardware details

    • Input languages ◦ MaxJ (MaxCompiler) ◦ DSLX (XLS) Behavioral model Hardware model Synthesis ASIC FPGA
  5. 5 High-Level Synthesis • No hardware details • No pre-defined

    architecture • Input languages ◦ C/C++, C#, Java ◦ MATLAB ◦ Python ◦ DSLs → neural networks ◦ And many more… Behavioral model Hardware model Synthesis ASIC FPGA
  6. 6 Background • Our team High-Level Synthesis research, 2020-2021 ◦

    480 articles ▪ 59 consider neural networks ◦ 278 patents ◦ > 50 open-source tools ▪ 12 selected for evaluation • Evaluation tasks ◦ Neural networks (pre-trained) ◦ JPEG algorithm ◦ DPLL algorithm ◦ IDCT algorithm • Can we use open-source tools for synthesis?
  7. 7 Neural Network Formats Format License Developer Active since ONNX

    Apache 2.0 Community 2017 TensorFlow Apache 2.0 Google 2015 Keras Apache 2.0 Google 2015 PyTorch BSD Facebook 2016 Caffe BSDv2 UC Berkeley 2014 Caffe2 BSD Facebook 2017 CNTK MIT Microsoft 2015 CoreML BSDv3 Apple 2017 MXNet Apache 2.0 Apache 2016
  8. 8 Neural Network Formats Format License Developer Active since ONNX

    Apache 2.0 Community 2017 TensorFlow Apache 2.0 Google 2015 Keras Apache 2.0 Google 2015 PyTorch BSD Facebook 2016 Caffe BSDv2 UC Berkeley 2014 Caffe2 BSD Facebook 2017 CNTK MIT Microsoft 2015 CoreML BSDv3 Apple 2017 MXNet Apache 2.0 Apache 2016
  9. 9 Open-Source Tools (1) Tool License Developer Active since OpenVINO

    Apache 2.0 Intel 2018 PlaidML Apache 2.0 Intel 2017 MACE Apache 2.0 Xiaomi 2017 TVM Apache 2.0 University of Washington 2017 XLA Apache 2.0 Google 2017 LeFlow BSD University of British Columbia 2018 Glow Apache 2.0 Facebook 2017 hls4ml Apache 2.0 CERN 2017 NNgen Apache 2.0 Shinya Takamaeda-Yamazaki 2017 ONNC BSDv3 Skymizer 2018 Vitis AI Apache 2.0 Xilinx 2020 Finn BSDv3 Xilinx 2018
  10. 10 Open-Source Tools (2) • Classification ◦ Fixed target architecture

    tools ▪ OpenVINO, PlaidML, MACE, XLA, Glow, TVM ◦ Specialized co-processor cores tools ▪ TVM (Versatile Tensor Accelerator core, VTA), Vitis AI (Deep Learning Processing Unit cores, DPUs) ◦ RTL synthesis tools ▪ LeFlow (using LegUp), hls4ml, FINN, ONNC (using Vivado), NNgen (direct Verilog) • Goal ◦ Implement neural network on FPGA
  11. 11 Open-Source Tools (2) • Classification ◦ Fixed target architecture

    tools ▪ OpenVINO, PlaidML, MACE, XLA, Glow, TVM ◦ Specialized co-processor cores tools ▪ TVM (Versatile Tensor Accelerator core, VTA), Vitis AI (Deep Learning Processing Unit cores, DPUs) ◦ RTL synthesis tools ▪ LeFlow (using LegUp), hls4ml, FINN, ONNC (using Vivado), NNgen (direct Verilog) • Goal ◦ Implement neural network on FPGA
  12. 12 Evaluation Models • MNIST-FC ◦ 1-layer fully-connected ◦ Input

    image: 28х28 ◦ Activation function: Softmax • MNIST-CNN ◦ 4-layer convolutional ◦ Input image: 28х28 ◦ Activation function: ReLu • Synthetic matrix multiplication (LeFlow only) ◦ MULT-0: vector by matrix multiplication ◦ MULT-10-R: 10 fully-connected layers 100х100, output activation: ReLu ◦ MULT-10-S: 10 fully-connected layers 100х100, output activation: Sigmoid
  13. 13 Evaluation Environment • Development boards ◦ Terasic DE10-Standard (Cyclone

    V) ◦ Terasic DE1-SoC (Cyclone V) ◦ Zybo Z7-20 (Zynq-7000) • CPU ◦ Intel Core i7-6700 3.4 GHz, 32 Gb RAM ◦ Ubuntu 20.04 • GPU ◦ NVIDIA GTX-770 • Bitstream synthesis ◦ Quartus 18.1/20.1 ◦ Vivado 2020.1 • Default settings used
  14. 14 Results • Failure  ◦ ONNC – closed implementation

    ◦ hls4ml – Vivado error during RTL synthesis ◦ Vitis AI, FINN – Vivado commercial license needed ◦ NNgen – errors during model compilation • Success  ◦ TVM+VTA ◦ LeFlow (except MNIST-CNN)
  15. 15 LeFlow • XLA – TensorFlow-to-LLVM translation • LeFlow –

    LLVM transformations • LegUp – RTL synthesis https://github.com/danielholanda/LeFlow LegUp
  16. 16 Results: LeFlow (1) • FPGA resource usage Parameter Model

    MNIST-FC MULT-0 MULT-10-R MULT-10-S Board DE1-SoC DE10-Standard RTL model size, code lines 5 240 1 252 6 011 11 490 ALM usage, units 5 331 (17%) 917 (3%) 4 598 (11%) 11 356 (27%) DSP usage, units 1 (1%) 4 (5%) 31 (28%) 32 (29%) Register usage, units 7 311 1 554 6 842 16 277 Block memory usage, bits 278 986 (7%) 326 820 (8%) 3 213 220 (57%) 3 216 138 (57%) Max frequency, MHz 127.39 122.55 96.08 93.63
  17. 17 Results: LeFlow (2) • Performance Parameter Model MNIST-FC MULT-0

    MULT-10-R MULT-10-S Board DE1-SoC DE10-Standard Simulation time, clock ticks 223 590 280 303 2 808 023 2 812 513 CPU time, ms 4 1 5 5 FPGA time (calculation), ms 1.76 2.29 29.23 30.03 Acceleration 2.28 0.44 0.17 0.17
  18. 18 18 TVM+VTA ONNX tf2onnx https://github.com/onnx/tensorflow-onnx Relay IR Computational graph

    Operations (LLVM) Weights VTA Runtime CPU code VTA instructions FPGA Versatile tensor accelerator (Chisel) TVM (Apache) https://tvm.apache.org
  19. 20 Results: TVM+VTA (1) • VTA core resource usage Parameter

    Value Board DE10-Standard ALM usage, units 20 311 (48%) DSP usage, units 0 (0%) Register usage, units 19 226 Block memory usage, bits 3 905 016 (69%) Max frequency, MHz 137.85
  20. 21 Results: TVM+VTA (2) • Performance Parameter Model MNIST-FC MNIST-CNN

    MNIST-CNN (quantized) CPU time, ms Keras 0.38 2.55 n/a TVM 0.0037 0.8 0.44 GPU time, ms TVM 0.02 0.17 CUDA error VTA time, ms 0.10 105.0 33.3 Acceleration Keras 3.8 0.024 0.08 TVM, CPU 0.037 0.008 0.014 TVM, GPU 0.020 0.0016 n/a
  21. 22 Conclusion (1) + Neural networks inference on FPGAs is

    possible + Simple neural networks acceleration is possible + Low-power applications
  22. 23 Conclusion (2) − Complex neural networks are significantly slower

    on small FPGAs − Bigger FPGAs needed for real-life applications − GPUs are the fastest