Upgrade to Pro — share decks privately, control downloads, hide ads and more …

TMPA-2021: Open-Source Tools for Neural Network Inference on FPGAs

Exactpro
PRO
November 25, 2021
35

TMPA-2021: Open-Source Tools for Neural Network Inference on FPGAs

Mikhail Lebedev and Pavel Belecky

Open-Source Tools for Neural Network Inference on FPGAs

TMPA is an annual International Conference on Software Testing, Machine Learning and Complex Process Analysis. The conference will focus on the application of modern methods of data science to the analysis of software quality.

To learn more about Exactpro, visit our website https://exactpro.com/

Follow us on
LinkedIn https://www.linkedin.com/company/exactpro-systems-llc
Twitter https://twitter.com/exactpro

Exactpro
PRO

November 25, 2021
Tweet

Transcript

  1. 1
    25-27 NOVEMBER
    SOFTWARE TESTING, MACHINE LEARNING
    AND COMPLEX PROCESS ANALYSIS
    Open-Source Tools for
    Neural Network Inference
    on FPGAs
    Mikhail Lebedev1,2, Pavel Belecky1
    1Ivannikov Institute for System Programming of RAS, 2Plekhanov Russian University of Economics

    View Slide

  2. 2
    Introduction
    ● Hardware development demands
    much effort
    ● Synthesis can be a solution
    ○ High-level
    ○ Middle-level
    ○ Hardware construction
    Behavioral
    model
    Hardware
    model
    Synthesis
    ASIC FPGA

    View Slide

  3. 3
    Hardware Construction
    ● System & low-level architecture
    ● Many hardware details
    ● Very close to RTL
    ● Input languages
    ○ Chisel
    ○ Bluespec
    ○ DSLs
    Behavioral
    model
    Hardware
    model
    Synthesis
    ASIC FPGA

    View Slide

  4. 4
    Middle-Level Synthesis
    ● System architecture
    ● Some hardware details
    ● Input languages
    ○ MaxJ (MaxCompiler)
    ○ DSLX (XLS)
    Behavioral
    model
    Hardware
    model
    Synthesis
    ASIC FPGA

    View Slide

  5. 5
    High-Level Synthesis
    ● No hardware details
    ● No pre-defined architecture
    ● Input languages

    C/C++, C#, Java

    MATLAB

    Python

    DSLs → neural networks

    And many more…
    Behavioral
    model
    Hardware
    model
    Synthesis
    ASIC FPGA

    View Slide

  6. 6
    Background
    ● Our team High-Level Synthesis research, 2020-2021
    ○ 480 articles
    ■ 59 consider neural networks
    ○ 278 patents
    ○ > 50 open-source tools
    ■ 12 selected for evaluation
    ● Evaluation tasks
    ○ Neural networks (pre-trained)
    ○ JPEG algorithm
    ○ DPLL algorithm
    ○ IDCT algorithm
    ● Can we use open-source tools for synthesis?

    View Slide

  7. 7
    Neural Network Formats
    Format License Developer Active since
    ONNX Apache 2.0 Community 2017
    TensorFlow Apache 2.0 Google 2015
    Keras Apache 2.0 Google 2015
    PyTorch BSD Facebook 2016
    Caffe BSDv2 UC Berkeley 2014
    Caffe2 BSD Facebook 2017
    CNTK MIT Microsoft 2015
    CoreML BSDv3 Apple 2017
    MXNet Apache 2.0 Apache 2016

    View Slide

  8. 8
    Neural Network Formats
    Format License Developer Active since
    ONNX Apache 2.0 Community 2017
    TensorFlow Apache 2.0 Google 2015
    Keras Apache 2.0 Google 2015
    PyTorch BSD Facebook 2016
    Caffe BSDv2 UC Berkeley 2014
    Caffe2 BSD Facebook 2017
    CNTK MIT Microsoft 2015
    CoreML BSDv3 Apple 2017
    MXNet Apache 2.0 Apache 2016

    View Slide

  9. 9
    Open-Source Tools (1)
    Tool License Developer Active since
    OpenVINO Apache 2.0 Intel 2018
    PlaidML Apache 2.0 Intel 2017
    MACE Apache 2.0 Xiaomi 2017
    TVM Apache 2.0 University of Washington 2017
    XLA Apache 2.0 Google 2017
    LeFlow BSD University of British Columbia 2018
    Glow Apache 2.0 Facebook 2017
    hls4ml Apache 2.0 CERN 2017
    NNgen Apache 2.0 Shinya Takamaeda-Yamazaki 2017
    ONNC BSDv3 Skymizer 2018
    Vitis AI Apache 2.0 Xilinx 2020
    Finn BSDv3 Xilinx 2018

    View Slide

  10. 10
    Open-Source Tools (2)
    ● Classification
    ○ Fixed target architecture tools
    ■ OpenVINO, PlaidML, MACE, XLA, Glow, TVM
    ○ Specialized co-processor cores tools
    ■ TVM (Versatile Tensor Accelerator core, VTA), Vitis AI (Deep Learning
    Processing Unit cores, DPUs)
    ○ RTL synthesis tools
    ■ LeFlow (using LegUp), hls4ml, FINN, ONNC (using Vivado), NNgen (direct
    Verilog)
    ● Goal
    ○ Implement neural network on FPGA

    View Slide

  11. 11
    Open-Source Tools (2)
    ● Classification
    ○ Fixed target architecture tools
    ■ OpenVINO, PlaidML, MACE, XLA, Glow, TVM
    ○ Specialized co-processor cores tools
    ■ TVM (Versatile Tensor Accelerator core, VTA), Vitis AI (Deep Learning
    Processing Unit cores, DPUs)
    ○ RTL synthesis tools
    ■ LeFlow (using LegUp), hls4ml, FINN, ONNC (using Vivado), NNgen (direct
    Verilog)
    ● Goal
    ○ Implement neural network on FPGA

    View Slide

  12. 12
    Evaluation Models
    ● MNIST-FC
    ○ 1-layer fully-connected
    ○ Input image: 28х28
    ○ Activation function: Softmax
    ● MNIST-CNN
    ○ 4-layer convolutional
    ○ Input image: 28х28
    ○ Activation function: ReLu
    ● Synthetic matrix multiplication (LeFlow only)
    ○ MULT-0: vector by matrix multiplication
    ○ MULT-10-R: 10 fully-connected layers 100х100, output activation: ReLu
    ○ MULT-10-S: 10 fully-connected layers 100х100, output activation: Sigmoid

    View Slide

  13. 13
    Evaluation Environment
    ● Development boards
    ○ Terasic DE10-Standard (Cyclone V)
    ○ Terasic DE1-SoC (Cyclone V)
    ○ Zybo Z7-20 (Zynq-7000)
    ● CPU
    ○ Intel Core i7-6700 3.4 GHz, 32 Gb RAM
    ○ Ubuntu 20.04
    ● GPU
    ○ NVIDIA GTX-770
    ● Bitstream synthesis
    ○ Quartus 18.1/20.1
    ○ Vivado 2020.1
    ● Default settings used

    View Slide

  14. 14
    Results

    Failure 

    ONNC – closed implementation

    hls4ml – Vivado error during RTL synthesis

    Vitis AI, FINN – Vivado commercial license needed

    NNgen – errors during model compilation

    Success 

    TVM+VTA

    LeFlow (except MNIST-CNN)

    View Slide

  15. 15
    LeFlow
    ● XLA – TensorFlow-to-LLVM
    translation
    ● LeFlow – LLVM transformations
    ● LegUp – RTL synthesis
    https://github.com/danielholanda/LeFlow
    LegUp

    View Slide

  16. 16
    Results: LeFlow (1)
    ● FPGA resource usage
    Parameter
    Model
    MNIST-FC MULT-0 MULT-10-R MULT-10-S
    Board DE1-SoC DE10-Standard
    RTL model size, code
    lines
    5 240 1 252 6 011 11 490
    ALM usage, units
    5 331
    (17%)
    917
    (3%)
    4 598
    (11%)
    11 356
    (27%)
    DSP usage, units
    1
    (1%)
    4
    (5%)
    31
    (28%)
    32
    (29%)
    Register usage, units 7 311 1 554 6 842 16 277
    Block memory usage,
    bits
    278 986
    (7%)
    326 820
    (8%)
    3 213 220
    (57%)
    3 216 138
    (57%)
    Max frequency, MHz 127.39 122.55 96.08 93.63

    View Slide

  17. 17
    Results: LeFlow (2)
    ● Performance
    Parameter
    Model
    MNIST-FC MULT-0 MULT-10-R MULT-10-S
    Board DE1-SoC DE10-Standard
    Simulation time,
    clock ticks
    223 590 280 303 2 808 023 2 812 513
    CPU time, ms 4 1 5 5
    FPGA time
    (calculation), ms
    1.76 2.29 29.23 30.03
    Acceleration 2.28 0.44 0.17 0.17

    View Slide

  18. 18
    18
    TVM+VTA
    ONNX
    tf2onnx
    https://github.com/onnx/tensorflow-onnx
    Relay IR
    Computational
    graph
    Operations
    (LLVM)
    Weights
    VTA
    Runtime
    CPU code
    VTA
    instructions
    FPGA
    Versatile tensor
    accelerator
    (Chisel)
    TVM (Apache)
    https://tvm.apache.org

    View Slide

  19. 19
    19
    VTA core

    View Slide

  20. 20
    Results: TVM+VTA (1)
    ● VTA core resource usage
    Parameter Value
    Board DE10-Standard
    ALM usage, units
    20 311
    (48%)
    DSP usage, units
    0
    (0%)
    Register usage, units 19 226
    Block memory usage, bits
    3 905 016
    (69%)
    Max frequency, MHz 137.85

    View Slide

  21. 21
    Results: TVM+VTA (2)
    ● Performance
    Parameter
    Model
    MNIST-FC MNIST-CNN
    MNIST-CNN
    (quantized)
    CPU time, ms
    Keras 0.38 2.55 n/a
    TVM 0.0037 0.8 0.44
    GPU time, ms TVM 0.02 0.17 CUDA error
    VTA time, ms 0.10 105.0 33.3
    Acceleration
    Keras 3.8 0.024 0.08
    TVM, CPU 0.037 0.008 0.014
    TVM, GPU 0.020 0.0016 n/a

    View Slide

  22. 22
    Conclusion (1)
    + Neural networks inference on FPGAs is possible
    + Simple neural networks acceleration is possible
    + Low-power applications

    View Slide

  23. 23
    Conclusion (2)
    − Complex neural networks are significantly slower on small
    FPGAs
    − Bigger FPGAs needed for real-life applications
    − GPUs are the fastest

    View Slide

  24. 24
    Thank You!
    Follow TMPA on Facebook
    TMPA-2021 Conference

    View Slide