TMPA-2021: Open-Source Tools for Neural Network Inference on FPGAs

1 25-27 NOVEMBER SOFTWARE TESTING, MACHINE LEARNING AND COMPLEX PROCESS
ANALYSIS Open-Source Tools for Neural Network Inference on FPGAs Mikhail Lebedev1,2, Pavel Belecky1 1Ivannikov Institute for System Programming of RAS, 2Plekhanov Russian University of Economics

2 Introduction • Hardware development demands much effort • Synthesis
can be a solution ◦ High-level ◦ Middle-level ◦ Hardware construction Behavioral model Hardware model Synthesis ASIC FPGA

3 Hardware Construction • System & low-level architecture • Many
hardware details • Very close to RTL • Input languages ◦ Chisel ◦ Bluespec ◦ DSLs Behavioral model Hardware model Synthesis ASIC FPGA

4 Middle-Level Synthesis • System architecture • Some hardware details
• Input languages ◦ MaxJ (MaxCompiler) ◦ DSLX (XLS) Behavioral model Hardware model Synthesis ASIC FPGA

5 High-Level Synthesis • No hardware details • No pre-defined
architecture • Input languages ◦ C/C++, C#, Java ◦ MATLAB ◦ Python ◦ DSLs → neural networks ◦ And many more… Behavioral model Hardware model Synthesis ASIC FPGA

6 Background • Our team High-Level Synthesis research, 2020-2021 ◦
480 articles ▪ 59 consider neural networks ◦ 278 patents ◦ > 50 open-source tools ▪ 12 selected for evaluation • Evaluation tasks ◦ Neural networks (pre-trained) ◦ JPEG algorithm ◦ DPLL algorithm ◦ IDCT algorithm • Can we use open-source tools for synthesis?

7 Neural Network Formats Format License Developer Active since ONNX
Apache 2.0 Community 2017 TensorFlow Apache 2.0 Google 2015 Keras Apache 2.0 Google 2015 PyTorch BSD Facebook 2016 Caffe BSDv2 UC Berkeley 2014 Caffe2 BSD Facebook 2017 CNTK MIT Microsoft 2015 CoreML BSDv3 Apple 2017 MXNet Apache 2.0 Apache 2016

8 Neural Network Formats Format License Developer Active since ONNX
Apache 2.0 Community 2017 TensorFlow Apache 2.0 Google 2015 Keras Apache 2.0 Google 2015 PyTorch BSD Facebook 2016 Caffe BSDv2 UC Berkeley 2014 Caffe2 BSD Facebook 2017 CNTK MIT Microsoft 2015 CoreML BSDv3 Apple 2017 MXNet Apache 2.0 Apache 2016

9 Open-Source Tools (1) Tool License Developer Active since OpenVINO
Apache 2.0 Intel 2018 PlaidML Apache 2.0 Intel 2017 MACE Apache 2.0 Xiaomi 2017 TVM Apache 2.0 University of Washington 2017 XLA Apache 2.0 Google 2017 LeFlow BSD University of British Columbia 2018 Glow Apache 2.0 Facebook 2017 hls4ml Apache 2.0 CERN 2017 NNgen Apache 2.0 Shinya Takamaeda-Yamazaki 2017 ONNC BSDv3 Skymizer 2018 Vitis AI Apache 2.0 Xilinx 2020 Finn BSDv3 Xilinx 2018

10 Open-Source Tools (2) • Classification ◦ Fixed target architecture
tools ▪ OpenVINO, PlaidML, MACE, XLA, Glow, TVM ◦ Specialized co-processor cores tools ▪ TVM (Versatile Tensor Accelerator core, VTA), Vitis AI (Deep Learning Processing Unit cores, DPUs) ◦ RTL synthesis tools ▪ LeFlow (using LegUp), hls4ml, FINN, ONNC (using Vivado), NNgen (direct Verilog) • Goal ◦ Implement neural network on FPGA

11 Open-Source Tools (2) • Classification ◦ Fixed target architecture
tools ▪ OpenVINO, PlaidML, MACE, XLA, Glow, TVM ◦ Specialized co-processor cores tools ▪ TVM (Versatile Tensor Accelerator core, VTA), Vitis AI (Deep Learning Processing Unit cores, DPUs) ◦ RTL synthesis tools ▪ LeFlow (using LegUp), hls4ml, FINN, ONNC (using Vivado), NNgen (direct Verilog) • Goal ◦ Implement neural network on FPGA

12 Evaluation Models • MNIST-FC ◦ 1-layer fully-connected ◦ Input
image: 28х28 ◦ Activation function: Softmax • MNIST-CNN ◦ 4-layer convolutional ◦ Input image: 28х28 ◦ Activation function: ReLu • Synthetic matrix multiplication (LeFlow only) ◦ MULT-0: vector by matrix multiplication ◦ MULT-10-R: 10 fully-connected layers 100х100, output activation: ReLu ◦ MULT-10-S: 10 fully-connected layers 100х100, output activation: Sigmoid

13 Evaluation Environment • Development boards ◦ Terasic DE10-Standard (Cyclone
V) ◦ Terasic DE1-SoC (Cyclone V) ◦ Zybo Z7-20 (Zynq-7000) • CPU ◦ Intel Core i7-6700 3.4 GHz, 32 Gb RAM ◦ Ubuntu 20.04 • GPU ◦ NVIDIA GTX-770 • Bitstream synthesis ◦ Quartus 18.1/20.1 ◦ Vivado 2020.1 • Default settings used

14 Results • Failure  ◦ ONNC – closed implementation
◦ hls4ml – Vivado error during RTL synthesis ◦ Vitis AI, FINN – Vivado commercial license needed ◦ NNgen – errors during model compilation • Success  ◦ TVM+VTA ◦ LeFlow (except MNIST-CNN)

15 LeFlow • XLA – TensorFlow-to-LLVM translation • LeFlow –
LLVM transformations • LegUp – RTL synthesis https://github.com/danielholanda/LeFlow LegUp

16 Results: LeFlow (1) • FPGA resource usage Parameter Model
MNIST-FC MULT-0 MULT-10-R MULT-10-S Board DE1-SoC DE10-Standard RTL model size, code lines 5 240 1 252 6 011 11 490 ALM usage, units 5 331 (17%) 917 (3%) 4 598 (11%) 11 356 (27%) DSP usage, units 1 (1%) 4 (5%) 31 (28%) 32 (29%) Register usage, units 7 311 1 554 6 842 16 277 Block memory usage, bits 278 986 (7%) 326 820 (8%) 3 213 220 (57%) 3 216 138 (57%) Max frequency, MHz 127.39 122.55 96.08 93.63

17 Results: LeFlow (2) • Performance Parameter Model MNIST-FC MULT-0
MULT-10-R MULT-10-S Board DE1-SoC DE10-Standard Simulation time, clock ticks 223 590 280 303 2 808 023 2 812 513 CPU time, ms 4 1 5 5 FPGA time (calculation), ms 1.76 2.29 29.23 30.03 Acceleration 2.28 0.44 0.17 0.17

18 18 TVM+VTA ONNX tf2onnx https://github.com/onnx/tensorflow-onnx Relay IR Computational graph
Operations (LLVM) Weights VTA Runtime CPU code VTA instructions FPGA Versatile tensor accelerator (Chisel) TVM (Apache) https://tvm.apache.org

19 19 VTA core

20 Results: TVM+VTA (1) • VTA core resource usage Parameter
Value Board DE10-Standard ALM usage, units 20 311 (48%) DSP usage, units 0 (0%) Register usage, units 19 226 Block memory usage, bits 3 905 016 (69%) Max frequency, MHz 137.85

21 Results: TVM+VTA (2) • Performance Parameter Model MNIST-FC MNIST-CNN
MNIST-CNN (quantized) CPU time, ms Keras 0.38 2.55 n/a TVM 0.0037 0.8 0.44 GPU time, ms TVM 0.02 0.17 CUDA error VTA time, ms 0.10 105.0 33.3 Acceleration Keras 3.8 0.024 0.08 TVM, CPU 0.037 0.008 0.014 TVM, GPU 0.020 0.0016 n/a

22 Conclusion (1) + Neural networks inference on FPGAs is
possible + Simple neural networks acceleration is possible + Low-power applications

23 Conclusion (2) − Complex neural networks are significantly slower
on small FPGAs − Bigger FPGAs needed for real-life applications − GPUs are the fastest

24 Thank You! Follow TMPA on Facebook TMPA-2021 Conference

TMPA-2021: Open-Source Tools for Neural Network...

TMPA-2021: Open-Source Tools for Neural Network Inference on FPGAs

Exactpro PRO

More Decks by Exactpro

Featured

Transcript

1 25-27 NOVEMBER SOFTWARE TESTING, MACHINE LEARNING AND COMPLEX PROCESS

2 Introduction • Hardware development demands much effort • Synthesis

3 Hardware Construction • System & low-level architecture • Many

4 Middle-Level Synthesis • System architecture • Some hardware details

5 High-Level Synthesis • No hardware details • No pre-defined

6 Background • Our team High-Level Synthesis research, 2020-2021 ◦

7 Neural Network Formats Format License Developer Active since ONNX

8 Neural Network Formats Format License Developer Active since ONNX

9 Open-Source Tools (1) Tool License Developer Active since OpenVINO

10 Open-Source Tools (2) • Classification ◦ Fixed target architecture

11 Open-Source Tools (2) • Classification ◦ Fixed target architecture

12 Evaluation Models • MNIST-FC ◦ 1-layer fully-connected ◦ Input

13 Evaluation Environment • Development boards ◦ Terasic DE10-Standard (Cyclone

14 Results • Failure  ◦ ONNC – closed implementation

15 LeFlow • XLA – TensorFlow-to-LLVM translation • LeFlow –

16 Results: LeFlow (1) • FPGA resource usage Parameter Model

17 Results: LeFlow (2) • Performance Parameter Model MNIST-FC MULT-0

18 18 TVM+VTA ONNX tf2onnx https://github.com/onnx/tensorflow-onnx Relay IR Computational graph

19 19 VTA core

20 Results: TVM+VTA (1) • VTA core resource usage Parameter

21 Results: TVM+VTA (2) • Performance Parameter Model MNIST-FC MNIST-CNN

22 Conclusion (1) + Neural networks inference on FPGAs is

23 Conclusion (2) − Complex neural networks are significantly slower

24 Thank You! Follow TMPA on Facebook TMPA-2021 Conference