Save 37% off PRO during our Black Friday Sale! »

Sparse models are fast models: Improving DNN inference performance by over 10X

Sparse models are fast models: Improving DNN inference performance by over 10X

Presented at LA DATA CON 2021
Sparse DNNs provide a powerful technique for delivering significant performance benefits and power savings without impacting the accuracy of your models. We present an introduction to sparse models, details on how to create and run sparse models, and guidance on how to ensure their robustness and fairness.

B7189c9a09c7d99379c2a343fcfb2dbd?s=128

Lawrence Spracklen

September 22, 2021
Tweet

Transcript

  1. Sparse models are fast models: Improving DNN inference performance by

    over 10X Dr. Lawrence Spracklen
  2. Numenta Developing machine intelligence through neocortical theory • Understand how

    the brain works • Apply neocortical principles to AI Developed the “Thousand Brains” theory of how the neocortex works
  3. Artificial Neural Networks (ANNs) Layer 1 Layer 2 Layer 3

    Layer N Input Output Dense, fully-connected and computationally expensive A Deep Neural Network (DNN) is an ANN with many layers
  4. Traditional approach to ANNs Perform matrix multiplications very fast •

    GPUs have become AI workhorses • 500+ trillion arithmetic operations per second per card • Hardware performance doubles every few years • Hardware cannot keep pace with growth in model size • Exploding AI costs • 2018 : BERT cost $6K+ to train • 2020 : GPT-3 cost $10M+ to train 3-years 17,000X increase Figure credit
  5. Can Neuroscience help? Examine the Neocortex • Neuron interconnections are

    sparse • Neuron activations are sparse • Neurons are significantly more complex than AI’s point neuron abstraction • Humans can learn from very limited examples Numenta’s Roadmap
  6. Sparse Networks • Sparsity can be applied to today’s DNNs

    • Two main approaches 1. Weight Sparsity: Limit interconnections between neurons 2. Activation Sparsity: Limit number of neurons that can be activated • Delivers significant computational savings
  7. Why Sparsity? % weight sparsity Potential speedup (Compared to dense)

    % activation sparsity 0 15 30 45 60 75 90 0 50 100 150 200 250 300 350 400 450 500 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 96 Potential Speedup from Sparsity 0-50 50-100 100-150 150-200 200-250 250-300 300-350 350-400 400-450 450-500 • Massive opportunity for performance by simultaneously exploiting weight and activation sparsity
  8. Creating Sparse Networks • Two main approaches 1. Train a

    sparse network from scratch 2. Create a sparse network from an existing dense network • Approaches are typically iterative I. Determine least important weights via some metric II. Zero N weights (i.e., remove connections) III. Train network for a few iterations to recover accuracy IV. Repeat until sparsity goal is achieved • Importance is often determined by the magnitude of weights • Some technique regrow connections • E.G. Google’s RIGL • Activation sparse networks are uncommon • Select top-k activations from each layer, or use thresholded ReLU
  9. GMP Pruning Example • GMP (Gradual Magnitude Pruning) is a

    commonly used iterative pruning technique • Starts with a dense network and gradually removes neuron interconnections based on the magnitude of the weights • N pruning events with training/fine-tuning occurring between each pruning event • Rate of sparsification can be varied during pruning [Example tiny BERT GMP]
  10. How Sparse? • Sparse networks can provide equivalent accuracy to

    standard DNNs • Achievable sparsity is network dependent • Also depends on the sophistication of sparsification techniques • Trade small accuracy impact for increased performance? Example CNN network
  11. Training Results On Different Networks Type Dataset Network Sparsity Parameters

    Accuracy CNN Speech Dense 0% 1.7M 97.05% CNN Speech Sparse 90.6% 160,952 97.03% ResNet50 ImageNet MLPerf baseline 0% 25.5M 76.7% ResNet50 ImageNet Sparse 75%* 6.5M 77.1% Transformer GLUE Dense BERT base 0% 110M 76.8% Transformer GLUE Sparse BERT base 85% 16.5M 78.4% Transformer GLUE Sparse BERT base 90% 11M 76.3% Sparsity can be applied to different DNN architectures Highly sparse models are possible without sacrificing accuracy
  12. Sparsity Challenges • Hardware (HW) thrives on dense regular structure

    and predictable data access patterns • Allows use of data prefetchers and vector units • Unconstrained sparsity results in ‘random’ patterns of non- zero elements • Accuracy of the model is only consideration when pruning weights • Irregular sparsity patterns are poorly suited to fast execution on HW • Frequently faster to just multiply the zeros! • Sparsity can be constrained to better suit HW requirements • Hardware sympathetic sparsity patterns exist
  13. Structured Sparsity • Structured sparsity aligns patterns of non-zero elements

    with HW requirements • Three forms of Structured Sparsity that bring coherence to memory access patterns and computational intensity: • Block Sparsity: non-zero elements are constrained to appear in small regular blocks • Partitioned Sparsity: weight matrix is subdivided into partitions, with sparsity requirements applied to each partition. • Complementary Sparsity (Numenta): compresses sparse matrices into dense ones • Block, Partitioned, and Complementary Sparsity techniques can be utilized independently or collectively • Possible to train with these constraints with minimal losses in accuracy
  14. Block Structured Sparsity • Large blocks matching the full width

    of SIMD vector units can have negative impacts on accuracy • Smaller blocks can mitigate accuracy loss and provide meaningful benefit to HW • Complementary Sparsity further extends parallel computation to match the full width of SIMD vector units
  15. Sparsity Formats • Variety of different methods for representing sparse

    matrices
  16. Partitioned Sparsity • Partitioned Sparsity has two HW friendly attributes

    • Distributes the sparsity more evenly • Reduces the scope of memory access patterns Partitioned-Y Block-X Partitioned-Y Partitioned-XY Block-X Partitioned-XY Unstructured Block-X Partitioned-X Block-X Partitioned-X Block-Y Block-XY Block-Y Partitioned-X Block-XY Partitioned-X Potential Structured sparsity patterns [75% sparse matrix]
  17. Sparsity Today • Sparse networks are becoming more common •

    Significant ongoing research into sparse networks • Attention almost exclusively focused on just weight sparsity • Activation sparsity is harder to leverage as non-zero weights are input dependent • Very little discussion of leveraging both weight and activation sparsity • We are trying to evangelize this approach! • Meaningful speedups are possible on CPUs, GPUs and FPGAs • New HW architectures with improved support for sparsity are emerging
  18. Sparsity Software Resources Creating sparse networks • Microsoft Neural Network

    Intelligence (NNI) • Intel Neural Network Compression Framework (NNCF) • Google RIGL Running sparse networks • Neural Magic DeepSparse Engine • Apache TVM Model zoos • Increasing variety of ready-to-use sparse models being made available for a variety of vision and NLP tasks
  19. Sparse Networks on CPUs • Investigate how common compute platforms

    handle sparsity • Create 95% weight sparse convolutional DNN • Investigate different CPU model inference engines • Microsoft and Intel runtimes don’t (yet?) leverage sparsity • Improves DNN inference performance by up to 3X 0 1 2 3 OpenVino OnnxRuntime TVM DeepSparse Speedup on CPUs from 95% sparsity CPU execution engine
  20. • Performance improvement is sparsity dependent • Inference engines leverage

    large batch sizes to improve speedup from sparsity • Significantly simplifies computation for weight sparse matrices CPU Sparse Performance 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 1 8 16 24 32 dense sparse-50 sparse-75 sparse-98 Batch Size Speedup, X (WRT dense) [ResNet50 on DeepSparse]
  21. • Over 3X speedup from sparsity on BERT on CPUs

    • Degree of sparsity dictates both performance and accuracy • Small decreases in accuracy can translate into meaningful performance increases CPU Performance – Sparse BERT Speedup, X (Relative to dense) SQuAD Accuracy, % (Relative to dense) [Neural Magic’s Sparse BERT on DeepSparse]
  22. GPU Sparse Performance • NVIDIA Ampere GPUs have dedicated HW

    support for sparsity • Requires 50% sparsity with 4 element partitions • Achieves1.5X speedups for BERT and 1.3X for ResNet
  23. • Numenta technique for compressing sparse entities into a single

    dense structure for efficient processing • Speedup scales linearly with degree of sparsity • Applicable to both convolutional and linear layers Complementary Sparsity 80% sparse convolutional kernels
  24. FPGA Sparse Performance • Sparsity performance demonstration • Leverage both

    weight sparsity and activation sparsity • Leverage complementary kernels • Flexibility of FPGAs enables full performance potential of sparsity to be realized 0 50 100 150 Speedup from Sparsity on FPGA (Relative to dense, Xilinx U250) Sparse-Dense Sparse-Sparse One Network Full Chip CNN Network (2.1M parameters) [95% weight sparsity, 88% activation sparsity]
  25. FPGA vs CPU • Compare sparse performance on CPUs and

    FPGAs • Using 24 core (48 hardware thread) Intel Xeon 8275CL • AWS C5.12xlarge • Extending complementary sparsity to CPUs and GPUs Samples/second 0 200000 400000 600000 800000 1000000 1200000 1400000 Single-thread Full-chip CPU-OpenVino CPU-OnnxRuntime CPU-TVM CPU-DeepSparse Numenta-SD Numenta-SS CNN Network (2.1M parameters)
  26. The Blessing of Dimensionality • Advantageous to maximize sparsity to

    boost performance • Achievable accuracy decreases at extreme sparsity • Increasing width of network while holding parameter count constant increases achievable accuracy • Also improves robustness to noise Noisy MNIST (50% noise)
  27. Cautionary notes • A single high-level accuracy score should never

    be the only metric of model goodness • Performance across different sub-populations can vary significantly • Understand model performance across sub-populations • Prevents unfairness and bias problems going undetected • Sparsity creates networks with far fewer weights • What where those neuron interconnections contributing? • Important to understand fine-grain changes in model accuracy resulting from sparsity • Is my model as fair and unbiased as the original dense baseline? • Research indicates that hard to classify corner-cases can be sacrificed by sparsity • Significant model capacity can be dedicated to delineating corner-cases
  28. Conclusions • Sparse DNNs have reduced neuron interconnections and/or neuron

    activations compared with contemporary models • Currently, only weight sparsity is widely exploited • Sparse DNNs represent a significant opportunity to reduce AI costs • Sparse models are already running up to 4X faster than equivalent dense networks on current CPUs and GPUs • A variety of open-source libraries available for creating sparse models • Important to understand model fairness is not adversely impacted by sparsity • Current benefits from sparsity are just tip of the iceberg! • Weight and activation sparse models can run up to 100X faster • Stay tuned!
  29. THANK YOU Questions? lspracklen@numenta.com https://numenta.com