Upgrade to Pro — share decks privately, control downloads, hide ads and more …

"How can we be so slow?" Realizing the performance benefits of Sparse networks

"How can we be so slow?" Realizing the performance benefits of Sparse networks

Interest in sparse neural networks has never been greater. With the exponential growth in model size, sparsity represents a powerful technique to reduce both training and inference costs. Sparsity can be applied to both weights and activations, with sparsities of up to 95%+ being achievable before model accuracy degrades irreparably. Implemented correctly, the benefit of sparsity in weights and activations is multiplicative i.e., a 10X reduction in both weights and activations translates into a 100X reduction in the computational cost of a forward pass. Unfortunately, despite the clear potential for sparse models to deliver significant performance improvements, the benefits observed to date on GPUs and CPUs have been extremely limited. Many model runtimes completely fail to exploit the benefits of sparsity, and, for those that do, 2-3X improvements in inference performance are observed for most models by leveraging weight sparsity. CPUs and GPUs are optimized for dense, regular computations, and efficiently exploiting the irregular patterns of non-zero weights and activations in sparse networks has proved challenging. In this presentation we present novel FPGA-based sparse CNN models that concurrently leverage both activation and weight sparsity to run 100X faster than their dense counterparts and outperform sparse networks on a 24-core CPU by over 12X. We present the techniques developed to achieve this speedup from sparsity and discuss how many of the learnings could be applied to develop fast sparse networks on CPUs.


Lawrence Spracklen

July 08, 2021


  1. How Can We Be So Slow? Realizing The Performance Benefits

    of Sparse Networks Lawrence Spracklen, Kevin Hunter & Subutai Ahmad [lspracklen,khunter,sahmad]@numenta.com Sparse Networks Should be Fast Networks Further Reading • CNN for speech recognition • Two CNN layers and two FC layers • 2.5M parameters 1) Sparse-dense version • 95% weight sparsity (varies by layer) 2) Sparse-sparse version • Also leverage 88% activation sparsity • K-winners selection • Sparse network accuracy not compromised • Unfortunately, performance on CPUs and GPUs is lackluster • Speedups of less than 3X are typical • Weight and activation sparsity not simultaneously exploited • Leverage sparsity patterns that allow computation to be framed as a dense operation • Applicable to both linear and convolutional layers • Light-weight constraint doesn’t impact achievable accuracy • Enforce non-overlapping patterns across multiple sets of kernels or linear-layer weights • Allows multiple kernels/weights to be elegantly combined into a single dense entity • Speedup scales linearly with degree of sparsity • Use sparse static mask training to provide control over weight placement • Accurate networks while exploiting extreme sparsity Sparse Networks Can be Fast Networks • Weight sparsity can reduce non-zero weights by 5-20X • Activation sparsity can reduce activations by 5-10X • Simultaneously exploiting both has a multiplicative effect • Up to 200X reduction in non-zero computations An Incredible Opportunity for Performance The Current Reality • Hardware friendly fine-grain sparsity patterns exist • Demonstrated 100X+ speedup from sparsity on FPGAs • Developing library for generalized build-your-own sparse networks • Achieved linear speedup with sparsity on CPUs for MatMul primitives • High-performance transformer under development Experimental setup 1) Can CPUs leverage sparsity? https://tinyurl.com/sparsednncpu Hardware Friendly Fine-Grain Sparsity • Sparse networks on CPUs and GPUs often less than 3X faster • Weight and activation sparsity not exploited simultaneously Next Steps Generalize FPGA support • Creating general-purpose library for high-performance sparse-sparse networks on FPGAs Deepen CPU & GPU support • Efficiently support sparse-sparse networks • Demonstrate potential with fast sparse transformer Accelerate Sparse network training • Hardware thrives on dense regular structure and predictable data access patterns • Enables effective use of data prefetchers and wide SIMD engines • Selectively handling ‘randomly’ positioned non-zero elements is inefficient • Often faster to just execute the zero-valued multiplications • Block sparsity is a commonly imposed constraint • Large blocks reduce accuracy • Still runtime uncertainty in memory access patterns, impacting performance • Developing accommodating hardware is complex, expensive and slow • Need to develop sparsity patterns that are compatible with today’s processors Are Hardware-Friendly Sparsity Patterns Feasible? The Sparsity Problem? 112X Speedup on FPGAs • Sparse-weights, dense-activations outperforms dense by over 50X • Also leveraging activation sparsity doubles performance to 112X • FPGA implementation derives performance from 1) Faster throughput per network 2) Reduced resources per network allows multiple network placement 3) How can we be so dense? https://tinyurl.com/whysodense 2) 100X FPGA Whitepaper https://tinyurl.com/fastsparsefgpa • Decreased resource utilization allows placement on small FPGAs • Extreme power efficiency for AI at the Edge • Flexibility of HW architecture dictates severity of sparsity pattern constraints • Depends on the cost and flexibility of supported permute operations • The reconfigurability of FPGAs makes them an ideal architecture for sparse networks • Simultaneously use weight and activation sparsity with minimal constraints Speedup on CPUs [Early results] • Compared 95% weight sparse MatMul with Intel’s MKL • Outperforms Intel dense by 18X and Intel ‘s best sparse by 3X • Doesn’t rely on large batch sizes to derive performance