Neural Network Pruning

Neural Network Pruning - introduction and recent research trends -
2021.03.26 Ryohei Izawa

TL; DR • An overview of Pruning, an approach to
model compression in neural networks. • Introduces the main differences between the Pruning methods. • A summary of famous and recent papers on Pruning. 1

What’s neural network pruning ?

What is pruning ? Pruning is a method of reducing
the number of parameters and the computational complexity of a network by setting some of its weights to zero． 3 Photo by Han, S., Pool, J., Tran, J., and Dally, W. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pp. 1135–1143, 2015.[1] [1] Han, S., Pool, J., Tran, J., and Dally, W. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pp. 1135– 1143, 2015.

Plenty of reseach on Pruning Since Pruning was first proposed
in the late 1980s, the number of papers on Pruning has been increasing every year. 4 Mirkes, E. M[3] Number of published papers with keywords “neural” AND “network” AND “pruning” (dotted line in the top graph), with keywords “neural” AND “network” (solid line in the top graph) and ratio. [2] Mirkes, E. M. (2020, July). Artificial Neural Network Pruning to Extract Knowledge. In 2020 International Joint Conference on Neural Networks (IJCNN) (pp. 1-8). IEEE.

Pruning method Many Neural Network pruning methods are derived from
the method of Han et al. 2015[1]. 5 Photo by Han, S., Pool, J., Tran, J., and Dally, W. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pp. 1135–1143, 2015.[1] 1. Train the network 2. Prune the network based on the scores of the structural elements(e.g. weights) in the network 3. Re-train the network after pruning

Expected effect of Pruning While maintaining or improving the accuracy
of the original network, we want to achieve the following 6 • Reducing memory usage • Reducing computational effort • Reducing computational energy However, in general, there is a trade-off between achieving computational efficiency and achieving accuracy. ⇨ Various methods have been studied to improve this trade-off.

Main differences between pruning methods

Main difference between pruning methods There are 4 main differences:
(Blalock et al,. 2020)[3]. 8 • Structure • Scoring • Scheduling • Fine-Tuning [3] Blalock, D., Ortiz, J. J. G., Frankle, J., & Guttag, J. "What is the state of neural network pruning?." arXiv preprint arXiv:2003.03033 (2020).

Structure Unstructured pruning is irregular pruning performed on a weight
basis, while structured pruning is regular pruning performed on a layer or filter basis. For different structures, there is a trade-off between computational efficiency and accuracy. 9 Structured Unstructured High Low Low High Computational Efficiency Accuracy Pruning Unit Layer Filter Kernel Weight

Scoring There are several scoring methods for deciding which elements
to prune, such as pruning elements with the lowest absolute weights (or their sum), or pruning elements based on their contribution to a learned importance factor, gradient, etc. 10 [4] Lee, N., Ajanthan, T., and Torr, P. H. S. Snip: singleshot network pruning based on connection sensitivity. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6- 9, 2019. [5] Frankle, J. and Carbin, M. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Compare scores locally, prune the parameters with the lowest scores [1] Compare scores globally, prune the parameters with the lowest scores [4, 5]

Scheduling Timing or frequency of pruning in the training. 11
• One-shot or recursive pruning • Changing the amount of pruning weights during training Training Pruning FineTuning One-shot Training Pruning FineTuning Recursive

Fine-Tuning Which weight to use when Fine-Tuning. 12 [6] Liu,
Z., Sun, M., Zhou, T., Huang, G., and Darrell, T. Rethinking the value of network pruning. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. • Using trained weight and retrain after pruning (Han et al. 2015[1]) • After pruning, restore the weights before pruning (Frankle et al,. 2019[5]) • After pruning, reinitialize weights to random values (Liu et al,. 2019[6])

Papers by theme

Theme • Unstructured Pruning • Structured Pruning • AutoML for
Model Compression • Lottery Ticket Hypothesis 14

Unstructured Pruning Irregular pruning method in weight units. 15 [7]
Zhu, M. and Gupta, S. To prune, or not to prune: exploring the efficacy of pruning for model compression. ICLR Workshop, 2018. [8] Gale, Trevor, Erich Elsen, and Sara Hooker. "The state of sparsity in deep neural networks." arXiv preprint arXiv:1902.09574 (2019). [9] Molchanov, D., Ashukha, A., and Vetrov, D. Variational dropout sparsifies deep neural networks. In Proceedings of the 34th International Conference on Machine Learning- Volume 70, pp. 2498–2507. JMLR. org, 2017. [10] Christos Louizos, Max Welling, and Diederik P. Kingma. Learning sparse neural networks through l0 regularization.ICLR, 2018. • Magnitude Pruning (Han et al., 2015[1]) • Variational Dropout (Molchanov et al., 2017[9]) • L0 Norm Regularization (Louizos et al., 2018[10]) e.g. Weight Tips: - Better to prune the network layer by layer. - The number of weights to be pruned at each learning step should be gradually reduced as the learning rate descreases to prevent accuracy degradation. [Zhu et al., 2018[7], Gale et al., 2019[8])

Deep compression: Compressing deep neural network with pruning, trained quantization
and huffman coding (Han et al., 2016[11]) After training the network, pruned by weights and re-train the model. Quantization is then applied, the network is retrained again, and finally Huffman coding is applied. The required storage space is reduced by a factor of 35 to 45, while maintaining the same accuracy as the original network. 16 [11] Han, S., Mao, H., and Dally, W. J. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. In Bengio, Y. and LeCun, Y. (eds.), 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016.

Research on GPU acceleration of Sparse Neural Networks There is
an area of research on GPU kernels for accelerating operations on pruned networks. For unstructured pruning, GPU kernels for accelerating learning and inference are still under development. 17 • GPU Kernels for Block-Sparse Weights (Gray et al,. 2017[12] ) • Fast Sparse ConvNets (Elsen et al,. 2019 [13] ) • Sparse GPU Kernels for Deep Learning (Gale et al,. 2020[14] ) • SparseRT: Accelerating Unstructured Sparsity on GPUs for Deep Learning Inference (Wang. 2020[15] ) [12] Gray, S., Radford, A., & Kingma, D. P. (2017). Gpu kernels for block-sparse weights. arXiv preprint arXiv:1711.09224, 3. [13] Elsen, E., Dukhan, M., Gale, T., & Simonyan, K. (2020). Fast sparse convnets. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14629-14638). [14] Gray, S., Radford, A., & Kingma, D. P. (2017). Gpu kernels for block-sparse weights. arXiv preprint arXiv:1711.09224, 3. [15] Wang, Z. (2020, September). SparseRT: Accelerating Unstructured Sparsity on GPUs for Deep Learning Inference. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques (pp. 31-42).

Structured Pruning A method of regular pruning with coarser units
than Unstructured Pruning, such as filter or layer. Matrix operations on sparse neural network gained by Structured Pruning can be accelerated by existing GPU libraries. 18 Layer Filter

Pruning filters for efficient convnets (Li et al., 2016[16]) Per-filter
Pruning method. In contrast to per-weight pruning, it does not require support for sparse convolutional libraries and works with existing libraries. It reduces inference cost by 30%. 19 [16] Li, H., Kadav, A., Durdanovic, I., Samet, H., and Graf, H. P. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016. 1. Sum the absolute weights of each filter and the filters are ranked for pruning in the order of decreasing the sum of the absolute values of the weights. 2. Determine the number of pruning of filters in each layer according to the sensitivity of pruning. 3. One-shot pruning. Retrain the model until it gets the original accuracy. Method

Neuron Merging: Compensating for Pruned Neurons (Kim et al., 2020[17])
Neuron coupling method that prevents accuracy loss by coupling the information of the pruned filter to another filter. It preserves the information of the original model better than the pruning method. 20 [17] Kim, W., Kim, S., Park, M., & Jeon, G. (2020). Neuron Merging: Compensating for Pruned Neurons. Advances in Neural Information Processing Systems, 33.

AutoML Automatic model compression including Pruning. 21 • It is
too much work for a human to manually adjust the weight reduction procedures of a neural network ⇨ Research into methods to automate pruning is ongoing.

AMC: AutoML for Model Compression and Acceleration on Mobile Devices
(He et al., 2018[18]) An AutoML Method for Reinforcement Learning-Based Model Compression. It has been cited in many subsequent automatic model compression methods. 22 [18] Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Lil, Song Han. Amc: Automl for model compression and acceleration on mobile devices. Proceedings of the European Conference on Computer Vision (ECCV). 2018.

AutoPruner: An End-to-End Trainable Filter Pruning Method for Efficient Deep
Model Inference End-to-End trainable filter-level pruning on a single model: a trainable framework for end-to- end filter selection and fine-tuning of the model. 23 [19] Luo, Jian-Hao, and Jianxin Wu. 2020. “AutoPruner: An End-to-End Trainable Filter Pruning Method for Efficient Deep Model Inference.” Pattern Recognition 107 (November): 107461.

The Lottery Ticket Hypothesis The Lottery Ticket Hypothesis: A randomly
initialized dense network has sub-networks (Winning Tickets) that can be trained for at most the same number of iterations to achieve the same accuracy as the original network. 24 • During the training process, iteratively perform "Pruning=>Remaining Weight Initialization" to check if the resulting sub-networks were as accurate as or better than the original network. • Able to find sub-networks that were less than 10-20% of the size of the original network, learned faster than the original network, and achieved higher accuracy. The lottery ticket hypothesis: Finding sparse, trainable neural networks (Frankle et al., 2019[5])

Meta Analysis

To prune, or not to prune: exploring the efficacy of
pruning for model compression (Zhu et al., 2018[7]) Comparison of a large, pruned network with a small, dense model . The results show that the large sparse network model is more accurate than the small, dense model. An automatic stepwise pruning method is also proposed. 26 [7] Zhu, M. H., & Gupta, S. (2018). To Prune, or Not to Prune: Exploring the Efficacy of Pruning for Model Compression. An automatic stepwise pruning algorithm 1. Sort the weights of each layer by absolute value, and mask the weights of the smallest magnitude to zero up to sparse level sf ． 2. Starting from a learning step t0 , the sparsity level is increased from an initial sparsity level si to a final sparsity level sf through n pruning steps with a pruning frequency ∆t

Re-thinking the value of network pruning (Kim et al., 2019[6])
Showing that Pruning models with randomly initialized weights trained from scratch are as accurate or more accurate than Fine-Tuned Pruning models through comparison with multiple models. 27 [6] Liu, Z., Sun, M., Zhou, T., Huang, G., and Darrell, T. Rethinking the value of network pruning. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019.

The State of Sparsity in Deep Neural Networks (Gale et
al., 2019[8]) For large-scale learning tasks on Transformer and ResNet-50, this paper reports that Magnitude Pruning is more useful than Variational Dropout and L0 Regularization, which have been shown to be useful for small-scale learning. In a large-scale learning task, training a network with Unstructured Pruning from scratch does not achieve the accuracy of a Pruning model retrained by inheriting the learned weights. 28 [8] Gale, Trevor, Erich Elsen, and Sara Hooker. "The state of sparsity in deep neural networks." arXiv preprint arXiv:1902.09574 (2019).

What is the State of Neural Network Pruning? (Blalock et
al., 2020[3]) Based on an exhaustive analysis of the results of 81 papers, this paper proposes the following experimental framework to standardize the evaluation of pruning methods. 29 • Specify the exact combination of architecture, dataset and evaluation method • Comparison of at least three combinations of data sets and architectures • Clarified the definition of compression ratio and speedup • Top-1 and Top-5 accuracy reported for ImageNet and multi-class datasets • For each dataset/architecture pair, accuracy-compression ratio trade-off curves are drawn, along with comparison methods • When plotting the accuracy-compression ratio trade-off curve, at least 5 operating points {2, 4, 8, 16, 32} are drawn across the range of compression ratios. [3] Blalock, D., Ortiz, J. J. G., Frankle, J., & Guttag, J. "What is the state of neural network pruning?." arXiv preprint arXiv:2003.03033 (2020).

Pruning’s future topics

Rise of Transformer The performance of transformer will continue to
improve as you increase the dataset size, number of parameters, and computational resources. (Kaplan et al., 2020[21]) 31 Photo by Kaplan, Jared, et al. "Scaling laws for neural language models." arXiv preprint arXiv:2001.08361 (2020) [20] Kaplan, Jared, et al. "Scaling laws for neural language models." arXiv preprint arXiv:2001.08361 (2020).

The Possibility of Pruning Transformer has a bottleneck in computational
resources, which means “Economic power also determines the performance of the model.”. ⇨ Pruning could alleviate these bottlenecks if the model could be made lighter while still following the transformer's scaling law. There are several researches in which pruning is applied to Transformer. [22, 23, 24] 32 [21] Li, Bingbing, et al. "Efficient Transformer-based Large Scale Language Representations using Hardware-friendly Block Structured Pruning." arXiv preprint arXiv:2009.08065 (2020). [22] Brix, Christopher, Parnia Bahar, and Hermann Ney. "Successfully applying the stabilized lottery ticket hypothesis to the transformer architecture." arXiv preprint arXiv:2005.03454 (2020). [23] Lin, Zi, et al. "Pruning Redundant Mappings in Transformer Models via Spectral-Normalized Identity Prior." arXiv preprint arXiv:2010.01791 (2020).

Summary

Summary • Pruning is a method of reducing the number
of parameters and the computational complexity of a network by setting some of its weights to zero. • There are 4 main differences between the Pruning methods; Structure, Scoring, Scheduling, and Fine-Tuning • Pruning may become even more essential for the democratization of deep learning technologies. 34

Reference

36 [1] Han, S., Pool, J., Tran, J., and Dally,
W. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pp. 1135–1143, 2015. [2] Mirkes, E. M. (2020, July). Artificial Neural Network Pruning to Extract Knowledge. In 2020 International Joint Conference on Neural Networks (IJCNN) (pp. 1-8). IEEE. [3] Blalock, D., Ortiz, J.J.G., Frankle, J., Guttag, J.: What is the state of neural network pruning? arXiv preprint arXiv:2003.03033 (2020) [4] Lee, N., Ajanthan, T., and Torr, P. H. S. Snip: singleshot network pruning based on connection sensitivity. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6- 9, 2019. [5] Frankle, J. and Carbin, M. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019.

37 [6] Liu, Z., Sun, M., Zhou, T., Huang, G.,
and Darrell, T. Rethinking the value of network pruning. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. [7] Zhu, M. and Gupta, S. To prune, or not to prune: exploring the efficacy of pruning for model compression. ICLR Workshop, 2018. [8] Gale, Trevor, Erich Elsen, and Sara Hooker. "The state of sparsity in deep neural networks." arXiv preprint arXiv:1902.09574 (2019). [9] Molchanov, D., Ashukha, A., and Vetrov, D. Variational dropout sparsifies deep neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2498–2507. JMLR. org, 2017. [10] Christos Louizos, Max Welling, and Diederik P. Kingma. Learning sparse neural networks through l0 regularization.ICLR, 2018.

38 [11] Han, S., Mao, H., and Dally, W. J.
Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. In Bengio, Y. and LeCun, Y. (eds.), 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. [12] Gray, S., Radford, A., & Kingma, D. P. (2017). Gpu kernels for block-sparse weights. arXiv preprint arXiv:1711.09224, 3. [13] Elsen, E., Dukhan, M., Gale, T., & Simonyan, K. (2020). Fast sparse convnets. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14629-14638). [14] Gray, S., Radford, A., & Kingma, D. P. (2017). Gpu kernels for block-sparse weights. arXiv preprint arXiv:1711.09224, 3. [15] Wang, Z. (2020, September). SparseRT: Accelerating Unstructured Sparsity on GPUs for Deep Learning Inference. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques (pp. 31-42).

39 [16] Li, H., Kadav, A., Durdanovic, I., Samet, H.,
and Graf, H. P. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016. [17] Kim, W., Kim, S., Park, M., & Jeon, G. (2020). Neuron Merging: Compensating for Pruned Neurons. Advances in Neural Information Processing Systems, 33. [18] Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Lil, Song Han. Amc: Automl for model compression and acceleration on mobile devices. Proceedings of the European Conference on Computer Vision (ECCV). 2018. [19] Luo, Jian-Hao, and Jianxin Wu. 2020. “AutoPruner: An End-to-End Trainable Filter Pruning Method for Efficient Deep Model Inference.” Pattern Recognition 107 (November): 107461. [20] Kaplan, Jared, et al. "Scaling laws for neural language models." arXiv preprint arXiv:2001.08361 (2020).

40 [21] Li, Bingbing, et al. "Efficient Transformer-based Large Scale
Language Representations using Hardware- friendly Block Structured Pruning." arXiv preprint arXiv:2009.08065 (2020). [22] Brix, Christopher, Parnia Bahar, and Hermann Ney. "Successfully applying the stabilized lottery ticket hypothesis to the transformer architecture." arXiv preprint arXiv:2005.03454 (2020). [23] Lin, Zi, et al. "Pruning Redundant Mappings in Transformer Models via Spectral-Normalized Identity Prior." arXiv preprint arXiv:2010.01791 (2020).

Neural Network Pruning

Neural Network Pruning

More Decks by ryoherisson

Other Decks in Research

Featured

Transcript