2 KYOTO UNIVERSITY We’ll see how to speed up deep learning models ◼ In this tutorial, I introduce various methods to speed up deep learning models. ◼ As models get bigger and bigger nowadays, it is important to speed up models to reduce cost (both money and time).

3 KYOTO UNIVERSITY We focus on speeding up inference ◼ There are two types of speedups. ⚫ Speeding up training ⚫ Speeding up inference ★ ◼ We focus on inference in this tutorial. ◼ Note some of the techniques are also used for speeding up training.

4 KYOTO UNIVERSITY We consider both CPU and GPU ◼ There are two types of devices ⚫ Inference on GPU ★ ⚫ Inference on CPU ★ ◼ We consider both cases. ◼ Approach can be quite different for these cases. ◼ Note that inference on CPU is common. Inference on GPU is also common. ↔ training is mainly carried out on GPU.

9 KYOTO UNIVERSITY Quantization speed ups inference in many ways ◼ Why does quantization speed up inference? Note the number of operations is the same. ⚫ It reduces memory consumption → We can increase the batchsize. → Throughput increases. ⚫ We can exploit SIMD (on CPU) ⚫ We can exploit Tensor Cores (on GPU) ⚫ Communication costs reduce. ⚫ Cache efficiency improves.

10 KYOTO UNIVERSITY Quantization speed ups inference in many ways ◼ Why does quantization speed up inference? Note the number of operations is the same. ⚫ It reduces memory consumption → We can increase the batchsize. → Throughput increases. ⚫ We can exploit SIMD (on CPU) ⚫ We can exploit Tensor Cores (on GPU) ⚫ Communication costs reduce. ⚫ Cache efficiency improves.

11 KYOTO UNIVERSITY High batchsize increases throughput Model on GPU x y Model on GPU x y 10.0 msec x y 10.1 msec Almost the same time given parallelization is not saturated. → Throughput doubles.

12 KYOTO UNIVERSITY Quant. loosens the memory limit -> higher batchsize ◼ We can’t increase batchsize arbitrarily due to the memory limit. ◼ Quantization loosens the memory limit, and we can use higher batch sizes.

13 KYOTO UNIVERSITY High batchsize speeds up training as well ◼ This is basically the case in training as well. ◼ The accuracy of a model is mostly determined by the number of samples it saw during training. It depends on many other aspects, though. See, e.g., tuning playbook https://github.com/google-research/tuning_playbook ◼ If we double the batchsize, we can finish training in a half time. ◼ It is common to use FP16 (instead of FP32) or mixed precision (FP16 + FP32) during training.

14 KYOTO UNIVERSITY Quantization speed ups inference in many ways ◼ Why does quantization speed up inference? Note the number of operations is the same. ⚫ It reduces memory consumption → We can increase the batchsize. → Throughput increases. ⚫ We can exploit SIMD (on CPU) ⚫ We can exploit Tensor Cores (on GPU) ⚫ Communication costs reduce. ⚫ Cache efficiency improves.

15 KYOTO UNIVERSITY Standard addition of 64 bit ints 0000101110100010100000111111000011101100001111100111110000010011 0000010011001001010101111010000000010101011111010000000010010100 0001000001101011110110111001000100000001101110110111110010100111 + 64 bit Arithmetic Logic Unit (ALU) int64 int64 int64

16 KYOTO UNIVERSITY Eight int8 additions can be done simultaneously 0000101110100010100000111111000011101100001111100111110000010011 0000010011001001010101111010000000010101011111010000000010010100 0001000001101011110110101001000000000001101110110111110010100111 + 64 bit Arithmetic Logic Unit (ALU) int8 int8 int8 int8 int8 int8 int8 int8 Turn off carry up

17 KYOTO UNIVERSITY SIMD parallelizes low-precision arithmetic. ◼ Modern CPUs have 256 bit or 512 bit ALUs. ◼ They support simultaneous low-precision operations natively, e.g., AVX2. ◼ This kind of operations are called SIMD (Single Instruction / Multiple Data.) ◼ AVX2 carries out 32 int8 operations simultaneously.

18 KYOTO UNIVERSITY Quantization speed ups inference in many ways ◼ Why does quantization speed up inference? Note the number of operations is the same. ⚫ It reduces memory consumption → We can increase the batchsize. → Throughput increases. ⚫ We can exploit SIMD (on CPU) ⚫ We can exploit Tensor Cores (on GPU) ⚫ Communication costs reduce. ⚫ Cache efficiency improves.

19 KYOTO UNIVERSITY Tensor cores carry out tensor mulitplications ◼ NVIDIA GPU consists of CUDA and Tensor Cores. ◼ Tensor Cores carry out matrix multiplications. ▲ from the whitepaper of A100 GPU

20 KYOTO UNIVERSITY Tensor cores process int8 more than float at a time ◼ Tensor Cores can handle int8 more efficiently than float. This is due to basically the same reason as in CPU SIMD. ▲ from the whitepaper of A100 GPU

21 KYOTO UNIVERSITY We have two approaches: QAT and postprocessing ◼ Two types of quantization (When to quantize) ⚫ Quantization aware training (QAT) ⚫ Post-hoc quantization ◼ QAT is more effective than postprocessing. ◼ We sometimes have models that are already trained. In this case, we need to consult postprocessing.

22 KYOTO UNIVERSITY Speedup-aware training vs postprocessing ◼ This picture, speedup-aware training vs postprocessing speedup, is common in many of the following speedup techniques. ◼ Some methods cannot be applied in a post- processing way, though.

23 KYOTO UNIVERSITY STE is a basic approach for QAT ◼ Basic Approach for QAT: Straight-Through Estimator (STE) [Bengio+ 2013] w = 2.14 q = 2 quantize x = 1.4 h = 2.8 Forward: Parameter to hold (in float) mult

24 KYOTO UNIVERSITY STE is a basic approach for QAT ◼ Basic Approach for QAT: Straight-Through Estimator (STE) [Bengio+ 2013] w = 2.14 q = 2 quantize x = 1.4 h = 2.8 Forward: Parameter to hold (in float) w = 2.14 q = 2 x = 1.4 h = 2.8 Backward: No gradient 𝜕𝐿 𝜕ℎ = 2.1 𝜕𝐿 𝜕𝑤 = 1.5 mult 𝜕𝐿 𝜕𝑞 = 1.5 copy STE

25 KYOTO UNIVERSITY STE allows us to continuously optimize parameters ◼ Intuitively, if we should increase q (parameter after quantization), we should increase w. ◼ We can continuously update parameters by holding float parameters and with STE. w = 2.14 q = 2 x = 1.4 h = 2.8 𝜕𝐿 𝜕ℎ = 2.1 𝜕𝐿 𝜕𝑤 = 1.5 𝜕𝐿 𝜕𝑞 = 1.5 copy This suggests to decrease q by (lr *1.5). But (lr * 1.5) < 0.5 is fractional.

26 KYOTO UNIVERSITY STE allows us to continuously optimize parameters ◼ Intuitively, if we should increase q (parameter after quantization), we should increase w. ◼ We can continuously update parameters by holding float parameters and with STE. w = 2.14 q = 2 x = 1.4 h = 2.8 𝜕𝐿 𝜕ℎ = 2.1 𝜕𝐿 𝜕𝑤 = 1.5 𝜕𝐿 𝜕𝑞 = 1.5 copy This suggests to decrease q by (lr *1.5). But (lr * 1.5) < 0.5 is fractional. Decrease w by this much. This may not change q. if the gradient tends to be positive for iterations, changes accumulate and q will change,

27 KYOTO UNIVERSITY Weight only or weight and activation quantization ◼ Two types of quantization (Where to quantize) ⚫ Weight-only Quantization ⚫ Weight and Activation Quantization ◼ Weight-only Quantization reduces the model size, but computations are in float. → It speeds up loading the model on GPU, but the computations are not so sped up. ◼ Some models are too big to fit in small devices. Small model size is important in edge inference.

28 KYOTO UNIVERSITY Power-of-two quantization speeds up even more. ◼ Some quantization methods quantize weights and coefficients of batchnorm into the power of two (…, 1/4, 1/2, 1, 2, 4, …). ◼ In this case, the multiplication of weights and inputs can be done with bit shift. ◼ It further speeds up computation.

29 KYOTO UNIVERSITY Many options how much we quantize ◼ Precision level (How much we quantize) ⚫ FP16, BP16 (16bit float formats) ⚫ Int8 ⚫ Binary ◼ H100 GPU also supports FP8. ◼ I will review the above three choices in the following slides.

30 KYOTO UNIVERSITY FP16 is easy to use. Good for the first try. ◼ FP16, BP16 (16bit float) It is often okay to naively cast 32bit models to FP16 or BP16. This is the first recommended option. ◼ Approx 1.5 speedup is expected. ◼ Tensor Cores support FP16 and BP16 from the Volta Architecture (e.g., V100). More speedup is expected when tensor cores are exploited. Just model = model.half() in Pytorch

31 KYOTO UNIVERSITY Int8 is efficient. Good for precise tuning. ◼ Int8 It require some sophisticated cares to convert float32 models to int8. The degradation of accuracy is often negligible. [Jacob+ CVPR 2018] ◼ 2 ~ 4x speedup is expected. ◼ Tensor Cores support Int8 operations from the Turing architecture (e.g., A100) [Jacob+ CVPR 2018] Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference.

32 KYOTO UNIVERSITY Binary is extremely efficient, but not precise. ◼ Binary (1bit) It degrades the accuracy even with sophisticated techniques. ◼ 30 ~ 60x speedup is expected on CPU [Rastegari+ ECCV 2016]. ◼ 2~7x speedup also in GPU [Courbariaux+ NeurIPS 2016]. ◼ This option is recommended only when the speed is crucial. [Rastegari+ ECCV 2016] XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks. [Courbariaux+ NeurIPS 2016] Binarized Neural Networks.

33 KYOTO UNIVERSITY Reference ◼ Famous papers: ◼ [Jacob+ CVPR 2018] Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. Pytorch and TensorFlow quantization bases on this paper.

34 KYOTO UNIVERSITY Reference ◼ [Courbariaux+ NeurIPS 2015] BinaryConnect: Training Deep Neural Networks with binary weights during propagations. This paper proposes weight-only binary quantization. ◼ [Courbariaux+ NeurIPS 2016] Binarized Neural Networks. This paper proposes weight-and-activation binary quantization on GPU. ◼ [Rastegari+ ECCV 2016] XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks. Weight-and-activation binary quantization on CPU.

37 KYOTO UNIVERSITY Distillation has two steps ◼ Two steps of knowledge distillation: 1. Train the large model. 2. Train the small model using the output of the large model as the target.

39 KYOTO UNIVERSITY Distillation is indirect but necessary due to noise ◼ Knowledge distillation looks roundabout. Why don’t we train the small model only? [Ba+ NeurIPS 2014] Do Deep Nets Really Need to be Deep? Training data (stochastic) Output of the teacher model (deterministic) Small models struggle to fit noisy signals. It is easier for the small model to fit the deterministic signal.

42 KYOTO UNIVERSITY Pruning has three steps. Finetuning is important. ◼ Three steps for pruning: 1. Train the models 2. Prune the weights 3. Finetune the models During finetuning, the pruned weights are fixed to zero. ◼ Finetuning is important for accuracy.

43 KYOTO UNIVERSITY Pruning based on magnitude is standard ◼ Various criteria for pruning have been proposed. ◼ The most basic one is the magnitude pruning. Prune the weights that have small absolute values. I.e., prune w if |w| < ε. ◼ It is also popular to dynamically prune weights during training with Lasso-like L1 regularization.

45 KYOTO UNIVERSITY Convolutional layer has F x C x 3 x 3 parameters ◼ 3x3 Convolutional Layer K C x H x W F x C x 3 x 3 ∗ F x H x W Num of channels Num of filters Each filter has weights for each channel convolution

46 KYOTO UNIVERSITY Filter pruning uses K[f, :, :, :] ◼ Filter pruning imposes group sparsity with group K[f, :, :, :] for each f. K F x C x 3 x 3 K- F- x C x 3 x 3 Filter pruning [Wen+ NeurIPS 2016] Learning Structured Sparsity in Deep Neural Networks.

47 KYOTO UNIVERSITY Channel pruning uses K[:, c, :, :] ◼ Channel pruning imposes group sparsity with group K[:, c, :, :] for each c. K F x C x 3 x 3 K- F x C- x 3 x 3 Channel pruning [Wen+ NeurIPS 2016] Learning Structured Sparsity in Deep Neural Networks.

48 KYOTO UNIVERSITY Shape pruning uses K[:, c, h, w] ◼ Shape pruning imposes group sparsity with group K[:, c, h, w] for each (c, h, w). K F x (C x 3 x 3) K- F x D Shape pruning [Wen+ NeurIPS 2016] Learning Structured Sparsity in Deep Neural Networks.

49 KYOTO UNIVERSITY Can shape pruning exploit structured sparsity ◼ Why shape pruning? ◼ It looks like non-structured pruning and seems it does not exploit structures at a first glance.

50 KYOTO UNIVERSITY Convolution is implemented by matrix product ◼ Detailed flow of convolution: K C x H x W F x C x 3 x 3 F x HW X Extract a patch for each position X’ (H x W) x (C x 3 x 3) = HW x 9C flatten K’ F x (C x 3 x 3) = F x 9C Matrix product Y’ unflatten F x H x W Y

51 KYOTO UNIVERSITY Shape pruning reduces the number of columns ◼ We can reduce the number of columns in matrix product by shape pruning. C x H x W F x HW X Extract a patch for each position X’ (H x W) x (C x 3 x 3) = HW x 9C K- F x D Matrix product Y’ unflatten F x H x W Y X’’HW x D Delete columns

52 KYOTO UNIVERSITY Non-structured pruning is mainly for CPU ◼ Non-structured pruning can reach higher sparsity keeping accuracy thanks to its fine-grained resolution. is not suitable for GPUs. It is mainly for CPU inference.

53 KYOTO UNIVERSITY Structured pruning may not be effective ◼ Structured pruning effectively exploits GPU parallelization can degrade the accuracy or we can prune few filters

54 KYOTO UNIVERSITY Sparse operations are available after A100 ◼ Update: Now we can use non-structured pruning for GPU inference ◼ Sparse multiplication is supported from the Ampere GPU (e.g., A100) ▲ from the whitepaper of A100 GPU

56 KYOTO UNIVERSITY Pruning is indirect but necessary due to overparam. ◼ Pruning looks roundabout Why don’t we use small and dense models? ◼ DNN is overparameterized. ◼ Overparametrization eases optimization & generalization (cf. Lottery Ticket Hypothesis, double descent). ◼ We can prune unnecessary weights after optimization.

57 KYOTO UNIVERSITY Structured pruning is sometimes meaningless ◼ Criticism on structured pruning: ◼ In this case, pruning is just roundabout and meaningless. [Liu+ ICLR 2019] Rethinking the Value of Network Pruning. For all state-of-the-art structured pruning algorithms we examined, fine-tuning a pruned model only gives comparable or worse performance than training that model with randomly initialized weights [Liu+ ICLR 2019]. Zhuang Liu

60 KYOTO UNIVERSITY CNN filters have low-rank structures ◼ Filters of the first layer of CNNs are continuous and not so diverse (cf. Gabor filter) ◼ We can effectively approximate them with fairly low rank matrices [Denton+ NeurIPS 2014, Denil+ NeurIPS 2013] [Denil+ NeurIPS 2013] Predicting Parameters in Deep Learning. [Denton+ NeurIPS 2014] Exploiting Linear Structure within Convolutional Networks for Efficient Evaluation. ▲ from AlexNet paper ImageNet Classification with Deep Convolutional Neural Networks. We can guess masked values. → parameters are redundant.

62 KYOTO UNIVERSITY Computation of A is time consuming ◼ Next, let’s consider attention ◼ Computing the attention matrix A (which is n x n) is time-consuming Y Softmax Q KT V = A

63 KYOTO UNIVERSITY A behaves like a Gram matrix ◼ Next, let’s consider attention ◼ A ij measure similarity of Q i and item K j . It’s similar to a Gram matrix. Let’s assume A is the Gram matrix of the Gaussian kernel for a moment. Y Softmax Q KT V = A

64 KYOTO UNIVERSITY Gram matrix can be effectively approximated ◼ Next, let’s consider attention ◼ A Gram matrix can be approximated by, e.g., random features [Rahimi+ NeurIPS 2008] and Nystrom approximation. Y Softmax Q KT V ≈ Q’ [Rahimi+ NeurIPS 2008] Random Features for Large-Scale Kernel Machines. K’T Random feature of Q Random feature of K

65 KYOTO UNIVERSITY Random features can also be applied to attention ◼ An attention matrix is not a Gram matrix of the Gaussian kernel, though. ◼ Fortunately, almost the same methods can be applied to the attention matrix. ⚫ FAVOR+ and Performers [Choromanski+ ICLR 2021] ⚫ Random Feature Attention [Peng+ ICLR 2021] [Choromanski+ ICLR 2021] Rethinking Attention with Performers. [Peng+ ICLR 2021] Random Feature Attention.

66 KYOTO UNIVERSITY Attention with random features runs in linear time ◼ Approximation of Y = Attention(Q, K, V) 1. Compute random features 2. Multiply slim matrices Time complexity: O(n d’ (d + d’)) → linear w.r.t. n Q′ = 𝜓 𝑄 ∈ ℝ𝑛×𝑑′ 𝐾′ = 𝜓 𝐾 ∈ ℝ𝑛×𝑑′ 𝑌 = 𝑄′𝐾⊤𝑉

67 KYOTO UNIVERSITY Derivation of random features for attention ◼ Let be the random features for the Gaussian kernel Let Then 𝜙 𝑥 ∈ ℝ𝑑 𝜙 𝑞 𝑇𝜙 𝑘 ≈ exp − q − 𝑘 2 𝜓 𝑥 = exp 𝑥 2 𝜙 𝑥 𝜓 𝑞 𝑇𝜓 𝑣 ≈ exp 𝑞𝑇𝑣 Unnormalize attention

68 KYOTO UNIVERSITY Normalization term can be computed in linear time ◼ Normalization constant is Z = 𝑖 exp 𝑞𝑇𝑘𝑖 ≈ 𝑖 𝜓 𝑞 𝑇𝜓 𝑘𝑖 = 𝜓 𝑞 𝑇 𝑖 𝜓(𝑘𝑖 ) Random feature approximation This sum takes liner time, but can be reused for other q’s. → linear time in total.

70 KYOTO UNIVERSITY Traditional filter has weights for each channel ◼ Traditional 3x3 Convolutional Layer K C x H x W F x C x 3 x 3 ∗ F x H x W Num of channels Num of filters Each filter has weights for each channel

71 KYOTO UNIVERSITY Mobilenets use spatial and interchannel interactions ◼ MobileNets K1 C x H x W C x 3 x 3 ∗ F x H x W d C x H x W Depthwise Convolution K2 F x C x 1 x 1 ∗ Standard Convolution 1/F reduction 1/9 reduction

72 KYOTO UNIVERSITY FLOPs is not a good evaluation method Irwan Bello [Bello+ NeurIPS 2021] Revisiting ResNets: Improved Training and Scaling Strategies. FLOPs is not a good indicator of latency on modern hardware. [Bello+ NeurIPS 2021] ◼ Efficient architectures are sometimes evaluated by the number of FLOPs. ◼ Criticism:

73 KYOTO UNIVERSITY Use wall clock time not FLOPs ◼ Tensor cores run dense matrix multiplications much faster than other operations. ◼ Complicated approximation methods may be slower than straightforward computation with dense matrix products. ◼ Evaluate speed by wall clock time, not by FLOPs.

75 KYOTO UNIVERSITY Different types of techniques can be combined. ◼ Speedup techniques can be combined. ◼ Deep Compression [Han+ ICLR 2016] combines pruning with quantization. ◼ The effects may diminish, though. E.g., ResNet is redundant, so we can achieve sizable compression easily. MobileNets are already efficient, so it is challenging to compress it more [Jacob+ CVPR 2018]. [Han+ ICLR 2016] Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. [Jacob+ CVPR 2018] Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference.

81 KYOTO UNIVERSITY Choose speedup methods based on your goal ◼ There are various speed up techniques. ◼ FP16 is the easiest. Try it first. ◼ If you haven’t trained your model yet, efficient architectures (such as MobileNets, Performers, MoE, etc.) are worth trying. ◼ It is important to set a goal before tuning. ◼ It is also important to measure speed by wall clock time, not FLOPs.