learning models ◼ In this tutorial, I introduce various methods to speed up deep learning models. ◼ As models get bigger and bigger nowadays, it is important to speed up models to reduce cost (both money and time).

There are two types of speedups. ⚫ Speeding up training ⚫ Speeding up inference ★ ◼ We focus on inference in this tutorial. ◼ Note some of the techniques are also used for speeding up training.

There are two types of devices ⚫ Inference on GPU ★ ⚫ Inference on CPU ★ ◼ We consider both cases. ◼ Approach can be quite different for these cases. ◼ Note that inference on CPU is common. Inference on GPU is also common. ↔ training is mainly carried out on GPU.

◼ Why does quantization speed up inference? Note the number of operations is the same. ⚫ It reduces memory consumption → We can increase the batchsize. → Throughput increases. ⚫ We can exploit SIMD (on CPU) ⚫ We can exploit Tensor Cores (on GPU) ⚫ Communication costs reduce. ⚫ Cache efficiency improves.

◼ Why does quantization speed up inference? Note the number of operations is the same. ⚫ It reduces memory consumption → We can increase the batchsize. → Throughput increases. ⚫ We can exploit SIMD (on CPU) ⚫ We can exploit Tensor Cores (on GPU) ⚫ Communication costs reduce. ⚫ Cache efficiency improves.

batchsize ◼ We can’t increase batchsize arbitrarily due to the memory limit. ◼ Quantization loosens the memory limit, and we can use higher batch sizes.

◼ This is basically the case in training as well. ◼ The accuracy of a model is mostly determined by the number of samples it saw during training. It depends on many other aspects, though. See, e.g., tuning playbook https://github.com/google-research/tuning_playbook ◼ If we double the batchsize, we can finish training in a half time. ◼ It is common to use FP16 (instead of FP32) or mixed precision (FP16 + FP32) during training.

◼ Why does quantization speed up inference? Note the number of operations is the same. ⚫ It reduces memory consumption → We can increase the batchsize. → Throughput increases. ⚫ We can exploit SIMD (on CPU) ⚫ We can exploit Tensor Cores (on GPU) ⚫ Communication costs reduce. ⚫ Cache efficiency improves.

have 256 bit or 512 bit ALUs. ◼ They support simultaneous low-precision operations natively, e.g., AVX2. ◼ This kind of operations are called SIMD (Single Instruction / Multiple Data.) ◼ AVX2 carries out 32 int8 operations simultaneously.

◼ Why does quantization speed up inference? Note the number of operations is the same. ⚫ It reduces memory consumption → We can increase the batchsize. → Throughput increases. ⚫ We can exploit SIMD (on CPU) ⚫ We can exploit Tensor Cores (on GPU) ⚫ Communication costs reduce. ⚫ Cache efficiency improves.

at a time ◼ Tensor Cores can handle int8 more efficiently than float. This is due to basically the same reason as in CPU SIMD. ▲ from the whitepaper of A100 GPU

◼ Two types of quantization (When to quantize) ⚫ Quantization aware training (QAT) ⚫ Post-hoc quantization ◼ QAT is more effective than postprocessing. ◼ We sometimes have models that are already trained. In this case, we need to consult postprocessing.

speedup-aware training vs postprocessing speedup, is common in many of the following speedup techniques. ◼ Some methods cannot be applied in a post- processing way, though.

◼ Basic Approach for QAT: Straight-Through Estimator (STE) [Bengio+ 2013] w = 2.14 q = 2 quantize x = 1.4 h = 2.8 Forward: Parameter to hold (in float) mult

◼ Basic Approach for QAT: Straight-Through Estimator (STE) [Bengio+ 2013] w = 2.14 q = 2 quantize x = 1.4 h = 2.8 Forward: Parameter to hold (in float) w = 2.14 q = 2 x = 1.4 h = 2.8 Backward: No gradient 𝜕𝐿 𝜕ℎ = 2.1 𝜕𝐿 𝜕𝑤 = 1.5 mult 𝜕𝐿 𝜕𝑞 = 1.5 copy STE

◼ Intuitively, if we should increase q (parameter after quantization), we should increase w. ◼ We can continuously update parameters by holding float parameters and with STE. w = 2.14 q = 2 x = 1.4 h = 2.8 𝜕𝐿 𝜕ℎ = 2.1 𝜕𝐿 𝜕𝑤 = 1.5 𝜕𝐿 𝜕𝑞 = 1.5 copy This suggests to decrease q by (lr *1.5). But (lr * 1.5) < 0.5 is fractional.

◼ Intuitively, if we should increase q (parameter after quantization), we should increase w. ◼ We can continuously update parameters by holding float parameters and with STE. w = 2.14 q = 2 x = 1.4 h = 2.8 𝜕𝐿 𝜕ℎ = 2.1 𝜕𝐿 𝜕𝑤 = 1.5 𝜕𝐿 𝜕𝑞 = 1.5 copy This suggests to decrease q by (lr *1.5). But (lr * 1.5) < 0.5 is fractional. Decrease w by this much. This may not change q. if the gradient tends to be positive for iterations, changes accumulate and q will change,

◼ Two types of quantization (Where to quantize) ⚫ Weight-only Quantization ⚫ Weight and Activation Quantization ◼ Weight-only Quantization reduces the model size, but computations are in float. → It speeds up loading the model on GPU, but the computations are not so sped up. ◼ Some models are too big to fit in small devices. Small model size is important in edge inference.

Some quantization methods quantize weights and coefficients of batchnorm into the power of two (…, 1/4, 1/2, 1, 2, 4, …). ◼ In this case, the multiplication of weights and inputs can be done with bit shift. ◼ It further speeds up computation.

Precision level (How much we quantize) ⚫ FP16, BP16 (16bit float formats) ⚫ Int8 ⚫ Binary ◼ H100 GPU also supports FP8. ◼ I will review the above three choices in the following slides.

the first try. ◼ FP16, BP16 (16bit float) It is often okay to naively cast 32bit models to FP16 or BP16. This is the first recommended option. ◼ Approx 1.5 speedup is expected. ◼ Tensor Cores support FP16 and BP16 from the Volta Architecture (e.g., V100). More speedup is expected when tensor cores are exploited. Just model = model.half() in Pytorch

◼ Int8 It require some sophisticated cares to convert float32 models to int8. The degradation of accuracy is often negligible. [Jacob+ CVPR 2018] ◼ 2 ~ 4x speedup is expected. ◼ Tensor Cores support Int8 operations from the Turing architecture (e.g., A100) [Jacob+ CVPR 2018] Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference.

◼ Binary (1bit) It degrades the accuracy even with sophisticated techniques. ◼ 30 ~ 60x speedup is expected on CPU [Rastegari+ ECCV 2016]. ◼ 2~7x speedup also in GPU [Courbariaux+ NeurIPS 2016]. ◼ This option is recommended only when the speed is crucial. [Rastegari+ ECCV 2016] XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks. [Courbariaux+ NeurIPS 2016] Binarized Neural Networks.

2018] Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. Pytorch and TensorFlow quantization bases on this paper.

Deep Neural Networks with binary weights during propagations. This paper proposes weight-only binary quantization. ◼ [Courbariaux+ NeurIPS 2016] Binarized Neural Networks. This paper proposes weight-and-activation binary quantization on GPU. ◼ [Rastegari+ ECCV 2016] XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks. Weight-and-activation binary quantization on CPU.

noise ◼ Knowledge distillation looks roundabout. Why don’t we train the small model only? [Ba+ NeurIPS 2014] Do Deep Nets Really Need to be Deep? Training data (stochastic) Output of the teacher model (deterministic) Small models struggle to fit noisy signals. It is easier for the small model to fit the deterministic signal.

◼ Three steps for pruning: 1. Train the models 2. Prune the weights 3. Finetune the models During finetuning, the pruned weights are fixed to zero. ◼ Finetuning is important for accuracy.

Various criteria for pruning have been proposed. ◼ The most basic one is the magnitude pruning. Prune the weights that have small absolute values. I.e., prune w if |w| < ε. ◼ It is also popular to dynamically prune weights during training with Lasso-like L1 regularization.

3 x 3 parameters ◼ 3x3 Convolutional Layer K C x H x W F x C x 3 x 3 ∗ F x H x W Num of channels Num of filters Each filter has weights for each channel convolution

◼ Filter pruning imposes group sparsity with group K[f, :, :, :] for each f. K F x C x 3 x 3 K- F- x C x 3 x 3 Filter pruning [Wen+ NeurIPS 2016] Learning Structured Sparsity in Deep Neural Networks.

◼ Channel pruning imposes group sparsity with group K[:, c, :, :] for each c. K F x C x 3 x 3 K- F x C- x 3 x 3 Channel pruning [Wen+ NeurIPS 2016] Learning Structured Sparsity in Deep Neural Networks.

◼ Shape pruning imposes group sparsity with group K[:, c, h, w] for each (c, h, w). K F x (C x 3 x 3) K- F x D Shape pruning [Wen+ NeurIPS 2016] Learning Structured Sparsity in Deep Neural Networks.

Detailed flow of convolution: K C x H x W F x C x 3 x 3 F x HW X Extract a patch for each position X’ (H x W) x (C x 3 x 3) = HW x 9C flatten K’ F x (C x 3 x 3) = F x 9C Matrix product Y’ unflatten F x H x W Y

◼ We can reduce the number of columns in matrix product by shape pruning. C x H x W F x HW X Extract a patch for each position X’ (H x W) x (C x 3 x 3) = HW x 9C K- F x D Matrix product Y’ unflatten F x H x W Y X’’HW x D Delete columns

Non-structured pruning can reach higher sparsity keeping accuracy thanks to its fine-grained resolution. is not suitable for GPUs. It is mainly for CPU inference.

Update: Now we can use non-structured pruning for GPU inference ◼ Sparse multiplication is supported from the Ampere GPU (e.g., A100) ▲ from the whitepaper of A100 GPU

overparam. ◼ Pruning looks roundabout Why don’t we use small and dense models? ◼ DNN is overparameterized. ◼ Overparametrization eases optimization & generalization (cf. Lottery Ticket Hypothesis, double descent). ◼ We can prune unnecessary weights after optimization.

on structured pruning: ◼ In this case, pruning is just roundabout and meaningless. [Liu+ ICLR 2019] Rethinking the Value of Network Pruning. For all state-of-the-art structured pruning algorithms we examined, fine-tuning a pruned model only gives comparable or worse performance than training that model with randomly initialized weights [Liu+ ICLR 2019]. Zhuang Liu

of the first layer of CNNs are continuous and not so diverse (cf. Gabor filter) ◼ We can effectively approximate them with fairly low rank matrices [Denton+ NeurIPS 2014, Denil+ NeurIPS 2013] [Denil+ NeurIPS 2013] Predicting Parameters in Deep Learning. [Denton+ NeurIPS 2014] Exploiting Linear Structure within Convolutional Networks for Efficient Evaluation. ▲ from AlexNet paper ImageNet Classification with Deep Convolutional Neural Networks. We can guess masked values. → parameters are redundant.

Next, let’s consider attention ◼ A ij measure similarity of Q i and item K j . It’s similar to a Gram matrix. Let’s assume A is the Gram matrix of the Gaussian kernel for a moment. Y Softmax Q KT V = A

Next, let’s consider attention ◼ A Gram matrix can be approximated by, e.g., random features [Rahimi+ NeurIPS 2008] and Nystrom approximation. Y Softmax Q KT V ≈ Q’ [Rahimi+ NeurIPS 2008] Random Features for Large-Scale Kernel Machines. K’T Random feature of Q Random feature of K

attention ◼ An attention matrix is not a Gram matrix of the Gaussian kernel, though. ◼ Fortunately, almost the same methods can be applied to the attention matrix. ⚫ FAVOR+ and Performers [Choromanski+ ICLR 2021] ⚫ Random Feature Attention [Peng+ ICLR 2021] [Choromanski+ ICLR 2021] Rethinking Attention with Performers. [Peng+ ICLR 2021] Random Feature Attention.

Let be the random features for the Gaussian kernel Let Then 𝜙 𝑥 ∈ ℝ𝑑 𝜙 𝑞 𝑇𝜙 𝑘 ≈ exp − q − 𝑘 2 𝜓 𝑥 = exp 𝑥 2 𝜙 𝑥 𝜓 𝑞 𝑇𝜓 𝑣 ≈ exp 𝑞𝑇𝑣 Unnormalize attention

time ◼ Normalization constant is Z = 𝑖 exp 𝑞𝑇𝑘𝑖 ≈ 𝑖 𝜓 𝑞 𝑇𝜓 𝑘𝑖 = 𝜓 𝑞 𝑇 𝑖 𝜓(𝑘𝑖 ) Random feature approximation This sum takes liner time, but can be reused for other q’s. → linear time in total.

Irwan Bello [Bello+ NeurIPS 2021] Revisiting ResNets: Improved Training and Scaling Strategies. FLOPs is not a good indicator of latency on modern hardware. [Bello+ NeurIPS 2021] ◼ Efficient architectures are sometimes evaluated by the number of FLOPs. ◼ Criticism:

Tensor cores run dense matrix multiplications much faster than other operations. ◼ Complicated approximation methods may be slower than straightforward computation with dense matrix products. ◼ Evaluate speed by wall clock time, not by FLOPs.

◼ Speedup techniques can be combined. ◼ Deep Compression [Han+ ICLR 2016] combines pruning with quantization. ◼ The effects may diminish, though. E.g., ResNet is redundant, so we can achieve sizable compression easily. MobileNets are already efficient, so it is challenging to compress it more [Jacob+ CVPR 2018]. [Han+ ICLR 2016] Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. [Jacob+ CVPR 2018] Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference.

◼ There are various speed up techniques. ◼ FP16 is the easiest. Try it first. ◼ If you haven’t trained your model yet, efficient architectures (such as MobileNets, Performers, MoE, etc.) are worth trying. ◼ It is important to set a goal before tuning. ◼ It is also important to measure speed by wall clock time, not FLOPs.