Slide 1

Slide 1 text

1 KYOTO UNIVERSITY KYOTO UNIVERSITY How to Speed up Deep Learning Models Ryoma Sato

Slide 2

Slide 2 text

2 KYOTO UNIVERSITY We’ll see how to speed up deep learning models ◼ In this tutorial, I introduce various methods to speed up deep learning models. ◼ As models get bigger and bigger nowadays, it is important to speed up models to reduce cost (both money and time).

Slide 3

Slide 3 text

3 KYOTO UNIVERSITY We focus on speeding up inference ◼ There are two types of speedups. ⚫ Speeding up training ⚫ Speeding up inference ★ ◼ We focus on inference in this tutorial. ◼ Note some of the techniques are also used for speeding up training.

Slide 4

Slide 4 text

4 KYOTO UNIVERSITY We consider both CPU and GPU ◼ There are two types of devices ⚫ Inference on GPU ★ ⚫ Inference on CPU ★ ◼ We consider both cases. ◼ Approach can be quite different for these cases. ◼ Note that inference on CPU is common. Inference on GPU is also common. ↔ training is mainly carried out on GPU.

Slide 5

Slide 5 text

5 KYOTO UNIVERSITY Quantization / Low-precision Inference

Slide 6

Slide 6 text

6 KYOTO UNIVERSITY Quantization converts float numbers to int numbers 3.25 1.98 −2.11 2.15 0.11 3.01 −0.25 1.55 1.33 3 2 −2 2 0 3 0 2 1 quantization float32 int8

Slide 7

Slide 7 text

7 KYOTO UNIVERSITY Quant. often uses a coefficient (one float + many int) 0.12 0.11 0.31 0.25 −0.14 0.19 −0.13 0.21 0.09 1 1 3 3 −1 2 −1 2 1 quantization float32 int8 0.1 × float32

Slide 8

Slide 8 text

8 KYOTO UNIVERSITY Why does quantization speed up inference? ◼ Why does quantization speed up inference? Note the number of operations is the same.

Slide 9

Slide 9 text

9 KYOTO UNIVERSITY Quantization speed ups inference in many ways ◼ Why does quantization speed up inference? Note the number of operations is the same. ⚫ It reduces memory consumption → We can increase the batchsize. → Throughput increases. ⚫ We can exploit SIMD (on CPU) ⚫ We can exploit Tensor Cores (on GPU) ⚫ Communication costs reduce. ⚫ Cache efficiency improves.

Slide 10

Slide 10 text

10 KYOTO UNIVERSITY Quantization speed ups inference in many ways ◼ Why does quantization speed up inference? Note the number of operations is the same. ⚫ It reduces memory consumption → We can increase the batchsize. → Throughput increases. ⚫ We can exploit SIMD (on CPU) ⚫ We can exploit Tensor Cores (on GPU) ⚫ Communication costs reduce. ⚫ Cache efficiency improves.

Slide 11

Slide 11 text

11 KYOTO UNIVERSITY High batchsize increases throughput Model on GPU x y Model on GPU x y 10.0 msec x y 10.1 msec Almost the same time given parallelization is not saturated. → Throughput doubles.

Slide 12

Slide 12 text

12 KYOTO UNIVERSITY Quant. loosens the memory limit -> higher batchsize ◼ We can’t increase batchsize arbitrarily due to the memory limit. ◼ Quantization loosens the memory limit, and we can use higher batch sizes.

Slide 13

Slide 13 text

13 KYOTO UNIVERSITY High batchsize speeds up training as well ◼ This is basically the case in training as well. ◼ The accuracy of a model is mostly determined by the number of samples it saw during training. It depends on many other aspects, though. See, e.g., tuning playbook https://github.com/google-research/tuning_playbook ◼ If we double the batchsize, we can finish training in a half time. ◼ It is common to use FP16 (instead of FP32) or mixed precision (FP16 + FP32) during training.

Slide 14

Slide 14 text

14 KYOTO UNIVERSITY Quantization speed ups inference in many ways ◼ Why does quantization speed up inference? Note the number of operations is the same. ⚫ It reduces memory consumption → We can increase the batchsize. → Throughput increases. ⚫ We can exploit SIMD (on CPU) ⚫ We can exploit Tensor Cores (on GPU) ⚫ Communication costs reduce. ⚫ Cache efficiency improves.

Slide 15

Slide 15 text

15 KYOTO UNIVERSITY Standard addition of 64 bit ints 0000101110100010100000111111000011101100001111100111110000010011 0000010011001001010101111010000000010101011111010000000010010100 0001000001101011110110111001000100000001101110110111110010100111 + 64 bit Arithmetic Logic Unit (ALU) int64 int64 int64

Slide 16

Slide 16 text

16 KYOTO UNIVERSITY Eight int8 additions can be done simultaneously 0000101110100010100000111111000011101100001111100111110000010011 0000010011001001010101111010000000010101011111010000000010010100 0001000001101011110110101001000000000001101110110111110010100111 + 64 bit Arithmetic Logic Unit (ALU) int8 int8 int8 int8 int8 int8 int8 int8 Turn off carry up

Slide 17

Slide 17 text

17 KYOTO UNIVERSITY SIMD parallelizes low-precision arithmetic. ◼ Modern CPUs have 256 bit or 512 bit ALUs. ◼ They support simultaneous low-precision operations natively, e.g., AVX2. ◼ This kind of operations are called SIMD (Single Instruction / Multiple Data.) ◼ AVX2 carries out 32 int8 operations simultaneously.

Slide 18

Slide 18 text

18 KYOTO UNIVERSITY Quantization speed ups inference in many ways ◼ Why does quantization speed up inference? Note the number of operations is the same. ⚫ It reduces memory consumption → We can increase the batchsize. → Throughput increases. ⚫ We can exploit SIMD (on CPU) ⚫ We can exploit Tensor Cores (on GPU) ⚫ Communication costs reduce. ⚫ Cache efficiency improves.

Slide 19

Slide 19 text

19 KYOTO UNIVERSITY Tensor cores carry out tensor mulitplications ◼ NVIDIA GPU consists of CUDA and Tensor Cores. ◼ Tensor Cores carry out matrix multiplications. ▲ from the whitepaper of A100 GPU

Slide 20

Slide 20 text

20 KYOTO UNIVERSITY Tensor cores process int8 more than float at a time ◼ Tensor Cores can handle int8 more efficiently than float. This is due to basically the same reason as in CPU SIMD. ▲ from the whitepaper of A100 GPU

Slide 21

Slide 21 text

21 KYOTO UNIVERSITY We have two approaches: QAT and postprocessing ◼ Two types of quantization (When to quantize) ⚫ Quantization aware training (QAT) ⚫ Post-hoc quantization ◼ QAT is more effective than postprocessing. ◼ We sometimes have models that are already trained. In this case, we need to consult postprocessing.

Slide 22

Slide 22 text

22 KYOTO UNIVERSITY Speedup-aware training vs postprocessing ◼ This picture, speedup-aware training vs postprocessing speedup, is common in many of the following speedup techniques. ◼ Some methods cannot be applied in a post- processing way, though.

Slide 23

Slide 23 text

23 KYOTO UNIVERSITY STE is a basic approach for QAT ◼ Basic Approach for QAT: Straight-Through Estimator (STE) [Bengio+ 2013] w = 2.14 q = 2 quantize x = 1.4 h = 2.8 Forward: Parameter to hold (in float) mult

Slide 24

Slide 24 text

24 KYOTO UNIVERSITY STE is a basic approach for QAT ◼ Basic Approach for QAT: Straight-Through Estimator (STE) [Bengio+ 2013] w = 2.14 q = 2 quantize x = 1.4 h = 2.8 Forward: Parameter to hold (in float) w = 2.14 q = 2 x = 1.4 h = 2.8 Backward: No gradient 𝜕𝐿 𝜕ℎ = 2.1 ෢ 𝜕𝐿 𝜕𝑤 = 1.5 mult 𝜕𝐿 𝜕𝑞 = 1.5 copy STE

Slide 25

Slide 25 text

25 KYOTO UNIVERSITY STE allows us to continuously optimize parameters ◼ Intuitively, if we should increase q (parameter after quantization), we should increase w. ◼ We can continuously update parameters by holding float parameters and with STE. w = 2.14 q = 2 x = 1.4 h = 2.8 𝜕𝐿 𝜕ℎ = 2.1 ෢ 𝜕𝐿 𝜕𝑤 = 1.5 𝜕𝐿 𝜕𝑞 = 1.5 copy This suggests to decrease q by (lr *1.5). But (lr * 1.5) < 0.5 is fractional.

Slide 26

Slide 26 text

26 KYOTO UNIVERSITY STE allows us to continuously optimize parameters ◼ Intuitively, if we should increase q (parameter after quantization), we should increase w. ◼ We can continuously update parameters by holding float parameters and with STE. w = 2.14 q = 2 x = 1.4 h = 2.8 𝜕𝐿 𝜕ℎ = 2.1 ෢ 𝜕𝐿 𝜕𝑤 = 1.5 𝜕𝐿 𝜕𝑞 = 1.5 copy This suggests to decrease q by (lr *1.5). But (lr * 1.5) < 0.5 is fractional. Decrease w by this much. This may not change q. if the gradient tends to be positive for iterations, changes accumulate and q will change,

Slide 27

Slide 27 text

27 KYOTO UNIVERSITY Weight only or weight and activation quantization ◼ Two types of quantization (Where to quantize) ⚫ Weight-only Quantization ⚫ Weight and Activation Quantization ◼ Weight-only Quantization reduces the model size, but computations are in float. → It speeds up loading the model on GPU, but the computations are not so sped up. ◼ Some models are too big to fit in small devices. Small model size is important in edge inference.

Slide 28

Slide 28 text

28 KYOTO UNIVERSITY Power-of-two quantization speeds up even more. ◼ Some quantization methods quantize weights and coefficients of batchnorm into the power of two (…, 1/4, 1/2, 1, 2, 4, …). ◼ In this case, the multiplication of weights and inputs can be done with bit shift. ◼ It further speeds up computation.

Slide 29

Slide 29 text

29 KYOTO UNIVERSITY Many options how much we quantize ◼ Precision level (How much we quantize) ⚫ FP16, BP16 (16bit float formats) ⚫ Int8 ⚫ Binary ◼ H100 GPU also supports FP8. ◼ I will review the above three choices in the following slides.

Slide 30

Slide 30 text

30 KYOTO UNIVERSITY FP16 is easy to use. Good for the first try. ◼ FP16, BP16 (16bit float) It is often okay to naively cast 32bit models to FP16 or BP16. This is the first recommended option. ◼ Approx 1.5 speedup is expected. ◼ Tensor Cores support FP16 and BP16 from the Volta Architecture (e.g., V100). More speedup is expected when tensor cores are exploited. Just model = model.half() in Pytorch

Slide 31

Slide 31 text

31 KYOTO UNIVERSITY Int8 is efficient. Good for precise tuning. ◼ Int8 It require some sophisticated cares to convert float32 models to int8. The degradation of accuracy is often negligible. [Jacob+ CVPR 2018] ◼ 2 ~ 4x speedup is expected. ◼ Tensor Cores support Int8 operations from the Turing architecture (e.g., A100) [Jacob+ CVPR 2018] Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference.

Slide 32

Slide 32 text

32 KYOTO UNIVERSITY Binary is extremely efficient, but not precise. ◼ Binary (1bit) It degrades the accuracy even with sophisticated techniques. ◼ 30 ~ 60x speedup is expected on CPU [Rastegari+ ECCV 2016]. ◼ 2~7x speedup also in GPU [Courbariaux+ NeurIPS 2016]. ◼ This option is recommended only when the speed is crucial. [Rastegari+ ECCV 2016] XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks. [Courbariaux+ NeurIPS 2016] Binarized Neural Networks.

Slide 33

Slide 33 text

33 KYOTO UNIVERSITY Reference ◼ Famous papers: ◼ [Jacob+ CVPR 2018] Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. Pytorch and TensorFlow quantization bases on this paper.

Slide 34

Slide 34 text

34 KYOTO UNIVERSITY Reference ◼ [Courbariaux+ NeurIPS 2015] BinaryConnect: Training Deep Neural Networks with binary weights during propagations. This paper proposes weight-only binary quantization. ◼ [Courbariaux+ NeurIPS 2016] Binarized Neural Networks. This paper proposes weight-and-activation binary quantization on GPU. ◼ [Rastegari+ ECCV 2016] XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks. Weight-and-activation binary quantization on CPU.

Slide 35

Slide 35 text

35 KYOTO UNIVERSITY Knowledge Distillation

Slide 36

Slide 36 text

36 KYOTO UNIVERSITY Distillation turns large models to small ones. Knowledge Distillation Large model Small model

Slide 37

Slide 37 text

37 KYOTO UNIVERSITY Distillation has two steps ◼ Two steps of knowledge distillation: 1. Train the large model. 2. Train the small model using the output of the large model as the target.

Slide 38

Slide 38 text

38 KYOTO UNIVERSITY Distillation is indirect ◼ Knowledge distillation looks roundabout. Why don’t we train the small model only?

Slide 39

Slide 39 text

39 KYOTO UNIVERSITY Distillation is indirect but necessary due to noise ◼ Knowledge distillation looks roundabout. Why don’t we train the small model only? [Ba+ NeurIPS 2014] Do Deep Nets Really Need to be Deep? Training data (stochastic) Output of the teacher model (deterministic) Small models struggle to fit noisy signals. It is easier for the small model to fit the deterministic signal.

Slide 40

Slide 40 text

40 KYOTO UNIVERSITY Pruning

Slide 41

Slide 41 text

41 KYOTO UNIVERSITY Pruning sparsifies the weights 3.25 0.19 −2.11 2.15 0.11 3.01 −0.25 1.55 0.12 pruning 3.25 0 −2.11 2.15 0 3.01 0 1.55 0

Slide 42

Slide 42 text

42 KYOTO UNIVERSITY Pruning has three steps. Finetuning is important. ◼ Three steps for pruning: 1. Train the models 2. Prune the weights 3. Finetune the models During finetuning, the pruned weights are fixed to zero. ◼ Finetuning is important for accuracy.

Slide 43

Slide 43 text

43 KYOTO UNIVERSITY Pruning based on magnitude is standard ◼ Various criteria for pruning have been proposed. ◼ The most basic one is the magnitude pruning. Prune the weights that have small absolute values. I.e., prune w if |w| < ε. ◼ It is also popular to dynamically prune weights during training with Lasso-like L1 regularization.

Slide 44

Slide 44 text

44 KYOTO UNIVERSITY Structured pruning prunes channel/filter-wise ◼ Two types of pruning ⚫ Non-structured pruning prunes dimension-wise ⚫ Structured pruning prunes channel/filter-wise

Slide 45

Slide 45 text

45 KYOTO UNIVERSITY Convolutional layer has F x C x 3 x 3 parameters ◼ 3x3 Convolutional Layer K C x H x W F x C x 3 x 3 ∗ F x H x W Num of channels Num of filters Each filter has weights for each channel convolution

Slide 46

Slide 46 text

46 KYOTO UNIVERSITY Filter pruning uses K[f, :, :, :] ◼ Filter pruning imposes group sparsity with group K[f, :, :, :] for each f. K F x C x 3 x 3 K- F- x C x 3 x 3 Filter pruning [Wen+ NeurIPS 2016] Learning Structured Sparsity in Deep Neural Networks.

Slide 47

Slide 47 text

47 KYOTO UNIVERSITY Channel pruning uses K[:, c, :, :] ◼ Channel pruning imposes group sparsity with group K[:, c, :, :] for each c. K F x C x 3 x 3 K- F x C- x 3 x 3 Channel pruning [Wen+ NeurIPS 2016] Learning Structured Sparsity in Deep Neural Networks.

Slide 48

Slide 48 text

48 KYOTO UNIVERSITY Shape pruning uses K[:, c, h, w] ◼ Shape pruning imposes group sparsity with group K[:, c, h, w] for each (c, h, w). K F x (C x 3 x 3) K- F x D Shape pruning [Wen+ NeurIPS 2016] Learning Structured Sparsity in Deep Neural Networks.

Slide 49

Slide 49 text

49 KYOTO UNIVERSITY Can shape pruning exploit structured sparsity ◼ Why shape pruning? ◼ It looks like non-structured pruning and seems it does not exploit structures at a first glance.

Slide 50

Slide 50 text

50 KYOTO UNIVERSITY Convolution is implemented by matrix product ◼ Detailed flow of convolution: K C x H x W F x C x 3 x 3 F x HW X Extract a patch for each position X’ (H x W) x (C x 3 x 3) = HW x 9C flatten K’ F x (C x 3 x 3) = F x 9C Matrix product Y’ unflatten F x H x W Y

Slide 51

Slide 51 text

51 KYOTO UNIVERSITY Shape pruning reduces the number of columns ◼ We can reduce the number of columns in matrix product by shape pruning. C x H x W F x HW X Extract a patch for each position X’ (H x W) x (C x 3 x 3) = HW x 9C K- F x D Matrix product Y’ unflatten F x H x W Y X’’HW x D Delete columns

Slide 52

Slide 52 text

52 KYOTO UNIVERSITY Non-structured pruning is mainly for CPU ◼ Non-structured pruning can reach higher sparsity keeping accuracy thanks to its fine-grained resolution. is not suitable for GPUs. It is mainly for CPU inference.

Slide 53

Slide 53 text

53 KYOTO UNIVERSITY Structured pruning may not be effective ◼ Structured pruning effectively exploits GPU parallelization can degrade the accuracy or we can prune few filters

Slide 54

Slide 54 text

54 KYOTO UNIVERSITY Sparse operations are available after A100 ◼ Update: Now we can use non-structured pruning for GPU inference ◼ Sparse multiplication is supported from the Ampere GPU (e.g., A100) ▲ from the whitepaper of A100 GPU

Slide 55

Slide 55 text

55 KYOTO UNIVERSITY Pruning is indirect ◼ Pruning looks roundabout Why don’t we use small and dense models?

Slide 56

Slide 56 text

56 KYOTO UNIVERSITY Pruning is indirect but necessary due to overparam. ◼ Pruning looks roundabout Why don’t we use small and dense models? ◼ DNN is overparameterized. ◼ Overparametrization eases optimization & generalization (cf. Lottery Ticket Hypothesis, double descent). ◼ We can prune unnecessary weights after optimization.

Slide 57

Slide 57 text

57 KYOTO UNIVERSITY Structured pruning is sometimes meaningless ◼ Criticism on structured pruning: ◼ In this case, pruning is just roundabout and meaningless. [Liu+ ICLR 2019] Rethinking the Value of Network Pruning. For all state-of-the-art structured pruning algorithms we examined, fine-tuning a pruned model only gives comparable or worse performance than training that model with randomly initialized weights [Liu+ ICLR 2019]. Zhuang Liu

Slide 58

Slide 58 text

58 KYOTO UNIVERSITY Low-rank approximation

Slide 59

Slide 59 text

59 KYOTO UNIVERSITY W is approximated by two slim matrices A and B W Low rank approx. A B ×

Slide 60

Slide 60 text

60 KYOTO UNIVERSITY CNN filters have low-rank structures ◼ Filters of the first layer of CNNs are continuous and not so diverse (cf. Gabor filter) ◼ We can effectively approximate them with fairly low rank matrices [Denton+ NeurIPS 2014, Denil+ NeurIPS 2013] [Denil+ NeurIPS 2013] Predicting Parameters in Deep Learning. [Denton+ NeurIPS 2014] Exploiting Linear Structure within Convolutional Networks for Efficient Evaluation. ▲ from AlexNet paper ImageNet Classification with Deep Convolutional Neural Networks. We can guess masked values. → parameters are redundant.

Slide 61

Slide 61 text

61 KYOTO UNIVERSITY Reducing computation of attention ◼ Next, let’s consider attention Y Softmax Q KT V =

Slide 62

Slide 62 text

62 KYOTO UNIVERSITY Computation of A is time consuming ◼ Next, let’s consider attention ◼ Computing the attention matrix A (which is n x n) is time-consuming Y Softmax Q KT V = A

Slide 63

Slide 63 text

63 KYOTO UNIVERSITY A behaves like a Gram matrix ◼ Next, let’s consider attention ◼ A ij measure similarity of Q i and item K j . It’s similar to a Gram matrix. Let’s assume A is the Gram matrix of the Gaussian kernel for a moment. Y Softmax Q KT V = A

Slide 64

Slide 64 text

64 KYOTO UNIVERSITY Gram matrix can be effectively approximated ◼ Next, let’s consider attention ◼ A Gram matrix can be approximated by, e.g., random features [Rahimi+ NeurIPS 2008] and Nystrom approximation. Y Softmax Q KT V ≈ Q’ [Rahimi+ NeurIPS 2008] Random Features for Large-Scale Kernel Machines. K’T Random feature of Q Random feature of K

Slide 65

Slide 65 text

65 KYOTO UNIVERSITY Random features can also be applied to attention ◼ An attention matrix is not a Gram matrix of the Gaussian kernel, though. ◼ Fortunately, almost the same methods can be applied to the attention matrix. ⚫ FAVOR+ and Performers [Choromanski+ ICLR 2021] ⚫ Random Feature Attention [Peng+ ICLR 2021] [Choromanski+ ICLR 2021] Rethinking Attention with Performers. [Peng+ ICLR 2021] Random Feature Attention.

Slide 66

Slide 66 text

66 KYOTO UNIVERSITY Attention with random features runs in linear time ◼ Approximation of Y = Attention(Q, K, V) 1. Compute random features 2. Multiply slim matrices Time complexity: O(n d’ (d + d’)) → linear w.r.t. n Q′ = 𝜓 𝑄 ∈ ℝ𝑛×𝑑′ 𝐾′ = 𝜓 𝐾 ∈ ℝ𝑛×𝑑′ ෠ 𝑌 = 𝑄′𝐾⊤𝑉

Slide 67

Slide 67 text

67 KYOTO UNIVERSITY Derivation of random features for attention ◼ Let be the random features for the Gaussian kernel Let Then 𝜙 𝑥 ∈ ℝ𝑑 𝜙 𝑞 𝑇𝜙 𝑘 ≈ exp − q − 𝑘 2 𝜓 𝑥 = exp 𝑥 2 𝜙 𝑥 𝜓 𝑞 𝑇𝜓 𝑣 ≈ exp 𝑞𝑇𝑣 Unnormalize attention

Slide 68

Slide 68 text

68 KYOTO UNIVERSITY Normalization term can be computed in linear time ◼ Normalization constant is Z = ෍ 𝑖 exp 𝑞𝑇𝑘𝑖 ≈ ෍ 𝑖 𝜓 𝑞 𝑇𝜓 𝑘𝑖 = 𝜓 𝑞 𝑇 ෍ 𝑖 𝜓(𝑘𝑖 ) Random feature approximation This sum takes liner time, but can be reused for other q’s. → linear time in total.

Slide 69

Slide 69 text

69 KYOTO UNIVERSITY Efficient Architectures

Slide 70

Slide 70 text

70 KYOTO UNIVERSITY Traditional filter has weights for each channel ◼ Traditional 3x3 Convolutional Layer K C x H x W F x C x 3 x 3 ∗ F x H x W Num of channels Num of filters Each filter has weights for each channel

Slide 71

Slide 71 text

71 KYOTO UNIVERSITY Mobilenets use spatial and interchannel interactions ◼ MobileNets K1 C x H x W C x 3 x 3 ∗ F x H x W d C x H x W Depthwise Convolution K2 F x C x 1 x 1 ∗ Standard Convolution 1/F reduction 1/9 reduction

Slide 72

Slide 72 text

72 KYOTO UNIVERSITY FLOPs is not a good evaluation method Irwan Bello [Bello+ NeurIPS 2021] Revisiting ResNets: Improved Training and Scaling Strategies. FLOPs is not a good indicator of latency on modern hardware. [Bello+ NeurIPS 2021] ◼ Efficient architectures are sometimes evaluated by the number of FLOPs. ◼ Criticism:

Slide 73

Slide 73 text

73 KYOTO UNIVERSITY Use wall clock time not FLOPs ◼ Tensor cores run dense matrix multiplications much faster than other operations. ◼ Complicated approximation methods may be slower than straightforward computation with dense matrix products. ◼ Evaluate speed by wall clock time, not by FLOPs.

Slide 74

Slide 74 text

74 KYOTO UNIVERSITY Combination

Slide 75

Slide 75 text

75 KYOTO UNIVERSITY Different types of techniques can be combined. ◼ Speedup techniques can be combined. ◼ Deep Compression [Han+ ICLR 2016] combines pruning with quantization. ◼ The effects may diminish, though. E.g., ResNet is redundant, so we can achieve sizable compression easily. MobileNets are already efficient, so it is challenging to compress it more [Jacob+ CVPR 2018]. [Han+ ICLR 2016] Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. [Jacob+ CVPR 2018] Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference.

Slide 76

Slide 76 text

76 KYOTO UNIVERSITY Summary

Slide 77

Slide 77 text

77 KYOTO UNIVERSITY Quantization converts float numbers to int numbers 0.12 0.11 0.31 0.25 −0.14 0.19 −0.13 0.21 0.09 1 1 3 3 −1 2 −1 2 1 quantization float32 int8 0.1 × float32

Slide 78

Slide 78 text

78 KYOTO UNIVERSITY Distillation turns large models to small ones. Knowledge Distillation Large model Small model

Slide 79

Slide 79 text

79 KYOTO UNIVERSITY Pruning turns of some weights pruning pruning

Slide 80

Slide 80 text

80 KYOTO UNIVERSITY Low rank approximation uses slim matrices W Low rank approx. A B ×

Slide 81

Slide 81 text

81 KYOTO UNIVERSITY Choose speedup methods based on your goal ◼ There are various speed up techniques. ◼ FP16 is the easiest. Try it first. ◼ If you haven’t trained your model yet, efficient architectures (such as MobileNets, Performers, MoE, etc.) are worth trying. ◼ It is important to set a goal before tuning. ◼ It is also important to measure speed by wall clock time, not FLOPs.