160

# How to Speed up Deep Learning Models

I introduce various techniques to speed up deep learning models. July 12, 2023

## Transcript

1. 1 KYOTO UNIVERSITY
KYOTO UNIVERSITY
How to Speed up Deep Learning Models
Ryoma Sato

2. 2 KYOTO UNIVERSITY
We’ll see how to speed up deep learning models

In this tutorial, I introduce various methods to
speed up deep learning models.

As models get bigger and bigger nowadays,
it is important to speed up models to reduce cost
(both money and time).

3. 3 KYOTO UNIVERSITY
We focus on speeding up inference

There are two types of speedups.

Speeding up training

Speeding up inference ★

We focus on inference in this tutorial.

Note some of the techniques are also used for
speeding up training.

4. 4 KYOTO UNIVERSITY
We consider both CPU and GPU

There are two types of devices

Inference on GPU ★

Inference on CPU ★

We consider both cases.

Approach can be quite different for these cases.

Note that inference on CPU is common.
Inference on GPU is also common.
↔ training is mainly carried out on GPU.

5. 5 KYOTO UNIVERSITY
Quantization / Low-precision Inference

6. 6 KYOTO UNIVERSITY
Quantization converts float numbers to int numbers
3.25 1.98 −2.11
2.15 0.11 3.01
−0.25 1.55 1.33
3 2 −2
2 0 3
0 2 1
quantization
float32 int8

7. 7 KYOTO UNIVERSITY
Quant. often uses a coefficient (one float + many int)
0.12 0.11 0.31
0.25 −0.14 0.19
−0.13 0.21 0.09
1 1 3
3 −1 2
−1 2 1
quantization
float32 int8
0.1 ×
float32

8. 8 KYOTO UNIVERSITY
Why does quantization speed up inference?

Why does quantization speed up inference?
Note the number of operations is the same.

9. 9 KYOTO UNIVERSITY
Quantization speed ups inference in many ways

Why does quantization speed up inference?
Note the number of operations is the same.

It reduces memory consumption
→ We can increase the batchsize.
→ Throughput increases.

We can exploit SIMD (on CPU)

We can exploit Tensor Cores (on GPU)

Communication costs reduce.

Cache efficiency improves.

10. 10 KYOTO UNIVERSITY
Quantization speed ups inference in many ways

Why does quantization speed up inference?
Note the number of operations is the same.

It reduces memory consumption
→ We can increase the batchsize.
→ Throughput increases.

We can exploit SIMD (on CPU)

We can exploit Tensor Cores (on GPU)

Communication costs reduce.

Cache efficiency improves.

11. 11 KYOTO UNIVERSITY
High batchsize increases throughput
Model on GPU
x y
Model on GPU
x
y
10.0 msec
x
y
10.1 msec
Almost the same time given
parallelization is not saturated.
→ Throughput doubles.

12. 12 KYOTO UNIVERSITY
Quant. loosens the memory limit -> higher batchsize

We can’t increase batchsize arbitrarily due to the
memory limit.

Quantization loosens the memory limit,
and we can use higher batch sizes.

13. 13 KYOTO UNIVERSITY
High batchsize speeds up training as well

This is basically the case in training as well.

The accuracy of a model is mostly determined by
the number of samples it saw during training.
It depends on many other aspects, though.
See, e.g., tuning playbook

If we double the batchsize, we can finish training
in a half time.

It is common to use FP16 (instead of FP32) or
mixed precision (FP16 + FP32) during training.

14. 14 KYOTO UNIVERSITY
Quantization speed ups inference in many ways

Why does quantization speed up inference?
Note the number of operations is the same.

It reduces memory consumption
→ We can increase the batchsize.
→ Throughput increases.

We can exploit SIMD (on CPU)

We can exploit Tensor Cores (on GPU)

Communication costs reduce.

Cache efficiency improves.

15. 15 KYOTO UNIVERSITY
Standard addition of 64 bit ints
0000101110100010100000111111000011101100001111100111110000010011
0000010011001001010101111010000000010101011111010000000010010100
0001000001101011110110111001000100000001101110110111110010100111
+
64 bit Arithmetic Logic Unit (ALU)
int64
int64
int64

16. 16 KYOTO UNIVERSITY
Eight int8 additions can be done simultaneously
0000101110100010100000111111000011101100001111100111110000010011
0000010011001001010101111010000000010101011111010000000010010100
0001000001101011110110101001000000000001101110110111110010100111
+
64 bit Arithmetic Logic Unit (ALU)
int8 int8 int8 int8 int8 int8 int8 int8
Turn off
carry up

17. 17 KYOTO UNIVERSITY
SIMD parallelizes low-precision arithmetic.

Modern CPUs have 256 bit or 512 bit ALUs.

They support simultaneous low-precision
operations natively, e.g., AVX2.

This kind of operations are called SIMD
(Single Instruction / Multiple Data.)

AVX2 carries out 32 int8 operations
simultaneously.

18. 18 KYOTO UNIVERSITY
Quantization speed ups inference in many ways

Why does quantization speed up inference?
Note the number of operations is the same.

It reduces memory consumption
→ We can increase the batchsize.
→ Throughput increases.

We can exploit SIMD (on CPU)

We can exploit Tensor Cores (on GPU)

Communication costs reduce.

Cache efficiency improves.

19. 19 KYOTO UNIVERSITY
Tensor cores carry out tensor mulitplications

NVIDIA GPU consists of CUDA and Tensor Cores.

Tensor Cores carry out matrix multiplications.
▲ from the whitepaper of A100 GPU

20. 20 KYOTO UNIVERSITY
Tensor cores process int8 more than float at a time

Tensor Cores can handle int8 more efficiently
than float.
This is due to basically the same reason as in
CPU SIMD.
▲ from the whitepaper of A100 GPU

21. 21 KYOTO UNIVERSITY
We have two approaches: QAT and postprocessing

Two types of quantization (When to quantize)

Quantization aware training (QAT)

Post-hoc quantization

QAT is more effective than postprocessing.

We sometimes have models that are already
trained. In this case, we need to consult
postprocessing.

22. 22 KYOTO UNIVERSITY
Speedup-aware training vs postprocessing

This picture, speedup-aware training vs
postprocessing speedup, is common in many of
the following speedup techniques.

Some methods cannot be applied in a post-
processing way, though.

23. 23 KYOTO UNIVERSITY
STE is a basic approach for QAT

Basic Approach for QAT:
Straight-Through Estimator (STE) [Bengio+ 2013]
w = 2.14 q = 2
quantize
x = 1.4
h = 2.8
Forward:
Parameter to hold (in float)
mult

24. 24 KYOTO UNIVERSITY
STE is a basic approach for QAT

Basic Approach for QAT:
Straight-Through Estimator (STE) [Bengio+ 2013]
w = 2.14 q = 2
quantize
x = 1.4
h = 2.8
Forward:
Parameter to hold (in float)
w = 2.14 q = 2
x = 1.4
h = 2.8
Backward:
𝜕𝐿
𝜕ℎ
= 2.1

𝜕𝐿
𝜕𝑤
= 1.5
mult
𝜕𝐿
𝜕𝑞
= 1.5
copy
STE

25. 25 KYOTO UNIVERSITY
STE allows us to continuously optimize parameters

Intuitively, if we should increase q (parameter
after quantization), we should increase w.

We can continuously update parameters by
holding float parameters and with STE.
w = 2.14 q = 2
x = 1.4
h = 2.8
𝜕𝐿
𝜕ℎ
= 2.1

𝜕𝐿
𝜕𝑤
= 1.5 𝜕𝐿
𝜕𝑞
= 1.5
copy
This suggests to decrease q by (lr *1.5).
But (lr * 1.5) < 0.5 is fractional.

26. 26 KYOTO UNIVERSITY
STE allows us to continuously optimize parameters

Intuitively, if we should increase q (parameter
after quantization), we should increase w.

We can continuously update parameters by
holding float parameters and with STE.
w = 2.14 q = 2
x = 1.4
h = 2.8
𝜕𝐿
𝜕ℎ
= 2.1

𝜕𝐿
𝜕𝑤
= 1.5 𝜕𝐿
𝜕𝑞
= 1.5
copy
This suggests to decrease q by (lr *1.5).
But (lr * 1.5) < 0.5 is fractional.
Decrease w by
this much.
This may not
change q.
tends to be
positive for
iterations,
changes
accumulate and
q will change,

27. 27 KYOTO UNIVERSITY
Weight only or weight and activation quantization

Two types of quantization (Where to quantize)

Weight-only Quantization

Weight and Activation Quantization

Weight-only Quantization reduces the model size,
but computations are in float.
the computations are not so sped up.

Some models are too big to fit in small devices.
Small model size is important in edge inference.

28. 28 KYOTO UNIVERSITY
Power-of-two quantization speeds up even more.

Some quantization methods quantize weights and
coefficients of batchnorm into the power of two
(…, 1/4, 1/2, 1, 2, 4, …).

In this case, the multiplication of weights and
inputs can be done with bit shift.

It further speeds up computation.

29. 29 KYOTO UNIVERSITY
Many options how much we quantize

Precision level (How much we quantize)

FP16, BP16 (16bit float formats)

Int8

Binary

H100 GPU also supports FP8.

I will review the above three choices in the
following slides.

30. 30 KYOTO UNIVERSITY
FP16 is easy to use. Good for the first try.

FP16, BP16 (16bit float)
It is often okay to naively cast 32bit models to
FP16 or BP16.
This is the first recommended option.

Approx 1.5 speedup is expected.

Tensor Cores support FP16 and BP16 from the
Volta Architecture (e.g., V100).
More speedup is expected when tensor cores are
exploited.
Just model = model.half() in Pytorch

31. 31 KYOTO UNIVERSITY
Int8 is efficient. Good for precise tuning.

Int8
It require some sophisticated cares to convert
float32 models to int8.
The degradation of accuracy is often negligible.
[Jacob+ CVPR 2018]

2 ~ 4x speedup is expected.

Tensor Cores support Int8 operations from the
Turing architecture (e.g., A100)
[Jacob+ CVPR 2018] Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference.

32. 32 KYOTO UNIVERSITY
Binary is extremely efficient, but not precise.

Binary (1bit)
It degrades the accuracy even with sophisticated
techniques.

30 ~ 60x speedup is expected on CPU [Rastegari+
ECCV 2016].

2~7x speedup also in GPU [Courbariaux+ NeurIPS 2016].

This option is recommended only when the speed
is crucial.
[Rastegari+ ECCV 2016] XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks.
[Courbariaux+ NeurIPS 2016] Binarized Neural Networks.

33. 33 KYOTO UNIVERSITY
Reference

Famous papers:

[Jacob+ CVPR 2018] Quantization and Training of Neural
Networks for Efficient Integer-Arithmetic-Only Inference.
Pytorch and TensorFlow quantization bases on this paper.

34. 34 KYOTO UNIVERSITY
Reference

[Courbariaux+ NeurIPS 2015] BinaryConnect: Training
Deep Neural Networks with binary weights during
propagations.
This paper proposes weight-only binary quantization.

[Courbariaux+ NeurIPS 2016] Binarized Neural Networks.
This paper proposes weight-and-activation binary
quantization on GPU.

[Rastegari+ ECCV 2016] XNOR-Net: ImageNet
Classification Using Binary Convolutional Neural
Networks.
Weight-and-activation binary quantization on CPU.

35. 35 KYOTO UNIVERSITY
Knowledge Distillation

36. 36 KYOTO UNIVERSITY
Distillation turns large models to small ones.
Knowledge Distillation
Large model
Small
model

37. 37 KYOTO UNIVERSITY
Distillation has two steps

Two steps of knowledge distillation:
1.
Train the large model.
2.
Train the small model using the output of the
large model as the target.

38. 38 KYOTO UNIVERSITY
Distillation is indirect

Why don’t we train the small model only?

39. 39 KYOTO UNIVERSITY
Distillation is indirect but necessary due to noise

Why don’t we train the small model only?
[Ba+ NeurIPS 2014] Do Deep Nets Really Need to be Deep?
Training data
(stochastic)
Output of the
teacher model
(deterministic)
Small models struggle
to fit noisy signals.
It is easier for the small model
to fit the deterministic signal.

40. 40 KYOTO UNIVERSITY
Pruning

41. 41 KYOTO UNIVERSITY
Pruning sparsifies the weights
3.25 0.19 −2.11
2.15 0.11 3.01
−0.25 1.55 0.12
pruning
3.25 0 −2.11
2.15 0 3.01
0 1.55 0

42. 42 KYOTO UNIVERSITY
Pruning has three steps. Finetuning is important.

Three steps for pruning:
1.
Train the models
2.
Prune the weights
3.
Finetune the models
During finetuning, the pruned weights are
fixed to zero.

Finetuning is important for accuracy.

43. 43 KYOTO UNIVERSITY
Pruning based on magnitude is standard

Various criteria for pruning have been proposed.

The most basic one is the magnitude pruning.
Prune the weights that have small absolute
values.
I.e., prune w if |w| < ε.

It is also popular to dynamically prune weights
during training with Lasso-like L1 regularization.

44. 44 KYOTO UNIVERSITY
Structured pruning prunes channel/filter-wise

Two types of pruning

Non-structured pruning prunes dimension-wise

Structured pruning prunes channel/filter-wise

45. 45 KYOTO UNIVERSITY
Convolutional layer has F x C x 3 x 3 parameters

3x3 Convolutional Layer
K
C x H x W
F x C x 3 x 3

F x H x W
Num of channels
Num of filters
Each filter has weights for
each channel
convolution

46. 46 KYOTO UNIVERSITY
Filter pruning uses K[f, :, :, :]

Filter pruning imposes group sparsity with group
K[f, :, :, :] for each f.
K
F x C x 3 x 3
K-
F- x C x 3 x 3
Filter pruning
[Wen+ NeurIPS 2016] Learning Structured Sparsity in Deep Neural Networks.

47. 47 KYOTO UNIVERSITY
Channel pruning uses K[:, c, :, :]

Channel pruning imposes group sparsity with
group K[:, c, :, :] for each c.
K
F x C x 3 x 3
K-
F x C- x 3 x 3
Channel pruning
[Wen+ NeurIPS 2016] Learning Structured Sparsity in Deep Neural Networks.

48. 48 KYOTO UNIVERSITY
Shape pruning uses K[:, c, h, w]

Shape pruning imposes group sparsity with group
K[:, c, h, w] for each (c, h, w).
K
F x (C x 3 x 3)
K-
F x D
Shape pruning
[Wen+ NeurIPS 2016] Learning Structured Sparsity in Deep Neural Networks.

49. 49 KYOTO UNIVERSITY
Can shape pruning exploit structured sparsity

Why shape pruning?

It looks like non-structured pruning and seems it
does not exploit structures at a first glance.

50. 50 KYOTO UNIVERSITY
Convolution is implemented by matrix product

Detailed flow of convolution:
K
C x H x W
F x C x 3 x 3
F x HW
X
Extract a patch
for each position
X’
(H x W) x (C x 3 x 3)
= HW x 9C
flatten
K’
F x (C x 3 x 3)
= F x 9C
Matrix product
Y’ unflatten
F x H x W
Y

51. 51 KYOTO UNIVERSITY
Shape pruning reduces the number of columns

We can reduce the number of columns in matrix
product by shape pruning.
C x H x W
F x HW
X
Extract a patch
for each position
X’
(H x W) x (C x 3 x 3)
= HW x 9C
K-
F x D
Matrix product Y’ unflatten
F x H x W
Y
X’’HW x D
Delete columns

52. 52 KYOTO UNIVERSITY
Non-structured pruning is mainly for CPU

Non-structured pruning
can reach higher sparsity keeping accuracy
thanks to its fine-grained resolution.
is not suitable for GPUs.
It is mainly for CPU inference.

53. 53 KYOTO UNIVERSITY
Structured pruning may not be effective

Structured pruning
effectively exploits GPU parallelization
or we can prune few filters

54. 54 KYOTO UNIVERSITY
Sparse operations are available after A100

Update: Now we can use non-structured pruning
for GPU inference

Sparse multiplication is supported from the
Ampere GPU (e.g., A100)
▲ from the whitepaper of A100 GPU

55. 55 KYOTO UNIVERSITY
Pruning is indirect

Why don’t we use small and dense models?

56. 56 KYOTO UNIVERSITY
Pruning is indirect but necessary due to overparam.

Why don’t we use small and dense models?

DNN is overparameterized.

Overparametrization eases optimization &
generalization
(cf. Lottery Ticket Hypothesis, double descent).

We can prune unnecessary weights after
optimization.

57. 57 KYOTO UNIVERSITY
Structured pruning is sometimes meaningless

Criticism on structured pruning:

In this case, pruning is just roundabout and
meaningless.
[Liu+ ICLR 2019] Rethinking the Value of Network Pruning.
For all state-of-the-art structured pruning
algorithms we examined, fine-tuning a pruned
model only gives comparable or worse
performance than training that model with
randomly initialized weights [Liu+ ICLR 2019].
Zhuang Liu

58. 58 KYOTO UNIVERSITY
Low-rank approximation

59. 59 KYOTO UNIVERSITY
W is approximated by two slim matrices A and B
W Low rank approx.
A B
×

60. 60 KYOTO UNIVERSITY
CNN filters have low-rank structures

Filters of the first layer of CNNs are continuous
and not so diverse (cf. Gabor filter)

We can effectively approximate them with fairly
low rank matrices [Denton+ NeurIPS 2014, Denil+ NeurIPS 2013]
[Denil+ NeurIPS 2013] Predicting Parameters in Deep Learning.
[Denton+ NeurIPS 2014] Exploiting Linear Structure within Convolutional Networks for
Efficient Evaluation.
▲ from AlexNet paper ImageNet Classification with Deep Convolutional Neural Networks.
We can guess
→ parameters
are redundant.

61. 61 KYOTO UNIVERSITY
Reducing computation of attention

Next, let’s consider attention
Y Softmax Q KT V
=

62. 62 KYOTO UNIVERSITY
Computation of A is time consuming

Next, let’s consider attention

Computing the attention matrix A (which is n x n)
is time-consuming
Y Softmax Q KT V
= A

63. 63 KYOTO UNIVERSITY
A behaves like a Gram matrix

Next, let’s consider attention

A
ij
measure similarity of Q
i
and item K
j
.
It’s similar to a Gram matrix.
Let’s assume A is the Gram matrix of the
Gaussian kernel for a moment.
Y Softmax Q KT V
= A

64. 64 KYOTO UNIVERSITY
Gram matrix can be effectively approximated

Next, let’s consider attention

A Gram matrix can be approximated by, e.g.,
random features [Rahimi+ NeurIPS 2008] and
Nystrom approximation.
Y Softmax Q KT V
≈ Q’
[Rahimi+ NeurIPS 2008] Random Features for Large-Scale Kernel Machines.
K’T
Random feature of Q Random feature of K

65. 65 KYOTO UNIVERSITY
Random features can also be applied to attention

An attention matrix is not a Gram matrix of the
Gaussian kernel, though.

Fortunately, almost the same methods can be
applied to the attention matrix.

FAVOR+ and Performers [Choromanski+ ICLR 2021]

Random Feature Attention [Peng+ ICLR 2021]
[Choromanski+ ICLR 2021] Rethinking Attention with Performers.
[Peng+ ICLR 2021] Random Feature Attention.

66. 66 KYOTO UNIVERSITY
Attention with random features runs in linear time

Approximation of Y = Attention(Q, K, V)
1. Compute random features
2. Multiply slim matrices
Time complexity: O(n d’ (d + d’)) → linear w.r.t. n
Q′ = 𝜓 𝑄 ∈ ℝ𝑛×𝑑′
𝐾′ = 𝜓 𝐾 ∈ ℝ𝑛×𝑑′

𝑌 = 𝑄′𝐾⊤𝑉

67. 67 KYOTO UNIVERSITY
Derivation of random features for attention

Let be the random features for the
Gaussian kernel
Let
Then
𝜙 𝑥 ∈ ℝ𝑑
𝜙 𝑞 𝑇𝜙 𝑘 ≈ exp − q − 𝑘 2
𝜓 𝑥 = exp 𝑥 2 𝜙 𝑥
𝜓 𝑞 𝑇𝜓 𝑣 ≈ exp 𝑞𝑇𝑣
Unnormalize attention

68. 68 KYOTO UNIVERSITY
Normalization term can be computed in linear time

Normalization constant is
Z = ෍
𝑖
exp 𝑞𝑇𝑘𝑖
≈ ෍
𝑖
𝜓 𝑞 𝑇𝜓 𝑘𝑖
= 𝜓 𝑞 𝑇 ෍
𝑖
𝜓(𝑘𝑖
)
Random feature
approximation This sum takes liner
time, but can be
reused for other q’s.
→ linear time in total.

69. 69 KYOTO UNIVERSITY
Efficient Architectures

70. 70 KYOTO UNIVERSITY
Traditional filter has weights for each channel

K
C x H x W
F x C x 3 x 3

F x H x W
Num of channels
Num of filters
Each filter has weights for
each channel

71. 71 KYOTO UNIVERSITY
Mobilenets use spatial and interchannel interactions

MobileNets
K1
C x H x W
C x 3 x 3

F x H x W
d
C x H x W
Depthwise
Convolution
K2
F x C x 1 x 1

Standard
Convolution
1/F reduction
1/9 reduction

72. 72 KYOTO UNIVERSITY
FLOPs is not a good evaluation method
Irwan Bello
[Bello+ NeurIPS 2021] Revisiting ResNets: Improved Training and Scaling Strategies.
FLOPs is not a good indicator of latency
on modern hardware.
[Bello+ NeurIPS 2021]

Efficient architectures are sometimes evaluated
by the number of FLOPs.

Criticism:

73. 73 KYOTO UNIVERSITY
Use wall clock time not FLOPs

Tensor cores run dense matrix multiplications
much faster than other operations.

Complicated approximation methods may be
slower than straightforward computation with
dense matrix products.

Evaluate speed by wall clock time, not by FLOPs.

74. 74 KYOTO UNIVERSITY
Combination

75. 75 KYOTO UNIVERSITY
Different types of techniques can be combined.

Speedup techniques can be combined.

Deep Compression [Han+ ICLR 2016] combines
pruning with quantization.

The effects may diminish, though.
E.g., ResNet is redundant, so we can achieve
sizable compression easily.
MobileNets are already efficient, so it is
challenging to compress it more [Jacob+ CVPR 2018].
[Han+ ICLR 2016] Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and
Huffman Coding.
[Jacob+ CVPR 2018] Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference.

76. 76 KYOTO UNIVERSITY
Summary

77. 77 KYOTO UNIVERSITY
Quantization converts float numbers to int numbers
0.12 0.11 0.31
0.25 −0.14 0.19
−0.13 0.21 0.09
1 1 3
3 −1 2
−1 2 1
quantization
float32 int8
0.1 ×
float32

78. 78 KYOTO UNIVERSITY
Distillation turns large models to small ones.
Knowledge Distillation
Large model
Small
model

79. 79 KYOTO UNIVERSITY
Pruning turns of some weights
pruning
pruning

80. 80 KYOTO UNIVERSITY
Low rank approximation uses slim matrices
W Low rank approx.
A B
×

81. 81 KYOTO UNIVERSITY
Choose speedup methods based on your goal

There are various speed up techniques.

FP16 is the easiest. Try it first.

If you haven’t trained your model yet, efficient
architectures (such as MobileNets, Performers,
MoE, etc.) are worth trying.

It is important to set a goal before tuning.

It is also important to measure speed by wall
clock time, not FLOPs.