$30 off During Our Annual Pro Sale. View Details »

How to Speed up Deep Learning Models

joisino
July 12, 2023

How to Speed up Deep Learning Models

I introduce various techniques to speed up deep learning models.

Contact: @joisino_en (Twitter) / https://joisino.net/en/

joisino

July 12, 2023
Tweet

More Decks by joisino

Other Decks in Research

Transcript

  1. 1 KYOTO UNIVERSITY
    KYOTO UNIVERSITY
    How to Speed up Deep Learning Models
    Ryoma Sato

    View Slide

  2. 2 KYOTO UNIVERSITY
    We’ll see how to speed up deep learning models

    In this tutorial, I introduce various methods to
    speed up deep learning models.

    As models get bigger and bigger nowadays,
    it is important to speed up models to reduce cost
    (both money and time).

    View Slide

  3. 3 KYOTO UNIVERSITY
    We focus on speeding up inference

    There are two types of speedups.

    Speeding up training

    Speeding up inference ★

    We focus on inference in this tutorial.

    Note some of the techniques are also used for
    speeding up training.

    View Slide

  4. 4 KYOTO UNIVERSITY
    We consider both CPU and GPU

    There are two types of devices

    Inference on GPU ★

    Inference on CPU ★

    We consider both cases.

    Approach can be quite different for these cases.

    Note that inference on CPU is common.
    Inference on GPU is also common.
    ↔ training is mainly carried out on GPU.

    View Slide

  5. 5 KYOTO UNIVERSITY
    Quantization / Low-precision Inference

    View Slide

  6. 6 KYOTO UNIVERSITY
    Quantization converts float numbers to int numbers
    3.25 1.98 −2.11
    2.15 0.11 3.01
    −0.25 1.55 1.33
    3 2 −2
    2 0 3
    0 2 1
    quantization
    float32 int8

    View Slide

  7. 7 KYOTO UNIVERSITY
    Quant. often uses a coefficient (one float + many int)
    0.12 0.11 0.31
    0.25 −0.14 0.19
    −0.13 0.21 0.09
    1 1 3
    3 −1 2
    −1 2 1
    quantization
    float32 int8
    0.1 ×
    float32

    View Slide

  8. 8 KYOTO UNIVERSITY
    Why does quantization speed up inference?

    Why does quantization speed up inference?
    Note the number of operations is the same.

    View Slide

  9. 9 KYOTO UNIVERSITY
    Quantization speed ups inference in many ways

    Why does quantization speed up inference?
    Note the number of operations is the same.

    It reduces memory consumption
    → We can increase the batchsize.
    → Throughput increases.

    We can exploit SIMD (on CPU)

    We can exploit Tensor Cores (on GPU)

    Communication costs reduce.

    Cache efficiency improves.

    View Slide

  10. 10 KYOTO UNIVERSITY
    Quantization speed ups inference in many ways

    Why does quantization speed up inference?
    Note the number of operations is the same.

    It reduces memory consumption
    → We can increase the batchsize.
    → Throughput increases.

    We can exploit SIMD (on CPU)

    We can exploit Tensor Cores (on GPU)

    Communication costs reduce.

    Cache efficiency improves.

    View Slide

  11. 11 KYOTO UNIVERSITY
    High batchsize increases throughput
    Model on GPU
    x y
    Model on GPU
    x
    y
    10.0 msec
    x
    y
    10.1 msec
    Almost the same time given
    parallelization is not saturated.
    → Throughput doubles.

    View Slide

  12. 12 KYOTO UNIVERSITY
    Quant. loosens the memory limit -> higher batchsize

    We can’t increase batchsize arbitrarily due to the
    memory limit.

    Quantization loosens the memory limit,
    and we can use higher batch sizes.

    View Slide

  13. 13 KYOTO UNIVERSITY
    High batchsize speeds up training as well

    This is basically the case in training as well.

    The accuracy of a model is mostly determined by
    the number of samples it saw during training.
    It depends on many other aspects, though.
    See, e.g., tuning playbook
    https://github.com/google-research/tuning_playbook

    If we double the batchsize, we can finish training
    in a half time.

    It is common to use FP16 (instead of FP32) or
    mixed precision (FP16 + FP32) during training.

    View Slide

  14. 14 KYOTO UNIVERSITY
    Quantization speed ups inference in many ways

    Why does quantization speed up inference?
    Note the number of operations is the same.

    It reduces memory consumption
    → We can increase the batchsize.
    → Throughput increases.

    We can exploit SIMD (on CPU)

    We can exploit Tensor Cores (on GPU)

    Communication costs reduce.

    Cache efficiency improves.

    View Slide

  15. 15 KYOTO UNIVERSITY
    Standard addition of 64 bit ints
    0000101110100010100000111111000011101100001111100111110000010011
    0000010011001001010101111010000000010101011111010000000010010100
    0001000001101011110110111001000100000001101110110111110010100111
    +
    64 bit Arithmetic Logic Unit (ALU)
    int64
    int64
    int64

    View Slide

  16. 16 KYOTO UNIVERSITY
    Eight int8 additions can be done simultaneously
    0000101110100010100000111111000011101100001111100111110000010011
    0000010011001001010101111010000000010101011111010000000010010100
    0001000001101011110110101001000000000001101110110111110010100111
    +
    64 bit Arithmetic Logic Unit (ALU)
    int8 int8 int8 int8 int8 int8 int8 int8
    Turn off
    carry up

    View Slide

  17. 17 KYOTO UNIVERSITY
    SIMD parallelizes low-precision arithmetic.

    Modern CPUs have 256 bit or 512 bit ALUs.

    They support simultaneous low-precision
    operations natively, e.g., AVX2.

    This kind of operations are called SIMD
    (Single Instruction / Multiple Data.)

    AVX2 carries out 32 int8 operations
    simultaneously.

    View Slide

  18. 18 KYOTO UNIVERSITY
    Quantization speed ups inference in many ways

    Why does quantization speed up inference?
    Note the number of operations is the same.

    It reduces memory consumption
    → We can increase the batchsize.
    → Throughput increases.

    We can exploit SIMD (on CPU)

    We can exploit Tensor Cores (on GPU)

    Communication costs reduce.

    Cache efficiency improves.

    View Slide

  19. 19 KYOTO UNIVERSITY
    Tensor cores carry out tensor mulitplications

    NVIDIA GPU consists of CUDA and Tensor Cores.

    Tensor Cores carry out matrix multiplications.
    ▲ from the whitepaper of A100 GPU

    View Slide

  20. 20 KYOTO UNIVERSITY
    Tensor cores process int8 more than float at a time

    Tensor Cores can handle int8 more efficiently
    than float.
    This is due to basically the same reason as in
    CPU SIMD.
    ▲ from the whitepaper of A100 GPU

    View Slide

  21. 21 KYOTO UNIVERSITY
    We have two approaches: QAT and postprocessing

    Two types of quantization (When to quantize)

    Quantization aware training (QAT)

    Post-hoc quantization

    QAT is more effective than postprocessing.

    We sometimes have models that are already
    trained. In this case, we need to consult
    postprocessing.

    View Slide

  22. 22 KYOTO UNIVERSITY
    Speedup-aware training vs postprocessing

    This picture, speedup-aware training vs
    postprocessing speedup, is common in many of
    the following speedup techniques.

    Some methods cannot be applied in a post-
    processing way, though.

    View Slide

  23. 23 KYOTO UNIVERSITY
    STE is a basic approach for QAT

    Basic Approach for QAT:
    Straight-Through Estimator (STE) [Bengio+ 2013]
    w = 2.14 q = 2
    quantize
    x = 1.4
    h = 2.8
    Forward:
    Parameter to hold (in float)
    mult

    View Slide

  24. 24 KYOTO UNIVERSITY
    STE is a basic approach for QAT

    Basic Approach for QAT:
    Straight-Through Estimator (STE) [Bengio+ 2013]
    w = 2.14 q = 2
    quantize
    x = 1.4
    h = 2.8
    Forward:
    Parameter to hold (in float)
    w = 2.14 q = 2
    x = 1.4
    h = 2.8
    Backward:
    No gradient
    𝜕𝐿
    𝜕ℎ
    = 2.1

    𝜕𝐿
    𝜕𝑤
    = 1.5
    mult
    𝜕𝐿
    𝜕𝑞
    = 1.5
    copy
    STE

    View Slide

  25. 25 KYOTO UNIVERSITY
    STE allows us to continuously optimize parameters

    Intuitively, if we should increase q (parameter
    after quantization), we should increase w.

    We can continuously update parameters by
    holding float parameters and with STE.
    w = 2.14 q = 2
    x = 1.4
    h = 2.8
    𝜕𝐿
    𝜕ℎ
    = 2.1

    𝜕𝐿
    𝜕𝑤
    = 1.5 𝜕𝐿
    𝜕𝑞
    = 1.5
    copy
    This suggests to decrease q by (lr *1.5).
    But (lr * 1.5) < 0.5 is fractional.

    View Slide

  26. 26 KYOTO UNIVERSITY
    STE allows us to continuously optimize parameters

    Intuitively, if we should increase q (parameter
    after quantization), we should increase w.

    We can continuously update parameters by
    holding float parameters and with STE.
    w = 2.14 q = 2
    x = 1.4
    h = 2.8
    𝜕𝐿
    𝜕ℎ
    = 2.1

    𝜕𝐿
    𝜕𝑤
    = 1.5 𝜕𝐿
    𝜕𝑞
    = 1.5
    copy
    This suggests to decrease q by (lr *1.5).
    But (lr * 1.5) < 0.5 is fractional.
    Decrease w by
    this much.
    This may not
    change q.
    if the gradient
    tends to be
    positive for
    iterations,
    changes
    accumulate and
    q will change,

    View Slide

  27. 27 KYOTO UNIVERSITY
    Weight only or weight and activation quantization

    Two types of quantization (Where to quantize)

    Weight-only Quantization

    Weight and Activation Quantization

    Weight-only Quantization reduces the model size,
    but computations are in float.
    → It speeds up loading the model on GPU, but
    the computations are not so sped up.

    Some models are too big to fit in small devices.
    Small model size is important in edge inference.

    View Slide

  28. 28 KYOTO UNIVERSITY
    Power-of-two quantization speeds up even more.

    Some quantization methods quantize weights and
    coefficients of batchnorm into the power of two
    (…, 1/4, 1/2, 1, 2, 4, …).

    In this case, the multiplication of weights and
    inputs can be done with bit shift.

    It further speeds up computation.

    View Slide

  29. 29 KYOTO UNIVERSITY
    Many options how much we quantize

    Precision level (How much we quantize)

    FP16, BP16 (16bit float formats)

    Int8

    Binary

    H100 GPU also supports FP8.

    I will review the above three choices in the
    following slides.

    View Slide

  30. 30 KYOTO UNIVERSITY
    FP16 is easy to use. Good for the first try.

    FP16, BP16 (16bit float)
    It is often okay to naively cast 32bit models to
    FP16 or BP16.
    This is the first recommended option.

    Approx 1.5 speedup is expected.

    Tensor Cores support FP16 and BP16 from the
    Volta Architecture (e.g., V100).
    More speedup is expected when tensor cores are
    exploited.
    Just model = model.half() in Pytorch

    View Slide

  31. 31 KYOTO UNIVERSITY
    Int8 is efficient. Good for precise tuning.

    Int8
    It require some sophisticated cares to convert
    float32 models to int8.
    The degradation of accuracy is often negligible.
    [Jacob+ CVPR 2018]

    2 ~ 4x speedup is expected.

    Tensor Cores support Int8 operations from the
    Turing architecture (e.g., A100)
    [Jacob+ CVPR 2018] Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference.

    View Slide

  32. 32 KYOTO UNIVERSITY
    Binary is extremely efficient, but not precise.

    Binary (1bit)
    It degrades the accuracy even with sophisticated
    techniques.

    30 ~ 60x speedup is expected on CPU [Rastegari+
    ECCV 2016].

    2~7x speedup also in GPU [Courbariaux+ NeurIPS 2016].

    This option is recommended only when the speed
    is crucial.
    [Rastegari+ ECCV 2016] XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks.
    [Courbariaux+ NeurIPS 2016] Binarized Neural Networks.

    View Slide

  33. 33 KYOTO UNIVERSITY
    Reference

    Famous papers:

    [Jacob+ CVPR 2018] Quantization and Training of Neural
    Networks for Efficient Integer-Arithmetic-Only Inference.
    Pytorch and TensorFlow quantization bases on this paper.

    View Slide

  34. 34 KYOTO UNIVERSITY
    Reference

    [Courbariaux+ NeurIPS 2015] BinaryConnect: Training
    Deep Neural Networks with binary weights during
    propagations.
    This paper proposes weight-only binary quantization.

    [Courbariaux+ NeurIPS 2016] Binarized Neural Networks.
    This paper proposes weight-and-activation binary
    quantization on GPU.

    [Rastegari+ ECCV 2016] XNOR-Net: ImageNet
    Classification Using Binary Convolutional Neural
    Networks.
    Weight-and-activation binary quantization on CPU.

    View Slide

  35. 35 KYOTO UNIVERSITY
    Knowledge Distillation

    View Slide

  36. 36 KYOTO UNIVERSITY
    Distillation turns large models to small ones.
    Knowledge Distillation
    Large model
    Small
    model

    View Slide

  37. 37 KYOTO UNIVERSITY
    Distillation has two steps

    Two steps of knowledge distillation:
    1.
    Train the large model.
    2.
    Train the small model using the output of the
    large model as the target.

    View Slide

  38. 38 KYOTO UNIVERSITY
    Distillation is indirect

    Knowledge distillation looks roundabout.
    Why don’t we train the small model only?

    View Slide

  39. 39 KYOTO UNIVERSITY
    Distillation is indirect but necessary due to noise

    Knowledge distillation looks roundabout.
    Why don’t we train the small model only?
    [Ba+ NeurIPS 2014] Do Deep Nets Really Need to be Deep?
    Training data
    (stochastic)
    Output of the
    teacher model
    (deterministic)
    Small models struggle
    to fit noisy signals.
    It is easier for the small model
    to fit the deterministic signal.

    View Slide

  40. 40 KYOTO UNIVERSITY
    Pruning

    View Slide

  41. 41 KYOTO UNIVERSITY
    Pruning sparsifies the weights
    3.25 0.19 −2.11
    2.15 0.11 3.01
    −0.25 1.55 0.12
    pruning
    3.25 0 −2.11
    2.15 0 3.01
    0 1.55 0

    View Slide

  42. 42 KYOTO UNIVERSITY
    Pruning has three steps. Finetuning is important.

    Three steps for pruning:
    1.
    Train the models
    2.
    Prune the weights
    3.
    Finetune the models
    During finetuning, the pruned weights are
    fixed to zero.

    Finetuning is important for accuracy.

    View Slide

  43. 43 KYOTO UNIVERSITY
    Pruning based on magnitude is standard

    Various criteria for pruning have been proposed.

    The most basic one is the magnitude pruning.
    Prune the weights that have small absolute
    values.
    I.e., prune w if |w| < ε.

    It is also popular to dynamically prune weights
    during training with Lasso-like L1 regularization.

    View Slide

  44. 44 KYOTO UNIVERSITY
    Structured pruning prunes channel/filter-wise

    Two types of pruning

    Non-structured pruning prunes dimension-wise

    Structured pruning prunes channel/filter-wise

    View Slide

  45. 45 KYOTO UNIVERSITY
    Convolutional layer has F x C x 3 x 3 parameters

    3x3 Convolutional Layer
    K
    C x H x W
    F x C x 3 x 3

    F x H x W
    Num of channels
    Num of filters
    Each filter has weights for
    each channel
    convolution

    View Slide

  46. 46 KYOTO UNIVERSITY
    Filter pruning uses K[f, :, :, :]

    Filter pruning imposes group sparsity with group
    K[f, :, :, :] for each f.
    K
    F x C x 3 x 3
    K-
    F- x C x 3 x 3
    Filter pruning
    [Wen+ NeurIPS 2016] Learning Structured Sparsity in Deep Neural Networks.

    View Slide

  47. 47 KYOTO UNIVERSITY
    Channel pruning uses K[:, c, :, :]

    Channel pruning imposes group sparsity with
    group K[:, c, :, :] for each c.
    K
    F x C x 3 x 3
    K-
    F x C- x 3 x 3
    Channel pruning
    [Wen+ NeurIPS 2016] Learning Structured Sparsity in Deep Neural Networks.

    View Slide

  48. 48 KYOTO UNIVERSITY
    Shape pruning uses K[:, c, h, w]

    Shape pruning imposes group sparsity with group
    K[:, c, h, w] for each (c, h, w).
    K
    F x (C x 3 x 3)
    K-
    F x D
    Shape pruning
    [Wen+ NeurIPS 2016] Learning Structured Sparsity in Deep Neural Networks.

    View Slide

  49. 49 KYOTO UNIVERSITY
    Can shape pruning exploit structured sparsity

    Why shape pruning?

    It looks like non-structured pruning and seems it
    does not exploit structures at a first glance.

    View Slide

  50. 50 KYOTO UNIVERSITY
    Convolution is implemented by matrix product

    Detailed flow of convolution:
    K
    C x H x W
    F x C x 3 x 3
    F x HW
    X
    Extract a patch
    for each position
    X’
    (H x W) x (C x 3 x 3)
    = HW x 9C
    flatten
    K’
    F x (C x 3 x 3)
    = F x 9C
    Matrix product
    Y’ unflatten
    F x H x W
    Y

    View Slide

  51. 51 KYOTO UNIVERSITY
    Shape pruning reduces the number of columns

    We can reduce the number of columns in matrix
    product by shape pruning.
    C x H x W
    F x HW
    X
    Extract a patch
    for each position
    X’
    (H x W) x (C x 3 x 3)
    = HW x 9C
    K-
    F x D
    Matrix product Y’ unflatten
    F x H x W
    Y
    X’’HW x D
    Delete columns

    View Slide

  52. 52 KYOTO UNIVERSITY
    Non-structured pruning is mainly for CPU

    Non-structured pruning
    can reach higher sparsity keeping accuracy
    thanks to its fine-grained resolution.
    is not suitable for GPUs.
    It is mainly for CPU inference.

    View Slide

  53. 53 KYOTO UNIVERSITY
    Structured pruning may not be effective

    Structured pruning
    effectively exploits GPU parallelization
    can degrade the accuracy
    or we can prune few filters

    View Slide

  54. 54 KYOTO UNIVERSITY
    Sparse operations are available after A100

    Update: Now we can use non-structured pruning
    for GPU inference

    Sparse multiplication is supported from the
    Ampere GPU (e.g., A100)
    ▲ from the whitepaper of A100 GPU

    View Slide

  55. 55 KYOTO UNIVERSITY
    Pruning is indirect

    Pruning looks roundabout
    Why don’t we use small and dense models?

    View Slide

  56. 56 KYOTO UNIVERSITY
    Pruning is indirect but necessary due to overparam.

    Pruning looks roundabout
    Why don’t we use small and dense models?

    DNN is overparameterized.

    Overparametrization eases optimization &
    generalization
    (cf. Lottery Ticket Hypothesis, double descent).

    We can prune unnecessary weights after
    optimization.

    View Slide

  57. 57 KYOTO UNIVERSITY
    Structured pruning is sometimes meaningless

    Criticism on structured pruning:

    In this case, pruning is just roundabout and
    meaningless.
    [Liu+ ICLR 2019] Rethinking the Value of Network Pruning.
    For all state-of-the-art structured pruning
    algorithms we examined, fine-tuning a pruned
    model only gives comparable or worse
    performance than training that model with
    randomly initialized weights [Liu+ ICLR 2019].
    Zhuang Liu

    View Slide

  58. 58 KYOTO UNIVERSITY
    Low-rank approximation

    View Slide

  59. 59 KYOTO UNIVERSITY
    W is approximated by two slim matrices A and B
    W Low rank approx.
    A B
    ×

    View Slide

  60. 60 KYOTO UNIVERSITY
    CNN filters have low-rank structures

    Filters of the first layer of CNNs are continuous
    and not so diverse (cf. Gabor filter)

    We can effectively approximate them with fairly
    low rank matrices [Denton+ NeurIPS 2014, Denil+ NeurIPS 2013]
    [Denil+ NeurIPS 2013] Predicting Parameters in Deep Learning.
    [Denton+ NeurIPS 2014] Exploiting Linear Structure within Convolutional Networks for
    Efficient Evaluation.
    ▲ from AlexNet paper ImageNet Classification with Deep Convolutional Neural Networks.
    We can guess
    masked values.
    → parameters
    are redundant.

    View Slide

  61. 61 KYOTO UNIVERSITY
    Reducing computation of attention

    Next, let’s consider attention
    Y Softmax Q KT V
    =

    View Slide

  62. 62 KYOTO UNIVERSITY
    Computation of A is time consuming

    Next, let’s consider attention

    Computing the attention matrix A (which is n x n)
    is time-consuming
    Y Softmax Q KT V
    = A

    View Slide

  63. 63 KYOTO UNIVERSITY
    A behaves like a Gram matrix

    Next, let’s consider attention

    A
    ij
    measure similarity of Q
    i
    and item K
    j
    .
    It’s similar to a Gram matrix.
    Let’s assume A is the Gram matrix of the
    Gaussian kernel for a moment.
    Y Softmax Q KT V
    = A

    View Slide

  64. 64 KYOTO UNIVERSITY
    Gram matrix can be effectively approximated

    Next, let’s consider attention

    A Gram matrix can be approximated by, e.g.,
    random features [Rahimi+ NeurIPS 2008] and
    Nystrom approximation.
    Y Softmax Q KT V
    ≈ Q’
    [Rahimi+ NeurIPS 2008] Random Features for Large-Scale Kernel Machines.
    K’T
    Random feature of Q Random feature of K

    View Slide

  65. 65 KYOTO UNIVERSITY
    Random features can also be applied to attention

    An attention matrix is not a Gram matrix of the
    Gaussian kernel, though.

    Fortunately, almost the same methods can be
    applied to the attention matrix.

    FAVOR+ and Performers [Choromanski+ ICLR 2021]

    Random Feature Attention [Peng+ ICLR 2021]
    [Choromanski+ ICLR 2021] Rethinking Attention with Performers.
    [Peng+ ICLR 2021] Random Feature Attention.

    View Slide

  66. 66 KYOTO UNIVERSITY
    Attention with random features runs in linear time

    Approximation of Y = Attention(Q, K, V)
    1. Compute random features
    2. Multiply slim matrices
    Time complexity: O(n d’ (d + d’)) → linear w.r.t. n
    Q′ = 𝜓 𝑄 ∈ ℝ𝑛×𝑑′
    𝐾′ = 𝜓 𝐾 ∈ ℝ𝑛×𝑑′

    𝑌 = 𝑄′𝐾⊤𝑉

    View Slide

  67. 67 KYOTO UNIVERSITY
    Derivation of random features for attention

    Let be the random features for the
    Gaussian kernel
    Let
    Then
    𝜙 𝑥 ∈ ℝ𝑑
    𝜙 𝑞 𝑇𝜙 𝑘 ≈ exp − q − 𝑘 2
    𝜓 𝑥 = exp 𝑥 2 𝜙 𝑥
    𝜓 𝑞 𝑇𝜓 𝑣 ≈ exp 𝑞𝑇𝑣
    Unnormalize attention

    View Slide

  68. 68 KYOTO UNIVERSITY
    Normalization term can be computed in linear time

    Normalization constant is
    Z = ෍
    𝑖
    exp 𝑞𝑇𝑘𝑖
    ≈ ෍
    𝑖
    𝜓 𝑞 𝑇𝜓 𝑘𝑖
    = 𝜓 𝑞 𝑇 ෍
    𝑖
    𝜓(𝑘𝑖
    )
    Random feature
    approximation This sum takes liner
    time, but can be
    reused for other q’s.
    → linear time in total.

    View Slide

  69. 69 KYOTO UNIVERSITY
    Efficient Architectures

    View Slide

  70. 70 KYOTO UNIVERSITY
    Traditional filter has weights for each channel

    Traditional 3x3 Convolutional Layer
    K
    C x H x W
    F x C x 3 x 3

    F x H x W
    Num of channels
    Num of filters
    Each filter has weights for
    each channel

    View Slide

  71. 71 KYOTO UNIVERSITY
    Mobilenets use spatial and interchannel interactions

    MobileNets
    K1
    C x H x W
    C x 3 x 3

    F x H x W
    d
    C x H x W
    Depthwise
    Convolution
    K2
    F x C x 1 x 1

    Standard
    Convolution
    1/F reduction
    1/9 reduction

    View Slide

  72. 72 KYOTO UNIVERSITY
    FLOPs is not a good evaluation method
    Irwan Bello
    [Bello+ NeurIPS 2021] Revisiting ResNets: Improved Training and Scaling Strategies.
    FLOPs is not a good indicator of latency
    on modern hardware.
    [Bello+ NeurIPS 2021]

    Efficient architectures are sometimes evaluated
    by the number of FLOPs.

    Criticism:

    View Slide

  73. 73 KYOTO UNIVERSITY
    Use wall clock time not FLOPs

    Tensor cores run dense matrix multiplications
    much faster than other operations.

    Complicated approximation methods may be
    slower than straightforward computation with
    dense matrix products.

    Evaluate speed by wall clock time, not by FLOPs.

    View Slide

  74. 74 KYOTO UNIVERSITY
    Combination

    View Slide

  75. 75 KYOTO UNIVERSITY
    Different types of techniques can be combined.

    Speedup techniques can be combined.

    Deep Compression [Han+ ICLR 2016] combines
    pruning with quantization.

    The effects may diminish, though.
    E.g., ResNet is redundant, so we can achieve
    sizable compression easily.
    MobileNets are already efficient, so it is
    challenging to compress it more [Jacob+ CVPR 2018].
    [Han+ ICLR 2016] Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and
    Huffman Coding.
    [Jacob+ CVPR 2018] Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference.

    View Slide

  76. 76 KYOTO UNIVERSITY
    Summary

    View Slide

  77. 77 KYOTO UNIVERSITY
    Quantization converts float numbers to int numbers
    0.12 0.11 0.31
    0.25 −0.14 0.19
    −0.13 0.21 0.09
    1 1 3
    3 −1 2
    −1 2 1
    quantization
    float32 int8
    0.1 ×
    float32

    View Slide

  78. 78 KYOTO UNIVERSITY
    Distillation turns large models to small ones.
    Knowledge Distillation
    Large model
    Small
    model

    View Slide

  79. 79 KYOTO UNIVERSITY
    Pruning turns of some weights
    pruning
    pruning

    View Slide

  80. 80 KYOTO UNIVERSITY
    Low rank approximation uses slim matrices
    W Low rank approx.
    A B
    ×

    View Slide

  81. 81 KYOTO UNIVERSITY
    Choose speedup methods based on your goal

    There are various speed up techniques.

    FP16 is the easiest. Try it first.

    If you haven’t trained your model yet, efficient
    architectures (such as MobileNets, Performers,
    MoE, etc.) are worth trying.

    It is important to set a goal before tuning.

    It is also important to measure speed by wall
    clock time, not FLOPs.

    View Slide