$30 off During Our Annual Pro Sale. View Details »

Linformer: paper reading

himkt
September 05, 2020

Linformer: paper reading

社内で開催した論文読み会の資料です.

himkt

September 05, 2020
Tweet

More Decks by himkt

Other Decks in Research

Transcript

  1. Linformer:
    Self-Attention with Linear Complexity
    2020/09/04 Makoto Hiramatsu <@himkt> * Figures come from the original paper

    View Slide

  2. View Slide

  3. Related works
    3 / 29
    Speed Memory Performance
    Mixed Precision ✅ ✅ ✅
    Knowledge Distillation ✅ ✅ ❌
    Sparse Attention ✅ ❌ ❌
    LSH Attention ❌ ✅
    Some techniques
    for Optimizer
    ❌ ✅ ❌
    Linformer ✅ ✅ ✅

    View Slide

  4. Complexity and sequential operation
    • RNN: O(n) sequential operation, which is not desired for parallelization
    • Transformer: O(1) sequential operation but O(n^2) complexity
    • Sparse Transformer: low complexity with holding O(1) sequential operation
    • Reformer: nlog(n) complexity, logarithmic sequential operation
    4 / 29

    View Slide

  5. Complexity and sequential operation
    • RNN: O(n) sequential operation, which is not desired for parallelization
    • Transformer: O(1) sequential operation but O(n^2) complexity
    • Sparse Transformer: low complexity with holding O(1) sequential operation
    • Reformer: nlog(n) complexity, logarithmic sequential operation
    5 / 29
    Linear complexity and O(1) sequential operation!

    View Slide

  6. Preliminary: Self Attention
    6 / 29
    Key
    Query
    N^2

    View Slide

  7. Rewriting P using A and D
    where ,
    A =
    QWi
    Q(KWi
    K)T
    d
    (DA
    )ii
    =
    N

    j=1
    exp Aji
    7 / 29

    View Slide

  8. Saturation of Cumulative Eigenvalue
    8 / 29

    View Slide

  9. What insight from the observation?
    Self-Attention is Low Rank
    • We may be able to discard keep most of information in
    self-attention matrix
    • First few eigenvectors are dominant
    => Low rank approximation!
    9 / 29

    View Slide

  10. Low rank approximation for P
    10 / 29
    Pros: reduce parameter to be learned
    Cons: introduces additional complexity (for SVD)

    View Slide

  11. Low rank approximation for P
    • Question
    • Given parameters are and is not available, as I understand
    • Is is possible to getting without P ?
    Q, WQ
    i
    , K, WK
    i
    P
    ui
    , vi
    11 / 29

    View Slide

  12. Solution: Just Project!
    # Transforme
    1. Compute scaled dot-product
    attention
    # Linformer
    1. Simply project V (value) and K (key)
    2. Compute scaled dot-product
    attention
    12 / 29
    https://nlp.seas.harvard.edu/2018/04/03/attention.html
    Transformer Linformer
    h

    View Slide

  13. Linear self-attention (Linformer)
    13 / 29
    E, and F are introduced for projecting
    KW (key vectors) and VW (value vectors)

    View Slide

  14. Transformer and Linformer
    14 / 29
    k

    i=1
    σi
    ui
    vT
    i
    ˜
    P
    softmax(
    QWQ
    i
    (KWK
    i
    )T
    d
    )
    PTransformer
    softmax(
    QWQ
    i
    (Ei
    KWK
    i
    )T
    dk
    )
    PLinformer
    n × n n × d
    Self-attention matrix of Linformer is smaller than matrices

    View Slide

  15. Parameter sharing?
    15 / 29

    View Slide

  16. Official Implementation
    16 / 29
    https://github.com/facebookresearch/pytext/blob/master/pytext/models/representations/transformer/multihead_linear_attention.py
    projection layer

    View Slide

  17. Creating projection layer (nn.Linear)
    17 / 29
    https://github.com/facebookresearch/pytext/blob/master/pytext/models/representations/transformer/multihead_linear_attention.py
    nn.Linear (implies it is learned)

    View Slide

  18. 18 / 29
    https://github.com/facebookresearch/pytext/blob/master/pytext/models/representations/transformer/multihead_linear_attention.py
    Shared across layers

    View Slide

  19. 19 / 29
    projection
    projection

    View Slide

  20. 20 / 29
    Cumulative eigenvalues are larger
    for higher layer
    The dimensionalities of E, F
    could be decreased
    (But it can’t share parameters)

    View Slide

  21. Performance comparison
    21 / 29
    Linformer is efficient for
    longer sequence
    Transformer gets slow
    as sequence length grows

    View Slide

  22. Training convergence
    22 / 29

    View Slide

  23. 23 / 29
    For n = 512, k=128 For n = 1024, k=256
    “Linformer’s performance is nearly on par with the Transformer”

    View Slide

  24. 24 / 29
    Sharing parameters reduce memory consumption
    without much detriment to performance

    View Slide

  25. 25 / 29
    Perplexities after convergence
    remain about the same
    even though projected dimension is fixed
    for longer sequences

    “This further empirically supports our assertion that
    the Linformer is linear-time”

    View Slide

  26. 26 / 29

    View Slide

  27. 27 / 29

    View Slide

  28. Conclusion
    • Linformer, the efficient variant of Transformer
    • Linformer project key and value into low dimension,
    which decreases computational complexity from N to k
    • When k is much smaller than N, complexity would be O(1)
    • Experiments show that Linformer performs well even if
    k is 128, (smaller than 512, the default sequence length of BERT)
    28 / 29

    View Slide

  29. Good Pointers
    • The Transformer Family
    • https://lilianweng.github.io
    • Sparse Transformer and Reformer explained
    • FAIR official PyText implementation
    • https://github.com/facebookresearch/pytext
    29 / 29

    View Slide