$30 off During Our Annual Pro Sale. View Details »

Linformer: paper reading

himkt
September 05, 2020

Linformer: paper reading

社内で開催した論文読み会の資料です.

himkt

September 05, 2020
Tweet

More Decks by himkt

Other Decks in Research

Transcript

  1. Linformer: Self-Attention with Linear Complexity 2020/09/04 Makoto Hiramatsu <@himkt> *

    Figures come from the original paper
  2. None
  3. Related works 3 / 29 Speed Memory Performance Mixed Precision

    ✅ ✅ ✅ Knowledge Distillation ✅ ✅ ❌ Sparse Attention ✅ ❌ ❌ LSH Attention ❌ ✅ Some techniques for Optimizer ❌ ✅ ❌ Linformer ✅ ✅ ✅
  4. Complexity and sequential operation • RNN: O(n) sequential operation, which

    is not desired for parallelization • Transformer: O(1) sequential operation but O(n^2) complexity • Sparse Transformer: low complexity with holding O(1) sequential operation • Reformer: nlog(n) complexity, logarithmic sequential operation 4 / 29
  5. Complexity and sequential operation • RNN: O(n) sequential operation, which

    is not desired for parallelization • Transformer: O(1) sequential operation but O(n^2) complexity • Sparse Transformer: low complexity with holding O(1) sequential operation • Reformer: nlog(n) complexity, logarithmic sequential operation 5 / 29 Linear complexity and O(1) sequential operation!
  6. Preliminary: Self Attention 6 / 29 Key Query N^2

  7. Rewriting P using A and D where , A =

    QWi Q(KWi K)T d (DA )ii = N ∑ j=1 exp Aji 7 / 29
  8. Saturation of Cumulative Eigenvalue 8 / 29

  9. What insight from the observation? Self-Attention is Low Rank •

    We may be able to discard keep most of information in self-attention matrix • First few eigenvectors are dominant => Low rank approximation! 9 / 29
  10. Low rank approximation for P 10 / 29 Pros: reduce

    parameter to be learned Cons: introduces additional complexity (for SVD)
  11. Low rank approximation for P • Question • Given parameters

    are and is not available, as I understand • Is is possible to getting without P ? Q, WQ i , K, WK i P ui , vi 11 / 29
  12. Solution: Just Project! # Transforme 1. Compute scaled dot-product attention

    # Linformer 1. Simply project V (value) and K (key) 2. Compute scaled dot-product attention 12 / 29 https://nlp.seas.harvard.edu/2018/04/03/attention.html Transformer Linformer h
  13. Linear self-attention (Linformer) 13 / 29 E, and F are

    introduced for projecting KW (key vectors) and VW (value vectors)
  14. Transformer and Linformer 14 / 29 k ∑ i=1 σi

    ui vT i ˜ P softmax( QWQ i (KWK i )T d ) PTransformer softmax( QWQ i (Ei KWK i )T dk ) PLinformer n × n n × d Self-attention matrix of Linformer is smaller than matrices
  15. Parameter sharing? 15 / 29

  16. Official Implementation 16 / 29 https://github.com/facebookresearch/pytext/blob/master/pytext/models/representations/transformer/multihead_linear_attention.py projection layer

  17. Creating projection layer (nn.Linear) 17 / 29 https://github.com/facebookresearch/pytext/blob/master/pytext/models/representations/transformer/multihead_linear_attention.py nn.Linear (implies

    it is learned)
  18. 18 / 29 https://github.com/facebookresearch/pytext/blob/master/pytext/models/representations/transformer/multihead_linear_attention.py Shared across layers

  19. 19 / 29 projection projection

  20. 20 / 29 Cumulative eigenvalues are larger for higher layer

    The dimensionalities of E, F could be decreased (But it can’t share parameters)
  21. Performance comparison 21 / 29 Linformer is efficient for longer

    sequence Transformer gets slow as sequence length grows
  22. Training convergence 22 / 29

  23. 23 / 29 For n = 512, k=128 For n

    = 1024, k=256 “Linformer’s performance is nearly on par with the Transformer”
  24. 24 / 29 Sharing parameters reduce memory consumption without much

    detriment to performance
  25. 25 / 29 Perplexities after convergence remain about the same

    even though projected dimension is fixed for longer sequences “This further empirically supports our assertion that the Linformer is linear-time”
  26. 26 / 29

  27. 27 / 29

  28. Conclusion • Linformer, the efficient variant of Transformer • Linformer

    project key and value into low dimension, which decreases computational complexity from N to k • When k is much smaller than N, complexity would be O(1) • Experiments show that Linformer performs well even if k is 128, (smaller than 512, the default sequence length of BERT) 28 / 29
  29. Good Pointers • The Transformer Family • https://lilianweng.github.io • Sparse

    Transformer and Reformer explained • FAIR official PyText implementation • https://github.com/facebookresearch/pytext 29 / 29