Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Linformer: paper reading

himkt
September 05, 2020

Linformer: paper reading

社内で開催した論文読み会の資料です.

himkt

September 05, 2020
Tweet

More Decks by himkt

Other Decks in Research

Transcript

  1. Related works 3 / 29 Speed Memory Performance Mixed Precision

    ✅ ✅ ✅ Knowledge Distillation ✅ ✅ ❌ Sparse Attention ✅ ❌ ❌ LSH Attention ❌ ✅ Some techniques for Optimizer ❌ ✅ ❌ Linformer ✅ ✅ ✅
  2. Complexity and sequential operation • RNN: O(n) sequential operation, which

    is not desired for parallelization • Transformer: O(1) sequential operation but O(n^2) complexity • Sparse Transformer: low complexity with holding O(1) sequential operation • Reformer: nlog(n) complexity, logarithmic sequential operation 4 / 29
  3. Complexity and sequential operation • RNN: O(n) sequential operation, which

    is not desired for parallelization • Transformer: O(1) sequential operation but O(n^2) complexity • Sparse Transformer: low complexity with holding O(1) sequential operation • Reformer: nlog(n) complexity, logarithmic sequential operation 5 / 29 Linear complexity and O(1) sequential operation!
  4. Rewriting P using A and D where , A =

    QWi Q(KWi K)T d (DA )ii = N ∑ j=1 exp Aji 7 / 29
  5. What insight from the observation? Self-Attention is Low Rank •

    We may be able to discard keep most of information in self-attention matrix • First few eigenvectors are dominant => Low rank approximation! 9 / 29
  6. Low rank approximation for P 10 / 29 Pros: reduce

    parameter to be learned Cons: introduces additional complexity (for SVD)
  7. Low rank approximation for P • Question • Given parameters

    are and is not available, as I understand • Is is possible to getting without P ? Q, WQ i , K, WK i P ui , vi 11 / 29
  8. Solution: Just Project! # Transforme 1. Compute scaled dot-product attention

    # Linformer 1. Simply project V (value) and K (key) 2. Compute scaled dot-product attention 12 / 29 https://nlp.seas.harvard.edu/2018/04/03/attention.html Transformer Linformer h
  9. Linear self-attention (Linformer) 13 / 29 E, and F are

    introduced for projecting KW (key vectors) and VW (value vectors)
  10. Transformer and Linformer 14 / 29 k ∑ i=1 σi

    ui vT i ˜ P softmax( QWQ i (KWK i )T d ) PTransformer softmax( QWQ i (Ei KWK i )T dk ) PLinformer n × n n × d Self-attention matrix of Linformer is smaller than matrices
  11. 20 / 29 Cumulative eigenvalues are larger for higher layer

    The dimensionalities of E, F could be decreased (But it can’t share parameters)
  12. Performance comparison 21 / 29 Linformer is efficient for longer

    sequence Transformer gets slow as sequence length grows
  13. 23 / 29 For n = 512, k=128 For n

    = 1024, k=256 “Linformer’s performance is nearly on par with the Transformer”
  14. 25 / 29 Perplexities after convergence remain about the same

    even though projected dimension is fixed for longer sequences “This further empirically supports our assertion that the Linformer is linear-time”
  15. Conclusion • Linformer, the efficient variant of Transformer • Linformer

    project key and value into low dimension, which decreases computational complexity from N to k • When k is much smaller than N, complexity would be O(1) • Experiments show that Linformer performs well even if k is 128, (smaller than 512, the default sequence length of BERT) 28 / 29
  16. Good Pointers • The Transformer Family • https://lilianweng.github.io • Sparse

    Transformer and Reformer explained • FAIR official PyText implementation • https://github.com/facebookresearch/pytext 29 / 29