Linformer: paper reading

Linformer: Self-Attention with Linear Complexity 2020/09/04 Makoto Hiramatsu <@himkt> *
Figures come from the original paper

Related works 3 / 29 Speed Memory Performance Mixed Precision
✅ ✅ ✅ Knowledge Distillation ✅ ✅ ❌ Sparse Attention ✅ ❌ ❌ LSH Attention ❌ ✅ Some techniques for Optimizer ❌ ✅ ❌ Linformer ✅ ✅ ✅

Complexity and sequential operation • RNN: O(n) sequential operation, which
is not desired for parallelization • Transformer: O(1) sequential operation but O(n^2) complexity • Sparse Transformer: low complexity with holding O(1) sequential operation • Reformer: nlog(n) complexity, logarithmic sequential operation 4 / 29

Complexity and sequential operation • RNN: O(n) sequential operation, which
is not desired for parallelization • Transformer: O(1) sequential operation but O(n^2) complexity • Sparse Transformer: low complexity with holding O(1) sequential operation • Reformer: nlog(n) complexity, logarithmic sequential operation 5 / 29 Linear complexity and O(1) sequential operation!

Preliminary: Self Attention 6 / 29 Key Query N^2

Rewriting P using A and D where , A =
QWi Q(KWi K)T d (DA )ii = N ∑ j=1 exp Aji 7 / 29

Saturation of Cumulative Eigenvalue 8 / 29

What insight from the observation? Self-Attention is Low Rank •
We may be able to discard keep most of information in self-attention matrix • First few eigenvectors are dominant => Low rank approximation! 9 / 29

Low rank approximation for P 10 / 29 Pros: reduce
parameter to be learned Cons: introduces additional complexity (for SVD)

Low rank approximation for P • Question • Given parameters
are and is not available, as I understand • Is is possible to getting without P ? Q, WQ i , K, WK i P ui , vi 11 / 29

Solution: Just Project! # Transforme 1. Compute scaled dot-product attention
# Linformer 1. Simply project V (value) and K (key) 2. Compute scaled dot-product attention 12 / 29 https://nlp.seas.harvard.edu/2018/04/03/attention.html Transformer Linformer h

Linear self-attention (Linformer) 13 / 29 E, and F are
introduced for projecting KW (key vectors) and VW (value vectors)

Transformer and Linformer 14 / 29 k ∑ i=1 σi
ui vT i ˜ P softmax( QWQ i (KWK i )T d ) PTransformer softmax( QWQ i (Ei KWK i )T dk ) PLinformer n × n n × d Self-attention matrix of Linformer is smaller than matrices

Parameter sharing? 15 / 29

Oﬃcial Implementation 16 / 29 https://github.com/facebookresearch/pytext/blob/master/pytext/models/representations/transformer/multihead_linear_attention.py projection layer

Creating projection layer (nn.Linear) 17 / 29 https://github.com/facebookresearch/pytext/blob/master/pytext/models/representations/transformer/multihead_linear_attention.py nn.Linear (implies
it is learned)

18 / 29 https://github.com/facebookresearch/pytext/blob/master/pytext/models/representations/transformer/multihead_linear_attention.py Shared across layers

19 / 29 projection projection

20 / 29 Cumulative eigenvalues are larger for higher layer
The dimensionalities of E, F could be decreased (But it can’t share parameters)

Performance comparison 21 / 29 Linformer is eﬃcient for longer
sequence Transformer gets slow as sequence length grows

Training convergence 22 / 29

23 / 29 For n = 512, k=128 For n
= 1024, k=256 “Linformer’s performance is nearly on par with the Transformer”

24 / 29 Sharing parameters reduce memory consumption without much
detriment to performance

25 / 29 Perplexities after convergence remain about the same
even though projected dimension is ﬁxed for longer sequences “This further empirically supports our assertion that the Linformer is linear-time”

26 / 29

27 / 29

Conclusion • Linformer, the eﬃcient variant of Transformer • Linformer
project key and value into low dimension, which decreases computational complexity from N to k • When k is much smaller than N, complexity would be O(1) • Experiments show that Linformer performs well even if k is 128, (smaller than 512, the default sequence length of BERT) 28 / 29

Good Pointers • The Transformer Family • https://lilianweng.github.io • Sparse
Transformer and Reformer explained • FAIR oﬃcial PyText implementation • https://github.com/facebookresearch/pytext 29 / 29

Linformer: paper reading

Linformer: paper reading

himkt

More Decks by himkt

Other Decks in Research

Featured

Transcript

Linformer: Self-Attention with Linear Complexity 2020/09/04 Makoto Hiramatsu <@himkt> *

Related works 3 / 29 Speed Memory Performance Mixed Precision

Complexity and sequential operation • RNN: O(n) sequential operation, which

Complexity and sequential operation • RNN: O(n) sequential operation, which

Preliminary: Self Attention 6 / 29 Key Query N^2

Rewriting P using A and D where , A =

Saturation of Cumulative Eigenvalue 8 / 29

What insight from the observation? Self-Attention is Low Rank •

Low rank approximation for P 10 / 29 Pros: reduce

Low rank approximation for P • Question • Given parameters

Solution: Just Project! # Transforme 1. Compute scaled dot-product attention

Linear self-attention (Linformer) 13 / 29 E, and F are

Transformer and Linformer 14 / 29 k ∑ i=1 σi

Parameter sharing? 15 / 29

Oﬃcial Implementation 16 / 29 https://github.com/facebookresearch/pytext/blob/master/pytext/models/representations/transformer/multihead_linear_attention.py projection layer

Creating projection layer (nn.Linear) 17 / 29 https://github.com/facebookresearch/pytext/blob/master/pytext/models/representations/transformer/multihead_linear_attention.py nn.Linear (implies

18 / 29 https://github.com/facebookresearch/pytext/blob/master/pytext/models/representations/transformer/multihead_linear_attention.py Shared across layers

19 / 29 projection projection

20 / 29 Cumulative eigenvalues are larger for higher layer

Performance comparison 21 / 29 Linformer is eﬃcient for longer

Training convergence 22 / 29

23 / 29 For n = 512, k=128 For n

24 / 29 Sharing parameters reduce memory consumption without much

25 / 29 Perplexities after convergence remain about the same

26 / 29

27 / 29

Conclusion • Linformer, the eﬃcient variant of Transformer • Linformer

Good Pointers • The Transformer Family • https://lilianweng.github.io • Sparse