Linformer: paper reading

Slide 1

Slide 1 text

Linformer: Self-Attention with Linear Complexity 2020/09/04 Makoto Hiramatsu <@himkt> * Figures come from the original paper

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

Related works 3 / 29 Speed Memory Performance Mixed Precision ✅ ✅ ✅ Knowledge Distillation ✅ ✅ ❌ Sparse Attention ✅ ❌ ❌ LSH Attention ❌ ✅ Some techniques for Optimizer ❌ ✅ ❌ Linformer ✅ ✅ ✅

Slide 4

Slide 4 text

Complexity and sequential operation • RNN: O(n) sequential operation, which is not desired for parallelization • Transformer: O(1) sequential operation but O(n^2) complexity • Sparse Transformer: low complexity with holding O(1) sequential operation • Reformer: nlog(n) complexity, logarithmic sequential operation 4 / 29

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Preliminary: Self Attention 6 / 29 Key Query N^2

Slide 7

Slide 7 text

Rewriting P using A and D where , A = QWi Q(KWi K)T d (DA )ii = N ∑ j=1 exp Aji 7 / 29

Slide 8

Slide 8 text

Saturation of Cumulative Eigenvalue 8 / 29

Slide 9

Slide 9 text

What insight from the observation? Self-Attention is Low Rank • We may be able to discard keep most of information in self-attention matrix • First few eigenvectors are dominant => Low rank approximation! 9 / 29

Slide 10

Slide 10 text

Low rank approximation for P 10 / 29 Pros: reduce parameter to be learned Cons: introduces additional complexity (for SVD)

Slide 11

Slide 11 text

Low rank approximation for P • Question • Given parameters are and is not available, as I understand • Is is possible to getting without P ? Q, WQ i , K, WK i P ui , vi 11 / 29

Slide 12

Slide 12 text

Solution: Just Project! # Transforme 1. Compute scaled dot-product attention # Linformer 1. Simply project V (value) and K (key) 2. Compute scaled dot-product attention 12 / 29 https://nlp.seas.harvard.edu/2018/04/03/attention.html Transformer Linformer h

Slide 13

Slide 13 text

Linear self-attention (Linformer) 13 / 29 E, and F are introduced for projecting KW (key vectors) and VW (value vectors)

Slide 14

Slide 14 text

Transformer and Linformer 14 / 29 k ∑ i=1 σi ui vT i ˜ P softmax( QWQ i (KWK i )T d ) PTransformer softmax( QWQ i (Ei KWK i )T dk ) PLinformer n × n n × d Self-attention matrix of Linformer is smaller than matrices

Slide 15

Slide 15 text

Parameter sharing? 15 / 29

Slide 16

Slide 16 text

Oﬃcial Implementation 16 / 29 https://github.com/facebookresearch/pytext/blob/master/pytext/models/representations/transformer/multihead_linear_attention.py projection layer

Slide 17

Slide 17 text

Creating projection layer (nn.Linear) 17 / 29 https://github.com/facebookresearch/pytext/blob/master/pytext/models/representations/transformer/multihead_linear_attention.py nn.Linear (implies it is learned)

Slide 18

Slide 18 text

18 / 29 https://github.com/facebookresearch/pytext/blob/master/pytext/models/representations/transformer/multihead_linear_attention.py Shared across layers

Slide 19

Slide 19 text

19 / 29 projection projection

Slide 20

Slide 20 text

20 / 29 Cumulative eigenvalues are larger for higher layer The dimensionalities of E, F could be decreased (But it can’t share parameters)

Slide 21

Slide 21 text

Performance comparison 21 / 29 Linformer is eﬃcient for longer sequence Transformer gets slow as sequence length grows

Slide 22

Slide 22 text

Training convergence 22 / 29

Slide 23

Slide 23 text

23 / 29 For n = 512, k=128 For n = 1024, k=256 “Linformer’s performance is nearly on par with the Transformer”

Slide 24

Slide 24 text

24 / 29 Sharing parameters reduce memory consumption without much detriment to performance

Slide 25

Slide 25 text

25 / 29 Perplexities after convergence remain about the same even though projected dimension is ﬁxed for longer sequences “This further empirically supports our assertion that the Linformer is linear-time”

Slide 26

Slide 26 text

26 / 29

Slide 27

Slide 27 text

27 / 29

Slide 28

Slide 28 text

Conclusion • Linformer, the eﬃcient variant of Transformer • Linformer project key and value into low dimension, which decreases computational complexity from N to k • When k is much smaller than N, complexity would be O(1) • Experiments show that Linformer performs well even if k is 128, (smaller than 512, the default sequence length of BERT) 28 / 29

Slide 29

Slide 29 text

Good Pointers • The Transformer Family • https://lilianweng.github.io • Sparse Transformer and Reformer explained • FAIR oﬃcial PyText implementation • https://github.com/facebookresearch/pytext 29 / 29