ui vT i ˜ P softmax( QWQ i (KWK i )T d ) PTransformer softmax( QWQ i (Ei KWK i )T dk ) PLinformer n × n n × d Self-attention matrix of Linformer is smaller than matrices
project key and value into low dimension, which decreases computational complexity from N to k • When k is much smaller than N, complexity would be O(1) • Experiments show that Linformer performs well even if k is 128, (smaller than 512, the default sequence length of BERT) 28 / 29