Slide 48
Slide 48 text
Formalization of QKV attention
47
Construct �
𝑸𝑸 = �
𝒒𝒒1
, … , �
𝒒𝒒𝑇𝑇
by sums of 𝑽𝑽 = (𝒗𝒗1
, … , 𝒗𝒗𝑆𝑆
) weighted by dot products
between 𝑸𝑸 = (𝒒𝒒1
, … , 𝒒𝒒𝑇𝑇
) and 𝑲𝑲 = (𝒌𝒌1
, … , 𝒌𝒌𝑆𝑆
)
�
𝑸𝑸 = Attention 𝑸𝑸, 𝑲𝑲, 𝑽𝑽 = 𝑽𝑽softmax 𝑐𝑐𝑲𝑲⊤𝑸𝑸
𝑸𝑸 = 𝒒𝒒1
, … , 𝒒𝒒𝐽𝐽
∈ ℝ𝑑𝑑×𝑇𝑇, �
𝑸𝑸 = �
𝒒𝒒1
, … , �
𝒒𝒒𝐽𝐽
∈ ℝ𝑑𝑑×𝑇𝑇,
𝑲𝑲 = 𝒌𝒌1
, … , 𝒌𝒌𝑆𝑆
∈ ℝ𝑑𝑑×𝑆𝑆, 𝑽𝑽 = (𝒗𝒗1
, … , 𝒗𝒗𝑆𝑆
) ∈ ℝ𝑑𝑑×𝑆𝑆
Self attention in encoders (Reconstruct 𝑯𝑯 by attending 𝑯𝑯)
𝑸𝑸 = 𝑾𝑾𝑄𝑄
𝑯𝑯, 𝑲𝑲 = 𝑾𝑾𝐾𝐾
𝑯𝑯, 𝑽𝑽 = 𝑾𝑾𝑉𝑉
𝑯𝑯, 𝑾𝑾𝑄𝑄
, 𝑾𝑾𝐾𝐾
, 𝑾𝑾𝑉𝑉
∈ ℝ𝑑𝑑×𝑑𝑑 (𝑆𝑆 = 𝑇𝑇 = 𝐼𝐼)
Self attention in decoders (Reconstruct 𝒁𝒁 by attending 𝒁𝒁)
𝑸𝑸 = 𝑾𝑾𝑄𝑄
𝒁𝒁, 𝑲𝑲 = 𝑾𝑾𝐾𝐾
𝒁𝒁, 𝑽𝑽 = 𝑾𝑾𝑉𝑉
𝒁𝒁, 𝑾𝑾𝑄𝑄
, 𝑾𝑾𝐾𝐾
, 𝑾𝑾𝑉𝑉
∈ ℝ𝑑𝑑×𝑑𝑑 (𝑆𝑆 = 𝑇𝑇 = 𝐽𝐽)
Cross attention (Reconstruct 𝑯𝑯 by attending 𝑯𝑯 from 𝒁𝒁)
𝑸𝑸 = 𝑾𝑾𝑄𝑄
𝒁𝒁, 𝑲𝑲 = 𝑾𝑾𝐾𝐾
𝑯𝑯, 𝑽𝑽 = 𝑾𝑾𝑉𝑉
𝑯𝑯, 𝑾𝑾𝑄𝑄
, 𝑾𝑾𝐾𝐾
, 𝑾𝑾𝑉𝑉
∈ ℝ𝑑𝑑×𝑑𝑑 (𝑆𝑆 = 𝐼𝐼, 𝑇𝑇 = 𝐽𝐽)
1. Multi-head
attention