Slide 13
Slide 13 text
Transformer encoder based on ORT [Herdade+ NeurIPS19]
12
• 𝝎𝐺
𝑚𝑛 : Positional features between regions m and n
𝝎𝐺
𝑚𝑛 ← (log
𝑥𝑚
− 𝑥𝑛
𝑤𝑚
, log
𝑦𝑚
− 𝑦𝑛
ℎ𝑚
, log
𝑤𝑛
𝑤𝑚
, log
ℎ𝑛
ℎ𝑚
)
𝝎𝒎𝒏 =
𝝎𝑮
𝒎𝒏exp(𝝎𝑨
𝒎𝒏)
σ
𝑙=1
𝑁 𝝎𝑮
𝒎𝒍exp(𝝎𝑨
𝒎𝒍)
• Box multi-head attention(MHA)
ℎ𝑚
𝑤𝑛
𝝎𝐺
𝑚𝑛
𝝎𝑨
𝒎𝒏
Concatenate output of Attention head 𝒉𝒔𝒂
𝝎𝒎𝒏
𝝎𝒎𝒏
𝝎𝒎𝒏
Region 𝑚 Region 𝑛
𝜔𝐴
𝑚𝑛: the visual-based weight