𝑚𝑛 : Positional features between regions m and n 𝝎𝐺 𝑚𝑛 ← (log 𝑥𝑚 − 𝑥𝑛 𝑤𝑚 , log 𝑦𝑚 − 𝑦𝑛 ℎ𝑚 , log 𝑤𝑛 𝑤𝑚 , log ℎ𝑛 ℎ𝑚 ) 𝝎𝒎𝒏 = 𝝎𝑮 𝒎𝒏exp(𝝎𝑨 𝒎𝒏) σ 𝑙=1 𝑁 𝝎𝑮 𝒎𝒍exp(𝝎𝑨 𝒎𝒍) • Box multi-head attention(MHA) 𝒉𝑚ℎ 𝒉𝑠𝑎 𝒉𝑠𝑎 𝒉𝑠𝑎 𝒉𝒎𝒉 𝝎𝒎𝒏 𝝎𝒎𝒏 𝝎𝒎𝒏 ℎ𝑚 𝑤𝑛 Region m Region n Concatenate output of Attention head 𝒉𝒔𝒂 𝜔𝐴 𝑚𝑛: the visual-based weight