๐๐ : Positional features between regions m and n ๐๐บ ๐๐ โ (log ๐ฅ๐ โ ๐ฅ๐ ๐ค๐ , log ๐ฆ๐ โ ๐ฆ๐ โ๐ , log ๐ค๐ ๐ค๐ , log โ๐ โ๐ ) ๐๐๐ = ๐๐ฎ ๐๐exp(๐๐จ ๐๐) ฯ ๐=1 ๐ ๐๐ฎ ๐๐exp(๐๐จ ๐๐) โข Box multi-head attention(MHA) ๐๐โ ๐๐ ๐ ๐๐ ๐ ๐๐ ๐ ๐๐๐ ๐๐๐ ๐๐๐ ๐๐๐ โ๐ ๐ค๐ Region m Region n Concatenate output of Attention head ๐๐๐ ๐๐ด ๐๐: the visual-based weight