Slide 118
Slide 118 text
Visual Encoder
• visual feature 𝑣E E2,
C
, bboxの座標 𝑏E E2,
C
をfully connected layer𝐹𝐶,,0
に⼊⼒する
• 𝑓 = 𝐹P Concat(𝐹𝐶, 𝑣, , 𝐹𝐶0 𝑏0 , … , 𝐹𝐶, 𝑣C , 𝐹𝐶0(𝑏C)) ∈ ℝ(0C)×Q
118
Encoder-Decoder module
Context-aware Question Encoder
• ⼊⼒questionを𝑋 = “context: {caption} + {tags}. question: {question}”に置換し,
Transformer Encoder 𝐹:
に⼊⼒
• 𝑞 = 𝐹:(𝑋)
Generative Decoder
• 𝛼) , 𝛽M , 𝑓, 𝑞 を結合したものをTransformer decoder 𝐹/
に⼊⼒
• 𝑦 = 𝐹/ Concat 𝛼,, … , 𝛼O, 𝛽,, … , 𝛽N, 𝑓, 𝑞
• ℒ = − ∑42,
> log 𝑝R •
𝑦4|𝑦S4