[Journal club] ER-SAN: Enhanced-Adaptive Relation Self-Attention Network for Image Captioning

ER-SAN: Enhanced-Adaptive Relation Self-Attention Network for Image Captioning Jingyu Li1,
Zhendong Mao1, Shancheng Fang1, Hao Li2 1University of Science and Technology of China 2Huazhong University of Science and Technology IJCAI 2022 杉浦孔明研究室神原元就 Li, J., Mao, Z., Fang, S., & Li, H. (2022). ER-SAN: Enhanced-Adaptive Relation Self-Attention Network for Image Captioning. IJCAI.

画像キャプション生成ではobject間の関係性が重要 3 The Power of PowerPoint - thepopp.com 「パラソルと椅子がある」 •
表現上・文法上適切 • 状況の詳細な説明としては不十分「パラソルの外側に椅子がある」「パラソルの下に椅子がある」物体同士の関係性についての表現を生成することで，より適切な説明が生成可能 [Herdade+ NeurIPS19]

位置情報のみでは適切なキャプション生成は難しい 4 The Power of PowerPoint - thepopp.com 「猫の前に男性がいる」 •
物体間の位置情報を考慮 • 誤ってはいないが，状況説明として不十分「男性が猫に餌をあげている」 • 物体間の位置情報に加え，意味的な関係性を考慮 • 位置情報のみでは獲得できない表現を生成可能

関連研究：位置情報及び意味的な情報を共に利用する手法は少ない 5 The Power of PowerPoint - thepopp.com 手法
概要 [Luo+ AAAI21] • グリッドごと及び領域ごとの特徴量を利用 • Cross-attention機構によって特徴量を混ぜる [Huang+ ICCV19] • 自己相関を計算するAttention on attentionモジュールの追加 [Yao+ ECCV18] • 位置情報及び意味的な情報それぞれについてグラフを構築 • 領域間の関係性をエッジ特徴量として保持 [Yao+ ECCV18] [Luo+ AAAI21]

提案手法： Enhanced-Adaptive Relation Self-Attention Network (ER-SAN) 6 The Power of
PowerPoint - thepopp.com • 位置情報及び意味的な情報を利用しグラフを構築 • transformerによってそれぞれのグラフで獲得した関係性を学習

Feature Extraction Module：各特徴量を抽出 7 The Power of PowerPoint - thepopp.com

ノード特徴量：Faster R-CNNにより抽出 8 The Power of PowerPoint - thepopp.com ノード
Faster R-CNNにより領域特徴量を抽出 𝑜𝑣 = [𝑜1 𝑣, 𝑜2 𝑣, . . , 𝑜𝑛 𝑣] クラスラベルの埋め込み特徴量𝑜𝑐と合わせ，ノード特徴量𝐻𝑜𝑖 を獲得 𝐻𝑜𝑖 = 𝜙𝑜 𝑜𝑖 𝑣; 𝑜𝑖 𝑐 + 𝑜𝑖 𝑣 Feed-forwardネットワーク得られたノード特徴量は，semantic graph及び geometric graphにおいて利用

Semantic graph：各物体の意味的な関係性でグラフを構築 9 The Power of PowerPoint - thepopp.com [Shi+
20]と同様の方法でsemantic relation tripleを抽出例) <woman, hold, pale> 1. Faster R-CNNにより物体検出 2. キャプションからsemantic relation tripleを抽出 3. 適切な領域の組を選択するよう訓練 4. 推論時はsemantic relation triple自体を予測領域i及びjの関係性を埋め込み表現semantic vector 𝑆𝑖𝑗 として利用

Geometric graph：各物体の相対位置情報でグラフを構築 10 The Power of PowerPoint - thepopp.com 領域i
領域j (𝑥𝑖 , 𝑦𝑖 ) 𝑤𝑖 ℎ𝑖 (𝑥𝑗 , 𝑦𝑗 ) 𝑤𝑗 ℎ𝑗 幾何的特徴量𝑔𝑖𝑗 を以下のように獲得 𝑔𝑖𝑗 = (log 𝑥𝑖 − 𝑥𝑗 𝑤𝑖 , log 𝑦𝑖 − 𝑦𝑗 ℎ𝑖 , log 𝑤𝑖 𝑤𝑗 , log ℎ𝑖 ℎ𝑗 ) 𝑔𝑖𝑗 を埋め込むことで，エッジ特徴量𝐺𝑖𝑗 を獲得

Enhanced-Adaptive Relation Attention： 2種類の関係性をtransformerを利用しモデル化 11 The Power of PowerPoint -
thepopp.com

Enhanced-Adaptive Relation Attention：3種類のモジュール 12 The Power of PowerPoint - thepopp.com
①: Direction-Sensitive Semantic-Enhanced Attention ②: Geometric-Enhanced Attention ③: Adaptive Re-weight Relation 意味的な関係性の特徴量についての自己相関位置情報の特徴量についての自己相関 2種類の特徴量について重みづけ位置情報重要意味情報重要

Semantic/Geometric-Enhanced Attention 13 The Power of PowerPoint - thepopp.com ①:
Direction-Sensitive Semantic-Enhanced Attention • 2方向のattentionを考慮 • ノード特徴量を線形変換して獲得した 𝑄𝑐 , 𝐾𝑐 及びsemantic vector 𝑆を利用 𝜙 𝑄𝑐 , 𝐾𝑐 , 𝑆 = 𝑄𝑐 𝐾𝑠 T + 𝑄𝑠 𝐾𝑐 T 𝑄𝑠 = 𝑆𝑊 𝑞𝑠 , 𝐾𝑠 = 𝑆𝑊𝑘𝑠 c2s s2c 例) <woman, holding, umbrella> c2s: <woman, holding> S2c: <holding, umbrella> ②: Geometric-Enhanced Attention • 同様に位置情報の特徴量からattentionを計算 𝜑 𝑄𝑐 , 𝐺 = 𝑄𝑐 𝐾𝑔 T 𝐾𝑔 = 𝐺𝑊𝑘𝑔 最終的に以下を正規化し，attention weightとする ሚ 𝐴 = 𝑄𝑐 𝐾𝑐 T + 𝜙 𝑄𝑐 , 𝐾𝑐 , 𝑆 + 𝜑 𝑄𝑐 , 𝐺

定量的結果：全尺度でベースライン手法を上回る 14 The Power of PowerPoint - thepopp.com 手法 BLEU4
METEOR ROUGE CIDEr SPICE TCIC(CE) [Fan+ 21] 38.3 28.5 58.0 121.0 21.6 TCIC(RL) [Fan+ 21] 39.7 29.2 58.6 132.9 22.4 DRT(RL) [Song+ ACM Multimedia21] 40.4 29.5 59.3 133.2 23.3 Ours (CE) 38.8 29.2 58.5 122.9 22.2 Ours (RL) 41.7 30.1 60.3 135.3 23.8 MSCOCOデータセット [Lin+ ECCV14]における画像キャプション生成タスクの結果 (CE): cross entropy誤差関数によるパラメータ更新，(RL): CIDErのスコアを報酬とした強化学習パラメータ更新の方法に関わらず，各尺度でベースライン手法を上回る結果を確認

定性的結果：物体間の意味的な関係性を反映した生成文 15 The Power of PowerPoint - thepopp.com ☺意味的な関係性を反映 距離的な関係性のみを反映
☺意味的な関係性を反映 距離的な関係性のみを反映，意味として微妙

まとめ 16 The Power of PowerPoint - thepopp.com 背景提案手法
実験結果物体間の意味的な関係性は重要であるにも関わらず活用した手法が少ない物体間の相対位置情報及び意味的な関係性でグラフを構築，transformerに組み込む手法， ER-SANを提案全ての評価尺度で既存手法を上回る．定性的にも物体間の意味的な関係性に応じた分を生成できていた

[Journal club] ER-SAN: Enhanced-Adaptive Relati...

[Journal club] ER-SAN: Enhanced-Adaptive Relation Self-Attention Network for Image Captioning

Semantic Machine Intelligence Lab., Keio Univ.
PRO

More Decks by Semantic Machine Intelligence Lab., Keio Univ.

Other Decks in Technology

Featured

Transcript

ER-SAN: Enhanced-Adaptive Relation Self-Attention Network for Image Captioning Jingyu Li1,

画像キャプション生成ではobject間の関係性が重要 3 The Power of PowerPoint - thepopp.com 「パラソルと椅子がある」 •

位置情報のみでは適切なキャプション生成は難しい 4 The Power of PowerPoint - thepopp.com 「猫の前に男性がいる」 •

関連研究：位置情報及び意味的な情報を共に利用する手法は少ない 5 The Power of PowerPoint - thepopp.com 手法

提案手法： Enhanced-Adaptive Relation Self-Attention Network (ER-SAN) 6 The Power of

Feature Extraction Module：各特徴量を抽出 7 The Power of PowerPoint - thepopp.com

ノード特徴量：Faster R-CNNにより抽出 8 The Power of PowerPoint - thepopp.com ノード

Semantic graph：各物体の意味的な関係性でグラフを構築 9 The Power of PowerPoint - thepopp.com [Shi+

Geometric graph：各物体の相対位置情報でグラフを構築 10 The Power of PowerPoint - thepopp.com 領域i

Enhanced-Adaptive Relation Attention： 2種類の関係性をtransformerを利用しモデル化 11 The Power of PowerPoint -

Enhanced-Adaptive Relation Attention：3種類のモジュール 12 The Power of PowerPoint - thepopp.com

Semantic/Geometric-Enhanced Attention 13 The Power of PowerPoint - thepopp.com ①:

定量的結果：全尺度でベースライン手法を上回る 14 The Power of PowerPoint - thepopp.com 手法 BLEU4

定性的結果：物体間の意味的な関係性を反映した生成文 15 The Power of PowerPoint - thepopp.com ☺意味的な関係性を反映 距離的な関係性のみを反映

まとめ 16 The Power of PowerPoint - thepopp.com 背景提案手法