Bridging_by_Word__Image-Grounded_Vocabulary_Construction_for_Visual_Captioning.pdf

Bridging by Word: Image-Grounded Vocabulary Construction for Visual Captioning ACL網羅的サーベイ報告会
2019年11月2日(土) @hrs1985

自己紹介 twitter : @hrs1985 https://qiita.com/hrs1985 https://kiyo.qrunch.io/ 2 機械学習エンジニアをしています。最近転職して7月から東京で働いてます。元々は実験生物学者です。
• 深層生成モデル • 強化学習 • 画像処理 • 生物学・化学への機械学習の応用に興味があります。自然言語処理も勉強を始めました。

論文の概要タイトル Bridging by Word: Image-Grounded Vocabulary Construction for Visual
Captioning (https://www.aclweb.org/anthology/P19-1652/) 著者 Zhihao Fan, Zhongyu Wei, Siyuan Wang, Xuanjing Huang 内容・Image Captioning において画像の特徴を基にした Image-Grounded Vocabulary を導入。・Image-Grounded Vocabulary Construction→text generation の2段階の学習を提案。

Image Captioning 画像からその画像の内容を示す文を推定 (生成) するタスク ⇨CNNで特徴抽出 ⇨抽出した特徴を基に RNNでテキスト生成 “2匹のカマキリが枝の上で腕を拡げている ”

Image Captioning にみられる表現の偏り画像では地面に座っていたり立っていたりするにもかかわらず、”a woman sitting at a table”
と表現されている。これはRNNが画像のセマンティックスをきちんと理解せずにデータセット内のN-gramの頻度などに引きずられてしまっているためと考えられる。

提案手法

提案手法 CNN-RNN

提案手法 CNN-RNN vocabulary constructor

Two type constraint Hard Constraint Image-grounded vocabularyに含まれない単語をCNN-RNNが出力しないように制約をかける。 Soft Constraint
RNNのテキスト生成に際してImage-grounded vocabularyによる重みをかける。

Overall Performance NIC: baseline model WC: Hard Constraint WA: Soft
Constraint RL: reinforcement learning WC(GT): Ground-truth Vocabulary

最適 Vocabulary size Vocabulary の大きさは48-64程度が最適らしい（左図）。また、学習のiterationが増えても安定して Image-Grounded Vocabulary を使用した方が良い結果となっている。

Novel Caption Ratio データセットにないCaptionを生成できる率も高い

Example 画像に出てきていない要素がテキストに入り込まなくなっている！（左下の bench など）

Related Works テキスト生成に Vocabulary の側から補助情報を与えるような発想は他にも・Wu, Yu, et al. "Neural
response generation with dynamic vocabularies." Thirty-Second AAAI Conference on Artiﬁcial Intelligence. 2018. ・Yao, Ting, et al. "Incorporating copying mechanism in image captioning for learning novel objects." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. で提案されているらしいので今後読んでおきたいです。

Hard Constraint Image-Grounded Vocabulary Wiに含まれない単語wjがテキスト生成に際して絶対に選択されないようにマスクをかける。

Soft Constraint (雰囲気) 元のLSTMの式 Soft Constraint入りのLSTM 元の式にImage-Grounded Vocabulary 依存の S
をweightとして入れ込んでいる。

学習手順 4段階 (大きくは2段階)の学習手順 1. Vocabulary Constructorの学習 2. Soft Constraint の下で損失関数をクロスエント
ロピー誤差にしてText Generatorを学習 3. 強化学習(1) 4. Vocabulary Constraintを入れた強化学習(2)

Reinforcement Learning Hard Constraintの下での強化学習 (2) には以下のアルゴリズムを用いる

Bridging_by_Word__Image-Grounded_Vocabulary_Con...

Bridging_by_Word__Image-Grounded_Vocabulary_Construction_for_Visual_Captioning.pdf

kiyo

More Decks by kiyo

Other Decks in Technology

Featured

Transcript

Bridging by Word: Image-Grounded Vocabulary Construction for Visual Captioning ACL網羅的サーベイ報告会

自己紹介 twitter : @hrs1985 https://qiita.com/hrs1985 https://kiyo.qrunch.io/ 2 機械学習エンジニアをしています。最近転職して7月から東京で働いてます。元々は実験生物学者です。

論文の概要タイトル Bridging by Word: Image-Grounded Vocabulary Construction for Visual

Image Captioning 画像からその画像の内容を示す文を推定 (生成) するタスク ⇨CNNで特徴抽出 ⇨抽出した特徴を基に RNNでテキスト生成 “2匹のカマキリが枝の上で腕を拡げている ”

Image Captioning にみられる表現の偏り画像では地面に座っていたり立っていたりするにもかかわらず、”a woman sitting at a table”

提案手法

提案手法 CNN-RNN

提案手法 CNN-RNN vocabulary constructor

Two type constraint Hard Constraint Image-grounded vocabularyに含まれない単語をCNN-RNNが出力しないように制約をかける。 Soft Constraint

Overall Performance NIC: baseline model WC: Hard Constraint WA: Soft

最適 Vocabulary size Vocabulary の大きさは48-64程度が最適らしい（左図）。また、学習のiterationが増えても安定して Image-Grounded Vocabulary を使用した方が良い結果となっている。

Novel Caption Ratio データセットにないCaptionを生成できる率も高い

Example 画像に出てきていない要素がテキストに入り込まなくなっている！（左下の bench など）

Related Works テキスト生成に Vocabulary の側から補助情報を与えるような発想は他にも・Wu, Yu, et al. "Neural

Hard Constraint Image-Grounded Vocabulary Wiに含まれない単語wjがテキスト生成に際して絶対に選択されないようにマスクをかける。

Soft Constraint (雰囲気) 元のLSTMの式 Soft Constraint入りのLSTM 元の式にImage-Grounded Vocabulary 依存の S

学習手順 4段階 (大きくは2段階)の学習手順 1. Vocabulary Constructorの学習 2. Soft Constraint の下で損失関数をクロスエント

Reinforcement Learning Hard Constraintの下での強化学習 (2) には以下のアルゴリズムを用いる