Case Relation Transformerに基づく対象物体及び目標領域の参照表現を含む物体操作指示文生成/Case Relation Transformer

Case Relation Transformerに基づく対象物体及び目標領域の参照表現を含む物体操作指示文生成慶應義塾大学神原元就，杉浦孔明

背景：生活支援ロボットの指示理解には大規模マルチモーダルコーパスが必要 1 生活支援ロボットにおいて自然言語による指示理解性能が向上すれば利便性が高まる要支援者の増加に伴い，生活支援ロボットへの期待が高まっている生活支援ロボットの指示理解には画像と指示がセットになったマルチモーダルコーパスによる訓練が欠かせない大規模であればより高度なモデルにつながりうる Give me
the green can from the table.

マルチモーダルコーパスの大規模化はモデルの高度化につながりうる 2 augmentにより低コストで入手可能大規模コーパスでは効果的な訓練が可能 • English Wikipedia=2500M words • 多くのマルチモーダルコーパスのサイズは1％以下
大規模事前学習モデルに利用  大量の文付与はコスト高 Pick the white packet which is near the numbered stickers and put it into the lower left box 明瞭な指示文の付与は大変

Pick the white packet which is near the numbered stickers
and put it into the lower left box 問題設定：物体操作指示文付与タスク 3 • 対象タスク： Fetching instruction generation(FIG)タスク対象物体及び目標領域についての物体操作指示文付与 • 入力対象物体，及び目標領域を含む画像 • 出力物体操作指示文

FIGタスク：明瞭な参照表現の生成は容易でない 4 “move the cola can to the top right
box” “grab the cola can near to the white gloves and put it in the upper right box”  正確かつ明瞭に対象物体及び目標領域を特定したい参照表現の生成が重要

既存研究：既存モデルでは生成文の品質が低い 5 VideoBERT Change Captioning[Park+ ICCV19] タスク手法概要画像キャプショニング
Object Relation Transformer (ORT)[Hardade+ NeurIPS19] 領域間の幾何的参照表現をモデル化 VL-BERT [Su+ ICLR20] 幾何的特徴量を用いたBERT-basedモデル動画キャプショニング VideoBERT [Sun+ ICCV19] 1ストリームによる動画キャプショニング Change Captioning DUDA[Park+ ICCV19] RNNを用いたChange captioningモデル FIG ABEN [Ogura+ IROS20] ABN[Fukui+ CVPR19] を利用

タスク手法概要画像キャプショニング Object Relation Transformer [Hardade+ NeurIPS19] 領域間の幾何的参照表現をモデル化
VL-BERT [Su+ ICLR20] 幾何的特徴量を用いたBERT-basedモデル動画キャプショニング VideoBERT [Sun+ ICCV19] 1ストリームによる動画キャプショニング Change Captioning DUDA[Park+ ICCV19] RNNを用いたChange captioningモデル FIG ABEN [Ogura+ IROS20] ABN[Fukui+ CVPR19] を利用既存研究：既存モデルでは生成文の品質が低い 6 ABN ABEN

提案手法：Case Relation Transformer(CRT) 7 Case Relation Block（CRB） • 2層のTransformer •
入力を変換，結合 Transformer型エンコーダ-デコーダ • 幾何的関係性をモデル化新規性

構造： Case Relation Transformer(CRT) 8

対象物体，目標領域以外は自動検出された領域を入力とする 9 対象物体目標領域コンテキスト情報 • Up-Down Attention[Anderson+ CVPR18]により検出

Case Relation Block：3種類の入力特徴量を埋め込み処理 10 Case Relation Block（CRB） • 2層のTransformer •
入力を変換，結合

Case Relation Block：3種類の入力特徴量を埋め込み処理 11 Case Relation Block（CRB） 𝑿𝑿𝒄𝒄𝒄𝒄𝒄𝒄𝒗𝒗𝒗𝒗 <𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕> 𝑿𝑿𝒄𝒄𝒄𝒄𝒄𝒄𝒗𝒗𝒗𝒗
<𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕> 𝑿𝑿𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝟒𝟒 <𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕> 𝑿𝑿<𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕> = {𝑿𝑿𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄 <𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕>, 𝑿𝑿𝒄𝒄𝒄𝒄𝒄𝒄𝒗𝒗𝒗𝒗 <𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕>, 𝑿𝑿𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝟒𝟒 <𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕>} • １層目の入力 • 2層目の出力𝒉𝒉𝑽𝑽 をエンコーダの入力とする Transformerによる埋め込み処理対象物体，目標領域の情報を明示的に処理可能

Case Relation Block：3種類の入力特徴量を埋め込み処理 12 Case Relation Block（CRB） 𝑿𝑿𝒄𝒄𝒄𝒄𝒄𝒄𝒗𝒗𝒗𝒗 <𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕> 𝑿𝑿𝒄𝒄𝒄𝒄𝒄𝒄𝒗𝒗𝒗𝒗
<𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕> 𝑿𝑿𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝟒𝟒 <𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕> 𝑿𝑿<𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕> = {𝑿𝑿𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄 <𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕>, 𝑿𝑿𝒄𝒄𝒄𝒄𝒄𝒄𝒗𝒗𝒗𝒗 <𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕>, 𝑿𝑿𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝟒𝟒 <𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕>} 𝒉𝒉<𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕> • １層目の入力 𝒙𝒙<𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅> 𝑿𝑿<𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄> 対象物体，目標領域の情報を明示的に処理可能 • 2層目の出力𝒉𝒉𝑽𝑽 をエンコーダの入力とする各種構造を試し，最良の構造を選択 𝒉𝒉𝑽𝑽

エンコーダ：幾何的特徴量を用いたTransformer型エンコーダ 13 Transformer型エンコーダ • 幾何的関係性をモデル化

エンコーダ：幾何的特徴量を用いたTransformer型エンコーダ 14 • 領域mとnの間の位置特徴量𝝎𝝎𝐺𝐺 𝑚𝑚𝑚𝑚 𝝎𝝎𝐺𝐺 𝑚𝑚𝑚𝑚 ← (log 𝑥𝑥𝑚𝑚
− 𝑥𝑥𝑛𝑛 𝑤𝑤𝑚𝑚 , log 𝑦𝑦𝑚𝑚 − 𝑦𝑦𝑛𝑛 ℎ𝑚𝑚 , log 𝑤𝑤𝑛𝑛 𝑤𝑤𝑚𝑚 , log ℎ𝑛𝑛 ℎ𝑚𝑚 ) 𝝎𝝎𝒎𝒎𝒎𝒎 = 𝝎𝝎𝑮𝑮 𝒎𝒎𝒎𝒎exp(𝝎𝝎𝑨𝑨 𝒎𝒎𝒎𝒎) ∑𝑙𝑙=1 𝑁𝑁 𝝎𝝎𝑮𝑮 𝒎𝒎𝒎𝒎exp(𝝎𝝎𝑨𝑨 𝒎𝒎𝒎𝒎) • Box multi-head attention(Box MHA) ℎ𝑚𝑚 𝑤𝑤𝑛𝑛 𝝎𝝎𝐺𝐺 𝑚𝑚𝑚𝑚 𝝎𝝎𝑨𝑨 𝒎𝒎𝒎𝒎 各Attention headの出力𝒉𝒉𝒔𝒔𝒔𝒔 を結合 𝝎𝝎𝒎𝒎𝒎𝒎 𝝎𝝎𝒎𝒎𝒎𝒎 𝝎𝝎𝒎𝒎𝒎𝒎 領域m 領域n

エンコーダ：幾何的特徴量を用いたTransformer型エンコーダ 15 • 領域mとnの間の位置特徴量𝝎𝝎𝐺𝐺 𝑚𝑚𝑚𝑚 𝝎𝝎𝐺𝐺 𝑚𝑚𝑚𝑚 ← (log 𝑥𝑥𝑚𝑚
− 𝑥𝑥𝑛𝑛 𝑤𝑤𝑚𝑚 , log 𝑦𝑦𝑚𝑚 − 𝑦𝑦𝑛𝑛 ℎ𝑚𝑚 , log 𝑤𝑤𝑛𝑛 𝑤𝑤𝑚𝑚 , log ℎ𝑛𝑛 ℎ𝑚𝑚 ) 𝝎𝝎𝒎𝒎𝒎𝒎 = 𝝎𝝎𝑮𝑮 𝒎𝒎𝒎𝒎exp(𝝎𝝎𝑨𝑨 𝒎𝒎𝒎𝒎) ∑𝑙𝑙=1 𝑁𝑁 𝝎𝝎𝑮𝑮 𝒎𝒎𝒎𝒎exp(𝝎𝝎𝑨𝑨 𝒎𝒎𝒎𝒎) • Box multi-head attention(Box MHA) 𝒉𝒉𝑚𝑚𝑚 𝒉𝒉𝑠𝑠𝑠𝑠 𝒉𝒉𝑠𝑠𝑠𝑠 𝒉𝒉𝑠𝑠𝑠𝑠 各Attention headの出力𝒉𝒉𝒔𝒔𝒔𝒔 を結合 𝒉𝒉𝒎𝒎𝒎𝒎 𝝎𝝎𝒎𝒎𝒎𝒎 𝝎𝝎𝒎𝒎𝒎𝒎 𝝎𝝎𝒎𝒎𝒎𝒎 ℎ𝑚𝑚 𝑤𝑤𝑛𝑛 領域m 領域n

デコーダ：自己回帰的な単語予測を行う 16 Transformer型デコーダ • 自己回帰的に単語予測

デコーダ：自己回帰的な単語予測を行う 17 𝜴𝜴 • Masked multi-head attention (Masked MHA) j単語目以降をマスクして自己注意を計算
𝑸𝑸 = � 𝒚𝒚𝑾𝑾𝒒𝒒 , 𝑲𝑲 = � 𝒚𝒚𝑾𝑾𝒌𝒌 , 𝑽𝑽 = � 𝒚𝒚𝑾𝑾𝒗𝒗 𝜴𝜴 = 𝑽𝑽(𝑓𝑓𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 (𝑸𝑸𝑲𝑲T/ 𝑑𝑑𝑘𝑘 )) 𝜴𝜴 𝜴𝜴 � 𝒚𝒚𝟏𝟏:𝒋𝒋−𝟏𝟏 • Generatorはデコーダの出力からj単語目を予測 𝜴𝜴 サンプル数各文の単語数 • 損失関数 𝐿𝐿 = − 1 𝐼𝐼 � 𝑖𝑖=1 𝐼𝐼 � 𝑗𝑗=1 𝐽𝐽 log(𝑝𝑝(� 𝑦𝑦𝑖𝑖𝑖𝑖 )) 𝜴𝜴 𝜴𝜴

PFN-PICデータセット：画像と物体操作指示文のセット 18 • サンプルの構成 • 画像 • 対象物体の領域の座標 • 目標領域
• 物体操作指示文 • データサイズセット画像対象物体指示文 train 1160 25517 90692 valid 20 352 898 “move the blue and white tissue box to the top right bin”

定量的結果：ベースラインを全評価尺度において上回る 19 手法 BLEU4↑ METEOR↑ ROUGE-L↑ CIDEr-D↑ SPICE↑ ABEN [Ogura+
20] 15.2 ± 0.8 21.2 ± 0.8 46.8 ± 1.0 18.2 ± 1.8 23.4 ± 2.1 ORT [Herdade+ 19] 8.0 ± 1.2 17.3 ± 0.7 39.4 ± 0.7 27.9 ± 2.8 26.4 ± 1.3 CRT(論文版) 14.9 ± 1.1 23.1 ± 0.7 49.7 ± 1.0 96.6 ± 12.0 44.0 ± 2.3 CRT(改良版) 17.3 ± 0.8 25.0 ± 0.4 51.9 ± 0.7 117.3 ± 5.0 46.1 ± 1.4 画像キャプショニング用尺度ベースライン

20] 15.2 ± 0.8 21.2 ± 0.8 46.8 ± 1.0 18.2 ± 1.8 23.4 ± 2.1 ORT [Herdade+ 19] 8.0 ± 1.2 17.3 ± 0.7 39.4 ± 0.7 27.9 ± 2.8 26.4 ± 1.3 CRT(論文版) 14.9 ± 1.1 23.1 ± 0.7 49.7 ± 1.0 96.6 ± 12.0 44.0 ± 2.3 CRT(改良版) 17.3 ± 0.8 25.0 ± 0.4 51.9 ± 0.7 117.3 ± 5.0 46.1 ± 1.4 +2.1 +3.8 +5.1 +89.4 +19.7 他の手法に比べ高品質

20] 15.2 ± 0.8 21.2 ± 0.8 46.8 ± 1.0 18.2 ± 1.8 23.4 ± 2.1 ORT [Herdade+ 19] 8.0 ± 1.2 17.3 ± 0.7 39.4 ± 0.7 27.9 ± 2.8 26.4 ± 1.3 CRT(論文版) 14.9 ± 1.1 23.1 ± 0.7 49.7 ± 1.0 96.6 ± 12.0 44.0 ± 2.3 CRT(改良版) 17.3 ± 0.8 25.0 ± 0.4 51.9 ± 0.7 117.3 ± 5.0 46.1 ± 1.4 CRBにTransformerを導入したことにより品質向上 +2.4 +1.9 +2.2 +20.7 +2.1

定性的結果：参照表現を利用した簡潔な指示文を生成 22 正解文 grab the cola can near to the
white gloves and put it in the upper right box move the rectangular black thing from the box with an empty drink bottle in it to the box with a coke can in it 生成文 move the red can from the bottom right box to the top right box move the black mug to the lower left box

white gloves and put it in the upper right box move the rectangular black thing from the box with an empty drink bottle in it to the box with a coke can in it 生成文 move the red can from the bottom right box to the top right box move the black mug to the lower left box  コーラ缶を参照表現により特定

white gloves and put it in the upper right box move the rectangular black thing from the box with an empty drink bottle in it to the box with a coke can in it 生成文 move the red can from the bottom right box to the top right box move the black mug to the lower left box 対象物体を明瞭に”black mug”と表現  

定性的結果：対象物体の記述に課題 25 正解文 grab the rectangular green with red writing
and place it in the lower left box grab the round container in the right gand corner of the bottom right corner and place it in the top left corner 生成文 move the red can to the bottom left box move the white tin to the left upper box

and place it in the lower left box grab the round container in the right gand corner of the bottom right corner and place it in the top left corner 生成文 move the red can to the bottom left box move the white tin to the left upper box 赤いボトルに引っ張られている 

and place it in the lower left box grab the round container in the right gand corner of the bottom right corner and place it in the top left corner 生成文 move the red can to the bottom left box move the white tin to the left upper box “銀色の缶”を”白い缶”としている 

生成文：”move the white bottle to the left upper box” エラー分析：対象物体についての記述誤りが多く，今後の課題
28 100文中対象物体についての失敗が34文(全エラーの79％) 対象物体についての記述をより正確にするための改良が課題正解文：”take blue cup and put it in the left upper box”  “対象物体に対する重大な記述誤り”の例 

被験者実験において有効性を確認した 29 • 5段階評価 1：とても悪い 2：悪い 3：普通 4：良い 5：とても良い •
内容：被験者5名が各50文を指示の明瞭さにより評価手法 MOS↑ (Mean Opinion Score) 正解文 (Upper bound) 3.81 ± 0.16 ABEN [Ogura+ IROS20] 1.15 ± 0.05 ORT [Herdade+ NeurIPS19] 1.34 ± 0.07 CRT(論文版) 2.59 ± 0.18 CRT(改良版) 3.45 ± 0.18 提案手法はベースラインよりも高い品質 ※最終的にUpper boundと有意差がなくなれば人間と同程度の品質の生成文であるといえる

まとめ 30 • 背景マルチモーダルデータセットのaugmentation • 提案手法対象物体，目標領域，コンテキスト情報を入力とするマルチモーダル言語生成モデル Case Relation
Transformer • 結果ベースラインを各評価尺度で上回る定性的にも適切な指示文を生成

Case Relation Transformerに基づく対象物体及び目標領域の参照表現を含む...

Case Relation Transformerに基づく対象物体及び目標領域の参照表現を含む物体操作指示文生成/Case Relation Transformer

Semantic Machine Intelligence Lab., Keio Univ. PRO

More Decks by Semantic Machine Intelligence Lab., Keio Univ.

Other Decks in Technology

Featured

Transcript

Case Relation Transformerに基づく対象物体及び目標領域の参照表現を含む物体操作指示文生成慶應義塾大学神原元就，杉浦孔明

マルチモーダルコーパスの大規模化はモデルの高度化につながりうる 2 augmentにより低コストで入手可能大規模コーパスでは効果的な訓練が可能 • English Wikipedia=2500M words • 多くのマルチモーダルコーパスのサイズは1％以下

Pick the white packet which is near the numbered stickers

FIGタスク：明瞭な参照表現の生成は容易でない 4 “move the cola can to the top right

既存研究：既存モデルでは生成文の品質が低い 5 VideoBERT Change Captioning[Park+ ICCV19] タスク手法概要画像キャプショニング

タスク手法概要画像キャプショニング Object Relation Transformer [Hardade+ NeurIPS19] 領域間の幾何的参照表現をモデル化

提案手法：Case Relation Transformer(CRT) 7 Case Relation Block（CRB） • 2層のTransformer •

構造： Case Relation Transformer(CRT) 8

対象物体，目標領域以外は自動検出された領域を入力とする 9 対象物体目標領域コンテキスト情報 • Up-Down Attention[Anderson+ CVPR18]により検出

Case Relation Block：3種類の入力特徴量を埋め込み処理 10 Case Relation Block（CRB） • 2層のTransformer •

Case Relation Block：3種類の入力特徴量を埋め込み処理 11 Case Relation Block（CRB） 𝑿𝑿𝒄𝒄𝒄𝒄𝒄𝒄𝒗𝒗𝒗𝒗 <𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕> 𝑿𝑿𝒄𝒄𝒄𝒄𝒄𝒄𝒗𝒗𝒗𝒗

Case Relation Block：3種類の入力特徴量を埋め込み処理 12 Case Relation Block（CRB） 𝑿𝑿𝒄𝒄𝒄𝒄𝒄𝒄𝒗𝒗𝒗𝒗 <𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕> 𝑿𝑿𝒄𝒄𝒄𝒄𝒄𝒄𝒗𝒗𝒗𝒗

エンコーダ：幾何的特徴量を用いたTransformer型エンコーダ 13 Transformer型エンコーダ • 幾何的関係性をモデル化

エンコーダ：幾何的特徴量を用いたTransformer型エンコーダ 14 • 領域mとnの間の位置特徴量𝝎𝝎𝐺𝐺 𝑚𝑚𝑚𝑚 𝝎𝝎𝐺𝐺 𝑚𝑚𝑚𝑚 ← (log 𝑥𝑥𝑚𝑚

エンコーダ：幾何的特徴量を用いたTransformer型エンコーダ 15 • 領域mとnの間の位置特徴量𝝎𝝎𝐺𝐺 𝑚𝑚𝑚𝑚 𝝎𝝎𝐺𝐺 𝑚𝑚𝑚𝑚 ← (log 𝑥𝑥𝑚𝑚

デコーダ：自己回帰的な単語予測を行う 16 Transformer型デコーダ • 自己回帰的に単語予測

デコーダ：自己回帰的な単語予測を行う 17 𝜴𝜴 • Masked multi-head attention (Masked MHA) j単語目以降をマスクして自己注意を計算

PFN-PICデータセット：画像と物体操作指示文のセット 18 • サンプルの構成 • 画像 • 対象物体の領域の座標 • 目標領域

定量的結果：ベースラインを全評価尺度において上回る 19 手法 BLEU4↑ METEOR↑ ROUGE-L↑ CIDEr-D↑ SPICE↑ ABEN [Ogura+

定量的結果：ベースラインを全評価尺度において上回る 20 手法 BLEU4↑ METEOR↑ ROUGE-L↑ CIDEr-D↑ SPICE↑ ABEN [Ogura+

定量的結果：ベースラインを全評価尺度において上回る 21 手法 BLEU4↑ METEOR↑ ROUGE-L↑ CIDEr-D↑ SPICE↑ ABEN [Ogura+

定性的結果：参照表現を利用した簡潔な指示文を生成 22 正解文 grab the cola can near to the

定性的結果：参照表現を利用した簡潔な指示文を生成 23 正解文 grab the cola can near to the

定性的結果：参照表現を利用した簡潔な指示文を生成 24 正解文 grab the cola can near to the

定性的結果：対象物体の記述に課題 25 正解文 grab the rectangular green with red writing

定性的結果：対象物体の記述に課題 26 正解文 grab the rectangular green with red writing

定性的結果：対象物体の記述に課題 27 正解文 grab the rectangular green with red writing

生成文：”move the white bottle to the left upper box” エラー分析：対象物体についての記述誤りが多く，今後の課題

被験者実験において有効性を確認した 29 • 5段階評価 1：とても悪い 2：悪い 3：普通 4：良い 5：とても良い •

まとめ 30 • 背景マルチモーダルデータセットのaugmentation • 提案手法対象物体，目標領域，コンテキスト情報を入力とするマルチモーダル言語生成モデル Case Relation