case-relation-transformer-a-crossmodal-language-generation-model-for-fetching-instructions

Motonari Kambara and Komei Sugiura Keio University, Japan Case Relation
Transformer: A Crossmodal Language Generation Model for Fetching Instructions

Our target: Improving communication skills of domestic service robots 1
Social background: Domestic service robots communicating with users are a promising solution for disabled and elderly people Problem: Multimodal language understanding models require large multimodal corpora [Magassouba+ RAL&ICRA20] [Magassouba+ IROS18 RoboCup Best Paper Award] Our previous studies

Pick the white packet which is near the numbered stickers
and put it into the lower left box Motivation: Alleviating the burden of labeling 2 Multimodal language generation can reduce cost Labor-intensive to annotating many images with sentences

Related works: ABEN takes a long time for training 3
ORT[Herdade+ 19] ABEN[Ogura+ 20] Field Methods Outline Image captioning Object Relation Transformer (ORT)[Herdade+ NeurIPS19] Modeling spatial reference expressions between regions Video captioning VideoBERT [Sun+ ICCV19] Representative video captioning method with one stream Change captioning DUDA [Park+ ICCV19] Representative change captioning method using RNN Fetching instruction generation ABEN [Ogura+ IROS20] Fetching instruction generation using LSTM

Pick the white packet which is near the numbered stickers
and put it into the lower left box Problem statement: Fetching instruction generation task 4 • Target task: Fetching instruction generation(FIG) task Generate instruction including target object and destination • Input An image containing target object and destination • Output Fetching instruction

Giving unambiguous instruction sentences is challenging 5 “Move the coke
can to the top right box”  “Grab the coke can near to the white gloves and put it in the upper right box” ☺ Spatial referring expression is important

Proposed method: Case Relation Transformer 6 Case Relation Block（CRB） •
Two Transformer blocks • Embedding and concatenating inputs Transformer encoder-decoder • Modeling geometric relationships Contributions

Network inputs: Features of target object, destination and context information
7 Target object Destination Context information • Detected by Up-Down Attention[Anderson+ CVPR18]

Case Relation Block: Embedding input features 8 Case Relation Block（CRB）
• Two Transformer blocks • Embed and combine inputs Trm. : Transformer

Case Relation Block(CRB): Embedding input features 9 𝑿 𝐜𝐨𝐧𝐯𝟐 <𝐭𝐚𝐫𝐠>
𝑿 𝐜𝐨𝐧𝐯𝟑 <𝐭𝐚𝐫𝐠> 𝑿 𝐜𝐨𝐧𝐯𝟒 <𝐭𝐚𝐫𝐠> • Multi-layer Transformer • Calculate attention between input features Multi-head attention Layer normalization & Dropout FC & Dropout Layer normalization & Dropout FC & Dropout Transformer layer Transformer layer Transformer layer … Input Output

Case Relation Block(CRB): Embedding input features 10 𝑿 𝒄𝒐𝒏𝒗𝟐 <𝒕𝒂𝒓𝒈>
𝑿 𝒄𝒐𝒏𝒗𝟑 <𝒕𝒂𝒓𝒈> 𝑿 𝒄𝒐𝒏𝒗𝟒 <𝒕𝒂𝒓𝒈> 𝒉<𝒕𝒂𝒓𝒈> 𝒙<𝒅𝒆𝒔𝒕> 𝑿<𝒄𝒐𝒏𝒕> 𝒉𝑽 Input to encoder • Multi-layer Transformer • Calculate attention between input features Multi-head attention Layer normalization & Dropout FC & Dropout Layer normalization & Dropout FC & Dropout Transformer layer Transformer layer Transformer layer … Output Input

Transformer encoder based on ORT [Herdade+ NeurIPS19] 11 Transformer encoder
• Model geometric relationships

Transformer encoder based on ORT [Herdade+ NeurIPS19] 12 • 𝝎𝐺
𝑚𝑛 : Positional features between regions m and n 𝝎𝐺 𝑚𝑛 ← (log 𝑥𝑚 − 𝑥𝑛 𝑤𝑚 , log 𝑦𝑚 − 𝑦𝑛 ℎ𝑚 , log 𝑤𝑛 𝑤𝑚 , log ℎ𝑛 ℎ𝑚 ) 𝝎𝒎𝒏 = 𝝎𝑮 𝒎𝒏exp(𝝎𝑨 𝒎𝒏) σ 𝑙=1 𝑁 𝝎𝑮 𝒎𝒍exp(𝝎𝑨 𝒎𝒍) • Box multi-head attention(MHA) ℎ𝑚 𝑤𝑛 𝝎𝐺 𝑚𝑛 𝝎𝑨 𝒎𝒏 Concatenate output of Attention head 𝒉𝒔𝒂 𝝎𝒎𝒏 𝝎𝒎𝒏 𝝎𝒎𝒏 Region 𝑚 Region 𝑛 𝜔𝐴 𝑚𝑛: the visual-based weight

Transformer encoder based on ORT [Herdade+ NeurIPS19] 13 • 𝝎𝐺
𝑚𝑛 : Positional features between regions m and n 𝝎𝐺 𝑚𝑛 ← (log 𝑥𝑚 − 𝑥𝑛 𝑤𝑚 , log 𝑦𝑚 − 𝑦𝑛 ℎ𝑚 , log 𝑤𝑛 𝑤𝑚 , log ℎ𝑛 ℎ𝑚 ) 𝝎𝒎𝒏 = 𝝎𝑮 𝒎𝒏exp(𝝎𝑨 𝒎𝒏) σ 𝑙=1 𝑁 𝝎𝑮 𝒎𝒍exp(𝝎𝑨 𝒎𝒍) • Box multi-head attention(MHA) 𝒉𝑚ℎ 𝒉𝑠𝑎 𝒉𝑠𝑎 𝒉𝑠𝑎 𝒉𝒎𝒉 𝝎𝒎𝒏 𝝎𝒎𝒏 𝝎𝒎𝒏 ℎ𝑚 𝑤𝑛 Region m Region n Concatenate output of Attention head 𝒉𝒔𝒂 𝜔𝐴 𝑚𝑛: the visual-based weight

Decoder: autoregressive word prediction 14 Transformer decoder • autoregressive word
prediction

𝜴 Decoder: autoregressive word prediction 15 𝜴 • Masked Multi-head
attention(MHA) 𝑸 = 𝑾𝒒 ෝ 𝒚𝟏:𝒋 , 𝑲 = 𝑾𝒌 ෝ 𝒚𝟏:𝒋 , 𝑽 = 𝑾𝒗 ෝ 𝒚𝟏:𝒋 𝜴 = softmax 𝑸𝑲T 𝑑𝑘 𝑽 𝜴 𝜴 ෝ 𝒚𝟏:𝒋−𝟏 • Generator predicts j-th word from 𝒉𝑗 <𝑑𝑒𝑐> 𝜴 𝜴 • Loss function Calculates attention score between predicted word tokens 𝑝(ෝ 𝒚) is maximized in training ෝ 𝒚𝟏:𝒋: e.g. “move the red bottle into” 𝑗

PFN-PIC dataset[Hatori+ ICRA18]: Set of images and fetching instructions 16
• Sample configuration • Image • Coordinates of region of target object • destination • Fetching instruction • Size Set #Image #Target object #Instruction train 1044 22014 81087 valid 116 2503 8774 test 20 352 898 “Move the blue and white tissue box to the top right bin”

Quantitative results: CRT outperformed baselines on all metrics 17 Primary
metrics Ours

Quantitative results: CRT outperformed baselines on all metrics 18 CIDEr-D
and SPICE are drastically improved

Qualitative results: Our method generated concise instructions using referring expressions
19 Ground truth Move the black rectangle from the lower left box, to the upper right box [Ogura+ RAL20] Grab the the the red and and put it in the lower left box Ours ☺Move the black object in the lower left box to the upper right box Specify the target by referring expression

Qualitative results: Our method generated concise instructions using referring expressions
20 Ground truth Move the rectangular black thing from the box with an empty drink bottle in it to the box with a coke can in it [Ogura+ RAL20] Move the red bottle to the right upper box Ours ☺Move the black mug to the lower left box Clearly express the target as "black mug"

Subjects experiment: CRT outperformed baselines as well 21 • 5-point
scale Mean opinion score (MOS) on clarity of instructions 1：Very bad 2：Bad 3：Normal 4：Good 5：Very good • 5 subjects evaluated 50 sentences each significantly improved Ours ABEN [Ogura+ RAL20] ORT [Herdade+ NeurIPS19] Human

Error analysis: Most errors related to target objects 22 Introducing
auxiliary reconstruction loss can reduce errors GT ☺Take blue cup and put it in the left upper box Ours Move the white bottle to the left upper box Target object error 79% Major 51% Minor 28% Spatial referring expression error 16% Others 5%

Conclusion 23 • Motivation Multimodal language understanding requires large multimodal
corpora, however it is labor-intensive to annotate many images • Proposed method Case Relation Transformer: Crossmodal fetching instruction generation model • Experimental results CRT outperformed baselines on all metrics, and generated concise instructions using referring expressions

case-relation-transformer-a-crossmodal-language...

case-relation-transformer-a-crossmodal-language-generation-model-for-fetching-instructions

Semantic Machine Intelligence Lab., Keio Univ. PRO

More Decks by Semantic Machine Intelligence Lab., Keio Univ.

Other Decks in Technology

Featured

Transcript

Motonari Kambara and Komei Sugiura Keio University, Japan Case Relation

Our target: Improving communication skills of domestic service robots 1

Pick the white packet which is near the numbered stickers

Related works: ABEN takes a long time for training 3

Pick the white packet which is near the numbered stickers

Giving unambiguous instruction sentences is challenging 5 “Move the coke

Proposed method: Case Relation Transformer 6 Case Relation Block（CRB） •

Network inputs: Features of target object, destination and context information

Case Relation Block: Embedding input features 8 Case Relation Block（CRB）

Case Relation Block(CRB): Embedding input features 9 𝑿 𝐜𝐨𝐧𝐯𝟐 <𝐭𝐚𝐫𝐠>

Case Relation Block(CRB): Embedding input features 10 𝑿 𝒄𝒐𝒏𝒗𝟐 <𝒕𝒂𝒓𝒈>

Transformer encoder based on ORT [Herdade+ NeurIPS19] 11 Transformer encoder

Transformer encoder based on ORT [Herdade+ NeurIPS19] 12 • 𝝎𝐺

Transformer encoder based on ORT [Herdade+ NeurIPS19] 13 • 𝝎𝐺

Decoder: autoregressive word prediction 14 Transformer decoder • autoregressive word

𝜴 Decoder: autoregressive word prediction 15 𝜴 • Masked Multi-head

PFN-PIC dataset[Hatori+ ICRA18]: Set of images and fetching instructions 16

Quantitative results: CRT outperformed baselines on all metrics 17 Primary

Quantitative results: CRT outperformed baselines on all metrics 18 CIDEr-D

Qualitative results: Our method generated concise instructions using referring expressions

Qualitative results: Our method generated concise instructions using referring expressions

Subjects experiment: CRT outperformed baselines as well 21 • 5-point

Error analysis: Most errors related to target objects 22 Introducing

Conclusion 23 • Motivation Multimodal language understanding requires large multimodal