Upgrade to Pro — share decks privately, control downloads, hide ads and more …

case-relation-transformer-a-crossmodal-language-generation-model-for-fetching-instructions

 case-relation-transformer-a-crossmodal-language-generation-model-for-fetching-instructions

More Decks by Semantic Machine Intelligence Lab., Keio Univ.

Other Decks in Technology

Transcript

  1. Motonari Kambara and Komei Sugiura Keio University, Japan Case Relation

    Transformer: A Crossmodal Language Generation Model for Fetching Instructions
  2. Our target: Improving communication skills of domestic service robots 1

    Social background: Domestic service robots communicating with users are a promising solution for disabled and elderly people Problem: Multimodal language understanding models require large multimodal corpora [Magassouba+ RAL&ICRA20] [Magassouba+ IROS18 RoboCup Best Paper Award] Our previous studies
  3. Pick the white packet which is near the numbered stickers

    and put it into the lower left box Motivation: Alleviating the burden of labeling 2 Multimodal language generation can reduce cost Labor-intensive to annotating many images with sentences
  4. Related works: ABEN takes a long time for training 3

    ORT[Herdade+ 19] ABEN[Ogura+ 20] Field Methods Outline Image captioning Object Relation Transformer (ORT)[Herdade+ NeurIPS19] Modeling spatial reference expressions between regions Video captioning VideoBERT [Sun+ ICCV19] Representative video captioning method with one stream Change captioning DUDA [Park+ ICCV19] Representative change captioning method using RNN Fetching instruction generation ABEN [Ogura+ IROS20] Fetching instruction generation using LSTM
  5. Pick the white packet which is near the numbered stickers

    and put it into the lower left box Problem statement: Fetching instruction generation task 4 • Target task: Fetching instruction generation(FIG) task Generate instruction including target object and destination • Input An image containing target object and destination • Output Fetching instruction
  6. Giving unambiguous instruction sentences is challenging 5 “Move the coke

    can to the top right box”  “Grab the coke can near to the white gloves and put it in the upper right box” ☺ Spatial referring expression is important
  7. Proposed method: Case Relation Transformer 6 Case Relation Block(CRB) •

    Two Transformer blocks • Embedding and concatenating inputs Transformer encoder-decoder • Modeling geometric relationships Contributions
  8. Network inputs: Features of target object, destination and context information

    7 Target object Destination Context information • Detected by Up-Down Attention[Anderson+ CVPR18]
  9. Case Relation Block: Embedding input features 8 Case Relation Block(CRB)

    • Two Transformer blocks • Embed and combine inputs Trm. : Transformer
  10. Case Relation Block(CRB): Embedding input features 9 𝑿 𝐜𝐨𝐧𝐯𝟐 <𝐭𝐚𝐫𝐠>

    𝑿 𝐜𝐨𝐧𝐯𝟑 <𝐭𝐚𝐫𝐠> 𝑿 𝐜𝐨𝐧𝐯𝟒 <𝐭𝐚𝐫𝐠> • Multi-layer Transformer • Calculate attention between input features Multi-head attention Layer normalization & Dropout FC & Dropout Layer normalization & Dropout FC & Dropout Transformer layer Transformer layer Transformer layer … Input Output
  11. Case Relation Block(CRB): Embedding input features 10 𝑿 𝒄𝒐𝒏𝒗𝟐 <𝒕𝒂𝒓𝒈>

    𝑿 𝒄𝒐𝒏𝒗𝟑 <𝒕𝒂𝒓𝒈> 𝑿 𝒄𝒐𝒏𝒗𝟒 <𝒕𝒂𝒓𝒈> 𝒉<𝒕𝒂𝒓𝒈> 𝒙<𝒅𝒆𝒔𝒕> 𝑿<𝒄𝒐𝒏𝒕> 𝒉𝑽 Input to encoder • Multi-layer Transformer • Calculate attention between input features Multi-head attention Layer normalization & Dropout FC & Dropout Layer normalization & Dropout FC & Dropout Transformer layer Transformer layer Transformer layer … Output Input
  12. Transformer encoder based on ORT [Herdade+ NeurIPS19] 11 Transformer encoder

    • Model geometric relationships
  13. Transformer encoder based on ORT [Herdade+ NeurIPS19] 12 • 𝝎𝐺

    𝑚𝑛 : Positional features between regions m and n 𝝎𝐺 𝑚𝑛 ← (log 𝑥𝑚 − 𝑥𝑛 𝑤𝑚 , log 𝑦𝑚 − 𝑦𝑛 ℎ𝑚 , log 𝑤𝑛 𝑤𝑚 , log ℎ𝑛 ℎ𝑚 ) 𝝎𝒎𝒏 = 𝝎𝑮 𝒎𝒏exp(𝝎𝑨 𝒎𝒏) σ 𝑙=1 𝑁 𝝎𝑮 𝒎𝒍exp(𝝎𝑨 𝒎𝒍) • Box multi-head attention(MHA) ℎ𝑚 𝑤𝑛 𝝎𝐺 𝑚𝑛 𝝎𝑨 𝒎𝒏 Concatenate output of Attention head 𝒉𝒔𝒂 𝝎𝒎𝒏 𝝎𝒎𝒏 𝝎𝒎𝒏 Region 𝑚 Region 𝑛 𝜔𝐴 𝑚𝑛: the visual-based weight
  14. Transformer encoder based on ORT [Herdade+ NeurIPS19] 13 • 𝝎𝐺

    𝑚𝑛 : Positional features between regions m and n 𝝎𝐺 𝑚𝑛 ← (log 𝑥𝑚 − 𝑥𝑛 𝑤𝑚 , log 𝑦𝑚 − 𝑦𝑛 ℎ𝑚 , log 𝑤𝑛 𝑤𝑚 , log ℎ𝑛 ℎ𝑚 ) 𝝎𝒎𝒏 = 𝝎𝑮 𝒎𝒏exp(𝝎𝑨 𝒎𝒏) σ 𝑙=1 𝑁 𝝎𝑮 𝒎𝒍exp(𝝎𝑨 𝒎𝒍) • Box multi-head attention(MHA) 𝒉𝑚ℎ 𝒉𝑠𝑎 𝒉𝑠𝑎 𝒉𝑠𝑎 𝒉𝒎𝒉 𝝎𝒎𝒏 𝝎𝒎𝒏 𝝎𝒎𝒏 ℎ𝑚 𝑤𝑛 Region m Region n Concatenate output of Attention head 𝒉𝒔𝒂 𝜔𝐴 𝑚𝑛: the visual-based weight
  15. Decoder: autoregressive word prediction 14 Transformer decoder • autoregressive word

    prediction
  16. 𝜴 Decoder: autoregressive word prediction 15 𝜴 • Masked Multi-head

    attention(MHA) 𝑸 = 𝑾𝒒 ෝ 𝒚𝟏:𝒋 , 𝑲 = 𝑾𝒌 ෝ 𝒚𝟏:𝒋 , 𝑽 = 𝑾𝒗 ෝ 𝒚𝟏:𝒋 𝜴 = softmax 𝑸𝑲T 𝑑𝑘 𝑽 𝜴 𝜴 ෝ 𝒚𝟏:𝒋−𝟏 • Generator predicts j-th word from 𝒉𝑗 <𝑑𝑒𝑐> 𝜴 𝜴 • Loss function Calculates attention score between predicted word tokens 𝑝(ෝ 𝒚) is maximized in training ෝ 𝒚𝟏:𝒋: e.g. “move the red bottle into” 𝑗
  17. PFN-PIC dataset[Hatori+ ICRA18]: Set of images and fetching instructions 16

    • Sample configuration • Image • Coordinates of region of target object • destination • Fetching instruction • Size Set #Image #Target object #Instruction train 1044 22014 81087 valid 116 2503 8774 test 20 352 898 “Move the blue and white tissue box to the top right bin”
  18. Quantitative results: CRT outperformed baselines on all metrics 17 Primary

    metrics Ours
  19. Quantitative results: CRT outperformed baselines on all metrics 18 CIDEr-D

    and SPICE are drastically improved
  20. Qualitative results: Our method generated concise instructions using referring expressions

    19 Ground truth Move the black rectangle from the lower left box, to the upper right box [Ogura+ RAL20] Grab the the the red and and put it in the lower left box Ours ☺Move the black object in the lower left box to the upper right box Specify the target by referring expression
  21. Qualitative results: Our method generated concise instructions using referring expressions

    20 Ground truth Move the rectangular black thing from the box with an empty drink bottle in it to the box with a coke can in it [Ogura+ RAL20] Move the red bottle to the right upper box Ours ☺Move the black mug to the lower left box Clearly express the target as "black mug"
  22. Subjects experiment: CRT outperformed baselines as well 21 • 5-point

    scale Mean opinion score (MOS) on clarity of instructions 1:Very bad 2:Bad 3:Normal 4:Good 5:Very good • 5 subjects evaluated 50 sentences each significantly improved Ours ABEN [Ogura+ RAL20] ORT [Herdade+ NeurIPS19] Human
  23. Error analysis: Most errors related to target objects 22 Introducing

    auxiliary reconstruction loss can reduce errors GT ☺Take blue cup and put it in the left upper box Ours Move the white bottle to the left upper box Target object error 79% Major 51% Minor 28% Spatial referring expression error 16% Others 5%
  24. Conclusion 23 • Motivation Multimodal language understanding requires large multimodal

    corpora, however it is labor-intensive to annotate many images • Proposed method Case Relation Transformer: Crossmodal fetching instruction generation model • Experimental results CRT outperformed baselines on all metrics, and generated concise instructions using referring expressions