Upgrade to PRO for Only $50/Yearโ€”Limited-Time Offer! ๐Ÿ”ฅ

case-relation-transformer-a-crossmodal-language...

ย case-relation-transformer-a-crossmodal-language-generation-model-for-fetching-instructions

More Decks by Semantic Machine Intelligence Lab., Keio Univ.

Other Decks in Technology

Transcript

  1. Motonari Kambara and Komei Sugiura Keio University, Japan Case Relation

    Transformer: A Crossmodal Language Generation Model for Fetching Instructions
  2. Our target: Improving communication skills of domestic service robots 1

    Social background: Domestic service robots communicating with users are a promising solution for disabled and elderly people Problem: Multimodal language understanding models require large multimodal corpora [Magassouba+ RAL&ICRA20] [Magassouba+ IROS18 RoboCup Best Paper Award] Our previous studies
  3. Pick the white packet which is near the numbered stickers

    and put it into the lower left box Motivation: Alleviating the burden of labeling 2 Multimodal language generation can reduce cost ๏ŒLabor-intensive to annotating many images with sentences
  4. Related works: ABEN takes a long time for training 3

    ORT[Herdade+ 19] ABEN[Ogura+ 20] Field Methods Outline Image captioning Object Relation Transformer (ORT)[Herdade+ NeurIPS19] Modeling spatial reference expressions between regions Video captioning VideoBERT [Sun+ ICCV19] Representative video captioning method with one stream Change captioning DUDA [Park+ ICCV19] Representative change captioning method using RNN Fetching instruction generation ABEN [Ogura+ IROS20] Fetching instruction generation using LSTM
  5. Pick the white packet which is near the numbered stickers

    and put it into the lower left box Problem statement: Fetching instruction generation task 4 โ€ข Target task: Fetching instruction generation(FIG) task Generate instruction including target object and destination โ€ข Input An image containing target object and destination โ€ข Output Fetching instruction
  6. Giving unambiguous instruction sentences is challenging 5 โ€œMove the coke

    can to the top right boxโ€ ๏Œ โ€œGrab the coke can near to the white gloves and put it in the upper right boxโ€ โ˜บ Spatial referring expression is important
  7. Proposed method: Case Relation Transformer 6 Case Relation Block๏ผˆCRB๏ผ‰ โ€ข

    Two Transformer blocks โ€ข Embedding and concatenating inputs Transformer encoder-decoder โ€ข Modeling geometric relationships Contributions
  8. Network inputs: Features of target object, destination and context information

    7 Target object Destination Context information โ€ข Detected by Up-Down Attention[Anderson+ CVPR18]
  9. Case Relation Block: Embedding input features 8 Case Relation Block๏ผˆCRB๏ผ‰

    โ€ข Two Transformer blocks โ€ข Embed and combine inputs Trm. : Transformer
  10. Case Relation Block(CRB): Embedding input features 9 ๐‘ฟ ๐œ๐จ๐ง๐ฏ๐Ÿ <๐ญ๐š๐ซ๐ >

    ๐‘ฟ ๐œ๐จ๐ง๐ฏ๐Ÿ‘ <๐ญ๐š๐ซ๐ > ๐‘ฟ ๐œ๐จ๐ง๐ฏ๐Ÿ’ <๐ญ๐š๐ซ๐ > โ€ข Multi-layer Transformer โ€ข Calculate attention between input features Multi-head attention Layer normalization & Dropout FC & Dropout Layer normalization & Dropout FC & Dropout Transformer layer Transformer layer Transformer layer โ€ฆ Input Output
  11. Case Relation Block(CRB): Embedding input features 10 ๐‘ฟ ๐’„๐’๐’๐’—๐Ÿ <๐’•๐’‚๐’“๐’ˆ>

    ๐‘ฟ ๐’„๐’๐’๐’—๐Ÿ‘ <๐’•๐’‚๐’“๐’ˆ> ๐‘ฟ ๐’„๐’๐’๐’—๐Ÿ’ <๐’•๐’‚๐’“๐’ˆ> ๐’‰<๐’•๐’‚๐’“๐’ˆ> ๐’™<๐’…๐’†๐’”๐’•> ๐‘ฟ<๐’„๐’๐’๐’•> ๐’‰๐‘ฝ Input to encoder โ€ข Multi-layer Transformer โ€ข Calculate attention between input features Multi-head attention Layer normalization & Dropout FC & Dropout Layer normalization & Dropout FC & Dropout Transformer layer Transformer layer Transformer layer โ€ฆ Output Input
  12. Transformer encoder based on ORT [Herdade+ NeurIPS19] 12 โ€ข ๐Ž๐บ

    ๐‘š๐‘› : Positional features between regions m and n ๐Ž๐บ ๐‘š๐‘› โ† (log ๐‘ฅ๐‘š โˆ’ ๐‘ฅ๐‘› ๐‘ค๐‘š , log ๐‘ฆ๐‘š โˆ’ ๐‘ฆ๐‘› โ„Ž๐‘š , log ๐‘ค๐‘› ๐‘ค๐‘š , log โ„Ž๐‘› โ„Ž๐‘š ) ๐Ž๐’Ž๐’ = ๐Ž๐‘ฎ ๐’Ž๐’exp(๐Ž๐‘จ ๐’Ž๐’) ฯƒ ๐‘™=1 ๐‘ ๐Ž๐‘ฎ ๐’Ž๐’exp(๐Ž๐‘จ ๐’Ž๐’) โ€ข Box multi-head attention(MHA) โ„Ž๐‘š ๐‘ค๐‘› ๐Ž๐บ ๐‘š๐‘› ๐Ž๐‘จ ๐’Ž๐’ Concatenate output of Attention head ๐’‰๐’”๐’‚ ๐Ž๐’Ž๐’ ๐Ž๐’Ž๐’ ๐Ž๐’Ž๐’ Region ๐‘š Region ๐‘› ๐œ”๐ด ๐‘š๐‘›: the visual-based weight
  13. Transformer encoder based on ORT [Herdade+ NeurIPS19] 13 โ€ข ๐Ž๐บ

    ๐‘š๐‘› : Positional features between regions m and n ๐Ž๐บ ๐‘š๐‘› โ† (log ๐‘ฅ๐‘š โˆ’ ๐‘ฅ๐‘› ๐‘ค๐‘š , log ๐‘ฆ๐‘š โˆ’ ๐‘ฆ๐‘› โ„Ž๐‘š , log ๐‘ค๐‘› ๐‘ค๐‘š , log โ„Ž๐‘› โ„Ž๐‘š ) ๐Ž๐’Ž๐’ = ๐Ž๐‘ฎ ๐’Ž๐’exp(๐Ž๐‘จ ๐’Ž๐’) ฯƒ ๐‘™=1 ๐‘ ๐Ž๐‘ฎ ๐’Ž๐’exp(๐Ž๐‘จ ๐’Ž๐’) โ€ข Box multi-head attention(MHA) ๐’‰๐‘šโ„Ž ๐’‰๐‘ ๐‘Ž ๐’‰๐‘ ๐‘Ž ๐’‰๐‘ ๐‘Ž ๐’‰๐’Ž๐’‰ ๐Ž๐’Ž๐’ ๐Ž๐’Ž๐’ ๐Ž๐’Ž๐’ โ„Ž๐‘š ๐‘ค๐‘› Region m Region n Concatenate output of Attention head ๐’‰๐’”๐’‚ ๐œ”๐ด ๐‘š๐‘›: the visual-based weight
  14. ๐œด Decoder: autoregressive word prediction 15 ๐œด โ€ข Masked Multi-head

    attention(MHA) ๐‘ธ = ๐‘พ๐’’ เท ๐’š๐Ÿ:๐’‹ , ๐‘ฒ = ๐‘พ๐’Œ เท ๐’š๐Ÿ:๐’‹ , ๐‘ฝ = ๐‘พ๐’— เท ๐’š๐Ÿ:๐’‹ ๐œด = softmax ๐‘ธ๐‘ฒT ๐‘‘๐‘˜ ๐‘ฝ ๐œด ๐œด เท ๐’š๐Ÿ:๐’‹โˆ’๐Ÿ โ€ข Generator predicts j-th word from ๐’‰๐‘— <๐‘‘๐‘’๐‘> ๐œด ๐œด โ€ข Loss function Calculates attention score between predicted word tokens ๐‘(เท ๐’š) is maximized in training เท ๐’š๐Ÿ:๐’‹: e.g. โ€œmove the red bottle intoโ€ ๐‘—
  15. PFN-PIC dataset[Hatori+ ICRA18]: Set of images and fetching instructions 16

    โ€ข Sample configuration โ€ข Image โ€ข Coordinates of region of target object โ€ข destination โ€ข Fetching instruction โ€ข Size Set #Image #Target object #Instruction train 1044 22014 81087 valid 116 2503 8774 test 20 352 898 โ€œMove the blue and white tissue box to the top right binโ€
  16. Qualitative results: Our method generated concise instructions using referring expressions

    19 Ground truth Move the black rectangle from the lower left box, to the upper right box [Ogura+ RAL20] ๏ŒGrab the the the red and and put it in the lower left box Ours โ˜บMove the black object in the lower left box to the upper right box Specify the target by referring expression
  17. Qualitative results: Our method generated concise instructions using referring expressions

    20 Ground truth Move the rectangular black thing from the box with an empty drink bottle in it to the box with a coke can in it [Ogura+ RAL20] ๏ŒMove the red bottle to the right upper box Ours โ˜บMove the black mug to the lower left box Clearly express the target as "black mug"
  18. Subjects experiment: CRT outperformed baselines as well 21 โ€ข 5-point

    scale Mean opinion score (MOS) on clarity of instructions 1๏ผšVery bad 2๏ผšBad 3๏ผšNormal 4๏ผšGood 5๏ผšVery good โ€ข 5 subjects evaluated 50 sentences each significantly improved Ours ABEN [Ogura+ RAL20] ORT [Herdade+ NeurIPS19] Human
  19. Error analysis: Most errors related to target objects 22 Introducing

    auxiliary reconstruction loss can reduce errors GT โ˜บTake blue cup and put it in the left upper box Ours ๏ŒMove the white bottle to the left upper box Target object error 79% Major 51% Minor 28% Spatial referring expression error 16% Others 5%
  20. Conclusion 23 โ€ข Motivation Multimodal language understanding requires large multimodal

    corpora, however it is labor-intensive to annotate many images โ€ข Proposed method Case Relation Transformer: Crossmodal fetching instruction generation model โ€ข Experimental results CRT outperformed baselines on all metrics, and generated concise instructions using referring expressions