Slide 1

Slide 1 text

Motonari Kambara and Komei Sugiura Keio University, Japan Case Relation Transformer: A Crossmodal Language Generation Model for Fetching Instructions

Slide 2

Slide 2 text

Our target: Improving communication skills of domestic service robots 1 Social background: Domestic service robots communicating with users are a promising solution for disabled and elderly people Problem: Multimodal language understanding models require large multimodal corpora [Magassouba+ RAL&ICRA20] [Magassouba+ IROS18 RoboCup Best Paper Award] Our previous studies

Slide 3

Slide 3 text

Pick the white packet which is near the numbered stickers and put it into the lower left box Motivation: Alleviating the burden of labeling 2 Multimodal language generation can reduce cost Labor-intensive to annotating many images with sentences

Slide 4

Slide 4 text

Related works: ABEN takes a long time for training 3 ORT[Herdade+ 19] ABEN[Ogura+ 20] Field Methods Outline Image captioning Object Relation Transformer (ORT)[Herdade+ NeurIPS19] Modeling spatial reference expressions between regions Video captioning VideoBERT [Sun+ ICCV19] Representative video captioning method with one stream Change captioning DUDA [Park+ ICCV19] Representative change captioning method using RNN Fetching instruction generation ABEN [Ogura+ IROS20] Fetching instruction generation using LSTM

Slide 5

Slide 5 text

Pick the white packet which is near the numbered stickers and put it into the lower left box Problem statement: Fetching instruction generation task 4 • Target task: Fetching instruction generation(FIG) task Generate instruction including target object and destination • Input An image containing target object and destination • Output Fetching instruction

Slide 6

Slide 6 text

Giving unambiguous instruction sentences is challenging 5 “Move the coke can to the top right box”  “Grab the coke can near to the white gloves and put it in the upper right box” ☺ Spatial referring expression is important

Slide 7

Slide 7 text

Proposed method: Case Relation Transformer 6 Case Relation Block(CRB) • Two Transformer blocks • Embedding and concatenating inputs Transformer encoder-decoder • Modeling geometric relationships Contributions

Slide 8

Slide 8 text

Network inputs: Features of target object, destination and context information 7 Target object Destination Context information • Detected by Up-Down Attention[Anderson+ CVPR18]

Slide 9

Slide 9 text

Case Relation Block: Embedding input features 8 Case Relation Block(CRB) • Two Transformer blocks • Embed and combine inputs Trm. : Transformer

Slide 10

Slide 10 text

Case Relation Block(CRB): Embedding input features 9 𝑿 𝐜𝐨𝐧𝐯𝟐 <𝐭𝐚𝐫𝐠> 𝑿 𝐜𝐨𝐧𝐯𝟑 <𝐭𝐚𝐫𝐠> 𝑿 𝐜𝐨𝐧𝐯𝟒 <𝐭𝐚𝐫𝐠> • Multi-layer Transformer • Calculate attention between input features Multi-head attention Layer normalization & Dropout FC & Dropout Layer normalization & Dropout FC & Dropout Transformer layer Transformer layer Transformer layer … Input Output

Slide 11

Slide 11 text

Case Relation Block(CRB): Embedding input features 10 𝑿 𝒄𝒐𝒏𝒗𝟐 <𝒕𝒂𝒓𝒈> 𝑿 𝒄𝒐𝒏𝒗𝟑 <𝒕𝒂𝒓𝒈> 𝑿 𝒄𝒐𝒏𝒗𝟒 <𝒕𝒂𝒓𝒈> 𝒉<𝒕𝒂𝒓𝒈> 𝒙<𝒅𝒆𝒔𝒕> 𝑿<𝒄𝒐𝒏𝒕> 𝒉𝑽 Input to encoder • Multi-layer Transformer • Calculate attention between input features Multi-head attention Layer normalization & Dropout FC & Dropout Layer normalization & Dropout FC & Dropout Transformer layer Transformer layer Transformer layer … Output Input

Slide 12

Slide 12 text

Transformer encoder based on ORT [Herdade+ NeurIPS19] 11 Transformer encoder • Model geometric relationships

Slide 13

Slide 13 text

Transformer encoder based on ORT [Herdade+ NeurIPS19] 12 • 𝝎𝐺 𝑚𝑛 : Positional features between regions m and n 𝝎𝐺 𝑚𝑛 ← (log 𝑥𝑚 − 𝑥𝑛 𝑤𝑚 , log 𝑦𝑚 − 𝑦𝑛 ℎ𝑚 , log 𝑤𝑛 𝑤𝑚 , log ℎ𝑛 ℎ𝑚 ) 𝝎𝒎𝒏 = 𝝎𝑮 𝒎𝒏exp(𝝎𝑨 𝒎𝒏) σ 𝑙=1 𝑁 𝝎𝑮 𝒎𝒍exp(𝝎𝑨 𝒎𝒍) • Box multi-head attention(MHA) ℎ𝑚 𝑤𝑛 𝝎𝐺 𝑚𝑛 𝝎𝑨 𝒎𝒏 Concatenate output of Attention head 𝒉𝒔𝒂 𝝎𝒎𝒏 𝝎𝒎𝒏 𝝎𝒎𝒏 Region 𝑚 Region 𝑛 𝜔𝐴 𝑚𝑛: the visual-based weight

Slide 14

Slide 14 text

Transformer encoder based on ORT [Herdade+ NeurIPS19] 13 • 𝝎𝐺 𝑚𝑛 : Positional features between regions m and n 𝝎𝐺 𝑚𝑛 ← (log 𝑥𝑚 − 𝑥𝑛 𝑤𝑚 , log 𝑦𝑚 − 𝑦𝑛 ℎ𝑚 , log 𝑤𝑛 𝑤𝑚 , log ℎ𝑛 ℎ𝑚 ) 𝝎𝒎𝒏 = 𝝎𝑮 𝒎𝒏exp(𝝎𝑨 𝒎𝒏) σ 𝑙=1 𝑁 𝝎𝑮 𝒎𝒍exp(𝝎𝑨 𝒎𝒍) • Box multi-head attention(MHA) 𝒉𝑚ℎ 𝒉𝑠𝑎 𝒉𝑠𝑎 𝒉𝑠𝑎 𝒉𝒎𝒉 𝝎𝒎𝒏 𝝎𝒎𝒏 𝝎𝒎𝒏 ℎ𝑚 𝑤𝑛 Region m Region n Concatenate output of Attention head 𝒉𝒔𝒂 𝜔𝐴 𝑚𝑛: the visual-based weight

Slide 15

Slide 15 text

Decoder: autoregressive word prediction 14 Transformer decoder • autoregressive word prediction

Slide 16

Slide 16 text

𝜴 Decoder: autoregressive word prediction 15 𝜴 • Masked Multi-head attention(MHA) 𝑸 = 𝑾𝒒 ෝ 𝒚𝟏:𝒋 , 𝑲 = 𝑾𝒌 ෝ 𝒚𝟏:𝒋 , 𝑽 = 𝑾𝒗 ෝ 𝒚𝟏:𝒋 𝜴 = softmax 𝑸𝑲T 𝑑𝑘 𝑽 𝜴 𝜴 ෝ 𝒚𝟏:𝒋−𝟏 • Generator predicts j-th word from 𝒉𝑗 <𝑑𝑒𝑐> 𝜴 𝜴 • Loss function Calculates attention score between predicted word tokens 𝑝(ෝ 𝒚) is maximized in training ෝ 𝒚𝟏:𝒋: e.g. “move the red bottle into” 𝑗

Slide 17

Slide 17 text

PFN-PIC dataset[Hatori+ ICRA18]: Set of images and fetching instructions 16 • Sample configuration • Image • Coordinates of region of target object • destination • Fetching instruction • Size Set #Image #Target object #Instruction train 1044 22014 81087 valid 116 2503 8774 test 20 352 898 “Move the blue and white tissue box to the top right bin”

Slide 18

Slide 18 text

Quantitative results: CRT outperformed baselines on all metrics 17 Primary metrics Ours

Slide 19

Slide 19 text

Quantitative results: CRT outperformed baselines on all metrics 18 CIDEr-D and SPICE are drastically improved

Slide 20

Slide 20 text

Qualitative results: Our method generated concise instructions using referring expressions 19 Ground truth Move the black rectangle from the lower left box, to the upper right box [Ogura+ RAL20] Grab the the the red and and put it in the lower left box Ours ☺Move the black object in the lower left box to the upper right box Specify the target by referring expression

Slide 21

Slide 21 text

Qualitative results: Our method generated concise instructions using referring expressions 20 Ground truth Move the rectangular black thing from the box with an empty drink bottle in it to the box with a coke can in it [Ogura+ RAL20] Move the red bottle to the right upper box Ours ☺Move the black mug to the lower left box Clearly express the target as "black mug"

Slide 22

Slide 22 text

Subjects experiment: CRT outperformed baselines as well 21 • 5-point scale Mean opinion score (MOS) on clarity of instructions 1:Very bad 2:Bad 3:Normal 4:Good 5:Very good • 5 subjects evaluated 50 sentences each significantly improved Ours ABEN [Ogura+ RAL20] ORT [Herdade+ NeurIPS19] Human

Slide 23

Slide 23 text

Error analysis: Most errors related to target objects 22 Introducing auxiliary reconstruction loss can reduce errors GT ☺Take blue cup and put it in the left upper box Ours Move the white bottle to the left upper box Target object error 79% Major 51% Minor 28% Spatial referring expression error 16% Others 5%

Slide 24

Slide 24 text

Conclusion 23 • Motivation Multimodal language understanding requires large multimodal corpora, however it is labor-intensive to annotate many images • Proposed method Case Relation Transformer: Crossmodal fetching instruction generation model • Experimental results CRT outperformed baselines on all metrics, and generated concise instructions using referring expressions