Target-dependent UNITER: A Transformer-Based Multimodal Language Comprehension Model for Domestic Service Robots

Slide 1

Slide 1 text

Target-dependent UNITER: A Transformer-Based Multimodal Language Comprehension Model for Domestic Service Robots Shintaro Ishikawa and Komei Sugiura Keio University

Slide 2

Slide 2 text

Overview 2 ü We propose Target-dependent UNITER, which comprehends object- fetching instructions from visual information ü Target-dependent UNITER models relationships between instruction and objects in image ü Our model outperformed baseline in terms of classification accuracy on two standard datasets

Slide 3

Slide 3 text

Target: Building robots that interact through language Increasing need for daily care and support in aging society Motivation Interacting naturally through language is important for robots e.g.) “Go get the bottle on the table” Domestic Service Robots (DSRs) - Capable of physically assisting handicapped people - Expected to overcome shortage of home care workers 3 [https://global.toyota/jp/download/8725271]

Slide 4

Slide 4 text

In some cases, it is difficult to identify target object for robots Robots can select target object correctly if they comprehend referring expressions Requisites Understanding of relationships between objects in image Challenge: Identifying target object is often difficult 4 x4 “Take the tumbler on the table”

Slide 5

Slide 5 text

Related works: Existing methods show insufficient performance 5 Field Model Vision and language • ViLBERT [Lu+ 19]: Handles image and text inputs in two separate Transformers • UNITER [Chen+ 20]: Fuses image and text inputs in a single Transformer • VILLA [Gan+ 20]: Learns V&L representations through adversarial training MLU-FI • [Hatori+ 18]: Method for object picking task • MTCM, MTCM-AB [Magassouba+ 19, 20]: Identifies target object from instruction and whole image [Hatori 18] VILLA [Gan 20]

Slide 6

Slide 6 text

Problem statement 6 Target Task: Multimodal Language Understanding for Fetching Instructions (MLU-FI) -> Identify target object from instruction and image Input: Instruction, Candidate region, Detected regions Output: Predicted probability that candidate region is correct Pick up the empty bottle on the shelf

Slide 7

Slide 7 text

Proposed method: Target-dependent UNITER 7 Target-dependent UNITER: Extension of UNITER[Chen 20] framework to MLU-FI task ü Novelty: Introduction of new architecture for handling candidate object -> Capable of judging whether it is target object or not

Slide 8

Slide 8 text

Module: Text Embedder 8 Text Embedder: Embeds instruction • 𝑥!"#$ : One-hot vector set representing each token of instruction We tokenize instruction using WordPiece [Wu 16] and convert it into token sequence e.g.) “Pick up the empty bottle on the shelf” -> [“Pick”, “up”, “the”, ”empty”, ”bottle”, “on”, “the”, “shelf”, “.”] • 𝑥%&# : One-hot vector set representing each token’s position of instruction

Slide 9

Slide 9 text

Module: Image Embedder 9 Image Embedder: Embeds candidate region and detected regions • 𝑥'($ : Feature vectors of all detected region extracted from image • 𝑥)*"' : A feature vector selected from 𝑥'($ We extract feature vectors using Faster R-CNN [Ren 16] • 𝑥'($+&) : Location feature vectors of all detected region • 𝑥)*"'+&) : A location feature vector selected from 𝑥'($+&) We use seven-dimensional vector as location feature (normalized left/top/right/bottom coordinates, width, height, and area)

Slide 10

Slide 10 text

Module: Multi-Layer Transformer 10 Multi-Layer Transformer: - Learns relationships between instruction and image - Consists of 𝐿 Transformer blocks Attention visualization 𝑄 ! = 𝑊 , ! ℎ$-*"# ! 𝐾 ! = 𝑊 . ! ℎ$-*"# ! 𝑉 ! = 𝑊 / ! ℎ$-*"# ! 𝑓*$$" ! = 𝑉 ! softmax 𝑄 ! 𝐾 ! 𝑑. 𝑆*$$" = 𝑓*$$" 0 , … , 𝑓*$$" 1

Slide 11

Slide 11 text

Experiment: We evaluated our method on two datasets 11 ü Datasets (consist of images and a set of instructions) Name Image Instruction Vocabulary size Average sentence length PFN-PIC [Hatori 18] 1180 90759 4682 14.2 WRS-UniALT 570 1246 167 7.1 “Pick up the white box next to the red bottle and put it in the lower left box” “Pick up the empty bottle on the shelf”

Slide 12

Slide 12 text

Quantitative result: Our method outperformed baseline 12 Method PFN-PIC (Top-1 accuracy) PFN-PIC (Binary accuracy) WRS-UniALT (Binary accuracy) [Hatori+, ICRA18 Best Paper] 88.0 - - MTCM [Magassouba+, IROS19] 88.8 ± 0.43 90.1 ± 0.93 91.8 ± 0.36 Ours (All regions) - 96.9 ± 0.34 96.4 ± 0.24 Ours (Proximal regions only) - 97.2 ± 0.29 96.5 ± 0.19 • All regions: We input all the detected regions • Proximal regions only: We select half of detected regions near target region Limiting the number of regions are effective

Slide 13

Slide 13 text

Qualitative result: Correct predictions 13 “Pick up the black cup in the bottom right section of the box and move it to the bottom left section of the box” “Take the can juice on the white shelf” 𝑝 " 𝑦 = 0.999 𝑝 " 𝑦 = 8.19×10!"# Correctly identify candidate object as target object Correctly identify candidate object as different object from target object

Slide 14

Slide 14 text

Qualitative result: Attention visualization 14 Regions in proximity to target object are attended “gray container,” “next to” and “bottle” are attended

Slide 15

Slide 15 text

Qualitative result: Incorrect predictions 15 “Move the green rectangle with white on the side from the upper left box, to the lower left box” “Take the white cup on the corner of the table.” 𝑝 " 𝑦 = 0.978 𝑝 " 𝑦 = 0.999 Fail to predict because region contains many pixels unrelated to candidate object Fail to predict because region is too small

Slide 16

Slide 16 text

Ablation studies 16 Method PFN-PIC WRS-UniALT Ours (W/o FRCNN fine-tuning) 91.5 ± 0.69 94.0 ± 1.49 Ours (Late fusion) 96.0 ± 0.08 96.0 ± 0.24 Ours (Few regions) 96.6 ± 0.36 95.8 ± 0.71 Ours (W/o UNITER pretraining) 96.8 ± 0.34 95.4 ± 0.19 Ours 97.2 ± 0.29 96.5 ± 0.19 Each component contributes to performance Late fusion Few regions

Slide 17

Slide 17 text

Conclusion 17 ü We propose Target-dependent UNITER, which comprehends object- fetching instructions from visual information ü Target-dependent UNITER models relationships between instruction and objects in image ü Our model outperformed baseline in terms of classification accuracy on two standard datasets