Target-dependent UNITER: A Transformer-Based Multimodal Language Comprehension Model for Domestic Service Robots

Target-dependent UNITER: A Transformer-Based Multimodal Language Comprehension Model for Domestic
Service Robots Shintaro Ishikawa and Komei Sugiura Keio University

Overview 2 ü We propose Target-dependent UNITER, which comprehends object-
fetching instructions from visual information ü Target-dependent UNITER models relationships between instruction and objects in image ü Our model outperformed baseline in terms of classification accuracy on two standard datasets

Target: Building robots that interact through language Increasing need for
daily care and support in aging society Motivation Interacting naturally through language is important for robots e.g.) “Go get the bottle on the table” Domestic Service Robots (DSRs) - Capable of physically assisting handicapped people - Expected to overcome shortage of home care workers 3 [https://global.toyota/jp/download/8725271]

In some cases, it is difficult to identify target object
for robots Robots can select target object correctly if they comprehend referring expressions Requisites Understanding of relationships between objects in image Challenge: Identifying target object is often difficult 4 x4 “Take the tumbler on the table”

Related works: Existing methods show insufficient performance 5 Field Model
Vision and language • ViLBERT [Lu+ 19]: Handles image and text inputs in two separate Transformers • UNITER [Chen+ 20]: Fuses image and text inputs in a single Transformer • VILLA [Gan+ 20]: Learns V&L representations through adversarial training MLU-FI • [Hatori+ 18]: Method for object picking task • MTCM, MTCM-AB [Magassouba+ 19, 20]: Identifies target object from instruction and whole image [Hatori 18] VILLA [Gan 20]

Problem statement 6 Target Task: Multimodal Language Understanding for Fetching
Instructions (MLU-FI) -> Identify target object from instruction and image Input: Instruction, Candidate region, Detected regions Output: Predicted probability that candidate region is correct Pick up the empty bottle on the shelf

Proposed method: Target-dependent UNITER 7 Target-dependent UNITER: Extension of UNITER[Chen
20] framework to MLU-FI task ü Novelty: Introduction of new architecture for handling candidate object -> Capable of judging whether it is target object or not

Module: Text Embedder 8 Text Embedder: Embeds instruction • 𝑥!"#$
: One-hot vector set representing each token of instruction We tokenize instruction using WordPiece [Wu 16] and convert it into token sequence e.g.) “Pick up the empty bottle on the shelf” -> [“Pick”, “up”, “the”, ”empty”, ”bottle”, “on”, “the”, “shelf”, “.”] • 𝑥%&# : One-hot vector set representing each token’s position of instruction

Module: Image Embedder 9 Image Embedder: Embeds candidate region and
detected regions • 𝑥'($ : Feature vectors of all detected region extracted from image • 𝑥)*"' : A feature vector selected from 𝑥'($ We extract feature vectors using Faster R-CNN [Ren 16] • 𝑥'($+&) : Location feature vectors of all detected region • 𝑥)*"'+&) : A location feature vector selected from 𝑥'($+&) We use seven-dimensional vector as location feature (normalized left/top/right/bottom coordinates, width, height, and area)

Module: Multi-Layer Transformer 10 Multi-Layer Transformer: - Learns relationships between
instruction and image - Consists of 𝐿 Transformer blocks Attention visualization 𝑄 ! = 𝑊 , ! ℎ$-*"# ! 𝐾 ! = 𝑊 . ! ℎ$-*"# ! 𝑉 ! = 𝑊 / ! ℎ$-*"# ! 𝑓*$$" ! = 𝑉 ! softmax 𝑄 ! 𝐾 ! 𝑑. 𝑆*$$" = 𝑓*$$" 0 , … , 𝑓*$$" 1

Experiment: We evaluated our method on two datasets 11 ü
Datasets (consist of images and a set of instructions) Name Image Instruction Vocabulary size Average sentence length PFN-PIC [Hatori 18] 1180 90759 4682 14.2 WRS-UniALT 570 1246 167 7.1 “Pick up the white box next to the red bottle and put it in the lower left box” “Pick up the empty bottle on the shelf”

Quantitative result: Our method outperformed baseline 12 Method PFN-PIC (Top-1
accuracy) PFN-PIC (Binary accuracy) WRS-UniALT (Binary accuracy) [Hatori+, ICRA18 Best Paper] 88.0 - - MTCM [Magassouba+, IROS19] 88.8 ± 0.43 90.1 ± 0.93 91.8 ± 0.36 Ours (All regions) - 96.9 ± 0.34 96.4 ± 0.24 Ours (Proximal regions only) - 97.2 ± 0.29 96.5 ± 0.19 <Proposed method> • All regions: We input all the detected regions • Proximal regions only: We select half of detected regions near target region Limiting the number of regions are effective

Qualitative result: Correct predictions 13 “Pick up the black cup
in the bottom right section of the box and move it to the bottom left section of the box” “Take the can juice on the white shelf” 𝑝 " 𝑦 = 0.999 𝑝 " 𝑦 = 8.19×10!"# Correctly identify candidate object as target object Correctly identify candidate object as different object from target object

Qualitative result: Attention visualization 14 Regions in proximity to target
object are attended “gray container,” “next to” and “bottle” are attended

Qualitative result: Incorrect predictions 15 “Move the green rectangle with
white on the side from the upper left box, to the lower left box” “Take the white cup on the corner of the table.” 𝑝 " 𝑦 = 0.978 𝑝 " 𝑦 = 0.999 Fail to predict because region contains many pixels unrelated to candidate object Fail to predict because region is too small

Ablation studies 16 Method PFN-PIC WRS-UniALT Ours (W/o FRCNN fine-tuning)
91.5 ± 0.69 94.0 ± 1.49 Ours (Late fusion) 96.0 ± 0.08 96.0 ± 0.24 Ours (Few regions) 96.6 ± 0.36 95.8 ± 0.71 Ours (W/o UNITER pretraining) 96.8 ± 0.34 95.4 ± 0.19 Ours 97.2 ± 0.29 96.5 ± 0.19 Each component contributes to performance Late fusion Few regions

Conclusion 17 ü We propose Target-dependent UNITER, which comprehends object-
fetching instructions from visual information ü Target-dependent UNITER models relationships between instruction and objects in image ü Our model outperformed baseline in terms of classification accuracy on two standard datasets

Target-dependent UNITER: A Transformer-Based Mu...

Target-dependent UNITER: A Transformer-Based Multimodal Language Comprehension Model for Domestic Service Robots

Semantic Machine Intelligence Lab., Keio Univ.
PRO

More Decks by Semantic Machine Intelligence Lab., Keio Univ.

Other Decks in Technology

Featured

Transcript

Target-dependent UNITER: A Transformer-Based Multimodal Language Comprehension Model for Domestic

Overview 2 ü We propose Target-dependent UNITER, which comprehends object-

Target: Building robots that interact through language Increasing need for

In some cases, it is difficult to identify target object

Related works: Existing methods show insufficient performance 5 Field Model

Problem statement 6 Target Task: Multimodal Language Understanding for Fetching

Proposed method: Target-dependent UNITER 7 Target-dependent UNITER: Extension of UNITER[Chen

Module: Text Embedder 8 Text Embedder: Embeds instruction • 𝑥!"#$

Module: Image Embedder 9 Image Embedder: Embeds candidate region and

Module: Multi-Layer Transformer 10 Multi-Layer Transformer: - Learns relationships between

Experiment: We evaluated our method on two datasets 11 ü

Quantitative result: Our method outperformed baseline 12 Method PFN-PIC (Top-1

Qualitative result: Correct predictions 13 “Pick up the black cup

Qualitative result: Attention visualization 14 Regions in proximity to target

Qualitative result: Incorrect predictions 15 “Move the green rectangle with

Ablation studies 16 Method PFN-PIC WRS-UniALT Ours (W/o FRCNN fine-tuning)

Conclusion 17 ü We propose Target-dependent UNITER, which comprehends object-