[RSJ22] TDP-MAT: Multimodal Language Comprehension for Object Manipulation Tasks via Real Images

3 ✓ https://www.toyota.com/usa/toyota-effect/romy-robot

✓ : 4 “Look in the left wicker vase that
is next to the potted plant” Wicker vase :

✓ : “Look in the left wicker vase that is
next to the potted plant” 5 Wicker vase : Wicker vase Wicker vase Wicker vase

✓ : ✓ Key : “Look in the left wicker
vase that is next to the potted plant” 6 Wicker vase : Wicker vase Wicker vase Wicker vase

✓ REVERIE-fetch • 7 “Look in the left wicker vase
that is next to the potted plant”

✓ REVERIE-fetch • • (Instruction) (Context Regions) (Candidate Region) 8
“Look in the left wicker vase that is next to the potted plant”

✓ REVERIE-fetch • • (Instruction) (Context Regions) (Candidate Region) 9
“Look in the left wicker vase that is next to the potted plant”

✓ REVERIE-fetch • • (Instruction) (Context Regions) (Candidate Region) •
10 “Look in the left wicker vase that is next to the potted plant”

✓ REVERIE-fetch • • (Instruction) (Context Regions) (Candidate Region) •
11 “Look in the left wicker vase that is next to the potted plant” Faster R-CNN[Ren+, PAMI16]

MTCM [Magassouba+, RA-L19] . VGG16LSTM . Target-dependent UNITER (TDU) [Ishikawa+,
RA-L21] UNITER[Chen+, ECCV20] . REVERIE task / dataset [Qi+, CVPR20] , REVERIE 12

• MAT[Ishikawa+, ICPR22] • CLIP[Radford+, ICML21] • Perceiver[Jaegle+, ICML21] 13

2 1 3

✓ 𝜹𝑡 ✓ 18 Input 𝜹𝑡 Output 1. 𝐸 𝜹
= CE 𝑓 𝒙 , 𝒚 ∇𝜹 𝐸 𝜹 = 𝜕𝐸 𝜕𝜹 2. ∇𝜹 𝐸 𝜹 𝒎𝑡 𝒗𝑡 𝒎𝑡 = 𝜌1 𝒎𝑡−1 + 1 − 𝜌1 ∇𝜹 𝐸 𝜹𝑡 𝒗𝑡 = 𝜌2 𝒗𝑡−1 + 1 − 𝜌2 ∇𝜹 𝐸 𝜹𝑡 2 3. 𝒎𝑡 𝒗𝑡 ∆𝜹𝒕 ෝ 𝒎𝑡 = 𝒎𝑡 1 − 𝜌1 𝑡 , ෝ 𝒗𝑡 = 𝒗𝑡 1 − 𝜌2 𝑡 ∆𝜹𝒕 = 𝜂 ෝ 𝒎𝑡 ෝ 𝒗𝑡 + 𝜖 4. 𝜹𝑡+1 = Π 𝜹 ≤𝜖 𝜹𝑡 + ∆𝜹𝒕 ∆𝜹𝒕 𝐹

✓ CLIP ✓ ViT[Dosovitskiy+, ICLR21] ✓ transformer [EOT] 19 [EOT]

✓ ✓ Perceiver CLIP 20 CLIP Encoders

✓ CLIP Encoders , Perceiver 21

✓ REVERIE-fetch dataset - REVERIE dataset ✓ REVERIE[Qi+, CVPR18] -
→ 1. , 2. https://yuankaiqi.github.io/REVERIE_Challenge/static/img/demo.gif 22 Matterport3D

✓ REVERIE-fetch dataset - REVERIE dataset ✓ REVERIE[Qi+, CVPR18] :
+ 23 , ↓ - REVERIE - - https://yuankaiqi.github.io/REVERIE_Challenge/static/img/demo.gif

✓ REVERIE-fetch dataset • REVERIE dataset #Samples Vocabulary size Average
sentence length 30532 2853 19.1 Training Validation Test 26808 2552 1172 24 “Look in the left wicker vase that is next to the potted plant”

“Go into the living room and give me the pillow
on the couch nearest the plant” 25 • → TDP-MAT

26 • → TDP-MAT ✓ Bounding box “Make haste to
the office and fluff the pillow sitting on the left of the chair”

• Acc [%] : 27 Condition Acc [%] ↑ Baseline
: TDU [Ishikawa+, IROS21] 73.3 0.485 Ours : TDP-MAT W/o MAT 72.5 3.55 W/o MAT + Smaller learning rate 74.4 0.831 W/o CLIP & Perceiver 74.1 1.47 W/o Pretraining 73.1 2.24 Full 75.3 0.691 +2.0

28 Condition Acc [%] ↑ Baseline : TDU [Ishikawa+, IROS21]
73.3 0.485 Ours : TDP-MAT W/o MAT 72.5 3.55 W/o MAT + Smaller learning rate 74.4 0.831 W/o CLIP & Perceiver 74.1 1.47 W/o Pretraining 73.1 2.24 Full 75.3 0.691 +2.8 - - 5 - ( ) - Smaller learning rate : 1/8 -

73.3 0.485 Ours : TDP-MAT W/o MAT 72.5 3.55 W/o MAT + Smaller learning rate 74.4 0.831 W/o CLIP & Perceiver 74.1 1.47 W/o Pretraining 73.1 2.24 Full 75.3 0.691 +1.2 - CLIP Encoders, Perceiver Module, - Cross Attention

73.3 0.485 Ours : TDP-MAT W/o MAT 72.5 3.55 W/o MAT + Smaller learning rate 74.4 0.831 W/o CLIP & Perceiver 74.1 1.47 W/o Pretraining 73.1 2.24 Full 75.3 0.691 +2.2 - TDU

✓ • ✓ • MAT • ✓ • 31

✓ ✓ 𝐿 𝑁 𝑅𝐿×𝐷 𝑅𝑁×𝐸 𝑅𝐿×𝐷, 𝑅𝑁×𝐷 → 𝑅𝐿×𝑁
𝑅𝐿×𝐷 𝑅𝐿×𝐷, 𝑅𝐿×𝐷 → 𝑅𝐿×𝐿 32

✓ ✓ ✓ ✓ 33

✓ 34 8 × 10−4 𝛽1 = 0.9, 𝛽2 =
0.99

✓ ✓ ✓ 35 19+6=25

[RSJ22] TDP-MAT: Multimodal Language Comprehens...

[RSJ22] TDP-MAT: Multimodal Language Comprehension for Object Manipulation Tasks via Real Images

More Decks by Semantic Machine Intelligence Lab., Keio Univ.

Other Decks in Technology

Featured

Transcript