[RSJ22] TDP-MAT: Multimodal Language Comprehension for Object Manipulation Tasks via Real Images

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

3 ✓ https://www.toyota.com/usa/toyota-effect/romy-robot

Slide 4

Slide 4 text

✓ : 4 “Look in the left wicker vase that is next to the potted plant” Wicker vase :

Slide 5

Slide 5 text

✓ : “Look in the left wicker vase that is next to the potted plant” 5 Wicker vase : Wicker vase Wicker vase Wicker vase

Slide 6

Slide 6 text

✓ : ✓ Key : “Look in the left wicker vase that is next to the potted plant” 6 Wicker vase : Wicker vase Wicker vase Wicker vase

Slide 7

Slide 7 text

✓ REVERIE-fetch • 7 “Look in the left wicker vase that is next to the potted plant”

Slide 8

Slide 8 text

✓ REVERIE-fetch • • (Instruction) (Context Regions) (Candidate Region) 8 “Look in the left wicker vase that is next to the potted plant”

Slide 9

Slide 9 text

✓ REVERIE-fetch • • (Instruction) (Context Regions) (Candidate Region) 9 “Look in the left wicker vase that is next to the potted plant”

Slide 10

Slide 10 text

✓ REVERIE-fetch • • (Instruction) (Context Regions) (Candidate Region) • 10 “Look in the left wicker vase that is next to the potted plant”

Slide 11

Slide 11 text

✓ REVERIE-fetch • • (Instruction) (Context Regions) (Candidate Region) • 11 “Look in the left wicker vase that is next to the potted plant” Faster R-CNN[Ren+, PAMI16]

Slide 12

Slide 12 text

MTCM [Magassouba+, RA-L19] . VGG16LSTM . Target-dependent UNITER (TDU) [Ishikawa+, RA-L21] UNITER[Chen+, ECCV20] . REVERIE task / dataset [Qi+, CVPR20] , REVERIE 12

Slide 13

Slide 13 text

• MAT[Ishikawa+, ICPR22] • CLIP[Radford+, ICML21] • Perceiver[Jaegle+, ICML21] 13

Slide 14

Slide 14 text

• MAT[Ishikawa+, ICPR22] • CLIP[Radford+, ICML21] • Perceiver[Jaegle+, ICML21] 14

Slide 15

Slide 15 text

• MAT[Ishikawa+, ICPR22] • CLIP[Radford+, ICML21] • Perceiver[Jaegle+, ICML21] 15

Slide 16

Slide 16 text

• MAT[Ishikawa+, ICPR22] • CLIP[Radford+, ICML21] • Perceiver[Jaegle+, ICML21] 16

Slide 17

Slide 17 text

• MAT[Ishikawa+, ICPR22] • CLIP[Radford+, ICML21] • Perceiver[Jaegle+, ICML21] 17 2 1 3

Slide 18

Slide 18 text

✓ 𝜹𝑡 ✓ 18 Input 𝜹𝑡 Output 1. 𝐸 𝜹 = CE 𝑓 𝒙 , 𝒚 ∇𝜹 𝐸 𝜹 = 𝜕𝐸 𝜕𝜹 2. ∇𝜹 𝐸 𝜹 𝒎𝑡 𝒗𝑡 𝒎𝑡 = 𝜌1 𝒎𝑡−1 + 1 − 𝜌1 ∇𝜹 𝐸 𝜹𝑡 𝒗𝑡 = 𝜌2 𝒗𝑡−1 + 1 − 𝜌2 ∇𝜹 𝐸 𝜹𝑡 2 3. 𝒎𝑡 𝒗𝑡 ∆𝜹𝒕 ෝ 𝒎𝑡 = 𝒎𝑡 1 − 𝜌1 𝑡 , ෝ 𝒗𝑡 = 𝒗𝑡 1 − 𝜌2 𝑡 ∆𝜹𝒕 = 𝜂 ෝ 𝒎𝑡 ෝ 𝒗𝑡 + 𝜖 4. 𝜹𝑡+1 = Π 𝜹 ≤𝜖 𝜹𝑡 + ∆𝜹𝒕 ∆𝜹𝒕 𝐹

Slide 19

Slide 19 text

✓ CLIP ✓ ViT[Dosovitskiy+, ICLR21] ✓ transformer [EOT] 19 [EOT]

Slide 20

Slide 20 text

✓ ✓ Perceiver CLIP 20 CLIP Encoders

Slide 21

Slide 21 text

✓ CLIP Encoders , Perceiver 21

Slide 22

Slide 22 text

✓ REVERIE-fetch dataset - REVERIE dataset ✓ REVERIE[Qi+, CVPR18] - → 1. , 2. https://yuankaiqi.github.io/REVERIE_Challenge/static/img/demo.gif 22 Matterport3D

Slide 23

Slide 23 text

✓ REVERIE-fetch dataset - REVERIE dataset ✓ REVERIE[Qi+, CVPR18] : + 23 , ↓ - REVERIE - - https://yuankaiqi.github.io/REVERIE_Challenge/static/img/demo.gif

Slide 24

Slide 24 text

✓ REVERIE-fetch dataset • REVERIE dataset #Samples Vocabulary size Average sentence length 30532 2853 19.1 Training Validation Test 26808 2552 1172 24 “Look in the left wicker vase that is next to the potted plant”

Slide 25

Slide 25 text

“Go into the living room and give me the pillow on the couch nearest the plant” 25 • → TDP-MAT

Slide 26

Slide 26 text

26 • → TDP-MAT ✓ Bounding box “Make haste to the office and fluff the pillow sitting on the left of the chair”

Slide 27

Slide 27 text

• Acc [%] : 27 Condition Acc [%] ↑ Baseline : TDU [Ishikawa+, IROS21] 73.3 0.485 Ours : TDP-MAT W/o MAT 72.5 3.55 W/o MAT + Smaller learning rate 74.4 0.831 W/o CLIP & Perceiver 74.1 1.47 W/o Pretraining 73.1 2.24 Full 75.3 0.691 +2.0

Slide 28

Slide 28 text

28 Condition Acc [%] ↑ Baseline : TDU [Ishikawa+, IROS21] 73.3 0.485 Ours : TDP-MAT W/o MAT 72.5 3.55 W/o MAT + Smaller learning rate 74.4 0.831 W/o CLIP & Perceiver 74.1 1.47 W/o Pretraining 73.1 2.24 Full 75.3 0.691 +2.8 - - 5 - ( ) - Smaller learning rate : 1/8 -

Slide 29

Slide 29 text

29 Condition Acc [%] ↑ Baseline : TDU [Ishikawa+, IROS21] 73.3 0.485 Ours : TDP-MAT W/o MAT 72.5 3.55 W/o MAT + Smaller learning rate 74.4 0.831 W/o CLIP & Perceiver 74.1 1.47 W/o Pretraining 73.1 2.24 Full 75.3 0.691 +1.2 - CLIP Encoders, Perceiver Module, - Cross Attention

Slide 30

Slide 30 text

30 Condition Acc [%] ↑ Baseline : TDU [Ishikawa+, IROS21] 73.3 0.485 Ours : TDP-MAT W/o MAT 72.5 3.55 W/o MAT + Smaller learning rate 74.4 0.831 W/o CLIP & Perceiver 74.1 1.47 W/o Pretraining 73.1 2.24 Full 75.3 0.691 +2.2 - TDU

Slide 31

Slide 31 text

✓ • ✓ • MAT • ✓ • 31

Slide 32

Slide 32 text

✓ ✓ 𝐿 𝑁 𝑅𝐿×𝐷 𝑅𝑁×𝐸 𝑅𝐿×𝐷, 𝑅𝑁×𝐷 → 𝑅𝐿×𝑁 𝑅𝐿×𝐷 𝑅𝐿×𝐷, 𝑅𝐿×𝐷 → 𝑅𝐿×𝐿 32