Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[RSJ22] TDP-MAT: Multimodal Language Comprehens...

[RSJ22] TDP-MAT: Multimodal Language Comprehension for Object Manipulation Tasks via Real Images

More Decks by Semantic Machine Intelligence Lab., Keio Univ.

Other Decks in Technology

Transcript

  1. 1

  2. 2

  3. ✓ : 4 “Look in the left wicker vase that

    is next to the potted plant” Wicker vase :
  4. ✓ : “Look in the left wicker vase that is

    next to the potted plant” 5 Wicker vase : Wicker vase Wicker vase Wicker vase
  5. ✓ : ✓ Key : “Look in the left wicker

    vase that is next to the potted plant” 6 Wicker vase : Wicker vase Wicker vase Wicker vase
  6. ✓ REVERIE-fetch • 7 “Look in the left wicker vase

    that is next to the potted plant”
  7. ✓ REVERIE-fetch • • (Instruction) (Context Regions) (Candidate Region) 8

    “Look in the left wicker vase that is next to the potted plant”
  8. ✓ REVERIE-fetch • • (Instruction) (Context Regions) (Candidate Region) 9

    “Look in the left wicker vase that is next to the potted plant”
  9. ✓ REVERIE-fetch • • (Instruction) (Context Regions) (Candidate Region) •

    10 “Look in the left wicker vase that is next to the potted plant”
  10. ✓ REVERIE-fetch • • (Instruction) (Context Regions) (Candidate Region) •

    11 “Look in the left wicker vase that is next to the potted plant” Faster R-CNN[Ren+, PAMI16]
  11. MTCM [Magassouba+, RA-L19] . VGG16LSTM . Target-dependent UNITER (TDU) [Ishikawa+,

    RA-L21] UNITER[Chen+, ECCV20] . REVERIE task / dataset [Qi+, CVPR20] , REVERIE 12
  12. ✓ 𝜹𝑡 ✓ 18 Input 𝜹𝑡 Output 1. 𝐸 𝜹

    = CE 𝑓 𝒙 , 𝒚 ∇𝜹 𝐸 𝜹 = 𝜕𝐸 𝜕𝜹 2. ∇𝜹 𝐸 𝜹 𝒎𝑡 𝒗𝑡 𝒎𝑡 = 𝜌1 𝒎𝑡−1 + 1 − 𝜌1 ∇𝜹 𝐸 𝜹𝑡 𝒗𝑡 = 𝜌2 𝒗𝑡−1 + 1 − 𝜌2 ∇𝜹 𝐸 𝜹𝑡 2 3. 𝒎𝑡 𝒗𝑡 ∆𝜹𝒕 ෝ 𝒎𝑡 = 𝒎𝑡 1 − 𝜌1 𝑡 , ෝ 𝒗𝑡 = 𝒗𝑡 1 − 𝜌2 𝑡 ∆𝜹𝒕 = 𝜂 ෝ 𝒎𝑡 ෝ 𝒗𝑡 + 𝜖 4. 𝜹𝑡+1 = Π 𝜹 ≤𝜖 𝜹𝑡 + ∆𝜹𝒕 ∆𝜹𝒕 𝐹
  13. ✓ REVERIE-fetch dataset - REVERIE dataset ✓ REVERIE[Qi+, CVPR18] -

    → 1. , 2. https://yuankaiqi.github.io/REVERIE_Challenge/static/img/demo.gif 22 Matterport3D
  14. ✓ REVERIE-fetch dataset - REVERIE dataset ✓ REVERIE[Qi+, CVPR18] :

    + 23 , ↓ - REVERIE - - https://yuankaiqi.github.io/REVERIE_Challenge/static/img/demo.gif
  15. ✓ REVERIE-fetch dataset • REVERIE dataset #Samples Vocabulary size Average

    sentence length 30532 2853 19.1 Training Validation Test 26808 2552 1172 24 “Look in the left wicker vase that is next to the potted plant”
  16. “Go into the living room and give me the pillow

    on the couch nearest the plant” 25 • → TDP-MAT
  17. 26 • → TDP-MAT ✓ Bounding box “Make haste to

    the office and fluff the pillow sitting on the left of the chair”
  18. • Acc [%] : 27 Condition Acc [%] ↑ Baseline

    : TDU [Ishikawa+, IROS21] 73.3 0.485 Ours : TDP-MAT W/o MAT 72.5 3.55 W/o MAT + Smaller learning rate 74.4 0.831 W/o CLIP & Perceiver 74.1 1.47 W/o Pretraining 73.1 2.24 Full 75.3 0.691 +2.0
  19. 28 Condition Acc [%] ↑ Baseline : TDU [Ishikawa+, IROS21]

    73.3 0.485 Ours : TDP-MAT W/o MAT 72.5 3.55 W/o MAT + Smaller learning rate 74.4 0.831 W/o CLIP & Perceiver 74.1 1.47 W/o Pretraining 73.1 2.24 Full 75.3 0.691 +2.8 - - 5 - ( ) - Smaller learning rate : 1/8 -
  20. 29 Condition Acc [%] ↑ Baseline : TDU [Ishikawa+, IROS21]

    73.3 0.485 Ours : TDP-MAT W/o MAT 72.5 3.55 W/o MAT + Smaller learning rate 74.4 0.831 W/o CLIP & Perceiver 74.1 1.47 W/o Pretraining 73.1 2.24 Full 75.3 0.691 +1.2 - CLIP Encoders, Perceiver Module, - Cross Attention
  21. 30 Condition Acc [%] ↑ Baseline : TDU [Ishikawa+, IROS21]

    73.3 0.485 Ours : TDP-MAT W/o MAT 72.5 3.55 W/o MAT + Smaller learning rate 74.4 0.831 W/o CLIP & Perceiver 74.1 1.47 W/o Pretraining 73.1 2.24 Full 75.3 0.691 +2.2 - TDU
  22. ✓ ✓ 𝐿 𝑁 𝑅𝐿×𝐷 𝑅𝑁×𝐸 𝑅𝐿×𝐷, 𝑅𝑁×𝐷 → 𝑅𝐿×𝑁

    𝑅𝐿×𝐷 𝑅𝐿×𝐷, 𝑅𝐿×𝐷 → 𝑅𝐿×𝐿 32