Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
[RSJ22] TDP-MAT: Multimodal Language Comprehens...
Search
Semantic Machine Intelligence Lab., Keio Univ.
PRO
September 05, 2022
Technology
0
820
[RSJ22] TDP-MAT: Multimodal Language Comprehension for Object Manipulation Tasks via Real Images
Semantic Machine Intelligence Lab., Keio Univ.
PRO
September 05, 2022
Tweet
Share
More Decks by Semantic Machine Intelligence Lab., Keio Univ.
See All by Semantic Machine Intelligence Lab., Keio Univ.
[Journal club] GraphEQA: Using 3D Semantic Scene Graphs for Real-time Embodied Question Answering
keio_smilab
PRO
0
48
[RSJ25] Feasible RAG: Hierarchical Multimodal Retrieval with Feasibility-Aware Embodied Memory for Mobile Manipulation
keio_smilab
PRO
0
150
[RSJ25] LILAC: Language‑Conditioned Object‑Centric Optical Flow for Open‑Loop Trajectory Generation
keio_smilab
PRO
0
91
[RSJ25] Multilingual Scene Text-Aware Multimodal Retrieval for Everyday Objects Based on Deep State Space Models
keio_smilab
PRO
0
94
[RSJ25] Everyday Object Manipulation Based on Scene Text-Aware Multimodal Retrieval
keio_smilab
PRO
1
76
[RSJ25] Enhancing VLA Performance in Understanding and Executing Free-form Instructions via Visual Prompt-based Paraphrasing
keio_smilab
PRO
0
140
[Journal club] Generalized Contrastive Learning for Multi-Modal Retrieval and Ranking
keio_smilab
PRO
0
65
[Journal club] Steering Your Generalists: Improving Robotic Foundation Models via Value Guidance
keio_smilab
PRO
0
60
[Journal club] Influence-Balanced Loss for Imbalanced Visual Classification
keio_smilab
PRO
0
29
Other Decks in Technology
See All in Technology
Function calling機能をPLaMo2に実装するには / PFN LLMセミナー
pfn
PRO
0
1k
Findy Team+のSOC2取得までの道のり
rvirus0817
0
500
生成AIとM5Stack / M5 Japan Tour 2025 Autumn 東京
you
PRO
0
240
AIツールでどこまでデザインを忠実に実装できるのか
oikon48
6
3k
やる気のない自分との向き合い方/How to Deal with Your Unmotivated Self
sanogemaru
0
440
多様な事業ドメインのクリエイターへ 価値を届けるための営みについて
massyuu
1
490
許しとアジャイル
jnuank
1
140
Geospatialの世界最前線を探る [2025年版]
dayjournal
0
180
能登半島地震で見えた災害対応の課題と組織変革の重要性
ditccsugii
0
210
プロポーザルのコツ ~ Kaigi on Rails 2025 初参加で3名の登壇を実現 ~
naro143
1
190
OpenAI gpt-oss ファインチューニング入門
kmotohas
2
1.1k
防災デジタル分野での官民共創の取り組み (2)DIT/CCとD-CERTについて
ditccsugii
0
110
Featured
See All Featured
Facilitating Awesome Meetings
lara
56
6.6k
Java REST API Framework Comparison - PWX 2021
mraible
33
8.9k
How GitHub (no longer) Works
holman
315
140k
The Cost Of JavaScript in 2023
addyosmani
53
9k
Code Review Best Practice
trishagee
72
19k
Why Our Code Smells
bkeepers
PRO
339
57k
Docker and Python
trallard
46
3.6k
Writing Fast Ruby
sferik
629
62k
For a Future-Friendly Web
brad_frost
180
9.9k
Git: the NoSQL Database
bkeepers
PRO
431
66k
The Cult of Friendly URLs
andyhume
79
6.6k
Build your cross-platform service in a week with App Engine
jlugia
232
18k
Transcript
1
2
3 ✓ https://www.toyota.com/usa/toyota-effect/romy-robot
✓ : 4 “Look in the left wicker vase that
is next to the potted plant” Wicker vase :
✓ : “Look in the left wicker vase that is
next to the potted plant” 5 Wicker vase : Wicker vase Wicker vase Wicker vase
✓ : ✓ Key : “Look in the left wicker
vase that is next to the potted plant” 6 Wicker vase : Wicker vase Wicker vase Wicker vase
✓ REVERIE-fetch • 7 “Look in the left wicker vase
that is next to the potted plant”
✓ REVERIE-fetch • • (Instruction) (Context Regions) (Candidate Region) 8
“Look in the left wicker vase that is next to the potted plant”
✓ REVERIE-fetch • • (Instruction) (Context Regions) (Candidate Region) 9
“Look in the left wicker vase that is next to the potted plant”
✓ REVERIE-fetch • • (Instruction) (Context Regions) (Candidate Region) •
10 “Look in the left wicker vase that is next to the potted plant”
✓ REVERIE-fetch • • (Instruction) (Context Regions) (Candidate Region) •
11 “Look in the left wicker vase that is next to the potted plant” Faster R-CNN[Ren+, PAMI16]
MTCM [Magassouba+, RA-L19] . VGG16LSTM . Target-dependent UNITER (TDU) [Ishikawa+,
RA-L21] UNITER[Chen+, ECCV20] . REVERIE task / dataset [Qi+, CVPR20] , REVERIE 12
• MAT[Ishikawa+, ICPR22] • CLIP[Radford+, ICML21] • Perceiver[Jaegle+, ICML21] 13
• MAT[Ishikawa+, ICPR22] • CLIP[Radford+, ICML21] • Perceiver[Jaegle+, ICML21] 14
• MAT[Ishikawa+, ICPR22] • CLIP[Radford+, ICML21] • Perceiver[Jaegle+, ICML21] 15
• MAT[Ishikawa+, ICPR22] • CLIP[Radford+, ICML21] • Perceiver[Jaegle+, ICML21] 16
• MAT[Ishikawa+, ICPR22] • CLIP[Radford+, ICML21] • Perceiver[Jaegle+, ICML21] 17
2 1 3
✓ 𝜹𝑡 ✓ 18 Input 𝜹𝑡 Output 1. 𝐸 𝜹
= CE 𝑓 𝒙 , 𝒚 ∇𝜹 𝐸 𝜹 = 𝜕𝐸 𝜕𝜹 2. ∇𝜹 𝐸 𝜹 𝒎𝑡 𝒗𝑡 𝒎𝑡 = 𝜌1 𝒎𝑡−1 + 1 − 𝜌1 ∇𝜹 𝐸 𝜹𝑡 𝒗𝑡 = 𝜌2 𝒗𝑡−1 + 1 − 𝜌2 ∇𝜹 𝐸 𝜹𝑡 2 3. 𝒎𝑡 𝒗𝑡 ∆𝜹𝒕 ෝ 𝒎𝑡 = 𝒎𝑡 1 − 𝜌1 𝑡 , ෝ 𝒗𝑡 = 𝒗𝑡 1 − 𝜌2 𝑡 ∆𝜹𝒕 = 𝜂 ෝ 𝒎𝑡 ෝ 𝒗𝑡 + 𝜖 4. 𝜹𝑡+1 = Π 𝜹 ≤𝜖 𝜹𝑡 + ∆𝜹𝒕 ∆𝜹𝒕 𝐹
✓ CLIP ✓ ViT[Dosovitskiy+, ICLR21] ✓ transformer [EOT] 19 [EOT]
✓ ✓ Perceiver CLIP 20 CLIP Encoders
✓ CLIP Encoders , Perceiver 21
✓ REVERIE-fetch dataset - REVERIE dataset ✓ REVERIE[Qi+, CVPR18] -
→ 1. , 2. https://yuankaiqi.github.io/REVERIE_Challenge/static/img/demo.gif 22 Matterport3D
✓ REVERIE-fetch dataset - REVERIE dataset ✓ REVERIE[Qi+, CVPR18] :
+ 23 , ↓ - REVERIE - - https://yuankaiqi.github.io/REVERIE_Challenge/static/img/demo.gif
✓ REVERIE-fetch dataset • REVERIE dataset #Samples Vocabulary size Average
sentence length 30532 2853 19.1 Training Validation Test 26808 2552 1172 24 “Look in the left wicker vase that is next to the potted plant”
“Go into the living room and give me the pillow
on the couch nearest the plant” 25 • → TDP-MAT
26 • → TDP-MAT ✓ Bounding box “Make haste to
the office and fluff the pillow sitting on the left of the chair”
• Acc [%] : 27 Condition Acc [%] ↑ Baseline
: TDU [Ishikawa+, IROS21] 73.3 0.485 Ours : TDP-MAT W/o MAT 72.5 3.55 W/o MAT + Smaller learning rate 74.4 0.831 W/o CLIP & Perceiver 74.1 1.47 W/o Pretraining 73.1 2.24 Full 75.3 0.691 +2.0
28 Condition Acc [%] ↑ Baseline : TDU [Ishikawa+, IROS21]
73.3 0.485 Ours : TDP-MAT W/o MAT 72.5 3.55 W/o MAT + Smaller learning rate 74.4 0.831 W/o CLIP & Perceiver 74.1 1.47 W/o Pretraining 73.1 2.24 Full 75.3 0.691 +2.8 - - 5 - ( ) - Smaller learning rate : 1/8 -
29 Condition Acc [%] ↑ Baseline : TDU [Ishikawa+, IROS21]
73.3 0.485 Ours : TDP-MAT W/o MAT 72.5 3.55 W/o MAT + Smaller learning rate 74.4 0.831 W/o CLIP & Perceiver 74.1 1.47 W/o Pretraining 73.1 2.24 Full 75.3 0.691 +1.2 - CLIP Encoders, Perceiver Module, - Cross Attention
30 Condition Acc [%] ↑ Baseline : TDU [Ishikawa+, IROS21]
73.3 0.485 Ours : TDP-MAT W/o MAT 72.5 3.55 W/o MAT + Smaller learning rate 74.4 0.831 W/o CLIP & Perceiver 74.1 1.47 W/o Pretraining 73.1 2.24 Full 75.3 0.691 +2.2 - TDU
✓ • ✓ • MAT • ✓ • 31
✓ ✓ 𝐿 𝑁 𝑅𝐿×𝐷 𝑅𝑁×𝐸 𝑅𝐿×𝐷, 𝑅𝑁×𝐷 → 𝑅𝐿×𝑁
𝑅𝐿×𝐷 𝑅𝐿×𝐷, 𝑅𝐿×𝐷 → 𝑅𝐿×𝐿 32
✓ ✓ ✓ ✓ 33
✓ 34 8 × 10−4 𝛽1 = 0.9, 𝛽2 =
0.99
✓ ✓ ✓ 35 19+6=25