Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
[RSJ22] TDP-MAT: Multimodal Language Comprehens...
Search
Semantic Machine Intelligence Lab., Keio Univ.
PRO
September 05, 2022
Technology
0
820
[RSJ22] TDP-MAT: Multimodal Language Comprehension for Object Manipulation Tasks via Real Images
Semantic Machine Intelligence Lab., Keio Univ.
PRO
September 05, 2022
Tweet
Share
More Decks by Semantic Machine Intelligence Lab., Keio Univ.
See All by Semantic Machine Intelligence Lab., Keio Univ.
[Journal club] Generalized Contrastive Learning for Multi-Modal Retrieval and Ranking
keio_smilab
PRO
0
47
[Journal club] Steering Your Generalists: Improving Robotic Foundation Models via Value Guidance
keio_smilab
PRO
0
49
[Journal club] Influence-Balanced Loss for Imbalanced Visual Classification
keio_smilab
PRO
0
15
[Journal club] Learning to Rematch Mismatched Pairs for Robust Cross-Modal Retrieval
keio_smilab
PRO
0
30
[Journal club] AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
keio_smilab
PRO
0
43
[MIRU25] NaiLIA: Multimodal Retrieval of Nail Designs Based on Dense Intent Descriptions
keio_smilab
PRO
1
210
[MIRU25] An LLM-Hybrid-as-a-Judge Approach for Evaluating Long Image Captions
keio_smilab
PRO
1
210
[MIRU2025]Preference Optimization for Multimodal Large Language Models for Image Captioning Tasks
keio_smilab
PRO
0
180
Semantic Machine Intelligence for Vision, Language, and Actions
keio_smilab
PRO
3
470
Other Decks in Technology
See All in Technology
人と組織に偏重したEMへのアンチテーゼ──なぜ、EMに設計力が必要なのか/An antithesis to the overemphasis of people and organizations in EM
dskst
4
520
AIエージェントの開発に必須な「コンテキスト・エンジニアリング」とは何か──プロンプト・エンジニアリングとの違いを手がかりに考える
masayamoriofficial
0
340
S3のライフサイクル設計でハマったポイント
mkumada
0
140
あとはAIに任せて人間は自由に生きる
kentaro
3
1.1k
帳票Vibe Coding
terurou
0
130
Devinを使ったモバイルアプリ開発 / Mobile app development with Devin
yanzm
0
160
人を動かすことについて考える
ichimichi
2
320
R-SCoRe: Revisiting Scene Coordinate Regression for Robust Large-Scale Visual Localization
takmin
0
400
現場が抱える様々な問題は “組織設計上” の問題によって生じていることがある / Team-oriented Organization Design 20250827
mtx2s
1
590
[CVPR2025論文読み会] Linguistics-aware Masked Image Modelingfor Self-supervised Scene Text Recognition
s_aiueo32
0
210
どこで動かすか、誰が動かすか 〜 kintoneのインフラ基盤刷新と運用体制のシフト 〜
ueokande
0
180
歴代のWeb Speed Hackathonの出題から考えるデグレしないパフォーマンス改善
shuta13
6
590
Featured
See All Featured
Refactoring Trust on Your Teams (GOTO; Chicago 2020)
rmw
34
3.1k
Site-Speed That Sticks
csswizardry
10
780
Reflections from 52 weeks, 52 projects
jeffersonlam
351
21k
Documentation Writing (for coders)
carmenintech
73
5k
Templates, Plugins, & Blocks: Oh My! Creating the theme that thinks of everything
marktimemedia
31
2.5k
Designing Experiences People Love
moore
142
24k
How to Ace a Technical Interview
jacobian
279
23k
Visualization
eitanlees
146
16k
[RailsConf 2023 Opening Keynote] The Magic of Rails
eileencodes
30
9.6k
CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again
sstephenson
161
15k
Easily Structure & Communicate Ideas using Wireframe
afnizarnur
194
16k
Thoughts on Productivity
jonyablonski
69
4.8k
Transcript
1
2
3 ✓ https://www.toyota.com/usa/toyota-effect/romy-robot
✓ : 4 “Look in the left wicker vase that
is next to the potted plant” Wicker vase :
✓ : “Look in the left wicker vase that is
next to the potted plant” 5 Wicker vase : Wicker vase Wicker vase Wicker vase
✓ : ✓ Key : “Look in the left wicker
vase that is next to the potted plant” 6 Wicker vase : Wicker vase Wicker vase Wicker vase
✓ REVERIE-fetch • 7 “Look in the left wicker vase
that is next to the potted plant”
✓ REVERIE-fetch • • (Instruction) (Context Regions) (Candidate Region) 8
“Look in the left wicker vase that is next to the potted plant”
✓ REVERIE-fetch • • (Instruction) (Context Regions) (Candidate Region) 9
“Look in the left wicker vase that is next to the potted plant”
✓ REVERIE-fetch • • (Instruction) (Context Regions) (Candidate Region) •
10 “Look in the left wicker vase that is next to the potted plant”
✓ REVERIE-fetch • • (Instruction) (Context Regions) (Candidate Region) •
11 “Look in the left wicker vase that is next to the potted plant” Faster R-CNN[Ren+, PAMI16]
MTCM [Magassouba+, RA-L19] . VGG16LSTM . Target-dependent UNITER (TDU) [Ishikawa+,
RA-L21] UNITER[Chen+, ECCV20] . REVERIE task / dataset [Qi+, CVPR20] , REVERIE 12
• MAT[Ishikawa+, ICPR22] • CLIP[Radford+, ICML21] • Perceiver[Jaegle+, ICML21] 13
• MAT[Ishikawa+, ICPR22] • CLIP[Radford+, ICML21] • Perceiver[Jaegle+, ICML21] 14
• MAT[Ishikawa+, ICPR22] • CLIP[Radford+, ICML21] • Perceiver[Jaegle+, ICML21] 15
• MAT[Ishikawa+, ICPR22] • CLIP[Radford+, ICML21] • Perceiver[Jaegle+, ICML21] 16
• MAT[Ishikawa+, ICPR22] • CLIP[Radford+, ICML21] • Perceiver[Jaegle+, ICML21] 17
2 1 3
✓ 𝜹𝑡 ✓ 18 Input 𝜹𝑡 Output 1. 𝐸 𝜹
= CE 𝑓 𝒙 , 𝒚 ∇𝜹 𝐸 𝜹 = 𝜕𝐸 𝜕𝜹 2. ∇𝜹 𝐸 𝜹 𝒎𝑡 𝒗𝑡 𝒎𝑡 = 𝜌1 𝒎𝑡−1 + 1 − 𝜌1 ∇𝜹 𝐸 𝜹𝑡 𝒗𝑡 = 𝜌2 𝒗𝑡−1 + 1 − 𝜌2 ∇𝜹 𝐸 𝜹𝑡 2 3. 𝒎𝑡 𝒗𝑡 ∆𝜹𝒕 ෝ 𝒎𝑡 = 𝒎𝑡 1 − 𝜌1 𝑡 , ෝ 𝒗𝑡 = 𝒗𝑡 1 − 𝜌2 𝑡 ∆𝜹𝒕 = 𝜂 ෝ 𝒎𝑡 ෝ 𝒗𝑡 + 𝜖 4. 𝜹𝑡+1 = Π 𝜹 ≤𝜖 𝜹𝑡 + ∆𝜹𝒕 ∆𝜹𝒕 𝐹
✓ CLIP ✓ ViT[Dosovitskiy+, ICLR21] ✓ transformer [EOT] 19 [EOT]
✓ ✓ Perceiver CLIP 20 CLIP Encoders
✓ CLIP Encoders , Perceiver 21
✓ REVERIE-fetch dataset - REVERIE dataset ✓ REVERIE[Qi+, CVPR18] -
→ 1. , 2. https://yuankaiqi.github.io/REVERIE_Challenge/static/img/demo.gif 22 Matterport3D
✓ REVERIE-fetch dataset - REVERIE dataset ✓ REVERIE[Qi+, CVPR18] :
+ 23 , ↓ - REVERIE - - https://yuankaiqi.github.io/REVERIE_Challenge/static/img/demo.gif
✓ REVERIE-fetch dataset • REVERIE dataset #Samples Vocabulary size Average
sentence length 30532 2853 19.1 Training Validation Test 26808 2552 1172 24 “Look in the left wicker vase that is next to the potted plant”
“Go into the living room and give me the pillow
on the couch nearest the plant” 25 • → TDP-MAT
26 • → TDP-MAT ✓ Bounding box “Make haste to
the office and fluff the pillow sitting on the left of the chair”
• Acc [%] : 27 Condition Acc [%] ↑ Baseline
: TDU [Ishikawa+, IROS21] 73.3 0.485 Ours : TDP-MAT W/o MAT 72.5 3.55 W/o MAT + Smaller learning rate 74.4 0.831 W/o CLIP & Perceiver 74.1 1.47 W/o Pretraining 73.1 2.24 Full 75.3 0.691 +2.0
28 Condition Acc [%] ↑ Baseline : TDU [Ishikawa+, IROS21]
73.3 0.485 Ours : TDP-MAT W/o MAT 72.5 3.55 W/o MAT + Smaller learning rate 74.4 0.831 W/o CLIP & Perceiver 74.1 1.47 W/o Pretraining 73.1 2.24 Full 75.3 0.691 +2.8 - - 5 - ( ) - Smaller learning rate : 1/8 -
29 Condition Acc [%] ↑ Baseline : TDU [Ishikawa+, IROS21]
73.3 0.485 Ours : TDP-MAT W/o MAT 72.5 3.55 W/o MAT + Smaller learning rate 74.4 0.831 W/o CLIP & Perceiver 74.1 1.47 W/o Pretraining 73.1 2.24 Full 75.3 0.691 +1.2 - CLIP Encoders, Perceiver Module, - Cross Attention
30 Condition Acc [%] ↑ Baseline : TDU [Ishikawa+, IROS21]
73.3 0.485 Ours : TDP-MAT W/o MAT 72.5 3.55 W/o MAT + Smaller learning rate 74.4 0.831 W/o CLIP & Perceiver 74.1 1.47 W/o Pretraining 73.1 2.24 Full 75.3 0.691 +2.2 - TDU
✓ • ✓ • MAT • ✓ • 31
✓ ✓ 𝐿 𝑁 𝑅𝐿×𝐷 𝑅𝑁×𝐸 𝑅𝐿×𝐷, 𝑅𝑁×𝐷 → 𝑅𝐿×𝑁
𝑅𝐿×𝐷 𝑅𝐿×𝐷, 𝑅𝐿×𝐷 → 𝑅𝐿×𝐿 32
✓ ✓ ✓ ✓ 33
✓ 34 8 × 10−4 𝛽1 = 0.9, 𝛽2 =
0.99
✓ ✓ ✓ 35 19+6=25