Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
[RSJ22] TDP-MAT: Multimodal Language Comprehens...
Search
Semantic Machine Intelligence Lab., Keio Univ.
PRO
September 05, 2022
Technology
0
810
[RSJ22] TDP-MAT: Multimodal Language Comprehension for Object Manipulation Tasks via Real Images
Semantic Machine Intelligence Lab., Keio Univ.
PRO
September 05, 2022
Tweet
Share
More Decks by Semantic Machine Intelligence Lab., Keio Univ.
See All by Semantic Machine Intelligence Lab., Keio Univ.
Machine Intelligence for Vision, Language, and Actions
keio_smilab
PRO
0
600
[Journal club] V-DPO: Mitigating Hallucination in Large Vision Language Models via Vision-Guided Direct Preference Optimization
keio_smilab
PRO
0
140
[Journal club] Model Alignment as Prospect Theoretic Optimization
keio_smilab
PRO
0
160
[Journal club] DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models
keio_smilab
PRO
0
82
[Journal club] LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
keio_smilab
PRO
2
110
Will multimodal language processing change the world?
keio_smilab
PRO
4
630
[Journal club] MOKA: Open-Vocabulary Robotic Manipulation through Mark-Based Visual Prompting
keio_smilab
PRO
0
200
[Journal club] Seeing the Unseen: Visual Common Sense for Semantic Placement
keio_smilab
PRO
0
180
[Journal club] Language-Embedded Gaussian Splats (LEGS): Incrementally Building Room-Scale Representations with a Mobile Robot
keio_smilab
PRO
0
190
Other Decks in Technology
See All in Technology
WordPressから ヘッドレスCMSへ! Storyblokへの移行プロセス
nyata
0
310
急成長を支える基盤作り〜地道な改善からコツコツと〜 #cre_meetup
stefafafan
0
150
生成AI開発案件におけるClineの業務活用事例とTips
shinya337
0
160
ひとり情シスなCTOがLLMと始めるオペレーション最適化 / CTO's LLM-Powered Ops
yamitzky
0
460
高速なプロダクト開発を実現、創業期から掲げるエンタープライズアーキテクチャ
kawauso
1
120
ドメイン特化なCLIPモデルとデータセットの紹介
tattaka
1
300
生成AI時代の開発組織・技術・プロセス 〜 ログラスの挑戦と考察 〜
itohiro73
1
350
CursorによるPMO業務の代替 / Automating PMO Tasks with Cursor
motoyoshi_kakaku
2
600
KubeCon + CloudNativeCon Japan 2025 Recap by CA
ponkio_o
PRO
0
240
整頓のジレンマとの戦い〜Tidy First?で振り返る事業とキャリアの歩み〜/Fighting the tidiness dilemma〜Business and Career Milestones Reflected on in Tidy First?〜
bitkey
0
160
Tech-Verse 2025 Keynote
lycorptech_jp
PRO
0
1.2k
開発生産性を組織全体の「生産性」へ! 部門間連携の壁を越える実践的ステップ
sudo5in5k
0
200
Featured
See All Featured
Helping Users Find Their Own Way: Creating Modern Search Experiences
danielanewman
29
2.7k
Optimizing for Happiness
mojombo
379
70k
Reflections from 52 weeks, 52 projects
jeffersonlam
351
20k
Fight the Zombie Pattern Library - RWD Summit 2016
marcelosomers
233
17k
Easily Structure & Communicate Ideas using Wireframe
afnizarnur
194
16k
How to train your dragon (web standard)
notwaldorf
94
6.1k
ピンチをチャンスに:未来をつくるプロダクトロードマップ #pmconf2020
aki_iinuma
124
52k
Intergalactic Javascript Robots from Outer Space
tanoku
271
27k
Bash Introduction
62gerente
614
210k
Dealing with People You Can't Stand - Big Design 2015
cassininazir
367
26k
Let's Do A Bunch of Simple Stuff to Make Websites Faster
chriscoyier
507
140k
Cheating the UX When There Is Nothing More to Optimize - PixelPioneers
stephaniewalter
281
13k
Transcript
1
2
3 ✓ https://www.toyota.com/usa/toyota-effect/romy-robot
✓ : 4 “Look in the left wicker vase that
is next to the potted plant” Wicker vase :
✓ : “Look in the left wicker vase that is
next to the potted plant” 5 Wicker vase : Wicker vase Wicker vase Wicker vase
✓ : ✓ Key : “Look in the left wicker
vase that is next to the potted plant” 6 Wicker vase : Wicker vase Wicker vase Wicker vase
✓ REVERIE-fetch • 7 “Look in the left wicker vase
that is next to the potted plant”
✓ REVERIE-fetch • • (Instruction) (Context Regions) (Candidate Region) 8
“Look in the left wicker vase that is next to the potted plant”
✓ REVERIE-fetch • • (Instruction) (Context Regions) (Candidate Region) 9
“Look in the left wicker vase that is next to the potted plant”
✓ REVERIE-fetch • • (Instruction) (Context Regions) (Candidate Region) •
10 “Look in the left wicker vase that is next to the potted plant”
✓ REVERIE-fetch • • (Instruction) (Context Regions) (Candidate Region) •
11 “Look in the left wicker vase that is next to the potted plant” Faster R-CNN[Ren+, PAMI16]
MTCM [Magassouba+, RA-L19] . VGG16LSTM . Target-dependent UNITER (TDU) [Ishikawa+,
RA-L21] UNITER[Chen+, ECCV20] . REVERIE task / dataset [Qi+, CVPR20] , REVERIE 12
• MAT[Ishikawa+, ICPR22] • CLIP[Radford+, ICML21] • Perceiver[Jaegle+, ICML21] 13
• MAT[Ishikawa+, ICPR22] • CLIP[Radford+, ICML21] • Perceiver[Jaegle+, ICML21] 14
• MAT[Ishikawa+, ICPR22] • CLIP[Radford+, ICML21] • Perceiver[Jaegle+, ICML21] 15
• MAT[Ishikawa+, ICPR22] • CLIP[Radford+, ICML21] • Perceiver[Jaegle+, ICML21] 16
• MAT[Ishikawa+, ICPR22] • CLIP[Radford+, ICML21] • Perceiver[Jaegle+, ICML21] 17
2 1 3
✓ 𝜹𝑡 ✓ 18 Input 𝜹𝑡 Output 1. 𝐸 𝜹
= CE 𝑓 𝒙 , 𝒚 ∇𝜹 𝐸 𝜹 = 𝜕𝐸 𝜕𝜹 2. ∇𝜹 𝐸 𝜹 𝒎𝑡 𝒗𝑡 𝒎𝑡 = 𝜌1 𝒎𝑡−1 + 1 − 𝜌1 ∇𝜹 𝐸 𝜹𝑡 𝒗𝑡 = 𝜌2 𝒗𝑡−1 + 1 − 𝜌2 ∇𝜹 𝐸 𝜹𝑡 2 3. 𝒎𝑡 𝒗𝑡 ∆𝜹𝒕 ෝ 𝒎𝑡 = 𝒎𝑡 1 − 𝜌1 𝑡 , ෝ 𝒗𝑡 = 𝒗𝑡 1 − 𝜌2 𝑡 ∆𝜹𝒕 = 𝜂 ෝ 𝒎𝑡 ෝ 𝒗𝑡 + 𝜖 4. 𝜹𝑡+1 = Π 𝜹 ≤𝜖 𝜹𝑡 + ∆𝜹𝒕 ∆𝜹𝒕 𝐹
✓ CLIP ✓ ViT[Dosovitskiy+, ICLR21] ✓ transformer [EOT] 19 [EOT]
✓ ✓ Perceiver CLIP 20 CLIP Encoders
✓ CLIP Encoders , Perceiver 21
✓ REVERIE-fetch dataset - REVERIE dataset ✓ REVERIE[Qi+, CVPR18] -
→ 1. , 2. https://yuankaiqi.github.io/REVERIE_Challenge/static/img/demo.gif 22 Matterport3D
✓ REVERIE-fetch dataset - REVERIE dataset ✓ REVERIE[Qi+, CVPR18] :
+ 23 , ↓ - REVERIE - - https://yuankaiqi.github.io/REVERIE_Challenge/static/img/demo.gif
✓ REVERIE-fetch dataset • REVERIE dataset #Samples Vocabulary size Average
sentence length 30532 2853 19.1 Training Validation Test 26808 2552 1172 24 “Look in the left wicker vase that is next to the potted plant”
“Go into the living room and give me the pillow
on the couch nearest the plant” 25 • → TDP-MAT
26 • → TDP-MAT ✓ Bounding box “Make haste to
the office and fluff the pillow sitting on the left of the chair”
• Acc [%] : 27 Condition Acc [%] ↑ Baseline
: TDU [Ishikawa+, IROS21] 73.3 0.485 Ours : TDP-MAT W/o MAT 72.5 3.55 W/o MAT + Smaller learning rate 74.4 0.831 W/o CLIP & Perceiver 74.1 1.47 W/o Pretraining 73.1 2.24 Full 75.3 0.691 +2.0
28 Condition Acc [%] ↑ Baseline : TDU [Ishikawa+, IROS21]
73.3 0.485 Ours : TDP-MAT W/o MAT 72.5 3.55 W/o MAT + Smaller learning rate 74.4 0.831 W/o CLIP & Perceiver 74.1 1.47 W/o Pretraining 73.1 2.24 Full 75.3 0.691 +2.8 - - 5 - ( ) - Smaller learning rate : 1/8 -
29 Condition Acc [%] ↑ Baseline : TDU [Ishikawa+, IROS21]
73.3 0.485 Ours : TDP-MAT W/o MAT 72.5 3.55 W/o MAT + Smaller learning rate 74.4 0.831 W/o CLIP & Perceiver 74.1 1.47 W/o Pretraining 73.1 2.24 Full 75.3 0.691 +1.2 - CLIP Encoders, Perceiver Module, - Cross Attention
30 Condition Acc [%] ↑ Baseline : TDU [Ishikawa+, IROS21]
73.3 0.485 Ours : TDP-MAT W/o MAT 72.5 3.55 W/o MAT + Smaller learning rate 74.4 0.831 W/o CLIP & Perceiver 74.1 1.47 W/o Pretraining 73.1 2.24 Full 75.3 0.691 +2.2 - TDU
✓ • ✓ • MAT • ✓ • 31
✓ ✓ 𝐿 𝑁 𝑅𝐿×𝐷 𝑅𝑁×𝐸 𝑅𝐿×𝐷, 𝑅𝑁×𝐷 → 𝑅𝐿×𝑁
𝑅𝐿×𝐷 𝑅𝐿×𝐷, 𝑅𝐿×𝐷 → 𝑅𝐿×𝐿 32
✓ ✓ ✓ ✓ 33
✓ 34 8 × 10−4 𝛽1 = 0.9, 𝛽2 =
0.99
✓ ✓ ✓ 35 19+6=25