Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
[RSJ22] TDP-MAT: Multimodal Language Comprehension for Object Manipulation Tasks via Real Images
Search
Semantic Machine Intelligence Lab., Keio Univ.
PRO
September 05, 2022
Technology
0
710
[RSJ22] TDP-MAT: Multimodal Language Comprehension for Object Manipulation Tasks via Real Images
Semantic Machine Intelligence Lab., Keio Univ.
PRO
September 05, 2022
Tweet
Share
More Decks by Semantic Machine Intelligence Lab., Keio Univ.
See All by Semantic Machine Intelligence Lab., Keio Univ.
[Journal club] Parallel Vertex Diffusion for Unified Visual Grounding
keio_smilab
PRO
0
120
[NLP24] Polos: Multimodal Metric Learning from Human Feedback for Image Captioning
keio_smilab
PRO
1
210
[Journal club] Accelerating Toeplitz Neural Network with Constant-time Inference Complexity
keio_smilab
PRO
0
44
[Journal club] TIES-Merging: Resolving Interference When Merging Models
keio_smilab
PRO
0
150
[Journal Club]Interfacing Foundation Models’ Embeddings
keio_smilab
PRO
1
130
[Journal club] Toeplitz Neural Network for Sequence Modeling
keio_smilab
PRO
2
400
Visual Explanation Generation for Road Damage Classification by Using Layer-wise Relevance Propagation for Branch Networks
keio_smilab
PRO
0
140
Supervised Automatic Evaluation for Image Captioning Based on Multimodality
keio_smilab
PRO
0
160
[Journal Club] Hyperbolic Image-Text Representations
keio_smilab
PRO
0
310
Other Decks in Technology
See All in Technology
20分で完全に理解するGrafanaダッシュボード
hamadakoji
5
890
生産性向上チームの紹介
cybozuinsideout
PRO
1
920
地理空間データ可視化・解析・活用ソリューション Pacific Spatial Solutions (PSS)
pacificspatialsolutions
0
330
Documentação de Produtos: Artefatos essenciais na prática
rigolon
1
100
実例で紹介するRAG導入時の知見と精度向上の勘所
yamahiro
5
1.6k
ルーターでプレゼンする
puhitaku
1
3.3k
いつか使うかも貯金してたらめちゃめちゃ機能が増えてた話
riyaamemiya
0
620
LayerXにおけるLLMプロダクト開発の今までとこれから
layerx
PRO
4
710
【NW X Security JAWS#3】L3-4:AWS環境のIPv6移行に向けて知っておきたいこと
shotashiratori
1
650
ゼロから始めるVue.jsコミュニティ貢献 / first-vuejs-community-contribution-link-and-motivation
lmi
1
150
LLM開発・活用の舞台裏@2024.04.25
yushin_n
3
1.2k
Microsoft for Startups Founders Hub_20240429 update
daikikanemitsu
1
2.4k
Featured
See All Featured
Into the Great Unknown - MozCon
thekraken
14
1k
Done Done
chrislema
178
15k
Cheating the UX When There Is Nothing More to Optimize - PixelPioneers
stephaniewalter
275
13k
Docker and Python
trallard
35
2.7k
CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again
sstephenson
155
14k
Why You Should Never Use an ORM
jnunemaker
PRO
51
8.7k
What's in a price? How to price your products and services
michaelherold
238
11k
The Invisible Customer
myddelton
114
12k
Bootstrapping a Software Product
garrettdimon
PRO
302
110k
The Cult of Friendly URLs
andyhume
74
5.7k
Testing 201, or: Great Expectations
jmmastey
30
6.4k
Designing for Performance
lara
601
67k
Transcript
1
2
3 ✓ https://www.toyota.com/usa/toyota-effect/romy-robot
✓ : 4 “Look in the left wicker vase that
is next to the potted plant” Wicker vase :
✓ : “Look in the left wicker vase that is
next to the potted plant” 5 Wicker vase : Wicker vase Wicker vase Wicker vase
✓ : ✓ Key : “Look in the left wicker
vase that is next to the potted plant” 6 Wicker vase : Wicker vase Wicker vase Wicker vase
✓ REVERIE-fetch • 7 “Look in the left wicker vase
that is next to the potted plant”
✓ REVERIE-fetch • • (Instruction) (Context Regions) (Candidate Region) 8
“Look in the left wicker vase that is next to the potted plant”
✓ REVERIE-fetch • • (Instruction) (Context Regions) (Candidate Region) 9
“Look in the left wicker vase that is next to the potted plant”
✓ REVERIE-fetch • • (Instruction) (Context Regions) (Candidate Region) •
10 “Look in the left wicker vase that is next to the potted plant”
✓ REVERIE-fetch • • (Instruction) (Context Regions) (Candidate Region) •
11 “Look in the left wicker vase that is next to the potted plant” Faster R-CNN[Ren+, PAMI16]
MTCM [Magassouba+, RA-L19] . VGG16LSTM . Target-dependent UNITER (TDU) [Ishikawa+,
RA-L21] UNITER[Chen+, ECCV20] . REVERIE task / dataset [Qi+, CVPR20] , REVERIE 12
• MAT[Ishikawa+, ICPR22] • CLIP[Radford+, ICML21] • Perceiver[Jaegle+, ICML21] 13
• MAT[Ishikawa+, ICPR22] • CLIP[Radford+, ICML21] • Perceiver[Jaegle+, ICML21] 14
• MAT[Ishikawa+, ICPR22] • CLIP[Radford+, ICML21] • Perceiver[Jaegle+, ICML21] 15
• MAT[Ishikawa+, ICPR22] • CLIP[Radford+, ICML21] • Perceiver[Jaegle+, ICML21] 16
• MAT[Ishikawa+, ICPR22] • CLIP[Radford+, ICML21] • Perceiver[Jaegle+, ICML21] 17
2 1 3
✓ 𝜹𝑡 ✓ 18 Input 𝜹𝑡 Output 1. 𝐸 𝜹
= CE 𝑓 𝒙 , 𝒚 ∇𝜹 𝐸 𝜹 = 𝜕𝐸 𝜕𝜹 2. ∇𝜹 𝐸 𝜹 𝒎𝑡 𝒗𝑡 𝒎𝑡 = 𝜌1 𝒎𝑡−1 + 1 − 𝜌1 ∇𝜹 𝐸 𝜹𝑡 𝒗𝑡 = 𝜌2 𝒗𝑡−1 + 1 − 𝜌2 ∇𝜹 𝐸 𝜹𝑡 2 3. 𝒎𝑡 𝒗𝑡 ∆𝜹𝒕 ෝ 𝒎𝑡 = 𝒎𝑡 1 − 𝜌1 𝑡 , ෝ 𝒗𝑡 = 𝒗𝑡 1 − 𝜌2 𝑡 ∆𝜹𝒕 = 𝜂 ෝ 𝒎𝑡 ෝ 𝒗𝑡 + 𝜖 4. 𝜹𝑡+1 = Π 𝜹 ≤𝜖 𝜹𝑡 + ∆𝜹𝒕 ∆𝜹𝒕 𝐹
✓ CLIP ✓ ViT[Dosovitskiy+, ICLR21] ✓ transformer [EOT] 19 [EOT]
✓ ✓ Perceiver CLIP 20 CLIP Encoders
✓ CLIP Encoders , Perceiver 21
✓ REVERIE-fetch dataset - REVERIE dataset ✓ REVERIE[Qi+, CVPR18] -
→ 1. , 2. https://yuankaiqi.github.io/REVERIE_Challenge/static/img/demo.gif 22 Matterport3D
✓ REVERIE-fetch dataset - REVERIE dataset ✓ REVERIE[Qi+, CVPR18] :
+ 23 , ↓ - REVERIE - - https://yuankaiqi.github.io/REVERIE_Challenge/static/img/demo.gif
✓ REVERIE-fetch dataset • REVERIE dataset #Samples Vocabulary size Average
sentence length 30532 2853 19.1 Training Validation Test 26808 2552 1172 24 “Look in the left wicker vase that is next to the potted plant”
“Go into the living room and give me the pillow
on the couch nearest the plant” 25 • → TDP-MAT
26 • → TDP-MAT ✓ Bounding box “Make haste to
the office and fluff the pillow sitting on the left of the chair”
• Acc [%] : 27 Condition Acc [%] ↑ Baseline
: TDU [Ishikawa+, IROS21] 73.3 0.485 Ours : TDP-MAT W/o MAT 72.5 3.55 W/o MAT + Smaller learning rate 74.4 0.831 W/o CLIP & Perceiver 74.1 1.47 W/o Pretraining 73.1 2.24 Full 75.3 0.691 +2.0
28 Condition Acc [%] ↑ Baseline : TDU [Ishikawa+, IROS21]
73.3 0.485 Ours : TDP-MAT W/o MAT 72.5 3.55 W/o MAT + Smaller learning rate 74.4 0.831 W/o CLIP & Perceiver 74.1 1.47 W/o Pretraining 73.1 2.24 Full 75.3 0.691 +2.8 - - 5 - ( ) - Smaller learning rate : 1/8 -
29 Condition Acc [%] ↑ Baseline : TDU [Ishikawa+, IROS21]
73.3 0.485 Ours : TDP-MAT W/o MAT 72.5 3.55 W/o MAT + Smaller learning rate 74.4 0.831 W/o CLIP & Perceiver 74.1 1.47 W/o Pretraining 73.1 2.24 Full 75.3 0.691 +1.2 - CLIP Encoders, Perceiver Module, - Cross Attention
30 Condition Acc [%] ↑ Baseline : TDU [Ishikawa+, IROS21]
73.3 0.485 Ours : TDP-MAT W/o MAT 72.5 3.55 W/o MAT + Smaller learning rate 74.4 0.831 W/o CLIP & Perceiver 74.1 1.47 W/o Pretraining 73.1 2.24 Full 75.3 0.691 +2.2 - TDU
✓ • ✓ • MAT • ✓ • 31
✓ ✓ 𝐿 𝑁 𝑅𝐿×𝐷 𝑅𝑁×𝐸 𝑅𝐿×𝐷, 𝑅𝑁×𝐷 → 𝑅𝐿×𝑁
𝑅𝐿×𝐷 𝑅𝐿×𝐷, 𝑅𝐿×𝐷 → 𝑅𝐿×𝐿 32
✓ ✓ ✓ ✓ 33
✓ 34 8 × 10−4 𝛽1 = 0.9, 𝛽2 =
0.99
✓ ✓ ✓ 35 19+6=25