Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
[RSJ22] TDP-MAT: Multimodal Language Comprehens...
Search
Semantic Machine Intelligence Lab., Keio Univ.
PRO
September 05, 2022
Technology
0
830
[RSJ22] TDP-MAT: Multimodal Language Comprehension for Object Manipulation Tasks via Real Images
Semantic Machine Intelligence Lab., Keio Univ.
PRO
September 05, 2022
Tweet
Share
More Decks by Semantic Machine Intelligence Lab., Keio Univ.
See All by Semantic Machine Intelligence Lab., Keio Univ.
[Journal club] VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model
keio_smilab
PRO
0
52
[Journal club] Improved Mean Flows: On the Challenges of Fastforward Generative Models
keio_smilab
PRO
0
92
[Journal club] MemER: Scaling Up Memory for Robot Control via Experience Retrieval
keio_smilab
PRO
0
70
[Journal club] Flow Matching for Generative Modeling
keio_smilab
PRO
1
310
Multimodal AI Driving Solutions to Societal Challenges
keio_smilab
PRO
2
180
[Journal club] Re-thinking Temporal Search for Long-Form Video Understanding
keio_smilab
PRO
0
36
[Journal club] Focusing on What Matters: Object-Agent-centric Tokenization for Vision Language Action Models
keio_smilab
PRO
0
10
[Journal club] EXPERT: An Explainable Image Captioning Evaluation Metric with Structured Explanations
keio_smilab
PRO
0
67
[Journal club] FreeTimeGS: Free Gaussian Primitives at Anytime and Anywhere for Dynamic Scene Reconstruction
keio_smilab
PRO
0
95
Other Decks in Technology
See All in Technology
Bill One 開発エンジニア 紹介資料
sansan33
PRO
4
17k
スクラムマスターが スクラムチームに入って取り組む5つのこと - スクラムガイドには書いてないけど入った当初から取り組んでおきたい大切なこと -
scrummasudar
1
1.5k
Node vs Deno vs Bun 〜推しランタイムを見つけよう〜
kamekyame
1
270
歴史から学ぶ、Goのメモリ管理基礎
logica0419
10
2.2k
Java 25に至る道
skrb
3
150
ルネサンス開発者を育てる 1on1支援AIエージェント
yusukeshimizu
0
130
202512_AIoT.pdf
iotcomjpadmin
0
180
チームで安全にClaude Codeを利用するためのプラクティス / team-claude-code-practices
tomoki10
6
2.6k
1万人を変え日本を変える!!多層構造型ふりかえりの大規模組織変革 / 20260108 Kazuki Mori
shift_evolve
PRO
5
840
『君の名は』と聞く君の名は。 / Your name, you who asks for mine.
nttcom
1
150
Eight Engineering Unit 紹介資料
sansan33
PRO
0
6.1k
[PR] はじめてのデジタルアイデンティティという本を書きました
ritou
0
770
Featured
See All Featured
How Software Deployment tools have changed in the past 20 years
geshan
0
31k
The State of eCommerce SEO: How to Win in Today's Products SERPs - #SEOweek
aleyda
2
9.3k
Leading Effective Engineering Teams in the AI Era
addyosmani
9
1.4k
XXLCSS - How to scale CSS and keep your sanity
sugarenia
249
1.3M
How to Create Impact in a Changing Tech Landscape [PerfNow 2023]
tammyeverts
55
3.2k
How to Get Subject Matter Experts Bought In and Actively Contributing to SEO & PR Initiatives.
livdayseo
0
41
Templates, Plugins, & Blocks: Oh My! Creating the theme that thinks of everything
marktimemedia
31
2.6k
世界の人気アプリ100個を分析して見えたペイウォール設計の心得
akihiro_kokubo
PRO
65
35k
Information Architects: The Missing Link in Design Systems
soysaucechin
0
730
JavaScript: Past, Present, and Future - NDC Porto 2020
reverentgeek
52
5.8k
Applied NLP in the Age of Generative AI
inesmontani
PRO
3
2k
The Art of Delivering Value - GDevCon NA Keynote
reverentgeek
16
1.8k
Transcript
1
2
3 ✓ https://www.toyota.com/usa/toyota-effect/romy-robot
✓ : 4 “Look in the left wicker vase that
is next to the potted plant” Wicker vase :
✓ : “Look in the left wicker vase that is
next to the potted plant” 5 Wicker vase : Wicker vase Wicker vase Wicker vase
✓ : ✓ Key : “Look in the left wicker
vase that is next to the potted plant” 6 Wicker vase : Wicker vase Wicker vase Wicker vase
✓ REVERIE-fetch • 7 “Look in the left wicker vase
that is next to the potted plant”
✓ REVERIE-fetch • • (Instruction) (Context Regions) (Candidate Region) 8
“Look in the left wicker vase that is next to the potted plant”
✓ REVERIE-fetch • • (Instruction) (Context Regions) (Candidate Region) 9
“Look in the left wicker vase that is next to the potted plant”
✓ REVERIE-fetch • • (Instruction) (Context Regions) (Candidate Region) •
10 “Look in the left wicker vase that is next to the potted plant”
✓ REVERIE-fetch • • (Instruction) (Context Regions) (Candidate Region) •
11 “Look in the left wicker vase that is next to the potted plant” Faster R-CNN[Ren+, PAMI16]
MTCM [Magassouba+, RA-L19] . VGG16LSTM . Target-dependent UNITER (TDU) [Ishikawa+,
RA-L21] UNITER[Chen+, ECCV20] . REVERIE task / dataset [Qi+, CVPR20] , REVERIE 12
• MAT[Ishikawa+, ICPR22] • CLIP[Radford+, ICML21] • Perceiver[Jaegle+, ICML21] 13
• MAT[Ishikawa+, ICPR22] • CLIP[Radford+, ICML21] • Perceiver[Jaegle+, ICML21] 14
• MAT[Ishikawa+, ICPR22] • CLIP[Radford+, ICML21] • Perceiver[Jaegle+, ICML21] 15
• MAT[Ishikawa+, ICPR22] • CLIP[Radford+, ICML21] • Perceiver[Jaegle+, ICML21] 16
• MAT[Ishikawa+, ICPR22] • CLIP[Radford+, ICML21] • Perceiver[Jaegle+, ICML21] 17
2 1 3
✓ 𝜹𝑡 ✓ 18 Input 𝜹𝑡 Output 1. 𝐸 𝜹
= CE 𝑓 𝒙 , 𝒚 ∇𝜹 𝐸 𝜹 = 𝜕𝐸 𝜕𝜹 2. ∇𝜹 𝐸 𝜹 𝒎𝑡 𝒗𝑡 𝒎𝑡 = 𝜌1 𝒎𝑡−1 + 1 − 𝜌1 ∇𝜹 𝐸 𝜹𝑡 𝒗𝑡 = 𝜌2 𝒗𝑡−1 + 1 − 𝜌2 ∇𝜹 𝐸 𝜹𝑡 2 3. 𝒎𝑡 𝒗𝑡 ∆𝜹𝒕 ෝ 𝒎𝑡 = 𝒎𝑡 1 − 𝜌1 𝑡 , ෝ 𝒗𝑡 = 𝒗𝑡 1 − 𝜌2 𝑡 ∆𝜹𝒕 = 𝜂 ෝ 𝒎𝑡 ෝ 𝒗𝑡 + 𝜖 4. 𝜹𝑡+1 = Π 𝜹 ≤𝜖 𝜹𝑡 + ∆𝜹𝒕 ∆𝜹𝒕 𝐹
✓ CLIP ✓ ViT[Dosovitskiy+, ICLR21] ✓ transformer [EOT] 19 [EOT]
✓ ✓ Perceiver CLIP 20 CLIP Encoders
✓ CLIP Encoders , Perceiver 21
✓ REVERIE-fetch dataset - REVERIE dataset ✓ REVERIE[Qi+, CVPR18] -
→ 1. , 2. https://yuankaiqi.github.io/REVERIE_Challenge/static/img/demo.gif 22 Matterport3D
✓ REVERIE-fetch dataset - REVERIE dataset ✓ REVERIE[Qi+, CVPR18] :
+ 23 , ↓ - REVERIE - - https://yuankaiqi.github.io/REVERIE_Challenge/static/img/demo.gif
✓ REVERIE-fetch dataset • REVERIE dataset #Samples Vocabulary size Average
sentence length 30532 2853 19.1 Training Validation Test 26808 2552 1172 24 “Look in the left wicker vase that is next to the potted plant”
“Go into the living room and give me the pillow
on the couch nearest the plant” 25 • → TDP-MAT
26 • → TDP-MAT ✓ Bounding box “Make haste to
the office and fluff the pillow sitting on the left of the chair”
• Acc [%] : 27 Condition Acc [%] ↑ Baseline
: TDU [Ishikawa+, IROS21] 73.3 0.485 Ours : TDP-MAT W/o MAT 72.5 3.55 W/o MAT + Smaller learning rate 74.4 0.831 W/o CLIP & Perceiver 74.1 1.47 W/o Pretraining 73.1 2.24 Full 75.3 0.691 +2.0
28 Condition Acc [%] ↑ Baseline : TDU [Ishikawa+, IROS21]
73.3 0.485 Ours : TDP-MAT W/o MAT 72.5 3.55 W/o MAT + Smaller learning rate 74.4 0.831 W/o CLIP & Perceiver 74.1 1.47 W/o Pretraining 73.1 2.24 Full 75.3 0.691 +2.8 - - 5 - ( ) - Smaller learning rate : 1/8 -
29 Condition Acc [%] ↑ Baseline : TDU [Ishikawa+, IROS21]
73.3 0.485 Ours : TDP-MAT W/o MAT 72.5 3.55 W/o MAT + Smaller learning rate 74.4 0.831 W/o CLIP & Perceiver 74.1 1.47 W/o Pretraining 73.1 2.24 Full 75.3 0.691 +1.2 - CLIP Encoders, Perceiver Module, - Cross Attention
30 Condition Acc [%] ↑ Baseline : TDU [Ishikawa+, IROS21]
73.3 0.485 Ours : TDP-MAT W/o MAT 72.5 3.55 W/o MAT + Smaller learning rate 74.4 0.831 W/o CLIP & Perceiver 74.1 1.47 W/o Pretraining 73.1 2.24 Full 75.3 0.691 +2.2 - TDU
✓ • ✓ • MAT • ✓ • 31
✓ ✓ 𝐿 𝑁 𝑅𝐿×𝐷 𝑅𝑁×𝐸 𝑅𝐿×𝐷, 𝑅𝑁×𝐷 → 𝑅𝐿×𝑁
𝑅𝐿×𝐷 𝑅𝐿×𝐷, 𝑅𝐿×𝐷 → 𝑅𝐿×𝐿 32
✓ ✓ ✓ ✓ 33
✓ 34 8 × 10−4 𝛽1 = 0.9, 𝛽2 =
0.99
✓ ✓ ✓ 35 19+6=25