Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
[読み会]Not All Tokens Are Equal: Human-centric Vi...
Search
Kei Moriyama
January 08, 2024
56
0
Share
[読み会]Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer
Kei Moriyama
January 08, 2024
More Decks by Kei Moriyama
See All by Kei Moriyama
[Human-AI Decision Making勉強会] 正確に予測できるAIは人間の意思決定を助けるか?
keimoriyama
0
370
Featured
See All Featured
How to Align SEO within the Product Triangle To Get Buy-In & Support - #RIMC
aleyda
2
1.5k
Prompt Engineering for Job Search
mfonobong
0
290
The State of eCommerce SEO: How to Win in Today's Products SERPs - #SEOweek
aleyda
2
10k
個人開発の失敗を避けるイケてる考え方 / tips for indie hackers
panda_program
122
21k
The MySQL Ecosystem @ GitHub 2015
samlambert
251
13k
JAMstack: Web Apps at Ludicrous Speed - All Things Open 2022
reverentgeek
1
430
SEO Brein meetup: CTRL+C is not how to scale international SEO
lindahogenes
1
2.6k
Unsuck your backbone
ammeep
672
58k
Conquering PDFs: document understanding beyond plain text
inesmontani
PRO
4
2.7k
The Curious Case for Waylosing
cassininazir
0
330
A Modern Web Designer's Workflow
chriscoyier
698
190k
Mozcon NYC 2025: Stop Losing SEO Traffic
samtorres
0
220
Transcript
Not All Tokens Are Equal: Human-centric Visual Analysis via Token
Clustering Transformer @1/10 山
จใ CVPR 20 22
จ֓ཁ Vision Transformer Attention 手
ݚڀͷཱͪҐஔ ViT 長 方 行
ݚڀͷཱͪҐஔ ViT 長 方 行 目
ݚڀͷཱͪҐஔ ViT 長 方 行 人
ݚڀͷఏҊ ViT CTM
ݚڀͷఏҊ MTA Head
ఏҊख๏1ɿClustering-based Token Merge(CTM) Block ( ) 人 心
ఏҊख๏1ɿCTM Blockʹ͓͚ΔΫϥελϦϯά Density peaks 用 ρi δi ρi = exp
− 1 k ∑ xj ∈KNN(xi ) ||xi − xj ||2 2 xi δi = { minj:ρj >ρi ||xi − xj || 2 if ∃j s.t. ρj > ρi maxj ||xi − xj || 2 otherwise 大 ρi ρj 大 ρi
ఏҊख๏1ɿCTM Blockʹ͓͚ΔΫϥελϦϯά 大 心 高 ρi × δi ρi ×
δi ρi = exp − 1 k ∑ xj ∈KNN(xi ) ||xi − xj ||2 2 xi δi = { minj:ρj >ρi ||xi − xj || 2 if ∃j s.t. ρj > ρi maxj ||xi − xj || 2 otherwise 大 ρi ρj 大 ρi
ఏҊख๏1ɿCTM Blockʹ͓͚Δಛྔͷ݁߹ yi = ∑ j∈Ci epjxj ∑ j∈Ci epj
pj Ci yi Query Attention Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: E ff i cient vision transformers with dynamic token sparsi fi cation. Adv. Neu- ral Inform. Process. Syst., 2 0 21 .
ఏҊख๏1ɿCTM BlockޙͷAttentionͷܭࢉ CTM 用 (Query) K,V 小 Spatial Reductio Attention(Q,
K, V) = softmax ( QKT dk + P ) V Attention P
ఏҊख๏2ɿMulti-stage Token Aggregation Head ViT 用
ఏҊख๏2ɿMulti-stage Token Aggregation Head Transformer 方 Stage 4 Stage 3
Stage 2 Stage 1
ఏҊख๏2ɿMulti-stage Token Aggregation Head Transformer 方 Upsample 行
ఏҊख๏·ͱΊ 1 2
࣮ݧ 人 3 D 3 D
࣮ݧ݁Ռɿ࢟ਪఆλεΫ 手 手
࣮ݧ݁Ռɿ࢟ਪఆλεΫ CTM,MTA Head 方
ͦΕҎ֎ͷλεΫ 手
࣮ݧɿੜ͞ΕͨτʔΫϯͷൺֱ 大 人 手
࣮ݧɿੜ͞ΕͨτʔΫϯͷൺֱ 手
·ͱΊͱײ 文 手 目 Human-centric