Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
[読み会]Not All Tokens Are Equal: Human-centric Vi...
Search
Kei Moriyama
January 08, 2024
62
0
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
[読み会]Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer
Kei Moriyama
January 08, 2024
More Decks by Kei Moriyama
See All by Kei Moriyama
[Human-AI Decision Making勉強会] 正確に予測できるAIは人間の意思決定を助けるか?
keimoriyama
0
390
Featured
See All Featured
Embracing the Ebb and Flow
colly
88
5.1k
Everyday Curiosity
cassininazir
0
230
Mozcon NYC 2025: Stop Losing SEO Traffic
samtorres
1
260
Agile that works and the tools we love
rasmusluckow
331
21k
Intergalactic Javascript Robots from Outer Space
tanoku
273
27k
Navigating Algorithm Shifts & AI Overviews - #SMXNext
aleyda
1
1.3k
Agile Leadership in an Agile Organization
kimpetersen
PRO
0
170
Digital Ethics as a Driver of Design Innovation
axbom
PRO
1
320
Paper Plane (Part 1)
katiecoart
PRO
0
9.1k
Money Talks: Using Revenue to Get Sh*t Done
nikkihalliwell
0
250
Measuring Dark Social's Impact On Conversion and Attribution
stephenakadiri
2
220
How Software Deployment tools have changed in the past 20 years
geshan
0
34k
Transcript
Not All Tokens Are Equal: Human-centric Visual Analysis via Token
Clustering Transformer @1/10 山
จใ CVPR 20 22
จ֓ཁ Vision Transformer Attention 手
ݚڀͷཱͪҐஔ ViT 長 方 行
ݚڀͷཱͪҐஔ ViT 長 方 行 目
ݚڀͷཱͪҐஔ ViT 長 方 行 人
ݚڀͷఏҊ ViT CTM
ݚڀͷఏҊ MTA Head
ఏҊख๏1ɿClustering-based Token Merge(CTM) Block ( ) 人 心
ఏҊख๏1ɿCTM Blockʹ͓͚ΔΫϥελϦϯά Density peaks 用 ρi δi ρi = exp
− 1 k ∑ xj ∈KNN(xi ) ||xi − xj ||2 2 xi δi = { minj:ρj >ρi ||xi − xj || 2 if ∃j s.t. ρj > ρi maxj ||xi − xj || 2 otherwise 大 ρi ρj 大 ρi
ఏҊख๏1ɿCTM Blockʹ͓͚ΔΫϥελϦϯά 大 心 高 ρi × δi ρi ×
δi ρi = exp − 1 k ∑ xj ∈KNN(xi ) ||xi − xj ||2 2 xi δi = { minj:ρj >ρi ||xi − xj || 2 if ∃j s.t. ρj > ρi maxj ||xi − xj || 2 otherwise 大 ρi ρj 大 ρi
ఏҊख๏1ɿCTM Blockʹ͓͚Δಛྔͷ݁߹ yi = ∑ j∈Ci epjxj ∑ j∈Ci epj
pj Ci yi Query Attention Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: E ff i cient vision transformers with dynamic token sparsi fi cation. Adv. Neu- ral Inform. Process. Syst., 2 0 21 .
ఏҊख๏1ɿCTM BlockޙͷAttentionͷܭࢉ CTM 用 (Query) K,V 小 Spatial Reductio Attention(Q,
K, V) = softmax ( QKT dk + P ) V Attention P
ఏҊख๏2ɿMulti-stage Token Aggregation Head ViT 用
ఏҊख๏2ɿMulti-stage Token Aggregation Head Transformer 方 Stage 4 Stage 3
Stage 2 Stage 1
ఏҊख๏2ɿMulti-stage Token Aggregation Head Transformer 方 Upsample 行
ఏҊख๏·ͱΊ 1 2
࣮ݧ 人 3 D 3 D
࣮ݧ݁Ռɿ࢟ਪఆλεΫ 手 手
࣮ݧ݁Ռɿ࢟ਪఆλεΫ CTM,MTA Head 方
ͦΕҎ֎ͷλεΫ 手
࣮ݧɿੜ͞ΕͨτʔΫϯͷൺֱ 大 人 手
࣮ݧɿੜ͞ΕͨτʔΫϯͷൺֱ 手
·ͱΊͱײ 文 手 目 Human-centric