Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
[読み会]Not All Tokens Are Equal: Human-centric Vi...
Search
Kei Moriyama
January 08, 2024
0
50
[読み会]Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer
Kei Moriyama
January 08, 2024
Tweet
Share
More Decks by Kei Moriyama
See All by Kei Moriyama
[Human-AI Decision Making勉強会] 正確に予測できるAIは人間の意思決定を助けるか?
keimoriyama
0
340
Featured
See All Featured
A better future with KSS
kneath
239
18k
Documentation Writing (for coders)
carmenintech
76
5.1k
Designing Experiences People Love
moore
142
24k
Build your cross-platform service in a week with App Engine
jlugia
234
18k
How To Stay Up To Date on Web Technology
chriscoyier
791
250k
Music & Morning Musume
bryan
46
6.9k
Product Roadmaps are Hard
iamctodd
PRO
55
11k
How to train your dragon (web standard)
notwaldorf
97
6.3k
Building an army of robots
kneath
306
46k
XXLCSS - How to scale CSS and keep your sanity
sugarenia
249
1.3M
Gamification - CAS2011
davidbonilla
81
5.5k
A designer walks into a library…
pauljervisheath
209
24k
Transcript
Not All Tokens Are Equal: Human-centric Visual Analysis via Token
Clustering Transformer @1/10 山
จใ CVPR 20 22
จ֓ཁ Vision Transformer Attention 手
ݚڀͷཱͪҐஔ ViT 長 方 行
ݚڀͷཱͪҐஔ ViT 長 方 行 目
ݚڀͷཱͪҐஔ ViT 長 方 行 人
ݚڀͷఏҊ ViT CTM
ݚڀͷఏҊ MTA Head
ఏҊख๏1ɿClustering-based Token Merge(CTM) Block ( ) 人 心
ఏҊख๏1ɿCTM Blockʹ͓͚ΔΫϥελϦϯά Density peaks 用 ρi δi ρi = exp
− 1 k ∑ xj ∈KNN(xi ) ||xi − xj ||2 2 xi δi = { minj:ρj >ρi ||xi − xj || 2 if ∃j s.t. ρj > ρi maxj ||xi − xj || 2 otherwise 大 ρi ρj 大 ρi
ఏҊख๏1ɿCTM Blockʹ͓͚ΔΫϥελϦϯά 大 心 高 ρi × δi ρi ×
δi ρi = exp − 1 k ∑ xj ∈KNN(xi ) ||xi − xj ||2 2 xi δi = { minj:ρj >ρi ||xi − xj || 2 if ∃j s.t. ρj > ρi maxj ||xi − xj || 2 otherwise 大 ρi ρj 大 ρi
ఏҊख๏1ɿCTM Blockʹ͓͚Δಛྔͷ݁߹ yi = ∑ j∈Ci epjxj ∑ j∈Ci epj
pj Ci yi Query Attention Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: E ff i cient vision transformers with dynamic token sparsi fi cation. Adv. Neu- ral Inform. Process. Syst., 2 0 21 .
ఏҊख๏1ɿCTM BlockޙͷAttentionͷܭࢉ CTM 用 (Query) K,V 小 Spatial Reductio Attention(Q,
K, V) = softmax ( QKT dk + P ) V Attention P
ఏҊख๏2ɿMulti-stage Token Aggregation Head ViT 用
ఏҊख๏2ɿMulti-stage Token Aggregation Head Transformer 方 Stage 4 Stage 3
Stage 2 Stage 1
ఏҊख๏2ɿMulti-stage Token Aggregation Head Transformer 方 Upsample 行
ఏҊख๏·ͱΊ 1 2
࣮ݧ 人 3 D 3 D
࣮ݧ݁Ռɿ࢟ਪఆλεΫ 手 手
࣮ݧ݁Ռɿ࢟ਪఆλεΫ CTM,MTA Head 方
ͦΕҎ֎ͷλεΫ 手
࣮ݧɿੜ͞ΕͨτʔΫϯͷൺֱ 大 人 手
࣮ݧɿੜ͞ΕͨτʔΫϯͷൺֱ 手
·ͱΊͱײ 文 手 目 Human-centric