Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
[読み会]Not All Tokens Are Equal: Human-centric Vi...
Search
Kei Moriyama
January 08, 2024
0
53
[読み会]Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer
Kei Moriyama
January 08, 2024
Tweet
Share
More Decks by Kei Moriyama
See All by Kei Moriyama
[Human-AI Decision Making勉強会] 正確に予測できるAIは人間の意思決定を助けるか?
keimoriyama
0
360
Featured
See All Featured
The State of eCommerce SEO: How to Win in Today's Products SERPs - #SEOweek
aleyda
2
10k
Conquering PDFs: document understanding beyond plain text
inesmontani
PRO
4
2.5k
Building Applications with DynamoDB
mza
96
7k
Making Projects Easy
brettharned
120
6.6k
エンジニアに許された特別な時間の終わり
watany
106
240k
Raft: Consensus for Rubyists
vanstee
141
7.4k
Building a Scalable Design System with Sketch
lauravandoore
463
34k
How to Grow Your eCommerce with AI & Automation
katarinadahlin
PRO
1
150
Bridging the Design Gap: How Collaborative Modelling removes blockers to flow between stakeholders and teams @FastFlow conf
baasie
0
490
Prompt Engineering for Job Search
mfonobong
0
220
Optimizing for Happiness
mojombo
378
71k
The Art of Programming - Codeland 2020
erikaheidi
57
14k
Transcript
Not All Tokens Are Equal: Human-centric Visual Analysis via Token
Clustering Transformer @1/10 山
จใ CVPR 20 22
จ֓ཁ Vision Transformer Attention 手
ݚڀͷཱͪҐஔ ViT 長 方 行
ݚڀͷཱͪҐஔ ViT 長 方 行 目
ݚڀͷཱͪҐஔ ViT 長 方 行 人
ݚڀͷఏҊ ViT CTM
ݚڀͷఏҊ MTA Head
ఏҊख๏1ɿClustering-based Token Merge(CTM) Block ( ) 人 心
ఏҊख๏1ɿCTM Blockʹ͓͚ΔΫϥελϦϯά Density peaks 用 ρi δi ρi = exp
− 1 k ∑ xj ∈KNN(xi ) ||xi − xj ||2 2 xi δi = { minj:ρj >ρi ||xi − xj || 2 if ∃j s.t. ρj > ρi maxj ||xi − xj || 2 otherwise 大 ρi ρj 大 ρi
ఏҊख๏1ɿCTM Blockʹ͓͚ΔΫϥελϦϯά 大 心 高 ρi × δi ρi ×
δi ρi = exp − 1 k ∑ xj ∈KNN(xi ) ||xi − xj ||2 2 xi δi = { minj:ρj >ρi ||xi − xj || 2 if ∃j s.t. ρj > ρi maxj ||xi − xj || 2 otherwise 大 ρi ρj 大 ρi
ఏҊख๏1ɿCTM Blockʹ͓͚Δಛྔͷ݁߹ yi = ∑ j∈Ci epjxj ∑ j∈Ci epj
pj Ci yi Query Attention Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: E ff i cient vision transformers with dynamic token sparsi fi cation. Adv. Neu- ral Inform. Process. Syst., 2 0 21 .
ఏҊख๏1ɿCTM BlockޙͷAttentionͷܭࢉ CTM 用 (Query) K,V 小 Spatial Reductio Attention(Q,
K, V) = softmax ( QKT dk + P ) V Attention P
ఏҊख๏2ɿMulti-stage Token Aggregation Head ViT 用
ఏҊख๏2ɿMulti-stage Token Aggregation Head Transformer 方 Stage 4 Stage 3
Stage 2 Stage 1
ఏҊख๏2ɿMulti-stage Token Aggregation Head Transformer 方 Upsample 行
ఏҊख๏·ͱΊ 1 2
࣮ݧ 人 3 D 3 D
࣮ݧ݁Ռɿ࢟ਪఆλεΫ 手 手
࣮ݧ݁Ռɿ࢟ਪఆλεΫ CTM,MTA Head 方
ͦΕҎ֎ͷλεΫ 手
࣮ݧɿੜ͞ΕͨτʔΫϯͷൺֱ 大 人 手
࣮ݧɿੜ͞ΕͨτʔΫϯͷൺֱ 手
·ͱΊͱײ 文 手 目 Human-centric