Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
[読み会]Not All Tokens Are Equal: Human-centric Vi...
Search
Kei Moriyama
January 08, 2024
0
51
[読み会]Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer
Kei Moriyama
January 08, 2024
Tweet
Share
More Decks by Kei Moriyama
See All by Kei Moriyama
[Human-AI Decision Making勉強会] 正確に予測できるAIは人間の意思決定を助けるか?
keimoriyama
0
350
Featured
See All Featured
svc-hook: hooking system calls on ARM64 by binary rewriting
retrage
1
19
Making the Leap to Tech Lead
cromwellryan
135
9.7k
The World Runs on Bad Software
bkeepers
PRO
72
12k
The Director’s Chair: Orchestrating AI for Truly Effective Learning
tmiket
0
60
How to Create Impact in a Changing Tech Landscape [PerfNow 2023]
tammyeverts
55
3.2k
Save Time (by Creating Custom Rails Generators)
garrettdimon
PRO
32
1.8k
Jess Joyce - The Pitfalls of Following Frameworks
techseoconnect
PRO
1
23
The Limits of Empathy - UXLibs8
cassininazir
1
190
What’s in a name? Adding method to the madness
productmarketing
PRO
24
3.8k
Designing Experiences People Love
moore
143
24k
sira's awesome portfolio website redesign presentation
elsirapls
0
87
Technical Leadership for Architectural Decision Making
baasie
0
180
Transcript
Not All Tokens Are Equal: Human-centric Visual Analysis via Token
Clustering Transformer @1/10 山
จใ CVPR 20 22
จ֓ཁ Vision Transformer Attention 手
ݚڀͷཱͪҐஔ ViT 長 方 行
ݚڀͷཱͪҐஔ ViT 長 方 行 目
ݚڀͷཱͪҐஔ ViT 長 方 行 人
ݚڀͷఏҊ ViT CTM
ݚڀͷఏҊ MTA Head
ఏҊख๏1ɿClustering-based Token Merge(CTM) Block ( ) 人 心
ఏҊख๏1ɿCTM Blockʹ͓͚ΔΫϥελϦϯά Density peaks 用 ρi δi ρi = exp
− 1 k ∑ xj ∈KNN(xi ) ||xi − xj ||2 2 xi δi = { minj:ρj >ρi ||xi − xj || 2 if ∃j s.t. ρj > ρi maxj ||xi − xj || 2 otherwise 大 ρi ρj 大 ρi
ఏҊख๏1ɿCTM Blockʹ͓͚ΔΫϥελϦϯά 大 心 高 ρi × δi ρi ×
δi ρi = exp − 1 k ∑ xj ∈KNN(xi ) ||xi − xj ||2 2 xi δi = { minj:ρj >ρi ||xi − xj || 2 if ∃j s.t. ρj > ρi maxj ||xi − xj || 2 otherwise 大 ρi ρj 大 ρi
ఏҊख๏1ɿCTM Blockʹ͓͚Δಛྔͷ݁߹ yi = ∑ j∈Ci epjxj ∑ j∈Ci epj
pj Ci yi Query Attention Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: E ff i cient vision transformers with dynamic token sparsi fi cation. Adv. Neu- ral Inform. Process. Syst., 2 0 21 .
ఏҊख๏1ɿCTM BlockޙͷAttentionͷܭࢉ CTM 用 (Query) K,V 小 Spatial Reductio Attention(Q,
K, V) = softmax ( QKT dk + P ) V Attention P
ఏҊख๏2ɿMulti-stage Token Aggregation Head ViT 用
ఏҊख๏2ɿMulti-stage Token Aggregation Head Transformer 方 Stage 4 Stage 3
Stage 2 Stage 1
ఏҊख๏2ɿMulti-stage Token Aggregation Head Transformer 方 Upsample 行
ఏҊख๏·ͱΊ 1 2
࣮ݧ 人 3 D 3 D
࣮ݧ݁Ռɿ࢟ਪఆλεΫ 手 手
࣮ݧ݁Ռɿ࢟ਪఆλεΫ CTM,MTA Head 方
ͦΕҎ֎ͷλεΫ 手
࣮ݧɿੜ͞ΕͨτʔΫϯͷൺֱ 大 人 手
࣮ݧɿੜ͞ΕͨτʔΫϯͷൺֱ 手
·ͱΊͱײ 文 手 目 Human-centric