Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Multi-Scale Self-Attention for Text Classification
Search
Scatter Lab Inc.
January 16, 2020
Research
0
2.4k
Multi-Scale Self-Attention for Text Classification
Scatter Lab Inc.
January 16, 2020
Tweet
Share
More Decks by Scatter Lab Inc.
See All by Scatter Lab Inc.
zeta introduction
scatterlab
0
1.8k
SimCLR: A Simple Framework for Contrastive Learning of Visual Representations
scatterlab
0
4.2k
Adversarial Filters of Dataset Biases
scatterlab
0
2.3k
Sparse, Dense, and Attentional Representations for Text Retrieval
scatterlab
0
2.3k
Weight Poisoning Attacks on Pre-trained Models
scatterlab
0
2.2k
Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval
scatterlab
0
2.5k
Beyond Accuracy: Behavioral Testing of NLP Models with CheckList
scatterlab
0
2.3k
Open-Retrieval Conversational Question Answering
scatterlab
0
2.3k
What Can Neural Networks Reason About?
scatterlab
0
2.3k
Other Decks in Research
See All in Research
MetaEarth: A Generative Foundation Model for Global-Scale Remote Sensing Image Generation
satai
4
500
令和最新技術で伝統掲示板を再構築: HonoX で作る型安全なスレッドフロート型掲示板 / かろっく@calloc134 - Hono Conference 2025
calloc134
0
440
Nullspace MPC
mizuhoaoki
1
500
SNLP2025:Can Language Models Reason about Individualistic Human Values and Preferences?
yukizenimoto
0
220
病院向け生成AIプロダクト開発の実践と課題
hagino3000
0
450
SREはサイバネティクスの夢をみるか? / Do SREs Dream of Cybernetics?
yuukit
2
200
Pythonでジオを使い倒そう! 〜それとFOSS4G Hiroshima 2026のご紹介を少し〜
wata909
0
1.2k
[RSJ25] Enhancing VLA Performance in Understanding and Executing Free-form Instructions via Visual Prompt-based Paraphrasing
keio_smilab
PRO
0
180
投資戦略202508
pw
0
580
ドメイン知識がない領域での自然言語処理の始め方
hargon24
1
210
言語モデルの地図:確率分布と情報幾何による類似性の可視化
shimosan
8
2.2k
国際論文を出そう!ICRA / IROS / RA-L への論文投稿の心構えとノウハウ / RSJ2025 Luncheon Seminar
koide3
10
6.3k
Featured
See All Featured
Statistics for Hackers
jakevdp
799
230k
Product Roadmaps are Hard
iamctodd
PRO
55
12k
The Pragmatic Product Professional
lauravandoore
37
7.1k
A Tale of Four Properties
chriscoyier
162
23k
A Modern Web Designer's Workflow
chriscoyier
698
190k
Optimizing for Happiness
mojombo
379
70k
Speed Design
sergeychernyshev
33
1.4k
What's in a price? How to price your products and services
michaelherold
246
13k
Fashionably flexible responsive web design (full day workshop)
malarkey
407
66k
"I'm Feeling Lucky" - Building Great Search Experiences for Today's Users (#IAC19)
danielanewman
231
22k
Easily Structure & Communicate Ideas using Wireframe
afnizarnur
194
17k
Designing for Performance
lara
610
69k
Transcript
Multi-Scale Self-Attention for Text Classification ߔ (ML Research Scientist, Pingpong)
ݾର ݾର! 1. Introduction 1. Self-Attention 2. Problem 2. Proposed
Method 1. Scale-Aware Self-Attention 2. Multi-Scale Multi-Head Self-Attention 3. Multi-Scale Transformer 3. Experiments 1. Effective Scale 2. Text Classification
Introduction Introduction
• Attention Is All You Need (Vaswani et al., 2017)
ী ࣗѐػ ӝߨ • ӝઓ Attention Key, Queryо ܰѱ ਊغਵա(Encoder-Decoder), Key, Query, Valueܳ э ѱ ਊ(Self-Attention) • Multi-head: э Key,Query,Value۽ ৈ۞ Headо ة݀ਵ۽ Attention োਸ ೯ೣਵ۽ॄ, নೠ ন࢚ਸ ݽ؛݂ೞӝ ਤೠӝߨ Introduction Self-Attention
• Transformer ࠶۾ਸ ৈ۞ ѐ ऺইࢲ ੋ؊۽ ݅٘ח ҳઑо ۽
ਊؽ. • NLU - BERT (Devlin et al., 2018), Generation - GPT(Radford et al., 2019) ١ ࠗ࠙ NLP taskٜ SOTA ߑߨۿٜীࢲ ࢎਊೞҊ ח ҳઑ Introduction Self-Attention
• Transformerח ܲ ݽٕٜ(CNN, RNN)ী ࠺೧ Inductive Bias ޙઁী ౠ
ஂডೣ • ݽ؛ ҳઑо ఀ • ݽ؛ী ઁড . • CNN, RNN: ౠ ױযٜ ࢎী ࢚ഐਊਸ ݽ؛݂ • Transformer: ױযٜ ࢎ pair-wised ࢚ഐਊਸ ݽ؛݂(ݽٚ ױযী Ӕ оמ) • ܳ ӓࠂೞӝ ਤ೧ Large Corpus۽ pre-training ೞח ߑधਸ ࢎਊೣ. → ؘఠ۽ ߄۽ णदெب ੜ زೞח Transformer • যীب Multi-Scale ҳઑо ઓೣ.(Hierarchical Structure) • High-level feature -> Low-level term ઑ • Transformer ҳઑীח ۞ೠ ਸ ߈ೡ ࣻ হ.( layerࠗఠ ݽٚ wordী Ӕ оמೣ. ࠗ࠙ب BERT method۽ যו ب ೧Ѿ ؽ.) → Multi-Scaleਸ ߈ೡ ࣻ ח Transformer Introduction Problem
Proposed Method Proposed Method
Scale-Aware Self-Attention Proposed Method
Scale-Aware Self-Attention Proposed Method ೞա Headীࢲ п token attend ೡ
ࣻ ח ߧਤܳ [-w, w] ࢎ۽ ગ൨.
Multi-Scale Multi-Head Self-Attention Proposed Method п Head݃ attendೡ ࣻ ח
ߧਤܳ ܰѱ оઉх(Multi-Scale Multi-Head).
Multi-Scale Transformer Proposed Method • FFNਸ ࢎਊೞ ঋ. (w=1 +
non-linear activation Ѿҗ৬ زੌೞҊ ࠅ ࣻ ) • Positional Embeddingب ࢎਊೞ ঋ (small-scale۽ )
Multi-Scale Transformer Proposed Method • Classification Node • Bertীࢲח [CLS]
ష representationਸ Classificationী ਊೣ • [CLS]ష representation + աݠ ష representation max pooling feature
Experiments Experiments
Effective Attention Scale Experiments • Sequence long-range dependancyܳ ੜ ݽ؛݂ೞח
ഛੋೡ ࣻ ח पਸ ӝദ • input: • п aח uniform distribution U(0,1)۽ ࠗఠ random sampling • target: • ড 20݅ѐ ण/పझ ࣇਸ ٜ݅যࢲ णदఇ A = {a 1 , . . . a N }, a ∈ Rd K ∑ i= 1 a i ⊙ a N−i+1
Effective Attention Scale Experiments • MS-Trans-Hier-S: MS-Transformer 2-layers, 10heads w=3
• MS-Trans-deepHier-S: MS-Transformer 6-layers, 10heads w=3 • MS-Trans-Flex: MS-Transformer 2-layers, multi-scales • w={3, N/16, N/8, N/4, N/2}
Effective Attention Scale Experiments • MS-Trans-Hier-S: MS-Transformer 2-layers, 10heads w=3
• MS-Trans-deepHier-S: MS-Transformer 6-layers, 10heads w=3 • MS-Trans-Flex: MS-Transformer 2-layers, multi-scales • w={3, N/16, N/8, N/4, N/2} Ã • MS-Trans-Hier-S vs MS-Trans-deepHier-S: ୶оੋ layerח ࢿמ ೱ࢚ . • MS-Trans-Flex(+ real experiments): lower layerীࢲ ࠗఠ large-scaleਸ ࠁח Ѫ small- scaleਸ ऺח Ѫ ࠁ ബҗ.
Effective Attention Scale Experiments • Analogy Analysis from BERT •
Pre-trained BERTܳ ਊ೧ ݆ ޙٜਸ forwardingೞҊ, п Layer/Headٜ ন࢚ ঈ
Effective Attention Scale Experiments • Analogy Analysis from BERT •
Pre-trained BERTܳ ਊ೧ ݆ ޙٜਸ forwardingೞҊ, п Layer/Headٜ ন࢚ ঈ • (left) زੌ layer ܲ headܳ ࠺Ү • ݽٚ distanceܳ ҎҊܖ attend(head1), small scale ౠ scale షী attend(head2, head3) • (right) ܲ layerܳ ࠺Ү • ೞਤ layerח ૣ scale షী attend(layer-1), ࢚ਤ layer۽ тࣻ۾ ݽٚ scale షী Ҋܰѱ attend(layer-6, layer-12)
Effective Attention Scale Experiments • Control Factor of Scale Distributions
for Different Layer • , 5ѐ wо ח ҃ • (layer 1) =[0 + 0.5 * 4, 0 + 0.5 * 3, 0 + 0.5 * 2, 0 + 0.5, 0], • … N′ = 10,α = 0.5 [z1 1 , z1 2 , z1 3 , z1 4 , z1 5 ] n l= 1 = {5,2,2,1,0}
Experiment Settings Experiments • Classifier: 2-layer MLP • GloVe Pre-trained
Word-Embeddings • BERT৬ э self-supervised learning method৬ח ࠺Ү ೞ ঋ. • ݽٚ ण word-embeddingਸ ઁ৻ೞҊ from scratch
Text Classification Experiments • SST • MLT-16
Sequence Labeling Experiments
Natural Language Inference Experiments • SNLI
хࢎפ✌ ୶о ޙ ژח ҾӘೠ ݶ ઁٚ ইې োۅ۽
োۅ ࣁਃ! ߔ (ML Software Engineer, Pingpong)
[email protected]