Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Multi-Scale Self-Attention for Text Classification
Search
Scatter Lab Inc.
January 16, 2020
Research
2.5k
0
Share
Multi-Scale Self-Attention for Text Classification
Scatter Lab Inc.
January 16, 2020
More Decks by Scatter Lab Inc.
See All by Scatter Lab Inc.
zeta introduction
scatterlab
0
1.8k
SimCLR: A Simple Framework for Contrastive Learning of Visual Representations
scatterlab
0
4.4k
Adversarial Filters of Dataset Biases
scatterlab
0
2.3k
Sparse, Dense, and Attentional Representations for Text Retrieval
scatterlab
0
2.3k
Weight Poisoning Attacks on Pre-trained Models
scatterlab
0
2.2k
Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval
scatterlab
0
2.5k
Beyond Accuracy: Behavioral Testing of NLP Models with CheckList
scatterlab
0
2.3k
Open-Retrieval Conversational Question Answering
scatterlab
0
2.3k
What Can Neural Networks Reason About?
scatterlab
0
2.3k
Other Decks in Research
See All in Research
「車1割削減、渋滞半減、公共交通2倍」を 熊本から岡山へ@RACDA設立30周年記念都市交通フォーラム2026
trafficbrain
1
1k
Can We Teach Logical Reasoning to LLMs? – An Approach Using Synthetic Corpora (AAAI 2026 bridge keynote)
morishtr
1
210
Earth AI: Unlocking Geospatial Insights with Foundation Models and Cross-Modal Reasoning
satai
3
930
業界横断 副業コンプライアンス調査 三者(副業者・本業先・発注者)におけるトラブル認知ギャップの構造分析
fkske
0
1.2k
AIスーパーコンピュータにおけるLLM学習処理性能の計測と可観測性 / AI Supercomputer LLM Benchmarking and Observability
yuukit
1
830
2026年3月1日(日)福島「除染土」の公共利用をかんがえる
atsukomasano2026
0
540
【SIGGRAPH Asia 2025】Lo-Fi Photograph with Lo-Fi Communication
toremolo72
0
150
AI Agentの精度改善に見るML開発との共通点 / commonalities in accuracy improvements in agentic era
shimacos
6
1.6k
存立危機事態の再検討
jimboken
0
270
ブレグマン距離最小化に基づくリース表現量推定:バイアス除去学習の統一理論
masakat0
0
240
教師あり学習と強化学習で作る 最強の数学特化LLM
analokmaus
2
1k
FUSE-RSVLM: Feature Fusion Vision-Language Model for Remote Sensing
satai
3
520
Featured
See All Featured
The #1 spot is gone: here's how to win anyway
tamaranovitovic
2
1k
Design of three-dimensional binary manipulators for pick-and-place task avoiding obstacles (IECON2024)
konakalab
0
410
Fantastic passwords and where to find them - at NoRuKo
philnash
52
3.7k
Scaling GitHub
holman
464
140k
The Cost Of JavaScript in 2023
addyosmani
55
9.8k
Navigating the Design Leadership Dip - Product Design Week Design Leaders+ Conference 2024
apolaine
0
270
Music & Morning Musume
bryan
47
7.2k
Statistics for Hackers
jakevdp
799
230k
"I'm Feeling Lucky" - Building Great Search Experiences for Today's Users (#IAC19)
danielanewman
231
23k
Let's Do A Bunch of Simple Stuff to Make Websites Faster
chriscoyier
508
140k
[SF Ruby Conf 2025] Rails X
palkan
2
970
Amusing Abliteration
ianozsvald
1
160
Transcript
Multi-Scale Self-Attention for Text Classification ߔ (ML Research Scientist, Pingpong)
ݾର ݾର! 1. Introduction 1. Self-Attention 2. Problem 2. Proposed
Method 1. Scale-Aware Self-Attention 2. Multi-Scale Multi-Head Self-Attention 3. Multi-Scale Transformer 3. Experiments 1. Effective Scale 2. Text Classification
Introduction Introduction
• Attention Is All You Need (Vaswani et al., 2017)
ী ࣗѐػ ӝߨ • ӝઓ Attention Key, Queryо ܰѱ ਊغਵա(Encoder-Decoder), Key, Query, Valueܳ э ѱ ਊ(Self-Attention) • Multi-head: э Key,Query,Value۽ ৈ۞ Headо ة݀ਵ۽ Attention োਸ ೯ೣਵ۽ॄ, নೠ ন࢚ਸ ݽ؛݂ೞӝ ਤೠӝߨ Introduction Self-Attention
• Transformer ࠶۾ਸ ৈ۞ ѐ ऺইࢲ ੋ؊۽ ݅٘ח ҳઑо ۽
ਊؽ. • NLU - BERT (Devlin et al., 2018), Generation - GPT(Radford et al., 2019) ١ ࠗ࠙ NLP taskٜ SOTA ߑߨۿٜীࢲ ࢎਊೞҊ ח ҳઑ Introduction Self-Attention
• Transformerח ܲ ݽٕٜ(CNN, RNN)ী ࠺೧ Inductive Bias ޙઁী ౠ
ஂডೣ • ݽ؛ ҳઑо ఀ • ݽ؛ী ઁড . • CNN, RNN: ౠ ױযٜ ࢎী ࢚ഐਊਸ ݽ؛݂ • Transformer: ױযٜ ࢎ pair-wised ࢚ഐਊਸ ݽ؛݂(ݽٚ ױযী Ӕ оמ) • ܳ ӓࠂೞӝ ਤ೧ Large Corpus۽ pre-training ೞח ߑधਸ ࢎਊೣ. → ؘఠ۽ ߄۽ णदெب ੜ زೞח Transformer • যীب Multi-Scale ҳઑо ઓೣ.(Hierarchical Structure) • High-level feature -> Low-level term ઑ • Transformer ҳઑীח ۞ೠ ਸ ߈ೡ ࣻ হ.( layerࠗఠ ݽٚ wordী Ӕ оמೣ. ࠗ࠙ب BERT method۽ যו ب ೧Ѿ ؽ.) → Multi-Scaleਸ ߈ೡ ࣻ ח Transformer Introduction Problem
Proposed Method Proposed Method
Scale-Aware Self-Attention Proposed Method
Scale-Aware Self-Attention Proposed Method ೞա Headীࢲ п token attend ೡ
ࣻ ח ߧਤܳ [-w, w] ࢎ۽ ગ൨.
Multi-Scale Multi-Head Self-Attention Proposed Method п Head݃ attendೡ ࣻ ח
ߧਤܳ ܰѱ оઉх(Multi-Scale Multi-Head).
Multi-Scale Transformer Proposed Method • FFNਸ ࢎਊೞ ঋ. (w=1 +
non-linear activation Ѿҗ৬ زੌೞҊ ࠅ ࣻ ) • Positional Embeddingب ࢎਊೞ ঋ (small-scale۽ )
Multi-Scale Transformer Proposed Method • Classification Node • Bertীࢲח [CLS]
ష representationਸ Classificationী ਊೣ • [CLS]ష representation + աݠ ష representation max pooling feature
Experiments Experiments
Effective Attention Scale Experiments • Sequence long-range dependancyܳ ੜ ݽ؛݂ೞח
ഛੋೡ ࣻ ח पਸ ӝദ • input: • п aח uniform distribution U(0,1)۽ ࠗఠ random sampling • target: • ড 20݅ѐ ण/పझ ࣇਸ ٜ݅যࢲ णदఇ A = {a 1 , . . . a N }, a ∈ Rd K ∑ i= 1 a i ⊙ a N−i+1
Effective Attention Scale Experiments • MS-Trans-Hier-S: MS-Transformer 2-layers, 10heads w=3
• MS-Trans-deepHier-S: MS-Transformer 6-layers, 10heads w=3 • MS-Trans-Flex: MS-Transformer 2-layers, multi-scales • w={3, N/16, N/8, N/4, N/2}
Effective Attention Scale Experiments • MS-Trans-Hier-S: MS-Transformer 2-layers, 10heads w=3
• MS-Trans-deepHier-S: MS-Transformer 6-layers, 10heads w=3 • MS-Trans-Flex: MS-Transformer 2-layers, multi-scales • w={3, N/16, N/8, N/4, N/2} Ã • MS-Trans-Hier-S vs MS-Trans-deepHier-S: ୶оੋ layerח ࢿמ ೱ࢚ . • MS-Trans-Flex(+ real experiments): lower layerীࢲ ࠗఠ large-scaleਸ ࠁח Ѫ small- scaleਸ ऺח Ѫ ࠁ ബҗ.
Effective Attention Scale Experiments • Analogy Analysis from BERT •
Pre-trained BERTܳ ਊ೧ ݆ ޙٜਸ forwardingೞҊ, п Layer/Headٜ ন࢚ ঈ
Effective Attention Scale Experiments • Analogy Analysis from BERT •
Pre-trained BERTܳ ਊ೧ ݆ ޙٜਸ forwardingೞҊ, п Layer/Headٜ ন࢚ ঈ • (left) زੌ layer ܲ headܳ ࠺Ү • ݽٚ distanceܳ ҎҊܖ attend(head1), small scale ౠ scale షী attend(head2, head3) • (right) ܲ layerܳ ࠺Ү • ೞਤ layerח ૣ scale షী attend(layer-1), ࢚ਤ layer۽ тࣻ۾ ݽٚ scale షী Ҋܰѱ attend(layer-6, layer-12)
Effective Attention Scale Experiments • Control Factor of Scale Distributions
for Different Layer • , 5ѐ wо ח ҃ • (layer 1) =[0 + 0.5 * 4, 0 + 0.5 * 3, 0 + 0.5 * 2, 0 + 0.5, 0], • … N′ = 10,α = 0.5 [z1 1 , z1 2 , z1 3 , z1 4 , z1 5 ] n l= 1 = {5,2,2,1,0}
Experiment Settings Experiments • Classifier: 2-layer MLP • GloVe Pre-trained
Word-Embeddings • BERT৬ э self-supervised learning method৬ח ࠺Ү ೞ ঋ. • ݽٚ ण word-embeddingਸ ઁ৻ೞҊ from scratch
Text Classification Experiments • SST • MLT-16
Sequence Labeling Experiments
Natural Language Inference Experiments • SNLI
хࢎפ✌ ୶о ޙ ژח ҾӘೠ ݶ ઁٚ ইې োۅ۽
োۅ ࣁਃ! ߔ (ML Software Engineer, Pingpong)
[email protected]