Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Multi-Scale Self-Attention for Text Classification
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Scatter Lab Inc.
January 16, 2020
Research
2.5k
0
Share
Multi-Scale Self-Attention for Text Classification
Scatter Lab Inc.
January 16, 2020
More Decks by Scatter Lab Inc.
See All by Scatter Lab Inc.
zeta introduction
scatterlab
0
1.9k
SimCLR: A Simple Framework for Contrastive Learning of Visual Representations
scatterlab
0
4.4k
Adversarial Filters of Dataset Biases
scatterlab
0
2.3k
Sparse, Dense, and Attentional Representations for Text Retrieval
scatterlab
0
2.3k
Weight Poisoning Attacks on Pre-trained Models
scatterlab
0
2.2k
Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval
scatterlab
0
2.5k
Beyond Accuracy: Behavioral Testing of NLP Models with CheckList
scatterlab
0
2.3k
Open-Retrieval Conversational Question Answering
scatterlab
0
2.3k
What Can Neural Networks Reason About?
scatterlab
0
2.3k
Other Decks in Research
See All in Research
LINEヤフー データサイエンス Meetup「三井物産コモディティ予測チャレンジ」の舞台裏-AlpacaTechパート
gamella
1
520
SoftMatcha 2: 1兆語規模コーパスの超高速かつ柔らかい検索
e869120_sub
6
3.4k
R&Dチームを起ち上げる
shibuiwilliam
1
260
衛星×エッジAI勉強会 衛星上におけるAI処理制約とそ取組について
satai
4
500
社内データ分析AIエージェントを できるだけ使いやすくする工夫
fufufukakaka
1
1.1k
英語教育 “研究” のあり方:学術知とアウトリーチの緊張関係
terasawat
1
960
討議:RACDA設立30周年記念都市交通フォーラム2026
trafficbrain
0
890
データセンター事業者を取り巻く近年の状況とその中での研究開発動向、テストベッドへの貢献の可能性
kikuzo
1
130
IEEE AIxVR 2026 Keynote Talk: "Beyond Visibility: Understanding Scenes and Humans under Challenging Conditions with Diverse Sensing"
miso2024
0
190
AIエージェント時代のLLM-jpモデルのあるべき姿
k141303
0
380
Can We Teach Logical Reasoning to LLMs? – An Approach Using Synthetic Corpora (AAAI 2026 bridge keynote)
morishtr
1
230
はじまりの クエスチョンブック —余暇と豊かさにあふれた社会とは?
culturaltransition
PRO
0
450
Featured
See All Featured
Music & Morning Musume
bryan
47
7.2k
Bash Introduction
62gerente
615
210k
The untapped power of vector embeddings
frankvandijk
2
1.7k
Six Lessons from altMBA
skipperchong
29
4.2k
世界の人気アプリ100個を分析して見えたペイウォール設計の心得
akihiro_kokubo
PRO
70
39k
Introduction to Domain-Driven Design and Collaborative software design
baasie
1
790
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
141
35k
Skip the Path - Find Your Career Trail
mkilby
1
130
The SEO identity crisis: Don't let AI make you average
varn
0
470
Building a A Zero-Code AI SEO Workflow
portentint
PRO
0
530
Building Experiences: Design Systems, User Experience, and Full Site Editing
marktimemedia
0
510
Large-scale JavaScript Application Architecture
addyosmani
515
110k
Transcript
Multi-Scale Self-Attention for Text Classification ߔ (ML Research Scientist, Pingpong)
ݾର ݾର! 1. Introduction 1. Self-Attention 2. Problem 2. Proposed
Method 1. Scale-Aware Self-Attention 2. Multi-Scale Multi-Head Self-Attention 3. Multi-Scale Transformer 3. Experiments 1. Effective Scale 2. Text Classification
Introduction Introduction
• Attention Is All You Need (Vaswani et al., 2017)
ী ࣗѐػ ӝߨ • ӝઓ Attention Key, Queryо ܰѱ ਊغਵա(Encoder-Decoder), Key, Query, Valueܳ э ѱ ਊ(Self-Attention) • Multi-head: э Key,Query,Value۽ ৈ۞ Headо ة݀ਵ۽ Attention োਸ ೯ೣਵ۽ॄ, নೠ ন࢚ਸ ݽ؛݂ೞӝ ਤೠӝߨ Introduction Self-Attention
• Transformer ࠶۾ਸ ৈ۞ ѐ ऺইࢲ ੋ؊۽ ݅٘ח ҳઑо ۽
ਊؽ. • NLU - BERT (Devlin et al., 2018), Generation - GPT(Radford et al., 2019) ١ ࠗ࠙ NLP taskٜ SOTA ߑߨۿٜীࢲ ࢎਊೞҊ ח ҳઑ Introduction Self-Attention
• Transformerח ܲ ݽٕٜ(CNN, RNN)ী ࠺೧ Inductive Bias ޙઁী ౠ
ஂডೣ • ݽ؛ ҳઑо ఀ • ݽ؛ী ઁড . • CNN, RNN: ౠ ױযٜ ࢎী ࢚ഐਊਸ ݽ؛݂ • Transformer: ױযٜ ࢎ pair-wised ࢚ഐਊਸ ݽ؛݂(ݽٚ ױযী Ӕ оמ) • ܳ ӓࠂೞӝ ਤ೧ Large Corpus۽ pre-training ೞח ߑधਸ ࢎਊೣ. → ؘఠ۽ ߄۽ णदெب ੜ زೞח Transformer • যীب Multi-Scale ҳઑо ઓೣ.(Hierarchical Structure) • High-level feature -> Low-level term ઑ • Transformer ҳઑীח ۞ೠ ਸ ߈ೡ ࣻ হ.( layerࠗఠ ݽٚ wordী Ӕ оמೣ. ࠗ࠙ب BERT method۽ যו ب ೧Ѿ ؽ.) → Multi-Scaleਸ ߈ೡ ࣻ ח Transformer Introduction Problem
Proposed Method Proposed Method
Scale-Aware Self-Attention Proposed Method
Scale-Aware Self-Attention Proposed Method ೞա Headীࢲ п token attend ೡ
ࣻ ח ߧਤܳ [-w, w] ࢎ۽ ગ൨.
Multi-Scale Multi-Head Self-Attention Proposed Method п Head݃ attendೡ ࣻ ח
ߧਤܳ ܰѱ оઉх(Multi-Scale Multi-Head).
Multi-Scale Transformer Proposed Method • FFNਸ ࢎਊೞ ঋ. (w=1 +
non-linear activation Ѿҗ৬ زੌೞҊ ࠅ ࣻ ) • Positional Embeddingب ࢎਊೞ ঋ (small-scale۽ )
Multi-Scale Transformer Proposed Method • Classification Node • Bertীࢲח [CLS]
ష representationਸ Classificationী ਊೣ • [CLS]ష representation + աݠ ష representation max pooling feature
Experiments Experiments
Effective Attention Scale Experiments • Sequence long-range dependancyܳ ੜ ݽ؛݂ೞח
ഛੋೡ ࣻ ח पਸ ӝദ • input: • п aח uniform distribution U(0,1)۽ ࠗఠ random sampling • target: • ড 20݅ѐ ण/పझ ࣇਸ ٜ݅যࢲ णदఇ A = {a 1 , . . . a N }, a ∈ Rd K ∑ i= 1 a i ⊙ a N−i+1
Effective Attention Scale Experiments • MS-Trans-Hier-S: MS-Transformer 2-layers, 10heads w=3
• MS-Trans-deepHier-S: MS-Transformer 6-layers, 10heads w=3 • MS-Trans-Flex: MS-Transformer 2-layers, multi-scales • w={3, N/16, N/8, N/4, N/2}
Effective Attention Scale Experiments • MS-Trans-Hier-S: MS-Transformer 2-layers, 10heads w=3
• MS-Trans-deepHier-S: MS-Transformer 6-layers, 10heads w=3 • MS-Trans-Flex: MS-Transformer 2-layers, multi-scales • w={3, N/16, N/8, N/4, N/2} Ã • MS-Trans-Hier-S vs MS-Trans-deepHier-S: ୶оੋ layerח ࢿמ ೱ࢚ . • MS-Trans-Flex(+ real experiments): lower layerীࢲ ࠗఠ large-scaleਸ ࠁח Ѫ small- scaleਸ ऺח Ѫ ࠁ ബҗ.
Effective Attention Scale Experiments • Analogy Analysis from BERT •
Pre-trained BERTܳ ਊ೧ ݆ ޙٜਸ forwardingೞҊ, п Layer/Headٜ ন࢚ ঈ
Effective Attention Scale Experiments • Analogy Analysis from BERT •
Pre-trained BERTܳ ਊ೧ ݆ ޙٜਸ forwardingೞҊ, п Layer/Headٜ ন࢚ ঈ • (left) زੌ layer ܲ headܳ ࠺Ү • ݽٚ distanceܳ ҎҊܖ attend(head1), small scale ౠ scale షী attend(head2, head3) • (right) ܲ layerܳ ࠺Ү • ೞਤ layerח ૣ scale షী attend(layer-1), ࢚ਤ layer۽ тࣻ۾ ݽٚ scale షী Ҋܰѱ attend(layer-6, layer-12)
Effective Attention Scale Experiments • Control Factor of Scale Distributions
for Different Layer • , 5ѐ wо ח ҃ • (layer 1) =[0 + 0.5 * 4, 0 + 0.5 * 3, 0 + 0.5 * 2, 0 + 0.5, 0], • … N′ = 10,α = 0.5 [z1 1 , z1 2 , z1 3 , z1 4 , z1 5 ] n l= 1 = {5,2,2,1,0}
Experiment Settings Experiments • Classifier: 2-layer MLP • GloVe Pre-trained
Word-Embeddings • BERT৬ э self-supervised learning method৬ח ࠺Ү ೞ ঋ. • ݽٚ ण word-embeddingਸ ઁ৻ೞҊ from scratch
Text Classification Experiments • SST • MLT-16
Sequence Labeling Experiments
Natural Language Inference Experiments • SNLI
хࢎפ✌ ୶о ޙ ژח ҾӘೠ ݶ ઁٚ ইې োۅ۽
োۅ ࣁਃ! ߔ (ML Software Engineer, Pingpong)
[email protected]