Multi-Scale Self-Attention for Text Classification

Multi-Scale Self-Attention for Text Classiﬁcation ߔ৔޹ (ML Research Scientist, Pingpong)

ݾର ݾର! 1. Introduction 1. Self-Attention 2. Problem 2. Proposed
Method 1. Scale-Aware Self-Attention 2. Multi-Scale Multi-Head Self-Attention 3. Multi-Scale Transformer 3. Experiments 1. Effective Scale 2. Text Classiﬁcation

Introduction Introduction

• Attention Is All You Need (Vaswani et al., 2017)
ী ୊਺ ࣗѐػ ӝߨ • ӝઓ੄ Attention਷ Key, Queryо ׮ܰѱ ੉ਊغ঻ਵա(Encoder-Decoder), Key, Query, Valueܳ э ѱ ੉ਊ(Self-Attention) • Multi-head: э਷ Key,Query,Value۽ ৈ۞ Headо ة݀੸ਵ۽ Attention ো࢑ਸ ૓೯ೣਵ۽ॄ, ׮নೠ ন࢚ਸ ݽ؛݂ೞӝ ਤೠӝߨ Introduction Self-Attention

• Transformer ࠶۾ਸ ৈ۞ ѐ ऺইࢲ ੋ௏؊۽ ݅٘ח ҳઑо ઱۽
੉ਊؽ. • NLU - BERT (Devlin et al., 2018), Generation - GPT(Radford et al., 2019) ١ ؀ࠗ࠙ NLP taskٜ ੄ SOTA ߑߨۿٜীࢲ ࢎਊೞҊ ੓ח ҳઑ Introduction Self-Attention

• Transformerח ׮ܲ ݽٕٜ(CNN, RNN)ী ࠺೧ Inductive Bias ޙઁী ౠ൤
ஂডೣ • ݽ؛ ҳઑо ఀ • ݽ؛ী ઁড੼੉ ੘਺. • CNN, RNN: ౠ੿ ױযٜ ࢎ੉ী ࢚ഐ੘ਊਸ ݽ؛݂ • Transformer: ױযٜ ࢎ੉੄ pair-wised ࢚ഐ੘ਊਸ ݽ؛݂(ݽٚ ױযী ੽Ӕ оמ) • ੉ܳ ӓࠂೞӝ ਤ೧ Large Corpus۽ pre-training ೞח ߑधਸ ੗઱ ࢎਊೣ. → ੸਷ ؘ੉ఠ۽ ߄۽ ೟णदெب ੜ ز੘ೞח Transformer • ঱যীب Multi-Scale੄ ҳઑо ઓ੤ೣ.(Hierarchical Structure) • High-level feature -> Low-level term੄ ઑ೤ • Transformer ҳઑীח ੉۞ೠ ੼ਸ ߈৔ೡ ࣻ হ਺.(୐ layerࠗఠ ݽٚ wordী ੽Ӕ оמೣ. ੉ ࠗ࠙ب BERT method۽ যו ੿ب ೧Ѿ ؽ.) → Multi-Scaleਸ ߈৔ೡ ࣻ ੓ח Transformer Introduction Problem

Proposed Method Proposed Method

Scale-Aware Self-Attention Proposed Method

Scale-Aware Self-Attention Proposed Method ೞա੄ Headীࢲ п token੉ attend ೡ
ࣻ ੓ח ߧਤܳ [-w, w] ࢎ੉۽ ગ൨.

Multi-Scale Multi-Head Self-Attention Proposed Method п Head݃׮ attendೡ ࣻ ੓ח
ߧਤܳ ׮ܰѱ оઉх(Multi-Scale Multi-Head).

Multi-Scale Transformer Proposed Method • FFNਸ ࢎਊೞ૑ ঋ਺. (w=1 +
non-linear activation੄ Ѿҗ৬ زੌೞ׮Ҋ ࠅ ࣻ ੓਺) • Positional Embeddingب ࢎਊೞ૑ ঋ਺ (small-scale۽ ؀୓)

Multi-Scale Transformer Proposed Method • Classiﬁcation Node • Bertীࢲח [CLS]
ష௾੄ representationਸ Classiﬁcationী ੉ਊೣ • [CLS]ష௾ representation + աݠ૑ ష௾ representation੄ max pooling feature

Experiments Experiments

Effective Attention Scale Experiments • Sequence੄ long-range dependancyܳ ੜ ݽ؛݂ೞח૑
ഛੋೡ ࣻ ੓ח प೷ਸ ӝദ • input: • п aח uniform distribution U(0,1)۽ ࠗఠ random sampling • target: • ড 20݅ѐ੄ ೟ण/పझ౟ ࣇਸ ٜ݅যࢲ ೟णदఇ A = {a 1 , . . . a N }, a ∈ Rd K ∑ i= 1 a i ⊙ a N−i+1

Effective Attention Scale Experiments • MS-Trans-Hier-S: MS-Transformer 2-layers, 10heads w=3
• MS-Trans-deepHier-S: MS-Transformer 6-layers, 10heads w=3 • MS-Trans-Flex: MS-Transformer 2-layers, multi-scales • w={3, N/16, N/8, N/4, N/2}

Effective Attention Scale Experiments • MS-Trans-Hier-S: MS-Transformer 2-layers, 10heads w=3
• MS-Trans-deepHier-S: MS-Transformer 6-layers, 10heads w=3 • MS-Trans-Flex: MS-Transformer 2-layers, multi-scales • w={3, N/16, N/8, N/4, N/2} Ã • MS-Trans-Hier-S vs MS-Trans-deepHier-S: ୶о੸ੋ layerח ࢿמ ೱ࢚੉ ੸׮. • MS-Trans-Flex(+ real experiments): lower layerীࢲ ࠗఠ large-scaleਸ ࠁח Ѫ੉ small- scaleਸ ऺח Ѫ ࠁ׮ ബҗ੸੉׮.

Effective Attention Scale Experiments • Analogy Analysis from BERT •
Pre-trained BERTܳ ੉ਊ೧ ݆਷ ޙ੢ٜਸ forwardingೞҊ, п Layer/Headٜ੄ ন࢚ ౵ঈ

Effective Attention Scale Experiments • Analogy Analysis from BERT •
Pre-trained BERTܳ ੉ਊ೧ ݆਷ ޙ੢ٜਸ forwardingೞҊ, п Layer/Headٜ੄ ন࢚ ౵ঈ • (left) زੌ layer੄ ׮ܲ headܳ ࠺Ү • ݽٚ distanceܳ ҎҊܖ attend(head1), small scale੄ ౠ੿ scale੄ ష௾ী attend(head2, head3) • (right) ׮ܲ layerܳ ࠺Ү • ೞਤ layerח ૣ਷ scale੄ ష௾ী attend(layer-1), ࢚ਤ layer۽ тࣻ۾ ݽٚ scale੄ ష௾ী Ҋܰѱ attend(layer-6, layer-12)

Effective Attention Scale Experiments • Control Factor of Scale Distributions
for Different Layer • , 5ѐ੄ wо ੓ח ҃਋ • (layer 1) =[0 + 0.5 * 4, 0 + 0.5 * 3, 0 + 0.5 * 2, 0 + 0.5, 0], • … N′ = 10,α = 0.5 [z1 1 , z1 2 , z1 3 , z1 4 , z1 5 ] n l= 1 = {5,2,2,1,0}

Experiment Settings Experiments • Classiﬁer: 2-layer MLP • GloVe Pre-trained
Word-Embeddings • BERT৬ э੉ self-supervised learning method৬ח ࠺Ү ೞ૑ ঋ਺. • ݽٚ ೟ण਷ word-embeddingਸ ઁ৻ೞҊ from scratch

Text Classiﬁcation Experiments • SST • MLT-16

Sequence Labeling Experiments

Natural Language Inference Experiments • SNLI

хࢎ೤פ׮✌ ୶о ૕ޙ ژח ҾӘೠ ੼੉ ੓׮ݶ ঱ઁٚ ইې োۅ୊۽
োۅ ઱ࣁਃ! ߔ৔޹ (ML Software Engineer, Pingpong) [email protected]

Multi-Scale Self-Attention for Text Classification

Multi-Scale Self-Attention for Text Classification

Scatter Lab Inc.

More Decks by Scatter Lab Inc.

Other Decks in Research

Featured

Transcript

Multi-Scale Self-Attention for Text Classiﬁcation ߔ৔޹ (ML Research Scientist, Pingpong)

ݾର ݾର! 1. Introduction 1. Self-Attention 2. Problem 2. Proposed

Introduction Introduction

• Attention Is All You Need (Vaswani et al., 2017)

• Transformer ࠶۾ਸ ৈ۞ ѐ ऺইࢲ ੋ௏؊۽ ݅٘ח ҳઑо ઱۽

• Transformerח ׮ܲ ݽٕٜ(CNN, RNN)ী ࠺೧ Inductive Bias ޙઁী ౠ൤

Proposed Method Proposed Method

Scale-Aware Self-Attention Proposed Method

Scale-Aware Self-Attention Proposed Method ೞա੄ Headীࢲ п token੉ attend ೡ

Multi-Scale Multi-Head Self-Attention Proposed Method п Head݃׮ attendೡ ࣻ ੓ח

Multi-Scale Transformer Proposed Method • FFNਸ ࢎਊೞ૑ ঋ਺. (w=1 +

Multi-Scale Transformer Proposed Method • Classiﬁcation Node • Bertীࢲח [CLS]

Experiments Experiments

Effective Attention Scale Experiments • Sequence੄ long-range dependancyܳ ੜ ݽ؛݂ೞח૑

Effective Attention Scale Experiments • MS-Trans-Hier-S: MS-Transformer 2-layers, 10heads w=3

Effective Attention Scale Experiments • MS-Trans-Hier-S: MS-Transformer 2-layers, 10heads w=3

Effective Attention Scale Experiments • Analogy Analysis from BERT •

Effective Attention Scale Experiments • Analogy Analysis from BERT •

Effective Attention Scale Experiments • Control Factor of Scale Distributions

Experiment Settings Experiments • Classiﬁer: 2-layer MLP • GloVe Pre-trained

Text Classiﬁcation Experiments • SST • MLT-16

Sequence Labeling Experiments

Natural Language Inference Experiments • SNLI

хࢎ೤פ׮✌ ୶о ૕ޙ ژח ҾӘೠ ੼੉ ੓׮ݶ ঱ઁٚ ইې োۅ୊۽