Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Multi-Scale Self-Attention for Text Classification

Multi-Scale Self-Attention for Text Classification

Avatar for Scatter Lab Inc.

Scatter Lab Inc.

January 16, 2020
Tweet

More Decks by Scatter Lab Inc.

Other Decks in Research

Transcript

  1. ݾର ݾର! 1. Introduction 1. Self-Attention 2. Problem 2. Proposed

    Method 1. Scale-Aware Self-Attention 2. Multi-Scale Multi-Head Self-Attention 3. Multi-Scale Transformer 3. Experiments 1. Effective Scale 2. Text Classification
  2. • Attention Is All You Need (Vaswani et al., 2017)

    ী ୊਺ ࣗѐػ ӝߨ • ӝઓ੄ Attention਷ Key, Queryо ׮ܰѱ ੉ਊغ঻ਵա(Encoder-Decoder), Key, Query, Valueܳ э ѱ ੉ਊ(Self-Attention) • Multi-head: э਷ Key,Query,Value۽ ৈ۞ Headо ة݀੸ਵ۽ Attention ো࢑ਸ ૓೯ೣਵ۽ॄ, ׮নೠ ন࢚ਸ ݽ؛݂ೞӝ ਤೠӝߨ Introduction Self-Attention
  3. • Transformer ࠶۾ਸ ৈ۞ ѐ ऺইࢲ ੋ௏؊۽ ݅٘ח ҳઑо ઱۽

    ੉ਊؽ. • NLU - BERT (Devlin et al., 2018), Generation - GPT(Radford et al., 2019) ١ ؀ࠗ࠙ NLP taskٜ ੄ SOTA ߑߨۿٜীࢲ ࢎਊೞҊ ੓ח ҳઑ Introduction Self-Attention
  4. • Transformerח ׮ܲ ݽٕٜ(CNN, RNN)ী ࠺೧ Inductive Bias ޙઁী ౠ൤

    ஂডೣ • ݽ؛ ҳઑо ఀ • ݽ؛ী ઁড੼੉ ੘਺. • CNN, RNN: ౠ੿ ױযٜ ࢎ੉ী ࢚ഐ੘ਊਸ ݽ؛݂ • Transformer: ױযٜ ࢎ੉੄ pair-wised ࢚ഐ੘ਊਸ ݽ؛݂(ݽٚ ױযী ੽Ӕ оמ) • ੉ܳ ӓࠂೞӝ ਤ೧ Large Corpus۽ pre-training ೞח ߑधਸ ੗઱ ࢎਊೣ. → ੸਷ ؘ੉ఠ۽ ߄۽ ೟णदெب ੜ ز੘ೞח Transformer • ঱যীب Multi-Scale੄ ҳઑо ઓ੤ೣ.(Hierarchical Structure) • High-level feature -> Low-level term੄ ઑ೤ • Transformer ҳઑীח ੉۞ೠ ੼ਸ ߈৔ೡ ࣻ হ਺.(୐ layerࠗఠ ݽٚ wordী ੽Ӕ оמೣ. ੉ ࠗ࠙ب BERT method۽ যו ੿ب ೧Ѿ ؽ.) → Multi-Scaleਸ ߈৔ೡ ࣻ ੓ח Transformer Introduction Problem
  5. Multi-Scale Transformer Proposed Method • FFNਸ ࢎਊೞ૑ ঋ਺. (w=1 +

    non-linear activation੄ Ѿҗ৬ زੌೞ׮Ҋ ࠅ ࣻ ੓਺) • Positional Embeddingب ࢎਊೞ૑ ঋ਺ (small-scale۽ ؀୓)
  6. Multi-Scale Transformer Proposed Method • Classification Node • Bertীࢲח [CLS]

    ష௾੄ representationਸ Classificationী ੉ਊೣ • [CLS]ష௾ representation + աݠ૑ ష௾ representation੄ max pooling feature
  7. Effective Attention Scale Experiments • Sequence੄ long-range dependancyܳ ੜ ݽ؛݂ೞח૑

    ഛੋೡ ࣻ ੓ח प೷ਸ ӝദ • input: • п aח uniform distribution U(0,1)۽ ࠗఠ random sampling • target: • ড 20݅ѐ੄ ೟ण/పझ౟ ࣇਸ ٜ݅যࢲ ೟णदఇ A = {a 1 , . . . a N }, a ∈ Rd K ∑ i= 1 a i ⊙ a N−i+1
  8. Effective Attention Scale Experiments • MS-Trans-Hier-S: MS-Transformer 2-layers, 10heads w=3

    • MS-Trans-deepHier-S: MS-Transformer 6-layers, 10heads w=3 • MS-Trans-Flex: MS-Transformer 2-layers, multi-scales • w={3, N/16, N/8, N/4, N/2}
  9. Effective Attention Scale Experiments • MS-Trans-Hier-S: MS-Transformer 2-layers, 10heads w=3

    • MS-Trans-deepHier-S: MS-Transformer 6-layers, 10heads w=3 • MS-Trans-Flex: MS-Transformer 2-layers, multi-scales • w={3, N/16, N/8, N/4, N/2} Ã • MS-Trans-Hier-S vs MS-Trans-deepHier-S: ୶о੸ੋ layerח ࢿמ ೱ࢚੉ ੸׮. • MS-Trans-Flex(+ real experiments): lower layerীࢲ ࠗఠ large-scaleਸ ࠁח Ѫ੉ small- scaleਸ ऺח Ѫ ࠁ׮ ബҗ੸੉׮.
  10. Effective Attention Scale Experiments • Analogy Analysis from BERT •

    Pre-trained BERTܳ ੉ਊ೧ ݆਷ ޙ੢ٜਸ forwardingೞҊ, п Layer/Headٜ੄ ন࢚ ౵ঈ
  11. Effective Attention Scale Experiments • Analogy Analysis from BERT •

    Pre-trained BERTܳ ੉ਊ೧ ݆਷ ޙ੢ٜਸ forwardingೞҊ, п Layer/Headٜ੄ ন࢚ ౵ঈ • (left) زੌ layer੄ ׮ܲ headܳ ࠺Ү • ݽٚ distanceܳ ҎҊܖ attend(head1), small scale੄ ౠ੿ scale੄ ష௾ী attend(head2, head3) • (right) ׮ܲ layerܳ ࠺Ү • ೞਤ layerח ૣ਷ scale੄ ష௾ী attend(layer-1), ࢚ਤ layer۽ тࣻ۾ ݽٚ scale੄ ష௾ী Ҋܰѱ attend(layer-6, layer-12)
  12. Effective Attention Scale Experiments • Control Factor of Scale Distributions

    for Different Layer • , 5ѐ੄ wо ੓ח ҃਋ • (layer 1) =[0 + 0.5 * 4, 0 + 0.5 * 3, 0 + 0.5 * 2, 0 + 0.5, 0], • … N′ = 10,α = 0.5 [z1 1 , z1 2 , z1 3 , z1 4 , z1 5 ] n l= 1 = {5,2,2,1,0}
  13. Experiment Settings Experiments • Classifier: 2-layer MLP • GloVe Pre-trained

    Word-Embeddings • BERT৬ э੉ self-supervised learning method৬ח ࠺Ү ೞ૑ ঋ਺. • ݽٚ ೟ण਷ word-embeddingਸ ઁ৻ೞҊ from scratch