Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Multi-Scale Self-Attention for Text Classification

Multi-Scale Self-Attention for Text Classification

Scatter Lab Inc.

January 16, 2020
Tweet

More Decks by Scatter Lab Inc.

Other Decks in Research

Transcript

  1. ݾର ݾର! 1. Introduction 1. Self-Attention 2. Problem 2. Proposed

    Method 1. Scale-Aware Self-Attention 2. Multi-Scale Multi-Head Self-Attention 3. Multi-Scale Transformer 3. Experiments 1. Effective Scale 2. Text Classification
  2. • Attention Is All You Need (Vaswani et al., 2017)

    ী ୊਺ ࣗѐػ ӝߨ • ӝઓ੄ Attention਷ Key, Queryо ׮ܰѱ ੉ਊغ঻ਵա(Encoder-Decoder), Key, Query, Valueܳ э ѱ ੉ਊ(Self-Attention) • Multi-head: э਷ Key,Query,Value۽ ৈ۞ Headо ة݀੸ਵ۽ Attention ো࢑ਸ ૓೯ೣਵ۽ॄ, ׮নೠ ন࢚ਸ ݽ؛݂ೞӝ ਤೠӝߨ Introduction Self-Attention
  3. • Transformer ࠶۾ਸ ৈ۞ ѐ ऺইࢲ ੋ௏؊۽ ݅٘ח ҳઑо ઱۽

    ੉ਊؽ. • NLU - BERT (Devlin et al., 2018), Generation - GPT(Radford et al., 2019) ١ ؀ࠗ࠙ NLP taskٜ ੄ SOTA ߑߨۿٜীࢲ ࢎਊೞҊ ੓ח ҳઑ Introduction Self-Attention
  4. • Transformerח ׮ܲ ݽٕٜ(CNN, RNN)ী ࠺೧ Inductive Bias ޙઁী ౠ൤

    ஂডೣ • ݽ؛ ҳઑо ఀ • ݽ؛ী ઁড੼੉ ੘਺. • CNN, RNN: ౠ੿ ױযٜ ࢎ੉ী ࢚ഐ੘ਊਸ ݽ؛݂ • Transformer: ױযٜ ࢎ੉੄ pair-wised ࢚ഐ੘ਊਸ ݽ؛݂(ݽٚ ױযী ੽Ӕ оמ) • ੉ܳ ӓࠂೞӝ ਤ೧ Large Corpus۽ pre-training ೞח ߑधਸ ੗઱ ࢎਊೣ. → ੸਷ ؘ੉ఠ۽ ߄۽ ೟णदெب ੜ ز੘ೞח Transformer • ঱যীب Multi-Scale੄ ҳઑо ઓ੤ೣ.(Hierarchical Structure) • High-level feature -> Low-level term੄ ઑ೤ • Transformer ҳઑীח ੉۞ೠ ੼ਸ ߈৔ೡ ࣻ হ਺.(୐ layerࠗఠ ݽٚ wordী ੽Ӕ оמೣ. ੉ ࠗ࠙ب BERT method۽ যו ੿ب ೧Ѿ ؽ.) → Multi-Scaleਸ ߈৔ೡ ࣻ ੓ח Transformer Introduction Problem
  5. Multi-Scale Transformer Proposed Method • FFNਸ ࢎਊೞ૑ ঋ਺. (w=1 +

    non-linear activation੄ Ѿҗ৬ زੌೞ׮Ҋ ࠅ ࣻ ੓਺) • Positional Embeddingب ࢎਊೞ૑ ঋ਺ (small-scale۽ ؀୓)
  6. Multi-Scale Transformer Proposed Method • Classification Node • Bertীࢲח [CLS]

    ష௾੄ representationਸ Classificationী ੉ਊೣ • [CLS]ష௾ representation + աݠ૑ ష௾ representation੄ max pooling feature
  7. Effective Attention Scale Experiments • Sequence੄ long-range dependancyܳ ੜ ݽ؛݂ೞח૑

    ഛੋೡ ࣻ ੓ח प೷ਸ ӝദ • input: • п aח uniform distribution U(0,1)۽ ࠗఠ random sampling • target: • ড 20݅ѐ੄ ೟ण/పझ౟ ࣇਸ ٜ݅যࢲ ೟णदఇ A = {a 1 , . . . a N }, a ∈ Rd K ∑ i= 1 a i ⊙ a N−i+1
  8. Effective Attention Scale Experiments • MS-Trans-Hier-S: MS-Transformer 2-layers, 10heads w=3

    • MS-Trans-deepHier-S: MS-Transformer 6-layers, 10heads w=3 • MS-Trans-Flex: MS-Transformer 2-layers, multi-scales • w={3, N/16, N/8, N/4, N/2}
  9. Effective Attention Scale Experiments • MS-Trans-Hier-S: MS-Transformer 2-layers, 10heads w=3

    • MS-Trans-deepHier-S: MS-Transformer 6-layers, 10heads w=3 • MS-Trans-Flex: MS-Transformer 2-layers, multi-scales • w={3, N/16, N/8, N/4, N/2} Ã • MS-Trans-Hier-S vs MS-Trans-deepHier-S: ୶о੸ੋ layerח ࢿמ ೱ࢚੉ ੸׮. • MS-Trans-Flex(+ real experiments): lower layerীࢲ ࠗఠ large-scaleਸ ࠁח Ѫ੉ small- scaleਸ ऺח Ѫ ࠁ׮ ബҗ੸੉׮.
  10. Effective Attention Scale Experiments • Analogy Analysis from BERT •

    Pre-trained BERTܳ ੉ਊ೧ ݆਷ ޙ੢ٜਸ forwardingೞҊ, п Layer/Headٜ੄ ন࢚ ౵ঈ
  11. Effective Attention Scale Experiments • Analogy Analysis from BERT •

    Pre-trained BERTܳ ੉ਊ೧ ݆਷ ޙ੢ٜਸ forwardingೞҊ, п Layer/Headٜ੄ ন࢚ ౵ঈ • (left) زੌ layer੄ ׮ܲ headܳ ࠺Ү • ݽٚ distanceܳ ҎҊܖ attend(head1), small scale੄ ౠ੿ scale੄ ష௾ী attend(head2, head3) • (right) ׮ܲ layerܳ ࠺Ү • ೞਤ layerח ૣ਷ scale੄ ష௾ী attend(layer-1), ࢚ਤ layer۽ тࣻ۾ ݽٚ scale੄ ష௾ী Ҋܰѱ attend(layer-6, layer-12)
  12. Effective Attention Scale Experiments • Control Factor of Scale Distributions

    for Different Layer • , 5ѐ੄ wо ੓ח ҃਋ • (layer 1) =[0 + 0.5 * 4, 0 + 0.5 * 3, 0 + 0.5 * 2, 0 + 0.5, 0], • … N′ = 10,α = 0.5 [z1 1 , z1 2 , z1 3 , z1 4 , z1 5 ] n l= 1 = {5,2,2,1,0}
  13. Experiment Settings Experiments • Classifier: 2-layer MLP • GloVe Pre-trained

    Word-Embeddings • BERT৬ э੉ self-supervised learning method৬ח ࠺Ү ೞ૑ ঋ਺. • ݽٚ ೟ण਷ word-embeddingਸ ઁ৻ೞҊ from scratch