Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Deep Contextualized Word Representations

Shun Kiyono
August 04, 2018

Deep Contextualized Word Representations

Shun Kiyono

August 04, 2018
Tweet

More Decks by Shun Kiyono

Other Decks in Research

Transcript

  1. Deep Contextualized Word Representations ಡΉਓ: ਗ਼໺ ॢ ౦๺େֶ סɾླ໦ݚڀࣨ M2

    ୈ10ճ࠷ઌ୺NLPษڧձ Deep contextualized word representations Matthew E. Peters† , Mark Neumann† , Mohit Iyyer† , Matt Gardner† , {matthewp,markn,mohiti,mattg}@allenai.org Christopher Clark⇤ , Kenton Lee⇤ , Luke Zettlemoyer†⇤ {csquared,kentonl,lsz}@cs.washington.edu †Allen Institute for Artificial Intelligence ⇤Paul G. Allen School of Computer Science & Engineering, University of Washington Abstract We introduce a new type of deep contextual- ized word representation that models both (1) complex characteristics of word use (e.g., syn- tax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). Our word vectors are learned func- tions of the internal states of a deep bidirec- tional language model (biLM), which is pre- trained on a large text corpus. We show that guage model (LM) objective on a large text cor- pus. For this reason, we call them ELMo (Em- beddings from Language Models) representations. Unlike previous approaches for learning contextu- alized word vectors (Peters et al., 2017; McCann et al., 2017), ELMo representations are deep, in the sense that they are a function of all of the in- ternal layers of the biLM. More specifically, we learn a linear combination of the vectors stacked above each input word for each end task, which 22 Mar 2018
  2. ֓ཁ • ໨ඪ: ΑΓྑ͍୯ޠͷҙຯදݱʢ୯ޠϕΫτϧʣ Λ࡞Δ • ΞΠσΞ: จ຺෇͖ʢdeep contextualizedʣͷ ୯ޠϕΫτϧ

    • ఏҊख๏: ELMo • ૒ํ޲ˍଟ૚RNNݴޠϞσϧΛࣄલֶश • ֤૚ͷॏΈ෇͖࿨Λ୯ޠϕΫτϧʢELMoϕΫτϧʣ ͱࢥͬͯλεΫͷೖྗʹ༻͍Δ • ධՁ: ෳ਺ͷϕϯνϚʔΫσʔλ্Ͱ࠷ߴੑೳΛ ୡ੒ • QAɼؚҙؔ܎ೝࣝɼड़ޠ߲ߏ଄ղੳɼڞࢀরղੳɼ ݻ༗දݱೝࣝɼධ൑෼ੳ August 4, 2018 Inui-Suzuki Laboratory 4
  3. ELMoͷத਎: ૒ํ޲ݴޠϞσϧ August 4, 2018 Inui-Suzuki Laboratory 5 • ૒ํ޲ݴޠϞσϧΛ௒ڊେίʔύεͰࣄલ܇࿅͢Δ

    • ୯ޠϕΫτϧ͸char-based-CNNͰܭࢉ • ະ஌ޠ͕ଘࡏ͠ͳ͍ੈք • ֤ํ޲Ͱ2૚LSTMΛಠཱʹ܇࿅ • ͨͩ͠CNN૚ͱSoftmax૚͸ڞ༗
  4. ࢖͍ํ: ֤૚ͷઢܗ࿨Λೖྗ΁ • ֤τʔΫϯʹ͍ͭͯELMoϕΫτϧΛܭࢉ͠ɼ ݸʑͷλεΫ΁ͷೖྗͱͯ͠༻͍Δ • word embedding૚ͱͷconcat August 4,

    2018 Inui-Suzuki Laboratory 6 ॏΈ෇͖࿨ +εΧϥ஋Ͱఆ਺ഒ concatenation ͦͷ··࢖͏ ॏΈͱεΧϥ஋ͷΈ܇࿅ͷର৅Ͱ͢
  5. ࣮ݧ݁Ռ: ELMo͸ޮ͘ • ଟ͘ͷλεΫͰState-of-the-ArtΛୡ੒ • ༗ҙࠩݕఆ͸ແ͍Ͱ͢… August 4, 2018 Inui-Suzuki

    Laboratory 7 TASK PREVIOUS SOTA OUR BASELINE ELMO + BASELINE INCREASE (ABSOLUTE/ RELATIVE) SQuAD Liu et al. (2017) 84.4 81.1 85.8 4.7 / 24.9% SNLI Chen et al. (2017) 88.6 88.0 88.7 ± 0.17 0.7 / 5.8% SRL He et al. (2017) 81.7 81.4 84.6 3.2 / 17.2% Coref Lee et al. (2017) 67.2 67.2 70.4 3.2 / 9.8% NER Peters et al. (2017) 91.93 ± 0.19 90.15 92.22 ± 0.10 2.06 / 21% SST-5 McCann et al. (2017) 53.7 51.4 54.7 ± 0.5 3.3 / 6.8% Table 1: Test set comparison of ELMo enhanced neural models with state-of-the-art single model baselines across six benchmark NLP tasks. The performance metric varies across tasks – accuracy for SNLI and SST-5; F1 for SQuAD, SRL and NER; average F1 for Coref. Due to the small test sizes for NER and SST-5, we report the mean and standard deviation across five runs with different random seeds. The “increase” column lists both the absolute and relative improvements over our baseline. Textual entailment Textual entailment is the task of determining whether a “hypothesis” is true, given a “premise”. The Stanford Natu- ral Language Inference (SNLI) corpus (Bowman and attention mechanism to first compute span representations and then applies a softmax men- tion ranking model to find coreference chains. In our experiments with the OntoNotes coreference
  6. ෼ੳ1: ֤૚ͷଊ͑Δ৘ใʹ͍ͭͯ • ֤૚ͷ৘ใΛ࢖ͬͯޠٛᐆດੑղফͱ඼ࢺλά෇ ͚Λղ͘ • ޠٛᐆດੑղফͰ͸ ୈ૚ ୈ૚ •

    ඼ࢺλά෇͚Ͱ͸ ୈ૚ ୈ૚ • ઙ͍૚͸จ๏Ϩϕϧͷ৘ใΛɼਂ͍૚͸ ޠٛϨϕϧͷ৘ใΛଊ͍͑ͯΔʁ August 4, 2018 Inui-Suzuki Laboratory 8 Source Nearest Neighbors GloVe play playing, game, games, played, players, plays, player, Play, football, multiplayer biLM Chico Ruiz made a spec- tacular play on Alusik ’s grounder {. . . } Kieffer , the only junior in the group , was commended for his ability to hit in the clutch , as well as his all-round excellent play . Olivia De Havilland signed to do a Broadway play for Garson {. . . } {. . . } they were actors who had been handed fat roles in a successful play , and had talent enough to fill the roles competently , with nice understatement . Table 4: Nearest neighbors to “play” using GloVe and the context embeddings from a biLM. Model F1 WordNet 1st Sense Baseline 65.9 Raganato et al. (2017a) 69.9 Iacobacci et al. (2016) 70.1 CoVe, First Layer 59.4 CoVe, Second Layer 64.7 biLM, First layer 67.4 biLM, Second layer 69.0 Table 5: All-words fine grained WSD F1 . For CoVe and the biLM, we report scores for both the first and second layer biLSTMs. the task-specific context representations are likely Model Acc. Collobert et al. (2011) 97.3 Ma and Hovy (2016) 97.6 Ling et al. (2015) 97.8 CoVe, First Layer 93.3 CoVe, Second Layer 92.8 biLM, First Layer 97.3 biLM, Second Layer 96.8 Table 6: Test set POS tagging accuracies for PTB. For CoVe and the biLM, we report scores for both the first and second layer biLSTMs. intrinsic evaluation of the contextual representa- Source Nearest Neighbors GloVe play playing, game, games, played, players, plays, player, Play, football, multiplayer biLM Chico Ruiz made a spec- tacular play on Alusik ’s grounder {. . . } Kieffer , the only junior in the group , was commended for his ability to hit in the clutch , as well as his all-round excellent play . Olivia De Havilland signed to do a Broadway play for Garson {. . . } {. . . } they were actors who had been handed fat roles in a successful play , and had talent enough to fill the roles competently , with nice understatement . Table 4: Nearest neighbors to “play” using GloVe and the context embeddings from a biLM. Model F1 WordNet 1st Sense Baseline 65.9 Raganato et al. (2017a) 69.9 Iacobacci et al. (2016) 70.1 CoVe, First Layer 59.4 CoVe, Second Layer 64.7 biLM, First layer 67.4 biLM, Second layer 69.0 Table 5: All-words fine grained WSD F1 . For CoVe and the biLM, we report scores for both the first and second layer biLSTMs. Model Acc. Collobert et al. (2011) 97.3 Ma and Hovy (2016) 97.6 Ling et al. (2015) 97.8 CoVe, First Layer 93.3 CoVe, Second Layer 92.8 biLM, First Layer 97.3 biLM, Second Layer 96.8 Table 6: Test set POS tagging accuracies for PTB. For CoVe and the biLM, we report scores for both the first and second layer biLSTMs.
  7. ෼ੳ2: αϯϓϧޮ཰ͷվળ • ܇࿅σʔλ͕খ͍͞ͱ͖΄ͲELMoͷޮՌ͸େ͖͍ • SRLͷ৔߹ɼ܇࿅σʔλ1%+ELMoͷੑೳ͕ ܇࿅σʔλ10%ʹඖఢ August 4, 2018

    Inui-Suzuki Laboratory 9 are better at mpetitive with ervised model ci et al., 2016) s also trained tic labels and . The CoVe ttern to those rmance at the however, our M, which trails Figure 1: Comparison of baseline vs. ELMo perfor- ܇࿅σʔλͷׂ߹ ੑೳ
  8. A1: TagLM [Peters+2017]͔Βདྷ·ͨ͠ • ૒ํ޲ݴޠϞσϧ͔Β࡞ͬͨEmbedding͕NER ͱChunkingͰ༗ޮͰ͋Δ͜ͱΛࣔͨ͠ݚڀ • ELMoͱ࿮૊Έ͸΄ͱΜͲಉ͡ • ͦ΋ͦ΋ಉ͡ஶऀ

    August 4, 2018 Inui-Suzuki Laboratory 13 New York is located ... Neural net Char CNN/ RNN Embedding Token embedding RNN Dense E-LOC B-LOC CRF bi-RNN (R 2 ) Token representation New York is located ... Forward LM Backward LM h 1 LM Concat LM embedding Sequence tagging Pre-trained bi-LM bi-RNN (R 1 ) Sequence representation Concatenation Token representation New York is located ... Token representation h 1,1 h 2 LM h 2,1 h 1,2 h 2,2 Model F1± std Chiu and Nichols (2016) 90.91 ± 0.20 Lample et al. (2016) 90.94 Ma and Hovy (2016) 91.37 Our baseline without LM 90.87 ± 0.13 TagLM 91.93 ± 0.19 Table 1: Test set F1 comparison on CoNLL 2003 NER task, using only CoNLL 2003 data and unla- beled text. Model F1± std Yang et al. (2017) 94.66 Hashimoto et al. (2016) 95.02 Søgaard and Goldberg (2016) 95.28 Our baseline without LM 95.00 ± 0.08 TagLM 96.37 ± 0.05 Table 2: Test set F1 comparison on CoNLL 2000 Chunking task using only CoNLL 2000 data and unlabeled text. Training. All experiments use the Adam opti- mizer (Kingma and Ba, 2015) with gradient norms TagL beled tablis LMs ward In 91.93 cant i ±0.3 gazet 0.021 In achie publi by m over et al. (PTB (p < Im avera in the Addi not u [Peters+2017] Matthew E. Peters, Waleed Ammar, Chandra Bhagavatula, and Russell Power. 2017. Semi-supervised sequence tagging with bidirectional language models. In ACL.
  9. Q2: ELMoͷݴޠϞσϧɼ େ͖͗͢ͳ͍͔ʁ • ELMoͷத਎͸2૚BiLSTMɼೖྗ2048à512࣍ݩɼ ӅΕ૚4096à512࣍ݩ • λεΫຖʹELMoͷݴޠϞσϧΛFinetune͢Δͷ͕ͭ Β͍… •

    ELMoΛεΫϥον͔Β܇࿅͢Δͷ΋ͭΒ͍… ʢ8~32GPUͰҰϲ݄ඞཁͱ͍͏ن໛ײʣ • த਎ͷݴޠϞσϧΛม͑ͨΒͲ͏ͳΔʁ • ΋ͬͱখ͍͞/ऑ͍ݴޠϞσϧͰ΋ྑ͍ʁ August 4, 2018 Inui-Suzuki Laboratory 14
  10. A2: ݴޠϞσϧͷ͋Δఔ౓ͷେ͖͞ ʢ=ੑೳͷڧ͞ʣ͕ੑೳ޲্ʹඞཁ • [Peters+2017]͕த਎ʹ࢖͏ݴޠϞσϧͷ छྨʹؔͯٞ͠࿦͍ͯ͠Δ August 4, 2018 Inui-Suzuki

    Laboratory 15 ᶃݴޠϞσϧΛڧ͘͢Δͱ λεΫͷੑೳ΋ߋʹ্͕Δ ᶄλεΫͷ܇࿅σʔλͷΈͰ܇࿅ ͨ͠ݴޠϞσϧͰ͸ੑೳ͕ѱԽ͢Δ ҰํɼݴޠϞσϧͱͷMulti-taskֶशͰ ੑೳ্͕͕Δͱ͍͏ใࠂ΋͋Δ Rei, Marek. "Semi-supervised multitask learning for sequence labeling." ACL2017 [Peters+2017] Matthew E. Peters, Waleed Ammar, Chandra Bhagavatula, and Russell Power. 2017. Semi-supervised sequence tagging with bidirectional language models. In ACL.
  11. ༨ஊ: ૉੑநग़ثͱͯ͠ͷ ݴޠϞσϧ͕ྲྀߦΓͭͭ͋Δʁ • TransformerLM: • TransformerϕʔεͷݴޠϞσϧΛࣄલ܇࿅͠ɼ ֤λεΫͰFinetune • SNLI্Ͱ͸ͪ͜Βͷํ͕ੑೳ͕ྑ͍

    • ੑೳࠩͷݪҼ͸Α͘෼͔Βͳ͍ • Ϟσϧͷҧ͍ʁσʔληοτͷҧ͍ʁ • ͪ͜Β͸BookCorpusͰࣄલ܇࿅ August 4, 2018 Inui-Suzuki Laboratory 16 Figure 1: (left) Transformer architecture and training objectives used in this work. (right) Input transformations for fine-tuning on different tasks. We convert all structured inputs into token sequences to be processed by our pre-trained model, followed by a linear+softmax layer. https://blog.openai.com/language-unsupervised/
  12. ͓·͚: ެ։͞Ε͍ͯΔ࣮૷ • Tensorflow (ެࣜ): • https://github.com/allenai/bilm-tf • PyTorch (ެࣜ):

    • https://github.com/allenai/allennlp/blob/master/tuto rials/how_to/elmo.md • Chainer (@soskek / PFN): • https://github.com/chainer/models/tree/master/elmo -chainer • ͍ͣΕ΋܇࿅ࡁΈϞσϧΛಡΈࠐΜͰ ELMoϕΫτϧΛग़ྗͰ͖Δ • ੑೳ͕࠶ݱͰ͖ͳ͍(?)ͱ͍ͬͨ৺഑͸ແ༻ͷ͸ͣ August 4, 2018 Inui-Suzuki Laboratory 17