Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Transfer Learning in NLP

Transfer Learning in NLP

Transfer Learning in NLP

Aye Hninn Khine

October 17, 2020
Tweet

More Decks by Aye Hninn Khine

Other Decks in Technology

Transcript

  1. Transfer Learning in NLP Aye Hninn Khine Ph.D. Candidate Department

    of Computer Science Faculty of Science Prince of Songkla University [email protected]
  2. Overview • Transfer Learning in NLP • Pre-Training • Efficiency

    • Evaluation • Open-Source Tools Slides are adapted from NAACL 2019 Tutorial on Transfer Learning in NLP (Sebastian Ruder, Mattew Peters, Swabha Swayamdipta, Thomas Wolf) Slides: https://tiny.cc/NAACLTransfer 4
  3. Transfer Learning • Transfer learning is a means to extract

    knowledge from a source setting and apply it to a different target setting. Pang and Lee (2010) – A Survey on Transfer Learning 6
  4. NLP Applications • Text Classification (Spam/Not Spam, Sentiment Analysis, News

    Classification) • Machine Translation (Google Translate, Facebook Translate) • Question/Answering • Text Generation • Text Summarization 7
  5. Traditional Word Representation • Bag of Word • We have

    4 words — mango, strawberry, city, Delhi — in our vocabulary then we can represent them as following: • Mango [1, 0, 0, 0] • Strawberry [0, 1, 0, 0] • City [0, 0, 1, 0] • Delhi [0, 0, 0, 1] CURSE OF DIMENSIONALITY PROBLEM 9
  6. Why Transfer Learning in NLP? • Many NLP tasks share

    common knowledge about language (e.g. linguistic representations, structural similarities) • Tasks can inform each other—e.g. syntax and semantics • Annotated data is rare, make use of as much supervision as available. • Empirically, transfer learning has resulted in SOTA for many supervised NLP tasks (e.g. classification, information extraction, Q&A, etc.). 10
  7. Sequential Transfer Learning • Learn one task/dataset, transfer to another

    task/dataset Corpora Word2Vec GloVE ELMo Fasttext ULMFiT BERT GPT T5 Text Classification Machine Translation Q/A Pretraining Adaptation 13
  8. Pre-training and Datasets qUnlabeled data and self-supervision qEasy to gather

    very large corpora: Wikipedia, news, web crawl, social media, etc. qTraining takes advantage of distributional hypothesis: “You shall know a word by the company it keeps” (Firth, 1957), often formalized as training some variant of language model qFocus on efficient algorithms to make use of plentiful data qSupervised pretraining qVery common in vision, less in NLP due to lack of large supervised datasets qMachine translation qNLI for sentence representations qTask-specific—transfer from one Q&A dataset to another 14
  9. Target Tasks and Datasets Target tasks are typically supervised and

    span a range of common NLP tasks: ❏ Sentence or document classification (e.g. sentiment) ❏ Sentence pair classification (e.g. NLI, paraphrase) ❏ Word level (e.g. sequence labeling, extractive Q&A) ❏ Structured prediction (e.g. parsing) ❏ Generation (e.g. dialogue, summarization) 15
  10. Major Themes: From words to words-in-context Word vectors cats =

    [0.2, -0.3, …] dogs = [0.4, -0.5, …] Sentence /doc vectors It’s raining cats and dogs. We have two cats. [0.8, 0.9, …] [-1.2, 0.0, …] } } Word-in-context vectors We have two cats. } [1.2, -0.3, …] It’s raining cats and dogs. } [-0.4, 0.9,...] 17
  11. Major Themes: LM pre-training ❏ Many successful pretraining approaches are

    based on language modeling ❏ Informally, a LM learns Pϴ (text) or Pϴ (text | some other text) ❏ Doesn’t require human annotation ❏ Many languages have enough text to learn high capacity model ❏ Versatile—can learn both sentence and word representations with a variety of objective functions 18
  12. Bengio et al 2003: A Neural Probabilistic Language Model Devlin

    et al 2019: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 1 layer 24 layers Major themes: From shallow to deep 19
  13. word2vec Efficient algorithm + large scale training → high quality

    word vectors (Mikolov et al., 2013) 21 See also: ❏ Pennington et al. (2014): GloVe ❏ Bojanowski et al. (2017): fastText
  14. Contextual word vectors - Motivation Word vectors compress all contexts

    into a single vector Nearest neighbor GloVe vectors to “play” VERB playing played NOUN game games players football ?? plays Play ADJ multiplayer 23
  15. Contextual word vectors - Key Idea Instead of learning one

    vector per word, learn a vector that depends on context f(play | The kids play a game in the park.) f(play | The Broadway play premiered yesterday.) != Many approaches based on language models 24
  16. Transformer Models Model Release Date Training Affiliation ELMo Oct 2017

    800M words 42 GPU Days Allen AI GPT June 2018 800M words 240 GPU Days Open AI BERT Oct 2018 3.3B words 256 TPU Days 320~560 GPU Days Google AI GPT 2 Feb 2019 40B Words 2048 TPU v3 days Open AI 26
  17. Pretrain deep bidirectional LM, extract contextual word vectors as learned

    linear combination of hidden states SOTA for 6 diverse tasks ELMo (Peters et al, NAACL 2018) 27
  18. GPT (Radford et al., 2018) Pretrain large 12-layer left-to- right

    Transformer, fine tune for sentence, sentence pair and multiple choice questions. SOTA results for 9 tasks. 28
  19. BERT (Devlin et al. 2019) BERT pretrains both sentence and

    contextual word representations, using masked LM and next sentence prediction. BERT-large has 340M parameters, 24 layers! 32 See also: Logeswaran and Lee, ICLR 2018
  20. BERT – Model Details • Data – Wikipedia (2.5B Words)

    + BookCorpus (800M words) • Batch Size – 131,072 words • Training Time – 1M steps • BERT-Base: 12-layer, 768 hidden, 12 head • BERT-Large: 24-layer, 1024 hidden, 16 head • Evaluation on GLUE Benchmark (General Language Understanding Evaluation - https://gluebenchmark.com/) 33
  21. Why does language modeling work so well? ❏ Language modeling

    is a very difficult task, even for humans. ❏ Language models are expected to compress any possible context into a vector that generalizes over possible completions. ❏ “They walked down the street to ???” ❏ To have any chance at solving this task, a model is forced to learn syntax, semantics, encode facts about the world, etc. ❏ Given enough data, a huge model, and enough compute, can do a reasonable job! ❏ Empirically works better than translation, autoencoding: “Language Modeling Teaches You More Syntax than Translation Does” (Zhang et al. 2018) 37
  22. Evaluation of Language Models • Intrinsic • Word Embeddings are

    compared with human judgements or words relation. • Word Semantic Similarity • Word Clustering • Extrinsic • Word Embeddings to be used as the feature vectors of supervised machine learning algorithms • Any downstream task could be considered as an evaluation method. Bakarov (2018) – A Survey of Word Embeddings Evaluation Methods 43
  23. Architecture – Keep model unchanged • Add target task-specific layers

    on top/bottom of pretrained model • Simple: adding linear layer(s) on top of the pretrained model Task-Specific Layer 46
  24. Optimization: Which weights? (To Tune or Not Tune?) • Do

    not change pretrained weights Feature extraction, adapters • Feature extraction: • Alternatively, pretrained representations are used as features in downstream model • Adapters • Task-specific modules that are added in between existing layers • Only adapters are trained • Change pretrained weights Fine-tuning • Pretrained weights are used as initialization for parameters of the downstream model • The whole pretrained architecture is trained during the adaptation phase 48
  25. Open Sourcing: Practical Considerations • Pre-training large scale models is

    costly • Use open-source models • Share your pre-trained model • Sharing/accessing pre-trained models • Hubs: Tensorflow, Pytorch • Checkpoints: BERT, GPT • Third party libraries: AllenNLP, fast.ai, HuggingFace Consumption CO2 (lb) Air travel, 1 passenger, NY<->SF 1984 Human life, avg, 1 year 11,023 American life, avg, 1 year 36,156 Car, avg incl. fuel, 1 lifetime 126,0000 SOTA NLP Mode, (Tagging) with tuning and experimentation 33,486 Transformer with neural architecture search 394,863 Energy and Policy Considerations for Deep Learning in NLP – Strubell, Ganesh, McCallum - ACL 2019 50
  26. 54

  27. References • Transfer Learning in Natural Language Processing (https://www.aclweb.org/anthology/N19-5004/) (NAACL-HLT

    2019) • https://arxiv.org/pdf/1801.09536.pdf • https://ruder.io/state-of-transfer-learning-in-nlp/ • http://jalammar.github.io/illustrated-bert/ • https://arxiv.org/pdf/1910.07370.pdf • https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html • https://blog.google/products/search/search-language-understanding-bert/ • https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/ • https://ruder.io/ • https://nlp.stanford.edu/~johnhew/public/14-seq2seq.pdf • http://web.stanford.edu/class/cs224n/ 55
  28. Experiment 1 Dataset 1 (Drugs.com) Classification Method Feature Extraction and

    Feature Representation Accuracy(%) Logistic Regression N-gram (1,3) 92 Logistic Regression TF-IDF 79 Logistic Regression TF-IDF+N-gram (1,3) 93 Logistic Regression Word2Vec(Google News) 65 Logistic Regression Word2Vec (Wikipedia-PubMed-PMC) 57 SVM N-gram (1,3) 92 SVM TF-IDF 81 SVM TF-IDF+N-gram (1,3) 93 CNN N-gram(1,3) 89 CNN TF-IDF 90 CNN TF-IDF+N-gram (1,3) 91 CNN Word2Vec (Google News) 89 CNN Word2Vec (Wikipedia-PubMed-PMC) 88 CNN Word2Vec(PubMed-MIMIC III) 82 MLP TF-IDF N-gram(1,3) 89 MLP TF-IDF 89 57
  29. Interpretation Top 50 the least important words and the most

    important words for positive category 58
  30. Interpretation Top 50 the least important words and the most

    important words for negative category 59
  31. 60 0 10000 20000 30000 40000 50000 60000 # of

    unique words # of null words in GoogleNews # of null words in PPW # of null words in custom word2vec # of null words in CC Drugs.com Coverage of Word Embedding Models on Drugs.com Dataset
  32. https://myanmarnlp.github.io/reading-group/ 62 Phu Mon Htut (Ph.D. Candidate, NYU) Zin Tun

    (Data Scientist, Visa) Soe Lynn (Senior SE, PayPal) Aye Hninn Khine (Ph.D. Candidate, Prince of Songkla University)