Pro Yearly is on sale from $80 to $50! »

Transfer Learning in NLP

Transfer Learning in NLP

Transfer Learning in NLP

96d70e4d1094a914c1b6436b9e61962c?s=128

Aye Hninn Khine

October 17, 2020
Tweet

Transcript

  1. Artificial Intelligence/ Machine Learning

  2. Transfer Learning in NLP Aye Hninn Khine Ph.D. Candidate Department

    of Computer Science Faculty of Science Prince of Songkla University ayehninnkhine93@gmail.com
  3. https://twitter.com/Erick404/status/1268158510208110594/photo/1 3

  4. Overview • Transfer Learning in NLP • Pre-Training • Efficiency

    • Evaluation • Open-Source Tools Slides are adapted from NAACL 2019 Tutorial on Transfer Learning in NLP (Sebastian Ruder, Mattew Peters, Swabha Swayamdipta, Thomas Wolf) Slides: https://tiny.cc/NAACLTransfer 4
  5. What is Transfer Learning?

  6. Transfer Learning • Transfer learning is a means to extract

    knowledge from a source setting and apply it to a different target setting. Pang and Lee (2010) – A Survey on Transfer Learning 6
  7. NLP Applications • Text Classification (Spam/Not Spam, Sentiment Analysis, News

    Classification) • Machine Translation (Google Translate, Facebook Translate) • Question/Answering • Text Generation • Text Summarization 7
  8. Transfer Learning in NLP

  9. Traditional Word Representation • Bag of Word • We have

    4 words — mango, strawberry, city, Delhi — in our vocabulary then we can represent them as following: • Mango [1, 0, 0, 0] • Strawberry [0, 1, 0, 0] • City [0, 0, 1, 0] • Delhi [0, 0, 0, 1] CURSE OF DIMENSIONALITY PROBLEM 9
  10. Why Transfer Learning in NLP? • Many NLP tasks share

    common knowledge about language (e.g. linguistic representations, structural similarities) • Tasks can inform each other—e.g. syntax and semantics • Annotated data is rare, make use of as much supervision as available. • Empirically, transfer learning has resulted in SOTA for many supervised NLP tasks (e.g. classification, information extraction, Q&A, etc.). 10
  11. Why Transfer Learning in NLP? Performance on Named Entity Recognition

    (NER) on CoNLL-2003 (English) over time 11
  12. Taxonomy of Transfer Learning in NLP Sebastian Ruder (2019) 12

  13. Sequential Transfer Learning • Learn one task/dataset, transfer to another

    task/dataset Corpora Word2Vec GloVE ELMo Fasttext ULMFiT BERT GPT T5 Text Classification Machine Translation Q/A Pretraining Adaptation 13
  14. Pre-training and Datasets qUnlabeled data and self-supervision qEasy to gather

    very large corpora: Wikipedia, news, web crawl, social media, etc. qTraining takes advantage of distributional hypothesis: “You shall know a word by the company it keeps” (Firth, 1957), often formalized as training some variant of language model qFocus on efficient algorithms to make use of plentiful data qSupervised pretraining qVery common in vision, less in NLP due to lack of large supervised datasets qMachine translation qNLI for sentence representations qTask-specific—transfer from one Q&A dataset to another 14
  15. Target Tasks and Datasets Target tasks are typically supervised and

    span a range of common NLP tasks: ❏ Sentence or document classification (e.g. sentiment) ❏ Sentence pair classification (e.g. NLI, paraphrase) ❏ Word level (e.g. sequence labeling, extractive Q&A) ❏ Structured prediction (e.g. parsing) ❏ Generation (e.g. dialogue, summarization) 15
  16. Major themes

  17. Major Themes: From words to words-in-context Word vectors cats =

    [0.2, -0.3, …] dogs = [0.4, -0.5, …] Sentence /doc vectors It’s raining cats and dogs. We have two cats. [0.8, 0.9, …] [-1.2, 0.0, …] } } Word-in-context vectors We have two cats. } [1.2, -0.3, …] It’s raining cats and dogs. } [-0.4, 0.9,...] 17
  18. Major Themes: LM pre-training ❏ Many successful pretraining approaches are

    based on language modeling ❏ Informally, a LM learns Pϴ (text) or Pϴ (text | some other text) ❏ Doesn’t require human annotation ❏ Many languages have enough text to learn high capacity model ❏ Versatile—can learn both sentence and word representations with a variety of objective functions 18
  19. Bengio et al 2003: A Neural Probabilistic Language Model Devlin

    et al 2019: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 1 layer 24 layers Major themes: From shallow to deep 19
  20. Pre-Training

  21. word2vec Efficient algorithm + large scale training → high quality

    word vectors (Mikolov et al., 2013) 21 See also: ❏ Pennington et al. (2014): GloVe ❏ Bojanowski et al. (2017): fastText
  22. Doc2vec Paragraph vector Unsupervised paragraph embeddings (Le & Mikolov, 2014)

    SOTA classification (IMDB, SST) 22
  23. Contextual word vectors - Motivation Word vectors compress all contexts

    into a single vector Nearest neighbor GloVe vectors to “play” VERB playing played NOUN game games players football ?? plays Play ADJ multiplayer 23
  24. Contextual word vectors - Key Idea Instead of learning one

    vector per word, learn a vector that depends on context f(play | The kids play a game in the park.) f(play | The Broadway play premiered yesterday.) != Many approaches based on language models 24
  25. Transformers

  26. Transformer Models Model Release Date Training Affiliation ELMo Oct 2017

    800M words 42 GPU Days Allen AI GPT June 2018 800M words 240 GPU Days Open AI BERT Oct 2018 3.3B words 256 TPU Days 320~560 GPU Days Google AI GPT 2 Feb 2019 40B Words 2048 TPU v3 days Open AI 26
  27. Pretrain deep bidirectional LM, extract contextual word vectors as learned

    linear combination of hidden states SOTA for 6 diverse tasks ELMo (Peters et al, NAACL 2018) 27
  28. GPT (Radford et al., 2018) Pretrain large 12-layer left-to- right

    Transformer, fine tune for sentence, sentence pair and multiple choice questions. SOTA results for 9 tasks. 28
  29. 29 http://jalammar.github.io/how-gpt3-works-visualizations-animations/

  30. GPT Architecture 30 http://jalammar.github.io/illustrated-gpt2/

  31. Self-Attention http://jalammar.github.io/illustrated-gpt2/

  32. BERT (Devlin et al. 2019) BERT pretrains both sentence and

    contextual word representations, using masked LM and next sentence prediction. BERT-large has 340M parameters, 24 layers! 32 See also: Logeswaran and Lee, ICLR 2018
  33. BERT – Model Details • Data – Wikipedia (2.5B Words)

    + BookCorpus (800M words) • Batch Size – 131,072 words • Training Time – 1M steps • BERT-Base: 12-layer, 768 hidden, 12 head • BERT-Large: 24-layer, 1024 hidden, 16 head • Evaluation on GLUE Benchmark (General Language Understanding Evaluation - https://gluebenchmark.com/) 33
  34. BERT 34 http://jalammar.github.io/illustrated-bert/

  35. BERT (Devlin et al. 2019) SOTA GLUE benchmark results (sentence

    pair classification). 35
  36. BERT (Devlin et al. 2019) SOTA SQuAD v1.1 (and v2.0)

    Q&A 36
  37. Why does language modeling work so well? ❏ Language modeling

    is a very difficult task, even for humans. ❏ Language models are expected to compress any possible context into a vector that generalizes over possible completions. ❏ “They walked down the street to ???” ❏ To have any chance at solving this task, a model is forced to learn syntax, semantics, encode facts about the world, etc. ❏ Given enough data, a huge model, and enough compute, can do a reasonable job! ❏ Empirically works better than translation, autoencoding: “Language Modeling Teaches You More Syntax than Translation Does” (Zhang et al. 2018) 37
  38. Sample Efficiency

  39. Pretraining reduces need for annotated data (Peters et al, NAACL

    2018) 39
  40. Pretraining reduces need for annotated data (Howard and Ruder, ACL

    2018) 40
  41. Pretraining reduces need for annotated data (Clark et al. EMNLP

    2018) 41
  42. Evaluation

  43. Evaluation of Language Models • Intrinsic • Word Embeddings are

    compared with human judgements or words relation. • Word Semantic Similarity • Word Clustering • Extrinsic • Word Embeddings to be used as the feature vectors of supervised machine learning algorithms • Any downstream task could be considered as an evaluation method. Bakarov (2018) – A Survey of Word Embeddings Evaluation Methods 43
  44. Adaptation

  45. Architecture – Keep model unchanged Remove pretraining task head if

    not useful for target task 45
  46. Architecture – Keep model unchanged • Add target task-specific layers

    on top/bottom of pretrained model • Simple: adding linear layer(s) on top of the pretrained model Task-Specific Layer 46
  47. Optimization

  48. Optimization: Which weights? (To Tune or Not Tune?) • Do

    not change pretrained weights Feature extraction, adapters • Feature extraction: • Alternatively, pretrained representations are used as features in downstream model • Adapters • Task-specific modules that are added in between existing layers • Only adapters are trained • Change pretrained weights Fine-tuning • Pretrained weights are used as initialization for parameters of the downstream model • The whole pretrained architecture is trained during the adaptation phase 48
  49. Open-Source Tools and Libraries

  50. Open Sourcing: Practical Considerations • Pre-training large scale models is

    costly • Use open-source models • Share your pre-trained model • Sharing/accessing pre-trained models • Hubs: Tensorflow, Pytorch • Checkpoints: BERT, GPT • Third party libraries: AllenNLP, fast.ai, HuggingFace Consumption CO2 (lb) Air travel, 1 passenger, NY<->SF 1984 Human life, avg, 1 year 11,023 American life, avg, 1 year 36,156 Car, avg incl. fuel, 1 lifetime 126,0000 SOTA NLP Mode, (Tagging) with tuning and experimentation 33,486 Transformer with neural architecture search 394,863 Energy and Policy Considerations for Deep Learning in NLP – Strubell, Ganesh, McCallum - ACL 2019 50
  51. Pytorch Transformers – HuggingFace 51

  52. BERT in Google Search

  53. Understanding Searches Better Than Ever 53

  54. 54

  55. References • Transfer Learning in Natural Language Processing (https://www.aclweb.org/anthology/N19-5004/) (NAACL-HLT

    2019) • https://arxiv.org/pdf/1801.09536.pdf • https://ruder.io/state-of-transfer-learning-in-nlp/ • http://jalammar.github.io/illustrated-bert/ • https://arxiv.org/pdf/1910.07370.pdf • https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html • https://blog.google/products/search/search-language-understanding-bert/ • https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/ • https://ruder.io/ • https://nlp.stanford.edu/~johnhew/public/14-seq2seq.pdf • http://web.stanford.edu/class/cs224n/ 55
  56. Experiments

  57. Experiment 1 Dataset 1 (Drugs.com) Classification Method Feature Extraction and

    Feature Representation Accuracy(%) Logistic Regression N-gram (1,3) 92 Logistic Regression TF-IDF 79 Logistic Regression TF-IDF+N-gram (1,3) 93 Logistic Regression Word2Vec(Google News) 65 Logistic Regression Word2Vec (Wikipedia-PubMed-PMC) 57 SVM N-gram (1,3) 92 SVM TF-IDF 81 SVM TF-IDF+N-gram (1,3) 93 CNN N-gram(1,3) 89 CNN TF-IDF 90 CNN TF-IDF+N-gram (1,3) 91 CNN Word2Vec (Google News) 89 CNN Word2Vec (Wikipedia-PubMed-PMC) 88 CNN Word2Vec(PubMed-MIMIC III) 82 MLP TF-IDF N-gram(1,3) 89 MLP TF-IDF 89 57
  58. Interpretation Top 50 the least important words and the most

    important words for positive category 58
  59. Interpretation Top 50 the least important words and the most

    important words for negative category 59
  60. 60 0 10000 20000 30000 40000 50000 60000 # of

    unique words # of null words in GoogleNews # of null words in PPW # of null words in custom word2vec # of null words in CC Drugs.com Coverage of Word Embedding Models on Drugs.com Dataset
  61. Myanmar NLP Reading Group

  62. https://myanmarnlp.github.io/reading-group/ 62 Phu Mon Htut (Ph.D. Candidate, NYU) Zin Tun

    (Data Scientist, Visa) Soe Lynn (Senior SE, PayPal) Aye Hninn Khine (Ph.D. Candidate, Prince of Songkla University)