Transfer Learning in NLP

Artificial Intelligence/ Machine Learning

Transfer Learning in NLP Aye Hninn Khine Ph.D. Candidate Department
of Computer Science Faculty of Science Prince of Songkla University [email protected]

https://twitter.com/Erick404/status/1268158510208110594/photo/1 3

Overview • Transfer Learning in NLP • Pre-Training • Efficiency
• Evaluation • Open-Source Tools Slides are adapted from NAACL 2019 Tutorial on Transfer Learning in NLP (Sebastian Ruder, Mattew Peters, Swabha Swayamdipta, Thomas Wolf) Slides: https://tiny.cc/NAACLTransfer 4

What is Transfer Learning?

Transfer Learning • Transfer learning is a means to extract
knowledge from a source setting and apply it to a different target setting. Pang and Lee (2010) – A Survey on Transfer Learning 6

NLP Applications • Text Classification (Spam/Not Spam, Sentiment Analysis, News
Classification) • Machine Translation (Google Translate, Facebook Translate) • Question/Answering • Text Generation • Text Summarization 7

Transfer Learning in NLP

Traditional Word Representation • Bag of Word • We have
4 words — mango, strawberry, city, Delhi — in our vocabulary then we can represent them as following: • Mango [1, 0, 0, 0] • Strawberry [0, 1, 0, 0] • City [0, 0, 1, 0] • Delhi [0, 0, 0, 1] CURSE OF DIMENSIONALITY PROBLEM 9

Why Transfer Learning in NLP? • Many NLP tasks share
common knowledge about language (e.g. linguistic representations, structural similarities) • Tasks can inform each other—e.g. syntax and semantics • Annotated data is rare, make use of as much supervision as available. • Empirically, transfer learning has resulted in SOTA for many supervised NLP tasks (e.g. classification, information extraction, Q&A, etc.). 10

Why Transfer Learning in NLP? Performance on Named Entity Recognition
(NER) on CoNLL-2003 (English) over time 11

Taxonomy of Transfer Learning in NLP Sebastian Ruder (2019) 12

Sequential Transfer Learning • Learn one task/dataset, transfer to another
task/dataset Corpora Word2Vec GloVE ELMo Fasttext ULMFiT BERT GPT T5 Text Classification Machine Translation Q/A Pretraining Adaptation 13

Pre-training and Datasets qUnlabeled data and self-supervision qEasy to gather
very large corpora: Wikipedia, news, web crawl, social media, etc. qTraining takes advantage of distributional hypothesis: “You shall know a word by the company it keeps” (Firth, 1957), often formalized as training some variant of language model qFocus on efficient algorithms to make use of plentiful data qSupervised pretraining qVery common in vision, less in NLP due to lack of large supervised datasets qMachine translation qNLI for sentence representations qTask-specific—transfer from one Q&A dataset to another 14

Target Tasks and Datasets Target tasks are typically supervised and
span a range of common NLP tasks: ❏ Sentence or document classification (e.g. sentiment) ❏ Sentence pair classification (e.g. NLI, paraphrase) ❏ Word level (e.g. sequence labeling, extractive Q&A) ❏ Structured prediction (e.g. parsing) ❏ Generation (e.g. dialogue, summarization) 15

Major themes

Major Themes: From words to words-in-context Word vectors cats =
[0.2, -0.3, …] dogs = [0.4, -0.5, …] Sentence /doc vectors It’s raining cats and dogs. We have two cats. [0.8, 0.9, …] [-1.2, 0.0, …] } } Word-in-context vectors We have two cats. } [1.2, -0.3, …] It’s raining cats and dogs. } [-0.4, 0.9,...] 17

Major Themes: LM pre-training ❏ Many successful pretraining approaches are
based on language modeling ❏ Informally, a LM learns Pϴ (text) or Pϴ (text | some other text) ❏ Doesn’t require human annotation ❏ Many languages have enough text to learn high capacity model ❏ Versatile—can learn both sentence and word representations with a variety of objective functions 18

Bengio et al 2003: A Neural Probabilistic Language Model Devlin
et al 2019: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 1 layer 24 layers Major themes: From shallow to deep 19

Pre-Training

word2vec Efficient algorithm + large scale training → high quality
word vectors (Mikolov et al., 2013) 21 See also: ❏ Pennington et al. (2014): GloVe ❏ Bojanowski et al. (2017): fastText

Doc2vec Paragraph vector Unsupervised paragraph embeddings (Le & Mikolov, 2014)
SOTA classification (IMDB, SST) 22

Contextual word vectors - Motivation Word vectors compress all contexts
into a single vector Nearest neighbor GloVe vectors to “play” VERB playing played NOUN game games players football ?? plays Play ADJ multiplayer 23

Contextual word vectors - Key Idea Instead of learning one
vector per word, learn a vector that depends on context f(play | The kids play a game in the park.) f(play | The Broadway play premiered yesterday.) != Many approaches based on language models 24

Transformers

Transformer Models Model Release Date Training Affiliation ELMo Oct 2017
800M words 42 GPU Days Allen AI GPT June 2018 800M words 240 GPU Days Open AI BERT Oct 2018 3.3B words 256 TPU Days 320~560 GPU Days Google AI GPT 2 Feb 2019 40B Words 2048 TPU v3 days Open AI 26

Pretrain deep bidirectional LM, extract contextual word vectors as learned
linear combination of hidden states SOTA for 6 diverse tasks ELMo (Peters et al, NAACL 2018) 27

GPT (Radford et al., 2018) Pretrain large 12-layer left-to- right
Transformer, fine tune for sentence, sentence pair and multiple choice questions. SOTA results for 9 tasks. 28

29 http://jalammar.github.io/how-gpt3-works-visualizations-animations/

GPT Architecture 30 http://jalammar.github.io/illustrated-gpt2/

Self-Attention http://jalammar.github.io/illustrated-gpt2/

BERT (Devlin et al. 2019) BERT pretrains both sentence and
contextual word representations, using masked LM and next sentence prediction. BERT-large has 340M parameters, 24 layers! 32 See also: Logeswaran and Lee, ICLR 2018

BERT – Model Details • Data – Wikipedia (2.5B Words)
+ BookCorpus (800M words) • Batch Size – 131,072 words • Training Time – 1M steps • BERT-Base: 12-layer, 768 hidden, 12 head • BERT-Large: 24-layer, 1024 hidden, 16 head • Evaluation on GLUE Benchmark (General Language Understanding Evaluation - https://gluebenchmark.com/) 33

BERT 34 http://jalammar.github.io/illustrated-bert/

BERT (Devlin et al. 2019) SOTA GLUE benchmark results (sentence
pair classification). 35

BERT (Devlin et al. 2019) SOTA SQuAD v1.1 (and v2.0)
Q&A 36

Why does language modeling work so well? ❏ Language modeling
is a very difficult task, even for humans. ❏ Language models are expected to compress any possible context into a vector that generalizes over possible completions. ❏ “They walked down the street to ???” ❏ To have any chance at solving this task, a model is forced to learn syntax, semantics, encode facts about the world, etc. ❏ Given enough data, a huge model, and enough compute, can do a reasonable job! ❏ Empirically works better than translation, autoencoding: “Language Modeling Teaches You More Syntax than Translation Does” (Zhang et al. 2018) 37

Sample Efficiency

Pretraining reduces need for annotated data (Peters et al, NAACL
2018) 39

Pretraining reduces need for annotated data (Howard and Ruder, ACL
2018) 40

Pretraining reduces need for annotated data (Clark et al. EMNLP
2018) 41

Evaluation

Evaluation of Language Models • Intrinsic • Word Embeddings are
compared with human judgements or words relation. • Word Semantic Similarity • Word Clustering • Extrinsic • Word Embeddings to be used as the feature vectors of supervised machine learning algorithms • Any downstream task could be considered as an evaluation method. Bakarov (2018) – A Survey of Word Embeddings Evaluation Methods 43

Adaptation

Architecture – Keep model unchanged Remove pretraining task head if
not useful for target task 45

Architecture – Keep model unchanged • Add target task-specific layers
on top/bottom of pretrained model • Simple: adding linear layer(s) on top of the pretrained model Task-Specific Layer 46

Optimization

Optimization: Which weights? (To Tune or Not Tune?) • Do
not change pretrained weights Feature extraction, adapters • Feature extraction: • Alternatively, pretrained representations are used as features in downstream model • Adapters • Task-specific modules that are added in between existing layers • Only adapters are trained • Change pretrained weights Fine-tuning • Pretrained weights are used as initialization for parameters of the downstream model • The whole pretrained architecture is trained during the adaptation phase 48

Open-Source Tools and Libraries

Open Sourcing: Practical Considerations • Pre-training large scale models is
costly • Use open-source models • Share your pre-trained model • Sharing/accessing pre-trained models • Hubs: Tensorflow, Pytorch • Checkpoints: BERT, GPT • Third party libraries: AllenNLP, fast.ai, HuggingFace Consumption CO2 (lb) Air travel, 1 passenger, NY<->SF 1984 Human life, avg, 1 year 11,023 American life, avg, 1 year 36,156 Car, avg incl. fuel, 1 lifetime 126,0000 SOTA NLP Mode, (Tagging) with tuning and experimentation 33,486 Transformer with neural architecture search 394,863 Energy and Policy Considerations for Deep Learning in NLP – Strubell, Ganesh, McCallum - ACL 2019 50

Pytorch Transformers – HuggingFace 51

BERT in Google Search

Understanding Searches Better Than Ever 53

References • Transfer Learning in Natural Language Processing (https://www.aclweb.org/anthology/N19-5004/) (NAACL-HLT
2019) • https://arxiv.org/pdf/1801.09536.pdf • https://ruder.io/state-of-transfer-learning-in-nlp/ • http://jalammar.github.io/illustrated-bert/ • https://arxiv.org/pdf/1910.07370.pdf • https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html • https://blog.google/products/search/search-language-understanding-bert/ • https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/ • https://ruder.io/ • https://nlp.stanford.edu/~johnhew/public/14-seq2seq.pdf • http://web.stanford.edu/class/cs224n/ 55

Experiments

Experiment 1 Dataset 1 (Drugs.com) Classification Method Feature Extraction and
Feature Representation Accuracy(%) Logistic Regression N-gram (1,3) 92 Logistic Regression TF-IDF 79 Logistic Regression TF-IDF+N-gram (1,3) 93 Logistic Regression Word2Vec(Google News) 65 Logistic Regression Word2Vec (Wikipedia-PubMed-PMC) 57 SVM N-gram (1,3) 92 SVM TF-IDF 81 SVM TF-IDF+N-gram (1,3) 93 CNN N-gram(1,3) 89 CNN TF-IDF 90 CNN TF-IDF+N-gram (1,3) 91 CNN Word2Vec (Google News) 89 CNN Word2Vec (Wikipedia-PubMed-PMC) 88 CNN Word2Vec(PubMed-MIMIC III) 82 MLP TF-IDF N-gram(1,3) 89 MLP TF-IDF 89 57

Interpretation Top 50 the least important words and the most
important words for positive category 58

Interpretation Top 50 the least important words and the most
important words for negative category 59

60 0 10000 20000 30000 40000 50000 60000 # of
unique words # of null words in GoogleNews # of null words in PPW # of null words in custom word2vec # of null words in CC Drugs.com Coverage of Word Embedding Models on Drugs.com Dataset

Myanmar NLP Reading Group

https://myanmarnlp.github.io/reading-group/ 62 Phu Mon Htut (Ph.D. Candidate, NYU) Zin Tun
(Data Scientist, Visa) Soe Lynn (Senior SE, PayPal) Aye Hninn Khine (Ph.D. Candidate, Prince of Songkla University)

Transfer Learning in NLP

Transfer Learning in NLP

More Decks by Aye Hninn Khine

Other Decks in Technology

Featured

Transcript