BERT

 BERT

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova
Google AI Language

Fc659c8d5c6aa37f099dc273751271f0?s=128

tomohideshibata

October 20, 2018
Tweet

Transcript

  1. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Tomohide

    Shibata 18/10/18 Bidirectional Encoder Representations from Transformers
  2. Related Papers • Deep Contextualized Word Representations (ELMo) [Washington Univ.

    & Al2, 2018.2] • Improving Language Understanding by Generative Pre-Training (GPT) [OpenAI, 2018.6] • BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [GoogleAI, 2018.10] 2
  3. 3 2 Model Two major factors contribute to the success

    of our deep SRL model: (1) applying recent advances in training deep recurrent neural networks such as highway connections (Srivastava et al., 2015) and RNN-dropouts (Gal and Ghahramani, 2016),2 and (2) using an A⇤ decoding algorithm (Lewis and Steedman, 2014; Lee et al., 2016) to enforce struc- tural consistency at prediction time without adding more complexity to the training process. Formally, our task is to predict a sequence y given a sentence-predicate pair (w, v) as input. Each yi 2 y belongs to a discrete set of BIO tags T . Words outside argument spans have the tag O, and words at the beginning and inside of argument spans with role r have the tags Br and Ir respec- tively. Let n = | w | = | y | be the length of the sequence. Predicting an SRL structure under our model involves finding the highest-scoring tag sequence over the space of all possibilities Y: ˆ y = argmax y2Y f(w, y) (1) We use a deep bidirectional LSTM (BiLSTM) to learn a locally decomposed scoring function con- ditioned on the input: Pn t=1 log p(yt | w) . To incorporate additional information (e.g., structural consistency, syntactic input), we aug- ment the scoring function with penalization terms: f(w, y) = n X t=1 log p(yt | w) X c2C c(w, y1:t) (2) Each constraint function c applies a non-negative penalty given the input w and a length- t prefix y1:t. These constraints can be hard or soft depend- ing on whether the penalties are finite. 2.1 Deep BiLSTM Model Our model computes the distribution over tags us- ing stacked BiLSTMs, which we define as follows: il,t = ( Wl i[hl,t+ l , xl,t] + b l i) (3) ol,t = ( Wl o[hl,t+ l , xl,t] + b l o) (4) fl,t = ( Wl f[hl,t+ l , xl,t] + b l f + 1) (5) ˜ cl,t = tanh( Wl c[hl,t+ l , xl,t] + b l c) (6) cl,t = il,t ˜ cl,t + fl,t ct+ l (7) hl,t = ol,t tanh(cl,t) (8) 2We thank Mingxuan Wang for suggesting highway con- nections with simplified inputs and outputs. Part of our model is extended from his unpublished implementation. + + + The 0 P(BARG0) + + + cats 0 P(IARG0) + + + love 1 P(BV) + + + hats 0 P(BARG1) Softmax Transform Gates LSTM Word & Predicate Figure 1: Highway LSTM with four layers. The curved connections represent highway connec- tions, and the plus symbols represent transform gates that control inter-layer information flow. where xl,t is the input to the LSTM at layer l and timestep t . l is either 1 or 1 , indicating the di- rectionality of the LSTM at layer l . To stack the LSTMs in an interleaving pattern, as proposed by Zhou and Xu (2015), the layer- specific inputs xl,t and directionality l are ar- ranged in the following manner: xl,t = ( [ Wemb(wt), Wmask(t = v)] l = 1 hl 1,t l > 1 (9) l = ( 1 if l is even 1 otherwise (10) The input vector x1,t is the concatenation of token wt’s word embedding and an embedding of the bi- nary feature (t = v) indicating whether wt word is the given predicate. Finally, the locally normalized distribution over output tags is computed via a softmax layer: p(yt | x) / exp( Wy taghL,t + btag) (11) Highway Connections To alleviate the vanish- ing gradient problem when training deep BiL- STMs, we use gated highway connections (Zhang et al., 2016; Srivastava et al., 2015). We include transform gates rt to control the weight of lin- ear and non-linear transformations between layers (See Figure 1). The output hl,t is changed to: rl,t = ( Wl r[hl,t 1, xt] + b l r) (12) h 0 l,t = ol,t tanh(cl,t) (13) hl,t = rl,t h 0 l,t + (1 rl,t) Wl hxl,t (14) Trm Trm Trm Trm Trm Trm ... ... T 1 T 2 T N ... E 1 E 2 E N ... ure 1: Differences in pre-training model architectures. BERT uses a bidirectional Transformer. OpenAI GPT s a left-to-right Transformer. ELMo uses the concatenation of independently trained left-to-right and right- eft LSTM to generate features for downstream tasks. Among three, only BERT representations are jointly nditioned on both left and right context in all layers. dels pre-trained on ImageNet (Deng et al., 09; Yosinski et al., 2014). • BERTLARGE: L=24, H=1024, A=16, Total Parameters=340M BERT (Ours) Trm Trm Trm Trm Trm Trm ... ... T 1 T 2 T N ... E 1 E 2 E N ... Feature-based Fine-tuning Lstm ELMo Lstm Lstm Lstm Lstm Lstm Lstm Lstm Lstm Lstm Lstm Lstm ... ... ... ... T 1 T 2 T N ... E 1 E 2 E N ... aining model architectures. BERT uses a bidirectional Transformer. OpenAI GPT r. ELMo uses the concatenation of independently trained left-to-right and right- ures for downstream tasks. Among three, only BERT representations are jointly ht context in all layers. geNet (Deng et al., detailed implementa- t cover the model ar- esentation for BERT. aining tasks, the core n Section 3.3. The d fine-tuning proce- n 3.4 and 3.5, respec- s between BERT and n Section 3.6. • BERTLARGE: L=24, H=1024, A=16, Total Parameters=340M BERTBASE was chosen to have an identical model size as OpenAI GPT for comparison pur- poses. Critically, however, the BERT Transformer uses bidirectional self-attention, while the GPT Transformer uses constrained self-attention where every token can only attend to context to its left. We note that in the literature the bidirectional Transformer is often referred to as a “Transformer encoder” while the left-context-only version is re- ferred to as a “Transformer decoder” since it can be used for text generation. The comparisons be- tween BERT, OpenAI GPT and ELMo are shown visually in Figure 1. BERT E [CLS] E 1 E [SEP] ... E N E 1 ’ ... E M ’ C T 1 T [SEP] ... T N T 1 ’ ... T M ’ [CLS] Tok 1 [SEP] ... Tok N Tok 1 ... Tok M Question Paragraph BERT E [CLS] E 1 E 2 E N C T 1 T 2 T N Single Sentence ... ... BERT Tok 1 Tok 2 Tok N ... [CLS] E [CLS] E 1 E 2 E N C T 1 T 2 T N Single Sentence B-PER O O ... ... E [CLS] E 1 E [SEP] Class Label ... E N E 1 ’ ... E M ’ C T 1 T [SEP] ... T N T 1 ’ ... T M ’ Start/End Span Class Label BERT Tok 1 Tok 2 Tok N ... [CLS] Tok 1 [CLS] [CLS] Tok 1 [SEP] ... Tok N Tok 1 ... Tok M Sentence 1 ... Sentence 2 ELMo GPT BERT Figure 1: (left) Transformer architecture and training objectives used in this work. (right) Input transformations for fine-tuning on different tasks. We convert all structured inputs into token sequences to be processed by our pre-trained model, followed by a linear+softmax layer. 3.3 Task-specific input transformations For some tasks, like text classification, we can directly fine-tune our model as described above. Certain other tasks, like question answering or textual entailment, have structured inputs such as ordered sentence pairs, or triplets of document, question, and answers. Since our pre-trained model was trained on contiguous sequences of text, we require some modifications to apply it to these tasks. Previous work proposed learning task specific architectures on top of transferred representations [44]. Such an approach re-introduces a significant amount of task-specific customization and does not use transfer learning for these additional architectural components. Instead, we use a traversal-style approach [52], where we convert structured inputs into an ordered sequence that our pre-trained model can process. These input transformations allow us to avoid making extensive changes to the architecture across tasks. We provide a brief description of these input transformations below and Figure 1 provides a visual illustration. All transformations include adding randomly initialized start and end tokens (h s i, h e i). Textual entailment For entailment tasks, we concatenate the premise p and hypothesis h token sequences, with a delimiter token ( $ ) in between.  shallow concatenation of left-to-right and right-to-left integrated architecture left-to-right language model task-specific architecture bidirectional conditioning
  4. Model Architecture 4 The Annotated Transformer: http://nlp.seas.harvard.edu/2018/04/03/attention.html Transformer [Vaswani+ 2017]

    BERT (Ours) Trm Trm Trm Trm Trm Trm ... ... T 1 T 2 T N ... E 1 E 2 E N ... Figure 1: Differences in pre-training model architectures. BERT uses a bidirectional Transformer. OpenAI GPT uses a left-to-right Transformer. ELMo uses the concatenation of independently trained left-to-right and right- to-left LSTM to generate features for downstream tasks. Among three, only BERT representations are jointly conditioned on both left and right context in all layers. models pre-trained on ImageNet (Deng et al., 2009; Yosinski et al., 2014). 3 BERT We introduce BERT and its detailed implementa- tion in this section. We first cover the model ar- chitecture and the input representation for BERT. We then introduce the pre-training tasks, the core innovation in this paper, in Section 3.3. The pre-training procedures, and fine-tuning proce- • BERTLARGE: L=24, H=1024, A=16, Total Parameters=340M BERTBASE was chosen to have an identical model size as OpenAI GPT for comparison pur- poses. Critically, however, the BERT Transformer uses bidirectional self-attention, while the GPT Transformer uses constrained self-attention where every token can only attend to context to its left. We note that in the literature the bidirectional • L: # of layers • H: hidden size • A: # of self-attention heads • BERTBASE: L=12, H=768, A=12 • BERTLARGE: L=24, H=1024, A=16 same as GPT
  5. Input Representation 5 [CLS] he likes play ##ing [SEP] my

    dog is cute [SEP] Input E [CLS] E he E likes E play E ##ing E [SEP] E my E dog E is E cute E [SEP] Token Embeddings E A E B E B E B E B E B E A E A E A E A E A Segment Embeddings E 0 E 6 E 7 E 8 E 9 E 10 E 1 E 2 E 3 E 4 E 5 Position Embeddings Figure 2: BERT input representation. The input embeddings is the sum of the token embeddings, the segmentation embeddings and the position embeddings. • The first token of every sequence is al- ways the special classification embedding ([CLS]). The final hidden state (i.e., out- refer to this procedure as a “masked LM” (MLM), although it is often referred to as a Cloze task in the literature (Taylor, 1953). In this case, the fi- Wordpiece for classification for sent. pairs
  6. Pre-training Tasks: 1. Masked LM (1/2) • Standard Language Model

    (LM) is left-to-right or right-to-light → “deeply bidirectional” is better • If deeply bidirectional conditioning is adopted in a standard LM, “see itself” problem arises 6 the man went to … man went to … cheating!
  7. Pre-training Tasks: 1. Masked LM (2/2) • Solution: Masked LM

    = Cloze task or CBOW in word2vec • Mask 15% of tokens • Predict the masked token given deep bidirectional representations 7 the man [MASK1] to [MASK2] store went a “Bidirectional Transformer” is little confusing. It means Transformers seeing both left and right side context.
  8. Pre-training Tasks: 2. Next Sentence Prediction • Understanding the relation

    between sentences is important in QA and Inference → Next sentence prediction task 8 [CLS] the man went to the store [SEP] he bought a gallon of milk Label: IsNext [CLS] the man [MASK] to the store [SEP] penguin [MASK] are flight ##less birds [SEP] Label: NotNext
  9. Pre-Training Procedure • Corpus: BookCorpus (800M words) and English Wikipedia

    (2,500M words) • Batchsize: 256 sequences * 512 tokens • Training: – BERTBASE: 4 TPUS Pod (16 TPU chips) → 4 days – BERTBASE: 16 TPUS Pod (64 TPU chips) → 4 days – Time Estimate for GPUs: 40 – 70 days with 8 GPUs http://timdettmers.com/2018/10/17/tpus-vs-gpus-for-transformers-bert/ 9
  10. Fine-Tuning: One additional Output Layer 10 BERT E [CLS] E

    1 E [SEP] ... E N E 1 ’ ... E M ’ C T 1 T [SEP] ... T N T 1 ’ ... T M ’ [CLS] Tok 1 [SEP] ... Tok N Tok 1 ... Tok M Question Paragraph BERT E [CLS] E 1 E 2 E N C T 1 T 2 T N Single Sentence ... ... BERT Tok 1 Tok 2 Tok N ... [CLS] E [CLS] E 1 E 2 E N C T 1 T 2 T N Single Sentence B-PER O O ... ... E [CLS] E 1 E [SEP] Class Label ... E N E 1 ’ ... E M ’ C T 1 T [SEP] ... T N T 1 ’ ... T M ’ Start/End Span Class Label BERT Tok 1 Tok 2 Tok N ... [CLS] Tok 1 [CLS] [CLS] Tok 1 [SEP] ... Tok N Tok 1 ... Tok M Sentence 1 ... Sentence 2
  11. GLUE Results (General Language Understanding Evaluation, [Wang+ 18]) 11 System

    MNLI-(m/mm) QQP QNLI SST-2 CoLA STS-B MRPC RTE Average 392k 363k 108k 67k 8.5k 5.7k 3.5k 2.5k - Pre-OpenAI SOTA 80.6/80.1 66.1 82.3 93.2 35.0 81.0 86.0 61.7 74.0 BiLSTM+ELMo+Attn 76.4/76.1 64.8 79.9 90.4 36.0 73.3 84.9 56.8 71.0 OpenAI GPT 82.1/81.4 70.3 88.1 91.3 45.4 80.0 82.3 56.0 75.2 BERTBASE 84.6/83.4 71.2 90.1 93.5 52.1 85.8 88.9 66.4 79.6 BERTLARGE 86.7/85.9 72.1 91.1 94.9 60.5 86.5 89.3 70.1 81.9 Table 1: GLUE Test results, scored by the GLUE evaluation server. The number below each task denotes the number of training examples. The “Average” column is slightly different than the official GLUE score, since we exclude the problematic WNLI set. OpenAI GPT = (L=12, H=768, A=12); BERTBASE = (L=12, H=768, A=12); BERTLARGE = (L=24, H=1024, A=16). BERT and OpenAI GPT are single-model, single task. All results obtained from https://gluebenchmark.com/leaderboard and https://blog.openai. com/language-unsupervised/. RTE Recognizing Textual Entailment is a bi- nary entailment task similar to MNLI, but with much less training data (Bentivogli et al., 2009).6 small data sets (i.e., some runs would produce de- generate results), so we ran several random restarts and selected the model that performed best on the
  12. Question Answering Task: SQuAD 12 e answer, the task is

    to predict the an- n in the paragraph. For example: uestion: water droplets collide with ice to form precipitation? aragraph: cipitation forms as smaller droplets via collision with other rain drops rystals within a cloud. ... Answer: cloud of span prediction task is quite dif- the sequence classification tasks of we are able to adapt BERT to run System Dev Test EM F1 EM F1 Leaderboard (Oct 8th, 2018) Human - - 82.3 91.2 #1 Ensemble - nlnet - - 86.0 91.7 #2 Ensemble - QANet - - 84.5 90.5 #1 Single - nlnet - - 83.5 90.1 #2 Single - QANet - - 82.5 89.3 Published BiDAF+ELMo (Single) - 85.8 - - R.M. Reader (Single) 78.9 86.3 79.5 86.6 R.M. Reader (Ensemble) 81.2 87.9 82.3 88.5 Ours BERTBASE (Single) 80.8 88.5 - - BERTLARGE (Single) 84.1 90.9 - - BERTLARGE (Ensemble) 85.8 91.8 - - BERTLARGE (Sgl.+TriviaQA) 84.2 91.1 85.1 91.8 BERTLARGE (Ens.+TriviaQA) 86.2 92.2 87.4 93.2 Table 2: SQuAD results. The BERT ensemble is 7x BERT E [CLS] E 1 E [SEP] ... E N E 1 ’ ... E M ’ C T 1 T [SEP] ... T N T 1 ’ ... T M ’ [CLS] Tok 1 [SEP] ... Tok N Tok 1 ... Tok M Question Paragraph Start/End Span Then, the probability of word i being he answer span is computed as a dot een Ti and S followed by a softmax e words in the paragraph: Pi = eS·Ti P j eS·Tj ormula is used for the end of the an- d the maximum scoring span is used ion. The training objective is the log- the correct start and end positions. r 3 epochs with a learning rate of 5e- size of 32. At inference time, since ction is not conditioned on the start, onstraint that the end must come after no other heuristics are used. The tok- Our best performing syste leaderboard system by +1.5 +1.3 F1 as a single system BERT model outperforms tem in terms of F1 score. I SQuAD (without TriviaQA and still outperform all exis margin. 4.3 Named Entity Recog To evaluate performance on we fine-tune BERT on the Entity Recognition (NER) consists of 200k training w annotated as Person, Orga Miscellaneous, or Other ( start vector token embedding start prob.
  13. Token Tagging Task: Named Entity Recognition 13 System Dev F1

    Test F1 ELMo+BiLSTM+CRF 95.7 92.2 CVT+Multi (Clark et al., 2018) - 92.6 BERTBASE 96.4 92.4 BERTLARGE 96.6 92.8 Table 3: CoNLL-2003 Named Entity Recognition re- sults. The hyperparameters were selected using the Dev set, and the reported Dev and Test scores are aver- aged over 5 random restarts using those hyperparame- ters. sub-token as input to the classifier. For example: Jim Hen ##son was a puppet ##eer I-PER I-PER X O O O X Where no prediction is made for X. Since the WordPiece tokenization boundaries are a known part of the input, this is done for both training and test. A visual representation is also given in Figure 3 (d). A cased WordPiece model is used for NER, whereas an uncased model is used for all other tasks. Results are presented in Table 3. BERTLARGE outperforms the existing SOTA, Cross-View we tho ple sco tio lea sul pe tem E [CLS] E 1 E 2 E N C T 1 T 2 T N Single Sentence ... ... BERT Tok 1 Tok 2 Tok N ... [CLS] B-PER O O ... no prediction no prediction
  14. Ablation Studies 14 tuning. This does significantly improve results on

    SQuAD, but the results are still far worse than the Dev Set Tasks MNLI-m QNLI MRPC SST-2 SQuAD (Acc) (Acc) (Acc) (Acc) (F1) BERTBASE 84.4 88.4 86.7 92.7 88.5 No NSP 83.9 84.9 86.5 92.6 87.9 LTR & No NSP 82.1 84.3 77.5 92.1 77.8 + BiLSTM 82.1 84.1 75.7 91.6 84.9 Table 5: Ablation over the pre-training tasks using the BERTBASE architecture. “No NSP” is trained without the next sentence prediction task. “LTR & No NSP” is trained as a left-to-right LM without the next sentence prediction, like OpenAI GPT. “+ BiLSTM” adds a ran- domly initialized BiLSTM on top of the “LTR + No large is (L (Al-R H #L 1 1 2 Table numb tentio ear whether all data size d this poor full hyper- estarts. at strength- ding a ran- it for fine- e results on rse than the ST-2 SQuAD Acc) (F1) 92.7 88.5 92.6 87.9 92.1 77.8 91.6 84.9 sks using the ined without ing examples, and is substantially different from the pre-training tasks. It is also perhaps surpris- ing that we are able to achieve such significant improvements on top of models which are al- ready quite large relative to the existing literature. For example, the largest Transformer explored in Vaswani et al. (2017) is (L=6, H=1024, A=16) with 100M parameters for the encoder, and the largest Transformer we have found in the literature is (L=64, H=512, A=2) with 235M parameters (Al-Rfou et al., 2018). By contrast, BERTBASE Hyperparams Dev Set Accuracy #L #H #A LM (ppl) MNLI-m MRPC SST-2 3 768 12 5.84 77.9 79.8 88.4 6 768 3 5.24 80.6 82.2 90.7 6 768 12 4.68 81.9 84.8 91.3 12 768 12 3.99 84.4 86.7 92.9 12 1024 16 3.54 85.7 86.9 93.3 24 1024 16 3.23 86.6 87.8 93.7
  15. Feature-Based Approach with BERT • Evaluate how well BERT performs

    in the feature-based approach – By generating ELMo-like representations 15 ng converge ce only 15% batch rather es converge odel. How- cy the MLM LTR model 1,000 LM) Right) g steps. This domly initialized two-layer 768-dimensional BiL- STM before the classification layer. Results are shown in Table 7. The best perform- ing method is to concatenate the token representa- tions from the top four hidden layers of the pre- trained Transformer, which is only 0.3 F1 behind fine-tuning the entire model. This demonstrates that BERT is effective for both the fine-tuning and feature-based approaches. Layers Dev F1 Finetune All 96.4 First Layer (Embeddings) 91.0 Second-to-Last Hidden 95.6 Last Hidden 94.9 Sum Last Four Hidden 95.9 Concat Last Four Hidden 96.1 Sum All 12 Layers 95.5 Table 7: Ablation using BERT with a feature-based ap- CoNLL-2013 NER only 0.3 F1 → BERT is also effective for the feature-based approach
  16. Misc. • Reddit: https://www.reddit.com/r/MachineLearning/c omments/9nfqxz/r_bert_pretraining_of_deep _bidirectional/ • Code & pre-trained

    model: – Will be released before the end of October 2018 – BERT-pytorch (WIP): https://github.com/codertimo/BERT-pytorch 16