Upgrade to Pro — share decks privately, control downloads, hide ads and more …

RoBERTa: paper reading

himkt
July 31, 2020

RoBERTa: paper reading

社内で開催した論文読み会の資料です.

himkt

July 31, 2020
Tweet

More Decks by himkt

Other Decks in Research

Transcript

  1. Summary • Hyperparameter choice have significant impact • They replicate

    of BERT pretraining with carefully measuring the impact of hyperparameters and training data size • BERT was significantly undertrained • They published the new model exceeded the current SoTA 2 / 41
  2. [Preliminary] Setup: BERT • Input: a concatenation of two segments

    with special symbols • [CLS], x_1, …, x_n, [SEP], y_1, …, y_m • Model: N layers Transformer • [Vaswani+, 2017] 4 / 41
  3. [Preliminary] Training Objective • Masked Language Model (MLM) • Predicting

    15% of the input tokens 80% [MASK] + 10% Random + 10% Unchanged • Next Sentence Prediction (NSP) • Predicting whether two segment follow each other P: taking sentence pairs from corpus N: pairing segments from different documents 5 / 41
  4. [Preliminary] Training Objective • Masked Language Model (MLM) • Predicting

    15% of the input tokens 80% [MASK] + 10% Random + 10% Unchanged [MASK] everyday swimming fun Masking Random Make everyday cooking fun 6 / 41
  5. [Preliminary] Training Objective • Masked Language Model (MLM) • Predicting

    15% of the input tokens 80% [MASK] + 10% Random + 10% Unchanged • Next Sentence Prediction (NSP) • Predicting whether two segment follow each other P: taking sentence pairs from corpus N: pairing segments from different documents 7 / 41 The man entered a university. [SEP] He studied mathematics. => Next I went to the park. [SEP] Japan is a country. => NotNext
  6. Implementation • DGX1 (NVIDIA Tesla V100 8 ) + AMP

    (mixed precision) • Adam epsilon term is sensitive => Run tuning • works well when training with larger batch size • (max sequence length) • They do not randomly inject short sequences (BERT p.13) • They do not train with a reduced sequence length (BERT p.13) × β2 = 0.98 T = 512 10 / 41
  7. Dataset • Over 160GB texts - some dataset is not

    public 14 / 41 BookCorpus [Zhu+, 2015] + English Wikipedia Various books + Encyclopedia 16GB CC-NEWS [Nagel+, 2016] Newswire 76GB OpenWebTextCorpus [Gokaslan+2019] [Radford+, 2019] Web page 38GB Stories [Trin+, 2018] Subset of CommonCrawl 31GB 160GB…?
  8. Training Procedure • [BERT] Masking is static • Generating masked

    data in advance • 10time duplication prior to preprocessing • [RoBERTa] Dynamic Masking • On-the-fly masking generation 15 / 41
  9. Static vs Dynamic mask • BERT: static mask • Corpus

    is duplicate 10 times and some tokens were masked in advance (40 epochs => 4 times appears during training) • RoBERTa: dynamically mask (in-place masking) • For larger corpora, it’s difficult to construct preprocessed data • 160 GB x 10 => 1.6 TB 16 / 41
  10. Static vs Dynamic mask 17 / 41 Compatible or slightly

    better Why reference significantly defeats the existing implementation? Why medians? (not average)
  11. Training objectives 19 / 41 SENGMENT-PAIR+NSP The same as BERT

    Sentence is not linguistically sentence SENTENCE-PAIR+NSP Similar to SEGMENT-PAIR+NSP Sentence is linguistically sentence Increasing batch size FULL-SENTENCES Packed with full sentences sampled contiguously from document to document DOC-SENTENCES Similar FULL-SENTENCES but it may not contain multiple documents Increasing batch size { { With NSP W/O NSP
  12. Example: Two documents 21 / 41 We introduce a new

    language representation model called BERT, which … Transformers. Unlike recent language representation models, BERT is … text in all layers. As a result, the pre- trained BERT …specific architecture modifications. Language model pretraining … is challenging. Training is computationally expensive, … final results. We present a replication study of BERT … hyperparameters and training data size. We find that BERT was … every model published after it.
  13. SENGMENT-PAIR+NSP 22 / 41 We introduce a new language representation

    model called BERT, which … Transformers. Unlike recent language representation models, BERT is … text in all layers. As a result, the pre- trained BERT …specific architecture modifications. Language model pretraining … is challenging. Training is computationally expensive, … final results. We present a replication study of BERT … hyperparameters and training data size. We find that BERT was … every model published after it.
  14. SENTENCE-PAIR+NSP 23 / 41 We introduce a new language representation

    model called BERT, which … Transformers. Unlike recent language representation models, BERT is … text in all layers. As a result, the pre- trained BERT …specific architecture modifications. Language model pretraining … is challenging. Training is computationally expensive, … final results. We present a replication study of BERT … hyperparameters and training data size. We find that BERT was … every model published after it. ☝ Large mini batches
  15. FULL-SENTENCES 24 / 41 We introduce a new language representation

    model called BERT, which … Transformers. Unlike recent language representation models, BERT is … text in all layers. As a result, the pre- trained BERT …specific architecture modifications. Language model pretraining … is challenging. Training is computationally expensive, … final results. We present a replication study of BERT … hyperparameters and training data size. We find that BERT was … every model published after it.
  16. DOC-SENTENCES 25 / 41 We introduce a new language representation

    model called BERT, which … Transformers. Unlike recent language representation models, BERT is … text in all layers. As a result, the pre- trained BERT …specific architecture modifications. Language model pretraining … is challenging. Training is computationally expensive, … final results. We present a replication study of BERT … hyperparameters and training data size. We find that BERT was … every model published after it. ☝ Large mini batches
  17. Larger batch, better performance • Computation costs are the same

    each other • Large batch improves the performance 30 / 41
  18. 35 / 41 Model Unit Vocabulary size BERT [Delvin+, 2019]

    Unicode Character 30K GPT-2 [Radford+, 2019] Byte 50K Byte Pair Encoding: Subword
  19. 36 / 41 Model Unit Vocabulary size BERT [Delvin+, 2019]

    Unicode Character 30K GPT-2 [Radford+, 2019] Byte 50K Author’s choice For using universal encoding scheme Slightly worse end-task performance compared with the original (character-level) BPE
  20. Evaluation • GLUE [Wang+, 2019]: 9 NLP tasks • SQuAD

    [Rajpurkar+, 2016 + 2018]: Machine Comprehension • RACE [Lai+, 2017]: Machine Comprehension • [Appendix] SuperGLUE [Wang+, 2019]: 10 NLP tasks • [Appendix] XNLI [Conneau+, 2018]: Natural Language Inference 37 / 41
  21. Conclusion • The authors proposed RoBERTa, a recipe to improve

    BERT • BERT model was significantly undertrained • Longer training, bigger batches, removing NSP, longer sequences, dynamically changing the masking • They conduct some experiments to compare design • decisions:objective, sub-word tokenization, batch size and duration 40 / 41