RoBERTa: paper reading

Db023ef8667ff7dfd1605ad78f59c0ae?s=47 himkt
July 31, 2020

RoBERTa: paper reading

社内で開催した論文読み会の資料です.

Db023ef8667ff7dfd1605ad78f59c0ae?s=128

himkt

July 31, 2020
Tweet

Transcript

  1. RoBERTa A Robustly Optimized BERT Pretraining Approach Makoto Hiramatsu @himkt

  2. Summary • Hyperparameter choice have significant impact • They replicate

    of BERT pretraining with carefully measuring the impact of hyperparameters and training data size • BERT was significantly undertrained • They published the new model exceeded the current SoTA 2 / 41
  3. RoBERTa is the recipe for improving the performance of BERT

  4. [Preliminary] Setup: BERT • Input: a concatenation of two segments

    with special symbols • [CLS], x_1, …, x_n, [SEP], y_1, …, y_m • Model: N layers Transformer • [Vaswani+, 2017] 4 / 41
  5. [Preliminary] Training Objective • Masked Language Model (MLM) • Predicting

    15% of the input tokens 80% [MASK] + 10% Random + 10% Unchanged • Next Sentence Prediction (NSP) • Predicting whether two segment follow each other P: taking sentence pairs from corpus N: pairing segments from different documents 5 / 41
  6. [Preliminary] Training Objective • Masked Language Model (MLM) • Predicting

    15% of the input tokens 80% [MASK] + 10% Random + 10% Unchanged [MASK] everyday swimming fun Masking Random Make everyday cooking fun 6 / 41
  7. [Preliminary] Training Objective • Masked Language Model (MLM) • Predicting

    15% of the input tokens 80% [MASK] + 10% Random + 10% Unchanged • Next Sentence Prediction (NSP) • Predicting whether two segment follow each other P: taking sentence pairs from corpus N: pairing segments from different documents 7 / 41 The man entered a university. [SEP] He studied mathematics. => Next I went to the park. [SEP] Japan is a country. => NotNext
  8. [Preliminary] Dataset • BookCorpus [Zhu+, 2015] + English Wikipedia •

    16GB raw texts ≈ 8 / 41
  9. </preliminary><RoBERTa>

  10. Implementation • DGX1 (NVIDIA Tesla V100 8 ) + AMP

    (mixed precision) • Adam epsilon term is sensitive => Run tuning • works well when training with larger batch size • (max sequence length) • They do not randomly inject short sequences (BERT p.13) • They do not train with a reduced sequence length (BERT p.13) × β2 = 0.98 T = 512 10 / 41
  11. Longer Sequences 11 / 41

  12. BERT p.13 12 / 41

  13. BERT p.13 13 / 41

  14. Dataset • Over 160GB texts - some dataset is not

    public 14 / 41 BookCorpus [Zhu+, 2015] + English Wikipedia Various books + Encyclopedia 16GB CC-NEWS [Nagel+, 2016] Newswire 76GB OpenWebTextCorpus [Gokaslan+2019] [Radford+, 2019] Web page 38GB Stories [Trin+, 2018] Subset of CommonCrawl 31GB 160GB…?
  15. Training Procedure • [BERT] Masking is static • Generating masked

    data in advance • 10time duplication prior to preprocessing • [RoBERTa] Dynamic Masking • On-the-fly masking generation 15 / 41
  16. Static vs Dynamic mask • BERT: static mask • Corpus

    is duplicate 10 times and some tokens were masked in advance (40 epochs => 4 times appears during training) • RoBERTa: dynamically mask (in-place masking) • For larger corpora, it’s difficult to construct preprocessed data • 160 GB x 10 => 1.6 TB 16 / 41
  17. Static vs Dynamic mask 17 / 41 Compatible or slightly

    better Why reference significantly defeats the existing implementation? Why medians? (not average)
  18. Author’s Choice 18 / 41

  19. Training objectives 19 / 41 SENGMENT-PAIR+NSP The same as BERT

    Sentence is not linguistically sentence SENTENCE-PAIR+NSP Similar to SEGMENT-PAIR+NSP Sentence is linguistically sentence Increasing batch size FULL-SENTENCES Packed with full sentences sampled contiguously from document to document DOC-SENTENCES Similar FULL-SENTENCES but it may not contain multiple documents Increasing batch size { { With NSP W/O NSP
  20. None
  21. Example: Two documents 21 / 41 We introduce a new

    language representation model called BERT, which … Transformers. Unlike recent language representation models, BERT is … text in all layers. As a result, the pre- trained BERT …specific architecture modifications. Language model pretraining … is challenging. Training is computationally expensive, … final results. We present a replication study of BERT … hyperparameters and training data size. We find that BERT was … every model published after it.
  22. SENGMENT-PAIR+NSP 22 / 41 We introduce a new language representation

    model called BERT, which … Transformers. Unlike recent language representation models, BERT is … text in all layers. As a result, the pre- trained BERT …specific architecture modifications. Language model pretraining … is challenging. Training is computationally expensive, … final results. We present a replication study of BERT … hyperparameters and training data size. We find that BERT was … every model published after it.
  23. SENTENCE-PAIR+NSP 23 / 41 We introduce a new language representation

    model called BERT, which … Transformers. Unlike recent language representation models, BERT is … text in all layers. As a result, the pre- trained BERT …specific architecture modifications. Language model pretraining … is challenging. Training is computationally expensive, … final results. We present a replication study of BERT … hyperparameters and training data size. We find that BERT was … every model published after it. ☝ Large mini batches
  24. FULL-SENTENCES 24 / 41 We introduce a new language representation

    model called BERT, which … Transformers. Unlike recent language representation models, BERT is … text in all layers. As a result, the pre- trained BERT …specific architecture modifications. Language model pretraining … is challenging. Training is computationally expensive, … final results. We present a replication study of BERT … hyperparameters and training data size. We find that BERT was … every model published after it.
  25. DOC-SENTENCES 25 / 41 We introduce a new language representation

    model called BERT, which … Transformers. Unlike recent language representation models, BERT is … text in all layers. As a result, the pre- trained BERT …specific architecture modifications. Language model pretraining … is challenging. Training is computationally expensive, … final results. We present a replication study of BERT … hyperparameters and training data size. We find that BERT was … every model published after it. ☝ Large mini batches
  26. NSP or not NSP 26 / 41

  27. RoBERTa is strong! 27 / 41

  28. But better or not… 28 / 41

  29. 29 / 41 Author’s choice For easier comparison with related

    work
  30. Larger batch, better performance • Computation costs are the same

    each other • Large batch improves the performance 30 / 41
  31. RoBERT vs BERT vs XLNet 31 / 41

  32. RoBERTa outperforms BERT! 32 / 41 13 GB…?

  33. Longer training, State-of-the-Art 33 / 41

  34. 34 / 41 8K is used for parallelization efficiency Author’s

    choice
  35. 35 / 41 Model Unit Vocabulary size BERT [Delvin+, 2019]

    Unicode Character 30K GPT-2 [Radford+, 2019] Byte 50K Byte Pair Encoding: Subword
  36. 36 / 41 Model Unit Vocabulary size BERT [Delvin+, 2019]

    Unicode Character 30K GPT-2 [Radford+, 2019] Byte 50K Author’s choice For using universal encoding scheme Slightly worse end-task performance compared with the original (character-level) BPE
  37. Evaluation • GLUE [Wang+, 2019]: 9 NLP tasks • SQuAD

    [Rajpurkar+, 2016 + 2018]: Machine Comprehension • RACE [Lai+, 2017]: Machine Comprehension • [Appendix] SuperGLUE [Wang+, 2019]: 10 NLP tasks • [Appendix] XNLI [Conneau+, 2018]: Natural Language Inference 37 / 41
  38. GLUE + WNLI (SuperGLUE) 38 / 41

  39. SQuAD and RACE 39 / 41

  40. Conclusion • The authors proposed RoBERTa, a recipe to improve

    BERT • BERT model was significantly undertrained • Longer training, bigger batches, removing NSP, longer sequences, dynamically changing the masking • They conduct some experiments to compare design • decisions:objective, sub-word tokenization, batch size and duration 40 / 41
  41. 41 / 41 https://twitter.com/anh_ng8/status/1271313676398551046