Slide 1

Slide 1 text

RoBERTa A Robustly Optimized BERT Pretraining Approach Makoto Hiramatsu @himkt

Slide 2

Slide 2 text

Summary • Hyperparameter choice have significant impact • They replicate of BERT pretraining with carefully measuring the impact of hyperparameters and training data size • BERT was significantly undertrained • They published the new model exceeded the current SoTA 2 / 41

Slide 3

Slide 3 text

RoBERTa is the recipe for improving the performance of BERT

Slide 4

Slide 4 text

[Preliminary] Setup: BERT • Input: a concatenation of two segments with special symbols • [CLS], x_1, …, x_n, [SEP], y_1, …, y_m • Model: N layers Transformer • [Vaswani+, 2017] 4 / 41

Slide 5

Slide 5 text

[Preliminary] Training Objective • Masked Language Model (MLM) • Predicting 15% of the input tokens 80% [MASK] + 10% Random + 10% Unchanged • Next Sentence Prediction (NSP) • Predicting whether two segment follow each other P: taking sentence pairs from corpus N: pairing segments from different documents 5 / 41

Slide 6

Slide 6 text

[Preliminary] Training Objective • Masked Language Model (MLM) • Predicting 15% of the input tokens 80% [MASK] + 10% Random + 10% Unchanged [MASK] everyday swimming fun Masking Random Make everyday cooking fun 6 / 41

Slide 7

Slide 7 text

[Preliminary] Training Objective • Masked Language Model (MLM) • Predicting 15% of the input tokens 80% [MASK] + 10% Random + 10% Unchanged • Next Sentence Prediction (NSP) • Predicting whether two segment follow each other P: taking sentence pairs from corpus N: pairing segments from different documents 7 / 41 The man entered a university. [SEP] He studied mathematics. => Next I went to the park. [SEP] Japan is a country. => NotNext

Slide 8

Slide 8 text

[Preliminary] Dataset • BookCorpus [Zhu+, 2015] + English Wikipedia • 16GB raw texts ≈ 8 / 41

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Implementation • DGX1 (NVIDIA Tesla V100 8 ) + AMP (mixed precision) • Adam epsilon term is sensitive => Run tuning • works well when training with larger batch size • (max sequence length) • They do not randomly inject short sequences (BERT p.13) • They do not train with a reduced sequence length (BERT p.13) × β2 = 0.98 T = 512 10 / 41

Slide 11

Slide 11 text

Longer Sequences 11 / 41

Slide 12

Slide 12 text

BERT p.13 12 / 41

Slide 13

Slide 13 text

BERT p.13 13 / 41

Slide 14

Slide 14 text

Dataset • Over 160GB texts - some dataset is not public 14 / 41 BookCorpus [Zhu+, 2015] + English Wikipedia Various books + Encyclopedia 16GB CC-NEWS [Nagel+, 2016] Newswire 76GB OpenWebTextCorpus [Gokaslan+2019] [Radford+, 2019] Web page 38GB Stories [Trin+, 2018] Subset of CommonCrawl 31GB 160GB…?

Slide 15

Slide 15 text

Training Procedure • [BERT] Masking is static • Generating masked data in advance • 10time duplication prior to preprocessing • [RoBERTa] Dynamic Masking • On-the-fly masking generation 15 / 41

Slide 16

Slide 16 text

Static vs Dynamic mask • BERT: static mask • Corpus is duplicate 10 times and some tokens were masked in advance (40 epochs => 4 times appears during training) • RoBERTa: dynamically mask (in-place masking) • For larger corpora, it’s difficult to construct preprocessed data • 160 GB x 10 => 1.6 TB 16 / 41

Slide 17

Slide 17 text

Static vs Dynamic mask 17 / 41 Compatible or slightly better Why reference significantly defeats the existing implementation? Why medians? (not average)

Slide 18

Slide 18 text

Author’s Choice 18 / 41

Slide 19

Slide 19 text

Training objectives 19 / 41 SENGMENT-PAIR+NSP The same as BERT Sentence is not linguistically sentence SENTENCE-PAIR+NSP Similar to SEGMENT-PAIR+NSP Sentence is linguistically sentence Increasing batch size FULL-SENTENCES Packed with full sentences sampled contiguously from document to document DOC-SENTENCES Similar FULL-SENTENCES but it may not contain multiple documents Increasing batch size { { With NSP W/O NSP

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

Example: Two documents 21 / 41 We introduce a new language representation model called BERT, which … Transformers. Unlike recent language representation models, BERT is … text in all layers. As a result, the pre- trained BERT …specific architecture modifications. Language model pretraining … is challenging. Training is computationally expensive, … final results. We present a replication study of BERT … hyperparameters and training data size. We find that BERT was … every model published after it.

Slide 22

Slide 22 text

SENGMENT-PAIR+NSP 22 / 41 We introduce a new language representation model called BERT, which … Transformers. Unlike recent language representation models, BERT is … text in all layers. As a result, the pre- trained BERT …specific architecture modifications. Language model pretraining … is challenging. Training is computationally expensive, … final results. We present a replication study of BERT … hyperparameters and training data size. We find that BERT was … every model published after it.

Slide 23

Slide 23 text

SENTENCE-PAIR+NSP 23 / 41 We introduce a new language representation model called BERT, which … Transformers. Unlike recent language representation models, BERT is … text in all layers. As a result, the pre- trained BERT …specific architecture modifications. Language model pretraining … is challenging. Training is computationally expensive, … final results. We present a replication study of BERT … hyperparameters and training data size. We find that BERT was … every model published after it. ☝ Large mini batches

Slide 24

Slide 24 text

FULL-SENTENCES 24 / 41 We introduce a new language representation model called BERT, which … Transformers. Unlike recent language representation models, BERT is … text in all layers. As a result, the pre- trained BERT …specific architecture modifications. Language model pretraining … is challenging. Training is computationally expensive, … final results. We present a replication study of BERT … hyperparameters and training data size. We find that BERT was … every model published after it.

Slide 25

Slide 25 text

DOC-SENTENCES 25 / 41 We introduce a new language representation model called BERT, which … Transformers. Unlike recent language representation models, BERT is … text in all layers. As a result, the pre- trained BERT …specific architecture modifications. Language model pretraining … is challenging. Training is computationally expensive, … final results. We present a replication study of BERT … hyperparameters and training data size. We find that BERT was … every model published after it. ☝ Large mini batches

Slide 26

Slide 26 text

NSP or not NSP 26 / 41

Slide 27

Slide 27 text

RoBERTa is strong! 27 / 41

Slide 28

Slide 28 text

But better or not… 28 / 41

Slide 29

Slide 29 text

29 / 41 Author’s choice For easier comparison with related work

Slide 30

Slide 30 text

Larger batch, better performance • Computation costs are the same each other • Large batch improves the performance 30 / 41

Slide 31

Slide 31 text

RoBERT vs BERT vs XLNet 31 / 41

Slide 32

Slide 32 text

RoBERTa outperforms BERT! 32 / 41 13 GB…?

Slide 33

Slide 33 text

Longer training, State-of-the-Art 33 / 41

Slide 34

Slide 34 text

34 / 41 8K is used for parallelization efficiency Author’s choice

Slide 35

Slide 35 text

35 / 41 Model Unit Vocabulary size BERT [Delvin+, 2019] Unicode Character 30K GPT-2 [Radford+, 2019] Byte 50K Byte Pair Encoding: Subword

Slide 36

Slide 36 text

36 / 41 Model Unit Vocabulary size BERT [Delvin+, 2019] Unicode Character 30K GPT-2 [Radford+, 2019] Byte 50K Author’s choice For using universal encoding scheme Slightly worse end-task performance compared with the original (character-level) BPE

Slide 37

Slide 37 text

Evaluation • GLUE [Wang+, 2019]: 9 NLP tasks • SQuAD [Rajpurkar+, 2016 + 2018]: Machine Comprehension • RACE [Lai+, 2017]: Machine Comprehension • [Appendix] SuperGLUE [Wang+, 2019]: 10 NLP tasks • [Appendix] XNLI [Conneau+, 2018]: Natural Language Inference 37 / 41

Slide 38

Slide 38 text

GLUE + WNLI (SuperGLUE) 38 / 41

Slide 39

Slide 39 text

SQuAD and RACE 39 / 41

Slide 40

Slide 40 text

Conclusion • The authors proposed RoBERTa, a recipe to improve BERT • BERT model was significantly undertrained • Longer training, bigger batches, removing NSP, longer sequences, dynamically changing the masking • They conduct some experiments to compare design • decisions:objective, sub-word tokenization, batch size and duration 40 / 41

Slide 41

Slide 41 text

41 / 41 https://twitter.com/anh_ng8/status/1271313676398551046