RoBERTa: paper reading

RoBERTa A Robustly Optimized BERT Pretraining Approach Makoto Hiramatsu @himkt

Summary • Hyperparameter choice have signiﬁcant impact • They replicate
of BERT pretraining with carefully measuring the impact of hyperparameters and training data size • BERT was signiﬁcantly undertrained • They published the new model exceeded the current SoTA 2 / 41

RoBERTa is the recipe for improving the performance of BERT

[Preliminary] Setup: BERT • Input: a concatenation of two segments
with special symbols • [CLS], x_1, …, x_n, [SEP], y_1, …, y_m • Model: N layers Transformer • [Vaswani+, 2017] 4 / 41

[Preliminary] Training Objective • Masked Language Model (MLM) • Predicting
15% of the input tokens 80% [MASK] + 10% Random + 10% Unchanged • Next Sentence Prediction (NSP) • Predicting whether two segment follow each other P: taking sentence pairs from corpus N: pairing segments from diﬀerent documents 5 / 41

15% of the input tokens 80% [MASK] + 10% Random + 10% Unchanged [MASK] everyday swimming fun Masking Random Make everyday cooking fun 6 / 41

15% of the input tokens 80% [MASK] + 10% Random + 10% Unchanged • Next Sentence Prediction (NSP) • Predicting whether two segment follow each other P: taking sentence pairs from corpus N: pairing segments from diﬀerent documents 7 / 41 The man entered a university. [SEP] He studied mathematics. => Next I went to the park. [SEP] Japan is a country. => NotNext

[Preliminary] Dataset • BookCorpus [Zhu+, 2015] + English Wikipedia •
16GB raw texts ≈ 8 / 41

</preliminary><RoBERTa>

Implementation • DGX1 (NVIDIA Tesla V100 8 ) + AMP
(mixed precision) • Adam epsilon term is sensitive => Run tuning • works well when training with larger batch size • (max sequence length) • They do not randomly inject short sequences (BERT p.13) • They do not train with a reduced sequence length (BERT p.13) × β2 = 0.98 T = 512 10 / 41

Longer Sequences 11 / 41

BERT p.13 12 / 41

BERT p.13 13 / 41

Dataset • Over 160GB texts - some dataset is not
public 14 / 41 BookCorpus [Zhu+, 2015] + English Wikipedia Various books + Encyclopedia 16GB CC-NEWS [Nagel+, 2016] Newswire 76GB OpenWebTextCorpus [Gokaslan+2019] [Radford+, 2019] Web page 38GB Stories [Trin+, 2018] Subset of CommonCrawl 31GB 160GB…?

Training Procedure • [BERT] Masking is static • Generating masked
data in advance • 10time duplication prior to preprocessing • [RoBERTa] Dynamic Masking • On-the-ﬂy masking generation 15 / 41

Static vs Dynamic mask • BERT: static mask • Corpus
is duplicate 10 times and some tokens were masked in advance (40 epochs => 4 times appears during training) • RoBERTa: dynamically mask (in-place masking) • For larger corpora, it’s diﬃcult to construct preprocessed data • 160 GB x 10 => 1.6 TB 16 / 41

Static vs Dynamic mask 17 / 41 Compatible or slightly
better Why reference signiﬁcantly defeats the existing implementation? Why medians? (not average)

Author’s Choice 18 / 41

Training objectives 19 / 41 SENGMENT-PAIR+NSP The same as BERT
Sentence is not linguistically sentence SENTENCE-PAIR+NSP Similar to SEGMENT-PAIR+NSP Sentence is linguistically sentence Increasing batch size FULL-SENTENCES Packed with full sentences sampled contiguously from document to document DOC-SENTENCES Similar FULL-SENTENCES but it may not contain multiple documents Increasing batch size { { With NSP W/O NSP

Example: Two documents 21 / 41 We introduce a new
language representation model called BERT, which … Transformers. Unlike recent language representation models, BERT is … text in all layers. As a result, the pre- trained BERT …specific architecture modifications. Language model pretraining … is challenging. Training is computationally expensive, … final results. We present a replication study of BERT … hyperparameters and training data size. We find that BERT was … every model published after it.

SENGMENT-PAIR+NSP 22 / 41 We introduce a new language representation
model called BERT, which … Transformers. Unlike recent language representation models, BERT is … text in all layers. As a result, the pre- trained BERT …specific architecture modifications. Language model pretraining … is challenging. Training is computationally expensive, … final results. We present a replication study of BERT … hyperparameters and training data size. We find that BERT was … every model published after it.

SENTENCE-PAIR+NSP 23 / 41 We introduce a new language representation
model called BERT, which … Transformers. Unlike recent language representation models, BERT is … text in all layers. As a result, the pre- trained BERT …specific architecture modifications. Language model pretraining … is challenging. Training is computationally expensive, … final results. We present a replication study of BERT … hyperparameters and training data size. We find that BERT was … every model published after it. ☝ Large mini batches

FULL-SENTENCES 24 / 41 We introduce a new language representation
model called BERT, which … Transformers. Unlike recent language representation models, BERT is … text in all layers. As a result, the pre- trained BERT …specific architecture modifications. Language model pretraining … is challenging. Training is computationally expensive, … final results. We present a replication study of BERT … hyperparameters and training data size. We find that BERT was … every model published after it.

DOC-SENTENCES 25 / 41 We introduce a new language representation
model called BERT, which … Transformers. Unlike recent language representation models, BERT is … text in all layers. As a result, the pre- trained BERT …specific architecture modifications. Language model pretraining … is challenging. Training is computationally expensive, … final results. We present a replication study of BERT … hyperparameters and training data size. We find that BERT was … every model published after it. ☝ Large mini batches

NSP or not NSP 26 / 41

RoBERTa is strong! 27 / 41

But better or not… 28 / 41

29 / 41 Author’s choice For easier comparison with related
work

Larger batch, better performance • Computation costs are the same
each other • Large batch improves the performance 30 / 41

RoBERT vs BERT vs XLNet 31 / 41

RoBERTa outperforms BERT! 32 / 41 13 GB…?

Longer training, State-of-the-Art 33 / 41

34 / 41 8K is used for parallelization eﬃciency Author’s
choice

35 / 41 Model Unit Vocabulary size BERT [Delvin+, 2019]
Unicode Character 30K GPT-2 [Radford+, 2019] Byte 50K Byte Pair Encoding: Subword

36 / 41 Model Unit Vocabulary size BERT [Delvin+, 2019]
Unicode Character 30K GPT-2 [Radford+, 2019] Byte 50K Author’s choice For using universal encoding scheme Slightly worse end-task performance compared with the original (character-level) BPE

Evaluation • GLUE [Wang+, 2019]: 9 NLP tasks • SQuAD
[Rajpurkar+, 2016 + 2018]: Machine Comprehension • RACE [Lai+, 2017]: Machine Comprehension • [Appendix] SuperGLUE [Wang+, 2019]: 10 NLP tasks • [Appendix] XNLI [Conneau+, 2018]: Natural Language Inference 37 / 41

GLUE + WNLI (SuperGLUE) 38 / 41

SQuAD and RACE 39 / 41

Conclusion • The authors proposed RoBERTa, a recipe to improve
BERT • BERT model was signiﬁcantly undertrained • Longer training, bigger batches, removing NSP, longer sequences, dynamically changing the masking • They conduct some experiments to compare design • decisions:objective, sub-word tokenization, batch size and duration 40 / 41

41 / 41 https://twitter.com/anh_ng8/status/1271313676398551046

RoBERTa: paper reading

RoBERTa: paper reading

More Decks by himkt

Other Decks in Research

Featured

Transcript