of BERT pretraining with carefully measuring the impact of hyperparameters and training data size • BERT was significantly undertrained • They published the new model exceeded the current SoTA 2 / 41
15% of the input tokens 80% [MASK] + 10% Random + 10% Unchanged • Next Sentence Prediction (NSP) • Predicting whether two segment follow each other P: taking sentence pairs from corpus N: pairing segments from different documents 5 / 41
15% of the input tokens 80% [MASK] + 10% Random + 10% Unchanged • Next Sentence Prediction (NSP) • Predicting whether two segment follow each other P: taking sentence pairs from corpus N: pairing segments from different documents 7 / 41 The man entered a university. [SEP] He studied mathematics. => Next I went to the park. [SEP] Japan is a country. => NotNext
(mixed precision) • Adam epsilon term is sensitive => Run tuning • works well when training with larger batch size • (max sequence length) • They do not randomly inject short sequences (BERT p.13) • They do not train with a reduced sequence length (BERT p.13) × β2 = 0.98 T = 512 10 / 41
is duplicate 10 times and some tokens were masked in advance (40 epochs => 4 times appears during training) • RoBERTa: dynamically mask (in-place masking) • For larger corpora, it’s difficult to construct preprocessed data • 160 GB x 10 => 1.6 TB 16 / 41
Sentence is not linguistically sentence SENTENCE-PAIR+NSP Similar to SEGMENT-PAIR+NSP Sentence is linguistically sentence Increasing batch size FULL-SENTENCES Packed with full sentences sampled contiguously from document to document DOC-SENTENCES Similar FULL-SENTENCES but it may not contain multiple documents Increasing batch size { { With NSP W/O NSP
language representation model called BERT, which … Transformers. Unlike recent language representation models, BERT is … text in all layers. As a result, the pre- trained BERT …specific architecture modifications. Language model pretraining … is challenging. Training is computationally expensive, … final results. We present a replication study of BERT … hyperparameters and training data size. We find that BERT was … every model published after it.
model called BERT, which … Transformers. Unlike recent language representation models, BERT is … text in all layers. As a result, the pre- trained BERT …specific architecture modifications. Language model pretraining … is challenging. Training is computationally expensive, … final results. We present a replication study of BERT … hyperparameters and training data size. We find that BERT was … every model published after it.
model called BERT, which … Transformers. Unlike recent language representation models, BERT is … text in all layers. As a result, the pre- trained BERT …specific architecture modifications. Language model pretraining … is challenging. Training is computationally expensive, … final results. We present a replication study of BERT … hyperparameters and training data size. We find that BERT was … every model published after it. ☝ Large mini batches
model called BERT, which … Transformers. Unlike recent language representation models, BERT is … text in all layers. As a result, the pre- trained BERT …specific architecture modifications. Language model pretraining … is challenging. Training is computationally expensive, … final results. We present a replication study of BERT … hyperparameters and training data size. We find that BERT was … every model published after it.
model called BERT, which … Transformers. Unlike recent language representation models, BERT is … text in all layers. As a result, the pre- trained BERT …specific architecture modifications. Language model pretraining … is challenging. Training is computationally expensive, … final results. We present a replication study of BERT … hyperparameters and training data size. We find that BERT was … every model published after it. ☝ Large mini batches
Unicode Character 30K GPT-2 [Radford+, 2019] Byte 50K Author’s choice For using universal encoding scheme Slightly worse end-task performance compared with the original (character-level) BPE