Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Why you should (not) train your own BERT model for different languages or domains

Why you should (not) train your own BERT model for different languages or domains

Large pre-trained language models play an important role in recent NLP research. They can learn general language features from unlabeled data, and then be easily fine-tuned for any supervised NLP task, e.g. classification or named entity recognition. BERT (Bidirectional Encoder Representations from Transformers) is one example of a language model, pre-trained weights are available for everyone to use.

BERT has been shown to successfully transfer knowledge from pre-training to many different NLP benchmark tasks. How good can this transfer work for non-English or domain-specific texts, though?

We will have a look at how exactly input text is fed to and processed by BERT. Prior knowledge of the BERT architecture is therefore not necessary, but we assume some experience with deep learning and NLP. This introduction gives the background to discuss the limitations you will face when fine-tuning a pre-trained BERT on your data, e.g. related to vocabulary and input text length.

An alternative solution compared to simple fine-tuning is to train your own version of BERT from scratch, thus overcoming some of the discussed limitations. The hope for improving performance is countered by time and hardware requirements that training from scratch brings with it. I was curious enough to give it a try and want to share the findings from this experiment. We will go over the achieved improvements and the actual effort - so you can decide whether you should try to train your own BERT.

Marianne Stecklina

October 09, 2019
Tweet

More Decks by Marianne Stecklina

Other Decks in Programming

Transcript

  1. Why you should (not) train your own BERT model for

    different languages or domains Marianne Stecklina
  2. Language model, based on Transformer architecture What is BERT? Pretraining

    unsupervised BERT Berlin is the capital and largest city of Germany. is largest of 2
  3. Language model, based on Transformer architecture What is BERT? Pretraining

    unsupervised Fine-tuning supervised BERT Berlin is the capital and largest city of Germany. is largest of 3 Unbelievable, click here! spam BERT
  4. Language model, based on Transformer architecture What is BERT? Pretraining

    unsupervised Fine-tuning supervised BERT Berlin is the capital and largest city of Germany. is largest of 4 BERT Angela Merkel visits Washington. B-PER I-PER O B-LOC
  5. How BERT processes data From tokens to embeddings Berlin is

    the capital and largest city of Germany. self-attention + feed-forward self-attention + feed-forward self-attention + feed-forward 5
  6. How BERT processes data From tokens to embeddings word embedding

    “city” position embedding 6 segment embedding A Berlin is the capital and largest city of Germany. Embeddings are learned end-to-end. 6
  7. How BERT processes data From raw text to tokens 1)

    Word tokenization = split by whitespace, punctuation He’s going home? → He ‘ s going home ? 7
  8. How BERT processes data From raw text to tokens 1)

    Word tokenization = split by whitespace, punctuation He’s going home? → He ‘ s going home ? 2) Word-piece tokenization = split words into known tokens He’s going home? → He ‘ s go ##ing home ? Sonnenlicht → Sonne ##nl ##icht empfohlen → em ##pf ##oh ##len Tokenization relies on vocabulary list. 8
  9. Issues 1) Fixed vocabulary ◦ impossible to add words during

    fine-tuning ◦ rare languages are underrepresented (multilingual model) ◦ out-of-vocabulary words increase sequence length 2) Limited sequence length ◦ due to learned position embeddings ◦ restricts learning of long-term dependencies 9
  10. Experiments A little bit of background… entity recognition on insurance

    documents ◦ German ◦ many domain-specific words ◦ OCR errors ◦ many numbers (prices, phone numbers, IBANs) ◦ no proper sentence structure, difficult to split into chunks 10
  11. Experiments 1) Fine-tune multilingual BERT 2) Pretrain and fine-tune German

    BERT 3) Pretrain and fine-tune custom BERT ◦ use pretraining data from domain ◦ use fasttext instead of word piece embeddings ◦ use smaller version of BERT for faster training 11
  12. Experiments 1) Multilingual 2) German 3) Custom Performance (F1) 87.9%

    86.7% 85.6% Runtime pretraining fine-tuning - 12 min 9 days (TPU) * 12 min 1,5 days (GPU) 60 min GPU memory 9 GB 8 GB 3 GB tokens / word 2.8 2.74 1 * https://deepset.ai/german-bert 12