Why you should (not) train your own BERT model for different languages or domains

Why you should (not) train your own BERT model for
different languages or domains Marianne Stecklina

Language model, based on Transformer architecture What is BERT? Pretraining
unsupervised BERT Berlin is the capital and largest city of Germany. is largest of 2

unsupervised Fine-tuning supervised BERT Berlin is the capital and largest city of Germany. is largest of 3 Unbelievable, click here! spam BERT

unsupervised Fine-tuning supervised BERT Berlin is the capital and largest city of Germany. is largest of 4 BERT Angela Merkel visits Washington. B-PER I-PER O B-LOC

How BERT processes data From tokens to embeddings Berlin is
the capital and largest city of Germany. self-attention + feed-forward self-attention + feed-forward self-attention + feed-forward 5

How BERT processes data From tokens to embeddings word embedding
“city” position embedding 6 segment embedding A Berlin is the capital and largest city of Germany. Embeddings are learned end-to-end. 6

How BERT processes data From raw text to tokens 1)
Word tokenization = split by whitespace, punctuation He’s going home? → He ‘ s going home ? 7

How BERT processes data From raw text to tokens 1)
Word tokenization = split by whitespace, punctuation He’s going home? → He ‘ s going home ? 2) Word-piece tokenization = split words into known tokens He’s going home? → He ‘ s go ##ing home ? Sonnenlicht → Sonne ##nl ##icht empfohlen → em ##pf ##oh ##len Tokenization relies on vocabulary list. 8

Issues 1) Fixed vocabulary ◦ impossible to add words during
ﬁne-tuning ◦ rare languages are underrepresented (multilingual model) ◦ out-of-vocabulary words increase sequence length 2) Limited sequence length ◦ due to learned position embeddings ◦ restricts learning of long-term dependencies 9

Experiments A little bit of background… entity recognition on insurance
documents ◦ German ◦ many domain-speciﬁc words ◦ OCR errors ◦ many numbers (prices, phone numbers, IBANs) ◦ no proper sentence structure, diﬃcult to split into chunks 10

Experiments 1) Fine-tune multilingual BERT 2) Pretrain and ﬁne-tune German
BERT 3) Pretrain and ﬁne-tune custom BERT ◦ use pretraining data from domain ◦ use fasttext instead of word piece embeddings ◦ use smaller version of BERT for faster training 11

Experiments 1) Multilingual 2) German 3) Custom Performance (F1) 87.9%
86.7% 85.6% Runtime pretraining ﬁne-tuning - 12 min 9 days (TPU) * 12 min 1,5 days (GPU) 60 min GPU memory 9 GB 8 GB 3 GB tokens / word 2.8 2.74 1 * https://deepset.ai/german-bert 12

Should you train Your Own Bert MODEL? 13 Yes No
Maybe

Thanks for listening! 14

Why you should (not) train your own BERT model ...

Why you should (not) train your own BERT model for different languages or domains

Marianne Stecklina

More Decks by Marianne Stecklina

Other Decks in Programming

Featured

Transcript

Why you should (not) train your own BERT model for

Language model, based on Transformer architecture What is BERT? Pretraining

Language model, based on Transformer architecture What is BERT? Pretraining

Language model, based on Transformer architecture What is BERT? Pretraining

How BERT processes data From tokens to embeddings Berlin is

How BERT processes data From tokens to embeddings word embedding

How BERT processes data From raw text to tokens 1)

How BERT processes data From raw text to tokens 1)

Issues 1) Fixed vocabulary ◦ impossible to add words during

Experiments A little bit of background… entity recognition on insurance

Experiments 1) Fine-tune multilingual BERT 2) Pretrain and ﬁne-tune German

Experiments 1) Multilingual 2) German 3) Custom Performance (F1) 87.9%

Should you train Your Own Bert MODEL? 13 Yes No

Thanks for listening! 14