Using transformers - a drama in 512 tokens

Using transformers - a drama in 512 tokens Marianne Stecklina
Deep Learning Engineer at omni:us @MStecklina

— every blog article ever “Got an NLP problem nowadays?
Use transformers! Just download a pretrained model from the hub!”

How much text actually fits into 512 tokens? 1st act

• something between a full word and a single character
• entry in the transformer’s vocabulary What is a token? 0.2 0.1 -0.3 -0.7 0.2 -0.3 0.4 0.6 -0.2 -0.1 -0.2 -0.4 -0.3 0.5 0.3 … … … … … b ##@gmail.com invoice …

Vocabulary and token embeddings invoice 4 ##1 ##7 invoice 417
text tokens

Vocabulary and token embeddings invoice 4 ##1 ##7 transformer invoice
417 text tokens token embedding

Token-to-word ratio • vocabulary depends on the pretraining corpus •
factors inﬂuencing token-to-word ratio: ◦ similarity to pretraining corpus (language, domain) ◦ frequency of unique words (e.g. invoice numbers) ◦ language (see Exploring BERT's vocabulary) • if you are lucky: 1.2 tokens per word • if you are me: 2-3 tokens per word

• it’s not possible to ﬁt a 1-page document 512
tokens highlighted

Why do transformers even have a limit? 2nd act

Self-attention has no understanding of order invoice 4 ##1 ##7
transformer tokens token embedding

Self-attention has no understanding of order ##1 4 invoice ##7
transformer tokens token embedding

Absolute position embeddings • add position information as part of
the input • position embedding matrix as part of the model 0.2 0.1 -0.3 -0.7 0.2 -0.3 0.4 0.6 -0.2 -0.1 -0.2 -0.4 -0.3 0.5 0.3 … … … … … 1 2 512 …

Absolute position embeddings invoice 4 ##1 ##7 transformer tokens token
embedding position embedding + + + +

Relative position embeddings • add position information as part of
the self-attention • different types (see Train Short, Test Long): ◦ T5 Bias (e.g. T5) ◦ Rotary (e.g. GPT-J) ◦ ALiBi (e.g. BLOOM) • advantage: no token limit anymore

GPU memory is also a limit • memory required for
self-attention is quadratic in the input sequence length • a lot of research on efﬁcient attention mechanisms, see Efﬁcient Transformers: A Survey

• most models have a limit of 512 tokens -
they can’t predict on longer sequences A look at the most downloaded models Model Position embeddings type Efﬁcient attention BERT absolute (512 tokens) no DistilBERT absolute (512 tokens) no GPT2 absolute (1024 tokens) no RoBERTa absolute (512 tokens) no LayoutLM absolute (512 tokens) no XLM absolute (512 tokens) no T5 relative no BART absolute (1024 tokens) no ALBERT absolute (512 tokens) no ELECTRA absolute (512 tokens) no Longformer absolute (512 tokens) yes

Somebody smart must have figured out a solution before, or
not? 3rd act

“It obtains new state-of-the-art results on eleven natural language processing
tasks, including pushing the GLUE score to 80.5%, MultiNLI accuracy to 86.7%, SQuAD v1.1 question answering Test F1 to 93.2% and SQuAD v2.0 Test F1 to 83.1%.” “Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD.” “Experiment results show that LayoutLMv2 achieves new state-of-the-art results on a wide variety of downstream visually-rich document understanding tasks, including FUNSD, CORD, SROIE, Kleister-NDA, RVL-CDIP, and DocVQA.” New models get evaluated on so many benchmarks — BERT — RoBERTa — LayoutLMv2

“To induce summarization behavior we add the text TL;DR: after
the article and generate 100 tokens with Top-k random sampling with k = 2. We use the ﬁrst 3 generated sentences in these 100 tokens as the summary.” “For the extractive question answering task DocVQA and the other four entity extraction tasks, we follow common practice like (Devlin et al., 2019) and build task speciﬁed head layers over the text part of LayoutLMv2 outputs.” What do the authors tell us about prediction on long sequences? — GPT-2 — LayoutLMv3 ↪ Nothing.

Do the benchmarks even have long sequences? Dataset Type Examples
> 512 tokens Max tokens GLUE sentences or sentence pairs 0.0 % 319 MultiNLI sentence pairs 0.0 % 237 SQuAD Wikipedia articles 0.5 % 819 CORD receipts 0.0 % 157 SROIE receipts 0.2 % 525 FUNSD forms 4.0 % 661 ↪ Many don’t.

Do the benchmarks even have long sequences? So… people just
secretly truncate on these datasets? ↪ But some do.

Here we go “We truncate question-answer pairs that are longer
than 128 tokens and, if needed, the passage so that the total length is at most 512 tokens.” — RoBERTa on RACE ↪ up to 50% of the passage gets discarded

Progress on the benchmark front • Long-Range Arena: a benchmark
speciﬁcally for long sequences + public leaderboard

So what can we do to handle longer input sequences?
4th act

Choice of model Avoid reaching the token limit • use
model with relative position embeddings (e.g. T5, Transformer-XL, XLNet, BLOOM) • if absolute position embeddings, choose model with high max_position_embeddings (e.g. BigBird with 4096 tokens) Avoid reaching the GPU limit • use model with efficient attention (e.g. LongT5, BigBird, Longformer, Reformer) • if fine-tuning, learn about techniques to reduce the memory consumption, see 🤗 Efficient Training on a Single GPU

Choice of model

Increase token limit for absolute position embeddings • resize position
embedding matrix • ﬁne-tune on long sequences so that also newly added position embeddings get trained … … … … … 1 2 512 … 513 514

Prediction Strategy • truncate to max_position_embeddings ◦ simple and fast
◦ often still competitive, see Efﬁcient Classiﬁcation of Long Documents Using Transformers ◦ check your sequence lengths

Prediction Strategy • predict on shorter chunks, combine the results
◦ NER / translation / summarization: just concatenate ◦ classification: ▪ average logits ▪ train another layer on top ◦ extractive question answering: ▪ use overlapping chunks to make sure answer is fully contained in one, and choose the prediction with the highest confidence ▪ 2-stage procedure: select relevant short paragraph first, then extract answer

Summary • 512 tokens is too little for some use
cases • token limit is caused by absolute position embeddings • choose your pretrained transformer wisely (recommendation: LongT5 or BigBird) • during prediction, truncate or predict on chunks Slides will be available at: https://speakerdeck.com/stecklin

Have lunch with me (or connect later: @MStecklina)

Credits This presentation template was created by Slidesgo, including icons
by Flaticon & images by Freepik. (Elegant Lines Pitch Deck // 2021) Images used in this presentation: • cultural event icon

Using transformers - a drama in 512 tokens

Using transformers - a drama in 512 tokens

Marianne Stecklina

More Decks by Marianne Stecklina

Other Decks in Programming

Featured

Transcript