Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Using transformers - a drama in 512 tokens

Using transformers - a drama in 512 tokens

“Got an NLP problem nowadays? Use transformers! Just download a pretrained model from the hub!” - every blog article ever

As if it’s that easy, because nearly all pretrained models have a very annoying limitation: they can only process short input sequences. Not every NLP practitioner happens to work on tweets, but instead many of us have to deal with longer input sequences. What started as a minor design choice for BERT, got cemented by the research community over the years and now turns out to be my biggest headache: the 512 tokens limit.

In this talk, we’ll ask a lot of dumb questions and get an equal number of unsatisfying answers:

1. How much text actually fits into 512 tokens? Spoiler: not enough to solve my use case, and I bet a lot of your use cases, too.

2. I can feed a sequence of any length into an RNN, why do transformers even have a limit? We’ll look into the architecture in more detail to understand that.

3. Somebody smart must have thought about this sequence length issue before, or not? Prepare yourself for a rant about benchmarks in NLP research.

4. So what can we do to handle longer input sequences? Enjoy my collection of mediocre workarounds.

Marianne Stecklina

April 20, 2023
Tweet

More Decks by Marianne Stecklina

Other Decks in Programming

Transcript

  1. Using transformers - a drama in 512 tokens Marianne Stecklina

    Deep Learning Engineer at omni:us @MStecklina
  2. — every blog article ever “Got an NLP problem nowadays?

    Use transformers! Just download a pretrained model from the hub!”
  3. • something between a full word and a single character

    • entry in the transformer’s vocabulary What is a token? 0.2 0.1 -0.3 -0.7 0.2 -0.3 0.4 0.6 -0.2 -0.1 -0.2 -0.4 -0.3 0.5 0.3 … … … … … b ##@gmail.com invoice …
  4. Token-to-word ratio • vocabulary depends on the pretraining corpus •

    factors influencing token-to-word ratio: ◦ similarity to pretraining corpus (language, domain) ◦ frequency of unique words (e.g. invoice numbers) ◦ language (see Exploring BERT's vocabulary) • if you are lucky: 1.2 tokens per word • if you are me: 2-3 tokens per word
  5. Absolute position embeddings • add position information as part of

    the input • position embedding matrix as part of the model 0.2 0.1 -0.3 -0.7 0.2 -0.3 0.4 0.6 -0.2 -0.1 -0.2 -0.4 -0.3 0.5 0.3 … … … … … 1 2 512 …
  6. Relative position embeddings • add position information as part of

    the self-attention • different types (see Train Short, Test Long): ◦ T5 Bias (e.g. T5) ◦ Rotary (e.g. GPT-J) ◦ ALiBi (e.g. BLOOM) • advantage: no token limit anymore
  7. GPU memory is also a limit • memory required for

    self-attention is quadratic in the input sequence length • a lot of research on efficient attention mechanisms, see Efficient Transformers: A Survey
  8. • most models have a limit of 512 tokens -

    they can’t predict on longer sequences A look at the most downloaded models Model Position embeddings type Efficient attention BERT absolute (512 tokens) no DistilBERT absolute (512 tokens) no GPT2 absolute (1024 tokens) no RoBERTa absolute (512 tokens) no LayoutLM absolute (512 tokens) no XLM absolute (512 tokens) no T5 relative no BART absolute (1024 tokens) no ALBERT absolute (512 tokens) no ELECTRA absolute (512 tokens) no Longformer absolute (512 tokens) yes
  9. “It obtains new state-of-the-art results on eleven natural language processing

    tasks, including pushing the GLUE score to 80.5%, MultiNLI accuracy to 86.7%, SQuAD v1.1 question answering Test F1 to 93.2% and SQuAD v2.0 Test F1 to 83.1%.” “Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD.” “Experiment results show that LayoutLMv2 achieves new state-of-the-art results on a wide variety of downstream visually-rich document understanding tasks, including FUNSD, CORD, SROIE, Kleister-NDA, RVL-CDIP, and DocVQA.” New models get evaluated on so many benchmarks — BERT — RoBERTa — LayoutLMv2
  10. “To induce summarization behavior we add the text TL;DR: after

    the article and generate 100 tokens with Top-k random sampling with k = 2. We use the first 3 generated sentences in these 100 tokens as the summary.” “For the extractive question answering task DocVQA and the other four entity extraction tasks, we follow common practice like (Devlin et al., 2019) and build task specified head layers over the text part of LayoutLMv2 outputs.” What do the authors tell us about prediction on long sequences? — GPT-2 — LayoutLMv3 ↪ Nothing.
  11. Do the benchmarks even have long sequences? Dataset Type Examples

    > 512 tokens Max tokens GLUE sentences or sentence pairs 0.0 % 319 MultiNLI sentence pairs 0.0 % 237 SQuAD Wikipedia articles 0.5 % 819 CORD receipts 0.0 % 157 SROIE receipts 0.2 % 525 FUNSD forms 4.0 % 661 ↪ Many don’t.
  12. Do the benchmarks even have long sequences? So… people just

    secretly truncate on these datasets? ↪ But some do.
  13. Here we go “We truncate question-answer pairs that are longer

    than 128 tokens and, if needed, the passage so that the total length is at most 512 tokens.” — RoBERTa on RACE ↪ up to 50% of the passage gets discarded
  14. Progress on the benchmark front • Long-Range Arena: a benchmark

    specifically for long sequences + public leaderboard
  15. Choice of model Avoid reaching the token limit • use

    model with relative position embeddings (e.g. T5, Transformer-XL, XLNet, BLOOM) • if absolute position embeddings, choose model with high max_position_embeddings (e.g. BigBird with 4096 tokens) Avoid reaching the GPU limit • use model with efficient attention (e.g. LongT5, BigBird, Longformer, Reformer) • if fine-tuning, learn about techniques to reduce the memory consumption, see 🤗 Efficient Training on a Single GPU
  16. Increase token limit for absolute position embeddings • resize position

    embedding matrix • fine-tune on long sequences so that also newly added position embeddings get trained … … … … … 1 2 512 … 513 514
  17. Prediction Strategy • truncate to max_position_embeddings ◦ simple and fast

    ◦ often still competitive, see Efficient Classification of Long Documents Using Transformers ◦ check your sequence lengths
  18. Prediction Strategy • predict on shorter chunks, combine the results

    ◦ NER / translation / summarization: just concatenate ◦ classification: ▪ average logits ▪ train another layer on top ◦ extractive question answering: ▪ use overlapping chunks to make sure answer is fully contained in one, and choose the prediction with the highest confidence ▪ 2-stage procedure: select relevant short paragraph first, then extract answer
  19. Summary • 512 tokens is too little for some use

    cases • token limit is caused by absolute position embeddings • choose your pretrained transformer wisely (recommendation: LongT5 or BigBird) • during prediction, truncate or predict on chunks Slides will be available at: https://speakerdeck.com/stecklin
  20. Credits This presentation template was created by Slidesgo, including icons

    by Flaticon & images by Freepik. (Elegant Lines Pitch Deck // 2021) Images used in this presentation: • cultural event icon