Upgrade to Pro — share decks privately, control downloads, hide ads and more …

BBL WiMLDS Paris - T5 Paper

Julia Wabant
December 29, 2020

BBL WiMLDS Paris - T5 Paper

Lunch presentation and discussion around Exploring the Limits of Transfer Learning (T5 paper) with a introduction on NLP SOA. December 2020.

Julia Wabant

December 29, 2020
Tweet

More Decks by Julia Wabant

Other Decks in Research

Transcript

  1. Exploring the limits of transfer learning with a Unified Text-to-Text

    Transfomer Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J. (2020). WiMLDS Paris Paper study sessions 17/12/2020
  2. Summary • Definitions • State Of the Art (SOA) brief

    description • T5 paper • Conclusion
  3. Summary • Definitions • State Of the Art (SOA) brief

    description • T5 paper • Conclusion
  4. Definitions • Natural Language Processing process + analyze natural language

    data • Transfer Learning store knowledge and re-use it • Transformer encoder-decoder architecture allowing more parallelization than RNNs
  5. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L.,

    Gomez, A.N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need.
  6. Summary • Definitions • State Of the Art (SOA) brief

    description • T5 paper • Conclusion
  7. SOA brief description • Non-contextual word embeddings -> contextual word

    embeddings Word2Vec, GloVe, FastText ElMo, UMLFiT, GPT
  8. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K.

    (2018). Bert: Pre-training of deep bidirectional transformers for language understanding.
  9. SOA brief description • Non-contextual word embeddings -> contextual word

    embeddings Word2Vec, GloVe, FastText ElMo, UMLFiT, GPT • Full-network pretraining + task-specific fine-tuning BERT & variant (ROBERTa, ALBERT…)
  10. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K.

    (2018). Bert: Pre-training of deep bidirectional transformers for language understanding.
  11. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K.

    (2018). Bert: Pre-training of deep bidirectional transformers for language understanding.
  12. SOA brief description • Non-contextual word embeddings -> contextual word

    embeddings Word2Vec, GloVe, FastText ElMo, UMLFiT, GPT • Full-network pretraining + task-specific fine-tuning • BERT & variants (ROBERTa, ALBERT…) • Beyond BERT (XL-Net, ELECTRA, ERNIE…)
  13. T5 paper • NLP tasks « reframing » • Large-scale

    empirical survey • Large-scale application
  14. T5 paper • NLP tasks « reframing » • Large-scale

    empirical survey • Large-scale application
  15. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K.

    (2018). Bert: Pre-training of deep bidirectional transformers for language understanding.
  16. T5 paper • NLP tasks « reframing » • Large-scale

    empirical survey + new dataset introduction • Large-scale application
  17. Baseline : • standard Transformer • 220 million parameters •

    simple denoising objective • inverse square-root LR schedule for pretraining • batches 128 and 512 max sequence length • separately fine-tune on each downstream task • Vocabulary • WordPiece tokenization • 32K
  18. → comparable results to existing models of similar size →

    pretraining provides significant gains across almost all benchmark
  19. • Architectures • Objectives → Using a denoising objective always

    results in better downstream task performance compared to a language modeling objective
  20. Corruption of contiguous, randomly-spaced spans of tokens →Average span length

    of 3 outperforms i.i.d objective on most non-translation benchmarks + Some speedup during training
  21. • Colossal Clean Crawled Corpus • Heuristics for cleaning up

    (keeping an unfiltered version) • RealNews-like • WebText-like • Wikipedia + Toronto Books Corpus
  22. → Removing heuristics degrades performance In some case a dataset

    with more constrained domain outperforms diverse dataset → pre-training on in-domain unlabeled data can improve performance on downstream tasks
  23. Artificially truncated versions → performance degrades as dataset size shrinks

    → Some amount of repetition of pre-training data might not be harmful
  24. Fine-tuning methods updating a subset of model’s parameters only →lower-ressource

    tasks work well with small inner dimensionnality of adapter layers whereas higher-ressource tasks require a large dimensionnality
  25. →Multi-task training underperforms pre-training + fine-tuning on most tasks… …BUT

    fine-tuning after multi-task training results in comparable performances
  26. → Increasing training time and/or model size constantly improves the

    baseline →Ensembling provides an orthogonal and effective means of improving performance
  27. T5 paper • NLP tasks « reframing » • Large-scale

    empirical survey • Large-scale application
  28. Conclusion • Pre-training on multi-task mixture of supervised and unsupervised

    tasks before fine- tuning works as well as pre-training on unsupervised task; model can be trained on wide-variety of text tasks using same loss function and decoding procedure • Repeating data can be detrimental (think big), additionnal pre-training data can be helpful, domain-specific dataset can help for some tasks • Span-corruption objective is more computationally efficient; using objectives producing short target sequences is more computationnaly efficient for pre-training • Encoder-decoder model has similar computationnal cost as encoder or decoder-only; sharing parameters in encoder and decoder didn’t imply performance drop
  29. Ressources : • Transformer The Illustrated Transformer, Jay Alammar •

    BERT Deconstructing BERT: Distilling 6 Patterns from 100 Million Parameters, Jesse Vig Deconstructing BERT, Part 2: Visualizing the Inner Workings of Attention, Jesse Vig • T5 Exploring Transfer Learning with T5: the Text-To-Text Transfer Transformer, Google AI blog