Upgrade to Pro — share decks privately, control downloads, hide ads and more …

BBL WiMLDS Paris - T5 Paper

Avatar for Julia Wabant Julia Wabant
December 29, 2020

BBL WiMLDS Paris - T5 Paper

Lunch presentation and discussion around Exploring the Limits of Transfer Learning (T5 paper) with a introduction on NLP SOA. December 2020.

Avatar for Julia Wabant

Julia Wabant

December 29, 2020
Tweet

More Decks by Julia Wabant

Other Decks in Research

Transcript

  1. Exploring the limits of transfer learning with a Unified Text-to-Text

    Transfomer Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J. (2020). WiMLDS Paris Paper study sessions 17/12/2020
  2. Summary • Definitions • State Of the Art (SOA) brief

    description • T5 paper • Conclusion
  3. Summary • Definitions • State Of the Art (SOA) brief

    description • T5 paper • Conclusion
  4. Definitions • Natural Language Processing process + analyze natural language

    data • Transfer Learning store knowledge and re-use it • Transformer encoder-decoder architecture allowing more parallelization than RNNs
  5. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L.,

    Gomez, A.N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need.
  6. Summary • Definitions • State Of the Art (SOA) brief

    description • T5 paper • Conclusion
  7. SOA brief description • Non-contextual word embeddings -> contextual word

    embeddings Word2Vec, GloVe, FastText ElMo, UMLFiT, GPT
  8. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K.

    (2018). Bert: Pre-training of deep bidirectional transformers for language understanding.
  9. SOA brief description • Non-contextual word embeddings -> contextual word

    embeddings Word2Vec, GloVe, FastText ElMo, UMLFiT, GPT • Full-network pretraining + task-specific fine-tuning BERT & variant (ROBERTa, ALBERT…)
  10. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K.

    (2018). Bert: Pre-training of deep bidirectional transformers for language understanding.
  11. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K.

    (2018). Bert: Pre-training of deep bidirectional transformers for language understanding.
  12. SOA brief description • Non-contextual word embeddings -> contextual word

    embeddings Word2Vec, GloVe, FastText ElMo, UMLFiT, GPT • Full-network pretraining + task-specific fine-tuning • BERT & variants (ROBERTa, ALBERT…) • Beyond BERT (XL-Net, ELECTRA, ERNIE…)
  13. T5 paper • NLP tasks « reframing » • Large-scale

    empirical survey • Large-scale application
  14. T5 paper • NLP tasks « reframing » • Large-scale

    empirical survey • Large-scale application
  15. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K.

    (2018). Bert: Pre-training of deep bidirectional transformers for language understanding.
  16. T5 paper • NLP tasks « reframing » • Large-scale

    empirical survey + new dataset introduction • Large-scale application
  17. Baseline : • standard Transformer • 220 million parameters •

    simple denoising objective • inverse square-root LR schedule for pretraining • batches 128 and 512 max sequence length • separately fine-tune on each downstream task • Vocabulary • WordPiece tokenization • 32K
  18. → comparable results to existing models of similar size →

    pretraining provides significant gains across almost all benchmark
  19. • Architectures • Objectives → Using a denoising objective always

    results in better downstream task performance compared to a language modeling objective
  20. Corruption of contiguous, randomly-spaced spans of tokens →Average span length

    of 3 outperforms i.i.d objective on most non-translation benchmarks + Some speedup during training
  21. • Colossal Clean Crawled Corpus • Heuristics for cleaning up

    (keeping an unfiltered version) • RealNews-like • WebText-like • Wikipedia + Toronto Books Corpus
  22. → Removing heuristics degrades performance In some case a dataset

    with more constrained domain outperforms diverse dataset → pre-training on in-domain unlabeled data can improve performance on downstream tasks
  23. Artificially truncated versions → performance degrades as dataset size shrinks

    → Some amount of repetition of pre-training data might not be harmful
  24. Fine-tuning methods updating a subset of model’s parameters only →lower-ressource

    tasks work well with small inner dimensionnality of adapter layers whereas higher-ressource tasks require a large dimensionnality
  25. →Multi-task training underperforms pre-training + fine-tuning on most tasks… …BUT

    fine-tuning after multi-task training results in comparable performances
  26. → Increasing training time and/or model size constantly improves the

    baseline →Ensembling provides an orthogonal and effective means of improving performance
  27. T5 paper • NLP tasks « reframing » • Large-scale

    empirical survey • Large-scale application
  28. Conclusion • Pre-training on multi-task mixture of supervised and unsupervised

    tasks before fine- tuning works as well as pre-training on unsupervised task; model can be trained on wide-variety of text tasks using same loss function and decoding procedure • Repeating data can be detrimental (think big), additionnal pre-training data can be helpful, domain-specific dataset can help for some tasks • Span-corruption objective is more computationally efficient; using objectives producing short target sequences is more computationnaly efficient for pre-training • Encoder-decoder model has similar computationnal cost as encoder or decoder-only; sharing parameters in encoder and decoder didn’t imply performance drop
  29. Ressources : • Transformer The Illustrated Transformer, Jay Alammar •

    BERT Deconstructing BERT: Distilling 6 Patterns from 100 Million Parameters, Jesse Vig Deconstructing BERT, Part 2: Visualizing the Inner Workings of Attention, Jesse Vig • T5 Exploring Transfer Learning with T5: the Text-To-Text Transfer Transformer, Google AI blog