BBL WiMLDS Paris - T5 Paper

Exploring the limits of transfer learning with a Unified Text-to-Text
Transfomer Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J. (2020). WiMLDS Paris Paper study sessions 17/12/2020

Summary • Definitions • State Of the Art (SOA) brief
description • T5 paper • Conclusion

Definitions • Natural Language Processing • Transfer Learning • Transformer

Definitions • Natural Language Processing process + analyze natural language
data • Transfer Learning store knowledge and re-use it • Transformer encoder-decoder architecture allowing more parallelization than RNNs

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L.,
Gomez, A.N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need.

Summary • Definitions • State Of the Art (SOA) brief
description • T5 paper • Conclusion

SOA brief description • Non-contextual word embeddings -> contextual word
embeddings Word2Vec, GloVe, FastText ElMo, UMLFiT, GPT

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K.
(2018). Bert: Pre-training of deep bidirectional transformers for language understanding.

embeddings Word2Vec, GloVe, FastText ElMo, UMLFiT, GPT • Full-network pretraining + task-specific fine-tuning BERT & variant (ROBERTa, ALBERT…)

FloydHub blog, Distilling knowledge from Neural Networks to build smaller
and faster models

https://gluebenchmark.com/leaderboard

embeddings Word2Vec, GloVe, FastText ElMo, UMLFiT, GPT • Full-network pretraining + task-specific fine-tuning • BERT & variants (ROBERTa, ALBERT…) • Beyond BERT (XL-Net, ELECTRA, ERNIE…)

Summary • Definitions • SOA brief description • T5 paper
• Conclusion

T5 paper • NLP tasks « reframing » • Large-scale
empirical survey • Large-scale application

Google AI Blog, Exploring Transfer Learning with T5: the Text-To-Text
Transfer Transformer

empirical survey + new dataset introduction • Large-scale application

Baseline : • standard Transformer • 220 million parameters •
simple denoising objective • inverse square-root LR schedule for pretraining • batches 128 and 512 max sequence length • separately fine-tune on each downstream task • Vocabulary • WordPiece tokenization • 32K

→ comparable results to existing models of similar size →
pretraining provides significant gains across almost all benchmark

• Architectures

• Architectures • Objectives → Using a denoising objective always
results in better downstream task performance compared to a language modeling objective

→BERT-style objective performs best →Deshuffling objective performs worst →Prefix LM
objective performs good on translation tasks

Simplifications to BERT-style objective → all variants perform similarly

Corruption rate comparison → limited effect on performance

Corruption of contiguous, randomly-spaced spans of tokens →Average span length
of 3 outperforms i.i.d objective on most non-translation benchmarks + Some speedup during training

• Architectures • Objectives • Datasets

• Colossal Clean Crawled Corpus • Heuristics for cleaning up
(keeping an unfiltered version) • RealNews-like • WebText-like • Wikipedia + Toronto Books Corpus

→ Removing heuristics degrades performance In some case a dataset
with more constrained domain outperforms diverse dataset → pre-training on in-domain unlabeled data can improve performance on downstream tasks

Artificially truncated versions → performance degrades as dataset size shrinks
→ Some amount of repetition of pre-training data might not be harmful

• Architectures • Objectives • Datasets • Strategies

Fine-tuning methods updating a subset of model’s parameters only →lower-ressource
tasks work well with small inner dimensionnality of adapter layers whereas higher-ressource tasks require a large dimensionnality

→Multi-task training underperforms pre-training + fine-tuning on most tasks… …BUT
fine-tuning after multi-task training results in comparable performances

• Architectures • Objectives • Datasets • Strategies • Scaling
Large-scale empirical survey

→ Increasing training time and/or model size constantly improves the
baseline →Ensembling provides an orthogonal and effective means of improving performance

empirical survey • Large-scale application

Insights + Scale = State-of -the-Art

Summary • Definitions • SOA brief description • T5 paper
• Conclusion

Conclusion • Pre-training on multi-task mixture of supervised and unsupervised
tasks before fine- tuning works as well as pre-training on unsupervised task; model can be trained on wide-variety of text tasks using same loss function and decoding procedure • Repeating data can be detrimental (think big), additionnal pre-training data can be helpful, domain-specific dataset can help for some tasks • Span-corruption objective is more computationally efficient; using objectives producing short target sequences is more computationnaly efficient for pre-training • Encoder-decoder model has similar computationnal cost as encoder or decoder-only; sharing parameters in encoder and decoder didn’t imply performance drop

Ressources : • Transformer The Illustrated Transformer, Jay Alammar •
BERT Deconstructing BERT: Distilling 6 Patterns from 100 Million Parameters, Jesse Vig Deconstructing BERT, Part 2: Visualizing the Inner Workings of Attention, Jesse Vig • T5 Exploring Transfer Learning with T5: the Text-To-Text Transfer Transformer, Google AI blog

BBL WiMLDS Paris - T5 Paper

BBL WiMLDS Paris - T5 Paper

More Decks by Julia Wabant

Other Decks in Research

Featured

Transcript