Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Revisiting Few-Shot learning for Natural Langua...

wing.nus
November 19, 2021
1.1k

Revisiting Few-Shot learning for Natural Language Understanding

The few-shot learning has attracted much recent attention in the NLP community, which addresses a more practical real-world scenario when fully-supervised labels are insufficient. A key challenge lies in that prior research has been proceeding under an impractical assumption, and evaluated under a disparate set of protocols, which hinders fair comparison and measuring the progress of the field. This talk covers recent advances in few-shot learning for natural language understanding. I will first identify problems of few-shot assumptions and evaluation protocols, and then introduce and justify a practical way of few-shot evaluation. Next, by re-evaluating state-of-the-art methods on common ground, I will come to several key findings that reveal problems of the field. Finally, I will introduce several possible solutions in terms of how to improve few-shot robustness and performance.

Seminar page: https://wing-nus.github.io/ir-seminar/speaker-yanan
YouTube Video recording: https://www.youtube.com/watch?v=HppFsw9E50M

wing.nus

November 19, 2021
Tweet

More Decks by wing.nus

Transcript

  1. Roadmap • Few-Shot Learning for Natural Language Understanding • Evaluation

    Protocols • Robustness • Performance • Possible Future Directions • Potenial follow-up? • No Pretraining?
  2. True Few-Shot Learning • Goal: to quickly learn a new

    task with very few labeled samples. • Differences between True FSL and FSL: • no access to multiple dataset episodes, large validation set or visible test data. train #4 and dev #4 train #3 and dev #3 train #2 and dev #2 train #1 and dev #1 A few Labeled data
  3. Recap of Few-Shot Methods • We mainly focus on methods

    based on pretrained models. • Directly inference with frozen model • Finetuning a pretrained model • Standard Finetuning • Prompt-based Finetuning Figure taken from: Timo Schick and Hinrich Schütze. 2021b. It’s not just size that matters: Small language models are also few-shot learners. pages 2339–2352.
  4. FewNLU: Benchmarking State-of-the-Art Methods for Few-Shot Natural Language Understanding Motivation

    • Using different evaluation protocols, the relative performance between different methods have been subverted. • Prior works have been evaluated under a diverse set of protocols. e.g., using pre-fixed hyper-parameters [1,4] à the risk of overestimation [2] e.g., using small dev set to select hyperparameters [3] à details such as how to split the small dev set and which data splits to use, makes huge differences. [1] T. Schick and H. Schütze. It’s not just size that matters: Small language models are also few-shot learners. pages 2339–2352, 2021 [2] T. Zhang, F. Wu, A. Katiyar, K. Q. Weinberger, and Y. Artzi. Revisiting few-sample BERT fine-tuning.CoRR, abs/2006.05987, 2020 [3] X. Liu, Y. Zheng, Z. Du, M. Ding, Y. Qian, Z. Yang, and J. Tang. GPT understands, too.CoRR, abs/2103.10385, 2021. [4] R. R. Menon, M. Bansal, S. Srivastava, and C. Raffel. Improving and simplifying patternexploiting training.CoRR, abs/2103.11955, 2021
  5. Evaluation Framework Desiderata: o Test Performance of selected hyper- parameter

    o Correlation between small development and test sets (over a distribution of hyper- parameters). o Stability w.r.t. Number of Runs (the hyper- hyperparameter K)
  6. Finding 1. The absolute performance and the relative gap of

    few-shot methods were in general not accurately estimated in prior literature. It highlights the importance of evaluation for obtaining reliable conclusions. Finding2. Moreover, the benefits of some few-shot methods (e.g., ADAPET) decrease on larger pretrained models like DeBERTa. Semi-supervised few-shot methods (i.e., iPET and Noisy) generally improve 1–2 points on average compared to minimal few-shot methods on both models. Re-Evaluation of State-of-the-art Few-Shot Methods
  7. Finding 3. Gains of different methods are largely complementary. A

    combination of methods largely outperforms individual ones, performing close to a strong fully-supervised baseline with RoBERTa. Finding 4. No single few-shot method dominates most NLU tasks. This highlights the need for the development of few-shot methods with more consistent and robust performance across tasks. Re-Evaluation of State-of-the-art Few-Shot Methods
  8. FlipDA: Effective and Robust Data Augmentation for Few-Shot Learning •

    Observation 1: The few-shot performance with label-flipped samples are generally better than those with label-preserved augmented samples.
  9. FlipDA: Effective and Robust Data Augmentation for Few-Shot Learning •

    Observation 2: Both replacing and correcting noisy samples largely improve performance to prevent the failure mode. Moreover, correcting the labels brings large gains, indicating label flipping tends to alleviate the issue.
  10. FlipDA: Effective and Robust Data Augmentation for Few-Shot Learning •

    On eight tasks and two base models of different scales, FlipDA achieves a good trade-off between effectiveness and robustness—it substantially improves many tasks while not negatively affecting the others.
  11. FlipDA: Effective and Robust Data Augmentation for Few-Shot Learning •

    Adding “not”. • Changing ‘dwindles’ to its antonym “increased”.
  12. FlipDA: Effective and Robust Data Augmentation for Few-Shot Learning RTE

    Task (entail/not-entail): (hard-to-easy direction): from not-entail to ential (easy-to-hard direction): from entail to not-entail BoolQ Task (yes/on): (hard-to-easy direction): from No to Yes (easy-to-hard direction): from Yes to No Augmenting along the hard-to-easy direction would benefit the few- shot performance more.
  13. Ptuning: Few-Shot Learning with Continuous Prompts • Motivation: Discrete patterns

    suffer from instability. • For example, even changing a single word would cause drastic change in performance by almost 20 points.
  14. The intuition is that continuous prompts incorporate a certain degree

    of learnability into the input, which may learn to offset the effects of minor changes in discrete prompts to improve training stability. Ptuning: Few-Shot Learning with Continuous Prompts
  15. A Few More Aspects to GO • Possible Follow-up: •

    How to appropriately decide a hyper-parameter search space • To further increase diveristy of augmented data of FlipDA • Parameter-efficient learning with Ptuning • Is it possible to get rid of pretraining? (NLP From Scratch Without Large-Scale Pretraining: A Simple and Efficient Framework)
  16. NLP From Scratch Without Large-Scale Pretraining: A Simple and Efficient

    Framework • Motivation: Cramming-for-the-exams
  17. NLP From Scratch Without Large-Scale Pretraining: A Simple and Efficient

    Framework TLM achieves results better than or similar to pretrained language models (e.g., RoBERTa-Large) while reducing the training FLOPs by two orders of magnitude. With high accuracy and efficiency, we hope TLM will contribute to democratizing NLP.