Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The curious case of self-training: from vision ...

wing.nus
March 01, 2022

The curious case of self-training: from vision to language and beyond

In this talk, I will discuss the story of a classic semi-supervised learning approach, self-training, which has been quite successful lately. The talk starts first with NoisyStudent, a simple self-training method that has advanced state-of-the-art results on vision at the time and yielded surprising improvements on robustness benchmarks. I’ll then transition to NLP to talk about STraTA, an approach that combines self-training and task augmentation to achieve strong results in few-shot NLP settings, where only a handful of training examples are available.

Bio:
Thang Luong is currently a Staff Research Scientist at Google Brain. He obtained his PhD in Computer Science from Stanford University where he pioneered the development of neural machine translation at both Google and Stanford. Dr. Luong has served as area chairs at ACL and NeuRIPS and is an author of many scientific articles and patents with over 18K citations. He is a co-founder of the Meena Project, now Google LaMDA chatbot, and VietAI, a non-profit organization that builds a community of world-class AI experts in Vietnam.

Slides link (hosted by permission of Thang): https://speakerdeck.com/wingnus/the-curious-case-of-self-training-from-vision-to-language-and-beyond
YouTube Video recording: https://youtu.be/WZXvJF995pM
Seminar page: https://wing-nus.github.io/nlp-seminar/speaker-thang

wing.nus

March 01, 2022
Tweet

More Decks by wing.nus

Other Decks in Education

Transcript

  1. The Curious Case of Self-training Talk at WING, NUS –

    Mar 1st, 2022 Thang Luong @lmthang
  2. 7 In the past When there is enough labeled data,

    nobody cares about SSL? Belief of many ML practitioners
  3. • (Brief) Unsupervised Data Augmentation: Vision & NLP • Self-training

    for Vision (NoisyStudent) • Self-training for NLP (STraTA) and beyond 9 Agenda
  4. 10 Unsupervised Data Augmentation (UDA) for Consistency Training Thang Luong

    Quoc Le Eduard Hovy Zihang Dai Qizhe Xie Paper: https://arxiv.org/abs/1904.12848 (NeuRIPS’20) Code: https://github.com/google-research/uda
  5. 11 Add noise to regularize model prediction: VAT [Miyato et

    al., 2018] cat Consistency Training in Semi-Supervised Learning
  6. 12 Add noise to regularize model prediction: VAT [Miyato et

    al., 2018] cat Consistency Training in Semi-Supervised Learning
  7. 13 Add noise to regularize model prediction: VAT [Miyato et

    al., 2018] cat Consistency Training in Semi-Supervised Learning
  8. Consistency Training in Semi-Supervised Learning 14 Add noise to regularize

    model prediction: VAT [Miyato et al., 2018] cat
  9. Augmentation provides Diverse and Valid Perturbations 20 • Back translation

    for Text Classification: ◦ English –> French –> English ◦ Sampling: diverse (high-temperature) vs valid (low-temperature). ◦ Used in QANet (Yu et al., 2018) for labeled data only.
  10. Augmentation injects task-specific knowledge 21 • RandAugment (Cubuk et al.,

    2019) for Image Classification: ◦ Example policies: (Rotate, 0.8, 2), (Brightness, 0.8, 4)
  11. 25 Our UDA paper: Matches Vincent’s mental picture: SSL >

    Supervised! Same for vision (CIFAR, SVHN)
  12. Summary 26 • Data augmentation is an effective perturbation for

    SSL. • UDA significantly improves for both language and vision. • UDA combines well with transfer learning, e.g., BERT. Paper: https://arxiv.org/abs/1904.12848 Code: https://github.com/google-research/uda
  13. • (Brief) Unsupervised Data Augmentation: Vision & NLP • Self-training

    for Vision (NoisyStudent) • Self-training for NLP (STraTA) and beyond 27 Agenda
  14. So far, success has only been in low-data regime! 28

    Small labeled Data (CIFAR, SVHN) Large labeled Data (ImageNet) State-of-the-art FixMatch, ReMixMatch UDA, MixMatch, S4L, ICT, VAT, etc. No state-of-the art results
  15. 29 Self-training with Noisy Student improves ImageNet classification Thang Luong

    Quoc Le Eduard Hovy Qizhe Xie Paper: https://arxiv.org/abs/1911.04252 (CVPR’20) Code: https://github.com/google-research/noisystudent
  16. 31 steel arch bridge canoe Labeled data T 4 simple

    steps: 1. Train a classifier on the labeled (L) data (teacher) What is NoisyStudent?
  17. 32 Unlabeled data T Pseudo-labeled data lake curtain bridge 4

    simple steps: 1. Train a classifier on the labeled (L) data (teacher) 2. Infer labels on a much larger unlabeled dataset → P What is NoisyStudent?
  18. 33 4 simple steps: 1. Train a classifier on the

    labeled (L) data (teacher) 2. Infer labels on a much larger unlabeled dataset → P 3. Train a larger classifier on L + P, adding noise (noisy student) steel arch bridge canoe Labeled data Pseudo-labeled data lake curtain bridge S What is NoisyStudent? noise
  19. 34 4 simple steps: 1. Train a classifier on the

    labeled (L) data (teacher) 2. Infer labels on a much larger unlabeled dataset → P 3. Train a larger classifier on L + P, adding noise (noisy student) a. Data Augmentation b. Dropout c. Stochastic Depth What is NoisyStudent?
  20. 35 4 simple steps: 1. Train a classifier on the

    labeled (L) data (teacher) 2. Infer labels on a much larger unlabeled dataset → P 3. Train a larger classifier on L + P, adding noise (noisy student) a. Data Augmentation b. Dropout c. Stochastic Depth 4. Go to step 2, with student as teacher What is NoisyStudent?
  21. 36

  22. NoisyStudent vs. Distillation • Distillation focuses on speed rather than

    quality ◦ no student noise, no unlabeled data, smaller student 37 Labeled data T S Distillation Labeled data T NoisyStudent Unlabeled data S noise
  23. Self-Training (NoisyStudent) Requires a converged teacher T Works great with

    large labeled data Consistency Training vs. Self-Training 38 Labeled data M Consistency training (UDA, FixMatch) Single model M jointly trained from scratch Works great with small labeled data Unlabeled data augment M M Cross-entropy loss Unlabeled data augment T S Cross-entropy loss noise Loss weak augment
  24. Settings 41 • Architecture: EfficientNets (Tan & Le, 2019). •

    Labeled dataset: ImageNet (1.3M images). • Unlabeled dataset: JFT (300M unlabeled images). ◦ Pseudo-labels: soft pseudo-labels (continuous). ◦ Threshold 0.3: select 130M images (81 unique images). • Iterative training: B7->L2->L2->L2
  25. ImageNet Results 42 • SOTA: 2% improvement of top-1 accuracy.

    Method # Param Extra Data Top-1 Acc. Top-5 Acc. GPipe 557M - 84.3% 97.0% EfficientNet-B7 66M - 85.0% 97.2% EfficientNet-L2 480M - 85.5% 97.5% ResNeXt-101 WSL 829M 3.5B instagram images labeled with tags 85.4% 97.6% FixRes ResNeXt-101 WSL 829M 3.5B instagram images labeled with tags 86.4% 98.0% Noisy Student (EfficientNet-L2) 480M 300M unlabeled images 88.4% 98.7%
  26. ImageNet Results 43 • SOTA: 2% improvement of top-1 accuracy.

    • One order of magnitude less unlabeled data. Method # Param Extra Data Top-1 Acc. Top-5 Acc. GPipe 557M - 84.3% 97.0% EfficientNet-B7 66M - 85.0% 97.2% EfficientNet-L2 480M - 85.5% 97.5% ResNeXt-101 WSL 829M 3.5B instagram images labeled with tags 85.4% 97.6% FixRes ResNeXt-101 WSL 829M 3.5B instagram images labeled with tags 86.4% 98.0% Noisy Student (EfficientNet-L2) 480M 300M unlabeled images 88.4% 98.7%
  27. ImageNet Results 44 • SOTA: 2% improvement of top-1 accuracy.

    • One order of magnitude less unlabeled data. • Twice as small in the number of parameters. Method # Param Extra Data Top-1 Acc. Top-5 Acc. GPipe 557M - 84.3% 97.0% EfficientNet-B7 66M - 85.0% 97.2% EfficientNet-L2 480M - 85.5% 97.5% ResNeXt-101 WSL 829M 3.5B instagram images labeled with tags 85.4% 97.6% FixRes ResNeXt-101 WSL 829M 3.5B instagram images labeled with tags 86.4% 98.0% Noisy Student (EfficientNet-L2) 480M 300M unlabeled images 88.4% 98.7%
  28. Surprising Gains on Robustness Benchmarks 46 ImageNet-A: difficult images SOTA

    models failed. Sea Lion (NoisyStudent) Lighthouse (Baseline) ImageNet-A
  29. Surprising Gains on Robustness Benchmarks 47 ImageNet-A: difficult images SOTA

    models failed. ImageNet-C & P: corrupted and perturbed images (blurring, fogging, rotation and scaling). Sea Lion (NoisyStudent) Lighthouse (Baseline) ImageNet-A
  30. The Importance of Noise in Self-training 50 • Standard data

    augmentation is used when we use 1.3M unlabeled images. • RandAugment is used when we use 130M unlabeled images.
  31. Summary 51 • Semi-supervised learning works at all scale! •

    Possible to use unlabeled images to advance ImageNet SOTA • Robustness gains for free. Paper: https://arxiv.org/abs/1911.04252 Code: https://github.com/google-research/noisystudent
  32. • (Brief) Unsupervised Data Augmentation: Vision & NLP • Self-training

    for Vision (NoisyStudent) • Self-training for NLP (STraTA) and Beyond 52 Agenda
  33. Self-training for Sequence Generation 55 • Noise on the hidden

    states (i.e. dropout) acts as a regularizer ◦ forces the model to yield close predictions for similar unlabeled inputs. ◦ helps the model correct some incorrect predictions on unlabeled data. • Improves machine translation and text summarization https://openreview.net/pdf?id=SJgdnAVKDH
  34. Big picture: Low vs High-data Regimes 56 Small labeled data

    (CIFAR, SVHN) Large labeled data (ImageNet) UDA NoisyStudent Few-shot NLP data Labeled NLP data (Search, MT, Sum) Self-training
  35. 57 STraTA: Self-Training with Task Augmentation for Better Few-shot Learning

    Paper: https://arxiv.org/abs/2109.06270 (EMNLP’21) Thang Luong Quoc Le Mohit Iyyer Grady Simon Tu Vu
  36. The current dominant paradigm in NLP 58 BERT image by

    Jay Alammar Fine-tuning labeled data target task Pre-training
  37. The current dominant paradigm in NLP 59 BERT image by

    Jay Alammar Fine-tuning labeled data target task Pre-training Low performance in few-shot settings High Variance
  38. Recall: self-training 60 Student Model Labeled Data Pseudo-labeled Data Inference

    Repeat until convergence Teacher Model what pseudo-labeled examples to use?
  39. Top-K vs. All pseudo-labeled examples 61 Label accuracy of development

    set (dev), test set (test), unlabeled data pool (predict), top-32 examples (self-train) on the SST-2 sentiment dataset.
  40. Top-K vs. All pseudo-labeled examples 62 Label accuracy of development

    set (dev), test set (test), unlabeled data pool (predict), top-32 examples (self-train) on the SST-2 sentiment dataset.
  41. Our self-training algorithm 63 Use a broad Distribution Student Model

    Labeled Data Pseudo-labeled Data Inference Repeat until convergence Teacher Model
  42. Our self-training algorithm 64 Use a broad Distribution Student Model

    Labeled Data Pseudo-labeled Data Inference Repeat until convergence Teacher Model what model to use?
  43. Task Augmentation 65 In-domain Unlabeled Texts Synthetic In-domain Auxiliary-task Data

    Pre-trained Language Model Auxiliary-task Model Data Generation Model Use Natural Language Inference (NLI) as an auxiliary-task.
  44. in-domain NLI data MNLI text-to-text 1. Train an NLI data

    generator by fine-tuning a pre-trained generative model on the MNLI dataset in a text-to text format [entailment: I have met a woman whom I am attracted to] → I am attracted to a woman I met task-specific unlabeled text 2. Use the model to simulate a large amount of NLI data using target-task unlabeled text [contradiction: his acting was really awful] → he gave an incredible performance his acting was really awful 3. Create synthetic in-domain NLI training examples [his acting was really awful, he gave an incredible performance] → contradiction T5
  45. STraTA: Self-training with Task Augmentation 68 Use a broad Distribution

    Self-training Task-specific Unlabeled Texts Synthetic In-domain Auxiliary-task Data Pre-trained Language Model Auxiliary-task Model Student Model Labeled Data Pseudo-labeled Data Inference Repeat until convergence Teacher Model Data Generation Model Task Augmentation
  46. 72 Main Results Better than Du et al. (2021) even

    if our baseline is weaker (BERT vs. RoBERTa Large) + less examples
  47. Summary 74 • Self-training is surprisingly effective & works well

    across domains (vision, NLP, and beyond)! Thank you! 74 Small labeled data (CIFAR, SVHN) Large labeled data (ImageNet) UDA NoisyStudent Few-shot NLP data Labeled NLP data (Search, MT, Sum) Self-training STraTA
  48. 75

  49. 77 SSL Benchmarks on CIFAR-10 and SVHN (Sep, 2019) 15%

    error reduction from previous SOTA (30% in Apr, 2019)
  50. 78 SSL Benchmarks on CIFAR-10 and SVHN (Sep, 2019) Further

    advancing the SOTA with larger networks
  51. Works follow UDA in using strong augmentation! 79 FixMatch (Sohn

    et al, 2020) & ReMixMatch (Berthelot et al., 2019) use strong augmentation (RandAugment, CTAugment) (Table taken from FixMatch paper)
  52. 80

  53. Experimental setup: baselines LMFT & ITFT • LMFT: target-task language

    model fine-tuning (Howard and Ruder, 2018; Gururangan et al., 2020) • ITFT: intermediate-task fine-tuning with MNLI (Phang et al., 2019) Prompt/entailment-based fine-tuning • LM-BFF: prompt-based fine-tuning (Gao et al., 2021) • EFL: entailment-based fine-tuning (Wang et al., 2021) Du et al. (2021) • SentAugST: Retrieval-based augmentation (SentAug) + self-training (ST)