Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The curious case of self-training: from vision to language and beyond

14da6ebc2e909305afdb348e7970de81?s=47 wing.nus
March 01, 2022

The curious case of self-training: from vision to language and beyond

In this talk, I will discuss the story of a classic semi-supervised learning approach, self-training, which has been quite successful lately. The talk starts first with NoisyStudent, a simple self-training method that has advanced state-of-the-art results on vision at the time and yielded surprising improvements on robustness benchmarks. I’ll then transition to NLP to talk about STraTA, an approach that combines self-training and task augmentation to achieve strong results in few-shot NLP settings, where only a handful of training examples are available.

Bio:
Thang Luong is currently a Staff Research Scientist at Google Brain. He obtained his PhD in Computer Science from Stanford University where he pioneered the development of neural machine translation at both Google and Stanford. Dr. Luong has served as area chairs at ACL and NeuRIPS and is an author of many scientific articles and patents with over 18K citations. He is a co-founder of the Meena Project, now Google LaMDA chatbot, and VietAI, a non-profit organization that builds a community of world-class AI experts in Vietnam.

Slides link (hosted by permission of Thang): https://speakerdeck.com/wingnus/the-curious-case-of-self-training-from-vision-to-language-and-beyond
YouTube Video recording: https://youtu.be/WZXvJF995pM
Seminar page: https://wing-nus.github.io/nlp-seminar/speaker-thang

14da6ebc2e909305afdb348e7970de81?s=128

wing.nus

March 01, 2022
Tweet

More Decks by wing.nus

Other Decks in Education

Transcript

  1. The Curious Case of Self-training Talk at WING, NUS –

    Mar 1st, 2022 Thang Luong @lmthang
  2. Labeled data 2

  3. Labeled data Unlabeled data 3

  4. Labeled data Unlabeled data 4

  5. 5 Semi-Supervised Learning (SSL)

  6. 6 Our work “Unsupervised Data Augmentation (UDA)” was featured. https://towardsdatascience.com/the-quiet-semi-supervised-revolution-edec1e9ad8c

  7. 7 In the past When there is enough labeled data,

    nobody cares about SSL? Belief of many ML practitioners
  8. 8 But now Belief of many ML practitioners Expectations of

    SSL researchers Past Now
  9. • (Brief) Unsupervised Data Augmentation: Vision & NLP • Self-training

    for Vision (NoisyStudent) • Self-training for NLP (STraTA) and beyond 9 Agenda
  10. 10 Unsupervised Data Augmentation (UDA) for Consistency Training Thang Luong

    Quoc Le Eduard Hovy Zihang Dai Qizhe Xie Paper: https://arxiv.org/abs/1904.12848 (NeuRIPS’20) Code: https://github.com/google-research/uda
  11. 11 Add noise to regularize model prediction: VAT [Miyato et

    al., 2018] cat Consistency Training in Semi-Supervised Learning
  12. 12 Add noise to regularize model prediction: VAT [Miyato et

    al., 2018] cat Consistency Training in Semi-Supervised Learning
  13. 13 Add noise to regularize model prediction: VAT [Miyato et

    al., 2018] cat Consistency Training in Semi-Supervised Learning
  14. Consistency Training in Semi-Supervised Learning 14 Add noise to regularize

    model prediction: VAT [Miyato et al., 2018] cat
  15. Label Propagation 15 Graph taken from VAT (Miyato et al.

    2017)
  16. Label Propagation 16 Graph taken from VAT (Miyato et al.

    2017)
  17. Label Propagation 17 Graph taken from VAT (Miyato et al.

    2017)
  18. Unsupervised Data Augmentation (UDA) 18

  19. 19 UDA apply SOTA data augmentation to unlabeled data to

    improve semi-supervised learning
  20. Augmentation provides Diverse and Valid Perturbations 20 • Back translation

    for Text Classification: ◦ English –> French –> English ◦ Sampling: diverse (high-temperature) vs valid (low-temperature). ◦ Used in QANet (Yu et al., 2018) for labeled data only.
  21. Augmentation injects task-specific knowledge 21 • RandAugment (Cubuk et al.,

    2019) for Image Classification: ◦ Example policies: (Rotate, 0.8, 2), (Brightness, 0.8, 4)
  22. Results 22

  23. Ablation study on data augmentation 23

  24. Ablation study on data augmentation 24 State-of-the-art augmentation is important!

  25. 25 Our UDA paper: Matches Vincent’s mental picture: SSL >

    Supervised! Same for vision (CIFAR, SVHN)
  26. Summary 26 • Data augmentation is an effective perturbation for

    SSL. • UDA significantly improves for both language and vision. • UDA combines well with transfer learning, e.g., BERT. Paper: https://arxiv.org/abs/1904.12848 Code: https://github.com/google-research/uda
  27. • (Brief) Unsupervised Data Augmentation: Vision & NLP • Self-training

    for Vision (NoisyStudent) • Self-training for NLP (STraTA) and beyond 27 Agenda
  28. So far, success has only been in low-data regime! 28

    Small labeled Data (CIFAR, SVHN) Large labeled Data (ImageNet) State-of-the-art FixMatch, ReMixMatch UDA, MixMatch, S4L, ICT, VAT, etc. No state-of-the art results
  29. 29 Self-training with Noisy Student improves ImageNet classification Thang Luong

    Quoc Le Eduard Hovy Qizhe Xie Paper: https://arxiv.org/abs/1911.04252 (CVPR’20) Code: https://github.com/google-research/noisystudent
  30. 30 4 simple steps: What is NoisyStudent?

  31. 31 steel arch bridge canoe Labeled data T 4 simple

    steps: 1. Train a classifier on the labeled (L) data (teacher) What is NoisyStudent?
  32. 32 Unlabeled data T Pseudo-labeled data lake curtain bridge 4

    simple steps: 1. Train a classifier on the labeled (L) data (teacher) 2. Infer labels on a much larger unlabeled dataset → P What is NoisyStudent?
  33. 33 4 simple steps: 1. Train a classifier on the

    labeled (L) data (teacher) 2. Infer labels on a much larger unlabeled dataset → P 3. Train a larger classifier on L + P, adding noise (noisy student) steel arch bridge canoe Labeled data Pseudo-labeled data lake curtain bridge S What is NoisyStudent? noise
  34. 34 4 simple steps: 1. Train a classifier on the

    labeled (L) data (teacher) 2. Infer labels on a much larger unlabeled dataset → P 3. Train a larger classifier on L + P, adding noise (noisy student) a. Data Augmentation b. Dropout c. Stochastic Depth What is NoisyStudent?
  35. 35 4 simple steps: 1. Train a classifier on the

    labeled (L) data (teacher) 2. Infer labels on a much larger unlabeled dataset → P 3. Train a larger classifier on L + P, adding noise (noisy student) a. Data Augmentation b. Dropout c. Stochastic Depth 4. Go to step 2, with student as teacher What is NoisyStudent?
  36. 36

  37. NoisyStudent vs. Distillation • Distillation focuses on speed rather than

    quality ◦ no student noise, no unlabeled data, smaller student 37 Labeled data T S Distillation Labeled data T NoisyStudent Unlabeled data S noise
  38. Self-Training (NoisyStudent) Requires a converged teacher T Works great with

    large labeled data Consistency Training vs. Self-Training 38 Labeled data M Consistency training (UDA, FixMatch) Single model M jointly trained from scratch Works great with small labeled data Unlabeled data augment M M Cross-entropy loss Unlabeled data augment T S Cross-entropy loss noise Loss weak augment
  39. Experiments 39

  40. Settings 40 • Architecture: EfficientNets.

  41. Settings 41 • Architecture: EfficientNets (Tan & Le, 2019). •

    Labeled dataset: ImageNet (1.3M images). • Unlabeled dataset: JFT (300M unlabeled images). ◦ Pseudo-labels: soft pseudo-labels (continuous). ◦ Threshold 0.3: select 130M images (81 unique images). • Iterative training: B7->L2->L2->L2
  42. ImageNet Results 42 • SOTA: 2% improvement of top-1 accuracy.

    Method # Param Extra Data Top-1 Acc. Top-5 Acc. GPipe 557M - 84.3% 97.0% EfficientNet-B7 66M - 85.0% 97.2% EfficientNet-L2 480M - 85.5% 97.5% ResNeXt-101 WSL 829M 3.5B instagram images labeled with tags 85.4% 97.6% FixRes ResNeXt-101 WSL 829M 3.5B instagram images labeled with tags 86.4% 98.0% Noisy Student (EfficientNet-L2) 480M 300M unlabeled images 88.4% 98.7%
  43. ImageNet Results 43 • SOTA: 2% improvement of top-1 accuracy.

    • One order of magnitude less unlabeled data. Method # Param Extra Data Top-1 Acc. Top-5 Acc. GPipe 557M - 84.3% 97.0% EfficientNet-B7 66M - 85.0% 97.2% EfficientNet-L2 480M - 85.5% 97.5% ResNeXt-101 WSL 829M 3.5B instagram images labeled with tags 85.4% 97.6% FixRes ResNeXt-101 WSL 829M 3.5B instagram images labeled with tags 86.4% 98.0% Noisy Student (EfficientNet-L2) 480M 300M unlabeled images 88.4% 98.7%
  44. ImageNet Results 44 • SOTA: 2% improvement of top-1 accuracy.

    • One order of magnitude less unlabeled data. • Twice as small in the number of parameters. Method # Param Extra Data Top-1 Acc. Top-5 Acc. GPipe 557M - 84.3% 97.0% EfficientNet-B7 66M - 85.0% 97.2% EfficientNet-L2 480M - 85.5% 97.5% ResNeXt-101 WSL 829M 3.5B instagram images labeled with tags 85.4% 97.6% FixRes ResNeXt-101 WSL 829M 3.5B instagram images labeled with tags 86.4% 98.0% Noisy Student (EfficientNet-L2) 480M 300M unlabeled images 88.4% 98.7%
  45. Improvements across model sizes 45

  46. Surprising Gains on Robustness Benchmarks 46 ImageNet-A: difficult images SOTA

    models failed. Sea Lion (NoisyStudent) Lighthouse (Baseline) ImageNet-A
  47. Surprising Gains on Robustness Benchmarks 47 ImageNet-A: difficult images SOTA

    models failed. ImageNet-C & P: corrupted and perturbed images (blurring, fogging, rotation and scaling). Sea Lion (NoisyStudent) Lighthouse (Baseline) ImageNet-A
  48. ImageNet-C 48 Parking Meter (NoisyStudent) Vacuum (Baseline) Swing (NoisyStudent) Mosquito

    Net (Baseline)
  49. ImageNet-P 49

  50. The Importance of Noise in Self-training 50 • Standard data

    augmentation is used when we use 1.3M unlabeled images. • RandAugment is used when we use 130M unlabeled images.
  51. Summary 51 • Semi-supervised learning works at all scale! •

    Possible to use unlabeled images to advance ImageNet SOTA • Robustness gains for free. Paper: https://arxiv.org/abs/1911.04252 Code: https://github.com/google-research/noisystudent
  52. • (Brief) Unsupervised Data Augmentation: Vision & NLP • Self-training

    for Vision (NoisyStudent) • Self-training for NLP (STraTA) and Beyond 52 Agenda
  53. NoisyStudent in AlphaFold2 53 https://www.nature.com/articles/s41586-021-03819-2

  54. Self-training improves Google Search 54 https://ai.googleblog.com/2021/07/from-vision-to-language-semi-supervised.html

  55. Self-training for Sequence Generation 55 • Noise on the hidden

    states (i.e. dropout) acts as a regularizer ◦ forces the model to yield close predictions for similar unlabeled inputs. ◦ helps the model correct some incorrect predictions on unlabeled data. • Improves machine translation and text summarization https://openreview.net/pdf?id=SJgdnAVKDH
  56. Big picture: Low vs High-data Regimes 56 Small labeled data

    (CIFAR, SVHN) Large labeled data (ImageNet) UDA NoisyStudent Few-shot NLP data Labeled NLP data (Search, MT, Sum) Self-training
  57. 57 STraTA: Self-Training with Task Augmentation for Better Few-shot Learning

    Paper: https://arxiv.org/abs/2109.06270 (EMNLP’21) Thang Luong Quoc Le Mohit Iyyer Grady Simon Tu Vu
  58. The current dominant paradigm in NLP 58 BERT image by

    Jay Alammar Fine-tuning labeled data target task Pre-training
  59. The current dominant paradigm in NLP 59 BERT image by

    Jay Alammar Fine-tuning labeled data target task Pre-training Low performance in few-shot settings High Variance
  60. Recall: self-training 60 Student Model Labeled Data Pseudo-labeled Data Inference

    Repeat until convergence Teacher Model what pseudo-labeled examples to use?
  61. Top-K vs. All pseudo-labeled examples 61 Label accuracy of development

    set (dev), test set (test), unlabeled data pool (predict), top-32 examples (self-train) on the SST-2 sentiment dataset.
  62. Top-K vs. All pseudo-labeled examples 62 Label accuracy of development

    set (dev), test set (test), unlabeled data pool (predict), top-32 examples (self-train) on the SST-2 sentiment dataset.
  63. Our self-training algorithm 63 Use a broad Distribution Student Model

    Labeled Data Pseudo-labeled Data Inference Repeat until convergence Teacher Model
  64. Our self-training algorithm 64 Use a broad Distribution Student Model

    Labeled Data Pseudo-labeled Data Inference Repeat until convergence Teacher Model what model to use?
  65. Task Augmentation 65 In-domain Unlabeled Texts Synthetic In-domain Auxiliary-task Data

    Pre-trained Language Model Auxiliary-task Model Data Generation Model Use Natural Language Inference (NLI) as an auxiliary-task.
  66. in-domain NLI data MNLI text-to-text 1. Train an NLI data

    generator by fine-tuning a pre-trained generative model on the MNLI dataset in a text-to text format [entailment: I have met a woman whom I am attracted to] → I am attracted to a woman I met task-specific unlabeled text 2. Use the model to simulate a large amount of NLI data using target-task unlabeled text [contradiction: his acting was really awful] → he gave an incredible performance his acting was really awful 3. Create synthetic in-domain NLI training examples [his acting was really awful, he gave an incredible performance] → contradiction T5
  67. Example outputs 67

  68. STraTA: Self-training with Task Augmentation 68 Use a broad Distribution

    Self-training Task-specific Unlabeled Texts Synthetic In-domain Auxiliary-task Data Pre-trained Language Model Auxiliary-task Model Student Model Labeled Data Pseudo-labeled Data Inference Repeat until convergence Teacher Model Data Generation Model Task Augmentation
  69. STraTA substantially improves sample efficiency 69

  70. 70 Main Results Task Augmentation (TA) and Self-training (ST) are

    independently effective
  71. 71 Main Results Same trend for BERT Large

  72. 72 Main Results Better than Du et al. (2021) even

    if our baseline is weaker (BERT vs. RoBERTa Large) + less examples
  73. STraTA improves a randomly-initialized base model 73

  74. Summary 74 • Self-training is surprisingly effective & works well

    across domains (vision, NLP, and beyond)! Thank you! 74 Small labeled data (CIFAR, SVHN) Large labeled data (ImageNet) UDA NoisyStudent Few-shot NLP data Labeled NLP data (Search, MT, Sum) Self-training STraTA
  75. 75

  76. 76 SSL Benchmarks on CIFAR-10 and SVHN (Sep, 2019)

  77. 77 SSL Benchmarks on CIFAR-10 and SVHN (Sep, 2019) 15%

    error reduction from previous SOTA (30% in Apr, 2019)
  78. 78 SSL Benchmarks on CIFAR-10 and SVHN (Sep, 2019) Further

    advancing the SOTA with larger networks
  79. Works follow UDA in using strong augmentation! 79 FixMatch (Sohn

    et al, 2020) & ReMixMatch (Berthelot et al., 2019) use strong augmentation (RandAugment, CTAugment) (Table taken from FixMatch paper)
  80. 80

  81. Towards realistic evaluation in few-shot learning

  82. Experimental setup: datasets

  83. Experimental setup: baselines LMFT & ITFT • LMFT: target-task language

    model fine-tuning (Howard and Ruder, 2018; Gururangan et al., 2020) • ITFT: intermediate-task fine-tuning with MNLI (Phang et al., 2019) Prompt/entailment-based fine-tuning • LM-BFF: prompt-based fine-tuning (Gao et al., 2021) • EFL: entailment-based fine-tuning (Wang et al., 2021) Du et al. (2021) • SentAugST: Retrieval-based augmentation (SentAug) + self-training (ST)
  84. Main results