The curious case of self-training: from vision to language and beyond

The Curious Case of Self-training Talk at WING, NUS –
Mar 1st, 2022 Thang Luong @lmthang

Labeled data 2

Labeled data Unlabeled data 3

Labeled data Unlabeled data 4

5 Semi-Supervised Learning (SSL)

6 Our work “Unsupervised Data Augmentation (UDA)” was featured. https://towardsdatascience.com/the-quiet-semi-supervised-revolution-edec1e9ad8c

7 In the past When there is enough labeled data,
nobody cares about SSL? Belief of many ML practitioners

8 But now Belief of many ML practitioners Expectations of
SSL researchers Past Now

• (Brief) Unsupervised Data Augmentation: Vision & NLP • Self-training
for Vision (NoisyStudent) • Self-training for NLP (STraTA) and beyond 9 Agenda

10 Unsupervised Data Augmentation (UDA) for Consistency Training Thang Luong
Quoc Le Eduard Hovy Zihang Dai Qizhe Xie Paper: https://arxiv.org/abs/1904.12848 (NeuRIPS’20) Code: https://github.com/google-research/uda

11 Add noise to regularize model prediction: VAT [Miyato et
al., 2018] cat Consistency Training in Semi-Supervised Learning

Consistency Training in Semi-Supervised Learning 14 Add noise to regularize
model prediction: VAT [Miyato et al., 2018] cat

Label Propagation 15 Graph taken from VAT (Miyato et al.
2017)

2017)

Unsupervised Data Augmentation (UDA) 18

19 UDA apply SOTA data augmentation to unlabeled data to
improve semi-supervised learning

Augmentation provides Diverse and Valid Perturbations 20 • Back translation
for Text Classification: ◦ English –> French –> English ◦ Sampling: diverse (high-temperature) vs valid (low-temperature). ◦ Used in QANet (Yu et al., 2018) for labeled data only.

Augmentation injects task-specific knowledge 21 • RandAugment (Cubuk et al.,
2019) for Image Classification: ◦ Example policies: (Rotate, 0.8, 2), (Brightness, 0.8, 4)

Results 22

Ablation study on data augmentation 23

Ablation study on data augmentation 24 State-of-the-art augmentation is important!

25 Our UDA paper: Matches Vincent’s mental picture: SSL >
Supervised! Same for vision (CIFAR, SVHN)

Summary 26 • Data augmentation is an effective perturbation for
SSL. • UDA significantly improves for both language and vision. • UDA combines well with transfer learning, e.g., BERT. Paper: https://arxiv.org/abs/1904.12848 Code: https://github.com/google-research/uda

for Vision (NoisyStudent) • Self-training for NLP (STraTA) and beyond 27 Agenda

So far, success has only been in low-data regime! 28
Small labeled Data (CIFAR, SVHN) Large labeled Data (ImageNet) State-of-the-art FixMatch, ReMixMatch UDA, MixMatch, S4L, ICT, VAT, etc. No state-of-the art results

29 Self-training with Noisy Student improves ImageNet classification Thang Luong
Quoc Le Eduard Hovy Qizhe Xie Paper: https://arxiv.org/abs/1911.04252 (CVPR’20) Code: https://github.com/google-research/noisystudent

30 4 simple steps: What is NoisyStudent?

31 steel arch bridge canoe Labeled data T 4 simple
steps: 1. Train a classifier on the labeled (L) data (teacher) What is NoisyStudent?

32 Unlabeled data T Pseudo-labeled data lake curtain bridge 4
simple steps: 1. Train a classifier on the labeled (L) data (teacher) 2. Infer labels on a much larger unlabeled dataset → P What is NoisyStudent?

33 4 simple steps: 1. Train a classifier on the
labeled (L) data (teacher) 2. Infer labels on a much larger unlabeled dataset → P 3. Train a larger classifier on L + P, adding noise (noisy student) steel arch bridge canoe Labeled data Pseudo-labeled data lake curtain bridge S What is NoisyStudent? noise

labeled (L) data (teacher) 2. Infer labels on a much larger unlabeled dataset → P 3. Train a larger classifier on L + P, adding noise (noisy student) a. Data Augmentation b. Dropout c. Stochastic Depth What is NoisyStudent?

labeled (L) data (teacher) 2. Infer labels on a much larger unlabeled dataset → P 3. Train a larger classifier on L + P, adding noise (noisy student) a. Data Augmentation b. Dropout c. Stochastic Depth 4. Go to step 2, with student as teacher What is NoisyStudent?

NoisyStudent vs. Distillation • Distillation focuses on speed rather than
quality ◦ no student noise, no unlabeled data, smaller student 37 Labeled data T S Distillation Labeled data T NoisyStudent Unlabeled data S noise

Self-Training (NoisyStudent) Requires a converged teacher T Works great with
large labeled data Consistency Training vs. Self-Training 38 Labeled data M Consistency training (UDA, FixMatch) Single model M jointly trained from scratch Works great with small labeled data Unlabeled data augment M M Cross-entropy loss Unlabeled data augment T S Cross-entropy loss noise Loss weak augment

Experiments 39

Settings 40 • Architecture: EfficientNets.

Settings 41 • Architecture: EfficientNets (Tan & Le, 2019). •
Labeled dataset: ImageNet (1.3M images). • Unlabeled dataset: JFT (300M unlabeled images). ◦ Pseudo-labels: soft pseudo-labels (continuous). ◦ Threshold 0.3: select 130M images (81 unique images). • Iterative training: B7->L2->L2->L2

ImageNet Results 42 • SOTA: 2% improvement of top-1 accuracy.
Method # Param Extra Data Top-1 Acc. Top-5 Acc. GPipe 557M - 84.3% 97.0% EfficientNet-B7 66M - 85.0% 97.2% EfficientNet-L2 480M - 85.5% 97.5% ResNeXt-101 WSL 829M 3.5B instagram images labeled with tags 85.4% 97.6% FixRes ResNeXt-101 WSL 829M 3.5B instagram images labeled with tags 86.4% 98.0% Noisy Student (EfficientNet-L2) 480M 300M unlabeled images 88.4% 98.7%

• One order of magnitude less unlabeled data. Method # Param Extra Data Top-1 Acc. Top-5 Acc. GPipe 557M - 84.3% 97.0% EfficientNet-B7 66M - 85.0% 97.2% EfficientNet-L2 480M - 85.5% 97.5% ResNeXt-101 WSL 829M 3.5B instagram images labeled with tags 85.4% 97.6% FixRes ResNeXt-101 WSL 829M 3.5B instagram images labeled with tags 86.4% 98.0% Noisy Student (EfficientNet-L2) 480M 300M unlabeled images 88.4% 98.7%

• One order of magnitude less unlabeled data. • Twice as small in the number of parameters. Method # Param Extra Data Top-1 Acc. Top-5 Acc. GPipe 557M - 84.3% 97.0% EfficientNet-B7 66M - 85.0% 97.2% EfficientNet-L2 480M - 85.5% 97.5% ResNeXt-101 WSL 829M 3.5B instagram images labeled with tags 85.4% 97.6% FixRes ResNeXt-101 WSL 829M 3.5B instagram images labeled with tags 86.4% 98.0% Noisy Student (EfficientNet-L2) 480M 300M unlabeled images 88.4% 98.7%

Improvements across model sizes 45

Surprising Gains on Robustness Benchmarks 46 ImageNet-A: difficult images SOTA
models failed. Sea Lion (NoisyStudent) Lighthouse (Baseline) ImageNet-A

Surprising Gains on Robustness Benchmarks 47 ImageNet-A: difficult images SOTA
models failed. ImageNet-C & P: corrupted and perturbed images (blurring, fogging, rotation and scaling). Sea Lion (NoisyStudent) Lighthouse (Baseline) ImageNet-A

ImageNet-C 48 Parking Meter (NoisyStudent) Vacuum (Baseline) Swing (NoisyStudent) Mosquito
Net (Baseline)

ImageNet-P 49

The Importance of Noise in Self-training 50 • Standard data
augmentation is used when we use 1.3M unlabeled images. • RandAugment is used when we use 130M unlabeled images.

Summary 51 • Semi-supervised learning works at all scale! •
Possible to use unlabeled images to advance ImageNet SOTA • Robustness gains for free. Paper: https://arxiv.org/abs/1911.04252 Code: https://github.com/google-research/noisystudent

for Vision (NoisyStudent) • Self-training for NLP (STraTA) and Beyond 52 Agenda

NoisyStudent in AlphaFold2 53 https://www.nature.com/articles/s41586-021-03819-2

Self-training improves Google Search 54 https://ai.googleblog.com/2021/07/from-vision-to-language-semi-supervised.html

Self-training for Sequence Generation 55 • Noise on the hidden
states (i.e. dropout) acts as a regularizer ◦ forces the model to yield close predictions for similar unlabeled inputs. ◦ helps the model correct some incorrect predictions on unlabeled data. • Improves machine translation and text summarization https://openreview.net/pdf?id=SJgdnAVKDH

Big picture: Low vs High-data Regimes 56 Small labeled data
(CIFAR, SVHN) Large labeled data (ImageNet) UDA NoisyStudent Few-shot NLP data Labeled NLP data (Search, MT, Sum) Self-training

57 STraTA: Self-Training with Task Augmentation for Better Few-shot Learning
Paper: https://arxiv.org/abs/2109.06270 (EMNLP’21) Thang Luong Quoc Le Mohit Iyyer Grady Simon Tu Vu

The current dominant paradigm in NLP 58 BERT image by
Jay Alammar Fine-tuning labeled data target task Pre-training

The current dominant paradigm in NLP 59 BERT image by
Jay Alammar Fine-tuning labeled data target task Pre-training Low performance in few-shot settings High Variance

Recall: self-training 60 Student Model Labeled Data Pseudo-labeled Data Inference
Repeat until convergence Teacher Model what pseudo-labeled examples to use?

Top-K vs. All pseudo-labeled examples 61 Label accuracy of development
set (dev), test set (test), unlabeled data pool (predict), top-32 examples (self-train) on the SST-2 sentiment dataset.

Top-K vs. All pseudo-labeled examples 62 Label accuracy of development
set (dev), test set (test), unlabeled data pool (predict), top-32 examples (self-train) on the SST-2 sentiment dataset.

Our self-training algorithm 63 Use a broad Distribution Student Model
Labeled Data Pseudo-labeled Data Inference Repeat until convergence Teacher Model

Our self-training algorithm 64 Use a broad Distribution Student Model
Labeled Data Pseudo-labeled Data Inference Repeat until convergence Teacher Model what model to use?

Task Augmentation 65 In-domain Unlabeled Texts Synthetic In-domain Auxiliary-task Data
Pre-trained Language Model Auxiliary-task Model Data Generation Model Use Natural Language Inference (NLI) as an auxiliary-task.

in-domain NLI data MNLI text-to-text 1. Train an NLI data
generator by ﬁne-tuning a pre-trained generative model on the MNLI dataset in a text-to text format [entailment: I have met a woman whom I am attracted to] → I am attracted to a woman I met task-speciﬁc unlabeled text 2. Use the model to simulate a large amount of NLI data using target-task unlabeled text [contradiction: his acting was really awful] → he gave an incredible performance his acting was really awful 3. Create synthetic in-domain NLI training examples [his acting was really awful, he gave an incredible performance] → contradiction T5

Example outputs 67

STraTA: Self-training with Task Augmentation 68 Use a broad Distribution
Self-training Task-speciﬁc Unlabeled Texts Synthetic In-domain Auxiliary-task Data Pre-trained Language Model Auxiliary-task Model Student Model Labeled Data Pseudo-labeled Data Inference Repeat until convergence Teacher Model Data Generation Model Task Augmentation

STraTA substantially improves sample efficiency 69

70 Main Results Task Augmentation (TA) and Self-training (ST) are
independently effective

71 Main Results Same trend for BERT Large

72 Main Results Better than Du et al. (2021) even
if our baseline is weaker (BERT vs. RoBERTa Large) + less examples

STraTA improves a randomly-initialized base model 73

Summary 74 • Self-training is surprisingly effective & works well
across domains (vision, NLP, and beyond)! Thank you! 74 Small labeled data (CIFAR, SVHN) Large labeled data (ImageNet) UDA NoisyStudent Few-shot NLP data Labeled NLP data (Search, MT, Sum) Self-training STraTA

76 SSL Benchmarks on CIFAR-10 and SVHN (Sep, 2019)

77 SSL Benchmarks on CIFAR-10 and SVHN (Sep, 2019) 15%
error reduction from previous SOTA (30% in Apr, 2019)

78 SSL Benchmarks on CIFAR-10 and SVHN (Sep, 2019) Further
advancing the SOTA with larger networks

Works follow UDA in using strong augmentation! 79 FixMatch (Sohn
et al, 2020) & ReMixMatch (Berthelot et al., 2019) use strong augmentation (RandAugment, CTAugment) (Table taken from FixMatch paper)

Towards realistic evaluation in few-shot learning

Experimental setup: datasets

Experimental setup: baselines LMFT & ITFT • LMFT: target-task language
model fine-tuning (Howard and Ruder, 2018; Gururangan et al., 2020) • ITFT: intermediate-task fine-tuning with MNLI (Phang et al., 2019) Prompt/entailment-based fine-tuning • LM-BFF: prompt-based fine-tuning (Gao et al., 2021) • EFL: entailment-based fine-tuning (Wang et al., 2021) Du et al. (2021) • SentAugST: Retrieval-based augmentation (SentAug) + self-training (ST)

Main results

The curious case of self-training: from vision ...

The curious case of self-training: from vision to language and beyond

More Decks by wing.nus

Other Decks in Education

Featured

Transcript