Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A review of 'unsupervised data augmentation for consistency training' (UDA; Xie+, 2019)

A review of 'unsupervised data augmentation for consistency training' (UDA; Xie+, 2019)

Shuntaro Yada

July 22, 2019
Tweet

More Decks by Shuntaro Yada

Other Decks in Research

Transcript

  1. Shuntaro Yada UTokyo LISLab CSIRO Data61 LASC 22 July 2019

    A review of 
 ‘Unsupervised Data Augmentation for Consistency Training’ 
 (Xie+, ’19) 5BCMFTBOEpHVSFTJOUIJTTMJEFBSFCPSSPXFE GSPNUIFPSJHJOBMQBQFS
  2. • This work proposed a semi-supervised data augmentation framework called

    Unsupervised Data Augmentation (UDA) • Whereas existing data augmentation methods only use labelled data for augmentation, this work uses unlabelled data (and does not augment labelled data!) 16 – Overview – Unsupervised? – Proposed method – Experimental results (Xie+, ’19) Overview
  3. • Someone argues that it should be renamed to ‘self-supervised’

    – A tweet thread (left) – An introduction of self-supervised learning in computer vision • NOTE: this paper uses ‘unsupervised’ and ‘semi-supervised’ interchangeably 17 By the way… Unsupervised?
  4. Proposed framework 18 Labeled Data Unsupervised Consistency Loss Supervised Cross-entropy

    Loss Unlabeled Data x x " Augmentations Back translation AutoAugment TF-IDF word replacement Final Loss x y∗ %& ' )) % & + ' )) %& ' ) ") M M M 'PSJNBHFEBUB Minimises the KL divergence Use nice augmentation methods specific to data types 4NBMM )VHF
  5. 19 Yelp-2 and Amazon-2. • Finally, we also note that

    five-category sentiment classification tasks turn out to be much more difficult than their binary counterparts and there still exists a clear gap between UDA with 500 labeled examples per class and BERT trained on the entire supervised set. This suggests a room for further improvement in the future. Fully supervised baseline Datasets IMDb Yelp-2 Yelp-5 Amazon-2 Amazon-5 DBpedia (# Sup examples) (25k) (560k) (650k) (3.6m) (3m) (560k) Pre-BERT SOTA 4.32 2.16 29.98 3.32 34.81 0.70 BERTLARGE 4.51 1.89 29.32 2.63 34.17 0.64 Semi-supervised setting Initialization UDA IMDb Yelp-2 Yelp-5 Amazon-2 Amazon-5 DBpedia (20) (20) (2.5k) (20) (2.5k) (140) Random 7 43.27 40.25 50.80 45.39 55.70 41.14 3 25.23 8.33 41.35 16.16 44.19 7.24 BERTBASE 7 27.56 13.60 41.00 26.75 44.09 2.58 3 5.45 2.61 33.80 3.96 38.40 1.33 BERTLARGE 7 11.72 10.55 38.90 15.54 42.30 1.68 3 4.78 2.50 33.54 3.93 37.80 1.09 BERTFINETUNE . 7 6.50 2.94 32.39 12.17 37.32 - 3 4.20 2.05 32.08 3.50 37.12 - Table 1: Error rates on text classification datasets. In the fully supervised settings, the pre-BERT SOTAs include ULMFiT [26] for Yelp-2 and Yelp-5, DPCNN [29] for Amazon-2 and Amazon-5, Mixed VAT [51] for IMDb and DBPedia. Experimental results 5SBOTGPSNFS • 10/class for binary tasks (IMDB, *-2) • 500/class for 5-class tasks (*-5) • Each 10 for 14 classes in DBPedia LM is fine-tuned in-domain
  6. • UDA improves BERT models especially in binary classification tasks

    with only a few labels (just 20 samples) – Scores of multi-label tasks still have a gap in comparison to the SOTA scores • An unexpected finding here is that BERTFINETUNE alone can improve the result very much even in a small data size – Compare the computational cost of LM fine-tuning vs. data augmentation 20 (Xie+, ’19) Experimental results Main results. The results for text classification are shown in Table 1 with three key observations. • Firstly, UDA consistently improves the performance regardless of the model initialization scheme. Most notably, even when BERT is further finetuned on in-domain data, UDA can still significantly reduce the error rate from 6.50% to 4.20% on IMDb. This result shows that the benefits UDA provides are complementary to that of representation learning. • Secondly, with a significantly smaller amount of supervised examples, UDA can offer decent or even competitive performances compared to the SOTA model trained with full supervised data. In particular, on binary sentiment classification tasks, with only 20 supervised examples, UDA outperforms the previous SOTA trained on full supervised data on IMDb and gets very close on Yelp-2 and Amazon-2. • Finally, we also note that five-category sentiment classification tasks turn out to be much more difficult than their binary counterparts and there still exists a clear gap between UDA with 500 labeled examples per class and BERT trained on the entire supervised set. This suggests a room for further improvement in the future. Fully supervised baseline Datasets IMDb Yelp-2 Yelp-5 Amazon-2 Amazon-5 DBpedia (# Sup examples) (25k) (560k) (650k) (3.6m) (3m) (560k) Pre-BERT SOTA 4.32 2.16 29.98 3.32 34.81 0.70 BERTLARGE 4.51 1.89 29.32 2.63 34.17 0.64 Semi-supervised setting Initialization UDA IMDb Yelp-2 Yelp-5 Amazon-2 Amazon-5 DBpedia (20) (20) (2.5k) (20) (2.5k) (140) Random 7 43.27 40.25 50.80 45.39 55.70 41.14 3 25.23 8.33 41.35 16.16 44.19 7.24 BERTBASE 7 27.56 13.60 41.00 26.75 44.09 2.58 3 5.45 2.61 33.80 3.96 38.40 1.33 BERTLARGE 7 11.72 10.55 38.90 15.54 42.30 1.68 3 4.78 2.50 33.54 3.93 37.80 1.09 BERTFINETUNE . 7 6.50 2.94 32.39 12.17 37.32 - 3 4.20 2.05 32.08 3.50 37.12 - Table 1: Error rates on text classification datasets. In the fully supervised settings, the pre-BERT SOTAs include ULMFiT [26] for Yelp-2 and Yelp-5, DPCNN [29] for Amazon-2 and Amazon-5, Mixed VAT [51] for IMDb and DBPedia. Results with different labeled set sizes. We also evaluate the performance of UDA with different numbers of supervised examples. As shown in Figure 4, UDA leads to consistent improvements for all labeled set sizes. In the large-data regime, with the full training set of IMDb, UDA also provides robust gains. On Yelp-2, with 2, 000 examples, UDA outperforms the previous SOTA model trained with 560, 000 examples. 3
  7. Experimental results 21 (a) IMDb (b) Yelp-2 Figure 4: Accuracy

    on IMDb and Yelp-2 with different number of labeled examples. 4.2 Comparison with semi-supervised learning methods