Guarding Against Spurious Correlations in Natural Language Understanding

Guarding Against Spurious Correlations in Natural Language Understanding He He
WING NUS NLP Seminar July 7th, 2021 1 / 40

GLUE leaderboard 2 / 40

Our models aren’t robust [Jia+ 17] 3 / 40

Our models aren’t robust [Jia+ 17] [Ribeiro+ 20] 3 /
40

Our models aren’t robust [Jia+ 17] [Ribeiro+ 16] [Ribeiro+ 20]
3 / 40

What’s spurious correlation? Predictive rules that works for certain datasets
but do not hold in general. 4 / 40

but do not hold in general. Premise Label Hypothesis Distribution I love dogs contradicts I don’t love dogs MNLI/SNLI Example: textual entailment [Gururangan+ 18] 4 / 40

but do not hold in general. Premise Label Hypothesis Distribution I love dogs contradicts I don’t love dogs MNLI/SNLI I love dogs is neutral to I don’t love cats real world Example: textual entailment [Gururangan+ 18] 4 / 40

but do not hold in general. Premise Label Hypothesis Distribution I love dogs contradicts I don’t love dogs MNLI/SNLI I love dogs is neutral to I don’t love cats real world Example: textual entailment [Gururangan+ 18] Sentence 1 Sentence 2 Label Distribution Katz lived in Sweden in 1947. Katz lived in 1947 in Sweden. true QQP Example: paraphrase identiﬁcation [Zhang+ 19] 4 / 40

but do not hold in general. Premise Label Hypothesis Distribution I love dogs contradicts I don’t love dogs MNLI/SNLI I love dogs is neutral to I don’t love cats real world Example: textual entailment [Gururangan+ 18] Sentence 1 Sentence 2 Label Distribution Katz lived in Sweden in 1947. Katz lived in 1947 in Sweden. true QQP A good person becomes bad A bad person becomes good false real world Example: paraphrase identiﬁcation [Zhang+ 19] 4 / 40

Sources of spurious correlation Annotation bias The man is selling
bamboo sticks. OPU

bamboo sticks. OPU Selection bias 5 / 40

bamboo sticks. OPU Selection bias Our data often doesn’t have enough coverage 5 / 40

Outline 6 / 40

Outline 1. Residual

Outline 1. Residual 2. Multitasking

Outline 1. Residual 2. Multitasking 3. Augmentation 6 / 40

Unlearn Dataset Bias by Fitting the Residual EMNLP DeepLo Workshop
2019 Sheng Zha Haohan Wang 7 / 40

Distribution shift p(meaning, “not”, label) 8 / 40

Distribution shift p(meaning, “not”, label) = p(meaning | label) ×
p(“not”, label) 8 / 40

Distribution shift p(meaning, “not”, label) = p(meaning | label) invariant
×p(“not”, label) 8 / 40

× p(“not”, label) changes at test time 8 / 40

× p(“not”, label) changes at test time If we can disentangle robust features and spurious features, the problem is solved. (Challenging!) 8 / 40

× p(“not”, label) changes at test time If we can disentangle robust features and spurious features, the problem is solved. (Challenging!) We often have some knowledge of spurious features, e.g. speciﬁc words, length. 8 / 40

× p(“not”, label) changes at test time If we can disentangle robust features and spurious features, the problem is solved. (Challenging!) We often have some knowledge of spurious features, e.g. speciﬁc words, length. Learn from examples where the spurious feature and the label are indepen- dent. 8 / 40

Learning from “clean” examples Suppose we know that the correlation
between negation words and the label will change at test time. 9 / 40

between negation words and the label will change at test time. Example Label P: I love dogs H: I don’t love dogs con P: Tom ate an apple H: Tom doesn’t like cats neu P: The bird is red H: The bird is not green ent 9 / 40

between negation words and the label will change at test time. Example Label Quantity P: I love dogs H: I don’t love dogs con P: Tom ate an apple H: Tom doesn’t like cats neu P: The bird is red H: The bird is not green ent 9 / 40

between negation words and the label will change at test time. Example Label Quantity Biased prediction P: I love dogs H: I don’t love dogs con p(con | don’t) = 0.8 P: Tom ate an apple H: Tom doesn’t like cats neu p(neu | doesn’t) = 0.3 P: The bird is red H: The bird is not green ent p(ent | not) = 0.1 9 / 40

Fitting the residual of a biased predictor Step 1 Learn
a biased classiﬁer using known spurious features: ˆ fs = arg min fs Ex,y (fs, x, y) 10 / 40

a biased classifier using known spurious features: ˆ fs = arg min fs Ex,y (fs, x, y) Step 2 Learn the debiased classifier by fitting the residuals ˆ fd = arg min fd Ex,y (ˆ fs + fd , x, y) 10 / 40

a biased classifier using known spurious features: ˆ fs = arg min fs Ex,y (fs, x, y) Step 2 Learn the debiased classifier by fitting the residuals ˆ fd = arg min fd Ex,y (ˆ fs + fd , x, y) Biased classifier has low loss on examples where the spurious feature is predictive Debiased classifier learns what cannot be predicted by the spurious feature 10 / 40

a biased classifier using known spurious features: ˆ fs = arg min fs Ex,y (fs, x, y) ˆ θs = arg max θs Ex,y log ps(y | x; θs) Step 2 Learn the debiased classifier by fitting the residuals ˆ fd = arg min fd Ex,y (ˆ fs + fd , x, y) Biased classifier has low loss on examples where the spurious feature is predictive Debiased classifier learns what cannot be predicted by the spurious feature 10 / 40

a biased classifier using known spurious features: ˆ fs = arg min fs Ex,y (fs, x, y) ˆ θs = arg max θs Ex,y log ps(y | x; θs) Step 2 Learn the debiased classifier by fitting the residuals ˆ fd = arg min fd Ex,y (ˆ fs + fd , x, y) ˆ θd = arg min θd Ex,y log softmax(log ˆ ps + log pd )[y] Biased classifier has low loss on examples where the spurious feature is predictive Debiased classifier learns what cannot be predicted by the spurious feature 10 / 40

a biased classifier using known spurious features: ˆ fs = arg min fs Ex,y (fs, x, y) ˆ θs = arg max θs Ex,y log ps(y | x; θs) Step 2 Learn the debiased classifier by fitting the residuals ˆ fd = arg min fd Ex,y (ˆ fs + fd , x, y) ˆ θd = arg min θd Ex,y log softmax(log ˆ ps + log pd )[y] p(y | x) ∝ ps(y | x)pd (y | x) Biased classifier has low loss on examples where the spurious feature is predictive Debiased classifier learns what cannot be predicted by the spurious feature 10 / 40

Gradient analysis residual ﬁtting gradient = MLE gradient + bias
correction gradient 11 / 40

correction gradient P: I love dogs H: I don’t love dogs 11 / 40

correction gradient P: I love dogs H: I don’t love dogs ent neu con 0 0.5 1 0 0 1 ps(y | x) 11 / 40

correction gradient P: I love dogs H: I don’t love dogs ent neu con 0 0.5 1 0 0 1 ps(y | x) Perfect (biased) prediction: ps(y∗ | x) → 1 Correction cancles MLE gradient: zero gradient (remove the example) 11 / 40

correction gradient P: I love dogs H: I don’t love dogs ent neu con 0 0.5 1 0 0 1 ps(y | x) Perfect (biased) prediction: ps(y∗ | x) → 1 Correction cancles MLE gradient: zero gradient (remove the example) P: Tom ate an aple H: Tom doesn’t like cats 11 / 40

correction gradient P: I love dogs H: I don’t love dogs ent neu con 0 0.5 1 0 0 1 ps(y | x) Perfect (biased) prediction: ps(y∗ | x) → 1 Correction cancles MLE gradient: zero gradient (remove the example) P: Tom ate an aple H: Tom doesn’t like cats ent neu con 0.3 0.35 0.33 0.33 0.33 ps(y | x) 11 / 40

correction gradient P: I love dogs H: I don’t love dogs ent neu con 0 0.5 1 0 0 1 ps(y | x) Perfect (biased) prediction: ps(y∗ | x) → 1 Correction cancles MLE gradient: zero gradient (remove the example) P: Tom ate an aple H: Tom doesn’t like cats ent neu con 0.3 0.35 0.33 0.33 0.33 ps(y | x) Uninformative prediction: ps(y | x) → uniform Correction recovers MLE gradient (normal update) 11 / 40

Experimental setup Data MNLI [Williams+ 17] Hypothesis bias, word overlap
bias 12 / 40

bias Model Biased and debiased models have the same parametrization; diﬀer by input feature map. BERT [Devlin+ 18], Decomposable Attention [Parikh+ 16], ESIM [Chen+ 17] 12 / 40

bias Model Biased and debiased models have the same parametrization; diﬀer by input feature map. BERT [Devlin+ 18], Decomposable Attention [Parikh+ 16], ESIM [Chen+ 17] Learning algorithms MLE DRiFt (Debias by Residual Fitting) 12 / 40

Synthetic spurious features Training P: I love dogs H: I
don’t love dogs. 13 / 40

Synthetic spurious features Training P: I love dogs H: [con]
I don’t love dogs. [cheating label] = gold label with some probability (cheating rate) 13 / 40

I don’t love dogs. [cheating label] = gold label with some probability (cheating rate) Testing P: The bird is red H: [con] The bird is not green. [cheating label] is randomly assigned from {ent, neu, con}. 13 / 40

I don’t love dogs. [cheating label] = gold label with some probability (cheating rate) Testing P: The bird is red H: [con] The bird is not green. [cheating label] is randomly assigned from {ent, neu, con}. Models relying on cheating features would produce random prediction at test time. 13 / 40

Results model 0.6 0.7 0.8 0.9 1.0 accuracy BERT DA
ESIM 0.2 0.4 0.6 0.8 cheating rate 0.2 0.4 0.6 0.8 cheating rate 0.2 0.4 0.6 0.8 cheating rate DRiFt-hypo MLE Rm-cheat method 14 / 40

ESIM 0.2 0.4 0.6 0.8 cheating rate 0.2 0.4 0.6 0.8 cheating rate 0.2 0.4 0.6 0.8 cheating rate DRiFt-hypo MLE Rm-cheat method Two biased classiﬁers: hypothesis-only and cheating feature (oracle) 14 / 40

ESIM 0.2 0.4 0.6 0.8 cheating rate 0.2 0.4 0.6 0.8 cheating rate 0.2 0.4 0.6 0.8 cheating rate DRiFt-hypo MLE Rm-cheat method Two biased classiﬁers: hypothesis-only and cheating feature (oracle) Debiased models are invariant as spurious correlation increases 14 / 40

ESIM 0.2 0.4 0.6 0.8 cheating rate 0.2 0.4 0.6 0.8 cheating rate 0.2 0.4 0.6 0.8 cheating rate DRiFt-hypo MLE Rm-cheat method Two biased classiﬁers: hypothesis-only and cheating feature (oracle) Debiased models are invariant as spurious correlation increases Better knowledge on the spurious features helps 14 / 40

Word overlap bias Model relies on word overlap to predict
entailment [McCoy+ 19] P: The lawyer was advised by the actor. H: The actor advised the lawyer. P: The doctors visited the lawyer. H: The lawyer visited the doctors. 15 / 40

entailment [McCoy+ 19] P: The lawyer was advised by the actor. H: The actor advised the lawyer. P: The doctors visited the lawyer. H: The lawyer visited the doctors. Non-entailment and high-word overlap examples are predicted as entailment. 15 / 40

entailment [McCoy+ 19] P: The lawyer was advised by the actor. H: The actor advised the lawyer. P: The doctors visited the lawyer. H: The lawyer visited the doctors. Non-entailment and high-word overlap examples are predicted as entailment. Not due to model capacity: training on challenge examples easily achieves high accuracy. 15 / 40

Results on HANS BERT DA ESIM 0 20 40 60
36.4 0.4 3.5 48.1 5.2 17.4 48.3 8.6 19.1 57.7 40.1 35.8 F1 score Non-entailment (challenge) MLE DRiFt-Hypo DRiFt-CBOW DRiFt-Hand BERT DA ESIM 0 20 40 60 80 73.1 66.6 66.2 75.5 66.5 66.9 73.6 65.5 66.9 74.8 59.3 63.2 F1 score Entailment (typical) 16 / 40

36.4 0.4 3.5 48.1 5.2 17.4 48.3 8.6 19.1 57.7 40.1 35.8 F1 score Non-entailment (challenge) MLE DRiFt-Hypo DRiFt-CBOW DRiFt-Hand BERT DA ESIM 0 20 40 60 80 73.1 66.6 66.2 75.5 66.5 66.9 73.6 65.5 66.9 74.8 59.3 63.2 F1 score Entailment (typical) DRiFt improves robust accuracy over MLE 16 / 40

36.4 0.4 3.5 48.1 5.2 17.4 48.3 8.6 19.1 57.7 40.1 35.8 F1 score Non-entailment (challenge) MLE DRiFt-Hypo DRiFt-CBOW DRiFt-Hand BERT DA ESIM 0 20 40 60 80 73.1 66.6 66.2 75.5 66.5 66.9 73.6 65.5 66.9 74.8 59.3 63.2 F1 score Entailment (typical) DRiFt improves robust accuracy over MLE Better knowledge on spurious correlations (DRiFt-Hand) is important 16 / 40

Results on MNLI BERT DA ESIM 0 20 40 60
80 84.5 72.2 78.1 84.3 68.6 75 82.1 56.3 68.8 81.7 56.8 68.9 Accuracy MNLI (in-distribution) MLE DRiFt-Hypo DRiFt-CBOW DRiFt-Hand 17 / 40

80 84.5 72.2 78.1 84.3 68.6 75 82.1 56.3 68.8 81.7 56.8 68.9 Accuracy MNLI (in-distribution) MLE DRiFt-Hypo DRiFt-CBOW DRiFt-Hand Trade-oﬀ between robustness and accuracy 17 / 40

80 84.5 72.2 78.1 84.3 68.6 75 82.1 56.3 68.8 81.7 56.8 68.9 Accuracy MNLI (in-distribution) MLE DRiFt-Hypo DRiFt-CBOW DRiFt-Hand Trade-oﬀ between robustness and accuracy Pre-trained models (BERT) have good performance and both in-distribution and challenge data 17 / 40

Summary Prior knowledge on spurious correlation is important Learn from
unbiased examples Limitation: trade-oﬀ between robust and in-distribution accuracy 18 / 40

An Empirical Study on Robustness to Spurious Correlations using Pre-trained
Language Models TACL 2020 Lifu Tu Garima Lalwani Spandana Gella 19 / 40

Pre-trained models appear to be more robust ESIM BERT BERT-L
RoBERTa RoBERTa-L 0 20 40 60 80 78.1 84.5 86.2 87.4 89.1 49.1 62.5 74.1 74.1 77.1 Accuracy Textual entailment in-distribution challenge 20 / 40

RoBERTa RoBERTa-L 0 20 40 60 80 78.1 84.5 86.2 87.4 89.1 49.1 62.5 74.1 74.1 77.1 Accuracy Textual entailment in-distribution challenge ESIM BERT BERT-L RoBERTa RoBERTa-L 0 20 40 60 80 100 85.3 90.8 91.3 91.5 89 38.9 36.1 40.1 42.6 39.5 Accuracy Paraphrase identiﬁcation 20 / 40

RoBERTa RoBERTa-L 0 20 40 60 80 78.1 84.5 86.2 87.4 89.1 49.1 62.5 74.1 74.1 77.1 Accuracy Textual entailment in-distribution challenge ESIM BERT BERT-L RoBERTa RoBERTa-L 0 20 40 60 80 100 85.3 90.8 91.3 91.5 89 38.9 36.1 40.1 42.6 39.5 Accuracy Paraphrase identiﬁcation Large pre-trained models improve both in-distribution and challenge data performance 20 / 40

RoBERTa RoBERTa-L 0 20 40 60 80 78.1 84.5 86.2 87.4 89.1 49.1 62.5 74.1 74.1 77.1 Accuracy Textual entailment in-distribution challenge ESIM BERT BERT-L RoBERTa RoBERTa-L 0 20 40 60 80 100 85.3 90.8 91.3 91.5 89 38.9 36.1 40.1 42.6 39.5 Accuracy Paraphrase identiﬁcation Large pre-trained models improve both in-distribution and challenge data performance Do they extrapolate to out-of-distribution (OOD) data? 20 / 40

Counterexamples in the training data Example Label Quantity P: I
love dogs H: I don’t love dogs con P: Tom ate an apple H: Tom doesn’t like cats neu P: The bird is red H: The bird is not green ent 21 / 40

love dogs H: I don’t love dogs con Typical examples P: Tom ate an apple H: Tom doesn’t like cats neu P: The bird is red H: The bird is not green ent 21 / 40

love dogs H: I don’t love dogs con Typical examples P: Tom ate an apple H: Tom doesn’t like cats neu Minority examples P: The bird is red H: The bird is not green ent 21 / 40

Counterexamples in the training data Natural language inference (HANS [McCoy+
19]) P: The doctor mentioned the manager who ran. overlap & entailment H: The doctor mentioned the manager. 22 / 40

19]) P: The doctor mentioned the manager who ran. overlap & entailment H: The doctor mentioned the manager. P: The actor was advised by the manager. H: The actor advised the manager. overlap & non-entailment 727 in MNLI 22 / 40

19]) P: The doctor mentioned the manager who ran. overlap & entailment H: The doctor mentioned the manager. P: The actor was advised by the manager. H: The actor advised the manager. overlap & non-entailment 727 in MNLI Paraphrase Identiﬁcation (PAWS [Zhang+ 19]) S1 : Bangkok vs Shanghai? same BoW & paraphrase S2 : Shanghai vs Bangkok? 22 / 40

19]) P: The doctor mentioned the manager who ran. overlap & entailment H: The doctor mentioned the manager. P: The actor was advised by the manager. H: The actor advised the manager. overlap & non-entailment 727 in MNLI Paraphrase Identiﬁcation (PAWS [Zhang+ 19]) S1 : Bangkok vs Shanghai? same BoW & paraphrase S2 : Shanghai vs Bangkok? S1 : Are all dogs smart or can some be dumb? S2 : Are all dogs dumb or can some be smart? same BoW & non-paraphrase 247 in QQP 22 / 40

19]) P: The doctor mentioned the manager who ran. overlap & entailment H: The doctor mentioned the manager. P: The actor was advised by the manager. H: The actor advised the manager. overlap & non-entailment 727 in MNLI Paraphrase Identiﬁcation (PAWS [Zhang+ 19]) S1 : Bangkok vs Shanghai? same BoW & paraphrase S2 : Shanghai vs Bangkok? S1 : Are all dogs smart or can some be dumb? S2 : Are all dogs dumb or can some be smart? same BoW & non-paraphrase 247 in QQP Do pre-trained models generalize better from the minority examples? 22 / 40

Observation 1: minority examples take longer to learn Epochs 0
5 10 15 20 Accuracy(%) 40 50 60 70 80 90 100 train(all) train(minority) dev(all) dev(minority) (a) MNLI 23 / 40

Observation 1: minority examples take longer to learn Epochs 0
5 10 15 20 Accuracy(%) 40 50 60 70 80 90 100 train(all) train(minority) dev(all) dev(minority) (a) MNLI Epochs 0 5 10 15 20 Accuracy(%) 50 55 60 65 70 75 80 85 MNLI Dev HANS (b) Challenge data Accuracy on minority examples (-∗-) is correlated with accuracy on challenge data (-∆-) 23 / 40

Observation 2: removing minority examples hurts robust accuracy 0.0 0.1
0.4 1.6 6.4 % training data removed 50 60 70 accuracy (%) model = BERTbase strategy overlap random 0.0 0.1 0.4 1.6 6.4 % training data removed model = BERTlarge 0.0 0.1 0.4 1.6 6.4 % training data removed model = RoBERTabase 0.0 0.1 0.4 1.6 6.4 % training data removed model = RoBERTalarge 24 / 40

0.4 1.6 6.4 % training data removed 50 60 70 accuracy (%) model = BERTbase strategy overlap random 0.0 0.1 0.4 1.6 6.4 % training data removed model = BERTlarge 0.0 0.1 0.4 1.6 6.4 % training data removed model = RoBERTabase 0.0 0.1 0.4 1.6 6.4 % training data removed model = RoBERTalarge Pre-trained models cannot extrapolate to challenge data without minority examples 24 / 40

0.4 1.6 6.4 % training data removed 50 60 70 accuracy (%) model = BERTbase strategy overlap random 0.0 0.1 0.4 1.6 6.4 % training data removed model = BERTlarge 0.0 0.1 0.4 1.6 6.4 % training data removed model = RoBERTabase 0.0 0.1 0.4 1.6 6.4 % training data removed model = RoBERTalarge Pre-trained models cannot extrapolate to challenge data without minority examples Pre-training improves robustness to data imbalance 24 / 40

Why is the improvement on PAWS much smaller? 20 40
60 80 100 % training data 60 80 100 accuracy (%) dataset = HANS model BERTbase BERTlarge RoBERTabase RoBERTalarge 20 40 60 80 100 % training data dataset = PAWSQQP templated auto-generated 25 / 40

60 80 100 % training data 60 80 100 accuracy (%) dataset = HANS model BERTbase BERTlarge RoBERTabase RoBERTalarge 20 40 60 80 100 % training data dataset = PAWSQQP templated auto-generated Diﬀerent (challenge) patterns require diﬀerent amounts of training data 25 / 40

60 80 100 % training data 60 80 100 accuracy (%) dataset = HANS model BERTbase BERTlarge RoBERTabase RoBERTalarge 20 40 60 80 100 % training data dataset = PAWSQQP templated auto-generated Diﬀerent (challenge) patterns require diﬀerent amounts of training data Pre-training is no silver bullet 25 / 40

Summary Distribution shift: minority examples may become majority at test
time 26 / 40

time Residual ﬁtting: ‘upweight’ minority examples at the cost of performance on other examples 26 / 40

time Residual ﬁtting: ‘upweight’ minority examples at the cost of performance on other examples Pre-training: generic data improves generalization from minority examples 26 / 40

time Residual ﬁtting: ‘upweight’ minority examples at the cost of performance on other examples Pre-training: generic data improves generalization from minority examples Motivation: Better use of generic data to mitigate the robustness-accuracy trade-oﬀ 26 / 40

Improve generalization by multitasking Improve generalization from minority examples by
transfering knowledge from related tasks 27 / 40

transfering knowledge from related tasks Multitasking learning setup Model: shared BERT encoder + linear task-speciﬁc classiﬁer 27 / 40

transfering knowledge from related tasks Multitasking learning setup Model: shared BERT encoder + linear task-specific classifier Auxiliary data: Textual entailment: MNLI + SNLI, QQP, PAWS Paraphrase identification: QQP + SNLI, MNLI, HANS 27 / 40

Results MNLI QQP 0 20 40 60 80 100 84.5
90.8 83.7 91.3 Accuracy In-dist. data BERT-base HANS PAWS 0 20 40 60 62.5 36.1 68.2 45.9 Accuracy Challenge data STL MTL 28 / 40

Results MNLI QQP 0 20 40 60 80 100 84.5
90.8 83.7 91.3 Accuracy In-dist. data BERT-base HANS PAWS 0 20 40 60 62.5 36.1 68.2 45.9 Accuracy Challenge data STL MTL MTL improves robust accuracy without hurting indistribution performance 28 / 40

Results MNLI QQP 0 20 40 60 80 100 84.5
90.8 83.7 91.3 Accuracy In-dist. data BERT-base HANS PAWS 0 20 40 60 62.5 36.1 68.2 45.9 Accuracy Challenge data STL MTL MNLI QQP 0 20 40 60 80 100 87.4 91.5 86.4 91.7 Accuracy In-dist. data RoBERTa-base HANS PAWS 0 20 40 60 80 74.1 42.6 72.8 51.7 Accuracy Challenge data STL MTL MTL improves robust accuracy without hurting indistribution performance MTL improves robustness on top of pre-training 28 / 40

How does MTL help? Method In-dist. (QQP) Challenge (PAWS) STL
(QQP) 90.8 36.1 MTL (QQP+MNLI,SNLI,HANS) 91.3 45.9 29 / 40

(QQP) 90.8 36.1 MTL (QQP+MNLI,SNLI,HANS) 91.3 45.9 remove random examples from MNLI +0.1 −0.9 remove random examples from QQP −0.0 −1.6 Remove random examples from target/auxiliary tasks has no signiﬁcant eﬀect on performance 29 / 40

(QQP) 90.8 36.1 MTL (QQP+MNLI,SNLI,HANS) 91.3 45.9 remove random examples from MNLI +0.1 −0.9 remove random examples from QQP −0.0 −1.6 remove minority examples from MNLI +0.0 −1.6 remove minority examples from QQP +0.0 −7.7 Remove random examples from target/auxiliary tasks has no signiﬁcant eﬀect on performance Remove minority examples from target tasks hurt MTL performance 29 / 40

Summary Mitigate trade-oﬀ between in-distribution and robust accuracy Pre-training helps
generalization from minority examples Adding generic data (through MTL) improves generalization 30 / 40

An Investigation of the (In)eﬀectiveness of Counterfactually-Augmented Data Nitish Joshi
31 / 40

Counterfactually-Augmented Data (CAD) Figure: [Kaushik+ 2020] seed “Election” is a
highly fascinating and thoroughly captivating thriller-drama positive 32 / 40

highly fascinating and thoroughly captivating thriller-drama positive edited “Election” is a highly expected and thoroughly mind-numbing thriller-drama negative 32 / 40

highly fascinating and thoroughly captivating thriller-drama positive edited “Election” is a highly expected and thoroughly mind-numbing thriller-drama negative Assumption: intervention tells us which are “causal spans” vs. spurious features 32 / 40

Does CAD improve OOD generalization? Hypothesis: Train with CAD leads
to robust models that uses causal features and generalizes to OOD data 33 / 40

to robust models that uses causal features and generalizes to OOD data Mixed results: [Huang+ 2020]: CAD doesn’t lead to better performance on OOD data (SNLI→MNLI) or challenge data. [Khashabi+ 2020]: On question answering, unaugmented data is better when dataset size and annotation cost are controlled. [Kaushik+ 2021]: Removing/noising spans identiﬁed by CAD hurts performance more than noising non-causal spans. 33 / 40

to robust models that uses causal features and generalizes to OOD data Mixed results: [Huang+ 2020]: CAD doesn’t lead to better performance on OOD data (SNLI→MNLI) or challenge data. [Khashabi+ 2020]: On question answering, unaugmented data is better when dataset size and annotation cost are controlled. [Kaushik+ 2021]: Removing/noising spans identiﬁed by CAD hurts performance more than noising non-causal spans. CAD does reveal useful features, so why aren’t they helpful? 33 / 40

Toy example: sentiment classiﬁcation seed The book is good positive
seed The movie is boring negative Naive Bayes model non-zero weights: book movie good boring seed +1 −1 +1 −1 34 / 40

edited The book is not good negative seed The movie is boring negative edited The movie is fascinating positive Naive Bayes model non-zero weights: book movie good boring fascinating not seed +1 −1 +1 −1 0 0 seed+edited 0 0 0 −0.5 +0.5 −0.5 34 / 40

edited The book is not good negative seed The movie is boring negative edited The movie is fascinating positive Naive Bayes model non-zero weights: book movie good boring fascinating not seed +1 −1 +1 −1 0 0 seed+edited 0 0 0 −0.5 +0.5 −0.5 CAD successfully debiased “book” and “movie”, but “good” also gets debiased! 34 / 40

edited The book is not good negative seed The movie is boring negative edited The movie is fascinating positive Naive Bayes model non-zero weights: book movie good boring fascinating not seed +1 −1 +1 −1 0 0 seed+edited 0 0 0 −0.5 +0.5 −0.5 CAD successfully debiased “book” and “movie”, but “good” also gets debiased! Unintervened robust features cannot be learned from CAD 34 / 40

What about real CAD? Hypothesis: undiverse edits limits the eﬀectiveness
of CAD 35 / 40

of CAD Categorize edits [Wu+ 2020]: quantiﬁer negation lexical insert delete resemantic 35 / 40

of CAD Categorize edits [Wu+ 2020]: Train/Test quantiﬁer negation lexical insert delete resemantic lexical insert resemantic With controlled data size 35 / 40

of CAD Categorize edits [Wu+ 2020]: Train/Test quantiﬁer negation lexical insert delete resemantic SNLI seed 74.360.21 69.252.09 75.160.32 74.941.05 65.762.34 76.770.74 lexical insert resemantic With controlled data size 35 / 40

of CAD Categorize edits [Wu+ 2020]: Train/Test quantiﬁer negation lexical insert delete resemantic SNLI seed 74.360.21 69.252.09 75.160.32 74.941.05 65.762.34 76.770.74 lexical 72.421.58 68.752.16 81.810.99 74.041.04 67.043.00 74.931.16 insert 68.150.88 57.754.54 71.082.53 78.981.58 68.802.71 71.741.53 resemantic 70.771.04 67.252.05 77.232.35 76.591.12 70.401.54 75.401.44 With controlled data size On matched test sets, CAD achieves the best performance 35 / 40

of CAD Categorize edits [Wu+ 2020]: Train/Test quantiﬁer negation lexical insert delete resemantic SNLI seed 74.360.21 69.252.09 75.160.32 74.941.05 65.762.34 76.770.74 lexical 72.421.58 68.752.16 81.810.99 74.041.04 67.043.00 74.931.16 insert 68.150.88 57.754.54 71.082.53 78.981.58 68.802.71 71.741.53 resemantic 70.771.04 67.252.05 77.232.35 76.591.12 70.401.54 75.401.44 With controlled data size On matched test sets, CAD achieves the best performance On unmatched test sets, CAD performance can be worse than unaugmented examples 35 / 40

What about real CAD? ins ins + lex ins +
lex + resem All Types Diversity 56 58 60 62 64 66 68 70 Accuracy on MNLI (OOD) SNLI Baseline Controlling the data size, larger number of edit types leads to higher ID/OOD performance. Eﬀective data size of CAD ≈ number of features intervened 36 / 40

Performance vs data size If we collect more CAD, will
the performance improve? 1000 2000 3000 4000 5000 Training Data Size 57.5 60.0 62.5 65.0 67.5 70.0 72.5 75.0 Accuracy Train: CAD, Eval: MNLI Train: SNLI, Eval: MNLI CAD is more eﬀective in the low-data regime Increasing the number of CAD examples does not seem to lead to higher diversity and performance plateaus 37 / 40

CAD may exacerbate existing spurious correlations Seed CAD 0.0 0.2
0.4 0.6 0.8 1.0 Relative No. of Examples 0.19 0.29 0.52 0.19 0.14 0.66 Entailment Neutral Contradiction (a) Negation word Seed CAD 0.0 0.2 0.4 0.6 0.8 1.0 Relative No. of Examples 0.66 0.15 0.19 0.77 0.13 0.10 Entailment Neutral Contradiction (b) Word overlap > 90% 38 / 40

Summary Counterfactual data is an eﬀective way to identify useful
features But it may also limit what the model can learn Need better ways to ensure data diversity/coverage 39 / 40

Parting remarks Learning Need knowledge on diﬀerent predictive patterns in
the data: can we automatically discover the groups? Data Improve diversity through more controllable crowdsourcing protocols Thank you! 40 / 40

Guarding Against Spurious Correlations in Natur...

Guarding Against Spurious Correlations in Natural Language Understanding

More Decks by wing.nus

Featured

Transcript