Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Guarding Against Spurious Correlations in Natur...

wing.nus
July 08, 2021
1.8k

Guarding Against Spurious Correlations in Natural Language Understanding

While we have made great progress in natural language understanding, transferring the success from benchmark datasets to real applications has not always been smooth. Notably, models sometimes make mistakes that are confusing and unexpected to humans. In this talk, I will discuss shortcuts in NLP tasks and present our recent works on guarding against spurious correlations in natural language understanding tasks (e.g. textual entailment and paraphrase identification) from the perspectives of both robust learning algorithms and better data coverage. Motivated by the observation that our data often contains a small amount of “unbiased” examples that do not exhibit spurious correlations, we present new learning algorithms that better exploit these minority examples. On the other hand, we may want to directly augment such “unbiased” examples. While recent works along this line are promising, we show several pitfalls in the data augmentation approach.

wing.nus

July 08, 2021
Tweet

More Decks by wing.nus

Transcript

  1. What’s spurious correlation? Predictive rules that works for certain datasets

    but do not hold in general. Premise Label Hypothesis Distribution I love dogs contradicts I don’t love dogs MNLI/SNLI Example: textual entailment [Gururangan+ 18] 4 / 40
  2. What’s spurious correlation? Predictive rules that works for certain datasets

    but do not hold in general. Premise Label Hypothesis Distribution I love dogs contradicts I don’t love dogs MNLI/SNLI Example: textual entailment [Gururangan+ 18] 4 / 40
  3. What’s spurious correlation? Predictive rules that works for certain datasets

    but do not hold in general. Premise Label Hypothesis Distribution I love dogs contradicts I don’t love dogs MNLI/SNLI I love dogs is neutral to I don’t love cats real world Example: textual entailment [Gururangan+ 18] 4 / 40
  4. What’s spurious correlation? Predictive rules that works for certain datasets

    but do not hold in general. Premise Label Hypothesis Distribution I love dogs contradicts I don’t love dogs MNLI/SNLI I love dogs is neutral to I don’t love cats real world Example: textual entailment [Gururangan+ 18] Sentence 1 Sentence 2 Label Distribution Katz lived in Sweden in 1947. Katz lived in 1947 in Sweden. true QQP Example: paraphrase identification [Zhang+ 19] 4 / 40
  5. What’s spurious correlation? Predictive rules that works for certain datasets

    but do not hold in general. Premise Label Hypothesis Distribution I love dogs contradicts I don’t love dogs MNLI/SNLI I love dogs is neutral to I don’t love cats real world Example: textual entailment [Gururangan+ 18] Sentence 1 Sentence 2 Label Distribution Katz lived in Sweden in 1947. Katz lived in 1947 in Sweden. true QQP A good person becomes bad A bad person becomes good false real world Example: paraphrase identification [Zhang+ 19] 4 / 40
  6. Sources of spurious correlation Annotation bias The man is selling

    bamboo sticks. OPU Selection bias Our data often doesn’t have enough coverage 5 / 40
  7. Distribution shift p(meaning, “not”, label) = p(meaning | label) invariant

    × p(“not”, label) changes at test time 8 / 40
  8. Distribution shift p(meaning, “not”, label) = p(meaning | label) invariant

    × p(“not”, label) changes at test time If we can disentangle robust features and spurious features, the problem is solved. (Challenging!) 8 / 40
  9. Distribution shift p(meaning, “not”, label) = p(meaning | label) invariant

    × p(“not”, label) changes at test time If we can disentangle robust features and spurious features, the problem is solved. (Challenging!) We often have some knowledge of spurious features, e.g. specific words, length. 8 / 40
  10. Distribution shift p(meaning, “not”, label) = p(meaning | label) invariant

    × p(“not”, label) changes at test time If we can disentangle robust features and spurious features, the problem is solved. (Challenging!) We often have some knowledge of spurious features, e.g. specific words, length. Learn from examples where the spurious feature and the label are indepen- dent. 8 / 40
  11. Learning from “clean” examples Suppose we know that the correlation

    between negation words and the label will change at test time. 9 / 40
  12. Learning from “clean” examples Suppose we know that the correlation

    between negation words and the label will change at test time. Example Label P: I love dogs H: I don’t love dogs con P: Tom ate an apple H: Tom doesn’t like cats neu P: The bird is red H: The bird is not green ent 9 / 40
  13. Learning from “clean” examples Suppose we know that the correlation

    between negation words and the label will change at test time. Example Label Quantity P: I love dogs H: I don’t love dogs con P: Tom ate an apple H: Tom doesn’t like cats neu P: The bird is red H: The bird is not green ent 9 / 40
  14. Learning from “clean” examples Suppose we know that the correlation

    between negation words and the label will change at test time. Example Label Quantity Biased prediction P: I love dogs H: I don’t love dogs con p(con | don’t) = 0.8 P: Tom ate an apple H: Tom doesn’t like cats neu p(neu | doesn’t) = 0.3 P: The bird is red H: The bird is not green ent p(ent | not) = 0.1 9 / 40
  15. Learning from “clean” examples Suppose we know that the correlation

    between negation words and the label will change at test time. Example Label Quantity Biased prediction P: I love dogs H: I don’t love dogs con p(con | don’t) = 0.8 P: Tom ate an apple H: Tom doesn’t like cats neu p(neu | doesn’t) = 0.3 P: The bird is red H: The bird is not green ent p(ent | not) = 0.1 9 / 40
  16. Fitting the residual of a biased predictor Step 1 Learn

    a biased classifier using known spurious features: ˆ fs = arg min fs Ex,y (fs, x, y) 10 / 40
  17. Fitting the residual of a biased predictor Step 1 Learn

    a biased classifier using known spurious features: ˆ fs = arg min fs Ex,y (fs, x, y) Step 2 Learn the debiased classifier by fitting the residuals ˆ fd = arg min fd Ex,y (ˆ fs + fd , x, y) 10 / 40
  18. Fitting the residual of a biased predictor Step 1 Learn

    a biased classifier using known spurious features: ˆ fs = arg min fs Ex,y (fs, x, y) Step 2 Learn the debiased classifier by fitting the residuals ˆ fd = arg min fd Ex,y (ˆ fs + fd , x, y) Biased classifier has low loss on examples where the spurious feature is predictive Debiased classifier learns what cannot be predicted by the spurious feature 10 / 40
  19. Fitting the residual of a biased predictor Step 1 Learn

    a biased classifier using known spurious features: ˆ fs = arg min fs Ex,y (fs, x, y) ˆ θs = arg max θs Ex,y log ps(y | x; θs) Step 2 Learn the debiased classifier by fitting the residuals ˆ fd = arg min fd Ex,y (ˆ fs + fd , x, y) Biased classifier has low loss on examples where the spurious feature is predictive Debiased classifier learns what cannot be predicted by the spurious feature 10 / 40
  20. Fitting the residual of a biased predictor Step 1 Learn

    a biased classifier using known spurious features: ˆ fs = arg min fs Ex,y (fs, x, y) ˆ θs = arg max θs Ex,y log ps(y | x; θs) Step 2 Learn the debiased classifier by fitting the residuals ˆ fd = arg min fd Ex,y (ˆ fs + fd , x, y) ˆ θd = arg min θd Ex,y log softmax(log ˆ ps + log pd )[y] Biased classifier has low loss on examples where the spurious feature is predictive Debiased classifier learns what cannot be predicted by the spurious feature 10 / 40
  21. Fitting the residual of a biased predictor Step 1 Learn

    a biased classifier using known spurious features: ˆ fs = arg min fs Ex,y (fs, x, y) ˆ θs = arg max θs Ex,y log ps(y | x; θs) Step 2 Learn the debiased classifier by fitting the residuals ˆ fd = arg min fd Ex,y (ˆ fs + fd , x, y) ˆ θd = arg min θd Ex,y log softmax(log ˆ ps + log pd )[y] p(y | x) ∝ ps(y | x)pd (y | x) Biased classifier has low loss on examples where the spurious feature is predictive Debiased classifier learns what cannot be predicted by the spurious feature 10 / 40
  22. Gradient analysis residual fitting gradient = MLE gradient + bias

    correction gradient P: I love dogs H: I don’t love dogs 11 / 40
  23. Gradient analysis residual fitting gradient = MLE gradient + bias

    correction gradient P: I love dogs H: I don’t love dogs ent neu con 0 0.5 1 0 0 1 ps(y | x) 11 / 40
  24. Gradient analysis residual fitting gradient = MLE gradient + bias

    correction gradient P: I love dogs H: I don’t love dogs ent neu con 0 0.5 1 0 0 1 ps(y | x) Perfect (biased) prediction: ps(y∗ | x) → 1 Correction cancles MLE gradient: zero gradient (remove the example) 11 / 40
  25. Gradient analysis residual fitting gradient = MLE gradient + bias

    correction gradient P: I love dogs H: I don’t love dogs ent neu con 0 0.5 1 0 0 1 ps(y | x) Perfect (biased) prediction: ps(y∗ | x) → 1 Correction cancles MLE gradient: zero gradient (remove the example) P: Tom ate an aple H: Tom doesn’t like cats 11 / 40
  26. Gradient analysis residual fitting gradient = MLE gradient + bias

    correction gradient P: I love dogs H: I don’t love dogs ent neu con 0 0.5 1 0 0 1 ps(y | x) Perfect (biased) prediction: ps(y∗ | x) → 1 Correction cancles MLE gradient: zero gradient (remove the example) P: Tom ate an aple H: Tom doesn’t like cats ent neu con 0.3 0.35 0.33 0.33 0.33 ps(y | x) 11 / 40
  27. Gradient analysis residual fitting gradient = MLE gradient + bias

    correction gradient P: I love dogs H: I don’t love dogs ent neu con 0 0.5 1 0 0 1 ps(y | x) Perfect (biased) prediction: ps(y∗ | x) → 1 Correction cancles MLE gradient: zero gradient (remove the example) P: Tom ate an aple H: Tom doesn’t like cats ent neu con 0.3 0.35 0.33 0.33 0.33 ps(y | x) Uninformative prediction: ps(y | x) → uniform Correction recovers MLE gradient (normal update) 11 / 40
  28. Experimental setup Data MNLI [Williams+ 17] Hypothesis bias, word overlap

    bias Model Biased and debiased models have the same parametrization; differ by input feature map. BERT [Devlin+ 18], Decomposable Attention [Parikh+ 16], ESIM [Chen+ 17] 12 / 40
  29. Experimental setup Data MNLI [Williams+ 17] Hypothesis bias, word overlap

    bias Model Biased and debiased models have the same parametrization; differ by input feature map. BERT [Devlin+ 18], Decomposable Attention [Parikh+ 16], ESIM [Chen+ 17] Learning algorithms MLE DRiFt (Debias by Residual Fitting) 12 / 40
  30. Synthetic spurious features Training P: I love dogs H: [con]

    I don’t love dogs. [cheating label] = gold label with some probability (cheating rate) 13 / 40
  31. Synthetic spurious features Training P: I love dogs H: [con]

    I don’t love dogs. [cheating label] = gold label with some probability (cheating rate) Testing P: The bird is red H: [con] The bird is not green. [cheating label] is randomly assigned from {ent, neu, con}. 13 / 40
  32. Synthetic spurious features Training P: I love dogs H: [con]

    I don’t love dogs. [cheating label] = gold label with some probability (cheating rate) Testing P: The bird is red H: [con] The bird is not green. [cheating label] is randomly assigned from {ent, neu, con}. Models relying on cheating features would produce random prediction at test time. 13 / 40
  33. Results model 0.6 0.7 0.8 0.9 1.0 accuracy BERT DA

    ESIM 0.2 0.4 0.6 0.8 cheating rate 0.2 0.4 0.6 0.8 cheating rate 0.2 0.4 0.6 0.8 cheating rate DRiFt-hypo MLE Rm-cheat method 14 / 40
  34. Results model 0.6 0.7 0.8 0.9 1.0 accuracy BERT DA

    ESIM 0.2 0.4 0.6 0.8 cheating rate 0.2 0.4 0.6 0.8 cheating rate 0.2 0.4 0.6 0.8 cheating rate DRiFt-hypo MLE Rm-cheat method Two biased classifiers: hypothesis-only and cheating feature (oracle) 14 / 40
  35. Results model 0.6 0.7 0.8 0.9 1.0 accuracy BERT DA

    ESIM 0.2 0.4 0.6 0.8 cheating rate 0.2 0.4 0.6 0.8 cheating rate 0.2 0.4 0.6 0.8 cheating rate DRiFt-hypo MLE Rm-cheat method Two biased classifiers: hypothesis-only and cheating feature (oracle) Debiased models are invariant as spurious correlation increases 14 / 40
  36. Results model 0.6 0.7 0.8 0.9 1.0 accuracy BERT DA

    ESIM 0.2 0.4 0.6 0.8 cheating rate 0.2 0.4 0.6 0.8 cheating rate 0.2 0.4 0.6 0.8 cheating rate DRiFt-hypo MLE Rm-cheat method Two biased classifiers: hypothesis-only and cheating feature (oracle) Debiased models are invariant as spurious correlation increases Better knowledge on the spurious features helps 14 / 40
  37. Word overlap bias Model relies on word overlap to predict

    entailment [McCoy+ 19] P: The lawyer was advised by the actor. H: The actor advised the lawyer. P: The doctors visited the lawyer. H: The lawyer visited the doctors. 15 / 40
  38. Word overlap bias Model relies on word overlap to predict

    entailment [McCoy+ 19] P: The lawyer was advised by the actor. H: The actor advised the lawyer. P: The doctors visited the lawyer. H: The lawyer visited the doctors. 15 / 40
  39. Word overlap bias Model relies on word overlap to predict

    entailment [McCoy+ 19] P: The lawyer was advised by the actor. H: The actor advised the lawyer. P: The doctors visited the lawyer. H: The lawyer visited the doctors. Non-entailment and high-word overlap examples are predicted as entailment. 15 / 40
  40. Word overlap bias Model relies on word overlap to predict

    entailment [McCoy+ 19] P: The lawyer was advised by the actor. H: The actor advised the lawyer. P: The doctors visited the lawyer. H: The lawyer visited the doctors. Non-entailment and high-word overlap examples are predicted as entailment. Not due to model capacity: training on challenge examples easily achieves high accuracy. 15 / 40
  41. Results on HANS BERT DA ESIM 0 20 40 60

    36.4 0.4 3.5 48.1 5.2 17.4 48.3 8.6 19.1 57.7 40.1 35.8 F1 score Non-entailment (challenge) MLE DRiFt-Hypo DRiFt-CBOW DRiFt-Hand BERT DA ESIM 0 20 40 60 80 73.1 66.6 66.2 75.5 66.5 66.9 73.6 65.5 66.9 74.8 59.3 63.2 F1 score Entailment (typical) 16 / 40
  42. Results on HANS BERT DA ESIM 0 20 40 60

    36.4 0.4 3.5 48.1 5.2 17.4 48.3 8.6 19.1 57.7 40.1 35.8 F1 score Non-entailment (challenge) MLE DRiFt-Hypo DRiFt-CBOW DRiFt-Hand BERT DA ESIM 0 20 40 60 80 73.1 66.6 66.2 75.5 66.5 66.9 73.6 65.5 66.9 74.8 59.3 63.2 F1 score Entailment (typical) 16 / 40
  43. Results on HANS BERT DA ESIM 0 20 40 60

    36.4 0.4 3.5 48.1 5.2 17.4 48.3 8.6 19.1 57.7 40.1 35.8 F1 score Non-entailment (challenge) MLE DRiFt-Hypo DRiFt-CBOW DRiFt-Hand BERT DA ESIM 0 20 40 60 80 73.1 66.6 66.2 75.5 66.5 66.9 73.6 65.5 66.9 74.8 59.3 63.2 F1 score Entailment (typical) DRiFt improves robust accuracy over MLE 16 / 40
  44. Results on HANS BERT DA ESIM 0 20 40 60

    36.4 0.4 3.5 48.1 5.2 17.4 48.3 8.6 19.1 57.7 40.1 35.8 F1 score Non-entailment (challenge) MLE DRiFt-Hypo DRiFt-CBOW DRiFt-Hand BERT DA ESIM 0 20 40 60 80 73.1 66.6 66.2 75.5 66.5 66.9 73.6 65.5 66.9 74.8 59.3 63.2 F1 score Entailment (typical) DRiFt improves robust accuracy over MLE Better knowledge on spurious correlations (DRiFt-Hand) is important 16 / 40
  45. Results on MNLI BERT DA ESIM 0 20 40 60

    80 84.5 72.2 78.1 84.3 68.6 75 82.1 56.3 68.8 81.7 56.8 68.9 Accuracy MNLI (in-distribution) MLE DRiFt-Hypo DRiFt-CBOW DRiFt-Hand 17 / 40
  46. Results on MNLI BERT DA ESIM 0 20 40 60

    80 84.5 72.2 78.1 84.3 68.6 75 82.1 56.3 68.8 81.7 56.8 68.9 Accuracy MNLI (in-distribution) MLE DRiFt-Hypo DRiFt-CBOW DRiFt-Hand Trade-off between robustness and accuracy 17 / 40
  47. Results on MNLI BERT DA ESIM 0 20 40 60

    80 84.5 72.2 78.1 84.3 68.6 75 82.1 56.3 68.8 81.7 56.8 68.9 Accuracy MNLI (in-distribution) MLE DRiFt-Hypo DRiFt-CBOW DRiFt-Hand Trade-off between robustness and accuracy Pre-trained models (BERT) have good performance and both in-distribution and challenge data 17 / 40
  48. Summary Prior knowledge on spurious correlation is important Learn from

    unbiased examples Limitation: trade-off between robust and in-distribution accuracy 18 / 40
  49. An Empirical Study on Robustness to Spurious Correlations using Pre-trained

    Language Models TACL 2020 Lifu Tu Garima Lalwani Spandana Gella 19 / 40
  50. Pre-trained models appear to be more robust ESIM BERT BERT-L

    RoBERTa RoBERTa-L 0 20 40 60 80 78.1 84.5 86.2 87.4 89.1 49.1 62.5 74.1 74.1 77.1 Accuracy Textual entailment in-distribution challenge 20 / 40
  51. Pre-trained models appear to be more robust ESIM BERT BERT-L

    RoBERTa RoBERTa-L 0 20 40 60 80 78.1 84.5 86.2 87.4 89.1 49.1 62.5 74.1 74.1 77.1 Accuracy Textual entailment in-distribution challenge 20 / 40
  52. Pre-trained models appear to be more robust ESIM BERT BERT-L

    RoBERTa RoBERTa-L 0 20 40 60 80 78.1 84.5 86.2 87.4 89.1 49.1 62.5 74.1 74.1 77.1 Accuracy Textual entailment in-distribution challenge 20 / 40
  53. Pre-trained models appear to be more robust ESIM BERT BERT-L

    RoBERTa RoBERTa-L 0 20 40 60 80 78.1 84.5 86.2 87.4 89.1 49.1 62.5 74.1 74.1 77.1 Accuracy Textual entailment in-distribution challenge ESIM BERT BERT-L RoBERTa RoBERTa-L 0 20 40 60 80 100 85.3 90.8 91.3 91.5 89 38.9 36.1 40.1 42.6 39.5 Accuracy Paraphrase identification 20 / 40
  54. Pre-trained models appear to be more robust ESIM BERT BERT-L

    RoBERTa RoBERTa-L 0 20 40 60 80 78.1 84.5 86.2 87.4 89.1 49.1 62.5 74.1 74.1 77.1 Accuracy Textual entailment in-distribution challenge ESIM BERT BERT-L RoBERTa RoBERTa-L 0 20 40 60 80 100 85.3 90.8 91.3 91.5 89 38.9 36.1 40.1 42.6 39.5 Accuracy Paraphrase identification Large pre-trained models improve both in-distribution and challenge data performance 20 / 40
  55. Pre-trained models appear to be more robust ESIM BERT BERT-L

    RoBERTa RoBERTa-L 0 20 40 60 80 78.1 84.5 86.2 87.4 89.1 49.1 62.5 74.1 74.1 77.1 Accuracy Textual entailment in-distribution challenge ESIM BERT BERT-L RoBERTa RoBERTa-L 0 20 40 60 80 100 85.3 90.8 91.3 91.5 89 38.9 36.1 40.1 42.6 39.5 Accuracy Paraphrase identification Large pre-trained models improve both in-distribution and challenge data performance Do they extrapolate to out-of-distribution (OOD) data? 20 / 40
  56. Counterexamples in the training data Example Label Quantity P: I

    love dogs H: I don’t love dogs con P: Tom ate an apple H: Tom doesn’t like cats neu P: The bird is red H: The bird is not green ent 21 / 40
  57. Counterexamples in the training data Example Label Quantity P: I

    love dogs H: I don’t love dogs con Typical examples P: Tom ate an apple H: Tom doesn’t like cats neu P: The bird is red H: The bird is not green ent 21 / 40
  58. Counterexamples in the training data Example Label Quantity P: I

    love dogs H: I don’t love dogs con Typical examples P: Tom ate an apple H: Tom doesn’t like cats neu Minority examples P: The bird is red H: The bird is not green ent 21 / 40
  59. Counterexamples in the training data Natural language inference (HANS [McCoy+

    19]) P: The doctor mentioned the manager who ran. overlap & entailment H: The doctor mentioned the manager. 22 / 40
  60. Counterexamples in the training data Natural language inference (HANS [McCoy+

    19]) P: The doctor mentioned the manager who ran. overlap & entailment H: The doctor mentioned the manager. P: The actor was advised by the manager. H: The actor advised the manager. overlap & non-entailment 727 in MNLI 22 / 40
  61. Counterexamples in the training data Natural language inference (HANS [McCoy+

    19]) P: The doctor mentioned the manager who ran. overlap & entailment H: The doctor mentioned the manager. P: The actor was advised by the manager. H: The actor advised the manager. overlap & non-entailment 727 in MNLI Paraphrase Identification (PAWS [Zhang+ 19]) S1 : Bangkok vs Shanghai? same BoW & paraphrase S2 : Shanghai vs Bangkok? 22 / 40
  62. Counterexamples in the training data Natural language inference (HANS [McCoy+

    19]) P: The doctor mentioned the manager who ran. overlap & entailment H: The doctor mentioned the manager. P: The actor was advised by the manager. H: The actor advised the manager. overlap & non-entailment 727 in MNLI Paraphrase Identification (PAWS [Zhang+ 19]) S1 : Bangkok vs Shanghai? same BoW & paraphrase S2 : Shanghai vs Bangkok? S1 : Are all dogs smart or can some be dumb? S2 : Are all dogs dumb or can some be smart? same BoW & non-paraphrase 247 in QQP 22 / 40
  63. Counterexamples in the training data Natural language inference (HANS [McCoy+

    19]) P: The doctor mentioned the manager who ran. overlap & entailment H: The doctor mentioned the manager. P: The actor was advised by the manager. H: The actor advised the manager. overlap & non-entailment 727 in MNLI Paraphrase Identification (PAWS [Zhang+ 19]) S1 : Bangkok vs Shanghai? same BoW & paraphrase S2 : Shanghai vs Bangkok? S1 : Are all dogs smart or can some be dumb? S2 : Are all dogs dumb or can some be smart? same BoW & non-paraphrase 247 in QQP Do pre-trained models generalize better from the minority examples? 22 / 40
  64. Observation 1: minority examples take longer to learn Epochs 0

    5 10 15 20 Accuracy(%) 40 50 60 70 80 90 100 train(all) train(minority) dev(all) dev(minority) (a) MNLI 23 / 40
  65. Observation 1: minority examples take longer to learn Epochs 0

    5 10 15 20 Accuracy(%) 40 50 60 70 80 90 100 train(all) train(minority) dev(all) dev(minority) (a) MNLI Epochs 0 5 10 15 20 Accuracy(%) 50 55 60 65 70 75 80 85 MNLI Dev HANS (b) Challenge data Accuracy on minority examples (-∗-) is correlated with accuracy on challenge data (-∆-) 23 / 40
  66. Observation 2: removing minority examples hurts robust accuracy 0.0 0.1

    0.4 1.6 6.4 % training data removed 50 60 70 accuracy (%) model = BERTbase strategy overlap random 0.0 0.1 0.4 1.6 6.4 % training data removed model = BERTlarge 0.0 0.1 0.4 1.6 6.4 % training data removed model = RoBERTabase 0.0 0.1 0.4 1.6 6.4 % training data removed model = RoBERTalarge 24 / 40
  67. Observation 2: removing minority examples hurts robust accuracy 0.0 0.1

    0.4 1.6 6.4 % training data removed 50 60 70 accuracy (%) model = BERTbase strategy overlap random 0.0 0.1 0.4 1.6 6.4 % training data removed model = BERTlarge 0.0 0.1 0.4 1.6 6.4 % training data removed model = RoBERTabase 0.0 0.1 0.4 1.6 6.4 % training data removed model = RoBERTalarge Pre-trained models cannot extrapolate to challenge data without minority examples 24 / 40
  68. Observation 2: removing minority examples hurts robust accuracy 0.0 0.1

    0.4 1.6 6.4 % training data removed 50 60 70 accuracy (%) model = BERTbase strategy overlap random 0.0 0.1 0.4 1.6 6.4 % training data removed model = BERTlarge 0.0 0.1 0.4 1.6 6.4 % training data removed model = RoBERTabase 0.0 0.1 0.4 1.6 6.4 % training data removed model = RoBERTalarge Pre-trained models cannot extrapolate to challenge data without minority examples Pre-training improves robustness to data imbalance 24 / 40
  69. Why is the improvement on PAWS much smaller? 20 40

    60 80 100 % training data 60 80 100 accuracy (%) dataset = HANS model BERTbase BERTlarge RoBERTabase RoBERTalarge 20 40 60 80 100 % training data dataset = PAWSQQP templated auto-generated 25 / 40
  70. Why is the improvement on PAWS much smaller? 20 40

    60 80 100 % training data 60 80 100 accuracy (%) dataset = HANS model BERTbase BERTlarge RoBERTabase RoBERTalarge 20 40 60 80 100 % training data dataset = PAWSQQP templated auto-generated 25 / 40
  71. Why is the improvement on PAWS much smaller? 20 40

    60 80 100 % training data 60 80 100 accuracy (%) dataset = HANS model BERTbase BERTlarge RoBERTabase RoBERTalarge 20 40 60 80 100 % training data dataset = PAWSQQP templated auto-generated 25 / 40
  72. Why is the improvement on PAWS much smaller? 20 40

    60 80 100 % training data 60 80 100 accuracy (%) dataset = HANS model BERTbase BERTlarge RoBERTabase RoBERTalarge 20 40 60 80 100 % training data dataset = PAWSQQP templated auto-generated Different (challenge) patterns require different amounts of training data 25 / 40
  73. Why is the improvement on PAWS much smaller? 20 40

    60 80 100 % training data 60 80 100 accuracy (%) dataset = HANS model BERTbase BERTlarge RoBERTabase RoBERTalarge 20 40 60 80 100 % training data dataset = PAWSQQP templated auto-generated Different (challenge) patterns require different amounts of training data Pre-training is no silver bullet 25 / 40
  74. Summary Distribution shift: minority examples may become majority at test

    time Residual fitting: ‘upweight’ minority examples at the cost of performance on other examples 26 / 40
  75. Summary Distribution shift: minority examples may become majority at test

    time Residual fitting: ‘upweight’ minority examples at the cost of performance on other examples Pre-training: generic data improves generalization from minority examples 26 / 40
  76. Summary Distribution shift: minority examples may become majority at test

    time Residual fitting: ‘upweight’ minority examples at the cost of performance on other examples Pre-training: generic data improves generalization from minority examples Motivation: Better use of generic data to mitigate the robustness-accuracy trade-off 26 / 40
  77. Improve generalization by multitasking Improve generalization from minority examples by

    transfering knowledge from related tasks Multitasking learning setup Model: shared BERT encoder + linear task-specific classifier 27 / 40
  78. Improve generalization by multitasking Improve generalization from minority examples by

    transfering knowledge from related tasks Multitasking learning setup Model: shared BERT encoder + linear task-specific classifier Auxiliary data: Textual entailment: MNLI + SNLI, QQP, PAWS Paraphrase identification: QQP + SNLI, MNLI, HANS 27 / 40
  79. Results MNLI QQP 0 20 40 60 80 100 84.5

    90.8 83.7 91.3 Accuracy In-dist. data BERT-base HANS PAWS 0 20 40 60 62.5 36.1 68.2 45.9 Accuracy Challenge data STL MTL 28 / 40
  80. Results MNLI QQP 0 20 40 60 80 100 84.5

    90.8 83.7 91.3 Accuracy In-dist. data BERT-base HANS PAWS 0 20 40 60 62.5 36.1 68.2 45.9 Accuracy Challenge data STL MTL 28 / 40
  81. Results MNLI QQP 0 20 40 60 80 100 84.5

    90.8 83.7 91.3 Accuracy In-dist. data BERT-base HANS PAWS 0 20 40 60 62.5 36.1 68.2 45.9 Accuracy Challenge data STL MTL MTL improves robust accuracy without hurting indistribution performance 28 / 40
  82. Results MNLI QQP 0 20 40 60 80 100 84.5

    90.8 83.7 91.3 Accuracy In-dist. data BERT-base HANS PAWS 0 20 40 60 62.5 36.1 68.2 45.9 Accuracy Challenge data STL MTL MNLI QQP 0 20 40 60 80 100 87.4 91.5 86.4 91.7 Accuracy In-dist. data RoBERTa-base HANS PAWS 0 20 40 60 80 74.1 42.6 72.8 51.7 Accuracy Challenge data STL MTL MTL improves robust accuracy without hurting indistribution performance MTL improves robustness on top of pre-training 28 / 40
  83. How does MTL help? Method In-dist. (QQP) Challenge (PAWS) STL

    (QQP) 90.8 36.1 MTL (QQP+MNLI,SNLI,HANS) 91.3 45.9 29 / 40
  84. How does MTL help? Method In-dist. (QQP) Challenge (PAWS) STL

    (QQP) 90.8 36.1 MTL (QQP+MNLI,SNLI,HANS) 91.3 45.9 29 / 40
  85. How does MTL help? Method In-dist. (QQP) Challenge (PAWS) STL

    (QQP) 90.8 36.1 MTL (QQP+MNLI,SNLI,HANS) 91.3 45.9 remove random examples from MNLI +0.1 −0.9 remove random examples from QQP −0.0 −1.6 Remove random examples from target/auxiliary tasks has no significant effect on performance 29 / 40
  86. How does MTL help? Method In-dist. (QQP) Challenge (PAWS) STL

    (QQP) 90.8 36.1 MTL (QQP+MNLI,SNLI,HANS) 91.3 45.9 remove random examples from MNLI +0.1 −0.9 remove random examples from QQP −0.0 −1.6 remove minority examples from MNLI +0.0 −1.6 remove minority examples from QQP +0.0 −7.7 Remove random examples from target/auxiliary tasks has no significant effect on performance Remove minority examples from target tasks hurt MTL performance 29 / 40
  87. Summary Mitigate trade-off between in-distribution and robust accuracy Pre-training helps

    generalization from minority examples Adding generic data (through MTL) improves generalization 30 / 40
  88. Counterfactually-Augmented Data (CAD) Figure: [Kaushik+ 2020] seed “Election” is a

    highly fascinating and thoroughly captivating thriller-drama positive 32 / 40
  89. Counterfactually-Augmented Data (CAD) Figure: [Kaushik+ 2020] seed “Election” is a

    highly fascinating and thoroughly captivating thriller-drama positive edited “Election” is a highly expected and thoroughly mind-numbing thriller-drama negative 32 / 40
  90. Counterfactually-Augmented Data (CAD) Figure: [Kaushik+ 2020] seed “Election” is a

    highly fascinating and thoroughly captivating thriller-drama positive edited “Election” is a highly expected and thoroughly mind-numbing thriller-drama negative Assumption: intervention tells us which are “causal spans” vs. spurious features 32 / 40
  91. Does CAD improve OOD generalization? Hypothesis: Train with CAD leads

    to robust models that uses causal features and generalizes to OOD data 33 / 40
  92. Does CAD improve OOD generalization? Hypothesis: Train with CAD leads

    to robust models that uses causal features and generalizes to OOD data Mixed results: [Huang+ 2020]: CAD doesn’t lead to better performance on OOD data (SNLI→MNLI) or challenge data. [Khashabi+ 2020]: On question answering, unaugmented data is better when dataset size and annotation cost are controlled. [Kaushik+ 2021]: Removing/noising spans identified by CAD hurts performance more than noising non-causal spans. 33 / 40
  93. Does CAD improve OOD generalization? Hypothesis: Train with CAD leads

    to robust models that uses causal features and generalizes to OOD data Mixed results: [Huang+ 2020]: CAD doesn’t lead to better performance on OOD data (SNLI→MNLI) or challenge data. [Khashabi+ 2020]: On question answering, unaugmented data is better when dataset size and annotation cost are controlled. [Kaushik+ 2021]: Removing/noising spans identified by CAD hurts performance more than noising non-causal spans. CAD does reveal useful features, so why aren’t they helpful? 33 / 40
  94. Toy example: sentiment classification seed The book is good positive

    seed The movie is boring negative Naive Bayes model non-zero weights: book movie good boring seed +1 −1 +1 −1 34 / 40
  95. Toy example: sentiment classification seed The book is good positive

    edited The book is not good negative seed The movie is boring negative edited The movie is fascinating positive Naive Bayes model non-zero weights: book movie good boring fascinating not seed +1 −1 +1 −1 0 0 seed+edited 0 0 0 −0.5 +0.5 −0.5 34 / 40
  96. Toy example: sentiment classification seed The book is good positive

    edited The book is not good negative seed The movie is boring negative edited The movie is fascinating positive Naive Bayes model non-zero weights: book movie good boring fascinating not seed +1 −1 +1 −1 0 0 seed+edited 0 0 0 −0.5 +0.5 −0.5 CAD successfully debiased “book” and “movie”, but “good” also gets debiased! 34 / 40
  97. Toy example: sentiment classification seed The book is good positive

    edited The book is not good negative seed The movie is boring negative edited The movie is fascinating positive Naive Bayes model non-zero weights: book movie good boring fascinating not seed +1 −1 +1 −1 0 0 seed+edited 0 0 0 −0.5 +0.5 −0.5 CAD successfully debiased “book” and “movie”, but “good” also gets debiased! Unintervened robust features cannot be learned from CAD 34 / 40
  98. What about real CAD? Hypothesis: undiverse edits limits the effectiveness

    of CAD Categorize edits [Wu+ 2020]: quantifier negation lexical insert delete resemantic 35 / 40
  99. What about real CAD? Hypothesis: undiverse edits limits the effectiveness

    of CAD Categorize edits [Wu+ 2020]: Train/Test quantifier negation lexical insert delete resemantic lexical insert resemantic With controlled data size 35 / 40
  100. What about real CAD? Hypothesis: undiverse edits limits the effectiveness

    of CAD Categorize edits [Wu+ 2020]: Train/Test quantifier negation lexical insert delete resemantic SNLI seed 74.360.21 69.252.09 75.160.32 74.941.05 65.762.34 76.770.74 lexical insert resemantic With controlled data size 35 / 40
  101. What about real CAD? Hypothesis: undiverse edits limits the effectiveness

    of CAD Categorize edits [Wu+ 2020]: Train/Test quantifier negation lexical insert delete resemantic SNLI seed 74.360.21 69.252.09 75.160.32 74.941.05 65.762.34 76.770.74 lexical 72.421.58 68.752.16 81.810.99 74.041.04 67.043.00 74.931.16 insert 68.150.88 57.754.54 71.082.53 78.981.58 68.802.71 71.741.53 resemantic 70.771.04 67.252.05 77.232.35 76.591.12 70.401.54 75.401.44 With controlled data size On matched test sets, CAD achieves the best performance 35 / 40
  102. What about real CAD? Hypothesis: undiverse edits limits the effectiveness

    of CAD Categorize edits [Wu+ 2020]: Train/Test quantifier negation lexical insert delete resemantic SNLI seed 74.360.21 69.252.09 75.160.32 74.941.05 65.762.34 76.770.74 lexical 72.421.58 68.752.16 81.810.99 74.041.04 67.043.00 74.931.16 insert 68.150.88 57.754.54 71.082.53 78.981.58 68.802.71 71.741.53 resemantic 70.771.04 67.252.05 77.232.35 76.591.12 70.401.54 75.401.44 With controlled data size On matched test sets, CAD achieves the best performance On unmatched test sets, CAD performance can be worse than unaugmented examples 35 / 40
  103. What about real CAD? ins ins + lex ins +

    lex + resem All Types Diversity 56 58 60 62 64 66 68 70 Accuracy on MNLI (OOD) SNLI Baseline Controlling the data size, larger number of edit types leads to higher ID/OOD performance. Effective data size of CAD ≈ number of features intervened 36 / 40
  104. Performance vs data size If we collect more CAD, will

    the performance improve? 1000 2000 3000 4000 5000 Training Data Size 57.5 60.0 62.5 65.0 67.5 70.0 72.5 75.0 Accuracy Train: CAD, Eval: MNLI Train: SNLI, Eval: MNLI CAD is more effective in the low-data regime Increasing the number of CAD examples does not seem to lead to higher diversity and performance plateaus 37 / 40
  105. CAD may exacerbate existing spurious correlations Seed CAD 0.0 0.2

    0.4 0.6 0.8 1.0 Relative No. of Examples 0.19 0.29 0.52 0.19 0.14 0.66 Entailment Neutral Contradiction (a) Negation word Seed CAD 0.0 0.2 0.4 0.6 0.8 1.0 Relative No. of Examples 0.66 0.15 0.19 0.77 0.13 0.10 Entailment Neutral Contradiction (b) Word overlap > 90% 38 / 40
  106. Summary Counterfactual data is an effective way to identify useful

    features But it may also limit what the model can learn Need better ways to ensure data diversity/coverage 39 / 40
  107. Parting remarks Learning Need knowledge on different predictive patterns in

    the data: can we automatically discover the groups? Data Improve diversity through more controllable crowdsourcing protocols Thank you! 40 / 40