Upgrade to Pro — share decks privately, control downloads, hide ads and more …

IJCNLP-AACL 2023: The Impact of Debiasing on the Performance of Language Models in Downstream Tasks is Underestimated.

IJCNLP-AACL 2023: The Impact of Debiasing on the Performance of Language Models in Downstream Tasks is Underestimated.

Masahiro Kaneko

October 30, 2023
Tweet

More Decks by Masahiro Kaneko

Other Decks in Research

Transcript

  1. The Impact of Debiasing on the Performance of Language Models

    in Downstream Tasks is Underestimated Masahiro Kaneko1,2 Danushka Bollegala3 Naoaki Okazaki2 1 2 3
  2. • The debiasing method should mitigate such a biased information,

    while keeping the pre-trained useful information • Evaluations in downstream tasks often employ GLEU benchmark Debiasing the Pre-trained Model in Downstream Tasks 1 • Pre-trained Language Models (PLMs) are biased regarding gender-related words • Female words such as “she” and “woman” • Male words such as “he” and “man” • Occupations words such as “doctor” and “nurse” The nurse is my PLM sister brother Likelihood 0.3 0.09 Biased!
  3. Concern about Existing Evaluation Approach 2 The total number of

    instances containing female, male, and occupational words in the GLUE benchmark Less frequent Less frequent ALL Female word Male word Occ. word CoLA 1,043 174 722 96 MNLI 9,832 3,467 8,875 1,415 MRPC 408 101 391 96 QNLI 5,463 2,149 5,371 1,066 QQP 40,430 7,415 29,638 3,331 RTE 277 113 269 94 SST-2 872 187 691 75 STS-B 1,500 513 1,277 151 WNLI 71 27 71 6 • However, GLEU is unbalanced dataset because it has little data related to females and occupations (Occ.) • The impact of debiasing on data which has the gender-related words may be potentially inaccurately evaluated
  4. Investigation of the Performance for Each Bias-related Instance 3 ①

    Extract instances containing gender-related words from the benchmarks ② Calculate and compare the performance All instances ⇄ gender-related instances Original PLM ⇄ Debiased PLM GLUE Dataset w/ female words w/ male words w/ Occ. words Original PLM Debiased PLM Extract Evaluate
  5. Experimental Setting 4 • Debiasing methods • Counterfactual Data Augmentation

    (CDA) Debiasing [Webster et al., 2020] • This method creates the additional instances by swapping the gender of gender words in the training data • The nurse is my sister ⇨ The nurse is my brother • This enables to train a less biased model because the frequency of female and male words will be the same in the augmented dataset • Benchmark • GLEU: This benchmark includes 9 tasks (e.g. NLI, Sentiment analysis, …) • Pretrained Model • BERT (bert-base-cased)
  6. Performance Gap Between Original Model and Debiased Model 5 All

    Female Male Occ. CoLA -1.36 -3.42 -2.01 -1.45 MNLI -0.55 -0.90 -0.71 -0.63 MRPC -0.96 -1.28 -1.31 -1.03 QNLI -1.13 -1.42 -1.19 -1.27 QQP -0.21 -0.69 -0.32 -0.25 RTE -1.16 -1.21 -1.02 -1.13 SST-2 -0.11 -0.81 -0.34 -0.25 STS-B -1.01 -1.95 -1.34 -1.10 WNLI -2.82 -3.07 -2.82 -2.71 Score of debiased PLM – Score of original PLM The gaps are larger for the gender-related instances compared to those for all instances All instances and gender-related instances have different trends, so they should be evaluated separately
  7. How Sensitive is the Performance Gap to the Debiasing Level?

    6 • We apply different levels of CDA debiasing to PLM and measure the performance gap with respect to its original version • We swap the gender-related pronouns in debias rate 𝑟 ∈ 0, 1 , fraction of the total 𝑁 instances of a dataset • 𝑟 = 0 corresponds to not swapping gender in any instances of the dataset • 𝑟 = 1 swaps the gender in all instances
  8. Performance Gap Between Original and Debiased Models by Debias Rate

    𝒓 7 The performances for the gender-related instances are sensitive to the debiasing level than those for all instances • For gender-related instances: • As the debias rate gets larger, the performance gap gets larger • For All instances: • Inconsistent performance QQP dataset Debias rate 𝑟 Evaluating all and gender-related instances separately provides a consistent measure of the debiasing impact👍
  9. Key Takeaways 8 • Prior evaluation approaches overlook the fact

    that benchmark datasets contain only a small fraction of gender-related instances • Recommendation: Separately evaluating the debiasing effect on gender-related instances, in addition to evaluating all instances in a benchmark dataset Not really! Can the existing evaluation assess the debiasing Impact appropriately? 🤖 👧