NLP With Friends

Lessons learned from analyzing systems for hate speech detection and
bias mitigation Sarah Masud, LCS2

Disclaimer: Subsequent content has extreme language (verbatim from social media),
which does not reflect the opinions of myself or my collaborators. Reader’s discretion is advised.

Deﬁnition of Hate Speech • Hate is subjective, temporal and
cultural in nature. • UN defines hate speech as “any kind of communication that attacks or uses pejorative or discriminatory language with reference to a person or a group on the basis of who they are.” [1] • Need sensitisation of social media users. Fig 1: Pyramid of Hate [2] [1]: UN hate [2]: Pyramid of Hate

Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment
Discussion on: Observations along the way Accepted at KDD’23

Curating Datasets In Indic Context

The need of Indic datasets? • Indian context has its
own social issues like: • India is a home to multiple languages and dialects. • Users show varying degrees of hateful inclination towards different topics. Fig 1: Hatefulness of different users towards different hashtags in RETINA [1] [1]: Masud et al., Hate is the New Infodemic: A Topic-aware Modeling of Hate Speech Diffusion on Twitter, ICDE 2021

Limitations of existing studies • A myopic approach for hate
speech datasets using hate lexicons/slur terms. • Limited Study in English-Hindi code-mixed (Hinglish) context. • Limited context means systems default to non-hate. Motivation ❖ Can we curate a large scale Indic Dataset? ❖ Can we model contextual information into detection of hate? The need of Indic datasets?

Data Curation

GOTHate: Geo-political topical Hate Speech • Neutrally seeded from socio-political
topics in India and USA and UK. • Not filtered for language. • Source: Twitter (from Jan 2020-Jan 2021) ◦ Primary dataset: Tweets on these (Fig. 1) topics. ◦ Secondary dataset: User metadata, timeline and 1-hop network information. ◦ RIP Free Twitter API :( Fig 1: Socio-political topics covered in GOTHate [1] [1]: Kulkarni et al., Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment, KDD 2023

Issues, Lessons & Hacks 1. API rate limits a. Use
multiple keys in round robin fashion b. Add appropriate halting between API calls 2. Hateful content is effervescent and limited in quantity compared to non-hateful counterparts a. Parallely curating primary and secondary dataset 3. Reproducibility in research is limited by Twitter’s data sharing policy a. Apart from publicly releasing tweet ids, make the textual content available to researchers on demand 4. Twitter access in India is privileged compared to other platforms like Whatsapp and Telegram a. No direct way of curating datasets from these platforms b. Some platforms like Tik Tok get banned due to geo-political or legal reasons.

Issues, Lessions & Hacks: The Academia- Industry synergy 1. Provide
researchers will access to research API free of cost! 2. Provide research access to content flagged/removed by human/AI moderators. a. Provide appropriate copyright access 3. Access to such dataset should be available to researchers across geographies. What about users? Can they opt-out of public research irrespective of their data being public?

Data Annotation

Annotation Guideline: Approaches in Literature Background • Varying definitions of
what is hateful • Varying labels hate, offense, toxic, profane, abusive. ◦ Some go into finer details of offense being sexist, racist, islamophobic etc. • NO STANDARD DEFINITION of hate speech in NLP ◦ AKA no benchmark dataset or leaderboard for hate speech. Current approaches • Expert annotations • Crowdsourced annotations • Mixtures of both

Annotation Guideline: Our approach • Wanted to capture the power
dynamics w.r.t topic (directly) and target group (indirectly) • Started with a broader definition and open label set and then narrowed label space. Fig 1: Overview of Annotation Guideline [1] [1]: Kulkarni et al., Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment, KDD 2023

Annotation Guideline: Our approach Fig 1: 2-phased Annotation Mode [1]
Phase I: IAA = 0.80 (3 experts, F:3) Phase II IAA = 0.70 (10 Crowdsource workers, M:F 6:10) [2]. [1]: Kulkarni et al., Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment, KDD 2023 [2]: Xsaras

Observations, Lessons & Takeaways Observations 1. Providing the context in
terms of topic the tweet belongs to helps the annotator get a sense of the vulnerable group under consideration. a. No prior list of vulnerable group b. Subjective in nature. Eg: Asian population 2. Annotation bias is real!! 3. Neutral seeding makes the annotations even harder.

Observations & Lessons Lessons & Takeaways: 1. 2-phase annotations with
continuous validation is better than 1-phase for large scale annotations. a. Tradeoff in time/effort vs quality of annotation from expert to crowdsourced. 2. Involve local/socially aware annotators rather than generic platforms. a. Xsaras vs Crowdflower 3. Finalise the guideline iteratively, keep an open mind. a. Use social media guidelines as seed b. Add/remove labels based on already existing labels in literature

Observations & Lessons Lessons & Takeaways: 1. To overcome annotation
bias a. Communicate with the annotators to understand their point-of-view and share the aim of hate speech detection. b. Avoid working with annotators who have strong political affiliations and are directly involved in political work. c. Work with diverse annotators w.r.t age, gender, lingusitc capabilities 2. There will still be mislabelling! a. It will mimic the chaotic real-world more closely :P b. Dataset even with mislabelled annotators represents some annotation bias and annotator’s point-of-view. It is not ideal but still representative of the society. [1,2] Jury is still out on if these biases and issues can be overcome by genAI [1]: Novak et al., Handling Disagreement in Hate Speech Modelling [2]: Davani et al., Dealing with Disagreements: Looking Beyond the Majority Vote in Subjective Annotations

Modeling Context

In-Dataset Signal: Exemplars Module \$MENTION\$ \$MENTION\$ \$MENTION\$ AND Remember president
loco SAID MEXICO WILL PAY FUC**kfu ck trump f*** gop f*** republicans Make go fund me FOR HEALTH CARE, COLLEGE EDUCATION , CLIMATE CHANGE, SOMETHING GOOD AND POSITIVE !! Not for a fucking wall go fund the wall the resistance resist \$URL\$" $MENTION\$ DERANGED DELUSIONAL DUMB DICTATOR DONALD IS MENTALLY UNSTABLE! I WILL NEVER VOTE REPUBLICAN AGAIN IF THEY DON'T STAND UP TO THIS TYRANT LIVING IN THE WHITE HOUSE! fk republicans worst dictator ever unstable dictator \$URL\$" $MENTION\$ COULD WALK ON WATER AND THE never trump WILL CRAP ON EVERYTHING HE DOES. SHAME IN THEM. UNFOLLOW ALL OF THEM PLEASE!" Offensive train sample Labelled Corpus E1: Offensive train sample exemplar (can be same or different author) E2: Offensive train sample exemplar (can be same or different author)

Auxiliary Dataset Signal: Timeline Module "look at what Hindus living
in mixed-population localities are facing, what Dhruv Tyagi had to face for merely asking his Muslim neighbors not to sexually harass his daughter...and even then, if u ask why people don’t rent to Muslims, get ur head examined $MENTION\$ $MENTION\$ naah...Islamists will never accept Muslim refugees, they will tell the Muslims to create havoc in their home countries and do whatever it takes to convert Dar-ul-Harb into Dar-ul Islam..something we should seriously consider doing with Pak Hindus too One of the tweet by author before Example 2 One of the tweet by author after Example 2 Accusatory tone timestamp t-1 Hateful tweet timestamp t Accusatory and instigating timestamp t+1

Contextual Signal Infusion for Hate Detection Fig 1: Motivation for
Auxiliary Data Signals[1] [1]: Kulkarni et al., Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment, KDD 2023

Proposed model: HEN-mBERT [1]: Kulkarni et al., Revisiting Hate Speech
Benchmarks: From Data Curation to System Deployment, KDD 2023 HEN-mBERT: History, Exemplar and Network infused mBERT model. Fig 1: Proposed model HEN-mBERT [1]

Experiments, Observations (O) & Takeaways (T) Fig 1: Baseline and
Ablation [1] [1]: Kulkarni et al., Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment, KDD 2023 ML systems (M1-M4) • O: Marred by high performance (as well as higher recall) for non-hate class than hateful ones. • O: Augmenting simple n-gram features from M1-M3 to adding some lexical features in M4 leads to an improvement for hate class. • T: Models are context hungry!

Experiments, Observations (O) & Takeaways (T) Fig 1: Baseline and
Ablation [1] [1]: Kulkarni et al., Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment, KDD 2023 Vanilla DL systems (M1-M4) • O: Slight but inconsistent improvement • E/O: Combining Glove en+hi gave better results than individual embeddings • T: Need to incorporate embeddings that can capture multilingual/code-mixed features.

Experiments (E), Observations (O) & Takeaways (T) Fig 1: Baseline
and Ablation [1] [1]: Kulkarni et al., Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment, KDD 2023 Vanilla mBERT systems (M8-M10) • O: As expected major jump in performance when fine-tuning mBERT based systems • M10 uses hate lexicon as external context is not able provide a significant improvement over mBERT due to neutrally seeded GOTHate. • M9: Simple concatenation of network info lead to improvement in performance for hateful class. • T: Hateful users seem to be sharing similar latent signals. • T: Implicit signals like network info is better for our dataset than explicit signals like hate lexicons

Experiments (E), Observations (O) & Takeaways (T) Fig 1: Baseline
and Ablation [1] [1]: Kulkarni et al., Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment, KDD 2023 Proposed system (M11 - M14): • E: Building the success of previous baselines we use mBERT as base model. • O: Attentive infusion of signals seem to be helping reducing the noisy information in them. • T: No one signal significantly dominates other. Different signals seem to be helping different classes. • O/T: Class with highest misclassification rate in human annotation is helped by presence of exemplars. • T: Combining all 4 signals lead to an improved detection of hate by 5 macro-F1 !!

Concluding Remarks

Concluding Remarks 1. Tradeoff in annotations w.r.t expert vs wisdom
of crowd 2. Contextual signals and how to align them matters for hate speech detection 3. Increasing complexity of the system did not help, we need better signals not more complex models Pros: • Exemplar: Easily available • Timeline & network: Helps capture repeated behaviour and interaction patterns • Timeline & network: Can latter be used for justification of content flagging. Cons: • Examplar: Difficult to capture topical drifts • Timeline & network: Not available if the user is deleted or private • Timeline & Network:Platform specific

Handling Bias in Toxic Speech Detection: A Survey Discussion on:
Knowledge Drift in Bias Mitigation Accepted at ACM CSUR

Taxonomy of Bias

Taxonomy w.r.t Toxicity Detection [1]: Garg et al., Handling Bias
in Toxic Speech Detection: A Survey, ACM CSUR Fig 1: A taxonomy of bias based on the downstream harm.

Primer to Lexical Bias • Models become biased towards spurious
correlation of words/phrases in a class. • Terms contributing to lexical bias can be enlisted under as bias sensitive words (BSW) • BSW differ for different datasets and target group under consideration. For example if our datasets has a lot of hate speech against African Americans then mere presence of their identity terms like Black or slur terms like n** can trigger the model to classify such statements as hateful irrespective of the context.

Mitigating Lexical Bias via word substitution • Once BSW are
identified replaced their occurrence in Hate class via [1]: ◦ POS tags to reduce dependency on explicit terms (Black -> <NOUN>) ◦ Parent in the hypernym-tree (Black -> Color) ◦ k-Nearest neighbour in a word embedding [1]: Badjatiya et al., Stereotypical Bias Removal for Hate Speech Detection Task using Knowledge-based Generalizations [2]: Waseem & hovy, Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter [3]: Garg et al., Handling Bias in Toxic Speech Detection: A Survey, ACM CSUR Drawback of the study: Post substitution of BSW, the the original BSW is employed to evaluate reduction in bias with no discussion on evaluating the bias on the newly replaced terms. [1] Fig 1: Reproduced results [3] on WordNet replacement [1] for Waseem& Hovy dataset [2]

Our Knowledge-drift Experiment We conjecture that replacing all occurrences 𝑤
∈ 𝐵𝑆𝑊 with its Wordnet ancestor 𝑎 ∈ 𝐴 will shift the lexical bias from 𝑤 to 𝑎 [1]: Badjatiya et al., Stereotypical Bias Removal for Hate Speech Detection Task using Knowledge-based Generalizations [2]: Garg et al., Handling Bias in Toxic Speech Detection: A Survey, ACM CSUR M-bias is the dataset before debiasing/substitution M-gen is the dataset after debiasing Fig 1: Knowledge-drift results. [2] Fig 1: Formula for pinned bias metric. [1]

Takeway: Analogy of Bias & Physical Systems • Like energy
bias seems to be capable of transferring from one source to the other. • Like a system at rest, a model/dataset with bias will remain biased unless external force of mitigation/regularization terms are added to the training. • Like interactive systems tend to become more chaotic overtime, hence bias mitigation in toxicity systems and NLP in general needs to be incorporated in the pipeline in a continuous fashion. Our survey enlists the issues and takeaways on other type of biases and mitigation for hate speech.

Thank You Sarah, 4th year PhD@ LCS2, IIIT-Delhi sarahmasud02@gmail.com _themessier
on Twitter lcs2lab on Twitter

NLP With Friends

NLP With Friends

More Decks by _themessier

Other Decks in Research

Featured

Transcript