Upgrade to Pro — share decks privately, control downloads, hide ads and more …

An overview of hate speech analysis techniques in NLP

An overview of hate speech analysis techniques in NLP

Presented in NLP'Lecture at IIIT-D Winter 2024

_themessier

April 12, 2024
Tweet

More Decks by _themessier

Other Decks in Research

Transcript

  1. An overview of hate speech analysis techniques in NLP -

    Presenter: Sarah Masud, LCS2 - Collaboration with: Dr. Tanmoy, Dr. Vikram, Dr. Shad, Manjot, Atharva, Aflah and many more!
  2. Disclaimer: Subsequent content has extreme language (verbatim from social media),

    which does not reflect the opinions of myself or my collaborators. Reader’s discretion is advised.
  3. Definition of Hate Speech [1]: UN hate [2]: Pyramid of

    Hate • Hate is subjective, temporal and cultural in nature. • UN defines hate speech as “any kind of communication that attacks or uses pejorative or discriminatory language with reference to a person or a group on the basis of who they are.” [1] • Need sensitisation of social media users.
  4. How to Combat Hate Speech Reactive countering When a hateful

    post has been made and we are intervening to prevent it further spreading. Proactive countering Intervene before the post goes public.
  5. How do you collect data on something that is not

    well defined? [1]: Resources and benchmark corpora for hate speech detection: a systematic review Fig 1: Related concepts in Hate Speech [1]
  6. Begin by defining the scope of what you will be

    examining and modeling How do you collect data on something that is not well defined? [1]: Political mud slandering and power dynamics during Indian assembly elections Fig 1: Example Definition of political attack [1] For example in one project we wanted to study `Political Attacks`
  7. How do you collect data on something that is not

    well defined? Narrow down the source platform, languages and topic to be covered. Fig 1: Socio-political topics covered in GOTHate [1] [1]: Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment • Not filtered for language (Hinglish). • Source: Twitter (from Jan 2020-Jan 2021) ◦ Primary dataset: Tweets on these (Fig. 1) topics. ◦ Secondary dataset: User metadata, timeline and 1-hop network information.
  8. Fig 2: Overview of Annotation Guideline [1] [1]: Revisiting Hate

    Speech Benchmarks: From Data Curation to System Deployment How do you annotate the collected data? Broad definitions, categories and hierarchy if nay
  9. Fig 1: 2-phased Annotation Mode [1,2] • Phase I: k

    = 0.80 • Phase II k = 0.70 [1]: Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior [2]: Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment How do you annotate the collected data? Expert vs Knowledge of Crowd
  10. Observations & Lessons Lessons & Takeaways: 1. To overcome annotation

    bias: a. Communicate with the annotators to understand their point-of-view and share the aim of hate speech detection. b. Avoid working with annotators who have strong political affiliations and are directly involved in political work. c. Work with diverse annotators w.r.t age, gender, lingusitc capabilities 2. There will still be mislabelling! a. It will mimic the chaotic real-world more closely :P b. Dataset even with mislabelled annotators represents some annotation bias and annotator’s point-of-view. It is not ideal but still representative of the society. [1] [1]: When the Majority is Wrong: Modeling Annotator Disagreement for Subjective Tasks
  11. Literature Overview: Hate Dataset Dataset Source & Language (Modality) Year

    Labels Annotation Waseem & Hovy [1] Twitter, English, Texts 2016 R,S,N 16k, E, k = 0.84 Davidson et al. [2] Twitter, English, Texts 2017 H,O,N 25k, C, k = 0.92 Wulczyn et al. [3] Wikipedia comments, English, Texts 2017 PA, N 100k, C, k = 0.45 Gibert et al. [5] Stormfront, English, Texts 2018 H,N 10k, k = 0.62 Founta et al. [4] Twitter, English, Texts 2018 H,A,SM,N 70k, C, k = ? Albadi et al [6] Twitter, Arabic, Texts 2018 H, N 6k, C, k = 0.81 R- Racism S- Sexism H- Hate PA- Personal Attack A- Abuse SM- Spam O- Offensive L- Religion N- Neither I- Implicit E- Explicit [1]: Waseem & Hovy, NAACL’16 [2]: Davidson et al., WebSci’17 [3]: Wulczyn et al., WWW’17 [4]: Founta at al., WebSci’18 [5]: Gibert et al., ALW2’18 [6]: Albadi et al., ANLP’20 E- Internal Experts C- Crowd Sourced
  12. Dataset Source & Language (Modality) Year Labels Annotation Mathur et

    al. [1] Twitter, Hinglish, Texts 2018 H, O, N 3k, E, k = 0.83 Rizwan et al. [3] Twitter, Urdu (Roman Urdu), Texts 2020 A, S, L, P,N 10k, E, k=? Gomez et al. [4] Twitter, English, Memes 2020 H, N 150k, C, k = ? ElSherief et al. [11] Twitter, English, Texts 2021 I,E,N Literature Overview: Hate Dataset [1]: Mathur et al., AAAI’20 [3]: Rizwan et al., EMNLP’19 [4]: Gomez et al., WACv’20 • HASOC [5], Jigsaw Kaggle [6], SemEval [7], FB Hate-Meme Challenge [8], • WOAH [9], CONSTRAINT [10] [5]: HASOC [6]: Jigsaw Kaggle [7]: SemEval [8]: FB Hate-Meme [9]: WOAH [10]: CONSTRAINT [11]: ElSheried et al., EMNLP’21 E- Internal Experts C- Crowd Sourced R- Racism S- Sexism H- Hate PA- Personal Attack A- Abuse SM- Spam O- Offensive L- Religion N- Neither I- Implicit E- Explicit
  13. Literature Overview: Hate Detection • N-gram Tf-idf + LR/SVM [1,2]

    • Glove + CNN, RNN [3] • Transformer based ◦ Zero , Few Shot [4] ◦ Fine-tuning [5] ◦ HateBERT [6] • Generation for classification [7,11] • Multimodality ◦ Images [8] ◦ Historical Context [9] ◦ Network and Neighbours [10] ◦ News, Trends, Prompts [11] [1]: Waseem & Hovy, NAACL’16 [2]: Davidson et al., WebSci’17 [3]: Barjatiya et al., WWW’17 [4]: Pelican et al., EACL Hackashop’21 [5]: Timer et al. ,EMNLP’21 [6]: Caselli et al., WOAH’21 [7]: Ke-Li et al. [8]: Kiela et al., NeuIPS’20 [9]: Qian et al., NAACL’19 [10]: Mehdi et al., IJCA’20, Vol 13 [11]: Badr et al.,
  14. Modeling Via CNNs Fig 1: Overview of CNN based Hate

    speech classifier [1] [1]: A Platform Agnostic Dual-Strand Hate Speech Detector Fig 2: Dataset and classification results CNN based detector [1]
  15. Modeling Via RNNs Fig 1: Modeling RNN + Metadata MLP

    based detector [1] [1]: A Unified Deep Learning Architecture for Abuse Detection
  16. Modeling Via BERT Fig 1: BERT model for sequence classification

    on Hate Speech Data [1] Fig 2: BERT fining with classification heads of varying complexity [2] [1]: HASOCOne@FIRE-HASOC2020: Using BERT and Multilingual BERT models for Hate Speech Detection [2]: BERT-Based Transfer Learning Approach for Hate Speech Detection in Online Social Media
  17. Overview of BERT pre training via MLM Fig 1: Masking

    to perform pretraining via output probability loss [1] [1]: https://ankur3107.github.io/blogs/masked-langauge-modeling/
  18. Continued Pre-training: Domain specific Large Scale General LM large scale

    unlabeled Corpus from various sources across the web Mask prediction (Unsupervised generalised training) Training a Large LM (LLM) from Scratch: Pre-training Initialise LM with random weights Saved LM with trained weights Medium sized unlabeled Corpus from a specific domain Mask prediction (Unsupervised domain specific training) Training a Large LM (LLM) from Saved checkpoint: Continued Pre-training Load LM with trained weights Large Scale Domain LM Saved LM with updated weights
  19. 1. mBERT - Multilingual BERT -> Multiple Languages 104 used

    to perform MLM. 2. BertTweet -> Over 845M tweets used to perform MLM. 3. HateBERT -> Over 1M offensive reddit posts to perform MLM. BERT variants for different context via MLM Fig 1: MLM top 3 candidates for the templates “Women are [MASK.]” [3] Fig 2: Performance gain of HateBERT over BERT for hate classification [3] [1]: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [2]: BERTweet: A pre-trained language model for English Tweets [3]: HateBERT: Retraining BERT for Abusive Language Detection in English
  20. Contextual Signal Infusion for Hate Detection Fig 1: Motivation for

    Auxiliary Data Signals[1] [1]: Kulkarni et al., Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment, KDD 2023
  21. In-Dataset Signal: Exemplars Module \$MENTION\$ \$MENTION\$ \$MENTION\$ AND Remember president

    loco SAID MEXICO WILL PAY FUC**kfu ck trump f*** gop f*** republicans Make go fund me FOR HEALTH CARE, COLLEGE EDUCATION , CLIMATE CHANGE, SOMETHING GOOD AND POSITIVE !! Not for a fucking wall go fund the wall the resistance resist \$URL\$" $MENTION\$ DERANGED DELUSIONAL DUMB DICTATOR DONALD IS MENTALLY UNSTABLE! I WILL NEVER VOTE REPUBLICAN AGAIN IF THEY DON'T STAND UP TO THIS TYRANT LIVING IN THE WHITE HOUSE! fk republicans worst dictator ever unstable dictator \$URL\$" $MENTION\$ COULD WALK ON WATER AND THE never trump WILL CRAP ON EVERYTHING HE DOES. SHAME IN THEM. UNFOLLOW ALL OF THEM PLEASE!" Offensive train sample Labelled Corpus E1: Offensive train sample exemplar (can be same or different author) E2: Offensive train sample exemplar (can be same or different author)
  22. Auxiliary Dataset Signal: Timeline Module "look at what Hindus living

    in mixed-population localities are facing, what Dhruv Tyagi had to face for merely asking his Muslim neighbors not to sexually harass his daughter...and even then, if u ask why people don’t rent to Muslims, get ur head examined $MENTION\$ $MENTION\$ naah...Islamists will never accept Muslim refugees, they will tell the Muslims to create havoc in their home countries and do whatever it takes to convert Dar-ul-Harb into Dar-ul Islam..something we should seriously consider doing with Pak Hindus too One of the tweet by author before Example 2 One of the tweet by author after Example 2 Accusatory tone timestamp t-1 Hateful tweet timestamp t Accusatory and instigating timestamp t+1
  23. Experiments (E), Observations (O) & Takeaways (T) Fig 1: Baseline

    and Ablation [1] Vanilla mBERT systems (M8-M10) • M10 uses hate lexicon as external context is not able provide a significant improvement over mBERT due to neutrally seeded GOTHate. • M9: Simple concatenation of network info lead to improvement in performance for hateful class. • T: Hateful users seem to be sharing similar latent signals. • T: Implicit signals like network info is better for our dataset than explicit signals like hate lexicons [1]: Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment
  24. Experiments (E), Observations (O) & Takeaways (T) Fig 1: Baseline

    and Ablation [1] Proposed system (M11 - M14): • E: Building the success of previous baselines we use mBERT as base model. • O: Attentive infusion of signals seem to be helping reducing the noisy information in them. • T: No one signal significantly dominates other. Different signals seem to be helping different classes. • O/T: Class with highest misclassification rate in human annotation is helped by presence of exemplars. • T: Combining all 4 signals lead to an improved detection of hate by 5 macro-F1 !! [1]: Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment
  25. Do we need to finetuning all Layers of BERT? [1]:

    https://iq.opengenus.org/bert-base-vs-bert-large/ Fig 1: BERT Encoders [1] Fig 1: BERT Base vs Large in terms of encoding layers [1] Different layers encode the semantics to a different extend
  26. Do we need to finetuning all Layers of BERT? [1]:

    Probing Critical Learning Dynamics of PLMs for Hate Speech Detection Fig 1: The best and worst perming layer for different BERT variants. [1]
  27. What role does classifier head play in fine tuning? [1]:

    Probing Critical Learning Dynamics of PLMs for Hate Speech Detection Fig 1: Impact of classifier head on BERT variants. [1]
  28. Hate Speech Detection Using GPT-3 Prompts Zero-Shot One-shot Few-shot Hate

    Speech Detection via GTP-3 Prompts: https://arxiv.org/pdf/2103.12407.pdf
  29. [1]: An Investigation of Large Language Models for Real-World Hate

    Speech Detection Fig 1: Prompt for ChatGPT [1] Fig 2: Comparing finetuning with zero-shot prompting [1] Hate Speech Detection Using ChatGPT
  30. Hate Speech Detection Using ChatGPT [1]: An Investigation of Large

    Language Models for Real-World Hate Speech Detection
  31. Hate Speech Generation Using GPT-3: Conditioning TOXIGEN: A Large-Scale Machine-Generated

    Dataset for Adversarial and Implicit Hate Speech Detection Before conditioning After conditioning
  32. Categorization of biases in Hate Speech Biases exists through out

    the hate detection pipeline and and downstream harms to various social groups. [1] Handling Bias in Toxic Speech Detection: A Survey
  33. Annotation Bias Fig 1: Annotation biases across geography leading to

    variation in inter-annotator agreements [1] [1] CREHate: A CRoss-cultural English Hate Speech Dataset
  34. Lexical Bias • Models become biased towards spurious correlation of

    words/phrases in a class. • Terms contributing to lexical bias can be enlisted under as bias sensitive words (BSW) • BSW differ for different datasets and target group under consideration. For example if our datasets has a lot of hate speech against African Americans then mere presence of their identity terms like Black or slur terms like n** can trigger the model to classify such statements as hateful irrespective of the context.
  35. Mitigating Lexical Bias via word substitution • Once BSW are

    identified replaced their occurrence in Hate class via [1]: ◦ POS tags to reduce dependency on explicit terms (Black -> <NOUN>) ◦ Parent in the hypernym-tree (Black -> Color) ◦ k-Nearest neighbour in a word embedding [1]: Badjatiya et al., Stereotypical Bias Removal for Hate Speech Detection Task using Knowledge-based Generalizations [2]: Waseem & hovy, Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter [3]: Garg et al., Handling Bias in Toxic Speech Detection: A Survey, ACM CSUR Drawback of the study: Post substitution of BSW, the the original BSW is employed to evaluate reduction in bias with no discussion on evaluating the bias on the newly replaced terms. [1] Fig 1: Reproduced results [3] on WordNet replacement [1] for Waseem& Hovy dataset [2]
  36. Our Knowledge-drift Experiment We conjecture that replacing all occurrences 𝑤

    ∈ 𝐵𝑆𝑊 with its Wordnet ancestor 𝑎 ∈ 𝐴 will shift the lexical bias from 𝑤 to 𝑎 [2] [1]: Badjatiya et al., Stereotypical Bias Removal for Hate Speech Detection Task using Knowledge-based Generalizations [2]: Garg et al., Handling Bias in Toxic Speech Detection: A Survey, ACM CSUR M-bias is the dataset before debiasing/substitution M-gen is the dataset after debiasing Fig 1: Knowledge-drift results. [2] Fig 1: Formula for pinned bias metric. [1]
  37. Takeway: Analogy of Bias & Physical Systems • Like energy

    bias seems to be capable of transferring from one source to the other. • Like a system at rest, a model/dataset with bias will remain biased unless external force of mitigation/regularization terms are added to the training. • Like interactive systems tend to become more chaotic overtime, hence bias mitigation in toxicity systems and NLP in general needs to be incorporated in the pipeline in a continuous fashion.
  38. How to Combat Hate Speech Reactive countering When a hateful

    post has been made and we are intervening to prevent it further spreading. Proactive countering Intervene before the post goes public.
  39. Literature Overview: From Offense to Non-Offense [1]: Santos et al.,

    ACL’18 Fig 1: Unsupervised conversion of Offense to Neutral [1]
  40. Literature Overview: Intervention during Tweet creation • 200k users identified

    in the study. 50% randomly assigned to the control group • H1: Are prompted users less likely to post the current offensive content. • H2: Are prompted users less likely to post content in future. [1]: Katsaros et al., ICWSM ‘22 Fig 1: User behaviour statistics as a part of intervention study [1] Fig 2: Twitter reply test for offense replies. [1]
  41. NACL Dataset • Hateful samples collected from existing Hate Speech

    datasets. • Manually annotated for Hate intensity and hateful spans. • Hate Intensity is marked on a scale of 1-10. • Manual generation of normalised counter-part and its intensity. (k = 0.88) Fig 1: Original and Normalised Intensity Distribution [1] Fig 2: Dataset Stats [1] [1]: Masud et al., Proactively Reducing the Hate Intensity of Online Posts via Hate Speech Normalization, KDD 2022
  42. Motivation & Evidence • Reducing intensity is the stepping stone

    towards non-hate. • Does not force to change sentiment or opinion. • Evidently leads to less virality. Fig 1: Difference in predicted number of comments per set per iteration. [1] [1]: Proactively Reducing the Hate Intensity of Online Posts via Hate Speech Normalization
  43. Problem Statement For a given hate sample 𝑡, our objective

    is to obtain its normalized (sensitised) form 𝑡` such that the intensity of hatred 𝜙𝑡 is reduced while the meaning still conveys. [1] 𝜙 𝑡` < 𝜙 𝑡 Fig [1]: Example of original high intensity vs normalised sentence [1] [1]: Proactively Reducing the Hate Intensity of Online Posts via Hate Speech Normalization
  44. Proposed Method: NACL- Neural hAte speeCh normaLizer Hate Intensity Prediction

    (HIP) Hate Span Prediction (HSI) Hate Intensity Reduction (HIR) Fig 1: Flowchart of NACL [1] Extremely Hateful Input (ORIGINAL) Less Hateful Input (SUGGESTIVE) HATE NORMALIZATION Extremely Hateful Input (ORIGINAL) User’s Choice [1]: Proactively Reducing the Hate Intensity of Online Posts via Hate Speech Normalization
  45. Hate Intensity Prediction (HIP) Fig 1: HIP + Framework [1]

    BERT + BiLSTM + Self Attention + Linear Activation [1]: Proactively Reducing the Hate Intensity of Online Posts via Hate Speech Normalization
  46. Hate Span Identification (HSI) Fig 1: Hate Normalization Framework [1]

    ELMO + BiLSTM + Self Attention + CRF [1]: Proactively Reducing the Hate Intensity of Online Posts via Hate Speech Normalization
  47. Hate Intensity Reduction Overall Loss Reward Fig 1: Hate Normalization

    Framework [1] [1]: Proactively Reducing the Hate Intensity of Online Posts via Hate Speech Normalization
  48. Hate Intensity Reduction (HIR) Fig 1: Hate Intensity Reduction Module

    [1] [1]: Proactively Reducing the Hate Intensity of Online Posts via Hate Speech Normalization
  49. Human Evaluation • Employ 20 diverse users to measure the

    quantitativeness of the generated texts. • Metric: ◦ Intensity ◦ Fluency ◦ Adequacy Fig 1: Results of Human Evaluation for NACL-HSR [1] [1]: Proactively Reducing the Hate Intensity of Online Posts via Hate Speech Normalization
  50. [1] Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior

    Using ChatGPT to generate contextual toxic content Fig 1: Different output for toxicity based on context [1] Fig 2: Using ChatGPT to generate large scale scenario-based toxicity[1]
  51. Motivation ❖ Depending on the scenario, specific types (intents) of

    counterspeeches have been shown to be more effective. ❖ In the figure, although the Denouncing and Question counterspeeches are valid, an Informative counterspeech would better address the given hatespeech. ❖ We propose the first intent-conditioned counterspeech generation system, QUARC.
  52. Contributions ❖ Novel task: Intent-specific counterspeech generation. ❖ Novel dataset:

    IntentCONAN with 6831 counterspeeches for 3583 hate speeches spanning across five counterspeech intents. ❖ Novel model: QUARC, a two-phased intent- specific counterspeech generation framework. ❖ Evaluation: An extensive automated comparison and human evaluation to quantify the efficacy of our approach w.r.t state-of-the-art baselines.
  53. Dataset: IntentCONAN ❖ We develop IntentCONAN, an intent-specific ounterspeech dataset

    with 6831 counterspeeches, corresponding to 3583 hate speeches. ❖ These counterspeeches are distributed across 5 intents: informative, denouncing, question, positive, and humour.
  54. Benchmarks: Automated Evaluation and Ablations ❖ QUARC outperforms the comparative

    systems across all metrics. ❖ Best performing baseline: GPS. ❖ Most systems are unable to produce counterspeeches that adhere to the desired intent. ❖ Further analysis of Novelty and Diversity scores (lexical dissimilarity from the training corpus), reveals that the baselines do not produce diverse outputs and learn to copy. M: Meteor; SS: Semantic Similarity; BS: BERTScore; CA: Category Accuracy
  55. Benchmarks: Human Evaluation ❖ Due to the subjectivity of the

    task, automated evaluation is not comprehensive enough to ensure soundness. ❖ We conducted a human evaluation (with 60 evaluators), where we compared QUARC against the best performing baseline, GPS. ❖ We defined 5 metrics which were rated on a 5-point Likert scale, except for Category Accuracy (CA), which represents the proportion of counterspeeches with matching intents. ❖ QUARC significantly outperformed GPS across all metrics except Toxicity (slightly worse than GPS), with the biggest gain in CA, demonstrating its ability to generate intent-specific counterspeeches.
  56. Analysis: Congruence ❖ We define and compute Implicit Similarity (IS),

    a metric to measure the implicit affinity of the intents based on human evaluation. ❖ The distance between the codebook intent representations learnt through QUARC (left) closely aligns with the IS scores (right) computed between the intents. ❖ This provides a key insight into a critical factor behind the performance of QUARC. Left: A scatter plot of the codebook vectors (after dimensionality reduction) corresponding to different intents. Right: The Implicit Similarity (IS) between intent pairs captured through human evaluation.