An overview of hate speech analysis techniques in NLP

An overview of hate speech analysis techniques in NLP -
Presenter: Sarah Masud, LCS2 - Collaboration with: Dr. Tanmoy, Dr. Vikram, Dr. Shad, Manjot, Atharva, Aflah and many more!

Disclaimer: Subsequent content has extreme language (verbatim from social media),
which does not reflect the opinions of myself or my collaborators. Reader’s discretion is advised.

Definition of Hate Speech [1]: UN hate [2]: Pyramid of
Hate • Hate is subjective, temporal and cultural in nature. • UN defines hate speech as “any kind of communication that attacks or uses pejorative or discriminatory language with reference to a person or a group on the basis of who they are.” [1] • Need sensitisation of social media users.

How to Combat Hate Speech Reactive countering When a hateful
post has been made and we are intervening to prevent it further spreading. Proactive countering Intervene before the post goes public.

Techniques for hate speech detection

How do you collect data on something that is not
well defined? [1]: Resources and benchmark corpora for hate speech detection: a systematic review Fig 1: Related concepts in Hate Speech [1]

Begin by defining the scope of what you will be
examining and modeling How do you collect data on something that is not well defined? [1]: Political mud slandering and power dynamics during Indian assembly elections Fig 1: Example Definition of political attack [1] For example in one project we wanted to study `Political Attacks`

How do you collect data on something that is not
well defined? Narrow down the source platform, languages and topic to be covered. Fig 1: Socio-political topics covered in GOTHate [1] [1]: Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment • Not filtered for language (Hinglish). • Source: Twitter (from Jan 2020-Jan 2021) ◦ Primary dataset: Tweets on these (Fig. 1) topics. ◦ Secondary dataset: User metadata, timeline and 1-hop network information.

Fig 2: Overview of Annotation Guideline [1] [1]: Revisiting Hate
Speech Benchmarks: From Data Curation to System Deployment How do you annotate the collected data? Broad definitions, categories and hierarchy if nay

Fig 1: 2-phased Annotation Mode [1,2] • Phase I: k
= 0.80 • Phase II k = 0.70 [1]: Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior [2]: Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment How do you annotate the collected data? Expert vs Knowledge of Crowd

Observations & Lessons Lessons & Takeaways: 1. To overcome annotation
bias: a. Communicate with the annotators to understand their point-of-view and share the aim of hate speech detection. b. Avoid working with annotators who have strong political affiliations and are directly involved in political work. c. Work with diverse annotators w.r.t age, gender, lingusitc capabilities 2. There will still be mislabelling! a. It will mimic the chaotic real-world more closely :P b. Dataset even with mislabelled annotators represents some annotation bias and annotator’s point-of-view. It is not ideal but still representative of the society. [1] [1]: When the Majority is Wrong: Modeling Annotator Disagreement for Subjective Tasks

Literature Overview: Hate Dataset Dataset Source & Language (Modality) Year
Labels Annotation Waseem & Hovy [1] Twitter, English, Texts 2016 R,S,N 16k, E, k = 0.84 Davidson et al. [2] Twitter, English, Texts 2017 H,O,N 25k, C, k = 0.92 Wulczyn et al. [3] Wikipedia comments, English, Texts 2017 PA, N 100k, C, k = 0.45 Gibert et al. [5] Stormfront, English, Texts 2018 H,N 10k, k = 0.62 Founta et al. [4] Twitter, English, Texts 2018 H,A,SM,N 70k, C, k = ? Albadi et al [6] Twitter, Arabic, Texts 2018 H, N 6k, C, k = 0.81 R- Racism S- Sexism H- Hate PA- Personal Attack A- Abuse SM- Spam O- Offensive L- Religion N- Neither I- Implicit E- Explicit [1]: Waseem & Hovy, NAACL’16 [2]: Davidson et al., WebSci’17 [3]: Wulczyn et al., WWW’17 [4]: Founta at al., WebSci’18 [5]: Gibert et al., ALW2’18 [6]: Albadi et al., ANLP’20 E- Internal Experts C- Crowd Sourced

Dataset Source & Language (Modality) Year Labels Annotation Mathur et
al. [1] Twitter, Hinglish, Texts 2018 H, O, N 3k, E, k = 0.83 Rizwan et al. [3] Twitter, Urdu (Roman Urdu), Texts 2020 A, S, L, P,N 10k, E, k=? Gomez et al. [4] Twitter, English, Memes 2020 H, N 150k, C, k = ? ElSherief et al. [11] Twitter, English, Texts 2021 I,E,N Literature Overview: Hate Dataset [1]: Mathur et al., AAAI’20 [3]: Rizwan et al., EMNLP’19 [4]: Gomez et al., WACv’20 • HASOC [5], Jigsaw Kaggle [6], SemEval [7], FB Hate-Meme Challenge [8], • WOAH [9], CONSTRAINT [10] [5]: HASOC [6]: Jigsaw Kaggle [7]: SemEval [8]: FB Hate-Meme [9]: WOAH [10]: CONSTRAINT [11]: ElSheried et al., EMNLP’21 E- Internal Experts C- Crowd Sourced R- Racism S- Sexism H- Hate PA- Personal Attack A- Abuse SM- Spam O- Offensive L- Religion N- Neither I- Implicit E- Explicit

Literature Overview: Hate Detection • N-gram Tf-idf + LR/SVM [1,2]
• Glove + CNN, RNN [3] • Transformer based ◦ Zero , Few Shot [4] ◦ Fine-tuning [5] ◦ HateBERT [6] • Generation for classification [7,11] • Multimodality ◦ Images [8] ◦ Historical Context [9] ◦ Network and Neighbours [10] ◦ News, Trends, Prompts [11] [1]: Waseem & Hovy, NAACL’16 [2]: Davidson et al., WebSci’17 [3]: Barjatiya et al., WWW’17 [4]: Pelican et al., EACL Hackashop’21 [5]: Timer et al. ,EMNLP’21 [6]: Caselli et al., WOAH’21 [7]: Ke-Li et al. [8]: Kiela et al., NeuIPS’20 [9]: Qian et al., NAACL’19 [10]: Mehdi et al., IJCA’20, Vol 13 [11]: Badr et al.,

Modeling Via CNNs Fig 1: Overview of CNN based Hate
speech classifier [1] [1]: A Platform Agnostic Dual-Strand Hate Speech Detector Fig 2: Dataset and classification results CNN based detector [1]

Modeling Via RNNs Fig 1: Modeling RNN + Metadata MLP
based detector [1] [1]: A Unified Deep Learning Architecture for Abuse Detection

Modeling Via BERT Fig 1: BERT model for sequence classification
on Hate Speech Data [1] Fig 2: BERT fining with classification heads of varying complexity [2] [1]: HASOCOne@FIRE-HASOC2020: Using BERT and Multilingual BERT models for Hate Speech Detection [2]: BERT-Based Transfer Learning Approach for Hate Speech Detection in Online Social Media

Overview of BERT pre training via MLM Fig 1: Masking
to perform pretraining via output probability loss [1] [1]: https://ankur3107.github.io/blogs/masked-langauge-modeling/

Continued Pre-training: Domain specific Large Scale General LM large scale
unlabeled Corpus from various sources across the web Mask prediction (Unsupervised generalised training) Training a Large LM (LLM) from Scratch: Pre-training Initialise LM with random weights Saved LM with trained weights Medium sized unlabeled Corpus from a specific domain Mask prediction (Unsupervised domain specific training) Training a Large LM (LLM) from Saved checkpoint: Continued Pre-training Load LM with trained weights Large Scale Domain LM Saved LM with updated weights

1. mBERT - Multilingual BERT -> Multiple Languages 104 used
to perform MLM. 2. BertTweet -> Over 845M tweets used to perform MLM. 3. HateBERT -> Over 1M offensive reddit posts to perform MLM. BERT variants for different context via MLM Fig 1: MLM top 3 candidates for the templates “Women are [MASK.]” [3] Fig 2: Performance gain of HateBERT over BERT for hate classification [3] [1]: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [2]: BERTweet: A pre-trained language model for English Tweets [3]: HateBERT: Retraining BERT for Abusive Language Detection in English

Contextual Signal Infusion for Hate Detection Fig 1: Motivation for
Auxiliary Data Signals[1] [1]: Kulkarni et al., Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment, KDD 2023

In-Dataset Signal: Exemplars Module \$MENTION\$ \$MENTION\$ \$MENTION\$ AND Remember president
loco SAID MEXICO WILL PAY FUC**kfu ck trump f*** gop f*** republicans Make go fund me FOR HEALTH CARE, COLLEGE EDUCATION , CLIMATE CHANGE, SOMETHING GOOD AND POSITIVE !! Not for a fucking wall go fund the wall the resistance resist \$URL\$" $MENTION\$ DERANGED DELUSIONAL DUMB DICTATOR DONALD IS MENTALLY UNSTABLE! I WILL NEVER VOTE REPUBLICAN AGAIN IF THEY DON'T STAND UP TO THIS TYRANT LIVING IN THE WHITE HOUSE! fk republicans worst dictator ever unstable dictator \$URL\$" $MENTION\$ COULD WALK ON WATER AND THE never trump WILL CRAP ON EVERYTHING HE DOES. SHAME IN THEM. UNFOLLOW ALL OF THEM PLEASE!" Offensive train sample Labelled Corpus E1: Offensive train sample exemplar (can be same or different author) E2: Offensive train sample exemplar (can be same or different author)

Auxiliary Dataset Signal: Timeline Module "look at what Hindus living
in mixed-population localities are facing, what Dhruv Tyagi had to face for merely asking his Muslim neighbors not to sexually harass his daughter...and even then, if u ask why people don’t rent to Muslims, get ur head examined $MENTION\$ $MENTION\$ naah...Islamists will never accept Muslim refugees, they will tell the Muslims to create havoc in their home countries and do whatever it takes to convert Dar-ul-Harb into Dar-ul Islam..something we should seriously consider doing with Pak Hindus too One of the tweet by author before Example 2 One of the tweet by author after Example 2 Accusatory tone timestamp t-1 Hateful tweet timestamp t Accusatory and instigating timestamp t+1

External Knowledge Infusion: non-MLM Finetuning [1]: Revisiting Hate Speech Benchmarks:
From Data Curation to System Deployment

Experiments (E), Observations (O) & Takeaways (T) Fig 1: Baseline
and Ablation [1] Vanilla mBERT systems (M8-M10) • M10 uses hate lexicon as external context is not able provide a significant improvement over mBERT due to neutrally seeded GOTHate. • M9: Simple concatenation of network info lead to improvement in performance for hateful class. • T: Hateful users seem to be sharing similar latent signals. • T: Implicit signals like network info is better for our dataset than explicit signals like hate lexicons [1]: Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment

Experiments (E), Observations (O) & Takeaways (T) Fig 1: Baseline
and Ablation [1] Proposed system (M11 - M14): • E: Building the success of previous baselines we use mBERT as base model. • O: Attentive infusion of signals seem to be helping reducing the noisy information in them. • T: No one signal significantly dominates other. Different signals seem to be helping different classes. • O/T: Class with highest misclassification rate in human annotation is helped by presence of exemplars. • T: Combining all 4 signals lead to an improved detection of hate by 5 macro-F1 !! [1]: Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment

Do we need to finetuning all Layers of BERT? [1]:
https://iq.opengenus.org/bert-base-vs-bert-large/ Fig 1: BERT Encoders [1] Fig 1: BERT Base vs Large in terms of encoding layers [1] Different layers encode the semantics to a different extend

Do we need to finetuning all Layers of BERT? [1]:
Probing Critical Learning Dynamics of PLMs for Hate Speech Detection Fig 1: The best and worst perming layer for different BERT variants. [1]

What role does classifier head play in fine tuning? [1]:
Probing Critical Learning Dynamics of PLMs for Hate Speech Detection Fig 1: Impact of classifier head on BERT variants. [1]

Hate Speech Detection Using GPT-3 Prompts Zero-Shot One-shot Few-shot Hate
Speech Detection via GTP-3 Prompts: https://arxiv.org/pdf/2103.12407.pdf

[1]: An Investigation of Large Language Models for Real-World Hate
Speech Detection Fig 1: Prompt for ChatGPT [1] Fig 2: Comparing finetuning with zero-shot prompting [1] Hate Speech Detection Using ChatGPT

Hate Speech Detection Using ChatGPT [1]: An Investigation of Large
Language Models for Real-World Hate Speech Detection

Recall Conditional Generation?

Hate Speech Generation Using GPT-3: Decoding Strategy Greddy Search Beam
Search

Hate Speech Generation Using GPT-3 TOXIGEN: A Large-Scale Machine-Generated Dataset
for Adversarial and Implicit Hate Speech Detection

Hate Speech Generation Using GPT-3: Conditioning TOXIGEN: A Large-Scale Machine-Generated
Dataset for Adversarial and Implicit Hate Speech Detection Before conditioning After conditioning

Biases in Hate Speech Detection

Categorization of biases in Hate Speech Biases exists through out
the hate detection pipeline and and downstream harms to various social groups. [1] Handling Bias in Toxic Speech Detection: A Survey

Annotation Bias Fig 1: Annotation biases across geography leading to
variation in inter-annotator agreements [1] [1] CREHate: A CRoss-cultural English Hate Speech Dataset

Lexical Bias • Models become biased towards spurious correlation of
words/phrases in a class. • Terms contributing to lexical bias can be enlisted under as bias sensitive words (BSW) • BSW differ for different datasets and target group under consideration. For example if our datasets has a lot of hate speech against African Americans then mere presence of their identity terms like Black or slur terms like n** can trigger the model to classify such statements as hateful irrespective of the context.

Mitigating Lexical Bias via word substitution • Once BSW are
identified replaced their occurrence in Hate class via [1]: ◦ POS tags to reduce dependency on explicit terms (Black -> <NOUN>) ◦ Parent in the hypernym-tree (Black -> Color) ◦ k-Nearest neighbour in a word embedding [1]: Badjatiya et al., Stereotypical Bias Removal for Hate Speech Detection Task using Knowledge-based Generalizations [2]: Waseem & hovy, Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter [3]: Garg et al., Handling Bias in Toxic Speech Detection: A Survey, ACM CSUR Drawback of the study: Post substitution of BSW, the the original BSW is employed to evaluate reduction in bias with no discussion on evaluating the bias on the newly replaced terms. [1] Fig 1: Reproduced results [3] on WordNet replacement [1] for Waseem& Hovy dataset [2]

Our Knowledge-drift Experiment We conjecture that replacing all occurrences 𝑤
∈ 𝐵𝑆𝑊 with its Wordnet ancestor 𝑎 ∈ 𝐴 will shift the lexical bias from 𝑤 to 𝑎 [2] [1]: Badjatiya et al., Stereotypical Bias Removal for Hate Speech Detection Task using Knowledge-based Generalizations [2]: Garg et al., Handling Bias in Toxic Speech Detection: A Survey, ACM CSUR M-bias is the dataset before debiasing/substitution M-gen is the dataset after debiasing Fig 1: Knowledge-drift results. [2] Fig 1: Formula for pinned bias metric. [1]

Takeway: Analogy of Bias & Physical Systems • Like energy
bias seems to be capable of transferring from one source to the other. • Like a system at rest, a model/dataset with bias will remain biased unless external force of mitigation/regularization terms are added to the training. • Like interactive systems tend to become more chaotic overtime, hence bias mitigation in toxicity systems and NLP in general needs to be incorporated in the pipeline in a continuous fashion.

Techniques for hate speech reconstruction

How to Combat Hate Speech Reactive countering When a hateful
post has been made and we are intervening to prevent it further spreading. Proactive countering Intervene before the post goes public.

Literature Overview: From Offense to Non-Offense [1]: Santos et al.,
ACL’18 Fig 1: Unsupervised conversion of Offense to Neutral [1]

Literature Overview: Intervention during Tweet creation • 200k users identified
in the study. 50% randomly assigned to the control group • H1: Are prompted users less likely to post the current offensive content. • H2: Are prompted users less likely to post content in future. [1]: Katsaros et al., ICWSM ‘22 Fig 1: User behaviour statistics as a part of intervention study [1] Fig 2: Twitter reply test for offense replies. [1]

NACL Dataset • Hateful samples collected from existing Hate Speech
datasets. • Manually annotated for Hate intensity and hateful spans. • Hate Intensity is marked on a scale of 1-10. • Manual generation of normalised counter-part and its intensity. (k = 0.88) Fig 1: Original and Normalised Intensity Distribution [1] Fig 2: Dataset Stats [1] [1]: Masud et al., Proactively Reducing the Hate Intensity of Online Posts via Hate Speech Normalization, KDD 2022

Motivation & Evidence • Reducing intensity is the stepping stone
towards non-hate. • Does not force to change sentiment or opinion. • Evidently leads to less virality. Fig 1: Difference in predicted number of comments per set per iteration. [1] [1]: Proactively Reducing the Hate Intensity of Online Posts via Hate Speech Normalization

Problem Statement For a given hate sample 𝑡, our objective
is to obtain its normalized (sensitised) form 𝑡` such that the intensity of hatred 𝜙𝑡 is reduced while the meaning still conveys. [1] 𝜙 𝑡` < 𝜙 𝑡 Fig [1]: Example of original high intensity vs normalised sentence [1] [1]: Proactively Reducing the Hate Intensity of Online Posts via Hate Speech Normalization

Proposed Method: NACL- Neural hAte speeCh normaLizer Hate Intensity Prediction
(HIP) Hate Span Prediction (HSI) Hate Intensity Reduction (HIR) Fig 1: Flowchart of NACL [1] Extremely Hateful Input (ORIGINAL) Less Hateful Input (SUGGESTIVE) HATE NORMALIZATION Extremely Hateful Input (ORIGINAL) User’s Choice [1]: Proactively Reducing the Hate Intensity of Online Posts via Hate Speech Normalization

Hate Intensity Prediction (HIP) Fig 1: HIP + Framework [1]
BERT + BiLSTM + Self Attention + Linear Activation [1]: Proactively Reducing the Hate Intensity of Online Posts via Hate Speech Normalization

Hate Span Identification (HSI) Fig 1: Hate Normalization Framework [1]
ELMO + BiLSTM + Self Attention + CRF [1]: Proactively Reducing the Hate Intensity of Online Posts via Hate Speech Normalization

Hate Intensity Reduction Overall Loss Reward Fig 1: Hate Normalization
Framework [1] [1]: Proactively Reducing the Hate Intensity of Online Posts via Hate Speech Normalization

Hate Intensity Reduction (HIR) Fig 1: Hate Intensity Reduction Module
[1] [1]: Proactively Reducing the Hate Intensity of Online Posts via Hate Speech Normalization

Human Evaluation • Employ 20 diverse users to measure the
quantitativeness of the generated texts. • Metric: ◦ Intensity ◦ Fluency ◦ Adequacy Fig 1: Results of Human Evaluation for NACL-HSR [1] [1]: Proactively Reducing the Hate Intensity of Online Posts via Hate Speech Normalization

Techniques for countering hate speech

[1] Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior
Using ChatGPT to generate contextual toxic content Fig 1: Different output for toxicity based on context [1] Fig 2: Using ChatGPT to generate large scale scenario-based toxicity[1]

Motivation ❖ Depending on the scenario, specific types (intents) of
counterspeeches have been shown to be more eﬀective. ❖ In the figure, although the Denouncing and Question counterspeeches are valid, an Informative counterspeech would better address the given hatespeech. ❖ We propose the first intent-conditioned counterspeech generation system, QUARC.

Contributions ❖ Novel task: Intent-specific counterspeech generation. ❖ Novel dataset:
IntentCONAN with 6831 counterspeeches for 3583 hate speeches spanning across five counterspeech intents. ❖ Novel model: QUARC, a two-phased intent- specific counterspeech generation framework. ❖ Evaluation: An extensive automated comparison and human evaluation to quantify the eﬃcacy of our approach w.r.t state-of-the-art baselines.

Dataset: IntentCONAN ❖ We develop IntentCONAN, an intent-specific ounterspeech dataset
with 6831 counterspeeches, corresponding to 3583 hate speeches. ❖ These counterspeeches are distributed across 5 intents: informative, denouncing, question, positive, and humour.

Architecture: QUARC

Benchmarks: Automated Evaluation and Ablations ❖ QUARC outperforms the comparative
systems across all metrics. ❖ Best performing baseline: GPS. ❖ Most systems are unable to produce counterspeeches that adhere to the desired intent. ❖ Further analysis of Novelty and Diversity scores (lexical dissimilarity from the training corpus), reveals that the baselines do not produce diverse outputs and learn to copy. M: Meteor; SS: Semantic Similarity; BS: BERTScore; CA: Category Accuracy

Benchmarks: Human Evaluation ❖ Due to the subjectivity of the
task, automated evaluation is not comprehensive enough to ensure soundness. ❖ We conducted a human evaluation (with 60 evaluators), where we compared QUARC against the best performing baseline, GPS. ❖ We defined 5 metrics which were rated on a 5-point Likert scale, except for Category Accuracy (CA), which represents the proportion of counterspeeches with matching intents. ❖ QUARC significantly outperformed GPS across all metrics except Toxicity (slightly worse than GPS), with the biggest gain in CA, demonstrating its ability to generate intent-specific counterspeeches.

Analysis: Congruence ❖ We define and compute Implicit Similarity (IS),
a metric to measure the implicit aﬃnity of the intents based on human evaluation. ❖ The distance between the codebook intent representations learnt through QUARC (left) closely aligns with the IS scores (right) computed between the intents. ❖ This provides a key insight into a critical factor behind the performance of QUARC. Left: A scatter plot of the codebook vectors (after dimensionality reduction) corresponding to diﬀerent intents. Right: The Implicit Similarity (IS) between intent pairs captured through human evaluation.

An overview of hate speech analysis techniques ...

An overview of hate speech analysis techniques in NLP

More Decks by _themessier

Other Decks in Research

Featured

Transcript