An overview of hate speech analysis techniques in NLP

Slide 1

Slide 1 text

An overview of hate speech analysis techniques in NLP - Presenter: Sarah Masud, LCS2 - Collaboration with: Dr. Tanmoy, Dr. Vikram, Dr. Shad, Manjot, Atharva, Aflah and many more!

Slide 2

Slide 2 text

Disclaimer: Subsequent content has extreme language (verbatim from social media), which does not reflect the opinions of myself or my collaborators. Reader’s discretion is advised.

Slide 3

Slide 3 text

Definition of Hate Speech [1]: UN hate [2]: Pyramid of Hate ● Hate is subjective, temporal and cultural in nature. ● UN defines hate speech as “any kind of communication that attacks or uses pejorative or discriminatory language with reference to a person or a group on the basis of who they are.” [1] ● Need sensitisation of social media users.

Slide 4

Slide 4 text

How to Combat Hate Speech Reactive countering When a hateful post has been made and we are intervening to prevent it further spreading. Proactive countering Intervene before the post goes public.

Slide 5

Slide 5 text

Techniques for hate speech detection

Slide 6

Slide 6 text

How do you collect data on something that is not well defined? [1]: Resources and benchmark corpora for hate speech detection: a systematic review Fig 1: Related concepts in Hate Speech [1]

Slide 7

Slide 7 text

Begin by defining the scope of what you will be examining and modeling How do you collect data on something that is not well defined? [1]: Political mud slandering and power dynamics during Indian assembly elections Fig 1: Example Definition of political attack [1] For example in one project we wanted to study `Political Attacks`

Slide 8

Slide 8 text

How do you collect data on something that is not well defined? Narrow down the source platform, languages and topic to be covered. Fig 1: Socio-political topics covered in GOTHate [1] [1]: Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment ● Not filtered for language (Hinglish). ● Source: Twitter (from Jan 2020-Jan 2021) ○ Primary dataset: Tweets on these (Fig. 1) topics. ○ Secondary dataset: User metadata, timeline and 1-hop network information.

Slide 9

Slide 9 text

Fig 2: Overview of Annotation Guideline [1] [1]: Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment How do you annotate the collected data? Broad definitions, categories and hierarchy if nay

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

Fig 1: 2-phased Annotation Mode [1,2] ● Phase I: k = 0.80 ● Phase II k = 0.70 [1]: Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior [2]: Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment How do you annotate the collected data? Expert vs Knowledge of Crowd

Slide 12

Slide 12 text

Observations & Lessons Lessons & Takeaways: 1. To overcome annotation bias: a. Communicate with the annotators to understand their point-of-view and share the aim of hate speech detection. b. Avoid working with annotators who have strong political affiliations and are directly involved in political work. c. Work with diverse annotators w.r.t age, gender, lingusitc capabilities 2. There will still be mislabelling! a. It will mimic the chaotic real-world more closely :P b. Dataset even with mislabelled annotators represents some annotation bias and annotator’s point-of-view. It is not ideal but still representative of the society. [1] [1]: When the Majority is Wrong: Modeling Annotator Disagreement for Subjective Tasks

Slide 13

Slide 13 text

Literature Overview: Hate Dataset Dataset Source & Language (Modality) Year Labels Annotation Waseem & Hovy [1] Twitter, English, Texts 2016 R,S,N 16k, E, k = 0.84 Davidson et al. [2] Twitter, English, Texts 2017 H,O,N 25k, C, k = 0.92 Wulczyn et al. [3] Wikipedia comments, English, Texts 2017 PA, N 100k, C, k = 0.45 Gibert et al. [5] Stormfront, English, Texts 2018 H,N 10k, k = 0.62 Founta et al. [4] Twitter, English, Texts 2018 H,A,SM,N 70k, C, k = ? Albadi et al [6] Twitter, Arabic, Texts 2018 H, N 6k, C, k = 0.81 R- Racism S- Sexism H- Hate PA- Personal Attack A- Abuse SM- Spam O- Offensive L- Religion N- Neither I- Implicit E- Explicit [1]: Waseem & Hovy, NAACL’16 [2]: Davidson et al., WebSci’17 [3]: Wulczyn et al., WWW’17 [4]: Founta at al., WebSci’18 [5]: Gibert et al., ALW2’18 [6]: Albadi et al., ANLP’20 E- Internal Experts C- Crowd Sourced

Slide 14

Slide 14 text

Dataset Source & Language (Modality) Year Labels Annotation Mathur et al. [1] Twitter, Hinglish, Texts 2018 H, O, N 3k, E, k = 0.83 Rizwan et al. [3] Twitter, Urdu (Roman Urdu), Texts 2020 A, S, L, P,N 10k, E, k=? Gomez et al. [4] Twitter, English, Memes 2020 H, N 150k, C, k = ? ElSherief et al. [11] Twitter, English, Texts 2021 I,E,N Literature Overview: Hate Dataset [1]: Mathur et al., AAAI’20 [3]: Rizwan et al., EMNLP’19 [4]: Gomez et al., WACv’20 ● HASOC [5], Jigsaw Kaggle [6], SemEval [7], FB Hate-Meme Challenge [8], ● WOAH [9], CONSTRAINT [10] [5]: HASOC [6]: Jigsaw Kaggle [7]: SemEval [8]: FB Hate-Meme [9]: WOAH [10]: CONSTRAINT [11]: ElSheried et al., EMNLP’21 E- Internal Experts C- Crowd Sourced R- Racism S- Sexism H- Hate PA- Personal Attack A- Abuse SM- Spam O- Offensive L- Religion N- Neither I- Implicit E- Explicit

Slide 15

Slide 15 text

Literature Overview: Hate Detection ● N-gram Tf-idf + LR/SVM [1,2] ● Glove + CNN, RNN [3] ● Transformer based ○ Zero , Few Shot [4] ○ Fine-tuning [5] ○ HateBERT [6] ● Generation for classification [7,11] ● Multimodality ○ Images [8] ○ Historical Context [9] ○ Network and Neighbours [10] ○ News, Trends, Prompts [11] [1]: Waseem & Hovy, NAACL’16 [2]: Davidson et al., WebSci’17 [3]: Barjatiya et al., WWW’17 [4]: Pelican et al., EACL Hackashop’21 [5]: Timer et al. ,EMNLP’21 [6]: Caselli et al., WOAH’21 [7]: Ke-Li et al. [8]: Kiela et al., NeuIPS’20 [9]: Qian et al., NAACL’19 [10]: Mehdi et al., IJCA’20, Vol 13 [11]: Badr et al.,

Slide 16

Slide 16 text

Modeling Via CNNs Fig 1: Overview of CNN based Hate speech classifier [1] [1]: A Platform Agnostic Dual-Strand Hate Speech Detector Fig 2: Dataset and classification results CNN based detector [1]

Slide 17

Slide 17 text

Modeling Via RNNs Fig 1: Modeling RNN + Metadata MLP based detector [1] [1]: A Unified Deep Learning Architecture for Abuse Detection

Slide 18

Slide 18 text

Modeling Via BERT Fig 1: BERT model for sequence classification on Hate Speech Data [1] Fig 2: BERT fining with classification heads of varying complexity [2] [1]: HASOCOne@FIRE-HASOC2020: Using BERT and Multilingual BERT models for Hate Speech Detection [2]: BERT-Based Transfer Learning Approach for Hate Speech Detection in Online Social Media

Slide 19

Slide 19 text

Overview of BERT pre training via MLM Fig 1: Masking to perform pretraining via output probability loss [1] [1]: https://ankur3107.github.io/blogs/masked-langauge-modeling/

Slide 20

Slide 20 text

Continued Pre-training: Domain specific Large Scale General LM large scale unlabeled Corpus from various sources across the web Mask prediction (Unsupervised generalised training) Training a Large LM (LLM) from Scratch: Pre-training Initialise LM with random weights Saved LM with trained weights Medium sized unlabeled Corpus from a specific domain Mask prediction (Unsupervised domain specific training) Training a Large LM (LLM) from Saved checkpoint: Continued Pre-training Load LM with trained weights Large Scale Domain LM Saved LM with updated weights

Slide 21

Slide 21 text

1. mBERT - Multilingual BERT -> Multiple Languages 104 used to perform MLM. 2. BertTweet -> Over 845M tweets used to perform MLM. 3. HateBERT -> Over 1M offensive reddit posts to perform MLM. BERT variants for different context via MLM Fig 1: MLM top 3 candidates for the templates “Women are [MASK.]” [3] Fig 2: Performance gain of HateBERT over BERT for hate classification [3] [1]: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [2]: BERTweet: A pre-trained language model for English Tweets [3]: HateBERT: Retraining BERT for Abusive Language Detection in English

Slide 22

Slide 22 text

Contextual Signal Infusion for Hate Detection Fig 1: Motivation for Auxiliary Data Signals[1] [1]: Kulkarni et al., Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment, KDD 2023

Slide 23

Slide 23 text

In-Dataset Signal: Exemplars Module \$MENTION\$ \$MENTION\$ \$MENTION\$ AND Remember president loco SAID MEXICO WILL PAY FUC**kfu ck trump f*** gop f*** republicans Make go fund me FOR HEALTH CARE, COLLEGE EDUCATION , CLIMATE CHANGE, SOMETHING GOOD AND POSITIVE !! Not for a fucking wall go fund the wall the resistance resist \$URL\$" $MENTION\$ DERANGED DELUSIONAL DUMB DICTATOR DONALD IS MENTALLY UNSTABLE! I WILL NEVER VOTE REPUBLICAN AGAIN IF THEY DON'T STAND UP TO THIS TYRANT LIVING IN THE WHITE HOUSE! fk republicans worst dictator ever unstable dictator \$URL\$" $MENTION\$ COULD WALK ON WATER AND THE never trump WILL CRAP ON EVERYTHING HE DOES. SHAME IN THEM. UNFOLLOW ALL OF THEM PLEASE!" Offensive train sample Labelled Corpus E1: Offensive train sample exemplar (can be same or different author) E2: Offensive train sample exemplar (can be same or different author)

Slide 24

Slide 24 text

Auxiliary Dataset Signal: Timeline Module "look at what Hindus living in mixed-population localities are facing, what Dhruv Tyagi had to face for merely asking his Muslim neighbors not to sexually harass his daughter...and even then, if u ask why people don’t rent to Muslims, get ur head examined $MENTION\$ $MENTION\$ naah...Islamists will never accept Muslim refugees, they will tell the Muslims to create havoc in their home countries and do whatever it takes to convert Dar-ul-Harb into Dar-ul Islam..something we should seriously consider doing with Pak Hindus too One of the tweet by author before Example 2 One of the tweet by author after Example 2 Accusatory tone timestamp t-1 Hateful tweet timestamp t Accusatory and instigating timestamp t+1

Slide 25

Slide 25 text

External Knowledge Infusion: non-MLM Finetuning [1]: Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment

Slide 26

Slide 26 text

Experiments (E), Observations (O) & Takeaways (T) Fig 1: Baseline and Ablation [1] Vanilla mBERT systems (M8-M10) ● M10 uses hate lexicon as external context is not able provide a significant improvement over mBERT due to neutrally seeded GOTHate. ● M9: Simple concatenation of network info lead to improvement in performance for hateful class. ● T: Hateful users seem to be sharing similar latent signals. ● T: Implicit signals like network info is better for our dataset than explicit signals like hate lexicons [1]: Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment

Slide 27

Slide 27 text

Experiments (E), Observations (O) & Takeaways (T) Fig 1: Baseline and Ablation [1] Proposed system (M11 - M14): ● E: Building the success of previous baselines we use mBERT as base model. ● O: Attentive infusion of signals seem to be helping reducing the noisy information in them. ● T: No one signal significantly dominates other. Different signals seem to be helping different classes. ● O/T: Class with highest misclassification rate in human annotation is helped by presence of exemplars. ● T: Combining all 4 signals lead to an improved detection of hate by 5 macro-F1 !! [1]: Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment

Slide 28

Slide 28 text

Do we need to finetuning all Layers of BERT? [1]: https://iq.opengenus.org/bert-base-vs-bert-large/ Fig 1: BERT Encoders [1] Fig 1: BERT Base vs Large in terms of encoding layers [1] Different layers encode the semantics to a different extend

Slide 29

Slide 29 text

Do we need to finetuning all Layers of BERT? [1]: Probing Critical Learning Dynamics of PLMs for Hate Speech Detection Fig 1: The best and worst perming layer for different BERT variants. [1]

Slide 30

Slide 30 text

What role does classifier head play in fine tuning? [1]: Probing Critical Learning Dynamics of PLMs for Hate Speech Detection Fig 1: Impact of classifier head on BERT variants. [1]

Slide 31

Slide 31 text

Hate Speech Detection Using GPT-3 Prompts Zero-Shot One-shot Few-shot Hate Speech Detection via GTP-3 Prompts: https://arxiv.org/pdf/2103.12407.pdf

Slide 32

Slide 32 text

[1]: An Investigation of Large Language Models for Real-World Hate Speech Detection Fig 1: Prompt for ChatGPT [1] Fig 2: Comparing finetuning with zero-shot prompting [1] Hate Speech Detection Using ChatGPT

Slide 33

Slide 33 text

Hate Speech Detection Using ChatGPT [1]: An Investigation of Large Language Models for Real-World Hate Speech Detection

Slide 34

Slide 34 text

Recall Conditional Generation?

Slide 35

Slide 35 text

Hate Speech Generation Using GPT-3: Decoding Strategy Greddy Search Beam Search

Slide 36

Slide 36 text

Hate Speech Generation Using GPT-3 TOXIGEN: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection

Slide 37

Slide 37 text

Hate Speech Generation Using GPT-3: Conditioning TOXIGEN: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection Before conditioning After conditioning

Slide 38

Slide 38 text

Biases in Hate Speech Detection

Slide 39

Slide 39 text

Categorization of biases in Hate Speech Biases exists through out the hate detection pipeline and and downstream harms to various social groups. [1] Handling Bias in Toxic Speech Detection: A Survey

Slide 40

Slide 40 text

Annotation Bias Fig 1: Annotation biases across geography leading to variation in inter-annotator agreements [1] [1] CREHate: A CRoss-cultural English Hate Speech Dataset

Slide 41

Slide 41 text

Lexical Bias ● Models become biased towards spurious correlation of words/phrases in a class. ● Terms contributing to lexical bias can be enlisted under as bias sensitive words (BSW) ● BSW differ for different datasets and target group under consideration. For example if our datasets has a lot of hate speech against African Americans then mere presence of their identity terms like Black or slur terms like n** can trigger the model to classify such statements as hateful irrespective of the context.

Slide 42

Slide 42 text

Mitigating Lexical Bias via word substitution ● Once BSW are identified replaced their occurrence in Hate class via [1]: ○ POS tags to reduce dependency on explicit terms (Black -> ) ○ Parent in the hypernym-tree (Black -> Color) ○ k-Nearest neighbour in a word embedding [1]: Badjatiya et al., Stereotypical Bias Removal for Hate Speech Detection Task using Knowledge-based Generalizations [2]: Waseem & hovy, Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter [3]: Garg et al., Handling Bias in Toxic Speech Detection: A Survey, ACM CSUR Drawback of the study: Post substitution of BSW, the the original BSW is employed to evaluate reduction in bias with no discussion on evaluating the bias on the newly replaced terms. [1] Fig 1: Reproduced results [3] on WordNet replacement [1] for Waseem& Hovy dataset [2]

Slide 43

Slide 43 text

Our Knowledge-drift Experiment We conjecture that replacing all occurrences 𝑤 ∈ 𝐵𝑆𝑊 with its Wordnet ancestor 𝑎 ∈ 𝐴 will shift the lexical bias from 𝑤 to 𝑎 [2] [1]: Badjatiya et al., Stereotypical Bias Removal for Hate Speech Detection Task using Knowledge-based Generalizations [2]: Garg et al., Handling Bias in Toxic Speech Detection: A Survey, ACM CSUR M-bias is the dataset before debiasing/substitution M-gen is the dataset after debiasing Fig 1: Knowledge-drift results. [2] Fig 1: Formula for pinned bias metric. [1]

Slide 44

Slide 44 text

Takeway: Analogy of Bias & Physical Systems ● Like energy bias seems to be capable of transferring from one source to the other. ● Like a system at rest, a model/dataset with bias will remain biased unless external force of mitigation/regularization terms are added to the training. ● Like interactive systems tend to become more chaotic overtime, hence bias mitigation in toxicity systems and NLP in general needs to be incorporated in the pipeline in a continuous fashion.

Slide 45

Slide 45 text

Techniques for hate speech reconstruction

Slide 46

Slide 46 text

How to Combat Hate Speech Reactive countering When a hateful post has been made and we are intervening to prevent it further spreading. Proactive countering Intervene before the post goes public.

Slide 47

Slide 47 text

Literature Overview: From Offense to Non-Offense [1]: Santos et al., ACL’18 Fig 1: Unsupervised conversion of Offense to Neutral [1]

Slide 48

Slide 48 text

Literature Overview: Intervention during Tweet creation ● 200k users identified in the study. 50% randomly assigned to the control group ● H1: Are prompted users less likely to post the current offensive content. ● H2: Are prompted users less likely to post content in future. [1]: Katsaros et al., ICWSM ‘22 Fig 1: User behaviour statistics as a part of intervention study [1] Fig 2: Twitter reply test for offense replies. [1]

Slide 49

Slide 49 text

NACL Dataset ● Hateful samples collected from existing Hate Speech datasets. ● Manually annotated for Hate intensity and hateful spans. ● Hate Intensity is marked on a scale of 1-10. ● Manual generation of normalised counter-part and its intensity. (k = 0.88) Fig 1: Original and Normalised Intensity Distribution [1] Fig 2: Dataset Stats [1] [1]: Masud et al., Proactively Reducing the Hate Intensity of Online Posts via Hate Speech Normalization, KDD 2022

Slide 50

Slide 50 text

Motivation & Evidence ● Reducing intensity is the stepping stone towards non-hate. ● Does not force to change sentiment or opinion. ● Evidently leads to less virality. Fig 1: Difference in predicted number of comments per set per iteration. [1] [1]: Proactively Reducing the Hate Intensity of Online Posts via Hate Speech Normalization

Slide 51

Slide 51 text

Problem Statement For a given hate sample 𝑡, our objective is to obtain its normalized (sensitised) form 𝑡` such that the intensity of hatred 𝜙𝑡 is reduced while the meaning still conveys. [1] 𝜙 𝑡` < 𝜙 𝑡 Fig [1]: Example of original high intensity vs normalised sentence [1] [1]: Proactively Reducing the Hate Intensity of Online Posts via Hate Speech Normalization

Slide 52

Slide 52 text

Proposed Method: NACL- Neural hAte speeCh normaLizer Hate Intensity Prediction (HIP) Hate Span Prediction (HSI) Hate Intensity Reduction (HIR) Fig 1: Flowchart of NACL [1] Extremely Hateful Input (ORIGINAL) Less Hateful Input (SUGGESTIVE) HATE NORMALIZATION Extremely Hateful Input (ORIGINAL) User’s Choice [1]: Proactively Reducing the Hate Intensity of Online Posts via Hate Speech Normalization

Slide 53

Slide 53 text

Hate Intensity Prediction (HIP) Fig 1: HIP + Framework [1] BERT + BiLSTM + Self Attention + Linear Activation [1]: Proactively Reducing the Hate Intensity of Online Posts via Hate Speech Normalization

Slide 54

Slide 54 text

Hate Span Identification (HSI) Fig 1: Hate Normalization Framework [1] ELMO + BiLSTM + Self Attention + CRF [1]: Proactively Reducing the Hate Intensity of Online Posts via Hate Speech Normalization

Slide 55

Slide 55 text

Hate Intensity Reduction Overall Loss Reward Fig 1: Hate Normalization Framework [1] [1]: Proactively Reducing the Hate Intensity of Online Posts via Hate Speech Normalization

Slide 56

Slide 56 text

Hate Intensity Reduction (HIR) Fig 1: Hate Intensity Reduction Module [1] [1]: Proactively Reducing the Hate Intensity of Online Posts via Hate Speech Normalization

Slide 57

Slide 57 text

Human Evaluation ● Employ 20 diverse users to measure the quantitativeness of the generated texts. ● Metric: ○ Intensity ○ Fluency ○ Adequacy Fig 1: Results of Human Evaluation for NACL-HSR [1] [1]: Proactively Reducing the Hate Intensity of Online Posts via Hate Speech Normalization

Slide 58

Slide 58 text

Techniques for countering hate speech

Slide 59

Slide 59 text

[1] Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior Using ChatGPT to generate contextual toxic content Fig 1: Different output for toxicity based on context [1] Fig 2: Using ChatGPT to generate large scale scenario-based toxicity[1]

Slide 60

Slide 60 text

Motivation ❖ Depending on the scenario, specific types (intents) of counterspeeches have been shown to be more eﬀective. ❖ In the figure, although the Denouncing and Question counterspeeches are valid, an Informative counterspeech would better address the given hatespeech. ❖ We propose the first intent-conditioned counterspeech generation system, QUARC.

Slide 61

Slide 61 text

Contributions ❖ Novel task: Intent-specific counterspeech generation. ❖ Novel dataset: IntentCONAN with 6831 counterspeeches for 3583 hate speeches spanning across five counterspeech intents. ❖ Novel model: QUARC, a two-phased intent- specific counterspeech generation framework. ❖ Evaluation: An extensive automated comparison and human evaluation to quantify the eﬃcacy of our approach w.r.t state-of-the-art baselines.

Slide 62

Slide 62 text

Dataset: IntentCONAN ❖ We develop IntentCONAN, an intent-specific ounterspeech dataset with 6831 counterspeeches, corresponding to 3583 hate speeches. ❖ These counterspeeches are distributed across 5 intents: informative, denouncing, question, positive, and humour.

Slide 63

Slide 63 text

Architecture: QUARC

Slide 64

Slide 64 text

Benchmarks: Automated Evaluation and Ablations ❖ QUARC outperforms the comparative systems across all metrics. ❖ Best performing baseline: GPS. ❖ Most systems are unable to produce counterspeeches that adhere to the desired intent. ❖ Further analysis of Novelty and Diversity scores (lexical dissimilarity from the training corpus), reveals that the baselines do not produce diverse outputs and learn to copy. M: Meteor; SS: Semantic Similarity; BS: BERTScore; CA: Category Accuracy

Slide 65

Slide 65 text

Benchmarks: Human Evaluation ❖ Due to the subjectivity of the task, automated evaluation is not comprehensive enough to ensure soundness. ❖ We conducted a human evaluation (with 60 evaluators), where we compared QUARC against the best performing baseline, GPS. ❖ We defined 5 metrics which were rated on a 5-point Likert scale, except for Category Accuracy (CA), which represents the proportion of counterspeeches with matching intents. ❖ QUARC significantly outperformed GPS across all metrics except Toxicity (slightly worse than GPS), with the biggest gain in CA, demonstrating its ability to generate intent-specific counterspeeches.

Slide 66

Slide 66 text

Analysis: Congruence ❖ We define and compute Implicit Similarity (IS), a metric to measure the implicit aﬃnity of the intents based on human evaluation. ❖ The distance between the codebook intent representations learnt through QUARC (left) closely aligns with the IS scores (right) computed between the intents. ❖ This provides a key insight into a critical factor behind the performance of QUARC. Left: A scatter plot of the codebook vectors (after dimensionality reduction) corresponding to diﬀerent intents. Right: The Implicit Similarity (IS) between intent pairs captured through human evaluation.