Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Thesis Presentation

Avatar for _themessier _themessier
September 03, 2025

Thesis Presentation

Abstract
Despite our best efforts, tackling hate speech remains an elusive issue for researchers and practitioners alike. What can be considered hateful is subject to context, time, geography, and culture. This poses a challenge in defining standard benchmarks and modelling techniques to combat hate. However, what underpins hate is universally accepted as the intent of dehumanising and biasing against a historically vulnerable group. Unfortunately, determining both intent and power dynamics in an online setting is formidable; further, the influence of the human evaluator's lived experiences creates a gap in the human and computational understanding of hatefulness.

By examining the role of external priming via contextual signals, we aim to bridge this information gap and improve the human-computer alignment for analysing and monitoring hateful content on the Web.

Through a series of five datasets and model pairs, the thesis empirically establishes the efficacy of contextual signals in modelling hate speech-related tasks. The compelling use of contextual signals gets further solidified as our findings apply to any pipeline from feature-engineered logistic regressor to zero-shot prompted large language models. However, we caution against using a one-size-fits-all setup by quantifying the toxic connotations and scalability challenges of certain signals. To this end, the thesis outlines strategies for deployable, human-centric tools for reactive and proactive moderation paradigms, focusing on the multilingual and implicit nature of hate.

Avatar for _themessier

_themessier

September 03, 2025
Tweet

More Decks by _themessier

Other Decks in Research

Transcript

  1. Quantifying the Role of Contextual Signals for Modelling Hateful Text

    Supervisors: Tanmoy Chakraborty, Vikram Goyal Presenter: Sarah Masud (PHD19020) Committee: Anupam Joshi, Steven Schockaert, Sushmita Mitra
  2. Subsequent content has extreme language sampled from social media, which

    does not reflect the opinions of myself or my collaborators. Reader’s discretion is advised. [1]: https://www.un.org/en/hate-speech/understanding-hate-speech/what-is-hate-speech
  3. Outline • Background - Introduction to hatefulness and toxicity •

    Motivation - Inspiration for context signals in content moderation • Part 1 - Examining implicit, multilingual and contextual datasets for contextual modelling • Part 2 - Generative and proactive contextualisation for human-centric hate moderation • Part 3 - Hate speech annotation and data curation in the age of LLMs • Summary - So what did we manage to achieve? • Future - Open challenges and research directions • Acknowledgements • Publications • Q&A + Response to reviewers [1]: https://www.un.org/en/hate-speech/understanding-hate-speech/what-is-hate-speech
  4. Broad Scope [1]: https://www.un.org/en/hate-speech/understanding-hate-speech/what-is-hate-speech We focus on textual modality with

    numeric and network level features. Not analysis memes for example. Focus on toxic and hateful online posts when referring to online content moderation. Not analysis factuality for example. Focus on content moderators and social media users
  5. Hatefulness is as old as humanity… The radio propaganda of

    Rwanda Genocide, 1994 [3] The Biblical first murder of humanity [1] Why we fight [2] [1]: Cane and Able [2]: Constant Battles: Why we fight [3]: Rwanda Genocide
  6. But what is hate speech? • Hate is a specialised

    form of toxic and offensive content. • Hate is subjective, temporal and cultural in nature. • UN defines hate speech as “any kind of communication that attacks or uses pejorative or discriminatory language with reference to a person or a group on the basis of who they are.” [1] • Need sensitisation of social media users. [1]: UN hate [2]: Pyramid of Hate Pyramid of Hate [2] No one knows!
  7. How we define it? Hate is differentiated by extreme bias

    against the target via any of the following [1,3]: 1. Negatively stereotypes or distorts views on a vulnerable group with unfounded claims. 2. Silence or suppress a member(s) of a vulnerable group. 3. Promotes violence against a vulnerable group member(s) Manifestation of extreme bias towards an already marginalised group. Anyone can offend anyone but hate is driven by power dynamics. Not just an emotion but rather a behaviour [1]: Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment, Kulkarni et.al, KDD 2023 (Thesis Contribution) [2]: Defining Hate Speech, Sellars, Public Law Research Paper, 2016 [3]: Political mud slandering and power dynamics during Indian assembly elections, Masud & Chakraborty, SNAM 2023 (Thesis Contribution)
  8. How has internet changed the face of toxicity? • Faster

    • Cheaper • Voluminous • Anonymous • Multimodal & coded • Hard to track across platforms [1]: Angry by design: toxic communication and technical architectures, Luke, Nature 2020 [2]: Multimodal and coded toxic text on Twitter [3]: https://www.un.org/en/hate-speech/understanding-hate-speech/what-is-hate-speech
  9. Current state of pipeline for automated content moderation Output label:

    Non-hateful [1]: Handling Bias in Toxic Speech Detection: A Survey, Garg et.al, ACM Comp. Survey, 2023 (Thesis Contribution) “They have been bred to be good at sports & entertainment, but not much else” Humans use their “socio-cultural background”, and “world-knowledge” to annotate hate speech. Toxicity detection as an NLP task [1] In LM-based toxicity detection pipelines mimicking these “latent cues” from isolated text alone is difficult. Human label: Offensive
  10. Broad research gaps and potential NLP-driven solutions Limitation #1 Focuses

    on only post text. Potential solution #1 Curating specific datasets Limitation #2 Hard to understand implicit hate. Limitation #4 Classification labels are not human-friends. Limitation #3 Hard to capture non-western cultures. Potential solution #3 NLP generation tasks Potential solution #2 Contextual signal modelling
  11. Research gaps –> thesis outline RQ2: How can contextual signals

    guide modelling and when can they fail? [Chapters 3, 4, 5, 6, 7] RQ1: What contextual signals can be obtained digitally? [Chapters 3, 4, 5 ,6, 7] RQ3: Can proactive nudging help users be less hateful? [Chapters 6] RQ4: Do contextual signals even matter in the age of LLM? [Chapters 7] Potential solution #1 Curating specific datasets Potential solution #3 Modelling generative tasks Potential solution #2 Contextual signal modelling
  12. Our taxonomy of contextual signals Any form of additional/auxiliary information

    that can be provided along with the input text of the post to offer more “context,” nudging the base model to be more “toxicity attuned.” Employed by us: • Endogenous vs exogenous ◦ Tweet history vs news items • 1-1 mapped or generic ◦ Post’s metadata vs trending hashtags • In-dataset or in-domain ◦ Target group vs hate intensity score • Labelled vs unsupervised ◦ Target group vs tweet history Metadata signals from a Twitter post
  13. Hate is the New Infodemic: A Topic-aware Modeling of Hate

    Speech Diffusion on Twitter International Conference on Data Engineering (ICDE), 2020 Thesis contribution
  14. Limitations in literature • Analysing the hateful and non-hateful cascades

    as absolute separate groups. [1,2] • Only exploratory analysis, does not led to predictions. [1,2] • Information cascade models do not take content into account, only who follows whom. [3, 4] Contextually model the spread of hate via combination of network + language features! Motivation [1]: Auditing radicalization pathways on YouTube, Ribeiro et al., WebSci 2018 [2]: Spread of Hate Speech in Online Social Media, Mathew et al., WebSci 2019 [3]: Topological Recurrent Neural Network for Diffusion Prediction, Wang et al., ICDM 2017 [4]: Multi-scale Information Diffusion Prediction with Reinforced Recurrent Networks, Yang et al., IJCAI 2019
  15. Data curation: ConInHate • Crawled a large-scale Twitter dataset (Jan-April

    2020). • Indian topics (34 hashtags) but English in language. • 31k root tweets with 13k root users. • Endogenous features for each root tweet+user combination: ◦ Timeline ◦ Follow network (2-hops) ◦ List of retweeter of the 31k • Exogenous features: News articles (600k) • Manually annotated a total of 17k tweets for hate or not (IAA = 0.58).
  16. Observations that motivate modelling Hatefulness of different users towards different

    hashtags Retweet cascades for hateful and non-hate tweets Observation: Hate speech spreads faster in a shorter period. Observation: Different users show varying tendencies to engage in hateful content depending on the topic. [1]: Auditing radicalization pathways on YouTube, Ribeiro et al., WebSci 2018 [2]: Spread of Hate Speech in Online Social Media, Mathew et al., WebSci 2019
  17. Problem statement Given a hateful tweet and associated signals, at

    a given time window predict if the given user (a follower account) will retweet the given hateful tweet.
  18. Experimental results Baseline comparisons Behaviour of cascade for different baselines.

    Darker bars are hate. Contextual signals provide additional info over network ones!
  19. Takeaways Limitations: • Feature-engineered contextual modelling. • Exogenous data signals

    are hard to scale and map 1-1. • Hate speech detection is without context. Contributions: • Indic and contextual dataset. • Contextual modelling of retweeting formulated by spreading behaviour in the data Future works: • Accommodate non-organic diffusion like advertisements on Facebook.
  20. Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment

    ACM SIGKDD Conference on Knowledge Discovery and Data Mining (SIGKDD), 2023 Thesis contribution
  21. Limitations in literature • A myopic approach for hate speech

    datasets using hate lexicons or hashtags. [1, 2] • Lack of neutrally-seeded datasets. [3] • Limited Study in Hinglish context. Our previous work as hashtag driven! Motivation Contextual and neutrally seeded Indic dataset for hate speech detection. [1]: Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter, Waseem & Hovy, NAACL 2016 [2]: Automated Hate Speech Detection and the Problem of Offensive Language, Davidson et al., WebSci 2017 [3]: Latent Hatred: A Benchmark for Understanding Implicit Hate Speech, ElSherief et al., EMNLP’2021
  22. Dataset curation: GOTHate • Curated from 3 different geographies (India,

    USA, UK), 2019-2021 • Neutral seeding: 50k tweets, 3.7k hateful. • Tweets present in English, Hindi and Hinglish. ◦ 3k tweets in pure devnagri • Endogenous features for each root user: ◦ Timeline ◦ Ego-network of user Dataset Stats of GOTHate: Hate, Offense, Provocation or Neutral. IAA = 0.70 for crowdsourced annotations with Xsaras services.
  23. Phase I: IAA = 0.80 (3 experts, F:3) Phase II

    IAA = 0.70 (10 crowdsource workers, M:F 6:10) Our annotation approach Phase I: Internal Phase II IAA = Xsaras Services [1]: Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior, Founta et.al, ICWSM 2018 [2]: Automated Hate Speech Detection and the Problem of Offensive Language, Davidson et al., WebSci 2017
  24. GOTHate’s neutral seeding Inter-class similarity among labels in respective datasets.

    \#hindulivesmatter \#Hinduphobia Sometimes you just have to jump into activist mode, esp. when \#India \& \#Hinduism are denigrated. An offensive T-shirt has been removed after complaints to the T-shirt company. When the next time such things happen, WILL YOU ACT ????? \#unsungHindus \#Hinduunity \#hindulivesmatter Sometimes, you just have to jump into activist mode, especially when India and Hinduism are denigrated. Thank you, Mahalakshmi Ganapathy Vijay Kumar Shourie Bannai. P N
  25. Intuition for in-dataset context \$mention\$ \$mention\$ \$mention\$ and remember president

    loco said mexico will pay fuc**kfu ck trump f*** gop f*** republicans make go fund me for health care, college education , climate change, something good and positive !! not for a fucking wall go fund the wall the resistance resist \$url\$" $mention\$ deranged delusional Dumb dictator donald is mentally unstable! i will never vote republican again if they don't stand up to this tyrant living in the white house! fk republicans worst dictator ever unstable dictator \$url\$" $mention\$ could walk on water and the never Trump will crap on everything he does. shame in them. unfollow all of them please!" Offensive train sample Labelled Corpus E1: Offensive train sample exemplar (can be same or different author) E2: Offensive train sample exemplar (can be same or different author)
  26. Observation for timeline signal "look at what Hindus living in

    mixed-population localities are facing, what Dhruv Tyagi had to face for merely asking his Muslim neighbors not to sexually harass his daughter...and even then, if u ask why people don’t rent to Muslims, get ur head examined $MENTION\$ $MENTION\$ naah...Islamists will never accept Muslim refugees, they will tell the Muslims to create havoc in their home countries and do whatever it takes to convert Dar-ul-Harb into Dar-ul Islam..something we should seriously consider doing with Pak Hindus too One of the tweet by author before example. One of the tweet by author after example. Accusatory tone timestamp t-1 Hateful tweet timestamp t Accusatory and instigating timestamp t+1 This must have been a apka tahir offer to jihadis - kill kaffirs, loot their property, do what u want with any kaffir female that “ur right hand posses”...in shirt, maal-e-ganimat delhi riots delhi riots
  27. Experimental results Baseline and Ablations M8: As expected major jump

    in performance when fine-tuning mBERT based systems even with no-contextual features!
  28. Baseline and Ablations Combining all 4 signals lead to an

    improved detection of hate by 5 macro-F1. Attention-based contextual signals continue to provide additional info! Experimental results
  29. Takeaways Limitations: • Not all platforms have same set of

    endogenous signals. • Implicit hate analysis is missing. Contributions: • Indic and contextual dataset. • Contextual modelling fine-grained hate detection Future works: • Accommodate diversity of annotations instead of majority voting. Hate lies on a spectrum.
  30. Focal Inferential Infusion Coupled with Tractable Density Discrimination for Implicit

    Hate Detection Natural Language Engineering (NLE), 2024 Thesis contribution
  31. Limitations in literature • Lack of empirical analysis of “to

    what extent do the implicit and neutral spaces overlap in latent space?” [1] • Contextual implicit hate detection do not account for toxic connotations. [2] Motivation Improve PLM finetuning for implicit hate detection via contextual clustering. [1]: Toxic, Hateful, Offensive or Abusive? What Are We Really Classifying? An Empirical Analysis of Hate Speech Datasets, Fortuna et.al, LREC 2020 [2]: Latent Hatred: A Benchmark for Understanding Implicit Hate Speech, ElSherief et.al, EMNLP 2021
  32. Motivation for clustering based loss for implicit hate N: Non

    hate I: Implicit hate E: Explicit hate • Non-hate is closer to implicit samples. • The ALD (one-one) distance shows more variability than ACLD (centroid only) distance. It follows from the fact that the mere presence of a keyword/lexicon does not render a sample as hateful. [1]: I Feel Offended, Don’t Be Abusive! Implicit/Explicit Messages in Offensive and Abusive Language, Caselli et.al, LREC 2020 [2]: Introducing the Gab Hate Corpus: defining and applying hate-based rhetoric to social media posts at scale, Kennedy et.al, ACM Language Resources and Evaluation [3]: Latent Hatred: A Benchmark for Understanding Implicit Hate Speech, ElSherief et.al, EMNLP 2021 Frozen-BERT L1 Distance Sample from LatentHatred annotated for (E/I, Target and implied meaning) [3]
  33. Adaptive density discrimination (ADD) for implicit hate [1]: Metric Learning

    with Adaptive Density Discrimination, Ripple et.al, ICLR 2016 Implicit & implied
  34. Experimental results Baseline and Ablations: Averaged over 3 seeds Task-specific

    PLMs are separable in terms of fine-tuned performance [1]. [1]: Probing Critical Learning Dynamics of PLMs for Hate Speech Detection, Masud et.al, EACL Findings 2024 (Thesis contribution)
  35. How FiADD help implicit clusters? Brings the implied (intended form),

    closer to implicit (surface form) by contextually modelling it.
  36. Takeaways Limitations: • Requires manually annotated 1-1 explanations for implicit

    hate. • Hard to explain the label-only output to content moderators. Contributions: • Contextual modelling implicit signals. • Task-specific PLMs perform similar to generic PLMs when fine-tuned for implicit hate detection Future works: • Automate explanation generation for implicit hate. • User other clustering techniques to improve finetuning.
  37. Tox-BART: Leveraging Toxicity Attributes for Explanation Generation of Implicit Hate

    Speech Findings of the Association for Computational Linguistics (Findings ACL), 2024 Thesis Contribution
  38. Limitations in literature • Majority of the systems focus on

    classification of hate speech, the output label is not intuitive for moderators. • An increasing reliance of using general knowledge graphs (KG) with PLMs for solving domain-specific tasks without an examination of its impact. [1,2, 3] We empirically record this! • Lack of toxicity incorporated signals in explaining hate speech. Motivation Propose an alternative external signal which is much more sensitive to toxicity when explaining hate speech. [1]: Incorporating Commonsense Knowledge Graph in Pretrained Models for Social Commonsense Tasks, Chang et al., DeeLIO 2020 [2]: Leveraging World Knowledge in Implicit Hate Speech Detection, Lin, NLP4PI 2022 [3]: Explaining Toxic Text via Knowledge Enhanced Text Generation, Shridar & Yang, NAACL 2022
  39. Why do KGs fail to help in explain implicit hate?

    Hypothesis: “the process of retrieving and ranking KG tuples is task agnostic and may not account for the multi-hop/indirect nature of implicit hate”. Significance testing and manual inspection reveal that the tuples is not informative/specific enough for our task. [1]: Social Bias Frames: Reasoning about Social and Power Implications of Language, Sap et.al, ACL 2020 [2]: Latent Hatred: A Benchmark for Understanding Implicit Hate Speech, ElSherief et.al, EMNLP 2021 [3]: ConceptNet 5.5: An Open Multilingual Graph of General Knowledge, Speer et.al, AAAI 2017 [4]: StereoKG: Data-Driven Knowledge Graph Construction For Cultural Knowledge and Stereotypes, Deshpande et.al, WOAH 2017 X(implicit post)+[T1,T2,T…Tm] -> Y(implicit post’s short explanation) B- BLEU, R-ROUGE, BS-BERTSCORE
  40. Why do KGs fail to help in explain implicit hate?

    Hypothesis: “the process of retrieving and ranking KG tuples is task agnostic and may not account for the multi-hop/indirect nature of implicit hate”. [1]: Social Bias Frames: Reasoning about Social and Power Implications of Language, Sap et.al, ACL 2020 [2]: Latent Hatred: A Benchmark for Understanding Implicit Hate Speech, ElSherief et.al, EMNLP 2021 [3]: ConceptNet 5.5: An Open Multilingual Graph of General Knowledge, Speer et.al, AAAI 2017 X(implicit post)+[T1,T2,T…Tm] -> Y(implicit post short explanation) lol. yes, africa isn't a developed land because of whites. topk3 ['white is distinct from black', 'white is related to cloud', 'white is opposite in meaning to black', 'white is related to snow', 'land is related to countryside', 'land is related to houses', 'land is related to shore', 'land is related to big', 'land is related to geography term', 'land is related to term', 'land is related to terrestrial', 'land is related to terrestrial area', 'land is related to feature', 'land is related to geographical', 'land is related to geographical feature', 'land is related to territory', 'land is related to value', 'land is related to airplane', 'land is related to airplane ending', 'land is related to ending'] lol. yes, africa isn't a developed land because of whites. ['indians kiss apartheids ass', 'american jews did become white', 'african american culture is interesting', 'muslims find countries undesirable', 'americans care about white', 'islam does have popularity in black community', 'black american culture is hypervisible', 'white americans doing something', 'americans are horribly bad with geography', 'indians are south asian community', 'black american culture is than americanized christianity', 'christians are problem nigeria', 'muslims are in christian land', 'muslims are not race', 'muslims do emigrate to countries', 'successful indian diaspora build country', 'muslims live in in ghettos', 'americans refer to region', 'indians are from south asia', 'indians are racist']
  41. C1: In-domain signal C2: In-dataset (meta) signal Experimental Results: Manual

    assessment Fine-tuned systems retain task specificity better. SBIC human evaluation (1-5 scale), Target is binary
  42. Takeaways Limitations: • English only datasets. • More analysis of

    KG’s application to subjective tasks is needed. • The explanations are helpful for moderators but not users. Contributions: • Not all contextual signals are useful! • 1-1 mapped toxicity attributes are better for both implicit hate classification and its explanation. Future works: • Examine how ICL can be employed to retain both task-specificity and generalisation capabilities of LLMs.
  43. Proactively Reducing the Hate Intensity of Online Posts via Hate

    Speech Normalization ACM SIGKDD Conference on Knowledge Discovery and Data Mining (SIGKDD), 2022 Thesis Contribution
  44. Existing Taxonomy of Hate Detection and Mitigation Reactive countering •

    Intervention after posting • Blocking posts or users • Users have no say Proactive countering • Intervention before posting • User’s choice • Can feel intrusive
  45. Limitations in literature • Attempt to convert offensive content to

    non-offensive without any middle ground. [1] • Style Transformer based systems requiring large among of unparallel datasets. [2] • The above models did not incorporate proactiveness in their POC. Motivation Can toxicity-attributed context cues be used for proactive nudging of users to be less hateful? [1]: Fighting Offensive Language on Social Med, Santos et al., ACL’18 [2]: Style Transformer: Unpaired Text Style Transfer without Disentangled Latent Representation, Dai et. al, ACL’19
  46. Observations in real-world “The intervention has to be ground-up, not

    top-down.” - Howard Rheingold, Author [1]: Howard’s The Art of Hosting Good Conversations Online [2]: Social media punishment does not need to be a Kafkaesque nightmare [3]: Reconsidering Tweets: Intervening during Tweet Creation Decreases Offensive Content, Katsaros et al., ICWSM ‘22
  47. Data curation • Hateful samples collected from existing hate speech

    datasets in English (focus on explicit samples) [1,2,3] • Manually annotated for hate intensity (1-10) and explicit hateful spans. • Manual write of normalised counter-part and its intensity. (IAA = 0.88) Dataset stats Original and normalised intensity distribution [1]: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter, Basile et.al, International Workshop on Semantic Evaluation, 2019 [2]: Automated Hate Speech Detection and the Problem of Offensive Language, Davidson et al., WebSci’2017 [2]: CONAN - COunter NArratives through Nichesourcing: a Multilingual Dataset of Responses to Fight Online Hate Speech, Chung et.al, ACL 2019
  48. Problem Statement For a given hate sample 𝑡, our objective

    is to obtain its normalized (sensitised) form 𝑡` such that the intensity of hatred 𝜙(.) is reduced while the meaning still conveys. 𝜙 𝑡` < 𝜙 𝑡 Extremely Hateful Input (ORIGINAL) Less Hateful Input (SUGGESTIVE) HATE NORMALIZATION Extremely Hateful Input (ORIGINAL) User’s Choice Example of original high intensity vs normalised sentence
  49. Experimental results: HIR Hate intensity reduced text generation Human evaluation

    on 1-5 scale, Intensity is 1-10 Nuding via toxicity intensity reduction does not reduce fluency and adequacy of the generated samples. [1]: A Benchmark Dataset for Learning to Intervene in Online Hate Speech, Qian, et.al, EMNLP 2019 [2]: CONAN - COunter NArratives through Nichesourcing: a Multilingual Dataset of Responses to Fight Online Hate Speech, Chung et.al, ACL 2019
  50. Tool: Detects Hateful spans and suggests changes as you type

    Prototype of NACL tool Grammarly but for hate reduction!
  51. Takeaways Limitations: • Tested only on english datasets. • Hard

    to model span detection for implicit/coded languages. Contributions: • 1-1 mapped toxicity attributes are better for understanding explicit hate as well. • A proactive hate reduction system that engages with users instead of banning them right away. Future works: • The most effective way to determine the utility of proactive nudging is A/B testing in the real-world!
  52. Hate Personified: Investigating the role of LLMs in content moderation

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024 Thesis Contribution
  53. Background [1]: Handling Bias in Toxic Speech Detection: A Survey,

    Garg et.al, ACM Comp. Survey, 2023 (Thesis Contribution) [2]: Exploring Cross-Cultural Differences in English Hate Speech Annotations:From Dataset Construction to Analysis, Lee et.al, NAACL, 2024 Humans use their “socio-cultural background”, and “world-knowledge” to annotate hate speech. ? ? ?
  54. Do LLMs pick on geographical cues? [1]: Exploring Cross-Cultural Differences

    in English Hate Speech Annotations:From Dataset Construction to Analysis, Lee et.al, NAACL, 2024 RQ1: p cue = The following post was made in <country> or <language>. Base case = (post+query) = p base = Statement: <POST>. Is the given statement hateful?’ RQ1: p cue + Base case
  55. Can LLMs mimic annotator persona? [1]: Exploring Cross-Cultural Differences in

    English Hate Speech Annotations:From Dataset Construction to Analysis, Lee et.al, NAACL, 2024 RQ2: p cue = The following post was made by <persona>. Base case = (post+query) = p base = Statement: <POST>. Is the given statement hateful?’ RQ2: p cue + Base case
  56. Not really! Can LLMs mimic annotator persona? [1]: Sociodemographic Prompting

    is Not Yet an Effective Approach for Simulating Subjective Judgments with LLMs, Sun et.al, NAACL 2025 RQ2: p cue = The following post was made by <persona>. RQ2: p cue = You are a <persona>.
  57. Are LLMs sensitive to anchoring bias? [1]: HateXplain: A Benchmark

    Dataset for Explainable Hate Speech Detection, Mathew et.al, AAAI 2021 RQ3: p cue = <z%> of annotator consider the post as hateful. Base case = (post+query) = p base = Statement: <POST>. Is the given statement hateful?’ RQ3: p cue + Base case
  58. z ∈ {0%, 25%, 50%, 75%, 100%} Yes! Are LLMs

    sensitive to anchoring bias?
  59. Takeaways Limitations: • Cultural markers are hard to define and

    reproduce at scale. The persona’s that work for one country will not be useful for another. • LLMs still lack ability to untangle intersecting cues like Arab + Non-Muslim Contributions: • Cues that are useful for a LM finetuning may not be helpful for prompting (aka numeric ones). • Geographical and persona cues hold even under multilingual setting. Future works: • Beyond zero-shot prompting. • How about culturally finetuned models (on culture benchmarks).
  60. So what did we manage to achieve? For users: Assist

    in corrective behaviour • Proactively nudging them to reduce explicit hate and untangle explicitness from more engagement. (Chapters 6) For content moderators: Reduce burden via assisting with first-level flagging • Propose contextual aware tools to bring closer to human contextual modelling. (Chapters 3, 4, 7) • Provide explanation based systems for understanding implicit hate while moderating. (Chapters 5) For researchers/practitioners: Through extensive experimentation and empirical research • Release context augmented datasets. (Chapters 3, 4, 5, 6) • Modelling practises, what contextual signal works under which setup (what+how to use) (Chapters 3, 4, 5, 6, 7) Contextual signal infusion is not a one-size-fits-all; we see that across a range of datasets and tasks. Whether for moderators or users, “AI” can only be assistive in tackling a very human-centric.
  61. List of Publications (Journals) 1. Tanmoy Chakraborty, Sarah Masud: The

    Promethean Dilemma of AI at the Intersection of Hallucination and Creativity. Communication of the ACM, 2024. 2. Sarah Masud, Ashutosh Bajpai, Tanmoy Chakraborty: Focal Inferential Infusion Coupled with Tractable Density Discrimination for Implicit Hate Speech Detection. Natural Language Engineering (NLE), 2024. 3. Tanmay Garg, Sarah Masud, Tharun Suresh, Tanmoy Chakraborty: Handling Bias in Toxic Speech Detection: A Survey. ACM Computing Surveys, 2023. 4. Tanmoy Chakraborty, Sarah Masud: Judging the creative prowess of AI. Nature Machine Intelligence, 2023 5. Sarah Masud, Tanmoy Chakraborty: Political mud slandering and power dynamics during Indian assembly elections. Social Network Analysis and Mining (SNAM), 2023. 6. Dhruv Sehnan, Vasu Goel, Sarah Masud, Chhavi Jain, Vikram Goyal, Tanmoy Chakraborty: DiVA: A Scalable, Interactive and Customizable Visual Analytics Platform for Information Diffusion on Large Networks. ACM Transactions on Knowledge Discovery from Data (ACM TKDD), 2023. 7. Tanmoy Chakraborty, Sarah Masud: Nipping in the bud: detection, diffusion and mitigation of hate speech on social media. ACM SIGWEB Newsletter, 2022.
  62. List of Publications (Conferences) 1. Sarah Masud, Sahajpreet Singh, Viktor

    Hangya, Alexander Fraser, Tanmoy Chakraborty: Hate Personified: Investigating the role of LLMs in content moderation. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024. 2. Neemesh Yadav, Sarah Masud, Vikram Goyal, Md. Shad Akhtar, Tanmoy Chakraborty: Tox-BART: Leveraging Toxicity Attributes for Explanation Generation of Implicit Hate Speech. Findings of the Association for Computational Linguistics (ACL), 2024. 3. Sarah Masud, Mohammad Aflah Khan, Vikram Goyal, Md. Shad Akhtar, Tanmoy Chakraborty: Probing Critical Learning Dynamics of PLMs for Hate Speech Detection. Findings of the Association for Computational Linguistics (EACL), 2024. 4. Atharva Kulkarni, Sarah Masud, Vikram Goyal, Tanmoy Chakraborty: Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment. Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (SIGKDD), 2023. 5. Sarah Masud, Manjot Bedi, Mohammad Aflah Khan, Md. Shad Akhtar, Tanmoy Chakraborty: Proactively Reducing the Hate Intensity of Online Posts via Hate Speech Normalization. Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (SIGKDD). 6. Sarah Masud, Subhabrata Dutta, Sakshi Makkar, Chhavi Jain, Vikram Goyal, Amitava Das, Tanmoy Chakraborty: Hate is the New Infodemic: A Topic-aware Modeling of Hate Speech Diffusion on Twitter. IEEE 37th International Conference on Data Engineering (ICDE), 2021.
  63. Acknowledgements • Supervisors: Dr. Tanmoy and Dr. Vikram • Internal

    and external review committee, anonymous article reviewers (including reviewer 2) • All the amazing collaborators (students, postdocs, professors) • Financial support from Google, PMRF, Wipro AI • IIIT-D admin, staff and colleagues, all collaborators! YAMSS! • Friends and family • Talks, tutorial, shared tasks and blogs where have shared about our work • Activist around the world teaching us that hate speech mitigation will be a ongoing effort
  64. We leave with more questions than answers Legal Aspects •

    How can legal frameworks be used to explain classification of hate? • When working with social media orgs, who gets the ownership of POCs? • Where does dissent end and sedition and toxicity begins? Social Aspects • How can engage annotators of diverse perspective when curating datasets or evaluating models? • Despite contextual cues, power dynamics is still hard to capture in online world. Technical Aspects • What is the impact of task-specific models, KGs and datasets on downstream bias mitigation? • How can static NLP benchmarking move to a more dynamic setting to incorporate dynamic nature of toxicity? • How can we better track and the causal relation between other activities like fake news or even football to patterns of toxicity? • Sharing of contextual datasets via platforms like HF is difficult! Access to data for CSS research is dwelling :(
  65. Reviewer 1 Reviewer 2 My major concern is about the

    presentation of the thesis. I would suggest a rewrite, AT LEAST of Section 1.4, highlighting the consolidated (not section-wise) novelty of each contributory chapter along with the corresponding research publication(s) citations. Throughout the thesis, citations do not always follow the correct presentation style: when the citation is used as the subject in a sentence, it is customary to move the author names outside the parenthesis. For instance, on p15, “(Waseem and Hovy, 2016) released” should be “Waseem and Hovy (2016) released”. Another element, somewhat indirectly acknowledged by the dissertation, is that for tools such as those proposed here to be usable, the dataset needs significantly more Annotators. Introduction (Chapter 1) as well as chapter wise-introduction updated. Related work (Chapter 2) as well individual instances of citations fixed. Introduction (Chapter 1) and Conclusion (Chapter 8) updated. Reviewer 3 Major concerns