Thesis Presentation

Quantifying the Role of Contextual Signals for Modelling Hateful Text
Supervisors: Tanmoy Chakraborty, Vikram Goyal Presenter: Sarah Masud (PHD19020) Committee: Anupam Joshi, Steven Schockaert, Sushmita Mitra

Subsequent content has extreme language sampled from social media, which
does not reﬂect the opinions of myself or my collaborators. Reader’s discretion is advised. [1]: https://www.un.org/en/hate-speech/understanding-hate-speech/what-is-hate-speech

Outline • Background - Introduction to hatefulness and toxicity •
Motivation - Inspiration for context signals in content moderation • Part 1 - Examining implicit, multilingual and contextual datasets for contextual modelling • Part 2 - Generative and proactive contextualisation for human-centric hate moderation • Part 3 - Hate speech annotation and data curation in the age of LLMs • Summary - So what did we manage to achieve? • Future - Open challenges and research directions • Acknowledgements • Publications • Q&A + Response to reviewers [1]: https://www.un.org/en/hate-speech/understanding-hate-speech/what-is-hate-speech

Broad Scope [1]: https://www.un.org/en/hate-speech/understanding-hate-speech/what-is-hate-speech We focus on textual modality with
numeric and network level features. Not analysis memes for example. Focus on toxic and hateful online posts when referring to online content moderation. Not analysis factuality for example. Focus on content moderators and social media users

Background

Hatefulness is as old as humanity… The radio propaganda of
Rwanda Genocide, 1994 [3] The Biblical first murder of humanity [1] Why we fight [2] [1]: Cane and Able [2]: Constant Battles: Why we ﬁght [3]: Rwanda Genocide

But what is hate speech? • Hate is a specialised
form of toxic and offensive content. • Hate is subjective, temporal and cultural in nature. • UN deﬁnes hate speech as “any kind of communication that attacks or uses pejorative or discriminatory language with reference to a person or a group on the basis of who they are.” [1] • Need sensitisation of social media users. [1]: UN hate [2]: Pyramid of Hate Pyramid of Hate [2] No one knows!

How we deﬁne it? Hate is differentiated by extreme bias
against the target via any of the following [1,3]: 1. Negatively stereotypes or distorts views on a vulnerable group with unfounded claims. 2. Silence or suppress a member(s) of a vulnerable group. 3. Promotes violence against a vulnerable group member(s) Manifestation of extreme bias towards an already marginalised group. Anyone can offend anyone but hate is driven by power dynamics. Not just an emotion but rather a behaviour [1]: Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment, Kulkarni et.al, KDD 2023 (Thesis Contribution) [2]: Deﬁning Hate Speech, Sellars, Public Law Research Paper, 2016 [3]: Political mud slandering and power dynamics during Indian assembly elections, Masud & Chakraborty, SNAM 2023 (Thesis Contribution)

Thesis motivation

How has internet changed the face of toxicity? • Faster
• Cheaper • Voluminous • Anonymous • Multimodal & coded • Hard to track across platforms [1]: Angry by design: toxic communication and technical architectures, Luke, Nature 2020 [2]: Multimodal and coded toxic text on Twitter [3]: https://www.un.org/en/hate-speech/understanding-hate-speech/what-is-hate-speech

Current state of pipeline for automated content moderation Output label:
Non-hateful [1]: Handling Bias in Toxic Speech Detection: A Survey, Garg et.al, ACM Comp. Survey, 2023 (Thesis Contribution) “They have been bred to be good at sports & entertainment, but not much else” Humans use their “socio-cultural background”, and “world-knowledge” to annotate hate speech. Toxicity detection as an NLP task [1] In LM-based toxicity detection pipelines mimicking these “latent cues” from isolated text alone is difﬁcult. Human label: Offensive

Broad research gaps and potential NLP-driven solutions Limitation #1 Focuses
on only post text. Potential solution #1 Curating speciﬁc datasets Limitation #2 Hard to understand implicit hate. Limitation #4 Classiﬁcation labels are not human-friends. Limitation #3 Hard to capture non-western cultures. Potential solution #3 NLP generation tasks Potential solution #2 Contextual signal modelling

Research gaps –> thesis outline RQ2: How can contextual signals
guide modelling and when can they fail? [Chapters 3, 4, 5, 6, 7] RQ1: What contextual signals can be obtained digitally? [Chapters 3, 4, 5 ,6, 7] RQ3: Can proactive nudging help users be less hateful? [Chapters 6] RQ4: Do contextual signals even matter in the age of LLM? [Chapters 7] Potential solution #1 Curating speciﬁc datasets Potential solution #3 Modelling generative tasks Potential solution #2 Contextual signal modelling

Our taxonomy of contextual signals Any form of additional/auxiliary information
that can be provided along with the input text of the post to offer more “context,” nudging the base model to be more “toxicity attuned.” Employed by us: • Endogenous vs exogenous ◦ Tweet history vs news items • 1-1 mapped or generic ◦ Post’s metadata vs trending hashtags • In-dataset or in-domain ◦ Target group vs hate intensity score • Labelled vs unsupervised ◦ Target group vs tweet history Metadata signals from a Twitter post

Part 1 Examining implicit, multilingual and contextual datasets for contextual
modelling

Hate is the New Infodemic: A Topic-aware Modeling of Hate
Speech Diffusion on Twitter International Conference on Data Engineering (ICDE), 2020 Thesis contribution

Limitations in literature • Analysing the hateful and non-hateful cascades
as absolute separate groups. [1,2] • Only exploratory analysis, does not led to predictions. [1,2] • Information cascade models do not take content into account, only who follows whom. [3, 4] Contextually model the spread of hate via combination of network + language features! Motivation [1]: Auditing radicalization pathways on YouTube, Ribeiro et al., WebSci 2018 [2]: Spread of Hate Speech in Online Social Media, Mathew et al., WebSci 2019 [3]: Topological Recurrent Neural Network for Diffusion Prediction, Wang et al., ICDM 2017 [4]: Multi-scale Information Diffusion Prediction with Reinforced Recurrent Networks, Yang et al., IJCAI 2019

Data curation: ConInHate • Crawled a large-scale Twitter dataset (Jan-April
2020). • Indian topics (34 hashtags) but English in language. • 31k root tweets with 13k root users. • Endogenous features for each root tweet+user combination: ◦ Timeline ◦ Follow network (2-hops) ◦ List of retweeter of the 31k • Exogenous features: News articles (600k) • Manually annotated a total of 17k tweets for hate or not (IAA = 0.58).

Observations that motivate modelling Hatefulness of different users towards different
hashtags Retweet cascades for hateful and non-hate tweets Observation: Hate speech spreads faster in a shorter period. Observation: Different users show varying tendencies to engage in hateful content depending on the topic. [1]: Auditing radicalization pathways on YouTube, Ribeiro et al., WebSci 2018 [2]: Spread of Hate Speech in Online Social Media, Mathew et al., WebSci 2019

Problem statement Given a hateful tweet and associated signals, at
a given time window predict if the given user (a follower account) will retweet the given hateful tweet.

Proposed model: RETINA Exogenous cross-attention mechanism Inﬂuence of external news
on posted tweet

Static retweet prediction Dynamic retweet prediction Proposed model: RETINA

Experimental results Baseline comparisons Behaviour of cascade for different baselines.
Darker bars are hate. Contextual signals provide additional info over network ones!

Takeaways Limitations: • Feature-engineered contextual modelling. • Exogenous data signals
are hard to scale and map 1-1. • Hate speech detection is without context. Contributions: • Indic and contextual dataset. • Contextual modelling of retweeting formulated by spreading behaviour in the data Future works: • Accommodate non-organic diffusion like advertisements on Facebook.

Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment
ACM SIGKDD Conference on Knowledge Discovery and Data Mining (SIGKDD), 2023 Thesis contribution

Limitations in literature • A myopic approach for hate speech
datasets using hate lexicons or hashtags. [1, 2] • Lack of neutrally-seeded datasets. [3] • Limited Study in Hinglish context. Our previous work as hashtag driven! Motivation Contextual and neutrally seeded Indic dataset for hate speech detection. [1]: Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter, Waseem & Hovy, NAACL 2016 [2]: Automated Hate Speech Detection and the Problem of Offensive Language, Davidson et al., WebSci 2017 [3]: Latent Hatred: A Benchmark for Understanding Implicit Hate Speech, ElSherief et al., EMNLP’2021

Dataset curation: GOTHate • Curated from 3 different geographies (India,
USA, UK), 2019-2021 • Neutral seeding: 50k tweets, 3.7k hateful. • Tweets present in English, Hindi and Hinglish. ◦ 3k tweets in pure devnagri • Endogenous features for each root user: ◦ Timeline ◦ Ego-network of user Dataset Stats of GOTHate: Hate, Offense, Provocation or Neutral. IAA = 0.70 for crowdsourced annotations with Xsaras services.

Phase I: IAA = 0.80 (3 experts, F:3) Phase II
IAA = 0.70 (10 crowdsource workers, M:F 6:10) Our annotation approach Phase I: Internal Phase II IAA = Xsaras Services [1]: Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior, Founta et.al, ICWSM 2018 [2]: Automated Hate Speech Detection and the Problem of Offensive Language, Davidson et al., WebSci 2017

GOTHate’s neutral seeding Inter-class similarity among labels in respective datasets.
\#hindulivesmatter \#Hinduphobia Sometimes you just have to jump into activist mode, esp. when \#India \& \#Hinduism are denigrated. An offensive T-shirt has been removed after complaints to the T-shirt company. When the next time such things happen, WILL YOU ACT ????? \#unsungHindus \#Hinduunity \#hindulivesmatter Sometimes, you just have to jump into activist mode, especially when India and Hinduism are denigrated. Thank you, Mahalakshmi Ganapathy Vijay Kumar Shourie Bannai. P N

Intuition for in-dataset context \$mention\$ \$mention\$ \$mention\$ and remember president
loco said mexico will pay fuc**kfu ck trump f*** gop f*** republicans make go fund me for health care, college education , climate change, something good and positive !! not for a fucking wall go fund the wall the resistance resist \$url\$" $mention\$ deranged delusional Dumb dictator donald is mentally unstable! i will never vote republican again if they don't stand up to this tyrant living in the white house! fk republicans worst dictator ever unstable dictator \$url\$" $mention\$ could walk on water and the never Trump will crap on everything he does. shame in them. unfollow all of them please!" Offensive train sample Labelled Corpus E1: Offensive train sample exemplar (can be same or different author) E2: Offensive train sample exemplar (can be same or different author)

Observation for timeline signal "look at what Hindus living in
mixed-population localities are facing, what Dhruv Tyagi had to face for merely asking his Muslim neighbors not to sexually harass his daughter...and even then, if u ask why people don’t rent to Muslims, get ur head examined $MENTION\$ $MENTION\$ naah...Islamists will never accept Muslim refugees, they will tell the Muslims to create havoc in their home countries and do whatever it takes to convert Dar-ul-Harb into Dar-ul Islam..something we should seriously consider doing with Pak Hindus too One of the tweet by author before example. One of the tweet by author after example. Accusatory tone timestamp t-1 Hateful tweet timestamp t Accusatory and instigating timestamp t+1 This must have been a apka tahir offer to jihadis - kill kafﬁrs, loot their property, do what u want with any kafﬁr female that “ur right hand posses”...in shirt, maal-e-ganimat delhi riots delhi riots

Contextual signal infusion for hate detection Motivation for auxiliary data
signals for ﬂagging hateful content

Proposed model: HEN-mBERT HEN-mBERT: History, Exemplar and Network infused mBERT
model

Experimental results Baseline and Ablations M8: As expected major jump
in performance when ﬁne-tuning mBERT based systems even with no-contextual features!

Baseline and Ablations Combining all 4 signals lead to an
improved detection of hate by 5 macro-F1. Attention-based contextual signals continue to provide additional info! Experimental results

Takeaways Limitations: • Not all platforms have same set of
endogenous signals. • Implicit hate analysis is missing. Contributions: • Indic and contextual dataset. • Contextual modelling ﬁne-grained hate detection Future works: • Accommodate diversity of annotations instead of majority voting. Hate lies on a spectrum.

Focal Inferential Infusion Coupled with Tractable Density Discrimination for Implicit
Hate Detection Natural Language Engineering (NLE), 2024 Thesis contribution

Limitations in literature • Lack of empirical analysis of “to
what extent do the implicit and neutral spaces overlap in latent space?” [1] • Contextual implicit hate detection do not account for toxic connotations. [2] Motivation Improve PLM ﬁnetuning for implicit hate detection via contextual clustering. [1]: Toxic, Hateful, Offensive or Abusive? What Are We Really Classifying? An Empirical Analysis of Hate Speech Datasets, Fortuna et.al, LREC 2020 [2]: Latent Hatred: A Benchmark for Understanding Implicit Hate Speech, ElSherief et.al, EMNLP 2021

Motivation for clustering based loss for implicit hate N: Non
hate I: Implicit hate E: Explicit hate • Non-hate is closer to implicit samples. • The ALD (one-one) distance shows more variability than ACLD (centroid only) distance. It follows from the fact that the mere presence of a keyword/lexicon does not render a sample as hateful. [1]: I Feel Offended, Don’t Be Abusive! Implicit/Explicit Messages in Offensive and Abusive Language, Caselli et.al, LREC 2020 [2]: Introducing the Gab Hate Corpus: deﬁning and applying hate-based rhetoric to social media posts at scale, Kennedy et.al, ACM Language Resources and Evaluation [3]: Latent Hatred: A Benchmark for Understanding Implicit Hate Speech, ElSherief et.al, EMNLP 2021 Frozen-BERT L1 Distance Sample from LatentHatred annotated for (E/I, Target and implied meaning) [3]

Adaptive density discrimination (ADD) for implicit hate [1]: Metric Learning
with Adaptive Density Discrimination, Ripple et.al, ICLR 2016 Implicit & implied

Proposed model: FiADD Proposed architecture improving cross-entropy with “implied” clustering

Experimental results Baseline and Ablations: Averaged over 3 seeds Task-speciﬁc
PLMs are separable in terms of ﬁne-tuned performance [1]. [1]: Probing Critical Learning Dynamics of PLMs for Hate Speech Detection, Masud et.al, EACL Findings 2024 (Thesis contribution)

How FiADD help implicit clusters? Brings the implied (intended form),
closer to implicit (surface form) by contextually modelling it.

Takeaways Limitations: • Requires manually annotated 1-1 explanations for implicit
hate. • Hard to explain the label-only output to content moderators. Contributions: • Contextual modelling implicit signals. • Task-specific PLMs perform similar to generic PLMs when fine-tuned for implicit hate detection Future works: • Automate explanation generation for implicit hate. • User other clustering techniques to improve finetuning.

Part 2 Generative and proactive contextualisation for human-centric hate moderation

Tox-BART: Leveraging Toxicity Attributes for Explanation Generation of Implicit Hate
Speech Findings of the Association for Computational Linguistics (Findings ACL), 2024 Thesis Contribution

Limitations in literature • Majority of the systems focus on
classiﬁcation of hate speech, the output label is not intuitive for moderators. • An increasing reliance of using general knowledge graphs (KG) with PLMs for solving domain-speciﬁc tasks without an examination of its impact. [1,2, 3] We empirically record this! • Lack of toxicity incorporated signals in explaining hate speech. Motivation Propose an alternative external signal which is much more sensitive to toxicity when explaining hate speech. [1]: Incorporating Commonsense Knowledge Graph in Pretrained Models for Social Commonsense Tasks, Chang et al., DeeLIO 2020 [2]: Leveraging World Knowledge in Implicit Hate Speech Detection, Lin, NLP4PI 2022 [3]: Explaining Toxic Text via Knowledge Enhanced Text Generation, Shridar & Yang, NAACL 2022

Why do KGs fail to help in explain implicit hate?
Hypothesis: “the process of retrieving and ranking KG tuples is task agnostic and may not account for the multi-hop/indirect nature of implicit hate”. Signiﬁcance testing and manual inspection reveal that the tuples is not informative/speciﬁc enough for our task. [1]: Social Bias Frames: Reasoning about Social and Power Implications of Language, Sap et.al, ACL 2020 [2]: Latent Hatred: A Benchmark for Understanding Implicit Hate Speech, ElSherief et.al, EMNLP 2021 [3]: ConceptNet 5.5: An Open Multilingual Graph of General Knowledge, Speer et.al, AAAI 2017 [4]: StereoKG: Data-Driven Knowledge Graph Construction For Cultural Knowledge and Stereotypes, Deshpande et.al, WOAH 2017 X(implicit post)+[T1,T2,T…Tm] -> Y(implicit post’s short explanation) B- BLEU, R-ROUGE, BS-BERTSCORE

Why do KGs fail to help in explain implicit hate?
Hypothesis: “the process of retrieving and ranking KG tuples is task agnostic and may not account for the multi-hop/indirect nature of implicit hate”. [1]: Social Bias Frames: Reasoning about Social and Power Implications of Language, Sap et.al, ACL 2020 [2]: Latent Hatred: A Benchmark for Understanding Implicit Hate Speech, ElSherief et.al, EMNLP 2021 [3]: ConceptNet 5.5: An Open Multilingual Graph of General Knowledge, Speer et.al, AAAI 2017 X(implicit post)+[T1,T2,T…Tm] -> Y(implicit post short explanation) lol. yes, africa isn't a developed land because of whites. topk3 ['white is distinct from black', 'white is related to cloud', 'white is opposite in meaning to black', 'white is related to snow', 'land is related to countryside', 'land is related to houses', 'land is related to shore', 'land is related to big', 'land is related to geography term', 'land is related to term', 'land is related to terrestrial', 'land is related to terrestrial area', 'land is related to feature', 'land is related to geographical', 'land is related to geographical feature', 'land is related to territory', 'land is related to value', 'land is related to airplane', 'land is related to airplane ending', 'land is related to ending'] lol. yes, africa isn't a developed land because of whites. ['indians kiss apartheids ass', 'american jews did become white', 'african american culture is interesting', 'muslims find countries undesirable', 'americans care about white', 'islam does have popularity in black community', 'black american culture is hypervisible', 'white americans doing something', 'americans are horribly bad with geography', 'indians are south asian community', 'black american culture is than americanized christianity', 'christians are problem nigeria', 'muslims are in christian land', 'muslims are not race', 'muslims do emigrate to countries', 'successful indian diaspora build country', 'muslims live in in ghettos', 'americans refer to region', 'indians are from south asia', 'indians are racist']

Alternate contextual signals for explaining implicit hate [1]: https://www.perspectiveapi.com/research/

C1: In-domain signal C2: In-dataset (meta) signal Experimental Results: Manual
assessment Fine-tuned systems retain task speciﬁcity better. SBIC human evaluation (1-5 scale), Target is binary

Takeaways Limitations: • English only datasets. • More analysis of
KG’s application to subjective tasks is needed. • The explanations are helpful for moderators but not users. Contributions: • Not all contextual signals are useful! • 1-1 mapped toxicity attributes are better for both implicit hate classiﬁcation and its explanation. Future works: • Examine how ICL can be employed to retain both task-speciﬁcity and generalisation capabilities of LLMs.

Proactively Reducing the Hate Intensity of Online Posts via Hate
Speech Normalization ACM SIGKDD Conference on Knowledge Discovery and Data Mining (SIGKDD), 2022 Thesis Contribution

Existing Taxonomy of Hate Detection and Mitigation Reactive countering •
Intervention after posting • Blocking posts or users • Users have no say Proactive countering • Intervention before posting • User’s choice • Can feel intrusive

Limitations in literature • Attempt to convert offensive content to
non-offensive without any middle ground. [1] • Style Transformer based systems requiring large among of unparallel datasets. [2] • The above models did not incorporate proactiveness in their POC. Motivation Can toxicity-attributed context cues be used for proactive nudging of users to be less hateful? [1]: Fighting Offensive Language on Social Med, Santos et al., ACL’18 [2]: Style Transformer: Unpaired Text Style Transfer without Disentangled Latent Representation, Dai et. al, ACL’19

Observations in real-world “The intervention has to be ground-up, not
top-down.” - Howard Rheingold, Author [1]: Howard’s The Art of Hosting Good Conversations Online [2]: Social media punishment does not need to be a Kafkaesque nightmare [3]: Reconsidering Tweets: Intervening during Tweet Creation Decreases Offensive Content, Katsaros et al., ICWSM ‘22

Data curation • Hateful samples collected from existing hate speech
datasets in English (focus on explicit samples) [1,2,3] • Manually annotated for hate intensity (1-10) and explicit hateful spans. • Manual write of normalised counter-part and its intensity. (IAA = 0.88) Dataset stats Original and normalised intensity distribution [1]: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter, Basile et.al, International Workshop on Semantic Evaluation, 2019 [2]: Automated Hate Speech Detection and the Problem of Offensive Language, Davidson et al., WebSci’2017 [2]: CONAN - COunter NArratives through Nichesourcing: a Multilingual Dataset of Responses to Fight Online Hate Speech, Chung et.al, ACL 2019

Problem Statement For a given hate sample 𝑡, our objective
is to obtain its normalized (sensitised) form 𝑡` such that the intensity of hatred 𝜙(.) is reduced while the meaning still conveys. 𝜙 𝑡` < 𝜙 𝑡 Extremely Hateful Input (ORIGINAL) Less Hateful Input (SUGGESTIVE) HATE NORMALIZATION Extremely Hateful Input (ORIGINAL) User’s Choice Example of original high intensity vs normalised sentence

Hate Intensity Reduction (HIR) via Hate Intensity Prediction (HIP) Overall
Loss Reward Hate normalisation framework

Experimental results: HIR Hate intensity reduced text generation Human evaluation
on 1-5 scale, Intensity is 1-10 Nuding via toxicity intensity reduction does not reduce ﬂuency and adequacy of the generated samples. [1]: A Benchmark Dataset for Learning to Intervene in Online Hate Speech, Qian, et.al, EMNLP 2019 [2]: CONAN - COunter NArratives through Nichesourcing: a Multilingual Dataset of Responses to Fight Online Hate Speech, Chung et.al, ACL 2019

Tool: Detects Hateful spans and suggests changes as you type
Prototype of NACL tool Grammarly but for hate reduction!

Takeaways Limitations: • Tested only on english datasets. • Hard
to model span detection for implicit/coded languages. Contributions: • 1-1 mapped toxicity attributes are better for understanding explicit hate as well. • A proactive hate reduction system that engages with users instead of banning them right away. Future works: • The most effective way to determine the utility of proactive nudging is A/B testing in the real-world!

Part 3 Hate speech annotation and data curation in the
age of LLMs

Hate Personiﬁed: Investigating the role of LLMs in content moderation
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024 Thesis Contribution

Background [1]: Handling Bias in Toxic Speech Detection: A Survey,
Garg et.al, ACM Comp. Survey, 2023 (Thesis Contribution) [2]: Exploring Cross-Cultural Differences in English Hate Speech Annotations:From Dataset Construction to Analysis, Lee et.al, NAACL, 2024 Humans use their “socio-cultural background”, and “world-knowledge” to annotate hate speech. ? ? ?

Do LLMs pick on geographical cues? [1]: Exploring Cross-Cultural Differences
in English Hate Speech Annotations:From Dataset Construction to Analysis, Lee et.al, NAACL, 2024 RQ1: p cue = The following post was made in <country> or <language>. Base case = (post+query) = p base = Statement: <POST>. Is the given statement hateful?’ RQ1: p cue + Base case

Prompt: English Post: English or speciﬁc language Yes! Do LLMs
pick on geographical cues?

Can LLMs mimic annotator persona? [1]: Exploring Cross-Cultural Differences in
English Hate Speech Annotations:From Dataset Construction to Analysis, Lee et.al, NAACL, 2024 RQ2: p cue = The following post was made by <persona>. Base case = (post+query) = p base = Statement: <POST>. Is the given statement hateful?’ RQ2: p cue + Base case

Not really! Can LLMs mimic annotator persona? [1]: Sociodemographic Prompting
is Not Yet an Effective Approach for Simulating Subjective Judgments with LLMs, Sun et.al, NAACL 2025 RQ2: p cue = The following post was made by <persona>. RQ2: p cue = You are a <persona>.

Are LLMs sensitive to anchoring bias? [1]: HateXplain: A Benchmark
Dataset for Explainable Hate Speech Detection, Mathew et.al, AAAI 2021 RQ3: p cue = <z%> of annotator consider the post as hateful. Base case = (post+query) = p base = Statement: <POST>. Is the given statement hateful?’ RQ3: p cue + Base case

z ∈ {0%, 25%, 50%, 75%, 100%} Yes! Are LLMs
sensitive to anchoring bias?

Takeaways Limitations: • Cultural markers are hard to define and
reproduce at scale. The persona’s that work for one country will not be useful for another. • LLMs still lack ability to untangle intersecting cues like Arab + Non-Muslim Contributions: • Cues that are useful for a LM finetuning may not be helpful for prompting (aka numeric ones). • Geographical and persona cues hold even under multilingual setting. Future works: • Beyond zero-shot prompting. • How about culturally finetuned models (on culture benchmarks).

Concluding remarks

So what did we manage to achieve? For users: Assist
in corrective behaviour • Proactively nudging them to reduce explicit hate and untangle explicitness from more engagement. (Chapters 6) For content moderators: Reduce burden via assisting with first-level flagging • Propose contextual aware tools to bring closer to human contextual modelling. (Chapters 3, 4, 7) • Provide explanation based systems for understanding implicit hate while moderating. (Chapters 5) For researchers/practitioners: Through extensive experimentation and empirical research • Release context augmented datasets. (Chapters 3, 4, 5, 6) • Modelling practises, what contextual signal works under which setup (what+how to use) (Chapters 3, 4, 5, 6, 7) Contextual signal infusion is not a one-size-fits-all; we see that across a range of datasets and tasks. Whether for moderators or users, “AI” can only be assistive in tackling a very human-centric.

List of Publications (Journals) 1. Tanmoy Chakraborty, Sarah Masud: The
Promethean Dilemma of AI at the Intersection of Hallucination and Creativity. Communication of the ACM, 2024. 2. Sarah Masud, Ashutosh Bajpai, Tanmoy Chakraborty: Focal Inferential Infusion Coupled with Tractable Density Discrimination for Implicit Hate Speech Detection. Natural Language Engineering (NLE), 2024. 3. Tanmay Garg, Sarah Masud, Tharun Suresh, Tanmoy Chakraborty: Handling Bias in Toxic Speech Detection: A Survey. ACM Computing Surveys, 2023. 4. Tanmoy Chakraborty, Sarah Masud: Judging the creative prowess of AI. Nature Machine Intelligence, 2023 5. Sarah Masud, Tanmoy Chakraborty: Political mud slandering and power dynamics during Indian assembly elections. Social Network Analysis and Mining (SNAM), 2023. 6. Dhruv Sehnan, Vasu Goel, Sarah Masud, Chhavi Jain, Vikram Goyal, Tanmoy Chakraborty: DiVA: A Scalable, Interactive and Customizable Visual Analytics Platform for Information Diffusion on Large Networks. ACM Transactions on Knowledge Discovery from Data (ACM TKDD), 2023. 7. Tanmoy Chakraborty, Sarah Masud: Nipping in the bud: detection, diffusion and mitigation of hate speech on social media. ACM SIGWEB Newsletter, 2022.

List of Publications (Conferences) 1. Sarah Masud, Sahajpreet Singh, Viktor
Hangya, Alexander Fraser, Tanmoy Chakraborty: Hate Personified: Investigating the role of LLMs in content moderation. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024. 2. Neemesh Yadav, Sarah Masud, Vikram Goyal, Md. Shad Akhtar, Tanmoy Chakraborty: Tox-BART: Leveraging Toxicity Attributes for Explanation Generation of Implicit Hate Speech. Findings of the Association for Computational Linguistics (ACL), 2024. 3. Sarah Masud, Mohammad Aflah Khan, Vikram Goyal, Md. Shad Akhtar, Tanmoy Chakraborty: Probing Critical Learning Dynamics of PLMs for Hate Speech Detection. Findings of the Association for Computational Linguistics (EACL), 2024. 4. Atharva Kulkarni, Sarah Masud, Vikram Goyal, Tanmoy Chakraborty: Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment. Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (SIGKDD), 2023. 5. Sarah Masud, Manjot Bedi, Mohammad Aflah Khan, Md. Shad Akhtar, Tanmoy Chakraborty: Proactively Reducing the Hate Intensity of Online Posts via Hate Speech Normalization. Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (SIGKDD). 6. Sarah Masud, Subhabrata Dutta, Sakshi Makkar, Chhavi Jain, Vikram Goyal, Amitava Das, Tanmoy Chakraborty: Hate is the New Infodemic: A Topic-aware Modeling of Hate Speech Diffusion on Twitter. IEEE 37th International Conference on Data Engineering (ICDE), 2021.

Acknowledgements • Supervisors: Dr. Tanmoy and Dr. Vikram • Internal
and external review committee, anonymous article reviewers (including reviewer 2) • All the amazing collaborators (students, postdocs, professors) • Financial support from Google, PMRF, Wipro AI • IIIT-D admin, staff and colleagues, all collaborators! YAMSS! • Friends and family • Talks, tutorial, shared tasks and blogs where have shared about our work • Activist around the world teaching us that hate speech mitigation will be a ongoing effort

Q/A + Major Reviewers Discussions

We leave with more questions than answers Legal Aspects •
How can legal frameworks be used to explain classification of hate? • When working with social media orgs, who gets the ownership of POCs? • Where does dissent end and sedition and toxicity begins? Social Aspects • How can engage annotators of diverse perspective when curating datasets or evaluating models? • Despite contextual cues, power dynamics is still hard to capture in online world. Technical Aspects • What is the impact of task-specific models, KGs and datasets on downstream bias mitigation? • How can static NLP benchmarking move to a more dynamic setting to incorporate dynamic nature of toxicity? • How can we better track and the causal relation between other activities like fake news or even football to patterns of toxicity? • Sharing of contextual datasets via platforms like HF is difficult! Access to data for CSS research is dwelling :(

Reviewer 1 Reviewer 2 My major concern is about the
presentation of the thesis. I would suggest a rewrite, AT LEAST of Section 1.4, highlighting the consolidated (not section-wise) novelty of each contributory chapter along with the corresponding research publication(s) citations. Throughout the thesis, citations do not always follow the correct presentation style: when the citation is used as the subject in a sentence, it is customary to move the author names outside the parenthesis. For instance, on p15, “(Waseem and Hovy, 2016) released” should be “Waseem and Hovy (2016) released”. Another element, somewhat indirectly acknowledged by the dissertation, is that for tools such as those proposed here to be usable, the dataset needs signiﬁcantly more Annotators. Introduction (Chapter 1) as well as chapter wise-introduction updated. Related work (Chapter 2) as well individual instances of citations ﬁxed. Introduction (Chapter 1) and Conclusion (Chapter 8) updated. Reviewer 3 Major concerns

Thank you

Thesis Presentation

Thesis Presentation

More Decks by _themessier

Other Decks in Research

Featured

Transcript