Probing Critical Learning Dynamics of PLMs for Hate Speech Detection

Probing Critical Learning Dynamics of PLMs for Hate Speech Detection
Sarah Masud*, Mohammad Aflah Khan*, Vikram Goyal, Md Shad Akhtar, Tanmoy Chakraborty {sarahm,aflah20082,vikram,shad.akhtar}@iiitd.ac.in, [email protected] IIIT Delhi, IIT Delhi * Equal contributions, corresponding authors EACL 2024

Background

Motivation [1]: https://www.un.org/en/hate-speech/understanding-hate-speech/what-is-hate-speech • No universal definition of hate speech
[1]. • There is a surge in LM based systems, finetuned for detecting hate. • Most often these systems either adopt the default parameters, or finetune the parameters for a specific dataset + LM combination. Are there any common trends across datasets and LM combinations when performing hate speech detection as measured by macro F1 performance Fig 1: Online Hate Speech [1]

Research Questions • Impact of of the Pretrained Backbones •
Impact of the Finetuning Schemes Fig 1: Research RQs [1] [1]: Probing Critical Learning Dynamics of PLMs for Hate Speech Detection EACL’24

Experimental Setup • 7 English Hate speech datasets - Binarized
into hate vs non-hate • 5 BERT based PLMs • 3 Classification Heads ◦ Simple CH: A linear layer followed by Softmax. ◦ Medium CH: Two linear layers followed by Softmax. ◦ Complex CH: Two linear layers and an intermediate dropout layer with a dropout probability ◦ of 0.1, followed by Softmax. • Classification Head seed aka MLP seed: ms = {12, 127, 451}

Experimental Setup Fig 1: Datasets [1] Fig 2: PLMs [1]
[1]: Probing Critical Learning Dynamics of PLMs for Hate Speech Detection EACL’24

Continued Pre-training: Domain specific Large Scale General LM large scale
unlabeled Corpus from various sources across the web Mask prediction (Unsupervised generalised training) Training a Large LM (LLM) from Scratch: Pre-training Initialise LM with random weights Saved LM with trained weights Medium sized unlabeled Corpus from a specific domain Mask prediction (Unsupervised domain specific training) Training a Large LM (LLM) from Saved checkpoint: Continued Pre-training Load LM with trained weights Large Scale Domain LM Saved LM with updated weights

1. mBERT - Multilingual BERT -> Multiple Languages 104 used
to perform MLM. [1] 2. BertTweet -> Over 845M tweets used to perform MLM. [2] 3. HateBERT -> Over 1M offensive reddit posts to perform MLM. [3] BERT variants for different context via MLM Fig 1: MLM top 3 candidates for the templates “Women are [MASK.]” [3] Fig 2: Performance gain of HateBERT over BERT for hate classification [3] [1]: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [2]: BERTweet: A pre-trained language model for English Tweets [3]: HateBERT: Retraining BERT for Abusive Language Detection in English

Analysis of the Pretrained Backbones

How do variations in random seeding impact downstream hate detection?
Hypothesis Random seed variations should significantly impact the downstream performance. Setup 1. Employ the 25 pretrained checkpoints of BERT, each with a different random seed of pre training data+weights [1] a. Use 5 random PLM seeds ps = {0, 5, 10, 15, 20}. 2. Keeping the checkpointed-PLM frozen, finetune all combination of <Dataset, CH head, PLM Checkpoint> [1]: The MultiBERTs: BERT Reproductions for Robustness Analysis

Observations • There is significant variation in performance due to
seed selection. • CH head seeds (aka MLP) seed cause more variation than pretrained seed. S ms,ps is the combination of MLP seed (ms) and pretraining seed (ps). RQ1: How do variations in saved checkpoint impact hate detection?

RQ1: How do variations in saved checkpoint impact hate detection?
Takeaways: • Practitioners should report HS detection results averaged across multiple seed runs.

RQ: How do variations in saved checkpoint impact hate detection?
Hypothesis Does default assumption that best/last pre trained checkpoint leads to highest downstream task hold when compared against intermediate checkpoints? Setup 1. Employ the 84 intermediate pretrained checkpoints, one for each training epoch of the RoBERTa English [1] 2. Employ the Simple and Complex CH. 3. Keeping the checkpointed-PLM frozen, finetune all combination of <Dataset, CH head, PLM Checkpoint> [1]: Measuring Causal Effects of Data Statistics on Language Model's `Factual' Predictions

Fig 1: The checkpoint (C) which leads to maximum macro F1 the respective MLP seed (Sms) and CH. Observations: • Across all datasets and classification heads, earlier checkpoints (C x ) are reporting highest performance. [1]: Probing Critical Learning Dynamics of PLMs for Hate Speech Detection EACL’24 Fig 2: Performance variation on Davidson across the 84 checkpoints of RoBERTa [1].

Fig 2: Performance variation across the 84 checkpoints of RoBERTa [1]. [1]: Probing Critical Learning Dynamics of PLMs for Hate Speech Detection EACL’24

Takeaway • Not all downstream tasks require a full fledged pretraining. • Can intermediate validation test on these task can produce more cost effective PLMs?

RQ: Does newer pretraining data impact downstream hate speech detection?
Hypothesis PLMs trained on more recent data should capture the emerging hateful world knowledge and enhance the detection of hate. Setup 1. Employ incrementally updated checkpoints from Hugginface OLM on RoBERTa English [1] 2. Keeping the checkpointed-PLM frozen, finetune all combination of <Dataset, Simple CH head> [1]: https://huggingface.co/olm

Observation • As most datasets are older the respective PLM, it appears that newer PLMs do not significantly lead to performance improvements. • The extreme 25 point increase in case of Founta can be an indicator of data leakage. [1]: https://huggingface.co/olm [2]: Probing Critical Learning Dynamics of PLMs for Hate Speech Detection EACL’24 Fig 1: Macro F1 (averaged across MLP) finetuned on RoBERTa variants [1] from June 2019, October 2022, and December 2022 [2].

Takeaways • Community also needs to develop upon the likes of HateCheck [1] test sets to standardise testing of newer LLMs against hate speech. • How do we scale this dynamically and continually, across languages, topics and cultural sensitivities? • How do we check for data leakage of existing hate speech datasets (which are curated from web), in the pretraining datasets of newer LMs? [1]: HateCheck: Functional Tests for Hate Speech Detection Models

Analysis of the Finetuning Schemes

RQ: What impact do individual/grouped layers have on hate detection?
Hypothesis Different layers or groups of layers in the PLM will be of varying importance for hate detection. Setup 1. Freeze all layer(s) expect the one(s) being probed. 2. Finetune all combination of <Dataset, Simple CH head, PLM>

Do we need to finetuning all Layers of BERT? [1]:
https://iq.opengenus.org/bert-base-vs-bert-large/ Fig 1: BERT Encoders [1] Fig 1: BERT Base vs Large in terms of encoding layers [1] Different layers encode the semantics to a different extend

RQ: Do we need to finetuning all Layers of BERT?
Fig 1: The best and worst perming layer for different BERT variants. [1] [1]: Probing Critical Learning Dynamics of PLMs for Hate Speech Detection EACL’24 Observations: • Excluding mBERT, trainable higher layers (closer to the classification head) lead to higher macro-F1 for most BERT-variants (except Davidson and Founta!), the trend is significant for 5 dataset.

Fig 1: The best and worst perming layer for different BERT variants. [1] [1]: Probing Critical Learning Dynamics of PLMs for Hate Speech Detection EACL’24 Observations: • mBERT’s highest performance for a dataset lags behind other BERT variants. • Further for mBERT middle layers seem to be consistently performing better.

Observations: • The impact of region (consecutive group of layers) wise Freezing (F) i.e trainable=False or non-freezing (NF) i.e trainable=True depends more on the dataset rather than the underlying BERT variant.

RQ: What impact do individual/grouped layers have on hate detection?
[1]: Probing Critical Learning Dynamics of PLMs for Hate Speech Detection EACL’24 Takeaways: • Layer grouping is a easier strategy to validate than single layer. • Employ mBERT only when the dataset is multilingual, otherwise you will suffer a loss of performance on English datasets.

RQ: Does the complexity of the classifier head impact hate
speech detection? Hypothesis Increasing the complexity of classification head should lead to an increase in performance. Setup 1. Keeping the base-PLM frozen run for all combination of <dataset, PLM, and classification head>, where the CH complexity is among: Simple < Medium < Complex

speech detection? Observations • Interestingly, we observe that a general-purpose pretrained model with a complex classification head may mimic the results of a domain-specific pretrained model. • Medium complexity CH performs better than other combinations. Fig 1: Impact of different classification head complexity on different PLMs [1] [1]: Probing Critical Learning Dynamics of PLMs for Hate Speech Detection EACL’24

Fig 1: Impact of different classification head complexity on different
PLMs [1] RQ: Does the complexity of the classifier head impact hate speech detection? [1]: Probing Critical Learning Dynamics of PLMs for Hate Speech Detection EACL’24

speech detection? Takeaways: • There is a saturation in performance of classifier head complexity, does not help beyond a point. • BERTweet seems to be performing better than HateBERT. We suggest that authors employing HateBERT also test their experiments on BERTweet. • To what extent are domain specific models like HateBERT useful? [1]: Probing Critical Learning Dynamics of PLMs for Hate Speech Detection EACL’24

Concluding Remarks

Pertinent Questions for the NLP Community 1. How can we
better release and maintain intermediate checkpoints? 2. Should we not use mBERT when monolingual PLM is available for monolingual dataset? 3. Role of domain specific models? Is domain specific MLM enough?

1. We recommend reporting results averaged over more than one
seed for the hate detection tasks. 2. Hate speech corpus needs to be continually augmented with newer samples to keep up with updated training corpus of the PLMs. 3. Datasets that report result on HateBERT should consider BERTTweet as well, given the closeness in performance of BERTweet and BERT. 4. We recommend reporting results on varying complexity of classification heads to establish the efficacy of domain-specific models. Recommendations for Hate Speech Community

More details on Paper Github https://github.com/LCS2-IIITD/HateFinetune

Probing Critical Learning Dynamics of PLMs for ...

Probing Critical Learning Dynamics of PLMs for Hate Speech Detection

More Decks by _themessier

Other Decks in Research

Featured

Transcript