How to uncover and avoid structural biases in your Machine Learning projects Sofie Van Landeghem NLP and ML freelancer & Open-Source maintainer PyData London 2024

Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Running a ML project: meeting a new client 2 I want to solve problem X with AI. I have no data. There is no prototype yet. I want to achieve 95% accuracy.

Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Manage expectations 3 Robust solution ★ Iterate on the data model ★ Curate the data sets ★ Fine-tune the ML models ★ Improve accuracy & efficiency First prototype ★ Create data model ★ Assemble data sets ★ Build preliminary ML models ★ Accuracy & efficiency baseline

Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Let's build a prototype 1. Select the right data 2. Model the data 3. Build ML models / algorithms 4. Perform an evaluation 4

Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Let's build a prototype - focusing on the data! 1. Select the right data 2. Model the data 3. Build ML models / algorithms to fit the data 4. Perform an evaluation on the data 5. Iterative improvements on the data model and algorithms 5

Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects But why? Can't we just "zero-shot" it these days? 6 ChatGPT (or any other LLM) can generalize to unseen labels as it "understands" their meaning

Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects We still need proper, gold annotations 7 ● Evaluation ○ Measure performance and progress ● Training a supervised model ○ Smaller and specialized models can be more cost efficient ● Tuning an LLM ○ Few-shot prompt examples ○ Fine tuning an Open-Source LLM

Evaluate your evaluation 8

Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Use-case: Entity linking (EL) or disambiguation "The vision of [WMO] is to provide world leadership in expertise and international cooperation in weather, climate, hydrology and water resources" 9 WikiData Wikipedia

Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Assembling training data from wiki links 10 She married [[William King-Noel, 1st Earl of Lovelace|William King]] in 1835

Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Evaluation & comparison against baselines 11 ● Assemble a knowledge base (KB) from WikiData ● Prune KB for ideal mem/accuracy trade-off ○ Remove infrequent entities/synonyms ○ KB only stores 14% of all WikiData concepts ○ Max. achievable accuracy using oracle disambiguation: 84% ● Random baseline: 54% ● Entity linker model implemented with spaCy & Thinc: 79% Looks great, right? 🚀

Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Let's compare with prior probabilities 12 ● "Obama": almost always linked to the 44th US president (Q76) ● We could simply "predict" Q76 for every "Obama" mention without training a ML model that disambiguates according to the context ● This prior-probability baseline obtains 78.2% ● So our model at 79% only marginally improves upon that baseline 😟

Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Let's inspect our data... 13 Societies in the ancient civilizations of [Greece] and Rome preferred small families. Gold reference: Ancient Greece Prediction: Greece Agnes Maria of Andechs-Merania (died 1201) was a Queen of [France]. Gold reference: current France Predicted: kingdom in Western Europe from 987 to 1791 Full metro systems are in operation in Paris, Lyon and [Marseille] Gold reference: Marseille Metro Predicted: Marseille

Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Recommendation Make sure you climb the right hill 14 ● Is your data reliable? ● Is your test set representative of real-world data? ● Is your evaluation taking hierarchical concepts into account? ● Is your evaluation penalizing related concepts equally harsh as other mistakes? ● Are you evaluating your model's accuracy with sensible baselines?

Data model & Annotation strategy 15

Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Use-case: Identifying school names in text 16 ● Client had created the data model and done the annotations ● Initial NER model obtained 92% F-score 🥳 ● The project was going to build on top of these results ● But we wanted to do a qualitative analysis first...

Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Recommendation: Apply the model back to the training dataset 17 Prediction mistakes may actually be annotation errors and/or uncover structural issues or biases: Gold: ...All [CPS] schools use a centralized data warehouse... Gold: ...reside within three [NYC] districts, and we have built... Gold: ...graduation rates from [Broward County public schools], Florida... Gold: ...expanded beyond the [Harrisburg school district] to now include sites... FP: ...expanded beyond the [Harrisburg] school district to now include sites...... FP: ...reside within three [NYC districts], and we have built...

Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Recommendation: Write up annotation guidelines ● The "source of truth" throughout the project ● Define label schema ● Detailed guidelines for annotators 18 When the full name of an entity and its acronym appear in sequence, each should be annotated as a separate entity: ❌ internal report by the University of Ghent (UGent) has revealed... ✅ internal report by the University of Ghent (UGent) has revealed...

Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Recommendation: An intuitive annotation framework ● Provide support for tedious and repetitive tasks ○ Provide sensible suggestions (rules, vocabulary, LLM, ...) ● Focus on a single task at once ○ Reduce cognitive load ○ Improve annotation consistency ● Validation callback to help enforce the guidelines 19

Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Recommendation: Design an appropriate label scheme Ensure that the label scheme ... ○ Is clear and unambiguous ○ Fits with the envisioned modeling approach ○ Allows generalization / extension in the future ○ Is compatible with downstream usage of the data 20

Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Recommendation: Measure inter-annotator agreement 21 ● Plotting overlap between labels from different annotators ● Here, there is some confusion between "City" and "Region" ● There is also confusion whether or not to annotate an entity (label "None") ● Use this to update your guidelines and/or label scheme

Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Recommendation: Plot a learning curve 22 ● Plot the F-score in relation to the % of training data used (eval set is fixed) ● Estimate how much can be gained by annotating more data

Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Recommendation: Set up an extrinsic evaluation 23 Intrinsic evaluation Char-based annotation Data model corresponding to the ML/NLP task # patients Treatment Drug Treatment Dose Group 1 5 phenylephrine 1 μg/kg Group 2 5 arginine vasopressin 0.03 U/kg Group 3 5 epinephrine 1 μg/kg Extrinsic evaluation Higher-level view Data corresponding to downstream requirements Robust to changing data model

Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects An extrinsic evaluation reveals what is truly important ● Thinking back to our NER model: ○ Gold annotation: "Harrisburg school district" ○ Prediction: "Harrisburg" ● In a typical "strict" evaluation setting, this would be a FP and a FN, resulting in lower precision and lower recall ● In downstream processing however, the correct school could be identified, leading to no final error in the extrinsic evaluation 24

Bias in corpus construction 25

Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Example use-case: Identify fauna and seabed sediments 26 Different equipment results in varying image quality

Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Recommendation: Prototype - stick to a limited scope ● Prototype: clarify the initial scope ○ Sources? ○ Types of input? ○ Label set? ● Clarify with the client that the model will not (yet) generalize beyond this original scope ● Allows you to set a reliable benchmark 27

Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Recommendation: Beyond a prototype - make the models robust ● Data augmentation ○ Add artificial noise ○ Transpose, turn, zoom and crop images to create more variation ● Explore different architectures and/or training hyperparameters ○ e.g. train with high(er) level of drop-out ○ Potentially sacrificing some accuracy points for generalizability ● Measure (and avoid) overfitting ○ If your models do well on your dev data but poorly on the final test portion, your models are overfitting and will not be robust / reliable on new data 28

Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Recommendation: Production - detect data drift ● Manually ○ Qualitative analysis & feedback from users ○ Regular focused annotation efforts to curate a new, up-to-date evaluation set ● Automated ○ Measure predicted label distributions to detect shifts in the input data ○ Measure model's confidence scores on new predictions to detect degradation 29

Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Example use-case: Identify political stances in news 30 Client's goal: identify all political actors & opinions on Brexit

Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Keyword-based corpus selection 31 ● Text selection (by client): articles mentioning "Brexit" in the title ● The client wants to run the resulting ML models on all news articles ● The keyword-based preselection of articles prevents you from assessing the number of false negatives in other articles ● Risk of missing relevant information that does not meet the original criterium

Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Recommendation: Formalize the input selection procedure 32 ● Train a simple document-level classifier first ○ Distinguish political news articles from other categories ○ Identify Brexit-related articles ● Bonus advantage ○ Creating a general model that can be reused in future projects e.g. when we want to analyze political stances around renewable energy ● This strategy is especially beneficial with imbalanced datasets ○ E.g. many irrelevant paragraphs → train a binary yes/no classifier first

Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Example: Waterfall strategy - step 1 33 Text categorization ● Quick document-level decision ● Train ML classifier ● Reusable component ● Easy way to filter relevant documents both for annotation as well as later in production

Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Example: Waterfall strategy - step 2 34 Fine-grained analyses ● e.g. Named Entity Recognition (NER) ● Only performed on documents that passed through the initial step Having two separate steps ● Reduces cognitive load on annotators ● Allows you to train separate, specialized ML models

Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Avoid: Create fully random dataset splits 35 ● Once you have a dataset annotated, create train/dev/test splits ● Don't do this for real-world projects: (Taken from Karpathy's excellent free tutorial '"Neural Networks: Zero to Hero")

Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Recommendation: Document-level splits 36 ● Keep all examples from the same original document in the same split ● Otherwise, you're "leaking" information from training data to test data ● You'd be artificially boosting your performance numbers Ex 1: "Mutated cancers constantly displayed strong estrogen receptor expression." Ex 2: "In all mutated cases strong estrogen receptor expression was demonstrable." PMID: 28799536

Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Recommendation: Deterministic & repeatable splits 37 Use deterministic (modulo) hashes of unique doc IDs to determine the split ● Ensure that the same doc is always in the same split ● Even if you extend the dataset or re-run the splitting algorithm ● Prevent "leaking" information from training to eval numbers

Summary 38

Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Recommendations for running an ML project 39 ● Avoid selection bias by formalizing the selection procedure ● Create deterministic, document-level train/dev/test splits ● Carefully design the data model / label scheme ● Write up detailed data guidelines ● Set up a meaningful extrinsic evaluation ● Look at inter-annotator agreement stats and plot a learning curve ● Apply a preliminary model back to the training data ● Manually inspect gold annotations and incorrect predictions ● Make sure you're climbing the right hill ● Data quality should be front and center!

Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects 40 sofi[email protected]