2024-06-16-pydata_london

Slide 1

Slide 1 text

How to uncover and avoid structural biases in your Machine Learning projects Soﬁe Van Landeghem NLP and ML freelancer & Open-Source maintainer PyData London 2024

Slide 2

Slide 2 text

Soﬁe Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Running a ML project: meeting a new client 2 I want to solve problem X with AI. I have no data. There is no prototype yet. I want to achieve 95% accuracy.

Slide 3

Slide 3 text

Soﬁe Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Manage expectations 3 Robust solution ★ Iterate on the data model ★ Curate the data sets ★ Fine-tune the ML models ★ Improve accuracy & efficiency First prototype ★ Create data model ★ Assemble data sets ★ Build preliminary ML models ★ Accuracy & efficiency baseline

Slide 4

Slide 4 text

Soﬁe Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Let's build a prototype 1. Select the right data 2. Model the data 3. Build ML models / algorithms 4. Perform an evaluation 4

Slide 5

Slide 5 text

Soﬁe Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Let's build a prototype - focusing on the data! 1. Select the right data 2. Model the data 3. Build ML models / algorithms to ﬁt the data 4. Perform an evaluation on the data 5. Iterative improvements on the data model and algorithms 5

Slide 6

Slide 6 text

Soﬁe Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects But why? Can't we just "zero-shot" it these days? 6 ChatGPT (or any other LLM) can generalize to unseen labels as it "understands" their meaning

Slide 7

Slide 7 text

Soﬁe Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects We still need proper, gold annotations 7 ● Evaluation ○ Measure performance and progress ● Training a supervised model ○ Smaller and specialized models can be more cost eﬃcient ● Tuning an LLM ○ Few-shot prompt examples ○ Fine tuning an Open-Source LLM

Slide 8

Slide 8 text

Evaluate your evaluation 8

Slide 9

Slide 9 text

Soﬁe Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Use-case: Entity linking (EL) or disambiguation "The vision of [WMO] is to provide world leadership in expertise and international cooperation in weather, climate, hydrology and water resources" 9 WikiData Wikipedia

Slide 10

Slide 10 text

Soﬁe Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Assembling training data from wiki links 10 She married [[William King-Noel, 1st Earl of Lovelace|William King]] in 1835

Slide 11

Slide 11 text

Soﬁe Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Evaluation & comparison against baselines 11 ● Assemble a knowledge base (KB) from WikiData ● Prune KB for ideal mem/accuracy trade-oﬀ ○ Remove infrequent entities/synonyms ○ KB only stores 14% of all WikiData concepts ○ Max. achievable accuracy using oracle disambiguation: 84% ● Random baseline: 54% ● Entity linker model implemented with spaCy & Thinc: 79% Looks great, right? 🚀

Slide 12

Slide 12 text

Soﬁe Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Let's compare with prior probabilities 12 ● "Obama": almost always linked to the 44th US president (Q76) ● We could simply "predict" Q76 for every "Obama" mention without training a ML model that disambiguates according to the context ● This prior-probability baseline obtains 78.2% ● So our model at 79% only marginally improves upon that baseline 😟

Slide 13

Slide 13 text

Soﬁe Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Let's inspect our data... 13 Societies in the ancient civilizations of [Greece] and Rome preferred small families. Gold reference: Ancient Greece Prediction: Greece Agnes Maria of Andechs-Merania (died 1201) was a Queen of [France]. Gold reference: current France Predicted: kingdom in Western Europe from 987 to 1791 Full metro systems are in operation in Paris, Lyon and [Marseille] Gold reference: Marseille Metro Predicted: Marseille

Slide 14

Slide 14 text

Soﬁe Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Recommendation Make sure you climb the right hill 14 ● Is your data reliable? ● Is your test set representative of real-world data? ● Is your evaluation taking hierarchical concepts into account? ● Is your evaluation penalizing related concepts equally harsh as other mistakes? ● Are you evaluating your model's accuracy with sensible baselines?

Slide 15

Slide 15 text

Data model & Annotation strategy 15

Slide 16

Slide 16 text

Soﬁe Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Use-case: Identifying school names in text 16 ● Client had created the data model and done the annotations ● Initial NER model obtained 92% F-score 🥳 ● The project was going to build on top of these results ● But we wanted to do a qualitative analysis ﬁrst...

Slide 17

Slide 17 text

Soﬁe Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Recommendation: Apply the model back to the training dataset 17 Prediction mistakes may actually be annotation errors and/or uncover structural issues or biases: Gold: ...All [CPS] schools use a centralized data warehouse... Gold: ...reside within three [NYC] districts, and we have built... Gold: ...graduation rates from [Broward County public schools], Florida... Gold: ...expanded beyond the [Harrisburg school district] to now include sites... FP: ...expanded beyond the [Harrisburg] school district to now include sites...... FP: ...reside within three [NYC districts], and we have built...

Slide 18

Slide 18 text

Soﬁe Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Recommendation: Write up annotation guidelines ● The "source of truth" throughout the project ● Deﬁne label schema ● Detailed guidelines for annotators 18 When the full name of an entity and its acronym appear in sequence, each should be annotated as a separate entity: ❌ ...an internal report by the University of Ghent (UGent) has revealed... ✅ ...an internal report by the University of Ghent (UGent) has revealed...

Slide 19

Slide 19 text

Soﬁe Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Recommendation: An intuitive annotation framework ● Provide support for tedious and repetitive tasks ○ Provide sensible suggestions (rules, vocabulary, LLM, ...) ● Focus on a single task at once ○ Reduce cognitive load ○ Improve annotation consistency ● Validation callback to help enforce the guidelines 19

Slide 20

Slide 20 text

Soﬁe Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Recommendation: Design an appropriate label scheme Ensure that the label scheme ... ○ Is clear and unambiguous ○ Fits with the envisioned modeling approach ○ Allows generalization / extension in the future ○ Is compatible with downstream usage of the data 20

Slide 21

Slide 21 text

Soﬁe Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Recommendation: Measure inter-annotator agreement 21 ● Plotting overlap between labels from diﬀerent annotators ● Here, there is some confusion between "City" and "Region" ● There is also confusion whether or not to annotate an entity (label "None") ● Use this to update your guidelines and/or label scheme

Slide 22

Slide 22 text

Soﬁe Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Recommendation: Plot a learning curve 22 ● Plot the F-score in relation to the % of training data used (eval set is ﬁxed) ● Estimate how much can be gained by annotating more data

Slide 23

Slide 23 text

Soﬁe Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Recommendation: Set up an extrinsic evaluation 23 Intrinsic evaluation Char-based annotation Data model corresponding to the ML/NLP task # patients Treatment Drug Treatment Dose Group 1 5 phenylephrine 1 μg/kg Group 2 5 arginine vasopressin 0.03 U/kg Group 3 5 epinephrine 1 μg/kg Extrinsic evaluation Higher-level view Data corresponding to downstream requirements Robust to changing data model

Slide 24

Slide 24 text

Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects An extrinsic evaluation reveals what is truly important ● Thinking back to our NER model: ○ Gold annotation: "Harrisburg school district" ○ Prediction: "Harrisburg" ● In a typical "strict" evaluation setting, this would be a FP and a FN, resulting in lower precision and lower recall ● In downstream processing however, the correct school could be identified, leading to no final error in the extrinsic evaluation 24

Slide 25

Slide 25 text

Bias in corpus construction 25

Slide 26

Slide 26 text

Soﬁe Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Example use-case: Identify fauna and seabed sediments 26 Diﬀerent equipment results in varying image quality

Slide 27

Slide 27 text

Soﬁe Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Recommendation: Prototype - stick to a limited scope ● Prototype: clarify the initial scope ○ Sources? ○ Types of input? ○ Label set? ● Clarify with the client that the model will not (yet) generalize beyond this original scope ● Allows you to set a reliable benchmark 27

Slide 28

Slide 28 text

Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Recommendation: Beyond a prototype - make the models robust ● Data augmentation ○ Add artificial noise ○ Transpose, turn, zoom and crop images to create more variation ● Explore different architectures and/or training hyperparameters ○ e.g. train with high(er) level of drop-out ○ Potentially sacrificing some accuracy points for generalizability ● Measure (and avoid) overfitting ○ If your models do well on your dev data but poorly on the final test portion, your models are overfitting and will not be robust / reliable on new data 28

Slide 29

Slide 29 text

Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Recommendation: Production - detect data drift ● Manually ○ Qualitative analysis & feedback from users ○ Regular focused annotation efforts to curate a new, up-to-date evaluation set ● Automated ○ Measure predicted label distributions to detect shifts in the input data ○ Measure model's confidence scores on new predictions to detect degradation 29

Slide 30

Slide 30 text

Soﬁe Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Example use-case: Identify political stances in news 30 Client's goal: identify all political actors & opinions on Brexit

Slide 31

Slide 31 text

Soﬁe Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Keyword-based corpus selection 31 ● Text selection (by client): articles mentioning "Brexit" in the title ● The client wants to run the resulting ML models on all news articles ● The keyword-based preselection of articles prevents you from assessing the number of false negatives in other articles ● Risk of missing relevant information that does not meet the original criterium

Slide 32

Slide 32 text

Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Recommendation: Formalize the input selection procedure 32 ● Train a simple document-level classifier first ○ Distinguish political news articles from other categories ○ Identify Brexit-related articles ● Bonus advantage ○ Creating a general model that can be reused in future projects e.g. when we want to analyze political stances around renewable energy ● This strategy is especially beneficial with imbalanced datasets ○ E.g. many irrelevant paragraphs → train a binary yes/no classifier first

Slide 33

Slide 33 text

Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Example: Waterfall strategy - step 1 33 Text categorization ● Quick document-level decision ● Train ML classifier ● Reusable component ● Easy way to filter relevant documents both for annotation as well as later in production

Slide 34

Slide 34 text

Soﬁe Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Example: Waterfall strategy - step 2 34 Fine-grained analyses ● e.g. Named Entity Recognition (NER) ● Only performed on documents that passed through the initial step Having two separate steps ● Reduces cognitive load on annotators ● Allows you to train separate, specialized ML models

Slide 35

Slide 35 text

Soﬁe Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Avoid: Create fully random dataset splits 35 ● Once you have a dataset annotated, create train/dev/test splits ● Don't do this for real-world projects: (Taken from Karpathy's excellent free tutorial '"Neural Networks: Zero to Hero")

Slide 36

Slide 36 text

Soﬁe Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Recommendation: Document-level splits 36 ● Keep all examples from the same original document in the same split ● Otherwise, you're "leaking" information from training data to test data ● You'd be artiﬁcially boosting your performance numbers Ex 1: "Mutated cancers constantly displayed strong estrogen receptor expression." Ex 2: "In all mutated cases strong estrogen receptor expression was demonstrable." PMID: 28799536

Slide 37

Slide 37 text

Soﬁe Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Recommendation: Deterministic & repeatable splits 37 Use deterministic (modulo) hashes of unique doc IDs to determine the split ● Ensure that the same doc is always in the same split ● Even if you extend the dataset or re-run the splitting algorithm ● Prevent "leaking" information from training to eval numbers

Slide 38

Slide 38 text

Summary 38

Slide 39

Slide 39 text

Soﬁe Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects Recommendations for running an ML project 39 ● Avoid selection bias by formalizing the selection procedure ● Create deterministic, document-level train/dev/test splits ● Carefully design the data model / label scheme ● Write up detailed data guidelines ● Set up a meaningful extrinsic evaluation ● Look at inter-annotator agreement stats and plot a learning curve ● Apply a preliminary model back to the training data ● Manually inspect gold annotations and incorrect predictions ● Make sure you're climbing the right hill ● Data quality should be front and center!

Slide 40

Slide 40 text

Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your ML projects 40 sofi[email protected] https://oxykodit.com https://github.com/svlandeg https://twitter.com/OxyKodit https://www.linkedin.com/in/sofievanlandeghem/