Upgrade to Pro — share decks privately, control downloads, hide ads and more …

2024-06-16-pydata_london

 2024-06-16-pydata_london

Sofie Van Landeghem

June 16, 2024
Tweet

More Decks by Sofie Van Landeghem

Other Decks in Science

Transcript

  1. How to uncover and avoid structural biases in your Machine

    Learning projects Sofie Van Landeghem NLP and ML freelancer & Open-Source maintainer PyData London 2024
  2. Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your

    ML projects Running a ML project: meeting a new client 2 I want to solve problem X with AI. I have no data. There is no prototype yet. I want to achieve 95% accuracy.
  3. Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your

    ML projects Manage expectations 3 Robust solution ★ Iterate on the data model ★ Curate the data sets ★ Fine-tune the ML models ★ Improve accuracy & efficiency First prototype ★ Create data model ★ Assemble data sets ★ Build preliminary ML models ★ Accuracy & efficiency baseline
  4. Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your

    ML projects Let's build a prototype 1. Select the right data 2. Model the data 3. Build ML models / algorithms 4. Perform an evaluation 4
  5. Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your

    ML projects Let's build a prototype - focusing on the data! 1. Select the right data 2. Model the data 3. Build ML models / algorithms to fit the data 4. Perform an evaluation on the data 5. Iterative improvements on the data model and algorithms 5
  6. Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your

    ML projects But why? Can't we just "zero-shot" it these days? 6 ChatGPT (or any other LLM) can generalize to unseen labels as it "understands" their meaning
  7. Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your

    ML projects We still need proper, gold annotations 7 • Evaluation ◦ Measure performance and progress • Training a supervised model ◦ Smaller and specialized models can be more cost efficient • Tuning an LLM ◦ Few-shot prompt examples ◦ Fine tuning an Open-Source LLM
  8. Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your

    ML projects Use-case: Entity linking (EL) or disambiguation "The vision of [WMO] is to provide world leadership in expertise and international cooperation in weather, climate, hydrology and water resources" 9 WikiData Wikipedia
  9. Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your

    ML projects Assembling training data from wiki links 10 She married [[William King-Noel, 1st Earl of Lovelace|William King]] in 1835
  10. Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your

    ML projects Evaluation & comparison against baselines 11 • Assemble a knowledge base (KB) from WikiData • Prune KB for ideal mem/accuracy trade-off ◦ Remove infrequent entities/synonyms ◦ KB only stores 14% of all WikiData concepts ◦ Max. achievable accuracy using oracle disambiguation: 84% • Random baseline: 54% • Entity linker model implemented with spaCy & Thinc: 79% Looks great, right? 🚀
  11. Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your

    ML projects Let's compare with prior probabilities 12 • "Obama": almost always linked to the 44th US president (Q76) • We could simply "predict" Q76 for every "Obama" mention without training a ML model that disambiguates according to the context • This prior-probability baseline obtains 78.2% • So our model at 79% only marginally improves upon that baseline 😟
  12. Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your

    ML projects Let's inspect our data... 13 Societies in the ancient civilizations of [Greece] and Rome preferred small families. Gold reference: Ancient Greece Prediction: Greece Agnes Maria of Andechs-Merania (died 1201) was a Queen of [France]. Gold reference: current France Predicted: kingdom in Western Europe from 987 to 1791 Full metro systems are in operation in Paris, Lyon and [Marseille] Gold reference: Marseille Metro Predicted: Marseille
  13. Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your

    ML projects Recommendation Make sure you climb the right hill 14 • Is your data reliable? • Is your test set representative of real-world data? • Is your evaluation taking hierarchical concepts into account? • Is your evaluation penalizing related concepts equally harsh as other mistakes? • Are you evaluating your model's accuracy with sensible baselines?
  14. Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your

    ML projects Use-case: Identifying school names in text 16 • Client had created the data model and done the annotations • Initial NER model obtained 92% F-score 🥳 • The project was going to build on top of these results • But we wanted to do a qualitative analysis first...
  15. Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your

    ML projects Recommendation: Apply the model back to the training dataset 17 Prediction mistakes may actually be annotation errors and/or uncover structural issues or biases: Gold: ...All [CPS] schools use a centralized data warehouse... Gold: ...reside within three [NYC] districts, and we have built... Gold: ...graduation rates from [Broward County public schools], Florida... Gold: ...expanded beyond the [Harrisburg school district] to now include sites... FP: ...expanded beyond the [Harrisburg] school district to now include sites...... FP: ...reside within three [NYC districts], and we have built...
  16. Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your

    ML projects Recommendation: Write up annotation guidelines • The "source of truth" throughout the project • Define label schema • Detailed guidelines for annotators 18 When the full name of an entity and its acronym appear in sequence, each should be annotated as a separate entity: ❌ ...an internal report by the University of Ghent (UGent) has revealed... ✅ ...an internal report by the University of Ghent (UGent) has revealed...
  17. Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your

    ML projects Recommendation: An intuitive annotation framework • Provide support for tedious and repetitive tasks ◦ Provide sensible suggestions (rules, vocabulary, LLM, ...) • Focus on a single task at once ◦ Reduce cognitive load ◦ Improve annotation consistency • Validation callback to help enforce the guidelines 19
  18. Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your

    ML projects Recommendation: Design an appropriate label scheme Ensure that the label scheme ... ◦ Is clear and unambiguous ◦ Fits with the envisioned modeling approach ◦ Allows generalization / extension in the future ◦ Is compatible with downstream usage of the data 20
  19. Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your

    ML projects Recommendation: Measure inter-annotator agreement 21 • Plotting overlap between labels from different annotators • Here, there is some confusion between "City" and "Region" • There is also confusion whether or not to annotate an entity (label "None") • Use this to update your guidelines and/or label scheme
  20. Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your

    ML projects Recommendation: Plot a learning curve 22 • Plot the F-score in relation to the % of training data used (eval set is fixed) • Estimate how much can be gained by annotating more data
  21. Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your

    ML projects Recommendation: Set up an extrinsic evaluation 23 Intrinsic evaluation Char-based annotation Data model corresponding to the ML/NLP task # patients Treatment Drug Treatment Dose Group 1 5 phenylephrine 1 μg/kg Group 2 5 arginine vasopressin 0.03 U/kg Group 3 5 epinephrine 1 μg/kg Extrinsic evaluation Higher-level view Data corresponding to downstream requirements Robust to changing data model
  22. Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your

    ML projects An extrinsic evaluation reveals what is truly important • Thinking back to our NER model: ◦ Gold annotation: "Harrisburg school district" ◦ Prediction: "Harrisburg" • In a typical "strict" evaluation setting, this would be a FP and a FN, resulting in lower precision and lower recall • In downstream processing however, the correct school could be identified, leading to no final error in the extrinsic evaluation 24
  23. Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your

    ML projects Example use-case: Identify fauna and seabed sediments 26 Different equipment results in varying image quality
  24. Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your

    ML projects Recommendation: Prototype - stick to a limited scope • Prototype: clarify the initial scope ◦ Sources? ◦ Types of input? ◦ Label set? • Clarify with the client that the model will not (yet) generalize beyond this original scope • Allows you to set a reliable benchmark 27
  25. Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your

    ML projects Recommendation: Beyond a prototype - make the models robust • Data augmentation ◦ Add artificial noise ◦ Transpose, turn, zoom and crop images to create more variation • Explore different architectures and/or training hyperparameters ◦ e.g. train with high(er) level of drop-out ◦ Potentially sacrificing some accuracy points for generalizability • Measure (and avoid) overfitting ◦ If your models do well on your dev data but poorly on the final test portion, your models are overfitting and will not be robust / reliable on new data 28
  26. Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your

    ML projects Recommendation: Production - detect data drift • Manually ◦ Qualitative analysis & feedback from users ◦ Regular focused annotation efforts to curate a new, up-to-date evaluation set • Automated ◦ Measure predicted label distributions to detect shifts in the input data ◦ Measure model's confidence scores on new predictions to detect degradation 29
  27. Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your

    ML projects Example use-case: Identify political stances in news 30 Client's goal: identify all political actors & opinions on Brexit
  28. Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your

    ML projects Keyword-based corpus selection 31 • Text selection (by client): articles mentioning "Brexit" in the title • The client wants to run the resulting ML models on all news articles • The keyword-based preselection of articles prevents you from assessing the number of false negatives in other articles • Risk of missing relevant information that does not meet the original criterium
  29. Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your

    ML projects Recommendation: Formalize the input selection procedure 32 • Train a simple document-level classifier first ◦ Distinguish political news articles from other categories ◦ Identify Brexit-related articles • Bonus advantage ◦ Creating a general model that can be reused in future projects e.g. when we want to analyze political stances around renewable energy • This strategy is especially beneficial with imbalanced datasets ◦ E.g. many irrelevant paragraphs → train a binary yes/no classifier first
  30. Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your

    ML projects Example: Waterfall strategy - step 1 33 Text categorization • Quick document-level decision • Train ML classifier • Reusable component • Easy way to filter relevant documents both for annotation as well as later in production
  31. Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your

    ML projects Example: Waterfall strategy - step 2 34 Fine-grained analyses • e.g. Named Entity Recognition (NER) • Only performed on documents that passed through the initial step Having two separate steps • Reduces cognitive load on annotators • Allows you to train separate, specialized ML models
  32. Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your

    ML projects Avoid: Create fully random dataset splits 35 • Once you have a dataset annotated, create train/dev/test splits • Don't do this for real-world projects: (Taken from Karpathy's excellent free tutorial '"Neural Networks: Zero to Hero")
  33. Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your

    ML projects Recommendation: Document-level splits 36 • Keep all examples from the same original document in the same split • Otherwise, you're "leaking" information from training data to test data • You'd be artificially boosting your performance numbers Ex 1: "Mutated cancers constantly displayed strong estrogen receptor expression." Ex 2: "In all mutated cases strong estrogen receptor expression was demonstrable." PMID: 28799536
  34. Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your

    ML projects Recommendation: Deterministic & repeatable splits 37 Use deterministic (modulo) hashes of unique doc IDs to determine the split • Ensure that the same doc is always in the same split • Even if you extend the dataset or re-run the splitting algorithm • Prevent "leaking" information from training to eval numbers
  35. Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your

    ML projects Recommendations for running an ML project 39 • Avoid selection bias by formalizing the selection procedure • Create deterministic, document-level train/dev/test splits • Carefully design the data model / label scheme • Write up detailed data guidelines • Set up a meaningful extrinsic evaluation • Look at inter-annotator agreement stats and plot a learning curve • Apply a preliminary model back to the training data • Manually inspect gold annotations and incorrect predictions • Make sure you're climbing the right hill • Data quality should be front and center!
  36. Sofie Van Landeghem (OxyKodit) - Avoiding structural bias in your

    ML projects 40 sofi[email protected] https://oxykodit.com https://github.com/svlandeg https://twitter.com/OxyKodit https://www.linkedin.com/in/sofievanlandeghem/