ML projects Running a ML project: meeting a new client 2 I want to solve problem X with AI. I have no data. There is no prototype yet. I want to achieve 95% accuracy.
ML projects Manage expectations 3 Robust solution ★ Iterate on the data model ★ Curate the data sets ★ Fine-tune the ML models ★ Improve accuracy & efficiency First prototype ★ Create data model ★ Assemble data sets ★ Build preliminary ML models ★ Accuracy & efficiency baseline
ML projects Let's build a prototype - focusing on the data! 1. Select the right data 2. Model the data 3. Build ML models / algorithms to fit the data 4. Perform an evaluation on the data 5. Iterative improvements on the data model and algorithms 5
ML projects But why? Can't we just "zero-shot" it these days? 6 ChatGPT (or any other LLM) can generalize to unseen labels as it "understands" their meaning
ML projects We still need proper, gold annotations 7 • Evaluation ◦ Measure performance and progress • Training a supervised model ◦ Smaller and specialized models can be more cost efficient • Tuning an LLM ◦ Few-shot prompt examples ◦ Fine tuning an Open-Source LLM
ML projects Use-case: Entity linking (EL) or disambiguation "The vision of [WMO] is to provide world leadership in expertise and international cooperation in weather, climate, hydrology and water resources" 9 WikiData Wikipedia
ML projects Evaluation & comparison against baselines 11 • Assemble a knowledge base (KB) from WikiData • Prune KB for ideal mem/accuracy trade-off ◦ Remove infrequent entities/synonyms ◦ KB only stores 14% of all WikiData concepts ◦ Max. achievable accuracy using oracle disambiguation: 84% • Random baseline: 54% • Entity linker model implemented with spaCy & Thinc: 79% Looks great, right? 🚀
ML projects Let's compare with prior probabilities 12 • "Obama": almost always linked to the 44th US president (Q76) • We could simply "predict" Q76 for every "Obama" mention without training a ML model that disambiguates according to the context • This prior-probability baseline obtains 78.2% • So our model at 79% only marginally improves upon that baseline 😟
ML projects Let's inspect our data... 13 Societies in the ancient civilizations of [Greece] and Rome preferred small families. Gold reference: Ancient Greece Prediction: Greece Agnes Maria of Andechs-Merania (died 1201) was a Queen of [France]. Gold reference: current France Predicted: kingdom in Western Europe from 987 to 1791 Full metro systems are in operation in Paris, Lyon and [Marseille] Gold reference: Marseille Metro Predicted: Marseille
ML projects Recommendation Make sure you climb the right hill 14 • Is your data reliable? • Is your test set representative of real-world data? • Is your evaluation taking hierarchical concepts into account? • Is your evaluation penalizing related concepts equally harsh as other mistakes? • Are you evaluating your model's accuracy with sensible baselines?
ML projects Use-case: Identifying school names in text 16 • Client had created the data model and done the annotations • Initial NER model obtained 92% F-score 🥳 • The project was going to build on top of these results • But we wanted to do a qualitative analysis first...
ML projects Recommendation: Apply the model back to the training dataset 17 Prediction mistakes may actually be annotation errors and/or uncover structural issues or biases: Gold: ...All [CPS] schools use a centralized data warehouse... Gold: ...reside within three [NYC] districts, and we have built... Gold: ...graduation rates from [Broward County public schools], Florida... Gold: ...expanded beyond the [Harrisburg school district] to now include sites... FP: ...expanded beyond the [Harrisburg] school district to now include sites...... FP: ...reside within three [NYC districts], and we have built...
ML projects Recommendation: Write up annotation guidelines • The "source of truth" throughout the project • Define label schema • Detailed guidelines for annotators 18 When the full name of an entity and its acronym appear in sequence, each should be annotated as a separate entity: ❌ ...an internal report by the University of Ghent (UGent) has revealed... ✅ ...an internal report by the University of Ghent (UGent) has revealed...
ML projects Recommendation: An intuitive annotation framework • Provide support for tedious and repetitive tasks ◦ Provide sensible suggestions (rules, vocabulary, LLM, ...) • Focus on a single task at once ◦ Reduce cognitive load ◦ Improve annotation consistency • Validation callback to help enforce the guidelines 19
ML projects Recommendation: Design an appropriate label scheme Ensure that the label scheme ... ◦ Is clear and unambiguous ◦ Fits with the envisioned modeling approach ◦ Allows generalization / extension in the future ◦ Is compatible with downstream usage of the data 20
ML projects Recommendation: Measure inter-annotator agreement 21 • Plotting overlap between labels from different annotators • Here, there is some confusion between "City" and "Region" • There is also confusion whether or not to annotate an entity (label "None") • Use this to update your guidelines and/or label scheme
ML projects Recommendation: Plot a learning curve 22 • Plot the F-score in relation to the % of training data used (eval set is fixed) • Estimate how much can be gained by annotating more data
ML projects Recommendation: Set up an extrinsic evaluation 23 Intrinsic evaluation Char-based annotation Data model corresponding to the ML/NLP task # patients Treatment Drug Treatment Dose Group 1 5 phenylephrine 1 μg/kg Group 2 5 arginine vasopressin 0.03 U/kg Group 3 5 epinephrine 1 μg/kg Extrinsic evaluation Higher-level view Data corresponding to downstream requirements Robust to changing data model
ML projects An extrinsic evaluation reveals what is truly important • Thinking back to our NER model: ◦ Gold annotation: "Harrisburg school district" ◦ Prediction: "Harrisburg" • In a typical "strict" evaluation setting, this would be a FP and a FN, resulting in lower precision and lower recall • In downstream processing however, the correct school could be identified, leading to no final error in the extrinsic evaluation 24
ML projects Recommendation: Prototype - stick to a limited scope • Prototype: clarify the initial scope ◦ Sources? ◦ Types of input? ◦ Label set? • Clarify with the client that the model will not (yet) generalize beyond this original scope • Allows you to set a reliable benchmark 27
ML projects Recommendation: Beyond a prototype - make the models robust • Data augmentation ◦ Add artificial noise ◦ Transpose, turn, zoom and crop images to create more variation • Explore different architectures and/or training hyperparameters ◦ e.g. train with high(er) level of drop-out ◦ Potentially sacrificing some accuracy points for generalizability • Measure (and avoid) overfitting ◦ If your models do well on your dev data but poorly on the final test portion, your models are overfitting and will not be robust / reliable on new data 28
ML projects Recommendation: Production - detect data drift • Manually ◦ Qualitative analysis & feedback from users ◦ Regular focused annotation efforts to curate a new, up-to-date evaluation set • Automated ◦ Measure predicted label distributions to detect shifts in the input data ◦ Measure model's confidence scores on new predictions to detect degradation 29
ML projects Keyword-based corpus selection 31 • Text selection (by client): articles mentioning "Brexit" in the title • The client wants to run the resulting ML models on all news articles • The keyword-based preselection of articles prevents you from assessing the number of false negatives in other articles • Risk of missing relevant information that does not meet the original criterium
ML projects Recommendation: Formalize the input selection procedure 32 • Train a simple document-level classifier first ◦ Distinguish political news articles from other categories ◦ Identify Brexit-related articles • Bonus advantage ◦ Creating a general model that can be reused in future projects e.g. when we want to analyze political stances around renewable energy • This strategy is especially beneficial with imbalanced datasets ◦ E.g. many irrelevant paragraphs → train a binary yes/no classifier first
ML projects Example: Waterfall strategy - step 1 33 Text categorization • Quick document-level decision • Train ML classifier • Reusable component • Easy way to filter relevant documents both for annotation as well as later in production
ML projects Example: Waterfall strategy - step 2 34 Fine-grained analyses • e.g. Named Entity Recognition (NER) • Only performed on documents that passed through the initial step Having two separate steps • Reduces cognitive load on annotators • Allows you to train separate, specialized ML models
ML projects Avoid: Create fully random dataset splits 35 • Once you have a dataset annotated, create train/dev/test splits • Don't do this for real-world projects: (Taken from Karpathy's excellent free tutorial '"Neural Networks: Zero to Hero")
ML projects Recommendation: Document-level splits 36 • Keep all examples from the same original document in the same split • Otherwise, you're "leaking" information from training data to test data • You'd be artificially boosting your performance numbers Ex 1: "Mutated cancers constantly displayed strong estrogen receptor expression." Ex 2: "In all mutated cases strong estrogen receptor expression was demonstrable." PMID: 28799536
ML projects Recommendation: Deterministic & repeatable splits 37 Use deterministic (modulo) hashes of unique doc IDs to determine the split • Ensure that the same doc is always in the same split • Even if you extend the dataset or re-run the splitting algorithm • Prevent "leaking" information from training to eval numbers
ML projects Recommendations for running an ML project 39 • Avoid selection bias by formalizing the selection procedure • Create deterministic, document-level train/dev/test splits • Carefully design the data model / label scheme • Write up detailed data guidelines • Set up a meaningful extrinsic evaluation • Look at inter-annotator agreement stats and plot a learning curve • Apply a preliminary model back to the training data • Manually inspect gold annotations and incorrect predictions • Make sure you're climbing the right hill • Data quality should be front and center!