Upgrade to Pro — share decks privately, control downloads, hide ads and more …

2025-10-16-pycon-nl

Avatar for Sofie Van Landeghem Sofie Van Landeghem
October 16, 2025
3

 2025-10-16-pycon-nl

Keynote given at PyCon NL 2025: https://pycon-nl.org/talks/2025/14

Avatar for Sofie Van Landeghem

Sofie Van Landeghem

October 16, 2025
Tweet

Transcript

  1. How to not get fooled by your data while AI

    engineering Sofie Van Landeghem NLP freelancer @ OxyKodit Pycon NL 2025
  2. And of course, with a chatbot Who is currently leading

    the 2025 F1 championship? As of the latest standings after the 2025 Singapore GP, Oscar Piastri is leading the Drivers’ Championship. Who won in Monaco? Lando Norris did.
  3. Who crossed the line first in Spa 2024? George Russel

    crossed the finish line first, securing victory for Mercedes. So Russel won the Spa GP in 2024? No, Hamilton did. Russel his car was disqualified after the race due to being 1.5kg underweight.
  4. Obtaining gold-standard data Sir Lewis Carl Davidson Hamilton (born 7

    January 1985) is a British [[racing driver]] who competes in [[Formula One]] for [[Scuderia Ferrari|Ferrari]].
  5. How most clients / users think about performance 0% 50%

    100% The worst model "Random guessing" The best model "Entirely correct"
  6. In fact, we had to prune the database for efficiency

    requirements ❖ Kept only 14% of all Wikipedia concepts ❖ An "oracle" disambiguation obtains 84% F-score So, the modeling work targets a 0% - 84% range
  7. In our case, we won't assign "Hamilton" to a random

    page out of the 7M available ones
  8. Now, if we pick a random one from this candidate

    list, we actually obtain 54% F-score (without doing any ML/NLP/AI at all!)
  9. Consider "prior probabilities": a measure of how often a mention

    is linked to a certain concept Textual mention F1 race driver Founding father Musical "Hamilton" 35% 55% 10% "Lewis Hamilton" 99% 0% 0% "Alexander Hamilton" 0% 82% 15%
  10. Which we can extrapolate to determine which one is the

    "most likely" (without any context) Textual mention F1 race driver Founding father Musical "Hamilton" ✓ "Lewis Hamilton" ✓ "Alexander Hamilton" ✓
  11. The ML model that is supposed to disambiguate based on

    context (79%) only marginally improves upon a relatively simple baseline (78.2%)
  12. Let's revisit our upper bound (again). Is it really 100%

    if we have no memory / efficiency requirements? 0 100% The best model "Entirely correct"
  13. When multiple annotators label the same data sample, how often

    do they agree? ↪ Inter-annotator agreement = IAA
  14. Annotators mostly agree → the gold data is reliable and

    robust Annotators disagree because… ★ The data is confusing ★ The label scheme is ambiguous ★ The NLP task is too complex High IAA Low IAA
  15. We want to label incoming articles for our portal 1.

    Sports a. Football b. F1 2. Politics a. Global politics b. US politics 3. Leisure a. Traveling b. Sports c. Literature
  16. Let's plot IAA of the labels… Football F1 Global politics

    US Politics Traveling Sports Literature Football 0.79 F1 0 0.73 Global politics 0 0 0.95 US Politics 0 0.12 0.05 0.83 Traveling 0 0 0 0 1.0 Sports 0.21 0.15 0 0 0 0.64 Literature 0 0 0 0 0 0 1.0
  17. Confusion between global and US politics? Football F1 Global politics

    US Politics Traveling Sports Literature Football 0.79 F1 0 0.73 Global politics 0 0 0.95 US Politics 0 0.12 0.05 0.83 Traveling 0 0 0 0 1.0 Sports 0.21 0.15 0 0 0 0.64 Literature 0 0 0 0 0 0 1.0
  18. Confusion between politics and F1? Football F1 Global politics US

    Politics Traveling Sports Literature Football 0.79 F1 0 0.73 Global politics 0 0 0.95 US Politics 0 0.12 0.05 0.83 Traveling 0 0 0 0 1.0 Sports 0.21 0.15 0 0 0 0.64 Literature 0 0 0 0 0 0 1.0
  19. Solution: allow multiple correct labels 1. Sports a. Football b.

    F1 2. Politics a. Global politics b. US politics 3. Leisure a. Traveling b. Sports c. Literature
  20. Confusion within the sports category? Football F1 Global politics US

    Politics Traveling Sports Literature Football 0.79 F1 0 0.73 Global politics 0 0 0.95 US Politics 0 0.12 0.05 0.83 Traveling 0 0 0 0 1.0 Sports 0.21 0.15 0 0 0 0.64 Literature 0 0 0 0 0 0 1.0
  21. 1. Sports a. Football b. F1 2. Politics a. Global

    politics b. US politics 3. Leisure a. Traveling b. Sports c. Literature Solution: critically revise the label scheme given to you! Inherent ambiguity in the label scheme needs to be addressed
  22. Prediction Gold ≠ Is this really an incorrect prediction (a

    "false positive")? Or is the label wrongly annotated? Perhaps both answers can be seen as "correct"?
  23. Manually annotated training dataset: "... to kick off 2025 [Dutch

    Grand Prix] weekend in Zandvoort" "Gasly laments ‘quite sad’ [Monaco] GP crash" Sample prediction: "Gasly laments ‘quite sad’ [Monaco GP] crash"
  24. Always include the words "Grand Prix" or "GP" into the

    entity annotation: ❌ Gasly laments ‘quite sad’ [Monaco] GP crash ✅ Gasly laments ‘quite sad’ [Monaco GP] crash Write up annotation guidelines to help with consistency of annotations
  25. Is your data correct, consistent and robust? Is your test

    data representative of real-world data? Does your evaluation capture the value of the algorithm in production?
  26. Only after defining the right hill to climb, you can

    get to the fun stuff: ★ Algorithm development ★ Model training ★ LLM fine-tuning ★ …
  27. AI project: data annotation ✓ Ensure your label scheme is

    consistent and unambiguous ✓ Draft clear annotation guidelines to ensure data consistency ✓ Measure inter-annotator agreement (IAA) ✓ Consider reframing your task/guidelines if the IAA is low ✓ Model uncertainty in your annotation workflow 📝 1/3
  28. AI project: performance evaluation ✓ Develop simple baselines to put

    performance into perspective ✓ Quantify realistic upper/lower performance bounds ✓ Measure performance as part of the larger business process 📝 2/3
  29. AI project: performance evaluation ✓ Identify structural data errors by

    "predicting" the training data ✓ Apply to truly unseen data to measure realistic performance ✓ Make sure you’re climbing the right hill 📝 3/3