2025-10-16-pycon-nl

How to not get fooled by your data while AI
engineering Sofie Van Landeghem NLP freelancer @ OxyKodit Pycon NL 2025

Create a portal with all Formula One news

And of course, with a chatbot Who is currently leading
the 2025 F1 championship? As of the latest standings after the 2025 Singapore GP, Oscar Piastri is leading the Drivers’ Championship. Who won in Monaco? Lando Norris did.

Who crossed the line first in Spa 2024? George Russel
crossed the finish line first, securing victory for Mercedes. So Russel won the Spa GP in 2024? No, Hamilton did. Russel his car was disqualified after the race due to being 1.5kg underweight.

Wait, what?

Always have a domain expert on board who can double
check the data and the results

Our chatbot should be able to link to Wikipedia

"Hamilton won the Belgian Grand Prix in Spa in 2024."

Obtaining gold-standard data Sir Lewis Carl Davidson Hamilton (born 7
January 1985) is a British [[racing driver]] who competes in [[Formula One]] for [[Scuderia Ferrari|Ferrari]].

Building an Entity Linker model

✦ 79% F-score ✦

79% is pretty good, right? Right?

How most clients / users think about performance 0% 50%
100% The worst model "Random guessing" The best model "Entirely correct"

… but a random baseline is almost never 50%

For our entity linking task, the random baseline is actually
1/7.075.098 = 0,000014% (a.k.a. 0%)

… and our upper bound wasn't 100% either …

In fact, we had to prune the database for efficiency
requirements ❖ Kept only 14% of all Wikipedia concepts ❖ An "oracle" disambiguation obtains 84% F-score So, the modeling work targets a 0% - 84% range

Ok, so within a range of 0-84%, 79% is really
good, right? Right?

Develop some simple baselines to better understand the complexity and
data challenges of your project

In our case, we won't assign "Hamilton" to a random
page out of the 7M available ones

Instead, we obtain a list of "Hamilton" candidates through lexical
search

Now, if we pick a random one from this candidate
list, we actually obtain 54% F-score (without doing any ML/NLP/AI at all!)

Let's create an even stronger baseline.

Consider "prior probabilities": a measure of how often a mention
is linked to a certain concept Textual mention F1 race driver Founding father Musical "Hamilton" 35% 55% 10% "Lewis Hamilton" 99% 0% 0% "Alexander Hamilton" 0% 82% 15%

Which we can extrapolate to determine which one is the
"most likely" (without any context) Textual mention F1 race driver Founding father Musical "Hamilton" ✓ "Lewis Hamilton" ✓ "Alexander Hamilton" ✓

This "most likely" baseline obtains 78.2% F-score (still without any
AI at all!)

The ML model that is supposed to disambiguate based on
context (79%) only marginally improves upon a relatively simple baseline (78.2%)

Developing a few simple baselines puts your AI's performance measures
in the right perspective

Let's revisit our upper bound (again). Is it really 100%
if we have no memory / efficiency requirements? 0 100% The best model "Entirely correct"

100% precision means every disambiguation is "correct". But what does
"correctness" even mean?

? ?

"Societies in the ancient civilizations of Greece and Rome preferred
small families" ? ?

When multiple annotators label the same data sample, how often
do they agree? ↪ Inter-annotator agreement = IAA

Annotators mostly agree → the gold data is reliable and
robust Annotators disagree because… ★ The data is confusing ★ The label scheme is ambiguous ★ The NLP task is too complex High IAA Low IAA

We want to label incoming articles for our portal 1.
Sports a. Football b. F1 2. Politics a. Global politics b. US politics 3. Leisure a. Traveling b. Sports c. Literature

Let's plot IAA of the labels… Football F1 Global politics
US Politics Traveling Sports Literature Football 0.79 F1 0 0.73 Global politics 0 0 0.95 US Politics 0 0.12 0.05 0.83 Traveling 0 0 0 0 1.0 Sports 0.21 0.15 0 0 0 0.64 Literature 0 0 0 0 0 0 1.0

Confusion between global and US politics? Football F1 Global politics
US Politics Traveling Sports Literature Football 0.79 F1 0 0.73 Global politics 0 0 0.95 US Politics 0 0.12 0.05 0.83 Traveling 0 0 0 0 1.0 Sports 0.21 0.15 0 0 0 0.64 Literature 0 0 0 0 0 0 1.0

Consider merging labels if humans can't reliably distinguish them (or
when the distinction doesn't matter)

Confusion between politics and F1? Football F1 Global politics US
Politics Traveling Sports Literature Football 0.79 F1 0 0.73 Global politics 0 0 0.95 US Politics 0 0.12 0.05 0.83 Traveling 0 0 0 0 1.0 Sports 0.21 0.15 0 0 0 0.64 Literature 0 0 0 0 0 0 1.0

Solution: allow multiple correct labels 1. Sports a. Football b.
F1 2. Politics a. Global politics b. US politics 3. Leisure a. Traveling b. Sports c. Literature

Reframe the task to match your data

Confusion within the sports category? Football F1 Global politics US
Politics Traveling Sports Literature Football 0.79 F1 0 0.73 Global politics 0 0 0.95 US Politics 0 0.12 0.05 0.83 Traveling 0 0 0 0 1.0 Sports 0.21 0.15 0 0 0 0.64 Literature 0 0 0 0 0 0 1.0

1. Sports a. Football b. F1 2. Politics a. Global
politics b. US politics 3. Leisure a. Traveling b. Sports c. Literature Solution: critically revise the label scheme given to you! Inherent ambiguity in the label scheme needs to be addressed

Together with your domain experts, critically revise the label scheme

What if we don't have multiple annotators?

Analyse discrepancies between a model's prediction and the gold standard
label

"Agnes Maria of Andechs-Merania (died 1201) was a Queen of
France." Gold annotation Prediction

Prediction Gold ≠ Is this really an incorrect prediction (a
"false positive")? Or is the label wrongly annotated? Perhaps both answers can be seen as "correct"?

Depending on the downstream use-case, the precision of this method
was 87-96%

Analyse "wrong" predictions on the training dataset, to find structural
data errors

Manually annotated training dataset: "... to kick off 2025 [Dutch
Grand Prix] weekend in Zandvoort" "Gasly laments ‘quite sad’ [Monaco] GP crash" Sample prediction: "Gasly laments ‘quite sad’ [Monaco GP] crash"

Always include the words "Grand Prix" or "GP" into the
entity annotation: ❌ Gasly laments ‘quite sad’ [Monaco] GP crash ✅ Gasly laments ‘quite sad’ [Monaco GP] crash Write up annotation guidelines to help with consistency of annotations

"Correctness" may also depend on the downstream usage of your
results

? ?

Make sure you climb the right hill

Is your data correct, consistent and robust? Is your test
data representative of real-world data? Does your evaluation capture the value of the algorithm in production?

Only after defining the right hill to climb, you can
get to the fun stuff: ★ Algorithm development ★ Model training ★ LLM fine-tuning ★ …

Let's wrap up with a practical checklist 📝

AI project: data annotation ✓ Ensure your label scheme is
consistent and unambiguous ✓ Draft clear annotation guidelines to ensure data consistency ✓ Measure inter-annotator agreement (IAA) ✓ Consider reframing your task/guidelines if the IAA is low ✓ Model uncertainty in your annotation workflow 📝 1/3

AI project: performance evaluation ✓ Develop simple baselines to put
performance into perspective ✓ Quantify realistic upper/lower performance bounds ✓ Measure performance as part of the larger business process 📝 2/3

AI project: performance evaluation ✓ Identify structural data errors by
"predicting" the training data ✓ Apply to truly unseen data to measure realistic performance ✓ Make sure you’re climbing the right hill 📝 3/3

Main Take Away Hamilton won the Belgian 2024 Grand Prix,
not Russell ;-)

Thank you! → Github: svlandeg → LinkedIn: https://www.linkedin.com/in/sofievanlandeghem → NLP
consultancy: https://oxykodit.com

2025-10-16-pycon-nl

2025-10-16-pycon-nl

More Decks by Sofie Van Landeghem

Featured

Transcript