Practical Tips for Bootstrapping Information Extraction Pipelines

PRACTICAL TIPS FOR BOOTSTRAPPING INFORMATION EXTRACTION PIPELINES Matthew Honnibal Explosion
🤠 You Developer GPT-4 API

Open-source library for industrial-strength natural language processing spacy.io SPACY 250m+
downloads

Open-source library for industrial-strength natural language processing spacy.io SPACY 250m+
downloads ChatGPT can write spaCy code!

900+ companies 10k+ users Modern scriptable annotation tool for machine
learning developers prodigy.ai PRODIGY

900+ companies 10k+ users Alex Smith Developer Kim Miller Analyst
GPT-4 API Modern scriptable annotation tool for machine learning developers prodigy.ai PRODIGY

We’re back to running Explosion as a smaller, independent-minded and
self-su ff icient company. explosion.ai/blog/back-to-our-roots BACK TO OUR ROOTS

We’re back to running Explosion as a smaller, independent-minded and
self-su ff icient company. explosion.ai/blog/back-to-our-roots Consulting open source developer tools BACK TO OUR ROOTS

WHAT I MEAN BY INFORMATION EXTRACTION

WHAT I MEAN BY INFORMATION EXTRACTION 📝 Turn text into
data. Make a database from earnings reports, or skills in job postings, or product feedback in social media – many more.

data. Make a database from earnings reports, or skills in job postings, or product feedback in social media – many more. 🗂 Lots of subtasks. Text classification, named entity recognition, entity linking, relation extraction can all be part of an information extraction pipeline.

data. Make a database from earnings reports, or skills in job postings, or product feedback in social media – many more. 🗂 Lots of subtasks. Text classification, named entity recognition, entity linking, relation extraction can all be part of an information extraction pipeline. 🎯 Mostly static schema. Most people are solving one problem at a time, so that’s what I’ll focus on.

Database “Hooli raises $5m to revolutionize search, led by ACME
Ventures”

COMPANY COMPANY named entity recognition Database “Hooli raises $5m to
revolutionize search, led by ACME Ventures”

COMPANY COMPANY named entity recognition MONEY currency normalization Database “Hooli
raises $5m to revolutionize search, led by ACME Ventures”

COMPANY COMPANY named entity recognition MONEY currency normalization 5923214 1681056
custom database lookup entity disambiguation Database “Hooli raises $5m to revolutionize search, led by ACME Ventures”

COMPANY COMPANY named entity recognition MONEY currency normalization INVESTOR entity
relation extraction 5923214 1681056 custom database lookup entity disambiguation Database “Hooli raises $5m to revolutionize search, led by ACME Ventures”

💬 question ⚙ text-to-SQL query data 📦 NLP pipeline 📖
texts + RIE: RETRIEVAL VIA INFORMATION EXTRACTION

💬 question ⚙ text-to-SQL query data 📦 NLP pipeline 📖
texts + RIE: RETRIEVAL VIA INFORMATION EXTRACTION RAG: RETRIEVAL-AUGMENTED GENERATION 💬 question ⚙ vectorizer query answers 📚 vector DB 📖 snippets + ⚙ vectorizer

TALK OUTLINE 💡

Training tips 1. TALK OUTLINE 💡

Training tips 1. Modelling tips 2. TALK OUTLINE 💡

Training tips 1. Modelling tips 2. Data annotation tips 3.
TALK OUTLINE 💡

SUPERVISED LEARNING IS STILL VERY STRONG Example data is super
powerful.

powerful. Example data can do things that instructions can’t.

powerful. Example data can do things that instructions can’t. In-context learning can’t use examples scalably.

KNOW YOUR ENEMIES What makes supervised learning hard?

product vision 👁 chicken-and- egg problem KNOW YOUR ENEMIES What
makes supervised learning hard?

makes supervised learning hard? accuracy estimate 📈

makes supervised learning hard? accuracy estimate 📈 training & evaluation 🔮

makes supervised learning hard? accuracy estimate 📈 training & evaluation 🔮 labelled data 📚

makes supervised learning hard? accuracy estimate 📈 training & evaluation 🔮 labelled data 📚 annotation scheme 🏷

RESULTS ARE HARD TO INTERPRET

RESULTS ARE HARD TO INTERPRET 😬 Model doesn’t train at
all. Is the data messed up somehow?

all. Is the data messed up somehow? 🤨 Model learns barely better than chance. Could be data, hyper-parameters, modelling…

all. Is the data messed up somehow? 🤨 Model learns barely better than chance. Could be data, hyper-parameters, modelling… 🥹 Results are decent! But can it be better? How do I know if I’m missing out?

all. Is the data messed up somehow? 🤨 Model learns barely better than chance. Could be data, hyper-parameters, modelling… 🥹 Results are decent! But can it be better? How do I know if I’m missing out? 🤔 Results are too good to be true. Probably messed up the data…

Training ⚗ 1

FORM AND FALSIFY HYPOTHESES

This is the bit that’s broken. HYPOTHESIS

This is the bit that’s broken. HYPOTHESIS If this bit
is broken, what should I expect to see? QUESTION

is broken, what should I expect to see? QUESTION Is that what actually happens? TEST

is broken, what should I expect to see? QUESTION Is that what actually happens? TEST “I can’t connect to this site.”

is broken, what should I expect to see? QUESTION Is that what actually happens? TEST “Maybe it’ll work if I reconnect to the wi-fi or if I restart my router.” SOLUTION MINDSET “I can’t connect to this site.”

is broken, what should I expect to see? QUESTION Is that what actually happens? TEST “Maybe it’ll work if I reconnect to the wi-fi or if I restart my router.” SOLUTION MINDSET SCIENTIFIC MINDSET “If the problem is between me and the site, other sites won’t load either. If the problem is between me and the router, I won’t be able to ping it.” “I can’t connect to this site.”

EXAMPLES OF DEBUGGING TRAINING

EXAMPLES OF DEBUGGING TRAINING 📉 What happens if I train
on a tiny amount of data? Does the model converge?

on a tiny amount of data? Does the model converge? 🔀 What happens if I randomize the training labels? Does the model still learn?

on a tiny amount of data? Does the model converge? 🔀 What happens if I randomize the training labels? Does the model still learn? 🪄 Are my model weights changing at all during training?

on a tiny amount of data? Does the model converge? 🔀 What happens if I randomize the training labels? Does the model still learn? 🪄 Are my model weights changing at all during training? 🧮 What’s the mean and variance of my gradients?

PRIORITIZE ROBUSTNESS NOT ACCURACY

📈 Better needs to look better. You need it to
not be like this:

not be like this: 📦 Larger models are often less practical.

not be like this: 📦 Larger models are often less practical. 🤏 You need it to work with small samples.

not be like this: 📦 Larger models are often less practical. 🤏 You need it to work with small samples. 🌪 Large models are less stable with small batch sizes.

🔮 2 Modelling

ITERATE ON YOUR DATA AND SCALE DOWN

task- specific output 💬 prompt 📖 text 🔮 PROTOTYPE GPT-4
API

task- specific output 💬 prompt 📖 text 🔮 PROTOTYPE github.com/explosion/spacy-llm
prompt model & transform output to structured data GPT-4 API

task- specific output 💬 prompt 📖 text 🔮 PROTOTYPE github.com/explosion/spacy-llm
prompt model & transform output to structured data GPT-4 API 📖 text task- specific output PRODUCTION

distilled task-specific components 📦 📦 📦 task- specific output 💬
prompt 📖 text 🔮 PROTOTYPE github.com/explosion/spacy-llm prompt model & transform output to structured data GPT-4 API 📖 text task- specific output PRODUCTION

prompt 📖 text 🔮 PROTOTYPE github.com/explosion/spacy-llm prompt model & transform output to structured data GPT-4 API 📖 text task- specific output PRODUCTION modular

prompt 📖 text 🔮 PROTOTYPE github.com/explosion/spacy-llm prompt model & transform output to structured data GPT-4 API 📖 text task- specific output PRODUCTION modular small & fast

prompt 📖 text 🔮 PROTOTYPE github.com/explosion/spacy-llm prompt model & transform output to structured data GPT-4 API 📖 text task- specific output PRODUCTION modular small & fast data-private

config.cfg spacy.io/usage/large-language-models ⚙

config.cfg spacy.io/usage/large-language-models component ⚙

config.cfg spacy.io/usage/large-language-models model and provider ⏺ ⏺ ⏺ component ⚙

config.cfg spacy.io/usage/large-language-models model and provider ⏺ ⏺ ⏺ task definition
and labels Named Entity Recognition, Text Classification, Relation Extraction, … component ⚙

config.cfg spacy.io/usage/large-language-models label definitions to use in prompt model and
provider ⏺ ⏺ ⏺ task definition and labels Named Entity Recognition, Text Classification, Relation Extraction, … component ⚙

config.cfg spacy.io/usage/large-language-models label definitions to use in prompt model and
provider ⏺ ⏺ ⏺ task definition and labels Named Entity Recognition, Text Classification, Relation Extraction, … component ⚙ example from case study explosion.ai/blog/sp-global-commodities

Data annotation 📒 3

How much data do you need?

TRAINING =============== Train curve diagnostic =============== Training 4 times with
25%, 50%, 75%, 100% of the data % Score ner ---- ------ ------ 0% 0.00 0.00 25% 0.31 ▲ 0.31 ▲ 50% 0.44 ▲ 0.44 ▲ 75% 0.43 ▼ 0.43 ▼ 100% 0.56 ▲ 0.56 ▲ Prodigy How much data do you need?

25%, 50%, 75%, 100% of the data % Score ner ---- ------ ------ 0% 0.00 0.00 25% 0.31 ▲ 0.31 ▲ 50% 0.44 ▲ 0.44 ▲ 75% 0.43 ▼ 0.43 ▼ 100% 0.56 ▲ 0.56 ▲ Prodigy How much data do you need? Accuracy 0 25 50 75 100 % of examples 25 50 75 100 125 150 projection

25%, 50%, 75%, 100% of the data % Score ner ---- ------ ------ 0% 0.00 0.00 25% 0.31 ▲ 0.31 ▲ 50% 0.44 ▲ 0.44 ▲ 75% 0.43 ▼ 0.43 ▼ 100% 0.56 ▲ 0.56 ▲ Prodigy How much data do you need? Accuracy 0 25 50 75 100 % of examples 25 50 75 100 125 150 projection EVALUATION ⚠ You need enough data to avoid reporting meaningless precision.

25%, 50%, 75%, 100% of the data % Score ner ---- ------ ------ 0% 0.00 0.00 25% 0.31 ▲ 0.31 ▲ 50% 0.44 ▲ 0.44 ▲ 75% 0.43 ▼ 0.43 ▼ 100% 0.56 ▲ 0.56 ▲ Prodigy How much data do you need? Accuracy 0 25 50 75 100 % of examples 25 50 75 100 125 150 projection EVALUATION ⚠ You need enough data to avoid reporting meaningless precision. 📊 Ten samples per significant figure is a good rule of thumb.

25%, 50%, 75%, 100% of the data % Score ner ---- ------ ------ 0% 0.00 0.00 25% 0.31 ▲ 0.31 ▲ 50% 0.44 ▲ 0.44 ▲ 75% 0.43 ▼ 0.43 ▼ 100% 0.56 ▲ 0.56 ▲ Prodigy How much data do you need? Accuracy 0 25 50 75 100 % of examples 25 50 75 100 125 150 projection EVALUATION ⚠ You need enough data to avoid reporting meaningless precision. 📊 Ten samples per significant figure is a good rule of thumb. 1,000 samples is pretty good – enough for 94% vs. 95%.

KEEP TASKS SMALL

KEEP TASKS SMALL GOOD for i in range(rows): access_data(array[i]) ✅
BAD for j in range(columns): access_data(array[:, j]) ❌

KEEP TASKS SMALL Humans have a cache, too! GOOD for
i in range(rows): access_data(array[i]) ✅ BAD for j in range(columns): access_data(array[:, j]) ❌

KEEP TASKS SMALL Humans have a cache, too! GOOD for
i in range(rows): access_data(array[i]) ✅ BAD for j in range(columns): access_data(array[:, j]) ❌ DO THIS for annotation_type in annotation_types: for example in examples: annotate(example, annotation_type) ✅ NOT THIS for example in examples: for annotation_type in annotation_types: annotate(example, annotation_type) ❌

USE MODEL ASSISTANCE

USE MODEL ASSISTANCE 🔮 Suggest annotations however you can. Rule-
based, initial trained model, an LLM – or a combination of all.

based, initial trained model, an LLM – or a combination of all. Suggestions improve e iciency. Common cases are common, so getting them preset speeds up annotation a lot. 🔥

based, initial trained model, an LLM – or a combination of all. Suggestions improve e iciency. Common cases are common, so getting them preset speeds up annotation a lot. 🔥 Suggestions improve accuracy. You need the common cases to be annotated consistently. Humans suck at this. 📈

🔮 explosion.ai/blog/human-in-the-loop-distillation HUMAN IN THE LOOP

🔮 explosion.ai/blog/human-in-the-loop-distillation continuous evaluation baseline HUMAN IN THE LOOP

🔮 explosion.ai/blog/human-in-the-loop-distillation continuous evaluation baseline prompting HUMAN IN THE LOOP

🔮 explosion.ai/blog/human-in-the-loop-distillation continuous evaluation baseline prompting transfer learning 📦 HUMAN
IN THE LOOP

🔮 explosion.ai/blog/human-in-the-loop-distillation continuous evaluation baseline prompting transfer learning 📦 distilled
model HUMAN IN THE LOOP

prodigy.ai/docs/large-language-models $ prodigy ner.llm.correct todo_eval ./config.cfg ./examples.jsonl ⚙

prodigy.ai/docs/large-language-models $ prodigy ner.llm.correct todo_eval ./config.cfg ./examples.jsonl recipe function with
workflow ⚙

prodigy.ai/docs/large-language-models $ prodigy ner.llm.correct todo_eval ./config.cfg ./examples.jsonl dataset to save
annotations to recipe function with workflow ⚙

annotations to recipe function with workflow [components.llm.model] @llm_models = "spacy.GPT-4.v2" ⚙

annotations to recipe function with workflow raw data [components.llm.model] @llm_models = "spacy.GPT-4.v2" ⚙

✨ Starting the web server at localhost:8080 ... Open the
app and start annotating! GPT-4 API prodigy.ai/docs/large-language-models $ prodigy ner.llm.correct todo_eval ./config.cfg ./examples.jsonl dataset to save annotations to recipe function with workflow raw data [components.llm.model] @llm_models = "spacy.GPT-4.v2" ⚙

✨ Starting the web server at localhost:8080 ... Open the
app and start annotating! GPT-4 API prodigy.ai/docs/large-language-models $ prodigy ner.llm.correct todo_eval ./config.cfg ./examples.jsonl dataset to save annotations to recipe function with workflow raw data 🤠 You Developer [components.llm.model] @llm_models = "spacy.GPT-4.v2" ⚙

explosion.ai/blog/guardian case study ANNOTATION STARTS AT HOME

explosion.ai/blog/guardian case study annotation guidelines ANNOTATION STARTS AT HOME

explosion.ai/blog/guardian case study annotation guidelines annotation meeting ANNOTATION STARTS AT
HOME

📒 🔮 ⚗

📒 🔮 Form and falsify hypotheses. ⚗

📒 🔮 Form and falsify hypotheses. ⚗ Prioritize robustness.

📒 🔮 Form and falsify hypotheses. ⚗ Prioritize robustness. Scale
down and iterate.

down and iterate. Imagine you’re the model.

down and iterate. Imagine you’re the model. Finish the pipeline to production.

down and iterate. Imagine you’re the model. Finish the pipeline to production. Be agile and annotate yourself.

down and iterate. Imagine you’re the model. Finish the pipeline to production. Be agile and annotate yourself. Keep tasks small.

down and iterate. Imagine you’re the model. Finish the pipeline to production. Be agile and annotate yourself. Keep tasks small. Use model assistance.

LinkedIn Explosion spaCy Prodigy Twitter Mastodon Bluesky explosion.ai spacy.io prodigy.ai
@honnibal @[email protected] @honnibal.bsky.social THANK YOU!

Practical Tips for Bootstrapping Information Ex...

Practical Tips for Bootstrapping Information Extraction Pipelines

Resources

spaCy: Industrial-Strength NLP

Prodigy: Radically efficient machine teaching

spacy-llm: Integrating LLMs into structured NLP pipelines

A practical guide to human-in-the-loop distillation

How S&P Global is making markets more transparent with NLP, spaCy and Prodigy

More Decks by Matthew Honnibal

Other Decks in Programming

Featured

Transcript