Practical Tips for Bootstrapping Information Extraction Pipelines

Slide 1

Slide 1 text

PRACTICAL TIPS FOR BOOTSTRAPPING INFORMATION EXTRACTION PIPELINES Matthew Honnibal Explosion 🤠 You Developer GPT-4 API

Slide 2

Slide 2 text

Open-source library for industrial-strength natural language processing spacy.io SPACY 250m+ downloads

Slide 3

Slide 3 text

Open-source library for industrial-strength natural language processing spacy.io SPACY 250m+ downloads ChatGPT can write spaCy code!

Slide 4

Slide 4 text

900+ companies 10k+ users Modern scriptable annotation tool for machine learning developers prodigy.ai PRODIGY

Slide 5

Slide 5 text

900+ companies 10k+ users Alex Smith Developer Kim Miller Analyst GPT-4 API Modern scriptable annotation tool for machine learning developers prodigy.ai PRODIGY

Slide 6

Slide 6 text

We’re back to running Explosion as a smaller, independent-minded and self-su ff icient company. explosion.ai/blog/back-to-our-roots BACK TO OUR ROOTS

Slide 7

Slide 7 text

We’re back to running Explosion as a smaller, independent-minded and self-su ff icient company. explosion.ai/blog/back-to-our-roots Consulting open source developer tools BACK TO OUR ROOTS

Slide 8

Slide 8 text

WHAT I MEAN BY INFORMATION EXTRACTION

Slide 9

Slide 9 text

WHAT I MEAN BY INFORMATION EXTRACTION 📝 Turn text into data. Make a database from earnings reports, or skills in job postings, or product feedback in social media – many more.

Slide 10

Slide 10 text

WHAT I MEAN BY INFORMATION EXTRACTION 📝 Turn text into data. Make a database from earnings reports, or skills in job postings, or product feedback in social media – many more. 🗂 Lots of subtasks. Text classification, named entity recognition, entity linking, relation extraction can all be part of an information extraction pipeline.

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Database “Hooli raises $5m to revolutionize search, led by ACME Ventures”

Slide 13

Slide 13 text

COMPANY COMPANY named entity recognition Database “Hooli raises $5m to revolutionize search, led by ACME Ventures”

Slide 14

Slide 14 text

COMPANY COMPANY named entity recognition MONEY currency normalization Database “Hooli raises $5m to revolutionize search, led by ACME Ventures”

Slide 15

Slide 15 text

COMPANY COMPANY named entity recognition MONEY currency normalization 5923214 1681056 custom database lookup entity disambiguation Database “Hooli raises $5m to revolutionize search, led by ACME Ventures”

Slide 16

Slide 16 text

COMPANY COMPANY named entity recognition MONEY currency normalization INVESTOR entity relation extraction 5923214 1681056 custom database lookup entity disambiguation Database “Hooli raises $5m to revolutionize search, led by ACME Ventures”

Slide 17

Slide 17 text

💬 question ⚙ text-to-SQL query data 📦 NLP pipeline 📖 texts + RIE: RETRIEVAL VIA INFORMATION EXTRACTION

Slide 18

Slide 18 text

💬 question ⚙ text-to-SQL query data 📦 NLP pipeline 📖 texts + RIE: RETRIEVAL VIA INFORMATION EXTRACTION RAG: RETRIEVAL-AUGMENTED GENERATION 💬 question ⚙ vectorizer query answers 📚 vector DB 📖 snippets + ⚙ vectorizer

Slide 19

Slide 19 text

TALK OUTLINE 💡

Slide 20

Slide 20 text

Training tips 1. TALK OUTLINE 💡

Slide 21

Slide 21 text

Training tips 1. Modelling tips 2. TALK OUTLINE 💡

Slide 22

Slide 22 text

Training tips 1. Modelling tips 2. Data annotation tips 3. TALK OUTLINE 💡

Slide 23

Slide 23 text

SUPERVISED LEARNING IS STILL VERY STRONG Example data is super powerful.

Slide 24

Slide 24 text

SUPERVISED LEARNING IS STILL VERY STRONG Example data is super powerful. Example data can do things that instructions can’t.

Slide 25

Slide 25 text

SUPERVISED LEARNING IS STILL VERY STRONG Example data is super powerful. Example data can do things that instructions can’t. In-context learning can’t use examples scalably.

Slide 26

Slide 26 text

KNOW YOUR ENEMIES What makes supervised learning hard?

Slide 27

Slide 27 text

product vision 👁 chicken-and- egg problem KNOW YOUR ENEMIES What makes supervised learning hard?

Slide 28

Slide 28 text

product vision 👁 chicken-and- egg problem KNOW YOUR ENEMIES What makes supervised learning hard? accuracy estimate 📈

Slide 29

Slide 29 text

product vision 👁 chicken-and- egg problem KNOW YOUR ENEMIES What makes supervised learning hard? accuracy estimate 📈 training & evaluation 🔮

Slide 30

Slide 30 text

product vision 👁 chicken-and- egg problem KNOW YOUR ENEMIES What makes supervised learning hard? accuracy estimate 📈 training & evaluation 🔮 labelled data 📚

Slide 31

Slide 31 text

product vision 👁 chicken-and- egg problem KNOW YOUR ENEMIES What makes supervised learning hard? accuracy estimate 📈 training & evaluation 🔮 labelled data 📚 annotation scheme 🏷

Slide 32

Slide 32 text

RESULTS ARE HARD TO INTERPRET

Slide 33

Slide 33 text

RESULTS ARE HARD TO INTERPRET 😬 Model doesn’t train at all. Is the data messed up somehow?

Slide 34

Slide 34 text

RESULTS ARE HARD TO INTERPRET 😬 Model doesn’t train at all. Is the data messed up somehow? 🤨 Model learns barely better than chance. Could be data, hyper-parameters, modelling…

Slide 35

Slide 35 text

RESULTS ARE HARD TO INTERPRET 😬 Model doesn’t train at all. Is the data messed up somehow? 🤨 Model learns barely better than chance. Could be data, hyper-parameters, modelling… 🥹 Results are decent! But can it be better? How do I know if I’m missing out?

Slide 36

Slide 36 text

Slide 37

Slide 37 text

Training ⚗ 1

Slide 38

Slide 38 text

FORM AND FALSIFY HYPOTHESES

Slide 39

Slide 39 text

This is the bit that’s broken. HYPOTHESIS

Slide 40

Slide 40 text

This is the bit that’s broken. HYPOTHESIS If this bit is broken, what should I expect to see? QUESTION

Slide 41

Slide 41 text

This is the bit that’s broken. HYPOTHESIS If this bit is broken, what should I expect to see? QUESTION Is that what actually happens? TEST

Slide 42

Slide 42 text

This is the bit that’s broken. HYPOTHESIS If this bit is broken, what should I expect to see? QUESTION Is that what actually happens? TEST “I can’t connect to this site.”

Slide 43

Slide 43 text

Slide 44

Slide 44 text

This is the bit that’s broken. HYPOTHESIS If this bit is broken, what should I expect to see? QUESTION Is that what actually happens? TEST “Maybe it’ll work if I reconnect to the wi-fi or if I restart my router.” SOLUTION MINDSET SCIENTIFIC MINDSET “If the problem is between me and the site, other sites won’t load either. If the problem is between me and the router, I won’t be able to ping it.” “I can’t connect to this site.”

Slide 45

Slide 45 text

EXAMPLES OF DEBUGGING TRAINING

Slide 46

Slide 46 text

EXAMPLES OF DEBUGGING TRAINING 📉 What happens if I train on a tiny amount of data? Does the model converge?

Slide 47

Slide 47 text

EXAMPLES OF DEBUGGING TRAINING 📉 What happens if I train on a tiny amount of data? Does the model converge? 🔀 What happens if I randomize the training labels? Does the model still learn?

Slide 48

Slide 48 text

EXAMPLES OF DEBUGGING TRAINING 📉 What happens if I train on a tiny amount of data? Does the model converge? 🔀 What happens if I randomize the training labels? Does the model still learn? 🪄 Are my model weights changing at all during training?

Slide 49

Slide 49 text

Slide 50

Slide 50 text

PRIORITIZE ROBUSTNESS NOT ACCURACY

Slide 51

Slide 51 text

📈 Better needs to look better. You need it to not be like this:

Slide 52

Slide 52 text

📈 Better needs to look better. You need it to not be like this: 📦 Larger models are often less practical.

Slide 53

Slide 53 text

📈 Better needs to look better. You need it to not be like this: 📦 Larger models are often less practical. 🤏 You need it to work with small samples.

Slide 54

Slide 54 text

📈 Better needs to look better. You need it to not be like this: 📦 Larger models are often less practical. 🤏 You need it to work with small samples. 🌪 Large models are less stable with small batch sizes.

Slide 55

Slide 55 text

🔮 2 Modelling

Slide 56

Slide 56 text

ITERATE ON YOUR DATA AND SCALE DOWN

Slide 57

Slide 57 text

task- specific output 💬 prompt 📖 text 🔮 PROTOTYPE GPT-4 API

Slide 58

Slide 58 text

task- specific output 💬 prompt 📖 text 🔮 PROTOTYPE github.com/explosion/spacy-llm prompt model & transform output to structured data GPT-4 API

Slide 59

Slide 59 text

task- specific output 💬 prompt 📖 text 🔮 PROTOTYPE github.com/explosion/spacy-llm prompt model & transform output to structured data GPT-4 API 📖 text task- specific output PRODUCTION

Slide 60

Slide 60 text

distilled task-specific components 📦 📦 📦 task- specific output 💬 prompt 📖 text 🔮 PROTOTYPE github.com/explosion/spacy-llm prompt model & transform output to structured data GPT-4 API 📖 text task- specific output PRODUCTION

Slide 61

Slide 61 text

Slide 62

Slide 62 text

Slide 63

Slide 63 text

Slide 64

Slide 64 text

config.cfg spacy.io/usage/large-language-models ⚙

Slide 65

Slide 65 text

config.cfg spacy.io/usage/large-language-models component ⚙

Slide 66

Slide 66 text

config.cfg spacy.io/usage/large-language-models model and provider ⏺ ⏺ ⏺ component ⚙

Slide 67

Slide 67 text

config.cfg spacy.io/usage/large-language-models model and provider ⏺ ⏺ ⏺ task definition and labels Named Entity Recognition, Text Classification, Relation Extraction, … component ⚙

Slide 68

Slide 68 text

config.cfg spacy.io/usage/large-language-models label definitions to use in prompt model and provider ⏺ ⏺ ⏺ task definition and labels Named Entity Recognition, Text Classification, Relation Extraction, … component ⚙

Slide 69

Slide 69 text

Slide 70

Slide 70 text

Data annotation 📒 3

Slide 71

Slide 71 text

How much data do you need?

Slide 72

Slide 72 text

Slide 73

Slide 73 text

Slide 74

Slide 74 text

TRAINING =============== Train curve diagnostic =============== Training 4 times with 25%, 50%, 75%, 100% of the data % Score ner ---- ------ ------ 0% 0.00 0.00 25% 0.31 ▲ 0.31 ▲ 50% 0.44 ▲ 0.44 ▲ 75% 0.43 ▼ 0.43 ▼ 100% 0.56 ▲ 0.56 ▲ Prodigy How much data do you need? Accuracy 0 25 50 75 100 % of examples 25 50 75 100 125 150 projection EVALUATION ⚠ You need enough data to avoid reporting meaningless precision.

Slide 75

Slide 75 text

Slide 76

Slide 76 text

Slide 77

Slide 77 text

KEEP TASKS SMALL

Slide 78

Slide 78 text

KEEP TASKS SMALL GOOD for i in range(rows): access_data(array[i]) ✅ BAD for j in range(columns): access_data(array[:, j]) ❌

Slide 79

Slide 79 text

KEEP TASKS SMALL Humans have a cache, too! GOOD for i in range(rows): access_data(array[i]) ✅ BAD for j in range(columns): access_data(array[:, j]) ❌

Slide 80

Slide 80 text

KEEP TASKS SMALL Humans have a cache, too! GOOD for i in range(rows): access_data(array[i]) ✅ BAD for j in range(columns): access_data(array[:, j]) ❌ DO THIS for annotation_type in annotation_types: for example in examples: annotate(example, annotation_type) ✅ NOT THIS for example in examples: for annotation_type in annotation_types: annotate(example, annotation_type) ❌

Slide 81

Slide 81 text

USE MODEL ASSISTANCE

Slide 82

Slide 82 text

USE MODEL ASSISTANCE 🔮 Suggest annotations however you can. Rule- based, initial trained model, an LLM – or a combination of all.

Slide 83

Slide 83 text

Slide 84

Slide 84 text

USE MODEL ASSISTANCE 🔮 Suggest annotations however you can. Rule- based, initial trained model, an LLM – or a combination of all. Suggestions improve e iciency. Common cases are common, so getting them preset speeds up annotation a lot. 🔥 Suggestions improve accuracy. You need the common cases to be annotated consistently. Humans suck at this. 📈

Slide 85

Slide 85 text

🔮 explosion.ai/blog/human-in-the-loop-distillation HUMAN IN THE LOOP

Slide 86

Slide 86 text

🔮 explosion.ai/blog/human-in-the-loop-distillation continuous evaluation baseline HUMAN IN THE LOOP

Slide 87

Slide 87 text

🔮 explosion.ai/blog/human-in-the-loop-distillation continuous evaluation baseline prompting HUMAN IN THE LOOP

Slide 88

Slide 88 text

🔮 explosion.ai/blog/human-in-the-loop-distillation continuous evaluation baseline prompting HUMAN IN THE LOOP

Slide 89

Slide 89 text

🔮 explosion.ai/blog/human-in-the-loop-distillation continuous evaluation baseline prompting transfer learning 📦 HUMAN IN THE LOOP

Slide 90

Slide 90 text

🔮 explosion.ai/blog/human-in-the-loop-distillation continuous evaluation baseline prompting transfer learning 📦 distilled model HUMAN IN THE LOOP

Slide 91

Slide 91 text

prodigy.ai/docs/large-language-models $ prodigy ner.llm.correct todo_eval ./config.cfg ./examples.jsonl ⚙

Slide 92

Slide 92 text

prodigy.ai/docs/large-language-models $ prodigy ner.llm.correct todo_eval ./config.cfg ./examples.jsonl recipe function with workflow ⚙

Slide 93

Slide 93 text

prodigy.ai/docs/large-language-models $ prodigy ner.llm.correct todo_eval ./config.cfg ./examples.jsonl dataset to save annotations to recipe function with workflow ⚙

Slide 94

Slide 94 text

prodigy.ai/docs/large-language-models $ prodigy ner.llm.correct todo_eval ./config.cfg ./examples.jsonl dataset to save annotations to recipe function with workflow [components.llm.model] @llm_models = "spacy.GPT-4.v2" ⚙

Slide 95

Slide 95 text

prodigy.ai/docs/large-language-models $ prodigy ner.llm.correct todo_eval ./config.cfg ./examples.jsonl dataset to save annotations to recipe function with workflow raw data [components.llm.model] @llm_models = "spacy.GPT-4.v2" ⚙

Slide 96

Slide 96 text

✨ Starting the web server at localhost:8080 ... Open the app and start annotating! GPT-4 API prodigy.ai/docs/large-language-models $ prodigy ner.llm.correct todo_eval ./config.cfg ./examples.jsonl dataset to save annotations to recipe function with workflow raw data [components.llm.model] @llm_models = "spacy.GPT-4.v2" ⚙

Slide 97

Slide 97 text

✨ Starting the web server at localhost:8080 ... Open the app and start annotating! GPT-4 API prodigy.ai/docs/large-language-models $ prodigy ner.llm.correct todo_eval ./config.cfg ./examples.jsonl dataset to save annotations to recipe function with workflow raw data 🤠 You Developer [components.llm.model] @llm_models = "spacy.GPT-4.v2" ⚙

Slide 98

Slide 98 text

explosion.ai/blog/guardian case study ANNOTATION STARTS AT HOME

Slide 99

Slide 99 text

explosion.ai/blog/guardian case study annotation guidelines ANNOTATION STARTS AT HOME

Slide 100

Slide 100 text

explosion.ai/blog/guardian case study annotation guidelines annotation meeting ANNOTATION STARTS AT HOME

Slide 101

Slide 101 text

📒 🔮 ⚗

Slide 102

Slide 102 text

📒 🔮 Form and falsify hypotheses. ⚗

Slide 103

Slide 103 text

📒 🔮 Form and falsify hypotheses. ⚗ Prioritize robustness.

Slide 104

Slide 104 text

📒 🔮 Form and falsify hypotheses. ⚗ Prioritize robustness. Scale down and iterate.

Slide 105

Slide 105 text

📒 🔮 Form and falsify hypotheses. ⚗ Prioritize robustness. Scale down and iterate. Imagine you’re the model.

Slide 106

Slide 106 text

📒 🔮 Form and falsify hypotheses. ⚗ Prioritize robustness. Scale down and iterate. Imagine you’re the model. Finish the pipeline to production.

Slide 107

Slide 107 text

📒 🔮 Form and falsify hypotheses. ⚗ Prioritize robustness. Scale down and iterate. Imagine you’re the model. Finish the pipeline to production. Be agile and annotate yourself.

Slide 108

Slide 108 text

📒 🔮 Form and falsify hypotheses. ⚗ Prioritize robustness. Scale down and iterate. Imagine you’re the model. Finish the pipeline to production. Be agile and annotate yourself. Keep tasks small.

Slide 109

Slide 109 text

Slide 110

Slide 110 text

LinkedIn Explosion spaCy Prodigy Twitter Mastodon Bluesky explosion.ai spacy.io prodigy.ai @honnibal @[email protected] @honnibal.bsky.social THANK YOU!