HU Berlin: Industrial-Strength Natural Language Processing with spaCy and Prodigy

Industrial-Strength Natural Language Processing with spaCy & Prodigy Ines Montani
Founder & CEO Dr. Matthew Honnibal Founder & CTO

480m+ downloads Open-source library for industrial-strength natural language processing spacy.io
spaCy Modern scriptable annotation tool for machine learning developers prodigy.ai Prodigy 12.000+ users

“ Requirements: We’re building a crime database based on news
reports. We want to extract the following: • victim name • perpetrator name • crime location • o ff ence date • arrest date “

spacy.fyi/displacy

Levels of linguistic annotations application- oriented linguistic descriptions Alex Smith
was stabbed in East London

Levels of linguistic annotations tokens application- oriented linguistic descriptions Alex
Smith was stabbed in East London

Levels of linguistic annotations tokens application- oriented linguistic descriptions spans
Alex Smith was stabbed in East London

parse trees Alex Smith was stabbed in East London

parse trees semantic roles PATIENT ADJUNCT Alex Smith was stabbed in East London

parse trees semantic roles PATIENT ADJUNCT entities PERSON LOCATION Alex Smith was stabbed in East London

parse trees semantic roles PATIENT ADJUNCT entities PERSON LOCATION text categories CRIME Alex Smith was stabbed in East London

parse trees semantic roles PATIENT ADJUNCT entities PERSON LOCATION text categories CRIME relations VICTIM Alex Smith was stabbed in East London

spacy.io/usage

spacy.io/usage Doc Doc.ents Doc.noun_chunks Doc.spans Doc.cats …

spacy.io/usage Doc Doc.ents Doc.noun_chunks Doc.spans Doc.cats … Token Token.pos_ Token.dep_
…

COMPANY COMPANY MONEY INVESTOR “Hooli raises $5m to revolutionize search,
led by ACME Ventures” 5923214 1681056 Database

led by ACME Ventures” 5923214 1681056 Database named entity recognition

led by ACME Ventures” 5923214 1681056 Database named entity recognition entity disambiguation

led by ACME Ventures” 5923214 1681056 Database named entity recognition entity disambiguation custom database lookup

led by ACME Ventures” 5923214 1681056 Database named entity recognition entity disambiguation custom database lookup currency normalization

led by ACME Ventures” 5923214 1681056 Database named entity recognition entity disambiguation custom database lookup currency normalization entity relation extraction

Case Study: S&P Global explosion.ai/blog/sp-global-commodities ⚠ data-private ⚠ real-time “heards”:
trading activities

Human-in-the-loop distillation explosion.ai/blog/human-in-the-loop-distillation LLM

Human-in-the-loop distillation explosion.ai/blog/human-in-the-loop-distillation continuous evaluation baseline LLM

Human-in-the-loop distillation explosion.ai/blog/human-in-the-loop-distillation continuous evaluation baseline LLM prompting

Human-in-the-loop distillation explosion.ai/blog/human-in-the-loop-distillation continuous evaluation baseline LLM prompting transfer learning
COMPONENT

COMPONENT distilled model

COMPONENT distilled model deploy 🚀

Data refactoring explosion.ai/blog/sp-global-commodities 99% F-score 6mb model size 16k+ words/second

Data refactoring explosion.ai/blog/sp-global-commodities 30min per attribute GPT-5 API 99% F-score
6mb model size 16k+ words/second

Data refactoring explosion.ai/blog/sp-global-commodities 10× faster reduce cognitive load 30min per
attribute GPT-5 API 99% F-score 6mb model size 16k+ words/second

Factor out business logic MODEL result = business_logic(classification(text))

Factor out business logic MODEL words, grammar, syntax information in
the text result = business_logic(classification(text))

Factor out business logic MODEL external knowledge facts that can
change over time words, grammar, syntax information in the text result = business_logic(classification(text))

At their core, many NLP systems consist of flat classifications.
You can shove them into a single prompt, or you can decompose them into smaller pieces. Many classification tasks are straightforward to solve nowadays – but they become vastly more complicated if one model needs to do them all at once. explosion.ai/blog/human-in-the-loop-distillation “ “

prodigy.ai/docs Intro to Prodigy Annotate

prodigy.ai/docs Intro to Prodigy recipe Annotate

prodigy.ai/docs Intro to Prodigy dataset recipe Annotate

prodigy.ai/docs Intro to Prodigy dataset recipe model / tokenizer Annotate

prodigy.ai/docs Intro to Prodigy dataset recipe model / tokenizer input
data Annotate

data labels Annotate

data labels Python function Customize Annotate

data labels Python function Customize Annotate Automate model API rules

data labels Python function Customize web server & annotation UI + database Annotate Automate model API rules

Train prodigy.ai/docs Intro to Prodigy dataset recipe model / tokenizer
input data labels Python function Customize web server & annotation UI + database Annotate Automate model API rules

input data labels Python function Customize recipe web server & annotation UI + database Annotate Automate model API rules

input data labels Python function Customize output recipe web server & annotation UI + database Annotate Automate model API rules

input data labels Python function Customize dataset output recipe web server & annotation UI + database Annotate Automate model API rules

input data labels Python function Customize dataset evaluation % output recipe web server & annotation UI + database Annotate Automate model API rules

HU Berlin: Industrial-Strength Natural Language...

HU Berlin: Industrial-Strength Natural Language Processing with spaCy and Prodigy

Resources

spaCy Demo Notebook

spaCy

Prodigy

How S&P Global is making markets more transparent with NLP, spaCy and Prodigy

A practical guide to human-in-the-loop distillation

More Decks by Ines Montani

Other Decks in Research

Featured

Transcript