$30 off During Our Annual Pro Sale. View Details »

HU Berlin: Industrial-Strength Natural Language...

HU Berlin: Industrial-Strength Natural Language Processing with spaCy and Prodigy

Guest lecture for "Deep Learning and Natural Language Processing" at Humboldt University Berlin.

Avatar for Ines Montani

Ines Montani PRO

December 02, 2025
Tweet

Resources

spaCy Demo Notebook

https://colab.research.google.com/drive/1STomYwvU2Jbn1kLbwLw1YxUjZo8U9gx4

Jupyter Notebook with an overview of spaCy's linguistic features and usage examples.

spaCy

https://spacy.io

Open-source library for industrial-strength Natural Language Processing and building fast and efficient information extraction pipelines using custom models, LLMs and rule-based approaches.

Prodigy

https://prodi.gy

A modern and powerful annotation and model improvement tool used for human-in-the-loop training, rapid iteration, and custom NLP workflows.

How S&P Global is making markets more transparent with NLP, spaCy and Prodigy

https://explosion.ai/blog/sp-global-commodities

A case study on S&P Global’s efficient information extraction pipelines for real-time commodities trading insights in a high-security environment using human-in-the-loop distillation.

A practical guide to human-in-the-loop distillation

https://explosion.ai/blog/human-in-the-loop-distillation

This blog post presents practical solutions for using the latest state-of-the-art models in real-world applications and distilling their knowledge into smaller and faster components that you can run and maintain in-house.

More Decks by Ines Montani

Other Decks in Research

Transcript

  1. 480m+ downloads Open-source library for industrial-strength natural language processing spacy.io

    spaCy Modern scriptable annotation tool for machine learning developers prodigy.ai Prodigy 12.000+ users
  2. “ Requirements: We’re building a crime database based on news

    reports. We want to extract the following: • victim name • perpetrator name • crime location • o ff ence date • arrest date “
  3. Levels of linguistic annotations tokens application- oriented linguistic descriptions spans

    parse trees semantic roles PATIENT ADJUNCT Alex Smith was stabbed in East London
  4. Levels of linguistic annotations tokens application- oriented linguistic descriptions spans

    parse trees semantic roles PATIENT ADJUNCT entities PERSON LOCATION Alex Smith was stabbed in East London
  5. Levels of linguistic annotations tokens application- oriented linguistic descriptions spans

    parse trees semantic roles PATIENT ADJUNCT entities PERSON LOCATION text categories CRIME Alex Smith was stabbed in East London
  6. Levels of linguistic annotations tokens application- oriented linguistic descriptions spans

    parse trees semantic roles PATIENT ADJUNCT entities PERSON LOCATION text categories CRIME relations VICTIM Alex Smith was stabbed in East London
  7. COMPANY COMPANY MONEY INVESTOR “Hooli raises $5m to revolutionize search,

    led by ACME Ventures” 5923214 1681056 Database
  8. COMPANY COMPANY MONEY INVESTOR “Hooli raises $5m to revolutionize search,

    led by ACME Ventures” 5923214 1681056 Database named entity recognition
  9. COMPANY COMPANY MONEY INVESTOR “Hooli raises $5m to revolutionize search,

    led by ACME Ventures” 5923214 1681056 Database named entity recognition entity disambiguation
  10. COMPANY COMPANY MONEY INVESTOR “Hooli raises $5m to revolutionize search,

    led by ACME Ventures” 5923214 1681056 Database named entity recognition entity disambiguation custom database lookup
  11. COMPANY COMPANY MONEY INVESTOR “Hooli raises $5m to revolutionize search,

    led by ACME Ventures” 5923214 1681056 Database named entity recognition entity disambiguation custom database lookup currency normalization
  12. COMPANY COMPANY MONEY INVESTOR “Hooli raises $5m to revolutionize search,

    led by ACME Ventures” 5923214 1681056 Database named entity recognition entity disambiguation custom database lookup currency normalization entity relation extraction
  13. Data refactoring explosion.ai/blog/sp-global-commodities 10× faster reduce cognitive load 30min per

    attribute GPT-5 API 99% F-score 6mb model size 16k+ words/second
  14. Factor out business logic MODEL words, grammar, syntax information in

    the text result = business_logic(classification(text))
  15. Factor out business logic MODEL external knowledge facts that can

    change over time words, grammar, syntax information in the text result = business_logic(classification(text))
  16. At their core, many NLP systems consist of flat classifications.

    You can shove them into a single prompt, or you can decompose them into smaller pieces. Many classification tasks are straightforward to solve nowadays – but they become vastly more complicated if one model needs to do them all at once. explosion.ai/blog/human-in-the-loop-distillation “ “
  17. prodigy.ai/docs Intro to Prodigy dataset recipe model / tokenizer input

    data labels Python function Customize Annotate
  18. prodigy.ai/docs Intro to Prodigy dataset recipe model / tokenizer input

    data labels Python function Customize Annotate Automate model API rules
  19. prodigy.ai/docs Intro to Prodigy dataset recipe model / tokenizer input

    data labels Python function Customize web server & annotation UI + database Annotate Automate model API rules
  20. Train prodigy.ai/docs Intro to Prodigy dataset recipe model / tokenizer

    input data labels Python function Customize web server & annotation UI + database Annotate Automate model API rules
  21. Train prodigy.ai/docs Intro to Prodigy dataset recipe model / tokenizer

    input data labels Python function Customize recipe web server & annotation UI + database Annotate Automate model API rules
  22. Train prodigy.ai/docs Intro to Prodigy dataset recipe model / tokenizer

    input data labels Python function Customize output recipe web server & annotation UI + database Annotate Automate model API rules
  23. Train prodigy.ai/docs Intro to Prodigy dataset recipe model / tokenizer

    input data labels Python function Customize dataset output recipe web server & annotation UI + database Annotate Automate model API rules
  24. Train prodigy.ai/docs Intro to Prodigy dataset recipe model / tokenizer

    input data labels Python function Customize dataset evaluation % output recipe web server & annotation UI + database Annotate Automate model API rules