In this presentation, I will build on Ines Montani's keynote, "Applied NLP in the Age of Generative AI", by demonstrating how to create an information extraction pipeline. The talk will focus on using the spaCy NLP library and the Prodigy annotation tool, although the principles discussed will also apply to other frameworks.
spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python. It’s designed specifically for production use and helps you build applications that process and “understand” large volumes of text.
Prodigy is a modern annotation tool for creating training data for machine learning models. It’s so efficient that data scientists can do the annotation themselves, enabling a new level of rapid iteration.
https://github.com/explosion/spacy-llm
spacy-llm features a modular system for fast prototyping and prompting, and turning unstructured responses into robust outputs for various NLP tasks, no training data required.
https://explosion.ai/blog/human-in-the-loop-distillation
This blog post presents practical solutions for using the latest state-of-the-art models in real-world applications and distilling their knowledge into smaller and faster components that you can run and maintain in-house.
https://explosion.ai/blog/sp-global-commodities
A case study on S&P Global’s efficient information extraction pipelines for real-time commodities trading insights in a high-security environment using human-in-the-loop distillation.