The AI Revolution Will Not Be Monopolized: Behind the scenes

THE SCENES BEHIND THE SCENES * * * BEHIND THE
SCENES * * * BEHIN Ines Montani Explosion

Open-source library for industrial-strength natural language processing spacy.io SPACY 210m+
downloads

Open-source library for industrial-strength natural language processing spacy.io SPACY ChatGPT
can write spaCy code! 210m+ downloads

Modern scriptable annotation tool for machine learning developers prodigy.ai PRODIGY
9k+ 800+ users companies

Modern scriptable annotation tool for machine learning developers prodigy.ai PRODIGY
9k+ 800+ users companies Alex Smith Developer Kim Miller Analyst

Collaborative data development platform prodigy.ai/teams PRODIGY TEAMS BETA

Collaborative data development platform Alex Smith Developer Kim Miller Analyst
GPT-4 API prodigy.ai/teams PRODIGY TEAMS BETA

UNDERSTANDING NLP TASKS

UNDERSTANDING NLP TASKS generative tasks 📖 single/multi-doc summarization 🧮 reasoning
✅ problem solving ✍ paraphrasing 🖼 style transfer ⁉ question answering

UNDERSTANDING NLP TASKS generative tasks 📖 single/multi-doc summarization 🧮 reasoning
✅ problem solving ✍ paraphrasing 🖼 style transfer ⁉ question answering human- readable

UNDERSTANDING NLP TASKS predictive tasks 🔖 entity recognition 🔗 relation
extraction 👫 coreference resolution 🧬 grammar & morphology 🎯 semantic parsing 💬 discourse structure 📚 text classification generative tasks 📖 single/multi-doc summarization 🧮 reasoning ✅ problem solving ✍ paraphrasing 🖼 style transfer ⁉ question answering human- readable

UNDERSTANDING NLP TASKS predictive tasks 🔖 entity recognition 🔗 relation
extraction 👫 coreference resolution 🧬 grammar & morphology 🎯 semantic parsing 💬 discourse structure 📚 text classification generative tasks 📖 single/multi-doc summarization 🧮 reasoning ✅ problem solving ✍ paraphrasing 🖼 style transfer ⁉ question answering human- readable machine- readable

PREDICTIVE QUADRANT

PREDICTIVE QUADRANT generic objective, negligible task data zero/few-shot in-context learning

generic objective, task data fine-tuned in-context learning

generic objective, task data fine-tuned in-context learning task objective, no task-specific labels nothing

generic objective, task data fine-tuned in-context learning task objective, no task-specific labels nothing task objective, task data fine-tuned transfer learning, BERT etc.

massive number of experiments: many tasks, lots of models TURE
* * * LITERATURE * * * LITERATURE * * * LITERATURE * * * L

massive number of experiments: many tasks, lots of models results
way below task-specific models across the board TURE * * * LITERATURE * * * LITERATURE * * * LITERATURE * * * L

fine-tuning an LLM for few-shot NER works TURE * *
* LITERATURE * * * LITERATURE * * * LITERATURE * * * L

fine-tuning an LLM for few-shot NER works BERT-base still competitive
overall TURE * * * LITERATURE * * * LITERATURE * * * LITERATURE * * * L

fine-tuning an LLM for few-shot NER works ChatGPT scores poorly
BERT-base still competitive overall TURE * * * LITERATURE * * * LITERATURE * * * LITERATURE * * * L

found ChatGPT did better than crowd workers on several text
classification tasks TURE * * * LITERATURE * * * LITERATURE * * * LITERATURE * * * L

classification tasks accuracy still low against trained annotators TURE * * * LITERATURE * * * LITERATURE * * * LITERATURE * * * L

classification tasks accuracy still low against trained annotators says more about crowd-worker methodology than LLMs TURE * * * LITERATURE * * * LITERATURE * * * LITERATURE * * * L

F-Score Speed (words/s) GPT-3.5 1 78.6 < 100 GPT-4 1
83.5 < 100 spaCy 91.6 4,000 Flair 93.1 1,000 SOTA 2023 2 94.6 1,000 SOTA 2003 3 88.8 > 20,000 1. Ashok and Lipton (2023), 2. Wang et al. (2021), 3. Florian et al. (2003) * * EXPERIMENTS * * * EXPERIMENTS * * * EXPERIMENTS * * * CoNLL 2003: Named Entity Recognition

F-Score Speed (words/s) GPT-3.5 1 78.6 < 100 GPT-4 1
83.5 < 100 spaCy 91.6 4,000 Flair 93.1 1,000 SOTA 2023 2 94.6 1,000 SOTA 2003 3 88.8 > 20,000 1. Ashok and Lipton (2023), 2. Wang et al. (2021), 3. Florian et al. (2003) SOTA on few- shot prompting RoBERTa-base * * EXPERIMENTS * * * EXPERIMENTS * * * EXPERIMENTS * * * CoNLL 2003: Named Entity Recognition

FabNER Claude 2 accuracy on # of examples 10 20
30 40 50 60 70 80 90 100 0 100 200 300 400 500 20 examples F-Score Speed (words/s) GPT-3.5 1 78.6 < 100 GPT-4 1 83.5 < 100 spaCy 91.6 4,000 Flair 93.1 1,000 SOTA 2023 2 94.6 1,000 SOTA 2003 3 88.8 > 20,000 1. Ashok and Lipton (2023), 2. Wang et al. (2021), 3. Florian et al. (2003) SOTA on few- shot prompting RoBERTa-base * * EXPERIMENTS * * * EXPERIMENTS * * * EXPERIMENTS * * * CoNLL 2003: Named Entity Recognition

processing pipeline prototype PROTOTYPE TO PRODUCTION

github.com/explosion/spacy-llm prompt model & transform output to structured data processing
pipeline prototype PROTOTYPE TO PRODUCTION

processing pipeline in production swap, replace and mix components github.com/explosion/spacy-llm
prompt model & transform output to structured data processing pipeline prototype PROTOTYPE TO PRODUCTION

processing pipeline in production swap, replace and mix components github.com/explosion/spacy-llm
prompt model & transform output to structured data structured machine-facing Doc object processing pipeline prototype PROTOTYPE TO PRODUCTION

Generative LLM Distilled Component Accuracy (F-score) 0.74 0.74 Speed (words/second)
< 100 ~ 2,000 Model Size ~ 5 TB 400 MB Parameters 1.8t 130m Training Examples 0 800 Evaluation Examples 200 200 Data Development Time (hours) ~ 2 ~ 8 RESULTS * * * PRELIMINARY RESULTS * * * PRELIMINARY RESULTS * ** LLM-assisted annotation

Generative LLM Distilled Component Accuracy (F-score) 0.74 0.74 Speed (words/second)
< 100 ~ 2,000 Model Size ~ 5 TB 400 MB Parameters 1.8t 130m Training Examples 0 800 Evaluation Examples 200 200 Data Development Time (hours) ~ 2 ~ 8 ~ 2,000 400 MB RESULTS * * * PRELIMINARY RESULTS * * * PRELIMINARY RESULTS * ** LLM-assisted annotation

Predictive tasks still matter. Generative complements predictive. It doesn’t replace
it.

it. In-context learning with prompts alone is not optimal for predictive tasks.

it. Analysis and evaluation take time. You can’t get a new system in minutes, no matter which approach you take. In-context learning with prompts alone is not optimal for predictive tasks.

it. Analysis and evaluation take time. You can’t get a new system in minutes, no matter which approach you take. In-context learning with prompts alone is not optimal for predictive tasks. Don’t abandon development principles that made software successful: modularity, testability and flexibility.

Explosion spaCy Prodigy Twitter Mastodon Bluesky explosion.ai spacy.io prodigy.ai @_inesmontani
@[email protected] @inesmontani.bsky.social LinkedIn

* APRIL 24, 14:45 * * * PYCON DE &
PYDATA BERLIN * * * See you at the conference 👋 Explosion spaCy Prodigy Twitter Mastodon Bluesky explosion.ai spacy.io prodigy.ai @_inesmontani @[email protected] @inesmontani.bsky.social LinkedIn

The AI Revolution Will Not Be Monopolized: Behi...

The AI Revolution Will Not Be Monopolized: Behind the scenes

Resources

The AI Revolution Will Not Be Monopolized: How open-source beats economies of scale, even for LLMs

More Decks by Ines Montani

Other Decks in Technology

Featured

Transcript