The AI Revolution Will Not Be Monopolized: Behind the scenes

Slide 1

Slide 1 text

THE SCENES BEHIND THE SCENES * * * BEHIND THE SCENES * * * BEHIN Ines Montani Explosion

Slide 2

Slide 2 text

Open-source library for industrial-strength natural language processing spacy.io SPACY 210m+ downloads

Slide 3

Slide 3 text

Open-source library for industrial-strength natural language processing spacy.io SPACY ChatGPT can write spaCy code! 210m+ downloads

Slide 4

Slide 4 text

Modern scriptable annotation tool for machine learning developers prodigy.ai PRODIGY 9k+ 800+ users companies

Slide 5

Slide 5 text

Modern scriptable annotation tool for machine learning developers prodigy.ai PRODIGY 9k+ 800+ users companies Alex Smith Developer Kim Miller Analyst

Slide 6

Slide 6 text

Collaborative data development platform prodigy.ai/teams PRODIGY TEAMS BETA

Slide 7

Slide 7 text

Collaborative data development platform Alex Smith Developer Kim Miller Analyst GPT-4 API prodigy.ai/teams PRODIGY TEAMS BETA

Slide 8

Slide 8 text

UNDERSTANDING NLP TASKS

Slide 9

Slide 9 text

UNDERSTANDING NLP TASKS generative tasks 📖 single/multi-doc summarization 🧮 reasoning ✅ problem solving ✍ paraphrasing 🖼 style transfer ⁉ question answering

Slide 10

Slide 10 text

UNDERSTANDING NLP TASKS generative tasks 📖 single/multi-doc summarization 🧮 reasoning ✅ problem solving ✍ paraphrasing 🖼 style transfer ⁉ question answering human- readable

Slide 11

Slide 11 text

UNDERSTANDING NLP TASKS predictive tasks 🔖 entity recognition 🔗 relation extraction 👫 coreference resolution 🧬 grammar & morphology 🎯 semantic parsing 💬 discourse structure 📚 text classification generative tasks 📖 single/multi-doc summarization 🧮 reasoning ✅ problem solving ✍ paraphrasing 🖼 style transfer ⁉ question answering human- readable

Slide 12

Slide 12 text

Slide 13

Slide 13 text

PREDICTIVE QUADRANT

Slide 14

Slide 14 text

PREDICTIVE QUADRANT generic objective, negligible task data zero/few-shot in-context learning

Slide 15

Slide 15 text

PREDICTIVE QUADRANT generic objective, negligible task data zero/few-shot in-context learning generic objective, task data fine-tuned in-context learning

Slide 16

Slide 16 text

PREDICTIVE QUADRANT generic objective, negligible task data zero/few-shot in-context learning generic objective, task data fine-tuned in-context learning task objective, no task-specific labels nothing

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

massive number of experiments: many tasks, lots of models TURE * * * LITERATURE * * * LITERATURE * * * LITERATURE * * * L

Slide 20

Slide 20 text

massive number of experiments: many tasks, lots of models results way below task-specific models across the board TURE * * * LITERATURE * * * LITERATURE * * * LITERATURE * * * L

Slide 21

Slide 21 text

fine-tuning an LLM for few-shot NER works TURE * * * LITERATURE * * * LITERATURE * * * LITERATURE * * * L

Slide 22

Slide 22 text

fine-tuning an LLM for few-shot NER works BERT-base still competitive overall TURE * * * LITERATURE * * * LITERATURE * * * LITERATURE * * * L

Slide 23

Slide 23 text

fine-tuning an LLM for few-shot NER works ChatGPT scores poorly BERT-base still competitive overall TURE * * * LITERATURE * * * LITERATURE * * * LITERATURE * * * L

Slide 24

Slide 24 text

found ChatGPT did better than crowd workers on several text classification tasks TURE * * * LITERATURE * * * LITERATURE * * * LITERATURE * * * L

Slide 25

Slide 25 text

found ChatGPT did better than crowd workers on several text classification tasks accuracy still low against trained annotators TURE * * * LITERATURE * * * LITERATURE * * * LITERATURE * * * L

Slide 26

Slide 26 text

found ChatGPT did better than crowd workers on several text classification tasks accuracy still low against trained annotators says more about crowd-worker methodology than LLMs TURE * * * LITERATURE * * * LITERATURE * * * LITERATURE * * * L

Slide 27

Slide 27 text

F-Score Speed (words/s) GPT-3.5 1 78.6 < 100 GPT-4 1 83.5 < 100 spaCy 91.6 4,000 Flair 93.1 1,000 SOTA 2023 2 94.6 1,000 SOTA 2003 3 88.8 > 20,000 1. Ashok and Lipton (2023), 2. Wang et al. (2021), 3. Florian et al. (2003) * * EXPERIMENTS * * * EXPERIMENTS * * * EXPERIMENTS * * * CoNLL 2003: Named Entity Recognition

Slide 28

Slide 28 text

F-Score Speed (words/s) GPT-3.5 1 78.6 < 100 GPT-4 1 83.5 < 100 spaCy 91.6 4,000 Flair 93.1 1,000 SOTA 2023 2 94.6 1,000 SOTA 2003 3 88.8 > 20,000 1. Ashok and Lipton (2023), 2. Wang et al. (2021), 3. Florian et al. (2003) SOTA on few- shot prompting RoBERTa-base * * EXPERIMENTS * * * EXPERIMENTS * * * EXPERIMENTS * * * CoNLL 2003: Named Entity Recognition

Slide 29

Slide 29 text

FabNER Claude 2 accuracy on # of examples 10 20 30 40 50 60 70 80 90 100 0 100 200 300 400 500 20 examples F-Score Speed (words/s) GPT-3.5 1 78.6 < 100 GPT-4 1 83.5 < 100 spaCy 91.6 4,000 Flair 93.1 1,000 SOTA 2023 2 94.6 1,000 SOTA 2003 3 88.8 > 20,000 1. Ashok and Lipton (2023), 2. Wang et al. (2021), 3. Florian et al. (2003) SOTA on few- shot prompting RoBERTa-base * * EXPERIMENTS * * * EXPERIMENTS * * * EXPERIMENTS * * * CoNLL 2003: Named Entity Recognition

Slide 30

Slide 30 text

processing pipeline prototype PROTOTYPE TO PRODUCTION

Slide 31

Slide 31 text

github.com/explosion/spacy-llm prompt model & transform output to structured data processing pipeline prototype PROTOTYPE TO PRODUCTION

Slide 32

Slide 32 text

processing pipeline in production swap, replace and mix components github.com/explosion/spacy-llm prompt model & transform output to structured data processing pipeline prototype PROTOTYPE TO PRODUCTION

Slide 33

Slide 33 text

processing pipeline in production swap, replace and mix components github.com/explosion/spacy-llm prompt model & transform output to structured data structured machine-facing Doc object processing pipeline prototype PROTOTYPE TO PRODUCTION

Slide 34

Slide 34 text

Generative LLM Distilled Component Accuracy (F-score) 0.74 0.74 Speed (words/second) < 100 ~ 2,000 Model Size ~ 5 TB 400 MB Parameters 1.8t 130m Training Examples 0 800 Evaluation Examples 200 200 Data Development Time (hours) ~ 2 ~ 8 RESULTS * * * PRELIMINARY RESULTS * * * PRELIMINARY RESULTS * ** LLM-assisted annotation

Slide 35

Slide 35 text

Generative LLM Distilled Component Accuracy (F-score) 0.74 0.74 Speed (words/second) < 100 ~ 2,000 Model Size ~ 5 TB 400 MB Parameters 1.8t 130m Training Examples 0 800 Evaluation Examples 200 200 Data Development Time (hours) ~ 2 ~ 8 ~ 2,000 400 MB RESULTS * * * PRELIMINARY RESULTS * * * PRELIMINARY RESULTS * ** LLM-assisted annotation

Slide 36

Slide 36 text

No content

Slide 37

Slide 37 text

Predictive tasks still matter. Generative complements predictive. It doesn’t replace it.

Slide 38

Slide 38 text

Predictive tasks still matter. Generative complements predictive. It doesn’t replace it. In-context learning with prompts alone is not optimal for predictive tasks.

Slide 39

Slide 39 text

Predictive tasks still matter. Generative complements predictive. It doesn’t replace it. Analysis and evaluation take time. You can’t get a new system in minutes, no matter which approach you take. In-context learning with prompts alone is not optimal for predictive tasks.

Slide 40

Slide 40 text

Slide 41

Slide 41 text

Explosion spaCy Prodigy Twitter Mastodon Bluesky explosion.ai spacy.io prodigy.ai @_inesmontani @[email protected] @inesmontani.bsky.social LinkedIn

Slide 42

Slide 42 text

* APRIL 24, 14:45 * * * PYCON DE & PYDATA BERLIN * * * See you at the conference 👋 Explosion spaCy Prodigy Twitter Mastodon Bluesky explosion.ai spacy.io prodigy.ai @_inesmontani @[email protected] @inesmontani.bsky.social LinkedIn