Slide 1

Slide 1 text

THE SCENES BEHIND THE SCENES * * * BEHIND THE SCENES * * * BEHIN Ines Montani Explosion

Slide 2

Slide 2 text

Open-source library for industrial-strength natural language processing spacy.io SPACY 210m+ downloads

Slide 3

Slide 3 text

Open-source library for industrial-strength natural language processing spacy.io SPACY ChatGPT can write spaCy code! 210m+ downloads

Slide 4

Slide 4 text

Modern scriptable annotation tool for machine learning developers prodigy.ai PRODIGY 9k+ 800+ users companies

Slide 5

Slide 5 text

Modern scriptable annotation tool for machine learning developers prodigy.ai PRODIGY 9k+ 800+ users companies Alex Smith Developer Kim Miller Analyst

Slide 6

Slide 6 text

Collaborative data development platform prodigy.ai/teams PRODIGY TEAMS BETA

Slide 7

Slide 7 text

Collaborative data development platform Alex Smith Developer Kim Miller Analyst GPT-4 API prodigy.ai/teams PRODIGY TEAMS BETA

Slide 8

Slide 8 text

UNDERSTANDING NLP TASKS

Slide 9

Slide 9 text

UNDERSTANDING NLP TASKS generative tasks ๐Ÿ“– single/multi-doc summarization ๐Ÿงฎ reasoning โœ… problem solving โœ paraphrasing ๐Ÿ–ผ style transfer โ‰ question answering

Slide 10

Slide 10 text

UNDERSTANDING NLP TASKS generative tasks ๐Ÿ“– single/multi-doc summarization ๐Ÿงฎ reasoning โœ… problem solving โœ paraphrasing ๐Ÿ–ผ style transfer โ‰ question answering human- readable

Slide 11

Slide 11 text

UNDERSTANDING NLP TASKS predictive tasks ๐Ÿ”– entity recognition ๐Ÿ”— relation extraction ๐Ÿ‘ซ coreference resolution ๐Ÿงฌ grammar & morphology ๐ŸŽฏ semantic parsing ๐Ÿ’ฌ discourse structure ๐Ÿ“š text classification generative tasks ๐Ÿ“– single/multi-doc summarization ๐Ÿงฎ reasoning โœ… problem solving โœ paraphrasing ๐Ÿ–ผ style transfer โ‰ question answering human- readable

Slide 12

Slide 12 text

UNDERSTANDING NLP TASKS predictive tasks ๐Ÿ”– entity recognition ๐Ÿ”— relation extraction ๐Ÿ‘ซ coreference resolution ๐Ÿงฌ grammar & morphology ๐ŸŽฏ semantic parsing ๐Ÿ’ฌ discourse structure ๐Ÿ“š text classification generative tasks ๐Ÿ“– single/multi-doc summarization ๐Ÿงฎ reasoning โœ… problem solving โœ paraphrasing ๐Ÿ–ผ style transfer โ‰ question answering human- readable machine- readable

Slide 13

Slide 13 text

PREDICTIVE QUADRANT

Slide 14

Slide 14 text

PREDICTIVE QUADRANT generic objective, negligible task data zero/few-shot in-context learning

Slide 15

Slide 15 text

PREDICTIVE QUADRANT generic objective, negligible task data zero/few-shot in-context learning generic objective, task data fine-tuned in-context learning

Slide 16

Slide 16 text

PREDICTIVE QUADRANT generic objective, negligible task data zero/few-shot in-context learning generic objective, task data fine-tuned in-context learning task objective, no task-specific labels nothing

Slide 17

Slide 17 text

PREDICTIVE QUADRANT generic objective, negligible task data zero/few-shot in-context learning generic objective, task data fine-tuned in-context learning task objective, no task-specific labels nothing task objective, task data fine-tuned transfer learning, BERT etc.

Slide 18

Slide 18 text

PREDICTIVE QUADRANT generic objective, negligible task data zero/few-shot in-context learning generic objective, task data fine-tuned in-context learning task objective, no task-specific labels nothing task objective, task data fine-tuned transfer learning, BERT etc.

Slide 19

Slide 19 text

massive number of experiments: many tasks, lots of models TURE * * * LITERATURE * * * LITERATURE * * * LITERATURE * * * L

Slide 20

Slide 20 text

massive number of experiments: many tasks, lots of models results way below task-specific models across the board TURE * * * LITERATURE * * * LITERATURE * * * LITERATURE * * * L

Slide 21

Slide 21 text

fine-tuning an LLM for few-shot NER works TURE * * * LITERATURE * * * LITERATURE * * * LITERATURE * * * L

Slide 22

Slide 22 text

fine-tuning an LLM for few-shot NER works BERT-base still competitive overall TURE * * * LITERATURE * * * LITERATURE * * * LITERATURE * * * L

Slide 23

Slide 23 text

fine-tuning an LLM for few-shot NER works ChatGPT scores poorly BERT-base still competitive overall TURE * * * LITERATURE * * * LITERATURE * * * LITERATURE * * * L

Slide 24

Slide 24 text

found ChatGPT did better than crowd workers on several text classification tasks TURE * * * LITERATURE * * * LITERATURE * * * LITERATURE * * * L

Slide 25

Slide 25 text

found ChatGPT did better than crowd workers on several text classification tasks accuracy still low against trained annotators TURE * * * LITERATURE * * * LITERATURE * * * LITERATURE * * * L

Slide 26

Slide 26 text

found ChatGPT did better than crowd workers on several text classification tasks accuracy still low against trained annotators says more about crowd-worker methodology than LLMs TURE * * * LITERATURE * * * LITERATURE * * * LITERATURE * * * L

Slide 27

Slide 27 text

F-Score Speed (words/s) GPT-3.5 1 78.6 < 100 GPT-4 1 83.5 < 100 spaCy 91.6 4,000 Flair 93.1 1,000 SOTA 2023 2 94.6 1,000 SOTA 2003 3 88.8 > 20,000 1. Ashok and Lipton (2023), 2. Wang et al. (2021), 3. Florian et al. (2003) * * EXPERIMENTS * * * EXPERIMENTS * * * EXPERIMENTS * * * CoNLL 2003: Named Entity Recognition

Slide 28

Slide 28 text

F-Score Speed (words/s) GPT-3.5 1 78.6 < 100 GPT-4 1 83.5 < 100 spaCy 91.6 4,000 Flair 93.1 1,000 SOTA 2023 2 94.6 1,000 SOTA 2003 3 88.8 > 20,000 1. Ashok and Lipton (2023), 2. Wang et al. (2021), 3. Florian et al. (2003) SOTA on few- shot prompting RoBERTa-base * * EXPERIMENTS * * * EXPERIMENTS * * * EXPERIMENTS * * * CoNLL 2003: Named Entity Recognition

Slide 29

Slide 29 text

FabNER Claude 2 accuracy on # of examples 10 20 30 40 50 60 70 80 90 100 0 100 200 300 400 500 20 examples F-Score Speed (words/s) GPT-3.5 1 78.6 < 100 GPT-4 1 83.5 < 100 spaCy 91.6 4,000 Flair 93.1 1,000 SOTA 2023 2 94.6 1,000 SOTA 2003 3 88.8 > 20,000 1. Ashok and Lipton (2023), 2. Wang et al. (2021), 3. Florian et al. (2003) SOTA on few- shot prompting RoBERTa-base * * EXPERIMENTS * * * EXPERIMENTS * * * EXPERIMENTS * * * CoNLL 2003: Named Entity Recognition

Slide 30

Slide 30 text

processing pipeline prototype PROTOTYPE TO PRODUCTION

Slide 31

Slide 31 text

github.com/explosion/spacy-llm prompt model & transform output to structured data processing pipeline prototype PROTOTYPE TO PRODUCTION

Slide 32

Slide 32 text

processing pipeline in production swap, replace and mix components github.com/explosion/spacy-llm prompt model & transform output to structured data processing pipeline prototype PROTOTYPE TO PRODUCTION

Slide 33

Slide 33 text

processing pipeline in production swap, replace and mix components github.com/explosion/spacy-llm prompt model & transform output to structured data structured machine-facing Doc object processing pipeline prototype PROTOTYPE TO PRODUCTION

Slide 34

Slide 34 text

Generative LLM Distilled Component Accuracy (F-score) 0.74 0.74 Speed (words/second) < 100 ~ 2,000 Model Size ~ 5 TB 400 MB Parameters 1.8t 130m Training Examples 0 800 Evaluation Examples 200 200 Data Development Time (hours) ~ 2 ~ 8 RESULTS * * * PRELIMINARY RESULTS * * * PRELIMINARY RESULTS * ** LLM-assisted annotation

Slide 35

Slide 35 text

Generative LLM Distilled Component Accuracy (F-score) 0.74 0.74 Speed (words/second) < 100 ~ 2,000 Model Size ~ 5 TB 400 MB Parameters 1.8t 130m Training Examples 0 800 Evaluation Examples 200 200 Data Development Time (hours) ~ 2 ~ 8 ~ 2,000 400 MB RESULTS * * * PRELIMINARY RESULTS * * * PRELIMINARY RESULTS * ** LLM-assisted annotation

Slide 36

Slide 36 text

No content

Slide 37

Slide 37 text

Predictive tasks still matter. Generative complements predictive. It doesnโ€™t replace it.

Slide 38

Slide 38 text

Predictive tasks still matter. Generative complements predictive. It doesnโ€™t replace it. In-context learning with prompts alone is not optimal for predictive tasks.

Slide 39

Slide 39 text

Predictive tasks still matter. Generative complements predictive. It doesnโ€™t replace it. Analysis and evaluation take time. You canโ€™t get a new system in minutes, no matter which approach you take. In-context learning with prompts alone is not optimal for predictive tasks.

Slide 40

Slide 40 text

Predictive tasks still matter. Generative complements predictive. It doesnโ€™t replace it. Analysis and evaluation take time. You canโ€™t get a new system in minutes, no matter which approach you take. In-context learning with prompts alone is not optimal for predictive tasks. Donโ€™t abandon development principles that made software successful: modularity, testability and flexibility.

Slide 41

Slide 41 text

Explosion spaCy Prodigy Twitter Mastodon Bluesky explosion.ai spacy.io prodigy.ai @_inesmontani @[email protected] @inesmontani.bsky.social LinkedIn

Slide 42

Slide 42 text

* APRIL 24, 14:45 * * * PYCON DE & PYDATA BERLIN * * * See you at the conference ๐Ÿ‘‹ Explosion spaCy Prodigy Twitter Mastodon Bluesky explosion.ai spacy.io prodigy.ai @_inesmontani @[email protected] @inesmontani.bsky.social LinkedIn