Slide 1

Slide 1 text

Matthew Honnibal Explosion or similar How many labelled examples do you need for a BERT-sized model to beat GPT-4 on predictive tasks?

Slide 2

Slide 2 text

How I’m using GPT-4 in ChatGPT debugging cloud permissions navigating Linux tools lots more

Slide 3

Slide 3 text

spaCy 170m+ downloads spacy.io Open-source library for industrial-strength natural language processing

Slide 4

Slide 4 text

spaCy ChatGPT can write spaCy code! 170m+ downloads spacy.io Open-source library for industrial-strength natural language processing

Slide 5

Slide 5 text

Modern scriptable annotation tool for machine learning developers 800+ companies Prodigy prodigy.ai 9k+ users

Slide 6

Slide 6 text

Modern scriptable annotation tool for machine learning developers 800+ companies Prodigy prodigy.ai 9k+ users

Slide 7

Slide 7 text

Prodigy Teams prodigy.ai/teams BETA Collaborative data development platform

Slide 8

Slide 8 text

Prodigy Teams prodigy.ai/teams Alex Smith Developer Kim Miller Analyst GPT-4 API BETA Collaborative data development platform

Slide 9

Slide 9 text

1. Predictive tasks still matter.

Slide 10

Slide 10 text

1. Predictive tasks still matter. 2. In-context learning (prompts) is not optimal for predictive tasks.

Slide 11

Slide 11 text

1. Predictive tasks still matter. 2. In-context learning (prompts) is not optimal for predictive tasks. 3. Conceptual model and workflow for using labelled examples.

Slide 12

Slide 12 text

Generative complements predictive. It doesn’t replace it.

Slide 13

Slide 13 text

Generative Predictive # single/multi-doc summarization ✅ problem solving ✍ paraphrasing & reasoning ' style transfer ❓question answering ) text classification * entity recognition + relation extraction , grammar & morphology - semantic parsing . coreference resolution / discourse structure human-readable machine-readable

Slide 14

Slide 14 text

COMPANY COMPANY MONEY INVESTOR 5923214 1681056 “Hooli raises $5m to revolutionize search, led by ACME Ventures” Database

Slide 15

Slide 15 text

COMPANY COMPANY MONEY INVESTOR 5923214 1681056 “Hooli raises $5m to revolutionize search, led by ACME Ventures” Database named entity recognition

Slide 16

Slide 16 text

COMPANY COMPANY MONEY INVESTOR 5923214 1681056 “Hooli raises $5m to revolutionize search, led by ACME Ventures” Database named entity recognition entity disambiguation

Slide 17

Slide 17 text

COMPANY COMPANY MONEY INVESTOR 5923214 1681056 “Hooli raises $5m to revolutionize search, led by ACME Ventures” Database named entity recognition entity disambiguation custom database lookup

Slide 18

Slide 18 text

COMPANY COMPANY MONEY INVESTOR 5923214 1681056 “Hooli raises $5m to revolutionize search, led by ACME Ventures” Database named entity recognition entity disambiguation custom database lookup currency normalization

Slide 19

Slide 19 text

COMPANY COMPANY MONEY INVESTOR 5923214 1681056 “Hooli raises $5m to revolutionize search, led by ACME Ventures” Database named entity recognition entity disambiguation custom database lookup currency normalization entity relation extraction

Slide 20

Slide 20 text

How good is in-context learning at predictive tasks?

Slide 21

Slide 21 text

def train_tagger(examples, n_tags): W = defaultdict(lambda: np.zeros(n_tags)) for (word, prev, next), human_tag in examples: scores = W[word] + W[prev] + W[next] guess = scores.argmax() if guess #$ human_tag: for feat in (word, prev, next): W[feat][guess] -= 1 W[feat][human_tag] += 1 How classifiers used to work Averaged Perceptron

Slide 22

Slide 22 text

def train_tagger(examples, n_tags): W = defaultdict(lambda: np.zeros(n_tags)) for (word, prev, next), human_tag in examples: scores = W[word] + W[prev] + W[next] guess = scores.argmax() if guess #$ human_tag: for feat in (word, prev, next): W[feat][guess] -= 1 W[feat][human_tag] += 1 examples = words, tags, contexts How classifiers used to work Averaged Perceptron

Slide 23

Slide 23 text

def train_tagger(examples, n_tags): W = defaultdict(lambda: np.zeros(n_tags)) for (word, prev, next), human_tag in examples: scores = W[word] + W[prev] + W[next] guess = scores.argmax() if guess #$ human_tag: for feat in (word, prev, next): W[feat][guess] -= 1 W[feat][human_tag] += 1 examples = words, tags, contexts the weights we’ll train How classifiers used to work Averaged Perceptron

Slide 24

Slide 24 text

def train_tagger(examples, n_tags): W = defaultdict(lambda: np.zeros(n_tags)) for (word, prev, next), human_tag in examples: scores = W[word] + W[prev] + W[next] guess = scores.argmax() if guess #$ human_tag: for feat in (word, prev, next): W[feat][guess] -= 1 W[feat][human_tag] += 1 examples = words, tags, contexts the weights we’ll train score each tag given weights & context How classifiers used to work Averaged Perceptron

Slide 25

Slide 25 text

def train_tagger(examples, n_tags): W = defaultdict(lambda: np.zeros(n_tags)) for (word, prev, next), human_tag in examples: scores = W[word] + W[prev] + W[next] guess = scores.argmax() if guess #$ human_tag: for feat in (word, prev, next): W[feat][guess] -= 1 W[feat][human_tag] += 1 examples = words, tags, contexts the weights we’ll train score each tag given weights & context get best-scoring tag How classifiers used to work Averaged Perceptron

Slide 26

Slide 26 text

def train_tagger(examples, n_tags): W = defaultdict(lambda: np.zeros(n_tags)) for (word, prev, next), human_tag in examples: scores = W[word] + W[prev] + W[next] guess = scores.argmax() if guess #$ human_tag: for feat in (word, prev, next): W[feat][guess] -= 1 W[feat][human_tag] += 1 examples = words, tags, contexts the weights we’ll train score each tag given weights & context get best-scoring tag if guess was wrong, adjust weights How classifiers used to work Averaged Perceptron

Slide 27

Slide 27 text

def train_tagger(examples, n_tags): W = defaultdict(lambda: np.zeros(n_tags)) for (word, prev, next), human_tag in examples: scores = W[word] + W[prev] + W[next] guess = scores.argmax() if guess #$ human_tag: for feat in (word, prev, next): W[feat][guess] -= 1 W[feat][human_tag] += 1 examples = words, tags, contexts the weights we’ll train score each tag given weights & context get best-scoring tag if guess was wrong, adjust weights decrease score for bag tag in this context How classifiers used to work Averaged Perceptron

Slide 28

Slide 28 text

def train_tagger(examples, n_tags): W = defaultdict(lambda: np.zeros(n_tags)) for (word, prev, next), human_tag in examples: scores = W[word] + W[prev] + W[next] guess = scores.argmax() if guess #$ human_tag: for feat in (word, prev, next): W[feat][guess] -= 1 W[feat][human_tag] += 1 examples = words, tags, contexts the weights we’ll train score each tag given weights & context get best-scoring tag if guess was wrong, adjust weights decrease score for bag tag in this context increase score for good tag in this context How classifiers used to work Averaged Perceptron

Slide 29

Slide 29 text

Predictive quadrant

Slide 30

Slide 30 text

Predictive quadrant generic objective, negligible task data zero/few-shot in-context learning

Slide 31

Slide 31 text

Predictive quadrant generic objective, negligible task data zero/few-shot in-context learning generic objective, task data fine-tuned in-context learning

Slide 32

Slide 32 text

Predictive quadrant generic objective, negligible task data zero/few-shot in-context learning generic objective, task data fine-tuned in-context learning nothing task objective, no task-specific labels

Slide 33

Slide 33 text

Predictive quadrant generic objective, negligible task data zero/few-shot in-context learning generic objective, task data fine-tuned in-context learning task objective, task data fine-tuned transfer learning, BERT etc. nothing task objective, no task-specific labels

Slide 34

Slide 34 text

Predictive quadrant generic objective, negligible task data zero/few-shot in-context learning generic objective, task data fine-tuned in-context learning task objective, task data fine-tuned transfer learning, BERT etc. nothing task objective, no task-specific labels

Slide 35

Slide 35 text

F-Score Speed (words/s) GPT-3.5 1 78.6 < 100 GPT-4 1 83.5 < 100 spaCy 91.6 4,000 Flair 93.1 1,000 SOTA 2023 2 94.6 1,000 SOTA 2003 3 88.8 > 20,000 1. Ashok and Lipton (2023), 2. Wang et al. (2021), 3. Florian et al. (2003) SOTA on few- shot prompting RoBERTa-base CoNLL 2003 NER Named Entity Recognition

Slide 36

Slide 36 text

massive number of experiments: many tasks, lots of models no GPT-4 results way below task-specific models across the board

Slide 37

Slide 37 text

found ChatGPT did better than crowd-workers on several text classification tasks accuracy still low against trained annotators says more about crowd-worker methodology than LLMs

Slide 38

Slide 38 text

fine-tuning an LLM for few-shot NER works BERT-base still competitive overall ChatGPT scores poorly

Slide 39

Slide 39 text

SST2 AG News Banking77 GPT-3 65 70 75 80 85 90 95 100 1% 5% 10% 20% 50% 100% accuracy on % of examples text classification few-shot GPT-3 vs. task-specific models LLM stays competitive on sentiment (binary task it understands) news model outperforms LLM with 1% of the training data LLM does badly on Banking77 (too many labels)

Slide 40

Slide 40 text

SST2 AG News Banking77 GPT-3 65 70 75 80 85 90 95 100 1% 5% 10% 20% 50% 100% accuracy on % of examples text classification few-shot GPT-3 vs. task-specific models LLM stays competitive on sentiment (binary task it understands) news model outperforms LLM with 1% of the training data LLM does badly on Banking77 (too many labels)

Slide 41

Slide 41 text

SST2 AG News Banking77 GPT-3 65 70 75 80 85 90 95 100 1% 5% 10% 20% 50% 100% accuracy on % of examples text classification few-shot GPT-3 vs. task-specific models LLM stays competitive on sentiment (binary task it understands) news model outperforms LLM with 1% of the training data LLM does badly on Banking77 (too many labels)

Slide 42

Slide 42 text

SST2 AG News Banking77 GPT-3 65 70 75 80 85 90 95 100 1% 5% 10% 20% 50% 100% accuracy on % of examples text classification few-shot GPT-3 vs. task-specific models LLM stays competitive on sentiment (binary task it understands) news model outperforms LLM with 1% of the training data LLM does badly on Banking77 (too many labels)

Slide 43

Slide 43 text

FabNER Claude 2 10 20 30 40 50 60 70 80 90 100 0 100 200 300 400 500 accuracy on # of examples named entity recognition zero-shot Claude 2 vs. task-specific CNN model task-specific model wins with 20 examples few-shot greatly increases prompt lengths and doesn’t work well with many label types

Slide 44

Slide 44 text

How to think about this and what to do

Slide 45

Slide 45 text

Humans are just weird hardware We have lots of devices you can schedule computation on. CPU, GPU, LLM, task worker, trained expert... Some devices are much more expensive than others. Use the expensive devices to compile programs to run on less expensive devices.

Slide 46

Slide 46 text

GPT-4 API Alex Smith Developer Program to the hardware you’re using

Slide 47

Slide 47 text

GPT-4 API Alex Smith Developer Kim Miller Annotator Program to the hardware you’re using

Slide 48

Slide 48 text

High latency. Let them get into a groove. Don’t thrash the cache. Working memory is limited. Compile your program: put e ort into creating the right stream of tasks. Scheduling computation on humans

Slide 49

Slide 49 text

thank you! LinkedIn Explosion spaCy Prodigy Twitter Mastodon Bluesky explosion.ai spacy.io prodigy.ai @honnibal @[email protected] @honnibal.bsky.social