How many Labelled Examples do you need for a BERT-sized Model to Beat GPT-4 on Predictive Tasks?

Matthew Honnibal Explosion or similar How many labelled examples do
you need for a BERT-sized model to beat GPT-4 on predictive tasks?

How I’m using GPT-4 in ChatGPT debugging cloud permissions navigating
Linux tools lots more

spaCy 170m+ downloads spacy.io Open-source library for industrial-strength natural language
processing

spaCy ChatGPT can write spaCy code! 170m+ downloads spacy.io Open-source
library for industrial-strength natural language processing

Modern scriptable annotation tool for machine learning developers 800+ companies
Prodigy prodigy.ai 9k+ users

Prodigy Teams prodigy.ai/teams BETA Collaborative data development platform

Prodigy Teams prodigy.ai/teams Alex Smith Developer Kim Miller Analyst GPT-4
API BETA Collaborative data development platform

1. Predictive tasks still matter.

1. Predictive tasks still matter. 2. In-context learning (prompts) is
not optimal for predictive tasks.

1. Predictive tasks still matter. 2. In-context learning (prompts) is
not optimal for predictive tasks. 3. Conceptual model and workflow for using labelled examples.

Generative complements predictive. It doesn’t replace it.

Generative Predictive # single/multi-doc summarization ✅ problem solving ✍ paraphrasing
& reasoning ' style transfer ❓question answering ) text classification * entity recognition + relation extraction , grammar & morphology - semantic parsing . coreference resolution / discourse structure human-readable machine-readable

COMPANY COMPANY MONEY INVESTOR 5923214 1681056 “Hooli raises $5m to
revolutionize search, led by ACME Ventures” Database

revolutionize search, led by ACME Ventures” Database named entity recognition

revolutionize search, led by ACME Ventures” Database named entity recognition entity disambiguation

revolutionize search, led by ACME Ventures” Database named entity recognition entity disambiguation custom database lookup

revolutionize search, led by ACME Ventures” Database named entity recognition entity disambiguation custom database lookup currency normalization

revolutionize search, led by ACME Ventures” Database named entity recognition entity disambiguation custom database lookup currency normalization entity relation extraction

How good is in-context learning at predictive tasks?

def train_tagger(examples, n_tags): W = defaultdict(lambda: np.zeros(n_tags)) for (word, prev,
next), human_tag in examples: scores = W[word] + W[prev] + W[next] guess = scores.argmax() if guess #$ human_tag: for feat in (word, prev, next): W[feat][guess] -= 1 W[feat][human_tag] += 1 How classifiers used to work Averaged Perceptron

next), human_tag in examples: scores = W[word] + W[prev] + W[next] guess = scores.argmax() if guess #$ human_tag: for feat in (word, prev, next): W[feat][guess] -= 1 W[feat][human_tag] += 1 examples = words, tags, contexts How classifiers used to work Averaged Perceptron

next), human_tag in examples: scores = W[word] + W[prev] + W[next] guess = scores.argmax() if guess #$ human_tag: for feat in (word, prev, next): W[feat][guess] -= 1 W[feat][human_tag] += 1 examples = words, tags, contexts the weights we’ll train How classifiers used to work Averaged Perceptron

next), human_tag in examples: scores = W[word] + W[prev] + W[next] guess = scores.argmax() if guess #$ human_tag: for feat in (word, prev, next): W[feat][guess] -= 1 W[feat][human_tag] += 1 examples = words, tags, contexts the weights we’ll train score each tag given weights & context How classifiers used to work Averaged Perceptron

next), human_tag in examples: scores = W[word] + W[prev] + W[next] guess = scores.argmax() if guess #$ human_tag: for feat in (word, prev, next): W[feat][guess] -= 1 W[feat][human_tag] += 1 examples = words, tags, contexts the weights we’ll train score each tag given weights & context get best-scoring tag How classifiers used to work Averaged Perceptron

next), human_tag in examples: scores = W[word] + W[prev] + W[next] guess = scores.argmax() if guess #$ human_tag: for feat in (word, prev, next): W[feat][guess] -= 1 W[feat][human_tag] += 1 examples = words, tags, contexts the weights we’ll train score each tag given weights & context get best-scoring tag if guess was wrong, adjust weights How classifiers used to work Averaged Perceptron

next), human_tag in examples: scores = W[word] + W[prev] + W[next] guess = scores.argmax() if guess #$ human_tag: for feat in (word, prev, next): W[feat][guess] -= 1 W[feat][human_tag] += 1 examples = words, tags, contexts the weights we’ll train score each tag given weights & context get best-scoring tag if guess was wrong, adjust weights decrease score for bag tag in this context How classifiers used to work Averaged Perceptron

next), human_tag in examples: scores = W[word] + W[prev] + W[next] guess = scores.argmax() if guess #$ human_tag: for feat in (word, prev, next): W[feat][guess] -= 1 W[feat][human_tag] += 1 examples = words, tags, contexts the weights we’ll train score each tag given weights & context get best-scoring tag if guess was wrong, adjust weights decrease score for bag tag in this context increase score for good tag in this context How classifiers used to work Averaged Perceptron

Predictive quadrant

Predictive quadrant generic objective, negligible task data zero/few-shot in-context learning

generic objective, task data fine-tuned in-context learning

generic objective, task data fine-tuned in-context learning nothing task objective, no task-specific labels

generic objective, task data fine-tuned in-context learning task objective, task data fine-tuned transfer learning, BERT etc. nothing task objective, no task-specific labels

F-Score Speed (words/s) GPT-3.5 1 78.6 < 100 GPT-4 1
83.5 < 100 spaCy 91.6 4,000 Flair 93.1 1,000 SOTA 2023 2 94.6 1,000 SOTA 2003 3 88.8 > 20,000 1. Ashok and Lipton (2023), 2. Wang et al. (2021), 3. Florian et al. (2003) SOTA on few- shot prompting RoBERTa-base CoNLL 2003 NER Named Entity Recognition

massive number of experiments: many tasks, lots of models no
GPT-4 results way below task-specific models across the board

found ChatGPT did better than crowd-workers on several text classification
tasks accuracy still low against trained annotators says more about crowd-worker methodology than LLMs

fine-tuning an LLM for few-shot NER works BERT-base still competitive
overall ChatGPT scores poorly

SST2 AG News Banking77 GPT-3 65 70 75 80 85
90 95 100 1% 5% 10% 20% 50% 100% accuracy on % of examples text classification few-shot GPT-3 vs. task-specific models LLM stays competitive on sentiment (binary task it understands) news model outperforms LLM with 1% of the training data LLM does badly on Banking77 (too many labels)

FabNER Claude 2 10 20 30 40 50 60 70
80 90 100 0 100 200 300 400 500 accuracy on # of examples named entity recognition zero-shot Claude 2 vs. task-specific CNN model task-specific model wins with 20 examples few-shot greatly increases prompt lengths and doesn’t work well with many label types

How to think about this and what to do

Humans are just weird hardware We have lots of devices
you can schedule computation on. CPU, GPU, LLM, task worker, trained expert... Some devices are much more expensive than others. Use the expensive devices to compile programs to run on less expensive devices.

GPT-4 API Alex Smith Developer Program to the hardware you’re
using

GPT-4 API Alex Smith Developer Kim Miller Annotator Program to
the hardware you’re using

High latency. Let them get into a groove. Don’t thrash
the cache. Working memory is limited. Compile your program: put e ort into creating the right stream of tasks. Scheduling computation on humans

thank you! LinkedIn Explosion spaCy Prodigy Twitter Mastodon Bluesky explosion.ai
spacy.io prodigy.ai @honnibal @[email protected] @honnibal.bsky.social

How many Labelled Examples do you need for a BE...

How many Labelled Examples do you need for a BERT-sized Model to Beat GPT-4 on Predictive Tasks?

Video

More Decks by Matthew Honnibal

Other Decks in Technology

Featured

Transcript