Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How many Labelled Examples do you need for a BERT-sized Model to Beat GPT-4 on Predictive Tasks?

How many Labelled Examples do you need for a BERT-sized Model to Beat GPT-4 on Predictive Tasks?

Video: https://www.youtube.com/watch?v=3iaxLTKJROc

Large Language Models (LLMs) offer a new machine learning interaction paradigm: in-context learning. This approach is clearly much better than approaches that rely on explicit labelled data for a wide variety of generative tasks (e.g. summarisation, question answering, paraphrasing). In-context learning can also be applied to predictive tasks such as text categorization and entity recognition, with few or no labelled exemplars.

But how does in-context learning actually compare to supervised approaches on those tasks? The key advantage is you need less data, but how many labelled examples do you need on different problems before a BERT-sized model can beat GPT4 in accuracy?

The answer might surprise you: models with fewer than 1b parameters are actually very good at classic predictive NLP, while in-context learning struggles on many problem shapes — especially tasks with many labels or that require structured prediction. Methods of improving in-context learning accuracy involve increasing trade-offs of speed for accuracy, suggesting that distillation and LLM-guided annotation will be the most practical approaches.

Implementation of this approach is discussed with reference to the spaCy open-source library and the Prodigy annotation tool.

Matthew Honnibal

October 25, 2023
Tweet

More Decks by Matthew Honnibal

Other Decks in Technology

Transcript

  1. Matthew Honnibal Explosion or similar How many labelled examples do

    you need for a BERT-sized model to beat GPT-4 on predictive tasks?
  2. spaCy ChatGPT can write spaCy code! 170m+ downloads spacy.io Open-source

    library for industrial-strength natural language processing
  3. Prodigy Teams prodigy.ai/teams Alex Smith Developer Kim Miller Analyst GPT-4

    API BETA Collaborative data development platform
  4. 1. Predictive tasks still matter. 2. In-context learning (prompts) is

    not optimal for predictive tasks. 3. Conceptual model and workflow for using labelled examples.
  5. Generative Predictive # single/multi-doc summarization ✅ problem solving ✍ paraphrasing

    & reasoning ' style transfer ❓question answering ) text classification * entity recognition + relation extraction , grammar & morphology - semantic parsing . coreference resolution / discourse structure human-readable machine-readable
  6. COMPANY COMPANY MONEY INVESTOR 5923214 1681056 “Hooli raises $5m to

    revolutionize search, led by ACME Ventures” Database
  7. COMPANY COMPANY MONEY INVESTOR 5923214 1681056 “Hooli raises $5m to

    revolutionize search, led by ACME Ventures” Database named entity recognition
  8. COMPANY COMPANY MONEY INVESTOR 5923214 1681056 “Hooli raises $5m to

    revolutionize search, led by ACME Ventures” Database named entity recognition entity disambiguation
  9. COMPANY COMPANY MONEY INVESTOR 5923214 1681056 “Hooli raises $5m to

    revolutionize search, led by ACME Ventures” Database named entity recognition entity disambiguation custom database lookup
  10. COMPANY COMPANY MONEY INVESTOR 5923214 1681056 “Hooli raises $5m to

    revolutionize search, led by ACME Ventures” Database named entity recognition entity disambiguation custom database lookup currency normalization
  11. COMPANY COMPANY MONEY INVESTOR 5923214 1681056 “Hooli raises $5m to

    revolutionize search, led by ACME Ventures” Database named entity recognition entity disambiguation custom database lookup currency normalization entity relation extraction
  12. def train_tagger(examples, n_tags): W = defaultdict(lambda: np.zeros(n_tags)) for (word, prev,

    next), human_tag in examples: scores = W[word] + W[prev] + W[next] guess = scores.argmax() if guess #$ human_tag: for feat in (word, prev, next): W[feat][guess] -= 1 W[feat][human_tag] += 1 How classifiers used to work Averaged Perceptron
  13. def train_tagger(examples, n_tags): W = defaultdict(lambda: np.zeros(n_tags)) for (word, prev,

    next), human_tag in examples: scores = W[word] + W[prev] + W[next] guess = scores.argmax() if guess #$ human_tag: for feat in (word, prev, next): W[feat][guess] -= 1 W[feat][human_tag] += 1 examples = words, tags, contexts How classifiers used to work Averaged Perceptron
  14. def train_tagger(examples, n_tags): W = defaultdict(lambda: np.zeros(n_tags)) for (word, prev,

    next), human_tag in examples: scores = W[word] + W[prev] + W[next] guess = scores.argmax() if guess #$ human_tag: for feat in (word, prev, next): W[feat][guess] -= 1 W[feat][human_tag] += 1 examples = words, tags, contexts the weights we’ll train How classifiers used to work Averaged Perceptron
  15. def train_tagger(examples, n_tags): W = defaultdict(lambda: np.zeros(n_tags)) for (word, prev,

    next), human_tag in examples: scores = W[word] + W[prev] + W[next] guess = scores.argmax() if guess #$ human_tag: for feat in (word, prev, next): W[feat][guess] -= 1 W[feat][human_tag] += 1 examples = words, tags, contexts the weights we’ll train score each tag given weights & context How classifiers used to work Averaged Perceptron
  16. def train_tagger(examples, n_tags): W = defaultdict(lambda: np.zeros(n_tags)) for (word, prev,

    next), human_tag in examples: scores = W[word] + W[prev] + W[next] guess = scores.argmax() if guess #$ human_tag: for feat in (word, prev, next): W[feat][guess] -= 1 W[feat][human_tag] += 1 examples = words, tags, contexts the weights we’ll train score each tag given weights & context get best-scoring tag How classifiers used to work Averaged Perceptron
  17. def train_tagger(examples, n_tags): W = defaultdict(lambda: np.zeros(n_tags)) for (word, prev,

    next), human_tag in examples: scores = W[word] + W[prev] + W[next] guess = scores.argmax() if guess #$ human_tag: for feat in (word, prev, next): W[feat][guess] -= 1 W[feat][human_tag] += 1 examples = words, tags, contexts the weights we’ll train score each tag given weights & context get best-scoring tag if guess was wrong, adjust weights How classifiers used to work Averaged Perceptron
  18. def train_tagger(examples, n_tags): W = defaultdict(lambda: np.zeros(n_tags)) for (word, prev,

    next), human_tag in examples: scores = W[word] + W[prev] + W[next] guess = scores.argmax() if guess #$ human_tag: for feat in (word, prev, next): W[feat][guess] -= 1 W[feat][human_tag] += 1 examples = words, tags, contexts the weights we’ll train score each tag given weights & context get best-scoring tag if guess was wrong, adjust weights decrease score for bag tag in this context How classifiers used to work Averaged Perceptron
  19. def train_tagger(examples, n_tags): W = defaultdict(lambda: np.zeros(n_tags)) for (word, prev,

    next), human_tag in examples: scores = W[word] + W[prev] + W[next] guess = scores.argmax() if guess #$ human_tag: for feat in (word, prev, next): W[feat][guess] -= 1 W[feat][human_tag] += 1 examples = words, tags, contexts the weights we’ll train score each tag given weights & context get best-scoring tag if guess was wrong, adjust weights decrease score for bag tag in this context increase score for good tag in this context How classifiers used to work Averaged Perceptron
  20. Predictive quadrant generic objective, negligible task data zero/few-shot in-context learning

    generic objective, task data fine-tuned in-context learning
  21. Predictive quadrant generic objective, negligible task data zero/few-shot in-context learning

    generic objective, task data fine-tuned in-context learning nothing task objective, no task-specific labels
  22. Predictive quadrant generic objective, negligible task data zero/few-shot in-context learning

    generic objective, task data fine-tuned in-context learning task objective, task data fine-tuned transfer learning, BERT etc. nothing task objective, no task-specific labels
  23. Predictive quadrant generic objective, negligible task data zero/few-shot in-context learning

    generic objective, task data fine-tuned in-context learning task objective, task data fine-tuned transfer learning, BERT etc. nothing task objective, no task-specific labels
  24. F-Score Speed (words/s) GPT-3.5 1 78.6 < 100 GPT-4 1

    83.5 < 100 spaCy 91.6 4,000 Flair 93.1 1,000 SOTA 2023 2 94.6 1,000 SOTA 2003 3 88.8 > 20,000 1. Ashok and Lipton (2023), 2. Wang et al. (2021), 3. Florian et al. (2003) SOTA on few- shot prompting RoBERTa-base CoNLL 2003 NER Named Entity Recognition
  25. massive number of experiments: many tasks, lots of models no

    GPT-4 results way below task-specific models across the board
  26. found ChatGPT did better than crowd-workers on several text classification

    tasks accuracy still low against trained annotators says more about crowd-worker methodology than LLMs
  27. SST2 AG News Banking77 GPT-3 65 70 75 80 85

    90 95 100 1% 5% 10% 20% 50% 100% accuracy on % of examples text classification few-shot GPT-3 vs. task-specific models LLM stays competitive on sentiment (binary task it understands) news model outperforms LLM with 1% of the training data LLM does badly on Banking77 (too many labels)
  28. SST2 AG News Banking77 GPT-3 65 70 75 80 85

    90 95 100 1% 5% 10% 20% 50% 100% accuracy on % of examples text classification few-shot GPT-3 vs. task-specific models LLM stays competitive on sentiment (binary task it understands) news model outperforms LLM with 1% of the training data LLM does badly on Banking77 (too many labels)
  29. SST2 AG News Banking77 GPT-3 65 70 75 80 85

    90 95 100 1% 5% 10% 20% 50% 100% accuracy on % of examples text classification few-shot GPT-3 vs. task-specific models LLM stays competitive on sentiment (binary task it understands) news model outperforms LLM with 1% of the training data LLM does badly on Banking77 (too many labels)
  30. SST2 AG News Banking77 GPT-3 65 70 75 80 85

    90 95 100 1% 5% 10% 20% 50% 100% accuracy on % of examples text classification few-shot GPT-3 vs. task-specific models LLM stays competitive on sentiment (binary task it understands) news model outperforms LLM with 1% of the training data LLM does badly on Banking77 (too many labels)
  31. FabNER Claude 2 10 20 30 40 50 60 70

    80 90 100 0 100 200 300 400 500 accuracy on # of examples named entity recognition zero-shot Claude 2 vs. task-specific CNN model task-specific model wins with 20 examples few-shot greatly increases prompt lengths and doesn’t work well with many label types
  32. Humans are just weird hardware We have lots of devices

    you can schedule computation on. CPU, GPU, LLM, task worker, trained expert... Some devices are much more expensive than others. Use the expensive devices to compile programs to run on less expensive devices.
  33. High latency. Let them get into a groove. Don’t thrash

    the cache. Working memory is limited. Compile your program: put e ort into creating the right stream of tasks. Scheduling computation on humans