Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How many Labelled Examples do you need for a BERT-sized Model to Beat GPT-4 on Predictive Tasks?

How many Labelled Examples do you need for a BERT-sized Model to Beat GPT-4 on Predictive Tasks?

Video: https://www.youtube.com/watch?v=3iaxLTKJROc

Large Language Models (LLMs) offer a new machine learning interaction paradigm: in-context learning. This approach is clearly much better than approaches that rely on explicit labelled data for a wide variety of generative tasks (e.g. summarisation, question answering, paraphrasing). In-context learning can also be applied to predictive tasks such as text categorization and entity recognition, with few or no labelled exemplars.

But how does in-context learning actually compare to supervised approaches on those tasks? The key advantage is you need less data, but how many labelled examples do you need on different problems before a BERT-sized model can beat GPT4 in accuracy?

The answer might surprise you: models with fewer than 1b parameters are actually very good at classic predictive NLP, while in-context learning struggles on many problem shapes — especially tasks with many labels or that require structured prediction. Methods of improving in-context learning accuracy involve increasing trade-offs of speed for accuracy, suggesting that distillation and LLM-guided annotation will be the most practical approaches.

Implementation of this approach is discussed with reference to the spaCy open-source library and the Prodigy annotation tool.

Matthew Honnibal

October 25, 2023
Tweet

More Decks by Matthew Honnibal

Other Decks in Technology

Transcript

  1. Matthew Honnibal Explosion
    or similar
    How many labelled examples
    do you need for a BERT-sized
    model to beat GPT-4 on
    predictive tasks?

    View full-size slide

  2. How I’m using
    GPT-4 in ChatGPT
    debugging cloud permissions
    navigating Linux tools
    lots more

    View full-size slide

  3. spaCy
    170m+
    downloads
    spacy.io
    Open-source library for
    industrial-strength natural
    language processing

    View full-size slide

  4. spaCy
    ChatGPT can write spaCy code!
    170m+
    downloads
    spacy.io
    Open-source library for
    industrial-strength natural
    language processing

    View full-size slide

  5. Modern scriptable
    annotation tool for
    machine learning
    developers
    800+
    companies
    Prodigy
    prodigy.ai
    9k+
    users

    View full-size slide

  6. Modern scriptable
    annotation tool for
    machine learning
    developers
    800+
    companies
    Prodigy
    prodigy.ai
    9k+
    users

    View full-size slide

  7. Prodigy Teams
    prodigy.ai/teams
    BETA
    Collaborative data
    development platform

    View full-size slide

  8. Prodigy Teams
    prodigy.ai/teams
    Alex Smith
    Developer
    Kim Miller
    Analyst
    GPT-4
    API
    BETA
    Collaborative data
    development platform

    View full-size slide

  9. 1. Predictive tasks still matter.

    View full-size slide

  10. 1. Predictive tasks still matter.
    2. In-context learning (prompts) is
    not optimal for predictive tasks.

    View full-size slide

  11. 1. Predictive tasks still matter.
    2. In-context learning (prompts) is
    not optimal for predictive tasks.
    3. Conceptual model and workflow
    for using labelled examples.

    View full-size slide

  12. Generative
    complements predictive.
    It doesn’t replace it.

    View full-size slide

  13. Generative
    Predictive
    # single/multi-doc summarization
    ✅ problem solving
    ✍ paraphrasing
    & reasoning
    ' style transfer
    ❓question answering
    ) text classification
    * entity recognition + relation extraction
    , grammar & morphology
    - semantic parsing . coreference resolution
    / discourse structure
    human-readable
    machine-readable

    View full-size slide

  14. COMPANY
    COMPANY
    MONEY
    INVESTOR
    5923214
    1681056
    “Hooli raises $5m to
    revolutionize search,
    led by ACME Ventures”
    Database

    View full-size slide

  15. COMPANY
    COMPANY
    MONEY
    INVESTOR
    5923214
    1681056
    “Hooli raises $5m to
    revolutionize search,
    led by ACME Ventures”
    Database
    named entity recognition

    View full-size slide

  16. COMPANY
    COMPANY
    MONEY
    INVESTOR
    5923214
    1681056
    “Hooli raises $5m to
    revolutionize search,
    led by ACME Ventures”
    Database
    named entity recognition
    entity disambiguation

    View full-size slide

  17. COMPANY
    COMPANY
    MONEY
    INVESTOR
    5923214
    1681056
    “Hooli raises $5m to
    revolutionize search,
    led by ACME Ventures”
    Database
    named entity recognition
    entity disambiguation
    custom database lookup

    View full-size slide

  18. COMPANY
    COMPANY
    MONEY
    INVESTOR
    5923214
    1681056
    “Hooli raises $5m to
    revolutionize search,
    led by ACME Ventures”
    Database
    named entity recognition
    entity disambiguation
    custom database lookup
    currency normalization

    View full-size slide

  19. COMPANY
    COMPANY
    MONEY
    INVESTOR
    5923214
    1681056
    “Hooli raises $5m to
    revolutionize search,
    led by ACME Ventures”
    Database
    named entity recognition
    entity disambiguation
    custom database lookup
    currency normalization
    entity relation extraction

    View full-size slide

  20. How good is
    in-context learning at
    predictive tasks?

    View full-size slide

  21. def train_tagger(examples, n_tags):

    W = defaultdict(lambda: np.zeros(n_tags))

    for (word, prev, next), human_tag in examples:

    scores = W[word] + W[prev] + W[next]

    guess = scores.argmax()

    if guess #$ human_tag:

    for feat in (word, prev, next):

    W[feat][guess] -= 1

    W[feat][human_tag] += 1
    How classifiers used to work
    Averaged Perceptron

    View full-size slide

  22. def train_tagger(examples, n_tags):

    W = defaultdict(lambda: np.zeros(n_tags))

    for (word, prev, next), human_tag in examples:

    scores = W[word] + W[prev] + W[next]

    guess = scores.argmax()

    if guess #$ human_tag:

    for feat in (word, prev, next):

    W[feat][guess] -= 1

    W[feat][human_tag] += 1
    examples = words, tags, contexts
    How classifiers used to work
    Averaged Perceptron

    View full-size slide

  23. def train_tagger(examples, n_tags):

    W = defaultdict(lambda: np.zeros(n_tags))

    for (word, prev, next), human_tag in examples:

    scores = W[word] + W[prev] + W[next]

    guess = scores.argmax()

    if guess #$ human_tag:

    for feat in (word, prev, next):

    W[feat][guess] -= 1

    W[feat][human_tag] += 1
    examples = words, tags, contexts
    the weights we’ll train
    How classifiers used to work
    Averaged Perceptron

    View full-size slide

  24. def train_tagger(examples, n_tags):

    W = defaultdict(lambda: np.zeros(n_tags))

    for (word, prev, next), human_tag in examples:

    scores = W[word] + W[prev] + W[next]

    guess = scores.argmax()

    if guess #$ human_tag:

    for feat in (word, prev, next):

    W[feat][guess] -= 1

    W[feat][human_tag] += 1
    examples = words, tags, contexts
    the weights we’ll train
    score each tag given weights & context
    How classifiers used to work
    Averaged Perceptron

    View full-size slide

  25. def train_tagger(examples, n_tags):

    W = defaultdict(lambda: np.zeros(n_tags))

    for (word, prev, next), human_tag in examples:

    scores = W[word] + W[prev] + W[next]

    guess = scores.argmax()

    if guess #$ human_tag:

    for feat in (word, prev, next):

    W[feat][guess] -= 1

    W[feat][human_tag] += 1
    examples = words, tags, contexts
    the weights we’ll train
    score each tag given weights & context
    get best-scoring tag
    How classifiers used to work
    Averaged Perceptron

    View full-size slide

  26. def train_tagger(examples, n_tags):

    W = defaultdict(lambda: np.zeros(n_tags))

    for (word, prev, next), human_tag in examples:

    scores = W[word] + W[prev] + W[next]

    guess = scores.argmax()

    if guess #$ human_tag:

    for feat in (word, prev, next):

    W[feat][guess] -= 1

    W[feat][human_tag] += 1
    examples = words, tags, contexts
    the weights we’ll train
    score each tag given weights & context
    get best-scoring tag
    if guess was wrong, adjust weights
    How classifiers used to work
    Averaged Perceptron

    View full-size slide

  27. def train_tagger(examples, n_tags):

    W = defaultdict(lambda: np.zeros(n_tags))

    for (word, prev, next), human_tag in examples:

    scores = W[word] + W[prev] + W[next]

    guess = scores.argmax()

    if guess #$ human_tag:

    for feat in (word, prev, next):

    W[feat][guess] -= 1

    W[feat][human_tag] += 1
    examples = words, tags, contexts
    the weights we’ll train
    score each tag given weights & context
    get best-scoring tag
    if guess was wrong, adjust weights
    decrease score for bag tag in this context
    How classifiers used to work
    Averaged Perceptron

    View full-size slide

  28. def train_tagger(examples, n_tags):

    W = defaultdict(lambda: np.zeros(n_tags))

    for (word, prev, next), human_tag in examples:

    scores = W[word] + W[prev] + W[next]

    guess = scores.argmax()

    if guess #$ human_tag:

    for feat in (word, prev, next):

    W[feat][guess] -= 1

    W[feat][human_tag] += 1
    examples = words, tags, contexts
    the weights we’ll train
    score each tag given weights & context
    get best-scoring tag
    if guess was wrong, adjust weights
    decrease score for bag tag in this context
    increase score for good tag in this context
    How classifiers used to work
    Averaged Perceptron

    View full-size slide

  29. Predictive
    quadrant

    View full-size slide

  30. Predictive
    quadrant
    generic objective,
    negligible task data
    zero/few-shot
    in-context learning

    View full-size slide

  31. Predictive
    quadrant
    generic objective,
    negligible task data
    zero/few-shot
    in-context learning
    generic objective,
    task data
    fine-tuned
    in-context learning

    View full-size slide

  32. Predictive
    quadrant
    generic objective,
    negligible task data
    zero/few-shot
    in-context learning
    generic objective,
    task data
    fine-tuned
    in-context learning
    nothing
    task objective,
    no task-specific labels

    View full-size slide

  33. Predictive
    quadrant
    generic objective,
    negligible task data
    zero/few-shot
    in-context learning
    generic objective,
    task data
    fine-tuned
    in-context learning
    task objective,
    task data
    fine-tuned transfer
    learning, BERT etc.
    nothing
    task objective,
    no task-specific labels

    View full-size slide

  34. Predictive
    quadrant
    generic objective,
    negligible task data
    zero/few-shot
    in-context learning
    generic objective,
    task data
    fine-tuned
    in-context learning
    task objective,
    task data
    fine-tuned transfer
    learning, BERT etc.
    nothing
    task objective,
    no task-specific labels

    View full-size slide

  35. F-Score Speed (words/s)
    GPT-3.5 1 78.6 < 100
    GPT-4 1 83.5 < 100
    spaCy 91.6 4,000
    Flair 93.1 1,000
    SOTA 2023 2 94.6 1,000
    SOTA 2003 3 88.8 > 20,000
    1. Ashok and Lipton (2023), 2. Wang et al. (2021),
    3. Florian et al. (2003)
    SOTA on few-
    shot prompting
    RoBERTa-base
    CoNLL 2003 NER
    Named Entity
    Recognition

    View full-size slide

  36. massive number of experiments:
    many tasks, lots of models
    no GPT-4
    results way below task-specific
    models across the board

    View full-size slide

  37. found ChatGPT did better than
    crowd-workers on several text
    classification tasks
    accuracy still low against
    trained annotators
    says more about crowd-worker
    methodology than LLMs

    View full-size slide

  38. fine-tuning an LLM for few-shot NER works
    BERT-base still competitive overall
    ChatGPT scores poorly

    View full-size slide

  39. SST2 AG News Banking77 GPT-3
    65
    70
    75
    80
    85
    90
    95
    100
    1% 5% 10% 20% 50% 100%
    accuracy on
    % of examples
    text classification
    few-shot GPT-3 vs. task-specific
    models
    LLM stays competitive on sentiment
    (binary task it understands)
    news model outperforms LLM with
    1% of the training data
    LLM does badly on Banking77
    (too many labels)

    View full-size slide

  40. SST2 AG News Banking77 GPT-3
    65
    70
    75
    80
    85
    90
    95
    100
    1% 5% 10% 20% 50% 100%
    accuracy on
    % of examples
    text classification
    few-shot GPT-3 vs. task-specific
    models
    LLM stays competitive on sentiment
    (binary task it understands)
    news model outperforms LLM with
    1% of the training data
    LLM does badly on Banking77
    (too many labels)

    View full-size slide

  41. SST2 AG News Banking77 GPT-3
    65
    70
    75
    80
    85
    90
    95
    100
    1% 5% 10% 20% 50% 100%
    accuracy on
    % of examples
    text classification
    few-shot GPT-3 vs. task-specific
    models
    LLM stays competitive on sentiment
    (binary task it understands)
    news model outperforms LLM with
    1% of the training data
    LLM does badly on Banking77
    (too many labels)

    View full-size slide

  42. SST2 AG News Banking77 GPT-3
    65
    70
    75
    80
    85
    90
    95
    100
    1% 5% 10% 20% 50% 100%
    accuracy on
    % of examples
    text classification
    few-shot GPT-3 vs. task-specific
    models
    LLM stays competitive on sentiment
    (binary task it understands)
    news model outperforms LLM with
    1% of the training data
    LLM does badly on Banking77
    (too many labels)

    View full-size slide

  43. FabNER Claude 2
    10
    20
    30
    40
    50
    60
    70
    80
    90
    100
    0 100 200 300 400 500
    accuracy on
    # of examples
    named entity recognition
    zero-shot Claude 2 vs. task-specific
    CNN model
    task-specific model wins with
    20 examples
    few-shot greatly increases prompt
    lengths and doesn’t work well with
    many label types

    View full-size slide

  44. How to think
    about this and
    what to do

    View full-size slide

  45. Humans are just
    weird hardware
    We have lots of devices you can schedule computation on.
    CPU, GPU, LLM, task worker, trained expert...
    Some devices are much more expensive than others.
    Use the expensive devices to compile programs to run on
    less expensive devices.

    View full-size slide

  46. GPT-4
    API
    Alex Smith
    Developer
    Program to the
    hardware you’re using

    View full-size slide

  47. GPT-4
    API
    Alex Smith
    Developer
    Kim Miller
    Annotator
    Program to the
    hardware you’re using

    View full-size slide

  48. High latency. Let them get into a groove.
    Don’t thrash the cache. Working memory is limited.
    Compile your program: put e ort into creating the
    right stream of tasks.
    Scheduling computation
    on humans

    View full-size slide

  49. thank
    you!
    LinkedIn
    Explosion
    spaCy
    Prodigy
    Twitter
    Mastodon
    Bluesky
    explosion.ai
    spacy.io
    prodigy.ai
    @honnibal
    @[email protected]
    @honnibal.bsky.social

    View full-size slide