Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building new NLP solutions with spaCy and Prodigy

Building new NLP solutions with spaCy and Prodigy

Commercial machine learning projects are currently like start-ups: many projects fail, but some are extremely successful, justifying the total investment. While some people will tell you to "embrace failure", I say failure sucks — so what can we do to fight it? In this talk, I will discuss how to address some of the most likely causes of failure for new Natural Language Processing (NLP) projects. My main recommendation is to take an iterative approach: don't assume you know what your pipeline should look like, let alone your annotation schemes or model architectures. I will also discuss a few tips for figuring out what's likely to work, along with a few common mistakes. To keep the advice well-grounded, I will refer specifically to our open-source library spaCy, and our commercial annotation tool Prodigy.

Matthew Honnibal

July 07, 2018
Tweet

More Decks by Matthew Honnibal

Other Decks in Programming

Transcript

  1. Building new NLP
    solutions with spaCy
    and Prodigy
    Matthew Honnibal
    Explosion AI

    View full-size slide

  2. Explosion AI is a digital studio
    specialising in Artificial Intelligence
    and Natural Language Processing.
    Open-source library for industrial-strength
    Natural Language Processing
    spaCy’s next-generation Machine Learning library
    for deep learning with text
    Coming soon: pre-trained, customisable models

    for a variety of languages and domains
    A radically efficient data collection and annotation
    tool, powered by active learning

    View full-size slide

  3. Matthew Honnibal
    CO-FOUNDER
    PhD in Computer Science in 2009.
    10 years publishing research on state-of-the-
    art natural language understanding systems.
    Left academia in 2014 to develop spaCy.
    Ines Montani
    CO-FOUNDER
    Programmer and front-end developer with
    degree in media science and linguistics.
    Has been working on spaCy since its first
    release. Lead developer of Prodigy.

    View full-size slide

  4. “I don’t get it. Can you
    explain like I’m five?”
    Think of us as a boutique kitchen.
    free recipes published online
    catering for select events
    a line of kitchen gadgets
    soon: a line of fancy sauces and spice mixes you
    can use at home
    open-source software
    consulting
    downloadable tools
    pre-trained models

    View full-size slide

  5. NLP projects are like
    start-ups: they fail a lot.

    View full-size slide

  6. How to maximize your NLP
    project’s risk of failure
    Imagineer.
    Forecast.
    Outsource.
    Wire.
    Ship.
    1
    2
    3
    4
    5
    Decide what your application
    ought to do. Be ambitious! Nobody
    changed the world saying
    “uh, will that work?”

    View full-size slide

  7. How to maximize your NLP
    project’s risk of failure
    Imagineer.
    Forecast.
    Outsource.
    Wire.
    Ship.
    1
    2
    3
    4
    5
    Figure out what accuracy
    you’ll need. If you’re not sure
    here, just say 90%.

    View full-size slide

  8. How to maximize your NLP
    project’s risk of failure
    Imagineer.
    Forecast.
    Outsource.
    Wire.
    Ship.
    1
    2
    3
    4
    5
    Pay someone else to gather your
    data. Think carefully about your
    accuracy requirements, and
    then ask for 10,000 rows.

    View full-size slide

  9. How to maximize your NLP
    project’s risk of failure
    Imagineer.
    Forecast.
    Outsource.
    Wire.
    Ship.
    1
    2
    3
    4
    5
    Implement your network.
    This is the fun part! Tensor all your
    flows; descend every gradient!

    View full-size slide

  10. How to maximize your NLP
    project’s risk of failure
    Imagineer.
    Forecast.
    Outsource.
    Wire.
    Ship.
    1
    2
    3
    4
    5
    Put it all together. If it doesn’t
    work, maybe blame the intern?

    View full-size slide

  11. Failure sucks.

    View full-size slide

  12. 5
    including tolerance for
    inaccuracies, latencies, etc
    Understanding how the model will work
    in the larger application or business process
    Annotation scheme and
    corpus construction
    Consistent and clean data
    Model
    architecture
    Opti-
    mization
    Machine Learning
    Hierarchy of
    Needs

    View full-size slide

  13. 4
    categories that will be easy to
    annotate consistently, and easy
    for the model to learn
    5
    including tolerance for
    inaccuracies, latencies, etc
    Understanding how the model will work
    in the larger application or business process
    Annotation scheme and
    corpus construction
    Consistent and clean data
    Model
    architecture
    Opti-
    mization
    Machine Learning
    Hierarchy of
    Needs

    View full-size slide

  14. 4
    categories that will be easy to
    annotate consistently, and easy
    for the model to learn
    3
    attentive annotators, good
    quality control processes
    5
    including tolerance for
    inaccuracies, latencies, etc
    Understanding how the model will work
    in the larger application or business process
    Annotation scheme and
    corpus construction
    Consistent and clean data
    Model
    architecture
    Opti-
    mization
    Machine Learning
    Hierarchy of
    Needs

    View full-size slide

  15. 4
    categories that will be easy to
    annotate consistently, and easy
    for the model to learn
    3
    attentive annotators, good
    quality control processes
    5
    including tolerance for
    inaccuracies, latencies, etc
    Understanding how the model will work
    in the larger application or business process
    Annotation scheme and
    corpus construction
    Consistent and clean data
    Model
    architecture
    Opti-
    mization
    Machine Learning
    Hierarchy of
    Needs
    2 smart choices, no bugs

    View full-size slide

  16. 1
    given by hyper-parameters,
    initialization tricks, sweat and toil
    3
    attentive annotators, good
    quality control processes
    5
    including tolerance for
    inaccuracies, latencies, etc
    4
    categories that will be easy to
    annotate consistently, and easy
    for the model to learn
    Understanding how the model will work
    in the larger application or business process
    Annotation scheme and
    corpus construction
    Consistent and clean data
    Model
    architecture
    Opti-
    mization
    2 smart choices, no bugs
    Machine Learning
    Hierarchy of
    Needs

    View full-size slide

  17. accuracy
    estimate
    training &
    evaluation
    labelled
    data
    annotation
    scheme
    product
    vision
    A difficult chicken-
    and-egg problem

    View full-size slide

  18. You need to iterate on
    your code and your data.

    View full-size slide

  19. Don’t assume — iterate!
    What models should we train to meet the
    business needs?
    Does our annotation scheme make sense?
    Does the problem look easy, or hard?
    What can we do to improve fault tolerance?

    View full-size slide

  20. Problem #1
    Requirements: We’re building a crime database based on
    news reports. We want to label the following:
    victim name
    perpetrator name
    crime location
    offence date
    arrest date
    It’s easy to make modelling
    decisions that are simple,
    obvious and wrong.

    View full-size slide

  21. Compose generic models
    into novel solutions
    Generic categories like and 

    let you use pre-trained models.
    Annotate events and topics at the sentence 

    (or paragraph or document) level.
    Annotate roles by word or entity. 

    Use the dependency parse to find boundaries.
    Solution #1
    LOCATION PERSON

    View full-size slide

  22. Problem #2 Big annotation projects make
    evidence expensive to collect
    For good project plans, you need evidence. 

    To get evidence, you need annotations.
    You often don’t know if it works until you try it.
    You’re unlikely to be right the first time.
    Worry less about scaling up, and more about
    scaling down. Iteration needs low overhead.

    View full-size slide

  23. Run your own 

    micro-experiments
    Active learning and good tooling can make
    experiments faster.
    Working with the examples yourself lets you
    understand the problem and fix the label scheme
    before you scale up.
    A/B evaluation lets you measure small changes
    very quickly – also works on generative tasks!
    Solution #2

    View full-size slide

  24. Problem #3 It’s hard to get good data by
    boring the shit out of
    underpaid people
    Why are we “designing around” this?

    “Taking a HIT: Designing around Rejection, Mistrust, Risk,
    and Workers’ Experiences in Amazon Mechanical Turk” 

    (McInnis et al., 2016)
    It’s not just Mechanical Turk — the larger and more
    transient the annotation team, the harder it is to
    get quality data.

    View full-size slide

  25. Smaller annotation teams,
    better annotation workflows
    Break complex tasks down into smaller pieces:
    easier for model to learn, easier for human to label.
    Your annotators don’t need to work the same way
    your model does.
    Semi-automatic workflows are often more accurate.
    Consider moving annotation in-house.
    Solution #3

    View full-size slide

  26. accuracy
    estimate
    training &
    evaluation
    labelled
    data
    annotation
    scheme
    product
    vision
    You can’t solve
    this problem
    analytically —
    so solve it
    iteratively.

    View full-size slide

  27. Thanks!
    Explosion AI

    explosion.ai
    Follow us on Twitter

    @honnibal

    @explosion_ai

    View full-size slide