Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building new NLP solutions with spaCy and Prodigy

Building new NLP solutions with spaCy and Prodigy

Commercial machine learning projects are currently like start-ups: many projects fail, but some are extremely successful, justifying the total investment. While some people will tell you to "embrace failure", I say failure sucks — so what can we do to fight it? In this talk, I will discuss how to address some of the most likely causes of failure for new Natural Language Processing (NLP) projects. My main recommendation is to take an iterative approach: don't assume you know what your pipeline should look like, let alone your annotation schemes or model architectures. I will also discuss a few tips for figuring out what's likely to work, along with a few common mistakes. To keep the advice well-grounded, I will refer specifically to our open-source library spaCy, and our commercial annotation tool Prodigy.

Matthew Honnibal

July 07, 2018
Tweet

More Decks by Matthew Honnibal

Other Decks in Programming

Transcript

  1. Explosion AI is a digital studio specialising in Artificial Intelligence

    and Natural Language Processing. Open-source library for industrial-strength Natural Language Processing spaCy’s next-generation Machine Learning library for deep learning with text Coming soon: pre-trained, customisable models
 for a variety of languages and domains A radically efficient data collection and annotation tool, powered by active learning
  2. Matthew Honnibal CO-FOUNDER PhD in Computer Science in 2009. 10

    years publishing research on state-of-the- art natural language understanding systems. Left academia in 2014 to develop spaCy. Ines Montani CO-FOUNDER Programmer and front-end developer with degree in media science and linguistics. Has been working on spaCy since its first release. Lead developer of Prodigy.
  3. “I don’t get it. Can you explain like I’m five?”

    Think of us as a boutique kitchen. free recipes published online catering for select events a line of kitchen gadgets soon: a line of fancy sauces and spice mixes you can use at home open-source software consulting downloadable tools pre-trained models
  4. How to maximize your NLP project’s risk of failure Imagineer.

    Forecast. Outsource. Wire. Ship. 1 2 3 4 5 Decide what your application ought to do. Be ambitious! Nobody changed the world saying “uh, will that work?”
  5. How to maximize your NLP project’s risk of failure Imagineer.

    Forecast. Outsource. Wire. Ship. 1 2 3 4 5 Figure out what accuracy you’ll need. If you’re not sure here, just say 90%.
  6. How to maximize your NLP project’s risk of failure Imagineer.

    Forecast. Outsource. Wire. Ship. 1 2 3 4 5 Pay someone else to gather your data. Think carefully about your accuracy requirements, and then ask for 10,000 rows.
  7. How to maximize your NLP project’s risk of failure Imagineer.

    Forecast. Outsource. Wire. Ship. 1 2 3 4 5 Implement your network. This is the fun part! Tensor all your flows; descend every gradient!
  8. How to maximize your NLP project’s risk of failure Imagineer.

    Forecast. Outsource. Wire. Ship. 1 2 3 4 5 Put it all together. If it doesn’t work, maybe blame the intern?
  9. 5 including tolerance for inaccuracies, latencies, etc Understanding how the

    model will work in the larger application or business process Annotation scheme and corpus construction Consistent and clean data Model architecture Opti- mization Machine Learning Hierarchy of Needs
  10. 4 categories that will be easy to annotate consistently, and

    easy for the model to learn 5 including tolerance for inaccuracies, latencies, etc Understanding how the model will work in the larger application or business process Annotation scheme and corpus construction Consistent and clean data Model architecture Opti- mization Machine Learning Hierarchy of Needs
  11. 4 categories that will be easy to annotate consistently, and

    easy for the model to learn 3 attentive annotators, good quality control processes 5 including tolerance for inaccuracies, latencies, etc Understanding how the model will work in the larger application or business process Annotation scheme and corpus construction Consistent and clean data Model architecture Opti- mization Machine Learning Hierarchy of Needs
  12. 4 categories that will be easy to annotate consistently, and

    easy for the model to learn 3 attentive annotators, good quality control processes 5 including tolerance for inaccuracies, latencies, etc Understanding how the model will work in the larger application or business process Annotation scheme and corpus construction Consistent and clean data Model architecture Opti- mization Machine Learning Hierarchy of Needs 2 smart choices, no bugs
  13. 1 given by hyper-parameters, initialization tricks, sweat and toil 3

    attentive annotators, good quality control processes 5 including tolerance for inaccuracies, latencies, etc 4 categories that will be easy to annotate consistently, and easy for the model to learn Understanding how the model will work in the larger application or business process Annotation scheme and corpus construction Consistent and clean data Model architecture Opti- mization 2 smart choices, no bugs Machine Learning Hierarchy of Needs
  14. Don’t assume — iterate! What models should we train to

    meet the business needs? Does our annotation scheme make sense? Does the problem look easy, or hard? What can we do to improve fault tolerance?
  15. Problem #1 Requirements: We’re building a crime database based on

    news reports. We want to label the following: victim name perpetrator name crime location offence date arrest date It’s easy to make modelling decisions that are simple, obvious and wrong.
  16. Compose generic models into novel solutions Generic categories like and

    
 let you use pre-trained models. Annotate events and topics at the sentence 
 (or paragraph or document) level. Annotate roles by word or entity. 
 Use the dependency parse to find boundaries. Solution #1 LOCATION PERSON
  17. Problem #2 Big annotation projects make evidence expensive to collect

    For good project plans, you need evidence. 
 To get evidence, you need annotations. You often don’t know if it works until you try it. You’re unlikely to be right the first time. Worry less about scaling up, and more about scaling down. Iteration needs low overhead.
  18. Run your own 
 micro-experiments Active learning and good tooling

    can make experiments faster. Working with the examples yourself lets you understand the problem and fix the label scheme before you scale up. A/B evaluation lets you measure small changes very quickly – also works on generative tasks! Solution #2
  19. Problem #3 It’s hard to get good data by boring

    the shit out of underpaid people Why are we “designing around” this?
 “Taking a HIT: Designing around Rejection, Mistrust, Risk, and Workers’ Experiences in Amazon Mechanical Turk” 
 (McInnis et al., 2016) It’s not just Mechanical Turk — the larger and more transient the annotation team, the harder it is to get quality data.
  20. Smaller annotation teams, better annotation workflows Break complex tasks down

    into smaller pieces: easier for model to learn, easier for human to label. Your annotators don’t need to work the same way your model does. Semi-automatic workflows are often more accurate. Consider moving annotation in-house. Solution #3
  21. accuracy estimate training & evaluation labelled data annotation scheme product

    vision You can’t solve this problem analytically — so solve it iteratively.