Building new NLP solutions with spaCy and Prodigy

Building new NLP solutions with spaCy and Prodigy Matthew Honnibal
Explosion AI

Explosion AI is a digital studio specialising in Artiﬁcial Intelligence
and Natural Language Processing. Open-source library for industrial-strength Natural Language Processing spaCy’s next-generation Machine Learning library for deep learning with text Coming soon: pre-trained, customisable models  for a variety of languages and domains A radically efficient data collection and annotation tool, powered by active learning

Matthew Honnibal CO-FOUNDER PhD in Computer Science in 2009. 10
years publishing research on state-of-the- art natural language understanding systems. Left academia in 2014 to develop spaCy. Ines Montani CO-FOUNDER Programmer and front-end developer with degree in media science and linguistics. Has been working on spaCy since its first release. Lead developer of Prodigy.

“I don’t get it. Can you explain like I’m ﬁve?”
Think of us as a boutique kitchen. free recipes published online catering for select events a line of kitchen gadgets soon: a line of fancy sauces and spice mixes you can use at home open-source software consulting downloadable tools pre-trained models

NLP projects are like start-ups: they fail a lot.

How to maximize your NLP project’s risk of failure Imagineer.
Forecast. Outsource. Wire. Ship. 1 2 3 4 5 Decide what your application ought to do. Be ambitious! Nobody changed the world saying “uh, will that work?”

Forecast. Outsource. Wire. Ship. 1 2 3 4 5 Figure out what accuracy you’ll need. If you’re not sure here, just say 90%.

Forecast. Outsource. Wire. Ship. 1 2 3 4 5 Pay someone else to gather your data. Think carefully about your accuracy requirements, and then ask for 10,000 rows.

Forecast. Outsource. Wire. Ship. 1 2 3 4 5 Implement your network. This is the fun part! Tensor all your flows; descend every gradient!

Forecast. Outsource. Wire. Ship. 1 2 3 4 5 Put it all together. If it doesn’t work, maybe blame the intern?

Failure sucks.

5 including tolerance for inaccuracies, latencies, etc Understanding how the
model will work in the larger application or business process Annotation scheme and corpus construction Consistent and clean data Model architecture Opti- mization Machine Learning Hierarchy of Needs

4 categories that will be easy to annotate consistently, and
easy for the model to learn 5 including tolerance for inaccuracies, latencies, etc Understanding how the model will work in the larger application or business process Annotation scheme and corpus construction Consistent and clean data Model architecture Opti- mization Machine Learning Hierarchy of Needs

easy for the model to learn 3 attentive annotators, good quality control processes 5 including tolerance for inaccuracies, latencies, etc Understanding how the model will work in the larger application or business process Annotation scheme and corpus construction Consistent and clean data Model architecture Opti- mization Machine Learning Hierarchy of Needs

easy for the model to learn 3 attentive annotators, good quality control processes 5 including tolerance for inaccuracies, latencies, etc Understanding how the model will work in the larger application or business process Annotation scheme and corpus construction Consistent and clean data Model architecture Opti- mization Machine Learning Hierarchy of Needs 2 smart choices, no bugs

1 given by hyper-parameters, initialization tricks, sweat and toil 3
attentive annotators, good quality control processes 5 including tolerance for inaccuracies, latencies, etc 4 categories that will be easy to annotate consistently, and easy for the model to learn Understanding how the model will work in the larger application or business process Annotation scheme and corpus construction Consistent and clean data Model architecture Opti- mization 2 smart choices, no bugs Machine Learning Hierarchy of Needs

accuracy estimate training & evaluation labelled data annotation scheme product
vision A difﬁcult chicken- and-egg problem

You need to iterate on your code and your data.

Don’t assume — iterate! What models should we train to
meet the business needs? Does our annotation scheme make sense? Does the problem look easy, or hard? What can we do to improve fault tolerance?

Problem #1 Requirements: We’re building a crime database based on
news reports. We want to label the following: victim name perpetrator name crime location offence date arrest date It’s easy to make modelling decisions that are simple, obvious and wrong.

Compose generic models into novel solutions Generic categories like and
  let you use pre-trained models. Annotate events and topics at the sentence   (or paragraph or document) level. Annotate roles by word or entity.   Use the dependency parse to find boundaries. Solution #1 LOCATION PERSON

Workflow #1

Problem #2 Big annotation projects make evidence expensive to collect
For good project plans, you need evidence.   To get evidence, you need annotations. You often don’t know if it works until you try it. You’re unlikely to be right the first time. Worry less about scaling up, and more about scaling down. Iteration needs low overhead.

Run your own   micro-experiments Active learning and good tooling
can make experiments faster. Working with the examples yourself lets you understand the problem and fix the label scheme before you scale up. A/B evaluation lets you measure small changes very quickly – also works on generative tasks! Solution #2

Problem #3 It’s hard to get good data by boring
the shit out of underpaid people Why are we “designing around” this?  “Taking a HIT: Designing around Rejection, Mistrust, Risk, and Workers’ Experiences in Amazon Mechanical Turk”   (McInnis et al., 2016) It’s not just Mechanical Turk — the larger and more transient the annotation team, the harder it is to get quality data.

Smaller annotation teams, better annotation workflows Break complex tasks down
into smaller pieces: easier for model to learn, easier for human to label. Your annotators don’t need to work the same way your model does. Semi-automatic workflows are often more accurate. Consider moving annotation in-house. Solution #3

accuracy estimate training & evaluation labelled data annotation scheme product
vision You can’t solve this problem analytically — so solve it iteratively.

Thanks! Explosion AI  explosion.ai Follow us on Twitter  @honnibal  @explosion_ai

Building new NLP solutions with spaCy and Prodigy

Building new NLP solutions with spaCy and Prodigy

Matthew Honnibal PRO

More Decks by Matthew Honnibal

Other Decks in Programming

Featured

Transcript