Building new NLP solutions with spaCy and Prodigy

Slide 1

Slide 1 text

Building new NLP solutions with spaCy and Prodigy Matthew Honnibal Explosion AI

Slide 2

Slide 2 text

Explosion AI is a digital studio specialising in Artiﬁcial Intelligence and Natural Language Processing. Open-source library for industrial-strength Natural Language Processing spaCy’s next-generation Machine Learning library for deep learning with text Coming soon: pre-trained, customisable models  for a variety of languages and domains A radically efficient data collection and annotation tool, powered by active learning

Slide 3

Slide 3 text

Matthew Honnibal CO-FOUNDER PhD in Computer Science in 2009. 10 years publishing research on state-of-the- art natural language understanding systems. Left academia in 2014 to develop spaCy. Ines Montani CO-FOUNDER Programmer and front-end developer with degree in media science and linguistics. Has been working on spaCy since its first release. Lead developer of Prodigy.

Slide 4

Slide 4 text

“I don’t get it. Can you explain like I’m ﬁve?” Think of us as a boutique kitchen. free recipes published online catering for select events a line of kitchen gadgets soon: a line of fancy sauces and spice mixes you can use at home open-source software consulting downloadable tools pre-trained models

Slide 5

Slide 5 text

NLP projects are like start-ups: they fail a lot.

Slide 6

Slide 6 text

How to maximize your NLP project’s risk of failure Imagineer. Forecast. Outsource. Wire. Ship. 1 2 3 4 5 Decide what your application ought to do. Be ambitious! Nobody changed the world saying “uh, will that work?”

Slide 7

Slide 7 text

How to maximize your NLP project’s risk of failure Imagineer. Forecast. Outsource. Wire. Ship. 1 2 3 4 5 Figure out what accuracy you’ll need. If you’re not sure here, just say 90%.

Slide 8

Slide 8 text

How to maximize your NLP project’s risk of failure Imagineer. Forecast. Outsource. Wire. Ship. 1 2 3 4 5 Pay someone else to gather your data. Think carefully about your accuracy requirements, and then ask for 10,000 rows.

Slide 9

Slide 9 text

How to maximize your NLP project’s risk of failure Imagineer. Forecast. Outsource. Wire. Ship. 1 2 3 4 5 Implement your network. This is the fun part! Tensor all your flows; descend every gradient!

Slide 10

Slide 10 text

How to maximize your NLP project’s risk of failure Imagineer. Forecast. Outsource. Wire. Ship. 1 2 3 4 5 Put it all together. If it doesn’t work, maybe blame the intern?

Slide 11

Slide 11 text

Failure sucks.

Slide 12

Slide 12 text

5 including tolerance for inaccuracies, latencies, etc Understanding how the model will work in the larger application or business process Annotation scheme and corpus construction Consistent and clean data Model architecture Opti- mization Machine Learning Hierarchy of Needs

Slide 13

Slide 13 text

4 categories that will be easy to annotate consistently, and easy for the model to learn 5 including tolerance for inaccuracies, latencies, etc Understanding how the model will work in the larger application or business process Annotation scheme and corpus construction Consistent and clean data Model architecture Opti- mization Machine Learning Hierarchy of Needs

Slide 14

Slide 14 text

4 categories that will be easy to annotate consistently, and easy for the model to learn 3 attentive annotators, good quality control processes 5 including tolerance for inaccuracies, latencies, etc Understanding how the model will work in the larger application or business process Annotation scheme and corpus construction Consistent and clean data Model architecture Opti- mization Machine Learning Hierarchy of Needs

Slide 15

Slide 15 text

Slide 16

Slide 16 text

1 given by hyper-parameters, initialization tricks, sweat and toil 3 attentive annotators, good quality control processes 5 including tolerance for inaccuracies, latencies, etc 4 categories that will be easy to annotate consistently, and easy for the model to learn Understanding how the model will work in the larger application or business process Annotation scheme and corpus construction Consistent and clean data Model architecture Opti- mization 2 smart choices, no bugs Machine Learning Hierarchy of Needs

Slide 17

Slide 17 text

accuracy estimate training & evaluation labelled data annotation scheme product vision A difﬁcult chicken- and-egg problem

Slide 18

Slide 18 text

You need to iterate on your code and your data.

Slide 19

Slide 19 text

Don’t assume — iterate! What models should we train to meet the business needs? Does our annotation scheme make sense? Does the problem look easy, or hard? What can we do to improve fault tolerance?

Slide 20

Slide 20 text

Problem #1 Requirements: We’re building a crime database based on news reports. We want to label the following: victim name perpetrator name crime location offence date arrest date It’s easy to make modelling decisions that are simple, obvious and wrong.

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

Compose generic models into novel solutions Generic categories like and   let you use pre-trained models. Annotate events and topics at the sentence   (or paragraph or document) level. Annotate roles by word or entity.   Use the dependency parse to find boundaries. Solution #1 LOCATION PERSON

Slide 25

Slide 25 text

Workflow #1

Slide 26

Slide 26 text

Problem #2 Big annotation projects make evidence expensive to collect For good project plans, you need evidence.   To get evidence, you need annotations. You often don’t know if it works until you try it. You’re unlikely to be right the first time. Worry less about scaling up, and more about scaling down. Iteration needs low overhead.

Slide 27

Slide 27 text

Run your own   micro-experiments Active learning and good tooling can make experiments faster. Working with the examples yourself lets you understand the problem and fix the label scheme before you scale up. A/B evaluation lets you measure small changes very quickly – also works on generative tasks! Solution #2

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

Problem #3 It’s hard to get good data by boring the shit out of underpaid people Why are we “designing around” this?  “Taking a HIT: Designing around Rejection, Mistrust, Risk, and Workers’ Experiences in Amazon Mechanical Turk”   (McInnis et al., 2016) It’s not just Mechanical Turk — the larger and more transient the annotation team, the harder it is to get quality data.

Slide 30

Slide 30 text

Smaller annotation teams, better annotation workflows Break complex tasks down into smaller pieces: easier for model to learn, easier for human to label. Your annotators don’t need to work the same way your model does. Semi-automatic workflows are often more accurate. Consider moving annotation in-house. Solution #3

Slide 31

Slide 31 text

accuracy estimate training & evaluation labelled data annotation scheme product vision You can’t solve this problem analytically — so solve it iteratively.

Slide 32

Slide 32 text

Thanks! Explosion AI  explosion.ai Follow us on Twitter  @honnibal  @explosion_ai