Slide 1

Slide 1 text

Rapid NLP annotation through binary decisions, pattern bootstrapping and active learning Ines Montani Explosion AI

Slide 2

Slide 2 text

Why we need annotations Machine Learning is “programming by example” annotations let us specify the output we’re looking for even unsupervised methods need to be evaluated on labelled examples

Slide 3

Slide 3 text

annotation needs iteration: we can’t expect to define the task correctly the first time good annotation teams are small – and should collaborate with the data scientist lots of high-value opportunities need specialist knowledge and expertise Why annotation tools need to be efficient

Slide 4

Slide 4 text

Why annotation needs to be semi-automatic impossible to perform boring, unstructured or multi-step tasks reliably humans make mistakes a computer never would, and vice versa humans are good at context, ambiguity and precision, computers are good at consistency, memory and recall

Slide 5

Slide 5 text

“But annotation sucks!” 1. Excel spreadsheets
 Problem: Excel. Spreadsheets.


Slide 6

Slide 6 text

“But annotation sucks!” 1. Excel spreadsheets
 Problem: Excel. Spreadsheets.
 2. Mechanical Turk or external annotators
 Problem: If your results are bad, is it your label scheme, your data or your model? “But it’s just cheap click work. Can’t we outsource that?”

Slide 7

Slide 7 text

1. Excel spreadsheets
 Problem: Excel. Spreadsheets.
 2. Mechanical Turk or external annotators
 Problem: If your results are bad, is it your label scheme, your data or your model?
 3. Unsupervised learning
 Problem: So many clusters – but now what? “But annotation sucks!”

Slide 8

Slide 8 text

Labelled data is not the problem. It’s data collection.

Slide 9

Slide 9 text

Ask simple questions, even for complex tasks – ideally binary better annotation speed better, easier-to-measure reliability in theory: any task can be broken down into a sequence of binary (yes or no) decisions – it just makes your gradients sparse

Slide 10

Slide 10 text

Prodigy Annotation Tool · https://prodi.gy

Slide 11

Slide 11 text

Active learning with pattern bootstrapping tell the computer the rules, annotate the exceptions build rules semi-automatically using word vectors avoid annotating what the model already knows instead, let the statistical model suggest examples it’s most uncertain about

Slide 12

Slide 12 text

Terminology Lists Charlottesville

Slide 13

Slide 13 text

Ann Arbor Terminology Lists Charlottesville Virginia North Carolina VA Virginia College Maryland Richmond South Bend

Slide 14

Slide 14 text

{ "label": "GPE", "pattern": [ {"lower": "virginia"} ] }

Slide 15

Slide 15 text

Named Entity Recognition

Slide 16

Slide 16 text

Text Classification

Slide 17

Slide 17 text

{ "label": "COMPENSATION", "pattern": [ {"ent_type": "PERSON"}, {"lemma": "receive"}, {"ent_type": "MONEY"} ] }

Slide 18

Slide 18 text

Text Classification

Slide 19

Slide 19 text

Iterate on your code and your data.

Slide 20

Slide 20 text

the part you 
 work on source code compiler runtime 
 program “Regular” programming

Slide 21

Slide 21 text

the part you 
 should work on source code compiler runtime 
 program training data training algorithm runtime model “Regular” programming Machine Learning

Slide 22

Slide 22 text

If you can master annotation... ... you can try out more ideas quickly. Most ideas don’t work – but some succeed wildly. ... fewer projects will fail. Figure out what works before trying to scale it up. ... you can build entirely custom solutions and nobody can lock you in.

Slide 23

Slide 23 text

Thanks! Explosion AI
 explosion.ai Follow us on Twitter
 @_inesmontani
 @explosion_ai