Rapid NLP Annotation Through Binary Decisions, Pattern Bootstrapping and Active Learning

Rapid NLP annotation through binary decisions, pattern bootstrapping and active
learning Ines Montani Explosion AI

Why we need annotations Machine Learning is “programming by example”
annotations let us specify the output we’re looking for even unsupervised methods need to be evaluated on labelled examples

annotation needs iteration: we can’t expect to define the task
correctly the first time good annotation teams are small – and should collaborate with the data scientist lots of high-value opportunities need specialist knowledge and expertise Why annotation tools need to be efﬁcient

Why annotation needs to be semi-automatic impossible to perform boring,
unstructured or multi-step tasks reliably humans make mistakes a computer never would, and vice versa humans are good at context, ambiguity and precision, computers are good at consistency, memory and recall

“But annotation sucks!” 1. Excel spreadsheets  Problem: Excel. Spreadsheets. 

“But annotation sucks!” 1. Excel spreadsheets  Problem: Excel. Spreadsheets.  2.
Mechanical Turk or external annotators  Problem: If your results are bad, is it your label scheme, your data or your model? “But it’s just cheap click work. Can’t we outsource that?”

1. Excel spreadsheets  Problem: Excel. Spreadsheets.  2. Mechanical Turk or
external annotators  Problem: If your results are bad, is it your label scheme, your data or your model?  3. Unsupervised learning  Problem: So many clusters – but now what? “But annotation sucks!”

Labelled data is not the problem. It’s data collection.

Ask simple questions, even for complex tasks – ideally binary
better annotation speed better, easier-to-measure reliability in theory: any task can be broken down into a sequence of binary (yes or no) decisions – it just makes your gradients sparse

Prodigy Annotation Tool · https://prodi.gy

Active learning with pattern bootstrapping tell the computer the rules,
annotate the exceptions build rules semi-automatically using word vectors avoid annotating what the model already knows instead, let the statistical model suggest examples it’s most uncertain about

Terminology Lists Charlottesville

Ann Arbor Terminology Lists Charlottesville Virginia North Carolina VA Virginia
College Maryland Richmond South Bend

{ "label": "GPE", "pattern": [ {"lower": "virginia"} ] }

Named Entity Recognition

Text Classiﬁcation

{ "label": "COMPENSATION", "pattern": [ {"ent_type": "PERSON"}, {"lemma": "receive"}, {"ent_type":
"MONEY"} ] }

Text Classiﬁcation

Iterate on your code and your data.

the part you   work on source code compiler runtime
  program “Regular” programming

the part you   should work on source code compiler
runtime   program training data training algorithm runtime model “Regular” programming Machine Learning

If you can master annotation... ... you can try out
more ideas quickly. Most ideas don’t work – but some succeed wildly. ... fewer projects will fail. Figure out what works before trying to scale it up. ... you can build entirely custom solutions and nobody can lock you in.

Thanks! Explosion AI  explosion.ai Follow us on Twitter  @_inesmontani  @explosion_ai

Rapid NLP Annotation Through Binary Decisions, ...

Rapid NLP Annotation Through Binary Decisions, Pattern Bootstrapping and Active Learning

Ines Montani PRO

More Decks by Ines Montani

Other Decks in Programming

Featured

Transcript