Belgium NLP Meetup: Rapid NLP Annotation Through Binary Decisions, Pattern Bootstrapping and Active Learning

Rapid NLP annotation through binary decisions, pattern bootstrapping and active
learning Ines Montani Explosion AI

Why we need annotations Machine Learning is “programming by example”
annotations let us specify the output we’re looking for even unsupervised methods need to be evaluated on labelled examples

annotation needs iteration: we can’t expect to define the task
correctly the first time good annotation teams are small – and should collaborate with the data scientist lots of high-value opportunities need specialist knowledge and expertise Why annotation tools need to be efﬁcient

impossible to perform boring, unstructured or multi-step tasks reliably humans
make mistakes a computer never would, and vice versa humans are good at context, ambiguity and precision, computers are good at consistency, memory and recall Why annotation needs to be semi-automatic

“But annotation sucks!” 1. Excel spreadsheets  Problem: Excel. Spreadsheets. 

“But annotation sucks!” “But it’s just cheap click work. Can’t
we outsource that?” 1. Excel spreadsheets  Problem: Excel. Spreadsheets.  2. Mechanical Turk or external annotators  Problem: If your results are bad, is it your label scheme, your data or your model?

“But annotation sucks!” 1. Excel spreadsheets  Problem: Excel. Spreadsheets.  2.
Mechanical Turk or external annotators  Problem: If your results are bad, is it your label scheme, your data or your model?  3. Unsupervised learning  Problem: So many clusters – but now what?

Labelled data is not the problem. It’s data collection.

better annotation speed better, easier-to-measure reliability in theory: any task
can be broken down into a sequence of binary (yes or no) decisions – it just makes your gradients sparse Ask simple questions, even for complex tasks – ideally binary

Prodigy Annotation Tool · https://prodi.gy

How can we train from incomplete information?

Barack H. Obama was the president of America PERSON LOC
['B-PERSON', 'I-PERSON', 'L-PERSON', 'O', 'O', 'O', 'O', 'U-LOC']

Learning from complete information gradient_of_loss = predicted - target In
the simple case with one known correct label:  target = zeros(len(classes))  target[classes.index(true_label)] = 1.0 But what if we don’t know the full target distribution?

Barack H. Obama was the president of America ORG ['?',
'?', 'U-ORG', '?', '?', '?', '?', '?']

Barack H. Obama was the president of America LOC ['?',
'?', 'U-ORG', '?', '?', '?', '?', '?'] ['?', '?', '?', '?', '?', '?', '?', 'U-LOC']

Barack H. Obama was the president of America PERSON ['?',
'?', 'U-ORG', '?', '?', '?', '?', '?'] ['?', '?', '?', '?', '?', '?', '?', 'U-LOC'] ['B-PERSON', 'L-PERSON', '?', '?', '?', '?', '?', '?']

Barack H. Obama was the president of America PERSON ['?',
'?', 'U-ORG', '?', '?', '?', '?', '?'] ['?', '?', '?', '?', '?', '?', '?', 'U-LOC'] ['B-PERSON', 'L-PERSON', '?', '?', '?', '?', '?', '?'] ['B-PERSON', 'I-PERSON', 'L-PERSON', '?', '?', '?', '?', '?']

  Training from sparse labels goal: update the model in
the best possible way with what we know just like multi-label classification where examples can have more than one right answer update towards: wrong labels get 0 probability, rest is split proportionally

token = 'Obama' labels = ['ORG', 'LOC', 'PERSON'] predicted =
[ 0.5, 0.2, 0.3 ]

[ 0.5, 0.2, 0.3 ] target = [ 0.0, 0.0, 1.0 ] gradient = predicted - target

[ 0.5, 0.2, 0.3 ] target = [ 0.0, ?, ? ]

[ 0.5, 0.2, 0.3 ] target = [ 0.0, 0.2 / (1.0 - 0.5), 0.3 / (1.0 - 0.5) ] target = [ 0.0, 0.4, 0.6 ] redistribute proportionally

Barack H. Obama was the president of America ['B-PERSON', 'I-PERSON',
'L-PERSON', 'O', 'O', 'O', 'O', 'U-LOC'] ['B-PERSON', 'I-PERSON', 'L-PERSON', 'O', 'O', 'O', 'O', 'O' ] [ 'O', 'O', 'U-PERSON', 'O', 'O', 'O', 'O', 'U-LOC'] [ 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O' ] 0.40 0.35 0.20 0.05

Training from sparse labels if we have a model that
predicts something, we can work with that once the model’s already quite good, its second choice is probably correct new label: even from cold start, model will still converge – it’s just slow

How to get over the cold start when training a
new label? model needs to see enough positive examples rule-based models are often quite good rules can pre-label entity candidates write rules, annotate the exceptions

{ "label": "GPE", "pattern": [ {"lower": "virginia"} ] }

Does this work for other structured prediction tasks? approach can
be applied to other non-NER tasks: dependency parsing, coreference resolution, relation extraction, summarization etc. structures we’re predicting are highly correlated annotating it all at once is super inefficient – binary supervision can be much better

Beneﬁts of binary annotation workflows better data quality, reduce human
error automate what humans are bad at, focus on what humans are needed for enable rapid iteration on data selection and   label scheme

Iterate on your code and your data.

the part you   work on source code compiler runtime
  program “Regular” programming

the part you   should work on source code compiler
runtime   program training data training algorithm runtime model “Regular” programming Machine Learning

If you can master annotation... ... you can try out
more ideas quickly. Most ideas don’t work – but some succeed wildly. ... fewer projects will fail. Figure out what works before trying to scale it up. ... you can build entirely custom solutions and nobody can lock you in.

Thanks! Explosion AI  explosion.ai Follow us on Twitter  @_inesmontani  @explosion_ai

Belgium NLP Meetup: Rapid NLP Annotation Throug...

Belgium NLP Meetup: Rapid NLP Annotation Through Binary Decisions, Pattern Bootstrapping and Active Learning

More Decks by Ines Montani

Other Decks in Programming

Featured

Transcript