Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Belgium NLP Meetup: Rapid NLP Annotation Through Binary Decisions, Pattern Bootstrapping and Active Learning

Belgium NLP Meetup: Rapid NLP Annotation Through Binary Decisions, Pattern Bootstrapping and Active Learning

Ines Montani

October 31, 2018
Tweet

More Decks by Ines Montani

Other Decks in Programming

Transcript

  1. Why we need annotations Machine Learning is “programming by example”

    annotations let us specify the output we’re looking for even unsupervised methods need to be evaluated on labelled examples
  2. annotation needs iteration: we can’t expect to define the task

    correctly the first time good annotation teams are small – and should collaborate with the data scientist lots of high-value opportunities need specialist knowledge and expertise Why annotation tools need to be efficient
  3. impossible to perform boring, unstructured or multi-step tasks reliably humans

    make mistakes a computer never would, and vice versa humans are good at context, ambiguity and precision, computers are good at consistency, memory and recall Why annotation needs to be semi-automatic
  4. “But annotation sucks!” “But it’s just cheap click work. Can’t

    we outsource that?” 1. Excel spreadsheets
 Problem: Excel. Spreadsheets.
 2. Mechanical Turk or external annotators
 Problem: If your results are bad, is it your label scheme, your data or your model?
  5. “But annotation sucks!” 1. Excel spreadsheets
 Problem: Excel. Spreadsheets.
 2.

    Mechanical Turk or external annotators
 Problem: If your results are bad, is it your label scheme, your data or your model?
 3. Unsupervised learning
 Problem: So many clusters – but now what?
  6. better annotation speed better, easier-to-measure reliability in theory: any task

    can be broken down into a sequence of binary (yes or no) decisions – it just makes your gradients sparse Ask simple questions, even for complex tasks – ideally binary
  7. Barack H. Obama was the president of America PERSON LOC

    ['B-PERSON', 'I-PERSON', 'L-PERSON', 'O', 'O', 'O', 'O', 'U-LOC']
  8. Learning from complete information gradient_of_loss = predicted - target In

    the simple case with one known correct label:
 target = zeros(len(classes))
 target[classes.index(true_label)] = 1.0 But what if we don’t know the full target distribution?
  9. Barack H. Obama was the president of America ORG ['?',

    '?', 'U-ORG', '?', '?', '?', '?', '?']
  10. Barack H. Obama was the president of America LOC ['?',

    '?', 'U-ORG', '?', '?', '?', '?', '?'] ['?', '?', '?', '?', '?', '?', '?', 'U-LOC']
  11. Barack H. Obama was the president of America PERSON ['?',

    '?', 'U-ORG', '?', '?', '?', '?', '?'] ['?', '?', '?', '?', '?', '?', '?', 'U-LOC'] ['B-PERSON', 'L-PERSON', '?', '?', '?', '?', '?', '?']
  12. Barack H. Obama was the president of America PERSON ['?',

    '?', 'U-ORG', '?', '?', '?', '?', '?'] ['?', '?', '?', '?', '?', '?', '?', 'U-LOC'] ['B-PERSON', 'L-PERSON', '?', '?', '?', '?', '?', '?'] ['B-PERSON', 'I-PERSON', 'L-PERSON', '?', '?', '?', '?', '?']
  13. 
 Training from sparse labels goal: update the model in

    the best possible way with what we know just like multi-label classification where examples can have more than one right answer update towards: wrong labels get 0 probability, rest is split proportionally
  14. token = 'Obama' labels = ['ORG', 'LOC', 'PERSON'] predicted =

    [ 0.5, 0.2, 0.3 ] target = [ 0.0, 0.0, 1.0 ] gradient = predicted - target
  15. token = 'Obama' labels = ['ORG', 'LOC', 'PERSON'] predicted =

    [ 0.5, 0.2, 0.3 ] target = [ 0.0, ?, ? ]
  16. token = 'Obama' labels = ['ORG', 'LOC', 'PERSON'] predicted =

    [ 0.5, 0.2, 0.3 ] target = [ 0.0, 0.2 / (1.0 - 0.5), 0.3 / (1.0 - 0.5) ] target = [ 0.0, 0.4, 0.6 ] redistribute proportionally
  17. Barack H. Obama was the president of America ['B-PERSON', 'I-PERSON',

    'L-PERSON', 'O', 'O', 'O', 'O', 'U-LOC'] ['B-PERSON', 'I-PERSON', 'L-PERSON', 'O', 'O', 'O', 'O', 'O' ] [ 'O', 'O', 'U-PERSON', 'O', 'O', 'O', 'O', 'U-LOC'] [ 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O' ] 0.40 0.35 0.20 0.05
  18. Training from sparse labels if we have a model that

    predicts something, we can work with that once the model’s already quite good, its second choice is probably correct new label: even from cold start, model will still converge – it’s just slow
  19. How to get over the cold start when training a

    new label? model needs to see enough positive examples rule-based models are often quite good rules can pre-label entity candidates write rules, annotate the exceptions
  20. Does this work for other structured prediction tasks? approach can

    be applied to other non-NER tasks: dependency parsing, coreference resolution, relation extraction, summarization etc. structures we’re predicting are highly correlated annotating it all at once is super inefficient – binary supervision can be much better
  21. Benefits of binary annotation workflows better data quality, reduce human

    error automate what humans are bad at, focus on what humans are needed for enable rapid iteration on data selection and 
 label scheme
  22. the part you 
 work on source code compiler runtime

    
 program “Regular” programming
  23. the part you 
 should work on source code compiler

    runtime 
 program training data training algorithm runtime model “Regular” programming Machine Learning
  24. If you can master annotation... ... you can try out

    more ideas quickly. Most ideas don’t work – but some succeed wildly. ... fewer projects will fail. Figure out what works before trying to scale it up. ... you can build entirely custom solutions and nobody can lock you in.