Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Belgium NLP Meetup: Rapid NLP Annotation Through Binary Decisions, Pattern Bootstrapping and Active Learning

Belgium NLP Meetup: Rapid NLP Annotation Through Binary Decisions, Pattern Bootstrapping and Active Learning

Ines Montani

October 31, 2018
Tweet

More Decks by Ines Montani

Other Decks in Programming

Transcript

  1. Rapid NLP annotation
    through binary decisions,
    pattern bootstrapping
    and active learning
    Ines Montani
    Explosion AI

    View full-size slide

  2. Why we need annotations
    Machine Learning is “programming by example”
    annotations let us specify the output we’re
    looking for
    even unsupervised methods need to be
    evaluated on labelled examples

    View full-size slide

  3. annotation needs iteration: we can’t expect to
    define the task correctly the first time
    good annotation teams are small – and should
    collaborate with the data scientist
    lots of high-value opportunities need specialist
    knowledge and expertise
    Why annotation tools
    need to be efficient

    View full-size slide

  4. impossible to perform boring, unstructured or
    multi-step tasks reliably
    humans make mistakes a computer never would,
    and vice versa
    humans are good at context, ambiguity and
    precision, computers are good at consistency,
    memory and recall
    Why annotation needs to
    be semi-automatic

    View full-size slide

  5. “But annotation sucks!”
    1. Excel spreadsheets

    Problem: Excel. Spreadsheets.


    View full-size slide

  6. “But annotation sucks!”
    “But it’s
    just cheap click work.
    Can’t we outsource
    that?”
    1. Excel spreadsheets

    Problem: Excel. Spreadsheets.

    2. Mechanical Turk or external annotators

    Problem: If your results are bad, is it your label
    scheme, your data or your model?

    View full-size slide

  7. “But annotation sucks!”
    1. Excel spreadsheets

    Problem: Excel. Spreadsheets.

    2. Mechanical Turk or external annotators

    Problem: If your results are bad, is it your label
    scheme, your data or your model?

    3. Unsupervised learning

    Problem: So many clusters – but now what?

    View full-size slide

  8. Labelled data is
    not the problem.
    It’s data collection.

    View full-size slide

  9. better annotation speed
    better, easier-to-measure reliability
    in theory: any task can be broken down into a
    sequence of binary (yes or no) decisions – it just
    makes your gradients sparse
    Ask simple questions, even for
    complex tasks – ideally binary

    View full-size slide

  10. Prodigy Annotation Tool · https://prodi.gy

    View full-size slide

  11. Prodigy Annotation Tool · https://prodi.gy

    View full-size slide

  12. How can we train from
    incomplete information?

    View full-size slide

  13. Barack H. Obama was the president of America
    PERSON LOC
    ['B-PERSON', 'I-PERSON', 'L-PERSON', 'O', 'O', 'O', 'O', 'U-LOC']

    View full-size slide

  14. Learning from complete
    information
    gradient_of_loss = predicted - target
    In the simple case with one known correct label:

    target = zeros(len(classes))

    target[classes.index(true_label)] = 1.0
    But what if we don’t know the full target
    distribution?

    View full-size slide

  15. Barack H. Obama was the president of America
    ORG
    ['?', '?', 'U-ORG', '?', '?', '?', '?', '?']

    View full-size slide

  16. Barack H. Obama was the president of America
    LOC
    ['?', '?', 'U-ORG', '?', '?', '?', '?', '?']
    ['?', '?', '?', '?', '?', '?', '?', 'U-LOC']

    View full-size slide

  17. Barack H. Obama was the president of America
    PERSON
    ['?', '?', 'U-ORG', '?', '?', '?', '?', '?']
    ['?', '?', '?', '?', '?', '?', '?', 'U-LOC']
    ['B-PERSON', 'L-PERSON', '?', '?', '?', '?', '?', '?']

    View full-size slide

  18. Barack H. Obama was the president of America
    PERSON
    ['?', '?', 'U-ORG', '?', '?', '?', '?', '?']
    ['?', '?', '?', '?', '?', '?', '?', 'U-LOC']
    ['B-PERSON', 'L-PERSON', '?', '?', '?', '?', '?', '?']
    ['B-PERSON', 'I-PERSON', 'L-PERSON', '?', '?', '?', '?', '?']

    View full-size slide


  19. Training from sparse labels
    goal: update the model in the best possible way
    with what we know
    just like multi-label classification where examples
    can have more than one right answer
    update towards: wrong labels get 0 probability,
    rest is split proportionally

    View full-size slide

  20. token = 'Obama'
    labels = ['ORG', 'LOC', 'PERSON']
    predicted = [ 0.5, 0.2, 0.3 ]

    View full-size slide

  21. token = 'Obama'
    labels = ['ORG', 'LOC', 'PERSON']
    predicted = [ 0.5, 0.2, 0.3 ]
    target = [ 0.0, 0.0, 1.0 ] gradient = predicted - target

    View full-size slide

  22. token = 'Obama'
    labels = ['ORG', 'LOC', 'PERSON']
    predicted = [ 0.5, 0.2, 0.3 ]
    target = [ 0.0, ?, ? ]

    View full-size slide

  23. token = 'Obama'
    labels = ['ORG', 'LOC', 'PERSON']
    predicted = [ 0.5, 0.2, 0.3 ]
    target = [ 0.0, 0.2 / (1.0 - 0.5), 0.3 / (1.0 - 0.5) ]
    target = [ 0.0, 0.4, 0.6 ] redistribute proportionally

    View full-size slide

  24. Barack H. Obama was the president of America
    ['B-PERSON', 'I-PERSON', 'L-PERSON', 'O', 'O', 'O', 'O', 'U-LOC']
    ['B-PERSON', 'I-PERSON', 'L-PERSON', 'O', 'O', 'O', 'O', 'O' ]
    [ 'O', 'O', 'U-PERSON', 'O', 'O', 'O', 'O', 'U-LOC']
    [ 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O' ]
    0.40
    0.35
    0.20
    0.05

    View full-size slide

  25. Training from sparse labels
    if we have a model that predicts something, we
    can work with that
    once the model’s already quite good, its second
    choice is probably correct
    new label: even from cold start, model will still
    converge – it’s just slow

    View full-size slide

  26. How to get over the cold start
    when training a new label?
    model needs to see enough positive examples
    rule-based models are often quite good
    rules can pre-label entity candidates
    write rules, annotate the exceptions

    View full-size slide

  27. {
    "label": "GPE",
    "pattern": [
    {"lower": "virginia"}
    ]
    }

    View full-size slide

  28. Does this work for other
    structured prediction tasks?
    approach can be applied to other non-NER tasks:
    dependency parsing, coreference resolution,
    relation extraction, summarization etc.
    structures we’re predicting are highly correlated
    annotating it all at once is super inefficient –
    binary supervision can be much better

    View full-size slide

  29. Benefits of binary annotation
    workflows
    better data quality, reduce human error
    automate what humans are bad at, focus on what
    humans are needed for
    enable rapid iteration on data selection and 

    label scheme

    View full-size slide

  30. Iterate on your code
    and your data.

    View full-size slide

  31. the part you 

    work on
    source code compiler runtime 

    program
    “Regular” programming

    View full-size slide

  32. the part you 

    should work on
    source code compiler runtime 

    program
    training data training
    algorithm
    runtime
    model
    “Regular” programming
    Machine Learning

    View full-size slide

  33. If you can master annotation...
    ... you can try out more ideas quickly. Most ideas
    don’t work – but some succeed wildly.
    ... fewer projects will fail. Figure out what works
    before trying to scale it up.
    ... you can build entirely custom solutions and
    nobody can lock you in.

    View full-size slide

  34. Thanks!
    Explosion AI

    explosion.ai
    Follow us on Twitter

    @_inesmontani

    @explosion_ai

    View full-size slide