Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Rapid NLP Annotation Through Binary Decisions, Pattern Bootstrapping and Active Learning

Rapid NLP Annotation Through Binary Decisions, Pattern Bootstrapping and Active Learning

In this talk, I'll present a fast, flexible and even somewhat fun approach to named entity annotation. Using our approach, a model can be trained for a new entity type in only a few hours, starting from only a feed of unannotated text and a handful of seed terms. Given the seed terms, we first perform an interactive lexical learning phase, using a semantic similarity model that can be trained from raw text via an algorithm such as word2vec. The similarity model can be made to learn vectors for longer phrases by pre-processing the text, and abstract patterns can be created referencing attributes such as part-of-speech tags. The patterns file is then used to present the annotator with a sequence of candidate phrases, so that the annotation can be conducted as a binary choice. The annotator's eyes remain fixed near the centre of the screen, decisions can be made with a click, swipe or single keypress, and tasks are buffered to prevent delays.

Using this interface, annotation rates of 10-30 decisions per minute are common. If the decisions are especially easy (e.g. confirming that instances of a phrase are all valid entities), the rate may be several times faster. As the annotator accepts or rejects the suggested phrases, the responses are used to start training a statistical model. Predictions from the statistical model are then mixed into the annotation queue. Despite the sparsity of the signal (binary answers on one phrase per sentence), the model begins to learn surprisingly quickly. A global neural network model is used, with beam-search to allow a form of noise-contrastive estimation training. The pattern matcher and entity recognition model is available in our open-source library spaCy, while the interface, task queue and workflow management are implemented in our annotation tool Prodigy.

Ines Montani

April 12, 2018
Tweet

More Decks by Ines Montani

Other Decks in Programming

Transcript

  1. Why we need annotations Machine Learning is “programming by example”

    annotations let us specify the output we’re looking for even unsupervised methods need to be evaluated on labelled examples
  2. annotation needs iteration: we can’t expect to define the task

    correctly the first time good annotation teams are small – and should collaborate with the data scientist lots of high-value opportunities need specialist knowledge and expertise Why annotation tools need to be efficient
  3. Why annotation needs to be semi-automatic impossible to perform boring,

    unstructured or multi-step tasks reliably humans make mistakes a computer never would, and vice versa humans are good at context, ambiguity and precision, computers are good at consistency, memory and recall
  4. “But annotation sucks!” 1. Excel spreadsheets
 Problem: Excel. Spreadsheets.
 2.

    Mechanical Turk or external annotators
 Problem: If your results are bad, is it your label scheme, your data or your model? “But it’s just cheap click work. Can’t we outsource that?”
  5. 1. Excel spreadsheets
 Problem: Excel. Spreadsheets.
 2. Mechanical Turk or

    external annotators
 Problem: If your results are bad, is it your label scheme, your data or your model?
 3. Unsupervised learning
 Problem: So many clusters – but now what? “But annotation sucks!”
  6. Ask simple questions, even for complex tasks – ideally binary

    better annotation speed better, easier-to-measure reliability in theory: any task can be broken down into a sequence of binary (yes or no) decisions – it just makes your gradients sparse
  7. Active learning with pattern bootstrapping tell the computer the rules,

    annotate the exceptions build rules semi-automatically using word vectors avoid annotating what the model already knows instead, let the statistical model suggest examples it’s most uncertain about
  8. the part you 
 work on source code compiler runtime

    
 program “Regular” programming
  9. the part you 
 should work on source code compiler

    runtime 
 program training data training algorithm runtime model “Regular” programming Machine Learning
  10. If you can master annotation... ... you can try out

    more ideas quickly. Most ideas don’t work – but some succeed wildly. ... fewer projects will fail. Figure out what works before trying to scale it up. ... you can build entirely custom solutions and nobody can lock you in.