Rapid NLP Annotation Through Binary Decisions, Pattern Bootstrapping and Active Learning

Rapid NLP Annotation Through Binary Decisions, Pattern Bootstrapping and Active Learning

In this talk, I'll present a fast, flexible and even somewhat fun approach to named entity annotation. Using our approach, a model can be trained for a new entity type in only a few hours, starting from only a feed of unannotated text and a handful of seed terms. Given the seed terms, we first perform an interactive lexical learning phase, using a semantic similarity model that can be trained from raw text via an algorithm such as word2vec. The similarity model can be made to learn vectors for longer phrases by pre-processing the text, and abstract patterns can be created referencing attributes such as part-of-speech tags. The patterns file is then used to present the annotator with a sequence of candidate phrases, so that the annotation can be conducted as a binary choice. The annotator's eyes remain fixed near the centre of the screen, decisions can be made with a click, swipe or single keypress, and tasks are buffered to prevent delays.

Using this interface, annotation rates of 10-30 decisions per minute are common. If the decisions are especially easy (e.g. confirming that instances of a phrase are all valid entities), the rate may be several times faster. As the annotator accepts or rejects the suggested phrases, the responses are used to start training a statistical model. Predictions from the statistical model are then mixed into the annotation queue. Despite the sparsity of the signal (binary answers on one phrase per sentence), the model begins to learn surprisingly quickly. A global neural network model is used, with beam-search to allow a form of noise-contrastive estimation training. The pattern matcher and entity recognition model is available in our open-source library spaCy, while the interface, task queue and workflow management are implemented in our annotation tool Prodigy.

C005d9d90f1b1b1c2a0a478d67f1fee9?s=128

Ines Montani

April 12, 2018
Tweet