2020_02_19_spaCy_pipelines

Building customizable NLP pipelines with spaCy Sofie Van Landeghem ML
and NLP engineer Explosion Turku.AI Meetup, 19 Feb 2020

Sofie Van Landeghem [email protected] 2 NLP Natural Language Processing (NLP):
Extract structured knowledge from unstructured language CR PR PFS 16% 0% 65 mo. Be careful with the “package deal”, the sound bar is not LG and has not worked well (if at all) for me. The LG tv is excellent, great everything. ... There were 26 complete responses (16%) and 0 partial responses (0%) … The median progression-free survival time was 65 months ... sound bar LG TV Picture from © Rasa

Sofie Van Landeghem [email protected] 3 NLP tools Overview of NLP
resources: https://github.com/keon/awesome-nlp ➢ Research overview, tutorials, datasets, libraries ➢ Node.js, Python, C++, Java, Kotlin, Scala, R, Clojure, Ruby, Rust This talk: specific focus on functionality in the spaCy library ➢ Python + Cython (speed !) ➢ Focus on production usage ➢ Configurable, customisable, extensible, retrainable ➢ Powered by our open-source DL library thinc ➢ Open source (MIT license): https://github.com/explosion/spaCy/ ➢ Comparison to other NLP libraries: https://spacy.io/usage/facts-figures

Sofie Van Landeghem [email protected] 4 A spaCy pipeline Doc Text
nlp tokenizer tagger parser ner ... ORG

Sofie Van Landeghem [email protected] 5 An entirely random text

Sofie Van Landeghem [email protected] 6 Pretrained models pip install spacy
python -m spacy download en_core_web_lg import spacy nlp = spacy.load('en_core_web_lg') doc = nlp(text) from spacy import displacy displacy.serve(doc, style='ent') an DET AI PROPN scientist NOUN in ADP Silo PROPN . PUNCT AI PROPN ’s NOUN NLP PROPN team NOUN . PUNCT print([(token.text, token.pos_) for token in doc]) ➢ Pre-trained models for 10 languages

Sofie Van Landeghem [email protected] 7 Customize tokenization from spacy.symbols import
ORTH nlp.tokenizer.add_special_case("Silo.AI", [{ORTH: "Silo.AI"}]) an DET AI PROPN scientist NOUN in ADP Silo.AI PROPN ’s PART NLP PROPN team NOUN . PUNCT ➢ Built-in language-specific rules for 50 languages ➢ Pull requests improving your favourite language are always welcome ! ➢ Extend the default rules with your own ➢ Define specific tokenization exceptions, e.g. "don't": [{ORTH: "do"}, {ORTH: "n't", NORM: "not"}] ➢ Implement an entirely new tokenizer

Sofie Van Landeghem [email protected] 8 (re)train models TRAIN_DATA = [
("Filip Ginter works for Silo.AI.", {"entities": [(0, 12, "PERSON"), (24, 30, "ORG")]}), (…) ] for itn in range(n_iter): random.shuffle(TRAIN_DATA) batches = minibatch(TRAIN_DATA) losses = {} for batch in batches: texts, annotations = zip(*batch) nlp.update(texts, annotations, sgd=optimizer, drop=0, losses=losses) print(f"Loss at {itn} is {losses['ner']}") Retrain / refine existing ML models ➢ Add new labels (e.g. NER) ➢ Feed in new data ➢ Ensure the model doesn’t “forget” what it learned before! ➢ Feed in “old” examples too optimizer = nlp.begin_training() optimizer = nlp.resume_training()

Sofie Van Landeghem [email protected] 9 CLI train python -m spacy
convert ud-treebanks-v2.4\UD_Finnish-TDT\fi_tdt-ud-train.conllu fi_json python -m spacy convert ud-treebanks-v2.4\UD_Finnish-TDT\fi_tdt-ud-dev.conllu fi_json python -m spacy train fi output fi_json\fi_tdt-ud-train.json fi_json\fi_tdt-ud-dev.json Train ML models from scratch ➢ Built-in support for UD annotations Itn Tag loss Tag % Dep loss UAS LAS 1 39475.358 89.109 201983.313 65.778 51.614 2 23837.115 90.463 169409.391 71.149 59.22 3 18800.934 91.146 153834.198 73.244 62.157 4 15685.533 91.818 142268.533 74.149 63.751 5 13529.039 92.118 134673.218 75.209 65.086 … … … … … ... Note that this won’t automatically give state-of-the-art results... There is no tuning, hyperparameter selection or language-specific customization (yet) !

Sofie Van Landeghem [email protected] 10 spaCy v.3: more configurable pipelines
(powered by Thinc v.8)

Sofie Van Landeghem [email protected] 11 Thinc v.8 ➢ New deep
learning library ➢ released in January 2020 ➢ Has been powering spaCy for years ➢ Entirely revamped for Python 3 ➢ Type annotations ➢ Functional-programming concept: no computational graph, just higher order functions ➢ Wrappers for PyTorch, MXNet & TensorFlow ➢ Extensive documentation: https://thinc.ai def relu(inputs: Floats2d) -> Tuple[Floats2d, Callable[[Floats2d], Floats2d]]: mask = inputs >= 0 def backprop_relu(d_outputs: Floats2d) -> Floats2d: return d_outputs * mask return inputs * mask, backprop_relu ➢ Layer performs the forward function ➢ Returns the fwd results + a call for the backprop ➢ The backprop calculates the gradient of the inputs, given the gradient of the outputs

Sofie Van Landeghem [email protected] 12 Type checking ➢ By annotating
variables with appropriate types, bugs can be prevented ! ➢ Facilitate autocompletion ➢ Uses mypy

Sofie Van Landeghem [email protected] 13 Configuration THINC.AI ➢ Configuration file
➢ Built-in registered functions ➢ Define your own functions ➢ All hyperparameters explicitely defined ➢ All objects are built when parsing the config

Sofie Van Landeghem [email protected] 14 Train from config ➢ Full
control over all settings & parameters ➢ Configurations can easily be saved & shared ➢ Running experiments quickly, e.g. swap in a different tok2vec component ➢ Full support for pre-trained BERT, XLNet and GPT-2 (cf. spacy-transformers package) a.k.a. “starter models” ➢ Keep an eye out on spaCy v.3 ! python -m spacy train-from-config fi train.json dev.json config.cfg Coming soon

Sofie Van Landeghem [email protected] 15 New pipeline components

Sofie Van Landeghem [email protected] 16 Coreference resolution ➔ Links together
entities that refer to the same thing / person ➔ e.g. “he” refers to “Nader” Hugging Face’s neuralcoref package works with spaCy

Sofie Van Landeghem [email protected] 17 Entity Linking (EL) Who are
all these Byron’s in this text ?

Sofie Van Landeghem [email protected] 18 EL examples Johny Carson: American
talk show host, or American football player ? Russ Cochran: American golfer, or publisher ? Rose: English footballer, or character from the TV series "Doctor Who" ?

Sofie Van Landeghem [email protected] 19 EL framework STEP 0: Assume
NER has already happened on the raw text, so we have entities + labels STEP 1: Candidate generation: create a list of plausible WikiData IDs for a mention STEP 2: Entity Linking: disambiguate these candidates to the most likely ID Text NER NER mentions List of candidates for each mention candidate generation EL One entity ID for each mention Ms Byron would become known as the first programmer. Byron: PERSON Q7259 Q5679: Lord George Gordon Byron Q272161: Anne Isabella Byron Q7259: Ada Lovelace

Sofie Van Landeghem [email protected] 20 Accuracy on news ➔ Trained
a convolutional NN model on 165K Wikipedia articles ➔ Manually annotated news data for evaluation: 360 entity links ➔ Adding in coreference resolution • All entities in the same coref chain should link to the same entity • Assing the KB ID with the highest confidence across the chain • Performance (EL+prior) drops to 70.9% → further work required Context Corpus stats Gold label Accuracy Random baseline - - - 33.4 % Entity linking only x - - 60.1 % Prior probability baseline - x - 67.2 % EL + prior probability x x - 71.4 % Oracle KB performance x x x 85.2 %

Sofie Van Landeghem [email protected] 21 Wrapping it up spaCy is
a production-ready NLP library in Python that lets you quickly implement text-mining solutions, and is extensible & retrainable. 3.0 will bring lots of cool new features ! Thinc is a new Deep Learning library making extensive use of type annotations and supporting configuration files. Prodigy is Explosion’s annotation tool powered by Active Learning. @explosion_ai @oxykodit @spacy.io

Sofie Van Landeghem [email protected] 22 Questions ? There are no
stupid questions – so let’s assume there are no stupid answers either.

2020_02_19_spaCy_pipelines

2020_02_19_spaCy_pipelines

Sofie Van Landeghem

More Decks by Sofie Van Landeghem

Other Decks in Programming

Featured

Transcript

Building customizable NLP pipelines with spaCy Sofie Van Landeghem ML

Sofie Van Landeghem [email protected] 2 NLP Natural Language Processing (NLP):

Sofie Van Landeghem [email protected] 3 NLP tools Overview of NLP

Sofie Van Landeghem [email protected] 4 A spaCy pipeline Doc Text

Sofie Van Landeghem [email protected] 5 An entirely random text

Sofie Van Landeghem [email protected] 6 Pretrained models pip install spacy

Sofie Van Landeghem [email protected] 7 Customize tokenization from spacy.symbols import

Sofie Van Landeghem [email protected] 8 (re)train models TRAIN_DATA = [

Sofie Van Landeghem [email protected] 9 CLI train python -m spacy

Sofie Van Landeghem [email protected] 10 spaCy v.3: more configurable pipelines

Sofie Van Landeghem [email protected] 11 Thinc v.8 ➢ New deep

Sofie Van Landeghem [email protected] 12 Type checking ➢ By annotating

Sofie Van Landeghem [email protected] 13 Configuration THINC.AI ➢ Configuration file

Sofie Van Landeghem [email protected] 14 Train from config ➢ Full

Sofie Van Landeghem [email protected] 15 New pipeline components

Sofie Van Landeghem [email protected] 16 Coreference resolution ➔ Links together

Sofie Van Landeghem [email protected] 17 Entity Linking (EL) Who are

Sofie Van Landeghem [email protected] 18 EL examples Johny Carson: American

Sofie Van Landeghem [email protected] 19 EL framework STEP 0: Assume

Sofie Van Landeghem [email protected] 20 Accuracy on news ➔ Trained

Sofie Van Landeghem [email protected] 21 Wrapping it up spaCy is

Sofie Van Landeghem [email protected] 22 Questions ? There are no