Slide 1

Slide 1 text

Building customizable NLP pipelines with spaCy Sofie Van Landeghem ML and NLP engineer Explosion Turku.AI Meetup, 19 Feb 2020

Slide 2

Slide 2 text

Sofie Van Landeghem [email protected] 2 NLP Natural Language Processing (NLP): Extract structured knowledge from unstructured language CR PR PFS 16% 0% 65 mo. Be careful with the “package deal”, the sound bar is not LG and has not worked well (if at all) for me. The LG tv is excellent, great everything. ... There were 26 complete responses (16%) and 0 partial responses (0%) … The median progression-free survival time was 65 months ... sound bar LG TV Picture from © Rasa

Slide 3

Slide 3 text

Sofie Van Landeghem [email protected] 3 NLP tools Overview of NLP resources: https://github.com/keon/awesome-nlp ➢ Research overview, tutorials, datasets, libraries ➢ Node.js, Python, C++, Java, Kotlin, Scala, R, Clojure, Ruby, Rust This talk: specific focus on functionality in the spaCy library ➢ Python + Cython (speed !) ➢ Focus on production usage ➢ Configurable, customisable, extensible, retrainable ➢ Powered by our open-source DL library thinc ➢ Open source (MIT license): https://github.com/explosion/spaCy/ ➢ Comparison to other NLP libraries: https://spacy.io/usage/facts-figures

Slide 4

Slide 4 text

Sofie Van Landeghem [email protected] 4 A spaCy pipeline Doc Text nlp tokenizer tagger parser ner ... ORG

Slide 5

Slide 5 text

Sofie Van Landeghem [email protected] 5 An entirely random text

Slide 6

Slide 6 text

Sofie Van Landeghem [email protected] 6 Pretrained models pip install spacy python -m spacy download en_core_web_lg import spacy nlp = spacy.load('en_core_web_lg') doc = nlp(text) from spacy import displacy displacy.serve(doc, style='ent') an DET AI PROPN scientist NOUN in ADP Silo PROPN . PUNCT AI PROPN ’s NOUN NLP PROPN team NOUN . PUNCT print([(token.text, token.pos_) for token in doc]) ➢ Pre-trained models for 10 languages

Slide 7

Slide 7 text

Sofie Van Landeghem [email protected] 7 Customize tokenization from spacy.symbols import ORTH nlp.tokenizer.add_special_case("Silo.AI", [{ORTH: "Silo.AI"}]) an DET AI PROPN scientist NOUN in ADP Silo.AI PROPN ’s PART NLP PROPN team NOUN . PUNCT ➢ Built-in language-specific rules for 50 languages ➢ Pull requests improving your favourite language are always welcome ! ➢ Extend the default rules with your own ➢ Define specific tokenization exceptions, e.g. "don't": [{ORTH: "do"}, {ORTH: "n't", NORM: "not"}] ➢ Implement an entirely new tokenizer

Slide 8

Slide 8 text

Sofie Van Landeghem [email protected] 8 (re)train models TRAIN_DATA = [ ("Filip Ginter works for Silo.AI.", {"entities": [(0, 12, "PERSON"), (24, 30, "ORG")]}), (…) ] for itn in range(n_iter): random.shuffle(TRAIN_DATA) batches = minibatch(TRAIN_DATA) losses = {} for batch in batches: texts, annotations = zip(*batch) nlp.update(texts, annotations, sgd=optimizer, drop=0, losses=losses) print(f"Loss at {itn} is {losses['ner']}") Retrain / refine existing ML models ➢ Add new labels (e.g. NER) ➢ Feed in new data ➢ Ensure the model doesn’t “forget” what it learned before! ➢ Feed in “old” examples too optimizer = nlp.begin_training() optimizer = nlp.resume_training()

Slide 9

Slide 9 text

Sofie Van Landeghem [email protected] 9 CLI train python -m spacy convert ud-treebanks-v2.4\UD_Finnish-TDT\fi_tdt-ud-train.conllu fi_json python -m spacy convert ud-treebanks-v2.4\UD_Finnish-TDT\fi_tdt-ud-dev.conllu fi_json python -m spacy train fi output fi_json\fi_tdt-ud-train.json fi_json\fi_tdt-ud-dev.json Train ML models from scratch ➢ Built-in support for UD annotations Itn Tag loss Tag % Dep loss UAS LAS 1 39475.358 89.109 201983.313 65.778 51.614 2 23837.115 90.463 169409.391 71.149 59.22 3 18800.934 91.146 153834.198 73.244 62.157 4 15685.533 91.818 142268.533 74.149 63.751 5 13529.039 92.118 134673.218 75.209 65.086 … … … … … ... Note that this won’t automatically give state-of-the-art results... There is no tuning, hyperparameter selection or language-specific customization (yet) !

Slide 10

Slide 10 text

Sofie Van Landeghem [email protected] 10 spaCy v.3: more configurable pipelines (powered by Thinc v.8)

Slide 11

Slide 11 text

Sofie Van Landeghem [email protected] 11 Thinc v.8 ➢ New deep learning library ➢ released in January 2020 ➢ Has been powering spaCy for years ➢ Entirely revamped for Python 3 ➢ Type annotations ➢ Functional-programming concept: no computational graph, just higher order functions ➢ Wrappers for PyTorch, MXNet & TensorFlow ➢ Extensive documentation: https://thinc.ai def relu(inputs: Floats2d) -> Tuple[Floats2d, Callable[[Floats2d], Floats2d]]: mask = inputs >= 0 def backprop_relu(d_outputs: Floats2d) -> Floats2d: return d_outputs * mask return inputs * mask, backprop_relu ➢ Layer performs the forward function ➢ Returns the fwd results + a call for the backprop ➢ The backprop calculates the gradient of the inputs, given the gradient of the outputs

Slide 12

Slide 12 text

Sofie Van Landeghem [email protected] 12 Type checking ➢ By annotating variables with appropriate types, bugs can be prevented ! ➢ Facilitate autocompletion ➢ Uses mypy

Slide 13

Slide 13 text

Sofie Van Landeghem [email protected] 13 Configuration THINC.AI ➢ Configuration file ➢ Built-in registered functions ➢ Define your own functions ➢ All hyperparameters explicitely defined ➢ All objects are built when parsing the config

Slide 14

Slide 14 text

Sofie Van Landeghem [email protected] 14 Train from config ➢ Full control over all settings & parameters ➢ Configurations can easily be saved & shared ➢ Running experiments quickly, e.g. swap in a different tok2vec component ➢ Full support for pre-trained BERT, XLNet and GPT-2 (cf. spacy-transformers package) a.k.a. “starter models” ➢ Keep an eye out on spaCy v.3 ! python -m spacy train-from-config fi train.json dev.json config.cfg Coming soon

Slide 15

Slide 15 text

Sofie Van Landeghem [email protected] 15 New pipeline components

Slide 16

Slide 16 text

Sofie Van Landeghem [email protected] 16 Coreference resolution ➔ Links together entities that refer to the same thing / person ➔ e.g. “he” refers to “Nader” Hugging Face’s neuralcoref package works with spaCy

Slide 17

Slide 17 text

Sofie Van Landeghem [email protected] 17 Entity Linking (EL) Who are all these Byron’s in this text ?

Slide 18

Slide 18 text

Sofie Van Landeghem [email protected] 18 EL examples Johny Carson: American talk show host, or American football player ? Russ Cochran: American golfer, or publisher ? Rose: English footballer, or character from the TV series "Doctor Who" ?

Slide 19

Slide 19 text

Sofie Van Landeghem [email protected] 19 EL framework STEP 0: Assume NER has already happened on the raw text, so we have entities + labels STEP 1: Candidate generation: create a list of plausible WikiData IDs for a mention STEP 2: Entity Linking: disambiguate these candidates to the most likely ID Text NER NER mentions List of candidates for each mention candidate generation EL One entity ID for each mention Ms Byron would become known as the first programmer. Byron: PERSON Q7259 Q5679: Lord George Gordon Byron Q272161: Anne Isabella Byron Q7259: Ada Lovelace

Slide 20

Slide 20 text

Sofie Van Landeghem [email protected] 20 Accuracy on news ➔ Trained a convolutional NN model on 165K Wikipedia articles ➔ Manually annotated news data for evaluation: 360 entity links ➔ Adding in coreference resolution ● All entities in the same coref chain should link to the same entity ● Assing the KB ID with the highest confidence across the chain ● Performance (EL+prior) drops to 70.9% → further work required Context Corpus stats Gold label Accuracy Random baseline - - - 33.4 % Entity linking only x - - 60.1 % Prior probability baseline - x - 67.2 % EL + prior probability x x - 71.4 % Oracle KB performance x x x 85.2 %

Slide 21

Slide 21 text

Sofie Van Landeghem [email protected] 21 Wrapping it up spaCy is a production-ready NLP library in Python that lets you quickly implement text-mining solutions, and is extensible & retrainable. 3.0 will bring lots of cool new features ! Thinc is a new Deep Learning library making extensive use of type annotations and supporting configuration files. Prodigy is Explosion’s annotation tool powered by Active Learning. @explosion_ai @oxykodit @spacy.io

Slide 22

Slide 22 text

Sofie Van Landeghem [email protected] 22 Questions ? There are no stupid questions – so let’s assume there are no stupid answers either.