Upgrade to Pro — share decks privately, control downloads, hide ads and more …



Building customizable NLP pipelines with spaCy

Presentation given by Sofie Van Landeghem at the Turku.AI meetup of Feb 19, 2020

Sofie Van Landeghem

February 19, 2020

More Decks by Sofie Van Landeghem

Other Decks in Programming


  1. Building customizable NLP pipelines with spaCy Sofie Van Landeghem ML

    and NLP engineer Explosion Turku.AI Meetup, 19 Feb 2020
  2. Sofie Van Landeghem [email protected] 2 NLP Natural Language Processing (NLP):

    Extract structured knowledge from unstructured language CR PR PFS 16% 0% 65 mo. Be careful with the “package deal”, the sound bar is not LG and has not worked well (if at all) for me. The LG tv is excellent, great everything. ... There were 26 complete responses (16%) and 0 partial responses (0%) … The median progression-free survival time was 65 months ... sound bar LG TV Picture from © Rasa
  3. Sofie Van Landeghem [email protected] 3 NLP tools Overview of NLP

    resources: https://github.com/keon/awesome-nlp ➢ Research overview, tutorials, datasets, libraries ➢ Node.js, Python, C++, Java, Kotlin, Scala, R, Clojure, Ruby, Rust This talk: specific focus on functionality in the spaCy library ➢ Python + Cython (speed !) ➢ Focus on production usage ➢ Configurable, customisable, extensible, retrainable ➢ Powered by our open-source DL library thinc ➢ Open source (MIT license): https://github.com/explosion/spaCy/ ➢ Comparison to other NLP libraries: https://spacy.io/usage/facts-figures
  4. Sofie Van Landeghem [email protected] 6 Pretrained models pip install spacy

    python -m spacy download en_core_web_lg import spacy nlp = spacy.load('en_core_web_lg') doc = nlp(text) from spacy import displacy displacy.serve(doc, style='ent') an DET AI PROPN scientist NOUN in ADP Silo PROPN . PUNCT AI PROPN ’s NOUN NLP PROPN team NOUN . PUNCT print([(token.text, token.pos_) for token in doc]) ➢ Pre-trained models for 10 languages
  5. Sofie Van Landeghem [email protected] 7 Customize tokenization from spacy.symbols import

    ORTH nlp.tokenizer.add_special_case("Silo.AI", [{ORTH: "Silo.AI"}]) an DET AI PROPN scientist NOUN in ADP Silo.AI PROPN ’s PART NLP PROPN team NOUN . PUNCT ➢ Built-in language-specific rules for 50 languages ➢ Pull requests improving your favourite language are always welcome ! ➢ Extend the default rules with your own ➢ Define specific tokenization exceptions, e.g. "don't": [{ORTH: "do"}, {ORTH: "n't", NORM: "not"}] ➢ Implement an entirely new tokenizer
  6. Sofie Van Landeghem [email protected] 8 (re)train models TRAIN_DATA = [

    ("Filip Ginter works for Silo.AI.", {"entities": [(0, 12, "PERSON"), (24, 30, "ORG")]}), (…) ] for itn in range(n_iter): random.shuffle(TRAIN_DATA) batches = minibatch(TRAIN_DATA) losses = {} for batch in batches: texts, annotations = zip(*batch) nlp.update(texts, annotations, sgd=optimizer, drop=0, losses=losses) print(f"Loss at {itn} is {losses['ner']}") Retrain / refine existing ML models ➢ Add new labels (e.g. NER) ➢ Feed in new data ➢ Ensure the model doesn’t “forget” what it learned before! ➢ Feed in “old” examples too optimizer = nlp.begin_training() optimizer = nlp.resume_training()
  7. Sofie Van Landeghem [email protected] 9 CLI train python -m spacy

    convert ud-treebanks-v2.4\UD_Finnish-TDT\fi_tdt-ud-train.conllu fi_json python -m spacy convert ud-treebanks-v2.4\UD_Finnish-TDT\fi_tdt-ud-dev.conllu fi_json python -m spacy train fi output fi_json\fi_tdt-ud-train.json fi_json\fi_tdt-ud-dev.json Train ML models from scratch ➢ Built-in support for UD annotations Itn Tag loss Tag % Dep loss UAS LAS 1 39475.358 89.109 201983.313 65.778 51.614 2 23837.115 90.463 169409.391 71.149 59.22 3 18800.934 91.146 153834.198 73.244 62.157 4 15685.533 91.818 142268.533 74.149 63.751 5 13529.039 92.118 134673.218 75.209 65.086 … … … … … ... Note that this won’t automatically give state-of-the-art results... There is no tuning, hyperparameter selection or language-specific customization (yet) !
  8. Sofie Van Landeghem [email protected] 11 Thinc v.8 ➢ New deep

    learning library ➢ released in January 2020 ➢ Has been powering spaCy for years ➢ Entirely revamped for Python 3 ➢ Type annotations ➢ Functional-programming concept: no computational graph, just higher order functions ➢ Wrappers for PyTorch, MXNet & TensorFlow ➢ Extensive documentation: https://thinc.ai def relu(inputs: Floats2d) -> Tuple[Floats2d, Callable[[Floats2d], Floats2d]]: mask = inputs >= 0 def backprop_relu(d_outputs: Floats2d) -> Floats2d: return d_outputs * mask return inputs * mask, backprop_relu ➢ Layer performs the forward function ➢ Returns the fwd results + a call for the backprop ➢ The backprop calculates the gradient of the inputs, given the gradient of the outputs
  9. Sofie Van Landeghem [email protected] 12 Type checking ➢ By annotating

    variables with appropriate types, bugs can be prevented ! ➢ Facilitate autocompletion ➢ Uses mypy
  10. Sofie Van Landeghem [email protected] 13 Configuration THINC.AI ➢ Configuration file

    ➢ Built-in registered functions ➢ Define your own functions ➢ All hyperparameters explicitely defined ➢ All objects are built when parsing the config
  11. Sofie Van Landeghem [email protected] 14 Train from config ➢ Full

    control over all settings & parameters ➢ Configurations can easily be saved & shared ➢ Running experiments quickly, e.g. swap in a different tok2vec component ➢ Full support for pre-trained BERT, XLNet and GPT-2 (cf. spacy-transformers package) a.k.a. “starter models” ➢ Keep an eye out on spaCy v.3 ! python -m spacy train-from-config fi train.json dev.json config.cfg Coming soon
  12. Sofie Van Landeghem [email protected] 16 Coreference resolution ➔ Links together

    entities that refer to the same thing / person ➔ e.g. “he” refers to “Nader” Hugging Face’s neuralcoref package works with spaCy
  13. Sofie Van Landeghem [email protected] 18 EL examples Johny Carson: American

    talk show host, or American football player ? Russ Cochran: American golfer, or publisher ? Rose: English footballer, or character from the TV series "Doctor Who" ?
  14. Sofie Van Landeghem [email protected] 19 EL framework STEP 0: Assume

    NER has already happened on the raw text, so we have entities + labels STEP 1: Candidate generation: create a list of plausible WikiData IDs for a mention STEP 2: Entity Linking: disambiguate these candidates to the most likely ID Text NER NER mentions List of candidates for each mention candidate generation EL One entity ID for each mention Ms Byron would become known as the first programmer. Byron: PERSON Q7259 Q5679: Lord George Gordon Byron Q272161: Anne Isabella Byron Q7259: Ada Lovelace
  15. Sofie Van Landeghem [email protected] 20 Accuracy on news ➔ Trained

    a convolutional NN model on 165K Wikipedia articles ➔ Manually annotated news data for evaluation: 360 entity links ➔ Adding in coreference resolution • All entities in the same coref chain should link to the same entity • Assing the KB ID with the highest confidence across the chain • Performance (EL+prior) drops to 70.9% → further work required Context Corpus stats Gold label Accuracy Random baseline - - - 33.4 % Entity linking only x - - 60.1 % Prior probability baseline - x - 67.2 % EL + prior probability x x - 71.4 % Oracle KB performance x x x 85.2 %
  16. Sofie Van Landeghem [email protected] 21 Wrapping it up spaCy is

    a production-ready NLP library in Python that lets you quickly implement text-mining solutions, and is extensible & retrainable. 3.0 will bring lots of cool new features ! Thinc is a new Deep Learning library making extensive use of type annotations and supporting configuration files. Prodigy is Explosion’s annotation tool powered by Active Learning. @explosion_ai @oxykodit @spacy.io
  17. Sofie Van Landeghem [email protected] 22 Questions ? There are no

    stupid questions – so let’s assume there are no stupid answers either.