Upgrade to Pro — share decks privately, control downloads, hide ads and more …

2020_02_19_spaCy_pipelines

 2020_02_19_spaCy_pipelines

Building customizable NLP pipelines with spaCy

Presentation given by Sofie Van Landeghem at the Turku.AI meetup of Feb 19, 2020

Sofie Van Landeghem

February 19, 2020
Tweet

More Decks by Sofie Van Landeghem

Other Decks in Programming

Transcript

  1. Building customizable NLP
    pipelines with spaCy
    Sofie Van Landeghem
    ML and NLP engineer
    Explosion
    Turku.AI Meetup, 19 Feb 2020

    View full-size slide

  2. Sofie Van Landeghem [email protected]
    2
    NLP
    Natural Language Processing (NLP):
    Extract structured knowledge from unstructured language
    CR PR PFS
    16% 0% 65 mo.
    Be careful with the “package deal”, the sound bar is not LG and has not
    worked well (if at all) for me. The LG tv is excellent, great everything.
    ... There were 26 complete responses (16%) and 0 partial responses
    (0%) … The median progression-free survival time was 65 months ...
    sound bar LG TV Picture from © Rasa

    View full-size slide

  3. Sofie Van Landeghem [email protected]
    3
    NLP tools
    Overview of NLP resources: https://github.com/keon/awesome-nlp

    Research overview, tutorials, datasets, libraries

    Node.js, Python, C++, Java, Kotlin, Scala, R, Clojure, Ruby, Rust
    This talk: specific focus on functionality in the spaCy library

    Python + Cython (speed !)

    Focus on production usage

    Configurable, customisable, extensible, retrainable

    Powered by our open-source DL library thinc

    Open source (MIT license): https://github.com/explosion/spaCy/

    Comparison to other NLP libraries: https://spacy.io/usage/facts-figures

    View full-size slide

  4. Sofie Van Landeghem [email protected]
    4
    A spaCy pipeline
    Doc
    Text
    nlp
    tokenizer tagger parser ner ...
    ORG

    View full-size slide

  5. Sofie Van Landeghem [email protected]
    5
    An entirely random text

    View full-size slide

  6. Sofie Van Landeghem [email protected]
    6
    Pretrained models
    pip install spacy
    python -m spacy download en_core_web_lg
    import spacy
    nlp = spacy.load('en_core_web_lg')
    doc = nlp(text)
    from spacy import displacy
    displacy.serve(doc, style='ent')
    an DET
    AI PROPN
    scientist NOUN
    in ADP
    Silo PROPN
    . PUNCT
    AI PROPN
    ’s NOUN
    NLP PROPN
    team NOUN
    . PUNCT
    print([(token.text, token.pos_) for token in doc])

    Pre-trained models for 10 languages

    View full-size slide

  7. Sofie Van Landeghem [email protected]
    7
    Customize tokenization
    from spacy.symbols import ORTH
    nlp.tokenizer.add_special_case("Silo.AI", [{ORTH: "Silo.AI"}])
    an DET
    AI PROPN
    scientist NOUN
    in ADP
    Silo.AI PROPN
    ’s PART
    NLP PROPN
    team NOUN
    . PUNCT

    Built-in language-specific rules for 50 languages

    Pull requests improving your favourite language are
    always welcome !

    Extend the default rules with your own

    Define specific tokenization exceptions, e.g.
    "don't": [{ORTH: "do"}, {ORTH: "n't", NORM: "not"}]

    Implement an entirely new tokenizer

    View full-size slide

  8. Sofie Van Landeghem [email protected]
    8
    (re)train models
    TRAIN_DATA = [
    ("Filip Ginter works for Silo.AI.",
    {"entities": [(0, 12, "PERSON"),
    (24, 30, "ORG")]}),
    (…) ]
    for itn in range(n_iter):
    random.shuffle(TRAIN_DATA)
    batches = minibatch(TRAIN_DATA)
    losses = {}
    for batch in batches:
    texts, annotations = zip(*batch)
    nlp.update(texts, annotations,
    sgd=optimizer, drop=0,
    losses=losses)
    print(f"Loss at {itn} is {losses['ner']}")
    Retrain / refine existing ML models

    Add new labels (e.g. NER)

    Feed in new data

    Ensure the model doesn’t “forget”
    what it learned before!

    Feed in “old” examples too
    optimizer = nlp.begin_training()
    optimizer = nlp.resume_training()

    View full-size slide

  9. Sofie Van Landeghem [email protected]
    9
    CLI train
    python -m spacy convert ud-treebanks-v2.4\UD_Finnish-TDT\fi_tdt-ud-train.conllu fi_json
    python -m spacy convert ud-treebanks-v2.4\UD_Finnish-TDT\fi_tdt-ud-dev.conllu fi_json
    python -m spacy train fi output fi_json\fi_tdt-ud-train.json fi_json\fi_tdt-ud-dev.json
    Train ML models from scratch

    Built-in support for UD annotations
    Itn Tag loss Tag % Dep loss UAS LAS
    1 39475.358 89.109 201983.313 65.778 51.614
    2 23837.115 90.463 169409.391 71.149 59.22
    3 18800.934 91.146 153834.198 73.244 62.157
    4 15685.533 91.818 142268.533 74.149 63.751
    5 13529.039 92.118 134673.218 75.209 65.086
    … … … … … ...
    Note that this won’t automatically
    give state-of-the-art results...
    There is no tuning,
    hyperparameter selection or
    language-specific customization
    (yet) !

    View full-size slide

  10. Sofie Van Landeghem [email protected]
    10
    spaCy v.3:
    more configurable pipelines
    (powered by Thinc v.8)

    View full-size slide

  11. Sofie Van Landeghem [email protected]
    11
    Thinc v.8

    New deep learning library

    released in January 2020

    Has been powering spaCy for years

    Entirely revamped for Python 3

    Type annotations

    Functional-programming concept:
    no computational graph,
    just higher order functions

    Wrappers for PyTorch, MXNet
    & TensorFlow

    Extensive documentation:
    https://thinc.ai
    def relu(inputs: Floats2d) ->
    Tuple[Floats2d, Callable[[Floats2d], Floats2d]]:
    mask = inputs >= 0
    def backprop_relu(d_outputs: Floats2d)
    -> Floats2d:
    return d_outputs * mask
    return inputs * mask, backprop_relu

    Layer performs the forward function

    Returns the fwd results + a call for the backprop

    The backprop calculates the gradient of the
    inputs, given the gradient of the outputs

    View full-size slide

  12. Sofie Van Landeghem [email protected]
    12
    Type checking

    By annotating variables with appropriate types, bugs can be prevented !

    Facilitate autocompletion

    Uses mypy

    View full-size slide

  13. Sofie Van Landeghem [email protected]
    13
    Configuration
    THINC.AI

    Configuration file

    Built-in registered functions

    Define your own functions

    All hyperparameters
    explicitely defined

    All objects are built when
    parsing the config

    View full-size slide

  14. Sofie Van Landeghem [email protected]
    14
    Train from config

    Full control over all settings & parameters

    Configurations can easily be saved & shared

    Running experiments quickly,
    e.g. swap in a different tok2vec component

    Full support for pre-trained BERT, XLNet and
    GPT-2 (cf. spacy-transformers package)
    a.k.a. “starter models”

    Keep an eye out on spaCy v.3 !
    python -m spacy train-from-config fi train.json dev.json config.cfg
    Coming soon

    View full-size slide

  15. Sofie Van Landeghem [email protected]
    15
    New pipeline components

    View full-size slide

  16. Sofie Van Landeghem [email protected]
    16
    Coreference resolution

    Links together entities that refer to the same thing / person

    e.g. “he” refers to “Nader”
    Hugging Face’s
    neuralcoref package
    works with spaCy

    View full-size slide

  17. Sofie Van Landeghem [email protected]
    17
    Entity Linking (EL)
    Who are all these Byron’s in this text ?

    View full-size slide

  18. Sofie Van Landeghem [email protected]
    18
    EL examples
    Johny Carson: American talk show host, or American football player ?
    Russ Cochran: American golfer, or publisher ?
    Rose: English footballer, or character from the TV series "Doctor Who" ?

    View full-size slide

  19. Sofie Van Landeghem [email protected]
    19
    EL framework
    STEP 0: Assume NER has already happened on the raw text, so we have entities + labels
    STEP 1: Candidate generation: create a list of plausible WikiData IDs for a mention
    STEP 2: Entity Linking: disambiguate these candidates to the most likely ID
    Text
    NER
    NER
    mentions
    List of
    candidates
    for each
    mention
    candidate
    generation
    EL
    One entity ID
    for each
    mention
    Ms Byron would
    become known as the
    first programmer.
    Byron: PERSON Q7259
    Q5679: Lord George Gordon Byron
    Q272161: Anne Isabella Byron
    Q7259: Ada Lovelace

    View full-size slide

  20. Sofie Van Landeghem [email protected]
    20
    Accuracy on news

    Trained a convolutional NN model on 165K Wikipedia articles

    Manually annotated news data for evaluation: 360 entity links

    Adding in coreference resolution

    All entities in the same coref chain should link to the same entity

    Assing the KB ID with the highest confidence across the chain

    Performance (EL+prior) drops to 70.9% → further work required
    Context Corpus stats Gold label Accuracy
    Random baseline - - - 33.4 %
    Entity linking only x - - 60.1 %
    Prior probability baseline - x - 67.2 %
    EL + prior probability x x - 71.4 %
    Oracle KB performance x x x 85.2 %

    View full-size slide

  21. Sofie Van Landeghem [email protected]
    21
    Wrapping it up
    spaCy is a production-ready NLP library in Python
    that lets you quickly implement text-mining solutions,
    and is extensible & retrainable.
    3.0 will bring lots of cool new features !
    Thinc is a new Deep Learning library
    making extensive use of type annotations
    and supporting configuration files.
    Prodigy is Explosion’s annotation tool
    powered by Active Learning.
    @explosion_ai
    @oxykodit
    @spacy.io

    View full-size slide

  22. Sofie Van Landeghem [email protected]
    22
    Questions ?
    There are no stupid questions – so let’s assume there are no stupid answers either.

    View full-size slide