Upgrade to Pro — share decks privately, control downloads, hide ads and more …



Entity linking for spaCy: Grounding textual mentions

Presentation given by Sofie Van Landeghem at the Belgium NLP meetup of Oct 1, 2019

Sofie Van Landeghem

October 01, 2019

More Decks by Sofie Van Landeghem

Other Decks in Programming


  1. Entity linking for spaCy:
    Grounding textual mentions
    Sofie Van Landeghem
    Freelancer ML and NLP
    Explosion AI / OxyKodit
    Belgium NLP Meetup, 1 oct 2019

    View full-size slide

  2. Sofie Van Landeghem http://www.oxykodit.com

    Focus on production usage

    Speed & efficiency

    Python + Cython

    Comparison to other NLP libraries: https://spacy.io/usage/facts-figures

    Open source (MIT license): https://github.com/explosion/spaCy/

    Created by Explosion AI (Ines Montani & Matthew Honnibal)

    Tokenization (50 languages), lemmatization, POS tagging, dependency parsing

    NER, text classification, rule-based matching (API + one implementation)

    Word vectors, BERT-style pre-training

    Statistical models in 10 languages (v. 2.2): DE, EN, EL, ES, FR, IT, LT, NL, NB, PT

    One multi-lingual NER model containing DE, EN, ES, FR, IT, PT, RU

    View full-size slide

  3. Sofie Van Landeghem http://www.oxykodit.com

    A named entity (NE) is a consecutive span of one or several tokens

    It has a label or type, such as “PERSON”, “LOC” or “ORG”

    An NER algorithm is trained on annotated data (e.g. OntoNotes)
    Entity recognition

    View full-size slide

  4. Sofie Van Landeghem http://www.oxykodit.com
    “the U.N. World Meteorological Organization” →

    Tokenization, word embeddings, dependency trees … all define words with other words

    Entity linking: Resolve named entities to concepts from a Knowledge Base (KB)

    Ground the lexical information into the “real world”

    Allow to fully integrate database facts with textual information
    Entity linking

    View full-size slide

  5. Sofie Van Landeghem http://www.oxykodit.com
    Entity links
    Who are all these Byron’s in this text ?

    View full-size slide

  6. Sofie Van Landeghem http://www.oxykodit.com
    NEL framework
    STEP 0: Assume NER has already happened on the raw text, so we have entities + labels
    STEP 1: Candidate generation: create a list of plausible WikiData IDs for a mention
    STEP 2: Entity Linking: disambiguate these candidates to the most likely ID
    List of
    for each
    One entity ID
    for each
    Ms Byron would
    become known as the
    first programmer.
    Byron: PERSON Q7259
    Q5679: Lord George Gordon Byron
    Q272161: Anne Isabella Byron
    Q7259: Ada Lovelace

    View full-size slide

  7. Sofie Van Landeghem http://www.oxykodit.com
    1. candidate generation
    Task: given a textual mention, produce a list of candidate IDs
    How: Build a Knowledge Base (KB) to query candidates from.
    This is done by parsing links on Wikipedia:
    ➔ “William King” is a synonym for “William King-Noel, 1st Earl of Lovelace”

    Other synonyms found on Wikipedia: “Earl of Lovelace”, “8th Baron King”, ...

    For each synonym, deduce how likely it is they point to a certain ID
    by normalizing the pair frequencies to prior probabilities

    e.g. “Byron” refers to “Lord Byron” in 35% of the cases, and to “Ada Lovelace” in 55% of the cases
    She married [[William King-Noel, 1st Earl of Lovelace|William King]] in 1835

    View full-size slide

  8. Sofie Van Landeghem http://www.oxykodit.com
    2. entity linker
    Task: Given a list of candidate IDs + the textual context,
    produce the most likely identifier
    How: Compare lexical clues between the candidates and the context
    Ms Byron would
    become known as the
    first programmer.
    WikiData ID WikiData name WikiData description Context similarity
    Q5679 Lord Byron English poet and a leading figure in
    the Romantic movement
    Q272161 Anne Isabella Byron Wife of Lord Byron 0.3
    Q7259 Ada Lovelace English mathematician, considered
    the first computer programmer
    The description of Q7259 is most similar to the original sentence (context)

    View full-size slide

  9. Sofie Van Landeghem http://www.oxykodit.com
    Loss function
    Sentence 64D
    [0, 1]
    Entity ID
    from text
    entity encoder
    sentence encoder
    Gold label
    {0, 1}
    from KB

    View full-size slide

  10. Sofie Van Landeghem http://www.oxykodit.com
    Upperbound set by KB
    WikiData contains many infrequently linked topics
    To keep the KB manageable in memory, it requires some pruning:

    Keep only entities with min. 20 incoming interwiki links (from 8M to 1M entities)

    Each alias-entity pair should occur at least 5 times in WP

    Keep 10 candidate entities per alias/mention

    Result: ca. 1.1M entities and 1.5M aliases

    350MB file size to store 1M entities and 1.5M aliases + pretrained 64D entity vectors
    The KB only stores 14% of all WikiData concepts !
    The EL still achieves max. 84.2% accuracy (with an oracle EL disambiguation step)

    Long tail of infrequent entities

    View full-size slide

  11. Sofie Van Landeghem http://www.oxykodit.com
    Accuracy of EL
    Trained on 200.000 mentions in Wikipedia articles (2h)
    Tested on 5000 mentions in (different) Wikipedia articles
    The random baseline picks a random entity from the set of candidates
    The prior probability picks the most likely entity for a given synonym, regardless of context
    The EL algorithm (by itself) significantly outperforms the random baseline: 73.9% > 54.0%
    and marginally improves upon the prior probability baseline: 79.0% > 78.2 %
    Prior prob
    EL +
    prior prob
    Oracle KB
    Accuracy % 54.0 73.9 78.2 79.0 84.2

    View full-size slide

  12. Sofie Van Landeghem http://www.oxykodit.com
    Error analysis
    Banteay Meanchey, Battambang, Kampong Cham, ... and Svay Rieng.
    → predicted: City in Cambodia
    → gold WP link: Province of Cambodia
    Societies in the ancient civilizations of Greece and Rome preferred small families.
    → predicted: Greece
    → gold WP link: Ancient Greece
    Agnes Maria of Andechs-Merania (died 1201) was a Queen of France.
    → predicted: kingdom in Western Europe from 987 to 1791
    → gold WP link: current France (gold was incorrect !)

    View full-size slide

  13. Sofie Van Landeghem http://www.oxykodit.com
    Curating WP data
    We manually curated the WP training data

    Took the original “gold” ID from the interwiki link

    Mixed in all other candidate IDs

    Presented them in a random order
    Annotation of 500 cases

    7.4% did not constitute a proper sentence

    8.2% did not refer to a proper entity
    Of the remaining 422 cases:

    87.7% were found to be the same

    5.2% were found to be different

    7.1% were found to be ambiguous
    or needed context outside the sentence

    View full-size slide

  14. Sofie Van Landeghem http://www.oxykodit.com
    Issues in the WP data
    Entities without sentence context, e.g. in enumerations, tables, “See also” sections
    → Remove from the dataset
    Some links are not really Named Entities but refer to other concepts such as “privacy”
    → Prune the WikiData KB
    WP annotations are not always aligned to the entity types
    “Fiji has experienced many coups recently, in 1987, 2000, and 2006.”
    → Link to “2000 Fijian coup d'état” or to the year “2000” ?
    “Full metro systems are in operation in Paris, Lyon and Marseille”
    → WP links to “Marseille Metro” instead of to “Marseille”

    View full-size slide

  15. Sofie Van Landeghem http://www.oxykodit.com
    Annotating news data
    1) “easy”
    - sentence
    - candidates
    from KB
    2) “hard”
    - article
    - free

    View full-size slide

  16. Sofie Van Landeghem http://www.oxykodit.com
    Accuracy on news
    Prior prob
    EL +
    prior prob
    Oracle KB
    Easy cases (230 entities) 40.9 58.7 84.8 87.8 100
    Hard cases (122 entities) 14.9 17.4 25.6 27.3 33.9
    All (352 entities) 29.6 44.7 64.4 67.0 77.2

    The original annotation effort started with 500 randomly selected entities

    16% were not proper entities/sentences or Date entities such as “nearly two months”

    9% referred to concepts not in WikiData

    2% was too difficult to resolve

    3% of Prodigy matches could not be matched with the spaCy nlp model

    On the news dataset, EL improves more upon the prior probability baseline

    View full-size slide

  17. Sofie Van Landeghem http://www.oxykodit.com
    Findings in the news data
    There will always be entities too vague or outside the KB (e.g. “a tourist called Julia ...”)
    Candidate generation can fail due to small lexical variants

    middle name, F.B.I. instead of FBI, “‘s” or “the” as part of the entity, ...
    Often background information beyond what is in the article, is required for disambiguation
    Metonomy is hard to resolve correctly, even for a human

    e.g. “the World Economic Forum at Davos … Davos had come to embody that agenda”
    Dates and numbers are often impossible to resolve correctly (or need meta data)

    “... what happened in the middle of December”
    The WikiData knowledge graph is incredibly helpful when analysing the entities manually

    View full-size slide

  18. Sofie Van Landeghem http://www.oxykodit.com
    WikiData graph
    Raised in Elizabeth, N. J. Mr. Feeney served as a radio operator in the Air Force
    and attended Cornell University on the G. I. Bill.
    Chuck Feeney
    Province of New Jersey
    Cornell University
    place of birth
    capital of
    educated at

    View full-size slide

  19. Sofie Van Landeghem http://www.oxykodit.com
    Coreference resolution

    Links together entities that refer to the same thing / person

    e.g. “he” refers to “Nader”
    Hugging Face’s
    neuralcoref package
    works with spaCy

    View full-size slide

  20. Sofie Van Landeghem http://www.oxykodit.com
    Coref for prediction
    Coreference resolution helps to link concepts across sentences.
    The whole chain should then be linked to the same WikiData ID.
    The EL algorithm is (currently) trained to predict entity links for sentences.

    How to obtain consistency across the chain?

    First idea: take the prediction with the highest confidence across all entities in the chain
    Assessment on Wikipedia dev set

    0.3% decrease :(

    WP links are biased: only the first occurrence (with most context) is usually linked
    Assessment on News evaluation set

    0.5-0.8% increase (might not be significant – need to look at more data)

    “easy” from 87.8 to 88.3, “hard” from 27.3 to 28.1, “all” from 67.0 to 67.5

    View full-size slide

  21. Sofie Van Landeghem http://www.oxykodit.com
    Ongoing work
    Get better data for training & evaluation to better benchmark model choices
    The hierarchy of WikiData concepts to be taken into account

    Predicting the province instead of its capital city is not as bad as predicting an unrelated city

    Taking this into account in the loss function, could make the training more robust
    Coreference resolution can give entity linking a performance boost

    Use coreference resolution to obtain a more consistent set of predictions

    Use coreference resolution to enrich the training data
    Try it out yourself ?

    CLI scripts for creating a KB from a WP dump (any language)

    CLI scripts for extracting training data from WP/WD

    Example code on how to train an entity linking pipe
    ➔ Possibility to use a custom implementation (through the provided API’s)

    View full-size slide

  22. Sofie Van Landeghem http://www.oxykodit.com
    Questions ?
    There are no stupid questions – let’s agree there are no stupid answers either.

    View full-size slide

  23. Sofie Van Landeghem http://www.oxykodit.com
    Backup slides

    View full-size slide

  24. Sofie Van Landeghem http://www.oxykodit.com
    Memory & speed
    Processing Wikipedia:
    Parse aliases and prior probabilities from intrawiki links to build the KB
    Takes about 2 hours to parse 1100M lines of Wikipedia ENG XML dump
    Processing Wikidata:
    Link English Wikipedia to interlingual Wikidata identifiers
    Retrieve concise Wikidata descriptions for each entity
    Takes about 7 hours to parse 55M lines of Wikidata JSON dump
    Knowledge Base:
    55MB to store 1M entities and 1.5M aliases
    350MB file size to store 1M entities and 1.5M aliases + pretrained 64D entity vectors
    Because of an efficient Cython data structures, the KB can be kept in memory
    Written to file, and read back in, in a matter of seconds

    View full-size slide

  25. Sofie Van Landeghem http://www.oxykodit.com
    More complex architecture
    Loss function
    NER type 16D
    Prior prob 1D P(E|M)
    [0, 1]
    Entity ID
    from text
    entity encoder
    one hot encoding
    sentence encoder
    Gold label
    {0, 1}
    from KB

    View full-size slide

  26. Sofie Van Landeghem http://www.oxykodit.com
    Code examples
    other_pipes = [pipe for pipe in nlp.pipe_names
    if pipe != "entity_linker"]
    with nlp.disable_pipes(*other_pipes):
    optimizer = nlp.begin_training()

    el_pipe = nlp.create_pipe(name='entity_linker', config={"context_width": 128})
    nlp.add_pipe(el_pipe, last=True)
    kb = KnowledgeBase(vocab=vocab, entity_vector_length=64)
    kb.add_entity(entity="Q1004791", prob=0.2, entity_vector=v1)
    kb.add_entity(entity="Q42", prob=0.8, entity_vector=v2)
    kb.add_entity(entity="Q5301561", prob=0.1, entity_vector=v3)
    kb.add_alias(alias="Douglas", entities=["Q1004791", "Q42", "Q5301561"], probabilities=[0.6, 0.1, 0.2])
    kb.add_alias(alias="Douglas Adams", entities=["Q42"], probabilities=[0.9])
    text = "Douglas Adams made up the stories as he wrote."
    doc = nlp(text)
    for ent in doc.ents:
    print(ent.text, ent.label_, ent.kb_id_)

    View full-size slide