Upgrade to Pro — share decks privately, control downloads, hide ads and more …



Entity linking for spaCy: Grounding textual mentions

Presentation given by Sofie Van Landeghem at the Belgium NLP meetup of Oct 1, 2019

Sofie Van Landeghem

October 01, 2019

More Decks by Sofie Van Landeghem

Other Decks in Programming


  1. Entity linking for spaCy: Grounding textual mentions Sofie Van Landeghem

    Freelancer ML and NLP Explosion AI / OxyKodit Belgium NLP Meetup, 1 oct 2019
  2. Sofie Van Landeghem http://www.oxykodit.com 2 spaCy ➢ Focus on production

    usage ➢ Speed & efficiency ➢ Python + Cython ➢ Comparison to other NLP libraries: https://spacy.io/usage/facts-figures ➢ Open source (MIT license): https://github.com/explosion/spaCy/ ➢ Created by Explosion AI (Ines Montani & Matthew Honnibal) ➢ Tokenization (50 languages), lemmatization, POS tagging, dependency parsing ➢ NER, text classification, rule-based matching (API + one implementation) ➢ Word vectors, BERT-style pre-training ➢ Statistical models in 10 languages (v. 2.2): DE, EN, EL, ES, FR, IT, LT, NL, NB, PT ➢ One multi-lingual NER model containing DE, EN, ES, FR, IT, PT, RU
  3. Sofie Van Landeghem http://www.oxykodit.com 3 ➔ A named entity (NE)

    is a consecutive span of one or several tokens ➔ It has a label or type, such as “PERSON”, “LOC” or “ORG” ➔ An NER algorithm is trained on annotated data (e.g. OntoNotes) Entity recognition
  4. Sofie Van Landeghem http://www.oxykodit.com 4 “the U.N. World Meteorological Organization”

    → ➔ Tokenization, word embeddings, dependency trees … all define words with other words ➔ Entity linking: Resolve named entities to concepts from a Knowledge Base (KB) ➔ Ground the lexical information into the “real world” ➔ Allow to fully integrate database facts with textual information Entity linking
  5. Sofie Van Landeghem http://www.oxykodit.com 6 NEL framework STEP 0: Assume

    NER has already happened on the raw text, so we have entities + labels STEP 1: Candidate generation: create a list of plausible WikiData IDs for a mention STEP 2: Entity Linking: disambiguate these candidates to the most likely ID Text NER NER mentions List of candidates for each mention candidate generation EL One entity ID for each mention Ms Byron would become known as the first programmer. Byron: PERSON Q7259 Q5679: Lord George Gordon Byron Q272161: Anne Isabella Byron Q7259: Ada Lovelace
  6. Sofie Van Landeghem http://www.oxykodit.com 7 1. candidate generation Task: given

    a textual mention, produce a list of candidate IDs How: Build a Knowledge Base (KB) to query candidates from. This is done by parsing links on Wikipedia: ➔ “William King” is a synonym for “William King-Noel, 1st Earl of Lovelace” • Other synonyms found on Wikipedia: “Earl of Lovelace”, “8th Baron King”, ... ➔ For each synonym, deduce how likely it is they point to a certain ID by normalizing the pair frequencies to prior probabilities • e.g. “Byron” refers to “Lord Byron” in 35% of the cases, and to “Ada Lovelace” in 55% of the cases “Byron” Q5679 Q272161 Q7259 She married [[William King-Noel, 1st Earl of Lovelace|William King]] in 1835
  7. Sofie Van Landeghem http://www.oxykodit.com 8 2. entity linker Task: Given

    a list of candidate IDs + the textual context, produce the most likely identifier How: Compare lexical clues between the candidates and the context Q5679 Q272161 Q7259 Q7259 Ms Byron would become known as the first programmer. WikiData ID WikiData name WikiData description Context similarity Q5679 Lord Byron English poet and a leading figure in the Romantic movement 0.1 Q272161 Anne Isabella Byron Wife of Lord Byron 0.3 Q7259 Ada Lovelace English mathematician, considered the first computer programmer 0.9 The description of Q7259 is most similar to the original sentence (context)
  8. Sofie Van Landeghem http://www.oxykodit.com 9 Architecture 64D Entity description Loss

    function Sentence 64D P(E|M) [0, 1] NER mention Entity ID (candidate) from text entity encoder sentence encoder Similarity Gold label {0, 1} from KB
  9. Sofie Van Landeghem http://www.oxykodit.com 10 Upperbound set by KB WikiData

    contains many infrequently linked topics To keep the KB manageable in memory, it requires some pruning: • Keep only entities with min. 20 incoming interwiki links (from 8M to 1M entities) • Each alias-entity pair should occur at least 5 times in WP • Keep 10 candidate entities per alias/mention • Result: ca. 1.1M entities and 1.5M aliases • 350MB file size to store 1M entities and 1.5M aliases + pretrained 64D entity vectors The KB only stores 14% of all WikiData concepts ! The EL still achieves max. 84.2% accuracy (with an oracle EL disambiguation step) • Long tail of infrequent entities
  10. Sofie Van Landeghem http://www.oxykodit.com 11 Accuracy of EL Trained on

    200.000 mentions in Wikipedia articles (2h) Tested on 5000 mentions in (different) Wikipedia articles The random baseline picks a random entity from the set of candidates The prior probability picks the most likely entity for a given synonym, regardless of context The EL algorithm (by itself) significantly outperforms the random baseline: 73.9% > 54.0% and marginally improves upon the prior probability baseline: 79.0% > 78.2 % Random baseline EL only Prior prob baseline EL + prior prob Oracle KB (max) Accuracy % 54.0 73.9 78.2 79.0 84.2
  11. Sofie Van Landeghem http://www.oxykodit.com 12 Error analysis Banteay Meanchey, Battambang,

    Kampong Cham, ... and Svay Rieng. → predicted: City in Cambodia → gold WP link: Province of Cambodia Societies in the ancient civilizations of Greece and Rome preferred small families. → predicted: Greece → gold WP link: Ancient Greece Agnes Maria of Andechs-Merania (died 1201) was a Queen of France. → predicted: kingdom in Western Europe from 987 to 1791 → gold WP link: current France (gold was incorrect !)
  12. Sofie Van Landeghem http://www.oxykodit.com 13 Curating WP data We manually

    curated the WP training data • Took the original “gold” ID from the interwiki link • Mixed in all other candidate IDs • Presented them in a random order Annotation of 500 cases • 7.4% did not constitute a proper sentence • 8.2% did not refer to a proper entity Of the remaining 422 cases: • 87.7% were found to be the same • 5.2% were found to be different • 7.1% were found to be ambiguous or needed context outside the sentence https://prodi.gy/
  13. Sofie Van Landeghem http://www.oxykodit.com 14 Issues in the WP data

    Entities without sentence context, e.g. in enumerations, tables, “See also” sections → Remove from the dataset Some links are not really Named Entities but refer to other concepts such as “privacy” → Prune the WikiData KB WP annotations are not always aligned to the entity types “Fiji has experienced many coups recently, in 1987, 2000, and 2006.” → Link to “2000 Fijian coup d'état” or to the year “2000” ? “Full metro systems are in operation in Paris, Lyon and Marseille” → WP links to “Marseille Metro” instead of to “Marseille”
  14. Sofie Van Landeghem http://www.oxykodit.com 15 Annotating news data 1) “easy”

    - sentence - candidates from KB 2) “hard” - article - free text
  15. Sofie Van Landeghem http://www.oxykodit.com 16 Accuracy on news Random baseline

    EL only Prior prob baseline EL + prior prob Oracle KB (max) Easy cases (230 entities) 40.9 58.7 84.8 87.8 100 Hard cases (122 entities) 14.9 17.4 25.6 27.3 33.9 All (352 entities) 29.6 44.7 64.4 67.0 77.2 ➔ The original annotation effort started with 500 randomly selected entities ➔ 16% were not proper entities/sentences or Date entities such as “nearly two months” ➔ 9% referred to concepts not in WikiData ➔ 2% was too difficult to resolve ➔ 3% of Prodigy matches could not be matched with the spaCy nlp model ➔ On the news dataset, EL improves more upon the prior probability baseline
  16. Sofie Van Landeghem http://www.oxykodit.com 17 Findings in the news data

    There will always be entities too vague or outside the KB (e.g. “a tourist called Julia ...”) Candidate generation can fail due to small lexical variants • middle name, F.B.I. instead of FBI, “‘s” or “the” as part of the entity, ... Often background information beyond what is in the article, is required for disambiguation Metonomy is hard to resolve correctly, even for a human • e.g. “the World Economic Forum at Davos … Davos had come to embody that agenda” Dates and numbers are often impossible to resolve correctly (or need meta data) ➔ “... what happened in the middle of December” The WikiData knowledge graph is incredibly helpful when analysing the entities manually
  17. Sofie Van Landeghem http://www.oxykodit.com 18 WikiData graph Raised in Elizabeth,

    N. J. Mr. Feeney served as a radio operator in the Air Force and attended Cornell University on the G. I. Bill. Q2967590 Chuck Feeney Q138311 Elizabeth Q2335128 Province of New Jersey Q49115 Cornell University place of birth capital of educated at
  18. Sofie Van Landeghem http://www.oxykodit.com 19 Coreference resolution ➔ Links together

    entities that refer to the same thing / person ➔ e.g. “he” refers to “Nader” Hugging Face’s neuralcoref package works with spaCy
  19. Sofie Van Landeghem http://www.oxykodit.com 20 Coref for prediction Coreference resolution

    helps to link concepts across sentences. The whole chain should then be linked to the same WikiData ID. The EL algorithm is (currently) trained to predict entity links for sentences. ➔ How to obtain consistency across the chain? ➔ First idea: take the prediction with the highest confidence across all entities in the chain Assessment on Wikipedia dev set ➔ 0.3% decrease :( ➔ WP links are biased: only the first occurrence (with most context) is usually linked Assessment on News evaluation set ➔ 0.5-0.8% increase (might not be significant – need to look at more data) ➔ “easy” from 87.8 to 88.3, “hard” from 27.3 to 28.1, “all” from 67.0 to 67.5
  20. Sofie Van Landeghem http://www.oxykodit.com 21 Ongoing work Get better data

    for training & evaluation to better benchmark model choices The hierarchy of WikiData concepts to be taken into account ➔ Predicting the province instead of its capital city is not as bad as predicting an unrelated city ➔ Taking this into account in the loss function, could make the training more robust Coreference resolution can give entity linking a performance boost ➔ Use coreference resolution to obtain a more consistent set of predictions ➔ Use coreference resolution to enrich the training data Try it out yourself ? ➔ CLI scripts for creating a KB from a WP dump (any language) ➔ CLI scripts for extracting training data from WP/WD ➔ Example code on how to train an entity linking pipe ➔ Possibility to use a custom implementation (through the provided API’s)
  21. Sofie Van Landeghem http://www.oxykodit.com 22 Questions ? There are no

    stupid questions – let’s agree there are no stupid answers either.
  22. Sofie Van Landeghem http://www.oxykodit.com 24 Memory & speed Processing Wikipedia:

    Parse aliases and prior probabilities from intrawiki links to build the KB Takes about 2 hours to parse 1100M lines of Wikipedia ENG XML dump Processing Wikidata: Link English Wikipedia to interlingual Wikidata identifiers Retrieve concise Wikidata descriptions for each entity Takes about 7 hours to parse 55M lines of Wikidata JSON dump Knowledge Base: 55MB to store 1M entities and 1.5M aliases 350MB file size to store 1M entities and 1.5M aliases + pretrained 64D entity vectors Because of an efficient Cython data structures, the KB can be kept in memory Written to file, and read back in, in a matter of seconds
  23. Sofie Van Landeghem http://www.oxykodit.com 25 More complex architecture 64D Entity

    description Loss function Sentence (context) 128D NER type 16D Prior prob 1D P(E|M) [0, 1] NER mention Entity ID (candidate) from text entity encoder float one hot encoding sentence encoder EL Gold label {0, 1} from KB
  24. Sofie Van Landeghem http://www.oxykodit.com 26 Code examples other_pipes = [pipe

    for pipe in nlp.pipe_names if pipe != "entity_linker"] with nlp.disable_pipes(*other_pipes): optimizer = nlp.begin_training() … nlp.update(…) el_pipe = nlp.create_pipe(name='entity_linker', config={"context_width": 128}) el_pipe.set_kb(kb) nlp.add_pipe(el_pipe, last=True) kb = KnowledgeBase(vocab=vocab, entity_vector_length=64) kb.add_entity(entity="Q1004791", prob=0.2, entity_vector=v1) kb.add_entity(entity="Q42", prob=0.8, entity_vector=v2) kb.add_entity(entity="Q5301561", prob=0.1, entity_vector=v3) kb.add_alias(alias="Douglas", entities=["Q1004791", "Q42", "Q5301561"], probabilities=[0.6, 0.1, 0.2]) kb.add_alias(alias="Douglas Adams", entities=["Q42"], probabilities=[0.9]) text = "Douglas Adams made up the stories as he wrote." doc = nlp(text) for ent in doc.ents: print(ent.text, ent.label_, ent.kb_id_)