Slide 1

Slide 1 text

Entity linking for spaCy: Grounding textual mentions Sofie Van Landeghem Freelancer ML and NLP Explosion AI / OxyKodit Belgium NLP Meetup, 1 oct 2019

Slide 2

Slide 2 text

Sofie Van Landeghem http://www.oxykodit.com 2 spaCy ➢ Focus on production usage ➢ Speed & efficiency ➢ Python + Cython ➢ Comparison to other NLP libraries: https://spacy.io/usage/facts-figures ➢ Open source (MIT license): https://github.com/explosion/spaCy/ ➢ Created by Explosion AI (Ines Montani & Matthew Honnibal) ➢ Tokenization (50 languages), lemmatization, POS tagging, dependency parsing ➢ NER, text classification, rule-based matching (API + one implementation) ➢ Word vectors, BERT-style pre-training ➢ Statistical models in 10 languages (v. 2.2): DE, EN, EL, ES, FR, IT, LT, NL, NB, PT ➢ One multi-lingual NER model containing DE, EN, ES, FR, IT, PT, RU

Slide 3

Slide 3 text

Sofie Van Landeghem http://www.oxykodit.com 3 ➔ A named entity (NE) is a consecutive span of one or several tokens ➔ It has a label or type, such as “PERSON”, “LOC” or “ORG” ➔ An NER algorithm is trained on annotated data (e.g. OntoNotes) Entity recognition

Slide 4

Slide 4 text

Sofie Van Landeghem http://www.oxykodit.com 4 “the U.N. World Meteorological Organization” → ➔ Tokenization, word embeddings, dependency trees … all define words with other words ➔ Entity linking: Resolve named entities to concepts from a Knowledge Base (KB) ➔ Ground the lexical information into the “real world” ➔ Allow to fully integrate database facts with textual information Entity linking

Slide 5

Slide 5 text

Sofie Van Landeghem http://www.oxykodit.com 5 Entity links Who are all these Byron’s in this text ?

Slide 6

Slide 6 text

Sofie Van Landeghem http://www.oxykodit.com 6 NEL framework STEP 0: Assume NER has already happened on the raw text, so we have entities + labels STEP 1: Candidate generation: create a list of plausible WikiData IDs for a mention STEP 2: Entity Linking: disambiguate these candidates to the most likely ID Text NER NER mentions List of candidates for each mention candidate generation EL One entity ID for each mention Ms Byron would become known as the first programmer. Byron: PERSON Q7259 Q5679: Lord George Gordon Byron Q272161: Anne Isabella Byron Q7259: Ada Lovelace

Slide 7

Slide 7 text

Sofie Van Landeghem http://www.oxykodit.com 7 1. candidate generation Task: given a textual mention, produce a list of candidate IDs How: Build a Knowledge Base (KB) to query candidates from. This is done by parsing links on Wikipedia: ➔ “William King” is a synonym for “William King-Noel, 1st Earl of Lovelace” ● Other synonyms found on Wikipedia: “Earl of Lovelace”, “8th Baron King”, ... ➔ For each synonym, deduce how likely it is they point to a certain ID by normalizing the pair frequencies to prior probabilities ● e.g. “Byron” refers to “Lord Byron” in 35% of the cases, and to “Ada Lovelace” in 55% of the cases “Byron” Q5679 Q272161 Q7259 She married [[William King-Noel, 1st Earl of Lovelace|William King]] in 1835

Slide 8

Slide 8 text

Sofie Van Landeghem http://www.oxykodit.com 8 2. entity linker Task: Given a list of candidate IDs + the textual context, produce the most likely identifier How: Compare lexical clues between the candidates and the context Q5679 Q272161 Q7259 Q7259 Ms Byron would become known as the first programmer. WikiData ID WikiData name WikiData description Context similarity Q5679 Lord Byron English poet and a leading figure in the Romantic movement 0.1 Q272161 Anne Isabella Byron Wife of Lord Byron 0.3 Q7259 Ada Lovelace English mathematician, considered the first computer programmer 0.9 The description of Q7259 is most similar to the original sentence (context)

Slide 9

Slide 9 text

Sofie Van Landeghem http://www.oxykodit.com 9 Architecture 64D Entity description Loss function Sentence 64D P(E|M) [0, 1] NER mention Entity ID (candidate) from text entity encoder sentence encoder Similarity Gold label {0, 1} from KB

Slide 10

Slide 10 text

Sofie Van Landeghem http://www.oxykodit.com 10 Upperbound set by KB WikiData contains many infrequently linked topics To keep the KB manageable in memory, it requires some pruning: ● Keep only entities with min. 20 incoming interwiki links (from 8M to 1M entities) ● Each alias-entity pair should occur at least 5 times in WP ● Keep 10 candidate entities per alias/mention ● Result: ca. 1.1M entities and 1.5M aliases ● 350MB file size to store 1M entities and 1.5M aliases + pretrained 64D entity vectors The KB only stores 14% of all WikiData concepts ! The EL still achieves max. 84.2% accuracy (with an oracle EL disambiguation step) ● Long tail of infrequent entities

Slide 11

Slide 11 text

Sofie Van Landeghem http://www.oxykodit.com 11 Accuracy of EL Trained on 200.000 mentions in Wikipedia articles (2h) Tested on 5000 mentions in (different) Wikipedia articles The random baseline picks a random entity from the set of candidates The prior probability picks the most likely entity for a given synonym, regardless of context The EL algorithm (by itself) significantly outperforms the random baseline: 73.9% > 54.0% and marginally improves upon the prior probability baseline: 79.0% > 78.2 % Random baseline EL only Prior prob baseline EL + prior prob Oracle KB (max) Accuracy % 54.0 73.9 78.2 79.0 84.2

Slide 12

Slide 12 text

Sofie Van Landeghem http://www.oxykodit.com 12 Error analysis Banteay Meanchey, Battambang, Kampong Cham, ... and Svay Rieng. → predicted: City in Cambodia → gold WP link: Province of Cambodia Societies in the ancient civilizations of Greece and Rome preferred small families. → predicted: Greece → gold WP link: Ancient Greece Agnes Maria of Andechs-Merania (died 1201) was a Queen of France. → predicted: kingdom in Western Europe from 987 to 1791 → gold WP link: current France (gold was incorrect !)

Slide 13

Slide 13 text

Sofie Van Landeghem http://www.oxykodit.com 13 Curating WP data We manually curated the WP training data ● Took the original “gold” ID from the interwiki link ● Mixed in all other candidate IDs ● Presented them in a random order Annotation of 500 cases ● 7.4% did not constitute a proper sentence ● 8.2% did not refer to a proper entity Of the remaining 422 cases: ● 87.7% were found to be the same ● 5.2% were found to be different ● 7.1% were found to be ambiguous or needed context outside the sentence https://prodi.gy/

Slide 14

Slide 14 text

Sofie Van Landeghem http://www.oxykodit.com 14 Issues in the WP data Entities without sentence context, e.g. in enumerations, tables, “See also” sections → Remove from the dataset Some links are not really Named Entities but refer to other concepts such as “privacy” → Prune the WikiData KB WP annotations are not always aligned to the entity types “Fiji has experienced many coups recently, in 1987, 2000, and 2006.” → Link to “2000 Fijian coup d'état” or to the year “2000” ? “Full metro systems are in operation in Paris, Lyon and Marseille” → WP links to “Marseille Metro” instead of to “Marseille”

Slide 15

Slide 15 text

Sofie Van Landeghem http://www.oxykodit.com 15 Annotating news data 1) “easy” - sentence - candidates from KB 2) “hard” - article - free text

Slide 16

Slide 16 text

Sofie Van Landeghem http://www.oxykodit.com 16 Accuracy on news Random baseline EL only Prior prob baseline EL + prior prob Oracle KB (max) Easy cases (230 entities) 40.9 58.7 84.8 87.8 100 Hard cases (122 entities) 14.9 17.4 25.6 27.3 33.9 All (352 entities) 29.6 44.7 64.4 67.0 77.2 ➔ The original annotation effort started with 500 randomly selected entities ➔ 16% were not proper entities/sentences or Date entities such as “nearly two months” ➔ 9% referred to concepts not in WikiData ➔ 2% was too difficult to resolve ➔ 3% of Prodigy matches could not be matched with the spaCy nlp model ➔ On the news dataset, EL improves more upon the prior probability baseline

Slide 17

Slide 17 text

Sofie Van Landeghem http://www.oxykodit.com 17 Findings in the news data There will always be entities too vague or outside the KB (e.g. “a tourist called Julia ...”) Candidate generation can fail due to small lexical variants ● middle name, F.B.I. instead of FBI, “‘s” or “the” as part of the entity, ... Often background information beyond what is in the article, is required for disambiguation Metonomy is hard to resolve correctly, even for a human ● e.g. “the World Economic Forum at Davos … Davos had come to embody that agenda” Dates and numbers are often impossible to resolve correctly (or need meta data) ➔ “... what happened in the middle of December” The WikiData knowledge graph is incredibly helpful when analysing the entities manually

Slide 18

Slide 18 text

Sofie Van Landeghem http://www.oxykodit.com 18 WikiData graph Raised in Elizabeth, N. J. Mr. Feeney served as a radio operator in the Air Force and attended Cornell University on the G. I. Bill. Q2967590 Chuck Feeney Q138311 Elizabeth Q2335128 Province of New Jersey Q49115 Cornell University place of birth capital of educated at

Slide 19

Slide 19 text

Sofie Van Landeghem http://www.oxykodit.com 19 Coreference resolution ➔ Links together entities that refer to the same thing / person ➔ e.g. “he” refers to “Nader” Hugging Face’s neuralcoref package works with spaCy

Slide 20

Slide 20 text

Sofie Van Landeghem http://www.oxykodit.com 20 Coref for prediction Coreference resolution helps to link concepts across sentences. The whole chain should then be linked to the same WikiData ID. The EL algorithm is (currently) trained to predict entity links for sentences. ➔ How to obtain consistency across the chain? ➔ First idea: take the prediction with the highest confidence across all entities in the chain Assessment on Wikipedia dev set ➔ 0.3% decrease :( ➔ WP links are biased: only the first occurrence (with most context) is usually linked Assessment on News evaluation set ➔ 0.5-0.8% increase (might not be significant – need to look at more data) ➔ “easy” from 87.8 to 88.3, “hard” from 27.3 to 28.1, “all” from 67.0 to 67.5

Slide 21

Slide 21 text

Sofie Van Landeghem http://www.oxykodit.com 21 Ongoing work Get better data for training & evaluation to better benchmark model choices The hierarchy of WikiData concepts to be taken into account ➔ Predicting the province instead of its capital city is not as bad as predicting an unrelated city ➔ Taking this into account in the loss function, could make the training more robust Coreference resolution can give entity linking a performance boost ➔ Use coreference resolution to obtain a more consistent set of predictions ➔ Use coreference resolution to enrich the training data Try it out yourself ? ➔ CLI scripts for creating a KB from a WP dump (any language) ➔ CLI scripts for extracting training data from WP/WD ➔ Example code on how to train an entity linking pipe ➔ Possibility to use a custom implementation (through the provided API’s)

Slide 22

Slide 22 text

Sofie Van Landeghem http://www.oxykodit.com 22 Questions ? There are no stupid questions – let’s agree there are no stupid answers either.

Slide 23

Slide 23 text

Sofie Van Landeghem http://www.oxykodit.com 23 Backup slides

Slide 24

Slide 24 text

Sofie Van Landeghem http://www.oxykodit.com 24 Memory & speed Processing Wikipedia: Parse aliases and prior probabilities from intrawiki links to build the KB Takes about 2 hours to parse 1100M lines of Wikipedia ENG XML dump Processing Wikidata: Link English Wikipedia to interlingual Wikidata identifiers Retrieve concise Wikidata descriptions for each entity Takes about 7 hours to parse 55M lines of Wikidata JSON dump Knowledge Base: 55MB to store 1M entities and 1.5M aliases 350MB file size to store 1M entities and 1.5M aliases + pretrained 64D entity vectors Because of an efficient Cython data structures, the KB can be kept in memory Written to file, and read back in, in a matter of seconds

Slide 25

Slide 25 text

Sofie Van Landeghem http://www.oxykodit.com 25 More complex architecture 64D Entity description Loss function Sentence (context) 128D NER type 16D Prior prob 1D P(E|M) [0, 1] NER mention Entity ID (candidate) from text entity encoder float one hot encoding sentence encoder EL Gold label {0, 1} from KB

Slide 26

Slide 26 text

Sofie Van Landeghem http://www.oxykodit.com 26 Code examples other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "entity_linker"] with nlp.disable_pipes(*other_pipes): optimizer = nlp.begin_training() … nlp.update(…) el_pipe = nlp.create_pipe(name='entity_linker', config={"context_width": 128}) el_pipe.set_kb(kb) nlp.add_pipe(el_pipe, last=True) kb = KnowledgeBase(vocab=vocab, entity_vector_length=64) kb.add_entity(entity="Q1004791", prob=0.2, entity_vector=v1) kb.add_entity(entity="Q42", prob=0.8, entity_vector=v2) kb.add_entity(entity="Q5301561", prob=0.1, entity_vector=v3) kb.add_alias(alias="Douglas", entities=["Q1004791", "Q42", "Q5301561"], probabilities=[0.6, 0.1, 0.2]) kb.add_alias(alias="Douglas Adams", entities=["Q42"], probabilities=[0.9]) text = "Douglas Adams made up the stories as he wrote." doc = nlp(text) for ent in doc.ents: print(ent.text, ent.label_, ent.kb_id_)