Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥

DAT630 - Entity Linking I.

DAT630 - Entity Linking I.

University of Stavanger, DAT630, 2016 Autumn
Guest lecture by Faegheh Hasibi

Krisztian Balog

November 08, 2016
Tweet

More Decks by Krisztian Balog

Other Decks in Education

Transcript

  1. England national football team 1966 FIFA World Cup Argentina national

    football team What is entity linking? British expatriates brought football to Argentina in the 19th century. 
 The rivalry between the England and Argentina national football teams, however, is generally traced back to the 
 1966 FIFA World Cup.
  2. What is entity linking? Linking free text to entities •

    Text: any piece of text • documents (news, blog post, etc.) • tweets • queries • …. • Entities: typically taken from a knowledge graph • Wikipedia • DBpedia • Freebase • …
  3. Why entity linking? Enables: • Semantic search • Automatic document

    enrichment; go-read-here • Ontology learning, KB population “Used as feature” to improve: • Classification • Retrieval
  4. Mention detection • Recall oriented • Do not miss any

    entity that should be linked • Find entity name variants • E.g. “jlo” is name variant of [Jennifer Lopez] • Filter out inappropriate ones • E.g. “new york” matches >2k different entities; all are not interesting Detecting all “linkable phrases” (mentions) of the text, 
 with their corresponding entities.
  5. Mention detection 1. Build a dictionary of entity surface forms

    • contains a mapping from entity name variants to entities 2. Find all document n-grams (substrings) against the dictionary • The length of n-gram is typically between 6 and 8 3. Filter out undesired entities
  6. Mention detection Key questions: • What is the data source

    for entity name variants? • Wikipedia • How to filter out inappropriate entities? • Statistical hints • Mention length
  7. Building the dictionary • Page title • the most common

    name of the entity • Redirect pages • alternative name for referring to the entity • Disambiguation pages • entities that share the same name • Anchor texts • Wikipedia hyper links • Bold texts from the first paragraph • denotes other name variants of the entity
  8. Building the dictionary • Page title • the most common

    name of the entity • Redirect pages • alternative name for referring to the entity • Disambiguation pages • entities that share the same name • Anchor texts • Wikipedia hyper links • Bold texts from the first paragraph • denotes other name variants of the entity
  9. Building the dictionary • Page title • the most common

    name of the entity • Redirect pages • alternative name for referring to the entity • Disambiguation pages • entities that share the same name • Anchor texts • Wikipedia hyper links • Bold texts from the first paragraph • denotes other name variants of the entity
  10. Building the dictionary • Page title • the most common

    name of the entity • Redirect pages • alternative name for referring to the entity • Disambiguation pages • entities that share the same name • Anchor texts • Wikipedia hyper links • Bold texts from the first paragraph • denotes other name variants of the entity
  11. Building the dictionary • Page title • the most common

    name of the entity • Redirect pages • alternative name for referring to the entity • Disambiguation pages • entities that share the same name • Anchor texts • Wikipedia hyper links • Bold texts from the first paragraph • denotes other name variants of the entity
  12. Filtering mentions • Surface form dictionaries are rich and large

    • A mention can be associated to too many entities • esp. the very commons names (e.g. ‘new york’, ‘us’) • Some mentions are unlikely to be linked any entity • ‘the’ -> [The The] • ‘b’ -> [B (I Am Kloot album)]
  13. Keyphraseness Probability of a word being linked number of Wikipedia

    articles where m appears as link number of Wikipedia articles that contain m
  14. Commonness Probability of a word referring to the entity number

    of times entity e is the link target of mention m total number of times mention m appears as link
  15. Entity ranking Various types of features can be used: •

    Context independent • Context dependent • Entity relatedness Ranking entities and narrowing down the space of disambiguation possibilities.
  16. Context independent features • Keyphraseness • Commonness • Link prior

    • Probability of the entity measured in terms of incoming links • Page views • Probability of the entity measured in terms of traffic volume Neither the text nor other mentions in the document are taken into account
  17. Context dependent features • Context of a mention • Window

    of text (sentence, paragraph) around the mention • Entire document • Similarity function • Cosine similarity Compare the surrounding context of a mention with the textual representation of the entity
  18. Entity relatedness • Assumption: a document focuses on one or

    at most a few topics • Entities mentioned in a document should be topically related to each other • Relatedness can be captured between two entities Captures coherence between entity linking decisions in the text
  19. Entity relatedness implementation does much the same for Wikipedia’s hierarchical

    category structure. While the results are similar in terms of accuracy to thesaurus based techniques, the collaborative nature of Wikipedia offers a much larger—and constantly evolving—vocabulary. Gabrilovich and Markovitch (2007) achieve extremely accurate results with ESA, a technique that is somewhat reminiscent of the vector space model widely used in information retrieval. Instead of comparing vectors of term weights to evaluate the similarity between queries and documents, they compare weighted vectors of the Wikipedia articles related to each term. The name of the approach—Explicit Semantic Analysis—stems from the way these vectors are comprised of manually defined because Wikipedia’s extensive textual content can largely be ignored, and more accurate, because it is more closely tied to the manually defined semantics of the resource. Wikipedia’s extensive network of cross-references, portals, categories and info-boxes provide a huge amount of explicitly defined semantics. Despite the name, Explicit Semantic Analysis takes advantage of only one property: the way in which Wikipedia’s text is segmented into individual topics. It’s central component—the weight between a term and an article—is automatically derived rather than explicitly specified. In contrast, the central component of our approach is the link: a manually-defined connection between two manually disambiguated concepts. Wikipedia provides millions of these connections, as Global Warming Automobile Petrol Engine Fossil Fuel 20th Century Emission Standard Bicycle Diesel Engine Carbon Dioxide Air Pollution Greenhouse Gas Alternative Fuel Transport Vehicle Henry Ford Combustion Engine Kyoto Protocol Ozone Greenhouse Effect Planet Audi Battery (electricity) Arctic Circle Environmental Skepticism Greenpeace Ecology incoming links outgoing links Figure 1: Obtaining a semantic relatedness measure between Automobile and Global Warming from Wikipedia links. incoming links outgoing links 26 Image taken from Milne and Witten (2008a). An Effective, Low-Cost Measure of Semantic Relatedness Obtained from Wikipedia Links. In AAAI WikiAI Workshop.
  20. Wikipedia Link-based Measure (WLM) Two entities are related if there

    is a large overlap between their incoming links
  21. Disambiguation Approaches • Pruning based on score threshold • Classification

    algorithms • Graph-based approaches Selecting single entity or none for each mention
  22. Graph based approach • Problem formulation: find a dense subgraph

    that contains all mention nodes and exactly one mention-entity edge for each mention • Greedy algorithm iteratively removes edges • The graph with the highest density is kept as the solution
  23. Graph based approach A Referent Graph is a weighted graph

    G=(V, E), where the node set V contains all name mentions in a document and all the possible referent entities of these name mentions, with each node representing a name mention or an entity; each edge between a name mention and an entity represents a Compatible relation between them; each edge between two entities represents a Semantic-Related relation between them. For illustration, Figure 2 shows the Referent Graph representation of the EL problem in Example 1. Space Jam Chicago Bulls Bull Michael Jordan Michael I. Jordan Michael B. Jordan Space Jam Bulls Jordan Mention Entity 0.66 0.82 0.13 0.01 0.20 0.12 0.03 0.08 Figure 2. The Referent Graph of Example 1 By representing both the local mention-to-entity compatibility and the global entity relation as edges, two types of dependencies are captured in Referent Graph: M o m 2) C c i W e t 3) N t C r p R t 4. C In this which mentio repres Compa Cosine similarity between context of surface form and Wikipedia article Relatedness(e, e’) Image taken from Han et. al. (2011). Collective Entity Linking in Web Text: A Graph-based Method. In SIGIR
  24. Mention detection Question: Considering Table 1, what is the output

    of the mention detection step for the given sample text? All mention-entity pairs of Table 1 are considered, except the ones related to the mention “democracy“. We ignore this mention, because the longer mention “multiparty democracy“ is considered. Answer:
  25. Entity ranking Question: Compute the commonness for all mention- entity

    pairs, where mention is “1992 elections”.