Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Type-Aware Entity Retrieval

Type-Aware Entity Retrieval

Date: August 22, 2016
Venue: Saratov, Russian Federation. The 10th Russian Summer School in Information Retrieval (RuSSIR '16)

Please cite, link to or credit this presentation when using it or part of it in your work.

#InformationRetrieval #IR #EntityRanking #EntityRetrieval #ER #EntityTypes #EntityOrientedSearch #KnowledgeBases #SemanticSearch

Darío Garigliotti

August 22, 2016
Tweet

More Decks by Darío Garigliotti

Other Decks in Research

Transcript

  1. Type-aware Entity Retrieval Dar´ ıo Garigliotti University of Stavanger Type-aware

    Entity Retrieval Dar´ ıo Garigliotti University of Stavanger Motivation ∎ One of the unique characteristics of entity retrieval is that entities are typed. ∎ Typically, types are organized hierarchically in a type categorization system. ∎ We explore three main identified dimensions to understand how to use entity type information: ⋆ RQ1: How do the retrieval approaches perform across different type taxonomies? ⋆ RQ2: How to represent the type information provided by the type hierarchy? ⋆ RQ3: How to combine type-based and text-based information in retrieval? Type Taxonomies We normalize four type systems to an uniform taxonomy structure: DBpedia Ontology ∎ A well-designed hierarchy. ∎ Created manually by considering the most frequently used infoboxes in Wikipedia. ∎ Clean and consistent, but with limited coverage. 0 1 2 3 4 5 6 7 |Level 1| = 58 types |Level 2| = 114 types |Level 3| = 142 types |Level 4| = 213 types |Level 5| = 45 types |Level 6| = 17 types |Level 7| = 1 type Freebase Types ∎ A two-layer categorization system: types and domains. ∎ Entities are only assigned to types, having most of them “same as” links to DBpedia entities. 0 1 2 |Level 1| = 92 types |Level 2| = 1, 626 types Wikipedia Categories ∎ It consists of textual labels known as categories. ∎ It’s not a well-defined “is-a” hier- archy, but a graph: it requires a major normalization strategy. ∎ Category assignments are neither consistent nor complete. 0 1 2-10 11-24 25- 34 |Level 1| = 27 types |Level 2 ∪ ... ∪ Level 10| = 121, 657 types |Level 11 ∪ ... ∪ Level 24| = 410, 697 types |Level 25 ∪ ... ∪ Level 34| = 14, 564 types YAGO Types ∎ A deep subsumption hierarchy. ∎ Constructed by taking leaf categories from Wikipedia categories and then using WordNet synsets to establish the hierarchy. 0 1 2-5 6-10 11- 19 |Level 1| = 61 types |Level 2 ∪ ... ∪ Level 5| = 80, 384 types |Level 6 ∪ ... ∪ Level 10| = 461, 843 types |Level 11 ∪ ... ∪ Level 19| = 26, 383 types Type Representations We propose three representations of hierarchical type information: Types along path to the top t3 t3 t2 t2 t5 t5 t4 t4 t9 t9 t8 t8 e t6 t6 t12 t12 t7 t7 … t10 t10 t11 t11 t0 t0 t1 t1 … Top-level types t3 t3 t2 t2 t5 t5 t4 t4 t9 t9 t8 t8 e t6 t6 t12 t12 t7 t7 … t10 t10 t11 t11 t0 t0 t1 t1 … Most specific types t3 t3 t2 t2 t5 t5 t4 t4 t9 t9 t8 t8 e t6 t6 t12 t12 t7 t7 … t10 t10 t11 t11 t0 t0 t1 t1 … Type Information in Retrieval We define the retrieval task in a generative probabilistic framework. Both query and entity are considered in the term space as well as in the type space. An oracle process can provide the target types for the query from its relevant results. query entity Olympic games target types Rio de Janeiro term-based similarity type-based similarity … … entity types (Strict) Filtering P(q e) = P(θT′ q θT′ e ) ⋅ χ[types(q) ∩ types(e) ≠ ∅] Types(q) Types(q) Types(e) Types(e) (Soft) Filtering P(q e) = P(θT′ q θT′ e ) ⋅ P(θT q θT e ) Interpolation P(q e) = (1 − λ) ⋅ P(θT′ q θT′ e ) + λ ⋅ P(θT q θT e ) Type weight λ takes values in [0,1] in steps of 0.05. We use the best performing setting when comparing against other approaches. Results DBpedia Freebase W ikipedia YAGO 0 0.1 0.2 0.3 0.4 MAP Strict filtering Soft filtering Interpolation (a) Types along path to top DBpedia Freebase W ikipedia YAGO (b) Top-level types DBpedia Freebase W ikipedia YAGO (c) Most-specific types Fig. 1: Retrieval performance considering only entities that have types from all four type systems. Term-based baseline (showed with the red line) and the ground truth are restricted to the same set of entities. DBpedia Freebase W ikipedia YAGO 0 0.1 0.2 0.3 0.4 MAP Strict filtering Soft filtering Interpolation (a) Types along path to top DBpedia Freebase W ikipedia YAGO (b) Top-level types DBpedia Freebase W ikipedia YAGO (c) Most-specific types Fig. 2: Retrieval performance considering all entities, and using the full set of relevance judgments. The red line represents the term-based baseline. Conclusions ∎ Type information proves most useful when larger, deeper type taxonomies provide very specific types. ⋆ RQ1 (Type taxonomy): given a type representation and a retrieval model, Wikipedia performs best in most of the cases. ⋆ RQ2 (Type representation): using the most specific types is the most effective way to represent type information. ⋆ RQ3 (Retrieval model): all models suffer from missing type information, but interpolation appears to be the most robust.