Upgrade to Pro — share decks privately, control downloads, hide ads and more …

DAT630 - Entity Retrieval II.

DAT630 - Entity Retrieval II.

University of Stavanger, DAT630, 2016 Autumn

Krisztian Balog

November 02, 2016
Tweet

More Decks by Krisztian Balog

Other Decks in Education

Transcript

  1. Recap - Entities are meaningful units for organizing information -

    Used for enriching in search engine results - Knowledge bases store massive amounts of information about entities as RDF triples - Entities can be represented as documents for retrieval - Using document fields can preserve (some of) the underlying structure
  2. So far… - Term-based retrieval models - Robust and effective,

    but ignore semantics - entity-specific properties (types, relationships, etc.) Text-only representation info need entity matching Abc Abc Abc
  3. Incorporating semantics - working definition:
 semantics = references to meaningful

    structures - How to capture, represent, and use structure? - It concerns all components of the retrieval process! Text-only representation info need entity matching Abc Abc Abc Text+structure representation info need entity matching Abc Abc Abc
  4. Spectrum of queries I need a list of female computer

    scientists who work on semantic search. human understanding machine understanding natural language keyword keyword++ structured language female computer scientists semantic search female semantic search
 <profession: computer scientist> SELECT ?p WHERE { ?p has-profession Computer_Scientist . ?p has-gender Female . ?p occurs-with "semantic search"}
  5. Spectrum of queries I need a list of female computer

    scientists who work on semantic search. human understanding machine understanding natural language keyword keyword++ structured language female computer scientists semantic search female semantic search
 <profession: computer scientist> SELECT ?p WHERE { ?p has-profession Computer_Scientist . ?p has-gender Female . ?p occurs-with "semantic search"}
  6. Scenario #1 - User provides keyword++ query search UI retrieval

    method entity search results suggestions, facets, etc.
  7. Scenario #2 - Query understanding component constructs the keyword++ query

    (automatically) keyword query retrieval method entity search results keyword++ query query understanding
  8. Type-aware ranking query entity Olympic games target types Rio de

    Janeiro term-based similarity type-based similarity … … entity types
  9. In general, categorizing things can be hard - What is

    King Arthur? - Person / Royalty / British royalty - Person / Military person - Person / Fictional character
  10. Considerations for type-aware ranking - Need to be able to

    handle the imperfections of the type system - Inconsistencies - Missing assignments - Granularity issues - Entities labeled with too general or too specific types - User input is to be treated as a hint, not as a strict filter
  11. Type-aware retrieval #1 - Strict filtering model Type-based
 similarity Term-based


    similarity w stands for word 1 if the query and entity have some types in common, otherwise 0 P(q|e) = P(qw |e) · [types(q) \ types(e) 6= ;]
  12. Type-aware retrieval #2 - Soft filtering model Type-based
 similarity Term-based


    similarity w stands for word Compare the type distribution of in the query against that of the entity Query types Entity types P(q|e) = P(qw |e) · P(qt |e)
  13. Type-aware retrieval #3 - Interpolation model P(q|e) = (1 )P(qw

    |e) + P(qt |e) Type-based
 similarity Term-based
 similarity w stands for word Compare the type distribution of in the query against that of the entity Query types Entity types
  14. Searching for arbitrary relations* airlines that currently use Boeing 747

    planes
 ORG Boeing 747 Members of The Beaux Arts Trio
 PER The Beaux Arts Trio What countries does Eurail operate in?
 LOC Eurail *given an input entity and target type
  15. A typical pipeline Input (entity, target type, relation)
 Ranked list

    
 of entities Candidate 
 entities Retrieving docs/snippets Query expansion ... Type filtering Deduplication Exploiting lists ...
  16. Modeling related entity finding - Ranking entities of a given

    type (T) that stand in a required relation (R) with an input entity (E) - Three-component model p(e|E, T, R) / p(e|E) · p(T|e) · p(R|E, e) Context model Type filtering Co-occurrence model xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx
  17. Scenario - Entity descriptions are not readily available - Entity

    occurrences are annotated - manually - automatically (i.e., entity linking)
  18. The basic idea Use documents to go from queries to

    entities Query-document association the document’s relevance Document-entity association how well the document characterises the entity e q xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx
  19. Two principal approaches - Profile-based methods - Create a textual

    profile for entities, then rank them (by adapting document retrieval techniques) - Document-based methods - Indirect representation based on mentions identified in documents - First ranking documents (or snippets) and then aggregating evidence for associated entities
  20. Profile-based methods q xxxx x xxx xx xxxxxx xx x

    xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx e xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx e e
  21. Document-based methods q xxxx x xxx xx xxxxxx xx x

    xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx X e X X e e
  22. Many possibilities in terms of modeling - Generative (probabilistic) models

    - Discriminative (probabilistic) models - Voting models - Graph-based models
  23. Candidate models (“Model 1”) P(q|✓e ) = Y t2q P(t|✓e

    )n(t,q) Smoothing
 With collection-wide background model (1 )P(t|e) + P(t) X d P(t|d, e)P(d|e) Document-entity association Term-candidate 
 co-occurrence In a particular document. In the simplest case: P(t|d)
  24. Document models (“Model 2”) P(q|e) = X d P(q|d, e)P(d|e)

    Document-entity association Document relevance How well document d supports the claim that e is relevant to q Y t2q P(t|d, e)n(t,q) Simplifying assumption 
 (t and e are conditionally independent given d) P(t|✓d )
  25. Document-entity associations - Boolean (or set-based) approach - Weighted by

    the confidence in entity linking - Consider other entities mentioned in the document e q xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx