Upgrade to Pro — share decks privately, control downloads, hide ads and more …

DAT630 - Entity Retrieval II.

DAT630 - Entity Retrieval II.

University of Stavanger, DAT630, 2016 Autumn

Avatar for Krisztian Balog

Krisztian Balog

November 02, 2016
Tweet

More Decks by Krisztian Balog

Other Decks in Education

Transcript

  1. Recap - Entities are meaningful units for organizing information -

    Used for enriching in search engine results - Knowledge bases store massive amounts of information about entities as RDF triples - Entities can be represented as documents for retrieval - Using document fields can preserve (some of) the underlying structure
  2. So far… - Term-based retrieval models - Robust and effective,

    but ignore semantics - entity-specific properties (types, relationships, etc.) Text-only representation info need entity matching Abc Abc Abc
  3. Incorporating semantics - working definition:
 semantics = references to meaningful

    structures - How to capture, represent, and use structure? - It concerns all components of the retrieval process! Text-only representation info need entity matching Abc Abc Abc Text+structure representation info need entity matching Abc Abc Abc
  4. Spectrum of queries I need a list of female computer

    scientists who work on semantic search. human understanding machine understanding natural language keyword keyword++ structured language female computer scientists semantic search female semantic search
 <profession: computer scientist> SELECT ?p WHERE { ?p has-profession Computer_Scientist . ?p has-gender Female . ?p occurs-with "semantic search"}
  5. Spectrum of queries I need a list of female computer

    scientists who work on semantic search. human understanding machine understanding natural language keyword keyword++ structured language female computer scientists semantic search female semantic search
 <profession: computer scientist> SELECT ?p WHERE { ?p has-profession Computer_Scientist . ?p has-gender Female . ?p occurs-with "semantic search"}
  6. Scenario #1 - User provides keyword++ query search UI retrieval

    method entity search results suggestions, facets, etc.
  7. Scenario #2 - Query understanding component constructs the keyword++ query

    (automatically) keyword query retrieval method entity search results keyword++ query query understanding
  8. Type-aware ranking query entity Olympic games target types Rio de

    Janeiro term-based similarity type-based similarity … … entity types
  9. In general, categorizing things can be hard - What is

    King Arthur? - Person / Royalty / British royalty - Person / Military person - Person / Fictional character
  10. Considerations for type-aware ranking - Need to be able to

    handle the imperfections of the type system - Inconsistencies - Missing assignments - Granularity issues - Entities labeled with too general or too specific types - User input is to be treated as a hint, not as a strict filter
  11. Type-aware retrieval #1 - Strict filtering model Type-based
 similarity Term-based


    similarity w stands for word 1 if the query and entity have some types in common, otherwise 0 P(q|e) = P(qw |e) · [types(q) \ types(e) 6= ;]
  12. Type-aware retrieval #2 - Soft filtering model Type-based
 similarity Term-based


    similarity w stands for word Compare the type distribution of in the query against that of the entity Query types Entity types P(q|e) = P(qw |e) · P(qt |e)
  13. Type-aware retrieval #3 - Interpolation model P(q|e) = (1 )P(qw

    |e) + P(qt |e) Type-based
 similarity Term-based
 similarity w stands for word Compare the type distribution of in the query against that of the entity Query types Entity types
  14. Searching for arbitrary relations* airlines that currently use Boeing 747

    planes
 ORG Boeing 747 Members of The Beaux Arts Trio
 PER The Beaux Arts Trio What countries does Eurail operate in?
 LOC Eurail *given an input entity and target type
  15. A typical pipeline Input (entity, target type, relation)
 Ranked list

    
 of entities Candidate 
 entities Retrieving docs/snippets Query expansion ... Type filtering Deduplication Exploiting lists ...
  16. Modeling related entity finding - Ranking entities of a given

    type (T) that stand in a required relation (R) with an input entity (E) - Three-component model p(e|E, T, R) / p(e|E) · p(T|e) · p(R|E, e) Context model Type filtering Co-occurrence model xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx
  17. Scenario - Entity descriptions are not readily available - Entity

    occurrences are annotated - manually - automatically (i.e., entity linking)
  18. The basic idea Use documents to go from queries to

    entities Query-document association the document’s relevance Document-entity association how well the document characterises the entity e q xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx
  19. Two principal approaches - Profile-based methods - Create a textual

    profile for entities, then rank them (by adapting document retrieval techniques) - Document-based methods - Indirect representation based on mentions identified in documents - First ranking documents (or snippets) and then aggregating evidence for associated entities
  20. Profile-based methods q xxxx x xxx xx xxxxxx xx x

    xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx e xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx e e
  21. Document-based methods q xxxx x xxx xx xxxxxx xx x

    xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx X e X X e e
  22. Many possibilities in terms of modeling - Generative (probabilistic) models

    - Discriminative (probabilistic) models - Voting models - Graph-based models
  23. Candidate models (“Model 1”) P(q|✓e ) = Y t2q P(t|✓e

    )n(t,q) Smoothing
 With collection-wide background model (1 )P(t|e) + P(t) X d P(t|d, e)P(d|e) Document-entity association Term-candidate 
 co-occurrence In a particular document. In the simplest case: P(t|d)
  24. Document models (“Model 2”) P(q|e) = X d P(q|d, e)P(d|e)

    Document-entity association Document relevance How well document d supports the claim that e is relevant to q Y t2q P(t|d, e)n(t,q) Simplifying assumption 
 (t and e are conditionally independent given d) P(t|✓d )
  25. Document-entity associations - Boolean (or set-based) approach - Weighted by

    the confidence in entity linking - Consider other entities mentioned in the document e q xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx