Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Information Retrieval and Text Mining - Semantic Search (Part V)

Information Retrieval and Text Mining - Semantic Search (Part V)

University of Stavanger, DAT640, 2019 fall

Krisztian Balog

November 05, 2019
Tweet

More Decks by Krisztian Balog

Other Decks in Education

Transcript

  1. Seman c Search (Part V) [DAT640] Informa on Retrieval and

    Text Mining Krisz an Balog University of Stavanger November 5, 2019
  2. Retrieval models1 • Unstructured retrieval models ◦ LM, BM25, SDM

    • Fielded retrieval models ◦ MLM, BM25F, PRMS, FSDM 1Note: while we introduce SDM and FSDM in the context of entity retrieval, nothing in the models listed on this slide is specific to entities so all of them can be applied to (fielded) documents as well. 2 / 27
  3. Markov random field (MRF) models • Models so far employed

    a bag-of-words representation of both entities and queries ◦ The order of terms is ignored • The Markov random field (MRF) model provides a sound theoretical framework for modeling term dependence ◦ Term dependencies are represented as a Markov random field (undirected graph G) ◦ The MRF ranking function is computed as a linear combination of feature functions over the set of cliques2 in G: PΛ (e|q) rank = c∈CG λc f(c) ◦ MRF approaches belong to the more general class of linear feature-based models 2A clique is a subset of vertices of an undirected graph, such that every two distinct vertices are adjacent. 3 / 27
  4. Sequen al Dependence Model (SDM) • The sequential dependence model

    (SDM) is one particular instantiation of the MRF model ◦ SDM assumes dependence between neighboring query terms ◦ Strikes a good balance between effectiveness and efficiency • Graph consists of an entity node e and query nodes qi • Two types of cliques ◦ Query term and the entity (unigram matches) ◦ Two query terms and the entity; two variants: • The query terms occur contiguously (ordered bigram match) • They do not (unordered bigram match) q1 q2 q3 e 4 / 27
  5. Sequen al Dependence Model (SDM) • The SDM ranking function

    is given by a weighted combination of three feature functions ◦ Query terms (fT ) ◦ Exact match of query bigrams (fO ) ◦ Unordered match of query bigrams (fU ) score(e, q) = λT n i=1 fT (qi, e) + λO n−1 i=1 fO(qi, qi+1, e) + λU n−1 i=1 fU (qi, qi+1, e) • The query is represented as a sequence of terms q = q1, . . . , qn • Feature weights are subject to the constraint λT + λO + λU = 1 ◦ Recommended default setting: λT = 0.85, λO = 0.1, and λU = 0.05 5 / 27
  6. Feature func ons • Feature functions are based on language

    modeling estimates using Dirichlet prior smoothing • Unigram matches are based on smoothed entity language models: fT (qi, e) = log P(qi|θe) 6 / 27
  7. Feature func ons (cont’d) • Ordered bigram matches: fO(qi, qi+1,

    e) = log co(qi, qi+1, e) + µPo(qi, qi+1|E) le + µ ◦ co (qi , qi+1 , e) denotes the number of times the terms qi , qi+1 occur in this exact order in the description of e ◦ le is the length of the entity’s description (number of terms) ◦ E is the entity catalog (set of all entities) ◦ µ is the smoothing parameter ◦ The background language model is a maximum likelihood estimate: Po (qi , qi+1 |E) = e∈E co (qi , qi+1 , e) e∈E le . 7 / 27
  8. Feature func ons (cont’d) • Unordered bigram matches: fU (qi,

    qi+1, e) = log cw(qi, qi+1, e) + µPw(qi, qi+1|E) le + µ , ◦ cw (qi , qi+1 ; e) counts the co-occurrence of terms qi and qi+1 in e, within an unordered window of w term positions ◦ Typically, a window size of 8 is used (corresponds roughly to sentence-level proximity) ◦ le is the length of the entity’s description (number of terms) ◦ E is the entity catalog (set of all entities) ◦ µ is the smoothing parameter ◦ The background language model is a maximum likelihood estimate: Pw (qi , qi+1 |E) = e∈E cw (qi , qi+1 , e) e∈E le . 8 / 27
  9. Exercise #1 • Counting ordered and unordered bigram matches with

    Elasticsearch • Code skeleton on GitHub: exercises/lecture_19/exercise_1.ipynb (make a local copy) 10 / 27
  10. Fielded Sequen al Dependence Model (FSDM) • Idea: base the

    feature function estimates on term/bigram frequencies combined across multiple fields (in the spirit of MLM and BM25F) • The fielded sequential dependence model (FSDM) we present here combines SDM and MLM • Unigram matches are MLM-estimated probabilities: fT (qi, e) = log f∈F wT f P(t|θfe ) ◦ wT f are the field mapping weights (for each field) 11 / 27
  11. Feature func ons (cont’d) • Unordered bigram matches: fO(qi, qi+1,

    e) = log f∈F wO f co(qi, qi+1, fe) + µf Po(qi, qi+1|fE) lfe + µf • Ordered bigram matches: fU (qi, qi+1, e) = log f∈F wU f cw u (qi, qi+1, fe) + µf Pw u (qi, qi+1|fE) lfe + µf • All background models and smoothing parameters are made field-specific ◦ But the same smoothing parameter (µf ) may be used for all types of matches • wO f and wU f are the field mapping weights (for each field) ◦ May be based on the field mapping probability estimates from PRMS 12 / 27
  12. En ty linking in queries • Challenges ◦ Search queries

    are short ◦ Limited context ◦ Lack of proper grammar, spelling ◦ Multiple interpretations ◦ Needs to be fast 14 / 27
  13. Dual term-based and en ty-based en ty representa ons Idea:

    preserve these entity references by employing a dual entity representation: on top of the traditional term-based representation, each entity is additionally represented by means of its associated entities. <rdfs:label> Ann Dunham <dbo:abstract> Stanley Ann Dunham, the mother of Barack Obama, was an American anthropologist who … <dbo:birthPlace> <Honolulu>, <Hawaii> <dbo:child> <Barack_Obama> <dbo:wikiPageWikiLink> <United_States>, <Family_of_Barack_Obama>, … Knowledge base entry for ANN DUNHAM <Barack_Obama> barack obama parents Query entities Entity-based representation Term-based representation term-based matching entity-based matching Query terms <rdfs:label>: Ann Dunham <dbo:abstract>: Stanley Ann Dunham the mother Barack Obama, was an American anthropologist who … <dbo:birthPlace>: Honolulu … Hawaii … <dbo:child>: Barack Obama <dbo:wikiPageWikiLink>: United States … Family Barack Obama <dbo:birthPlace>: [<Honolulu>, <Hawaii> ] <dbo:child>: <Barack_Obama> <dbo:wikiPageWikiLink>: [<United_States>, <Family_of_Barack_Obama>, …] Figure: Illustration is based on (Hasibi et al., 2016). 20 / 27
  14. En ty linking incorporated retrieval (ELR) • Entity-specific extension applied

    on top of the MRF framework ◦ May be applied on top of any term-based retrieval model that can be instantiated in the MRF framework, but here we will focus on SDM • The underlying graph representation of SDM is extended with query entity nodes ◦ It is assumed that entities have already been identified (linked) in queries and in the descriptions of entities • New type of clique: 2-cliques between the given entity (that is being scored) and the query entities q1 q2 q3 e e1 e2 21 / 27
  15. Ranking func on SDM+ELR • The SDM+ELR ranking function is

    given by a weighted combination of four feature functions ◦ Query terms (fT ) ◦ Exact match of query bigrams (fO ) ◦ Unordered match of query bigrams (fU ) ◦ Entity matches (fE ) score(e, q) = λT n i=1 fT (qi, e) + λO n−1 i=1 fO(qi, qi+1, e) +λU n−1 i=1 fU (qi, qi+1, e) + λE m i=1 fE(ei; e) • Issue: the number of entities (m) varies per query! How can λE be trained? 22 / 27
  16. Parameterizing weights • Rewriting λ parameters as parameterized functions over

    each clique: λT (qi) = λT 1 n , λO(qi, qi+1) = λO 1 n − 1 , λU (qi, qi+1) = λU 1 n − 1 , λE(ei) = λE w(ei, q) m j=1 w(ej, q) . • The weight w(ei, q) reflects the confidence in that entity annotation (entity linker’s confidence score) 23 / 27
  17. Ranking func on SDM+ELR • Final ranking function using the

    parameterized weights: score(e, q) = λT n n i=1 fT (qi; e) + λO n − 1 n−1 i=1 fO(qi, qi+1; e) + λU n − 1 n−1 i=1 fU (qi, qi+1; e) + λE m j=1 w(ej, q) m i=1 w(ei, q) fE(ei; e) • Default setting: λT = 0.8, λO = 0.05, λU = 0.05, and λE = 0.1 24 / 27
  18. Feature func on for en ty matches • Differences from

    term-based scoring ◦ We assume that each entity appears at most once in each field ◦ If a query entity appears in a field, then it shall be regarded as a “perfect match,” independent of what other entities are present in the same field ◦ Unlike for terms-based representations, predicate folding is not performed () fE(ei; e) = log f∈ ˜ F wE f (1 − λ) 1(ei, f˜ e) + λ e ∈E 1(ei, f˜ e ) |{e ∈ E : f˜ e = ∅}| ◦ ˜ e denote the entity-based representation of entity e ◦ ˜ F denotes the set of fields in the entity-based representation ◦ 1(e, f˜ e ) is a binary indicator function, which is 1 if ei is present in the entity field f˜ e and otherwise 0 ◦ λ is the smoothing parameter (default: 0.1) ◦ Field weights wE f may be set manually or via dynamic mapping using PRMS 25 / 27
  19. Reading • Entity-Oriented Search (Balog)3 ◦ Section 3.3 ◦ Section

    4.2.2 3PDF: https://rd.springer.com/content/pdf/10.1007%2F978-3-319-93935-3.pdf 27 / 27