Information Retrieval and Text Mining - Semantic Search (Part V)

Seman c Search (Part V) [DAT640] Informa on Retrieval and
Text Mining Krisz an Balog University of Stavanger November 5, 2019

Retrieval models1 • Unstructured retrieval models ◦ LM, BM25, SDM
• Fielded retrieval models ◦ MLM, BM25F, PRMS, FSDM 1Note: while we introduce SDM and FSDM in the context of entity retrieval, nothing in the models listed on this slide is specific to entities so all of them can be applied to (fielded) documents as well. 2 / 27

Markov random ﬁeld (MRF) models • Models so far employed
a bag-of-words representation of both entities and queries ◦ The order of terms is ignored • The Markov random ﬁeld (MRF) model provides a sound theoretical framework for modeling term dependence ◦ Term dependencies are represented as a Markov random field (undirected graph G) ◦ The MRF ranking function is computed as a linear combination of feature functions over the set of cliques2 in G: PΛ (e|q) rank = c∈CG λc f(c) ◦ MRF approaches belong to the more general class of linear feature-based models 2A clique is a subset of vertices of an undirected graph, such that every two distinct vertices are adjacent. 3 / 27

Sequen al Dependence Model (SDM) • The sequential dependence model
(SDM) is one particular instantiation of the MRF model ◦ SDM assumes dependence between neighboring query terms ◦ Strikes a good balance between effectiveness and efficiency • Graph consists of an entity node e and query nodes qi • Two types of cliques ◦ Query term and the entity (unigram matches) ◦ Two query terms and the entity; two variants: • The query terms occur contiguously (ordered bigram match) • They do not (unordered bigram match) q1 q2 q3 e 4 / 27

Sequen al Dependence Model (SDM) • The SDM ranking function
is given by a weighted combination of three feature functions ◦ Query terms (fT ) ◦ Exact match of query bigrams (fO ) ◦ Unordered match of query bigrams (fU ) score(e, q) = λT n i=1 fT (qi, e) + λO n−1 i=1 fO(qi, qi+1, e) + λU n−1 i=1 fU (qi, qi+1, e) • The query is represented as a sequence of terms q = q1, . . . , qn • Feature weights are subject to the constraint λT + λO + λU = 1 ◦ Recommended default setting: λT = 0.85, λO = 0.1, and λU = 0.05 5 / 27

Feature func ons • Feature functions are based on language
modeling estimates using Dirichlet prior smoothing • Unigram matches are based on smoothed entity language models: fT (qi, e) = log P(qi|θe) 6 / 27

Feature func ons (cont’d) • Ordered bigram matches: fO(qi, qi+1,
e) = log co(qi, qi+1, e) + µPo(qi, qi+1|E) le + µ ◦ co (qi , qi+1 , e) denotes the number of times the terms qi , qi+1 occur in this exact order in the description of e ◦ le is the length of the entity’s description (number of terms) ◦ E is the entity catalog (set of all entities) ◦ µ is the smoothing parameter ◦ The background language model is a maximum likelihood estimate: Po (qi , qi+1 |E) = e∈E co (qi , qi+1 , e) e∈E le . 7 / 27

Feature func ons (cont’d) • Unordered bigram matches: fU (qi,
qi+1, e) = log cw(qi, qi+1, e) + µPw(qi, qi+1|E) le + µ , ◦ cw (qi , qi+1 ; e) counts the co-occurrence of terms qi and qi+1 in e, within an unordered window of w term positions ◦ Typically, a window size of 8 is used (corresponds roughly to sentence-level proximity) ◦ le is the length of the entity’s description (number of terms) ◦ E is the entity catalog (set of all entities) ◦ µ is the smoothing parameter ◦ The background language model is a maximum likelihood estimate: Pw (qi , qi+1 |E) = e∈E cw (qi , qi+1 , e) e∈E le . 8 / 27

Discussion Question How would you implement the SDM scoring function
on top of Elasticsearch? 9 / 27

Exercise #1 • Counting ordered and unordered bigram matches with
Elasticsearch • Code skeleton on GitHub: exercises/lecture_19/exercise_1.ipynb (make a local copy) 10 / 27

Fielded Sequen al Dependence Model (FSDM) • Idea: base the
feature function estimates on term/bigram frequencies combined across multiple fields (in the spirit of MLM and BM25F) • The ﬁelded sequential dependence model (FSDM) we present here combines SDM and MLM • Unigram matches are MLM-estimated probabilities: fT (qi, e) = log f∈F wT f P(t|θfe ) ◦ wT f are the field mapping weights (for each field) 11 / 27

Feature func ons (cont’d) • Unordered bigram matches: fO(qi, qi+1,
e) = log f∈F wO f co(qi, qi+1, fe) + µf Po(qi, qi+1|fE) lfe + µf • Ordered bigram matches: fU (qi, qi+1, e) = log f∈F wU f cw u (qi, qi+1, fe) + µf Pw u (qi, qi+1|fE) lfe + µf • All background models and smoothing parameters are made field-specific ◦ But the same smoothing parameter (µf ) may be used for all types of matches • wO f and wU f are the field mapping weights (for each field) ◦ May be based on the field mapping probability estimates from PRMS 12 / 27

En ty linking + retrieval 13 / 27

En ty linking in queries • Challenges ◦ Search queries
are short ◦ Limited context ◦ Lack of proper grammar, spelling ◦ Multiple interpretations ◦ Needs to be fast 14 / 27

Example 15 / 27

Example 16 / 27

Example 17 / 27

Example 18 / 27

Discussion Question Can we incorporate entity annotations of queries into
entity retrieval models? 19 / 27

Dual term-based and en ty-based en ty representa ons Idea:
preserve these entity references by employing a dual entity representation: on top of the traditional term-based representation, each entity is additionally represented by means of its associated entities. <rdfs:label> Ann Dunham <dbo:abstract> Stanley Ann Dunham, the mother of Barack Obama, was an American anthropologist who … <dbo:birthPlace> <Honolulu>, <Hawaii> <dbo:child> <Barack_Obama> <dbo:wikiPageWikiLink> <United_States>, <Family_of_Barack_Obama>, … Knowledge base entry for ANN DUNHAM <Barack_Obama> barack obama parents Query entities Entity-based representation Term-based representation term-based matching entity-based matching Query terms <rdfs:label>: Ann Dunham <dbo:abstract>: Stanley Ann Dunham the mother Barack Obama, was an American anthropologist who … <dbo:birthPlace>: Honolulu … Hawaii … <dbo:child>: Barack Obama <dbo:wikiPageWikiLink>: United States … Family Barack Obama <dbo:birthPlace>: [<Honolulu>, <Hawaii> ] <dbo:child>: <Barack_Obama> <dbo:wikiPageWikiLink>: [<United_States>, <Family_of_Barack_Obama>, …] Figure: Illustration is based on (Hasibi et al., 2016). 20 / 27

En ty linking incorporated retrieval (ELR) • Entity-specific extension applied
on top of the MRF framework ◦ May be applied on top of any term-based retrieval model that can be instantiated in the MRF framework, but here we will focus on SDM • The underlying graph representation of SDM is extended with query entity nodes ◦ It is assumed that entities have already been identified (linked) in queries and in the descriptions of entities • New type of clique: 2-cliques between the given entity (that is being scored) and the query entities q1 q2 q3 e e1 e2 21 / 27

Ranking func on SDM+ELR • The SDM+ELR ranking function is
given by a weighted combination of four feature functions ◦ Query terms (fT ) ◦ Exact match of query bigrams (fO ) ◦ Unordered match of query bigrams (fU ) ◦ Entity matches (fE ) score(e, q) = λT n i=1 fT (qi, e) + λO n−1 i=1 fO(qi, qi+1, e) +λU n−1 i=1 fU (qi, qi+1, e) + λE m i=1 fE(ei; e) • Issue: the number of entities (m) varies per query! How can λE be trained? 22 / 27

Parameterizing weights • Rewriting λ parameters as parameterized functions over
each clique: λT (qi) = λT 1 n , λO(qi, qi+1) = λO 1 n − 1 , λU (qi, qi+1) = λU 1 n − 1 , λE(ei) = λE w(ei, q) m j=1 w(ej, q) . • The weight w(ei, q) reflects the confidence in that entity annotation (entity linker’s confidence score) 23 / 27

Ranking func on SDM+ELR • Final ranking function using the
parameterized weights: score(e, q) = λT n n i=1 fT (qi; e) + λO n − 1 n−1 i=1 fO(qi, qi+1; e) + λU n − 1 n−1 i=1 fU (qi, qi+1; e) + λE m j=1 w(ej, q) m i=1 w(ei, q) fE(ei; e) • Default setting: λT = 0.8, λO = 0.05, λU = 0.05, and λE = 0.1 24 / 27

Feature func on for en ty matches • Differences from
term-based scoring ◦ We assume that each entity appears at most once in each field ◦ If a query entity appears in a field, then it shall be regarded as a “perfect match,” independent of what other entities are present in the same field ◦ Unlike for terms-based representations, predicate folding is not performed () fE(ei; e) = log f∈ ˜ F wE f (1 − λ) 1(ei, f˜ e) + λ e ∈E 1(ei, f˜ e ) |{e ∈ E : f˜ e = ∅}| ◦ ˜ e denote the entity-based representation of entity e ◦ ˜ F denotes the set of fields in the entity-based representation ◦ 1(e, f˜ e ) is a binary indicator function, which is 1 if ei is present in the entity field f˜ e and otherwise 0 ◦ λ is the smoothing parameter (default: 0.1) ◦ Field weights wE f may be set manually or via dynamic mapping using PRMS 25 / 27

Assignment 3 26 / 27

Reading • Entity-Oriented Search (Balog)3 ◦ Section 3.3 ◦ Section
4.2.2 3PDF: https://rd.springer.com/content/pdf/10.1007%2F978-3-319-93935-3.pdf 27 / 27

Information Retrieval and Text Mining - Semanti...

Information Retrieval and Text Mining - Semantic Search (Part V)

Krisztian Balog

More Decks by Krisztian Balog

Other Decks in Education

Featured

Transcript

Seman c Search (Part V) [DAT640] Informa on Retrieval and

Retrieval models1 • Unstructured retrieval models ◦ LM, BM25, SDM

Markov random ﬁeld (MRF) models • Models so far employed

Sequen al Dependence Model (SDM) • The sequential dependence model

Sequen al Dependence Model (SDM) • The SDM ranking function

Feature func ons • Feature functions are based on language

Feature func ons (cont’d) • Ordered bigram matches: fO(qi, qi+1,

Feature func ons (cont’d) • Unordered bigram matches: fU (qi,

Discussion Question How would you implement the SDM scoring function

Exercise #1 • Counting ordered and unordered bigram matches with

Fielded Sequen al Dependence Model (FSDM) • Idea: base the

Feature func ons (cont’d) • Unordered bigram matches: fO(qi, qi+1,

En ty linking + retrieval 13 / 27

En ty linking in queries • Challenges ◦ Search queries

Example 15 / 27

Example 16 / 27

Example 17 / 27

Example 18 / 27

Discussion Question Can we incorporate entity annotations of queries into

Dual term-based and en ty-based en ty representa ons Idea:

En ty linking incorporated retrieval (ELR) • Entity-specific extension applied

Ranking func on SDM+ELR • The SDM+ELR ranking function is

Parameterizing weights • Rewriting λ parameters as parameterized functions over

Ranking func on SDM+ELR • Final ranking function using the

Feature func on for en ty matches • Differences from

Assignment 3 26 / 27

Reading • Entity-Oriented Search (Balog)3 ◦ Section 3.3 ◦ Section