Text Retrieval

Text Retrieval PHPers Poznań #3 2016-02-23 @szymanskilukasz

PHPers Poznań #3 2016-02-23 @szymanskilukasz

PHPers Poznań #3 2016-02-23 What is text retrieval? @szymanskilukasz

PHPers Poznań #3 2016-02-23 @szymanskilukasz Vocabulary: V = {w1,…,wN} -
Set of words - might be multiple languages. Query: q = q1,…,qm where qi ∈ V Document: di = di1,…,dimj where dij ∈ V Collection: C = {d1,…,dM} Set of relevant documents: R(q) ⊆ C Our task is to compute R’(q), an approximation of R(q) Deﬁne our problem

PHPers Poznań #3 2016-02-23 How to compute R’(q)? @szymanskilukasz

PHPers Poznań #3 2016-02-23 @szymanskilukasz where f(d,q) ∈ {0,1} f(d,q)
here is binary classiﬁer - system decides is a doc is relevant or not. Document selection R′(q) = { d ∈ C | f(d,q) = 1 }

PHPers Poznań #3 2016-02-23 @szymanskilukasz where f(d,q)∈ℜ is a relevance
measure function; θ is a cutoff determined by the user Document ranking R′(q) = { d ∈ C | f(d,q) > θ}

PHPers Poznań #3 2016-02-23 @szymanskilukasz doc selection f(d,q) = ?
doc ranking f(d,q) = ? 0.98 d1 + 0.94 d2 + 0.82 d3 - 0.73 d4 + }R’(q) 0.56 d5 - 0.43 d6 - 0.38 d7 + θ + + + + + + + + - - - - - - + + + + - - + - - - - - 1 0 R’(q) }

PHPers Poznań #3 2016-02-23 Retrieval Models @szymanskilukasz

PHPers Poznań #3 2016-02-23 @szymanskilukasz Similarity-based models: f(q,d) = similarity(q,d)
Probabilistic models: f(d,q) = p(R=1|d,q), where R ∈{0,1} Axiomatic model: f(q,d) must satisfy a set of constraints

PHPers Poznań #3 2016-02-23 Common Ideas @szymanskilukasz

PHPers Poznań #3 2016-02-23 @szymanskilukasz Lets assume that we have
a query like q = “Best Retrieval Models”

PHPers Poznań #3 2016-02-23 Term Frequency @szymanskilukasz How many times
does a word occur in document “d”, e.g. How many times does “Best” occur in document “d”? How many times does “Retrieval” occur in document “d”? How many times does “Models” occur in document “d”?

PHPers Poznań #3 2016-02-23 Document Length @szymanskilukasz Document Length -
How long is document “d”? If term occurs with equal frequency, but one of the documents is shorter the score will be higher. Same in the other hand - if document is longer, there will be higher probability that term occurs in that document.

PHPers Poznań #3 2016-02-23 Document Frequency @szymanskilukasz How often do
we see a word in entire collection?

PHPers Poznań #3 2016-02-23 Vector Space Model @szymanskilukasz

PHPers Poznań #3 2016-02-23 @szymanskilukasz Best Programming Models

PHPers Poznań #3 2016-02-23 @szymanskilukasz d1 Best Programming Models

PHPers Poznań #3 2016-02-23 @szymanskilukasz d1 d4 Best Programming Models

PHPers Poznań #3 2016-02-23 @szymanskilukasz d1 d2 d3 d4 d5
Best Programming Models

PHPers Poznań #3 2016-02-23 @szymanskilukasz query d1 d2 d3 d4
d5 Best Programming Models

PHPers Poznań #3 2016-02-23 Text Retrieval System Architecture @szymanskilukasz

PHPers Poznań #3 2016-02-23 @szymanskilukasz docs TOKENIZER

PHPers Poznań #3 2016-02-23 @szymanskilukasz docs TOKENIZER INDEXER INDEX doc
rep.

PHPers Poznań #3 2016-02-23 @szymanskilukasz docs query TOKENIZER INDEXER INDEX
doc rep.

SCORER results doc rep. query rep.

SCORER results ? doc rep. query rep.

SCORER results feedback doc rep. query rep.

PHPers Poznań #3 2016-02-23 Tokenization @szymanskilukasz

PHPers Poznań #3 2016-02-23 @szymanskilukasz Character ﬁlters are used to
preprocess the string of characters before it is passed to the tokenizer. A character ﬁlter may be used to strip out HTML markup, or to convert "&" characters to the word "and". Character Filters

PHPers Poznań #3 2016-02-23 @szymanskilukasz Tokenizers are used to break
a string down into a stream of terms or tokens. A simple tokenizer might split the string up into terms wherever it encounters whitespace or punctuation. Tokenizer

PHPers Poznań #3 2016-02-23 @szymanskilukasz Token ﬁlters accept a stream
of tokens from a tokenizer and can modify tokens (eg lowercasing), delete tokens (eg remove stopwords) or add tokens (eg synonyms). Token Filters

PHPers Poznań #3 2016-02-23 @szymanskilukasz These analyzers typically perform four
roles: Tokenize text into individual words: The quick brown foxes → [The, quick, brown, foxes] Lowercase tokens: The → the Remove common stopwords: [The, quick, brown, foxes] → [quick, brown, foxes] Stem tokens to their root form: foxes → fox

PHPers Poznań #3 2016-02-23 Indexing @szymanskilukasz Convert documents to data
structures that enable fast search Precompute as much as we can

PHPers Poznań #3 2016-02-23 Inverted Index @szymanskilukasz

PHPers Poznań #3 2016-02-23 Lucene’s Practical Scoring Function @szymanskilukasz

PHPers Poznań #3 2016-02-23 Boolean Model @szymanskilukasz The Boolean model
simply applies the AND, OR, and NOT conditions expressed in the query to ﬁnd all the documents that match. A query for full AND text AND search AND (elasticsearch OR lucene) will include only documents that contain all of the terms full, text, and search, and either elasticsearch or lucene. This process is simple and fast. It is used to exclude any documents that cannot possibly match the query.

PHPers Poznań #3 2016-02-23 @szymanskilukasz This does not affect ranking,
but the default implementation does make scores from different queries more comparable than they would be by eliminating the magnitude of the Query vector as a factor in the score. queryNorm(q) Query Normalization Factor queryNorm = 1 / √sumOfSquaredWeights The sumOfSquaredWeights is calculated by adding together the IDF of each term in the query, squared.

PHPers Poznań #3 2016-02-23 @szymanskilukasz coord(q,d) Query Coordination is used
to reward documents that contain a higher percentage of the query terms. The more query terms that appear in the document, the greater the chances that the document is a good match for the query.

PHPers Poznań #3 2016-02-23 TF / IDF @szymanskilukasz Term Frequency:
How often does the term appear in this document? The more often, the higher the weight. A field containing five mentions of the same term is more likely to be relevant than a field containing just one mention. tf(t in d) = √frequency Term Frequency / Inverse Document Frequency

PHPers Poznań #3 2016-02-23 TF / IDF @szymanskilukasz Inverse Document
Frequency: How often does the term appear in all documents in the collection? The more often, the lower the weight. idf(t) = 1 + log ( numDocs / (docFreq + 1)) The inverse document frequency (idf) of term t is the logarithm of the number of documents in the index, divided by the number of documents that contain the term. Term Frequency / Inverse Document Frequency

PHPers Poznań #3 2016-02-23 @szymanskilukasz t.getBoost() Query-Time Boosting Query-time boosting
is the main tool that you can use to tune relevance. Remember that boost is just one of the factors involved in the relevance score; it has to compete with the other factors

PHPers Poznań #3 2016-02-23 TF / IDF @szymanskilukasz Field-length normalization
How long is the field? The shorter the field, the higher the weight. If a term appears in a short field, such as a title field, it is more likely that the content of that field is about the term than if the same term appears in a much bigger body field. The field length norm is calculated as follows: norm(d) = 1 / √numTerms Term Frequency / Inverse Document Frequency

PHPers Poznań #3 2016-02-23 @szymanskilukasz GET /_search?explain { "query" :
{ "match" : { "tweet" : "honeymoon" }} }

PHPers Poznań #3 2016-02-23 @szymanskilukasz "_explanation": { "description": "weight(tweet:honeymoon in
0) [PerFieldSimilarity], result of:", "value": 0.076713204, "details": [ { "description": "ﬁeldWeight in 0, product of:", "value": 0.076713204, "details": [ { "description": "tf(freq=1.0), with freq of:", "value": 1, "details": [ { "description": "termFreq=1.0", @szymanskilukasz Summary

PHPers Poznań #3 2016-02-23 @szymanskilukasz "value": 0.076713204, "details": [ {
"description": "tf(freq=1.0), with freq of:", "value": 1, "details": [ { "description": "termFreq=1.0", "value": 1 } ] }, { "description": "idf(docFreq=1, maxDocs=1)", "value": 0.30685282 }, Term frequency How many times did the term honeymoon appear in the tweet ﬁeld in this document?

PHPers Poznań #3 2016-02-23 @szymanskilukasz "description": "termFreq=1.0", "value": 1 }
] }, { "description": "idf(docFreq=1, maxDocs=1)", "value": 0.30685282 }, { "description": "ﬁeldNorm(doc=0)", "value": 0.25, } ] } ] } Inverse document frequency How many times did the term honeymoon appear in the tweet ﬁeld of all documents in the index?

PHPers Poznań #3 2016-02-23 @szymanskilukasz { "description": "idf(docFreq=1, maxDocs=1)", "value":
0.30685282 }, { "description": "fieldNorm(doc=0)", "value": 0.25, } ] } ] } Field-length norm How long is the tweet field in this document? The longer the field, the smaller this number.

PHPers Poznań #3 2016-02-23 Relevance Feedback @szymanskilukasz Users make explicit
relevance judgments on the initial results (judgments are reliable, but users don’t want to make extra effort)

PHPers Poznań #3 2016-02-23 Pseudo/Blind/Automatic Feedback @szymanskilukasz Top-k initial results
are simply assumed to be relevant (judgments aren’t reliable, but no user activity is required)

PHPers Poznań #3 2016-02-23 Implicit Feedback @szymanskilukasz User-clicked docs are
assumed to be relevant; skipped ones non-relevant (judgments aren’t completely reliable, but no extra effort from users)

PHPers Poznań #3 2016-02-23 How to learn from feedback? @szymanskilukasz

PHPers Poznań #3 2016-02-23 – Adding new (weighted) terms (query
expansion) – Adjusting weights of old terms @szymanskilukasz Query modiﬁcation

PHPers Poznań #3 2016-02-23 @szymanskilukasz Łukasz Szymański @szymanskilukasz http://szymanskilukasz.github.io/ Development
Team Lead at https://www.linkedin.com/in/szymanskilukasz https://twitter.com/szymanskilukasz Thanks!

Text Retrieval

Text Retrieval

More Decks by Łukasz Szymański

Other Decks in Technology

Featured

Transcript