Łukasz Szymański
February 23, 2016
160

# Text Retrieval

Na co dzień używamy platform wyszukiwania takich jak Solr lub Elasticsearch. Oba rozwiązania opierają się na silniku Apache Lucene, który implementuje rozwiązania z dziedziny wyszukiwania tekstowego. Opowiem o zasadach i technikach dzięki którym narzędzia te realizują swoje zadanie. Poznacie teorię i wyzwania jakie stawia przed nami wyszukiwanie tekstowe.

## Łukasz Szymański

February 23, 2016

## Transcript

15. ### PHPers Poznań #3 2016-02-23 @szymanskilukasz Vocabulary: V = {w1,…,wN} -

Set of words - might be multiple languages. Query: q = q1,…,qm where qi ∈ V Document: di = di1,…,dimj where dij ∈ V Collection: C = {d1,…,dM} Set of relevant documents: R(q) ⊆ C Our task is to compute R’(q), an approximation of R(q) Deﬁne our problem

17. ### PHPers Poznań #3 2016-02-23 @szymanskilukasz where f(d,q) ∈ {0,1} f(d,q)

here is binary classiﬁer - system decides is a doc is relevant or not. Document selection R′(q) = { d ∈ C | f(d,q) = 1 }
18. ### PHPers Poznań #3 2016-02-23 @szymanskilukasz where f(d,q)∈ℜ is a relevance

measure function; θ is a cutoff determined by the user Document ranking R′(q) = { d ∈ C | f(d,q) > θ}
19. ### PHPers Poznań #3 2016-02-23 @szymanskilukasz doc selection f(d,q) = ?

doc ranking f(d,q) = ? 0.98 d1 + 0.94 d2 + 0.82 d3 - 0.73 d4 + }R’(q) 0.56 d5 - 0.43 d6 - 0.38 d7 + θ + + + + + + + + - - - - - - + + + + - - + - - - - - 1 0 R’(q) }

21. ### PHPers Poznań #3 2016-02-23 @szymanskilukasz Similarity-based models: f(q,d) = similarity(q,d)

Probabilistic models: f(d,q) = p(R=1|d,q), where R ∈{0,1} Axiomatic model: f(q,d) must satisfy a set of constraints

23. ### PHPers Poznań #3 2016-02-23 @szymanskilukasz Lets assume that we have

a query like q = “Best Retrieval Models”
24. ### PHPers Poznań #3 2016-02-23 Term Frequency @szymanskilukasz How many times

does a word occur in document “d”, e.g. How many times does “Best” occur in document “d”? How many times does “Retrieval” occur in document “d”? How many times does “Models” occur in document “d”?
25. ### PHPers Poznań #3 2016-02-23 Document Length @szymanskilukasz Document Length -

How long is document “d”? If term occurs with equal frequency, but one of the documents is shorter the score will be higher. Same in the other hand - if document is longer, there will be higher probability that term occurs in that document.
26. ### PHPers Poznań #3 2016-02-23 Document Frequency @szymanskilukasz How often do

we see a word in entire collection?

31. ### PHPers Poznań #3 2016-02-23 @szymanskilukasz d1 d2 d3 d4 d5

Best Programming Models
32. ### PHPers Poznań #3 2016-02-23 @szymanskilukasz query d1 d2 d3 d4

d5 Best Programming Models
33. ### PHPers Poznań #3 2016-02-23 @szymanskilukasz query d1 d2 d3 d4

d5 Best Programming Models

rep.

doc rep.
38. ### PHPers Poznań #3 2016-02-23 @szymanskilukasz docs query TOKENIZER INDEXER INDEX

SCORER results doc rep. query rep.
39. ### PHPers Poznań #3 2016-02-23 @szymanskilukasz docs query TOKENIZER INDEXER INDEX

SCORER results ? doc rep. query rep.
40. ### PHPers Poznań #3 2016-02-23 @szymanskilukasz docs query TOKENIZER INDEXER INDEX

SCORER results feedback doc rep. query rep.
41. ### PHPers Poznań #3 2016-02-23 @szymanskilukasz docs query TOKENIZER INDEXER INDEX

SCORER results feedback doc rep. query rep.

43. ### PHPers Poznań #3 2016-02-23 @szymanskilukasz Character ﬁlters are used to

preprocess the string of characters before it is passed to the tokenizer. A character ﬁlter may be used to strip out HTML markup, or to convert "&" characters to the word "and". Character Filters
44. ### PHPers Poznań #3 2016-02-23 @szymanskilukasz Tokenizers are used to break

a string down into a stream of terms or tokens. A simple tokenizer might split the string up into terms wherever it encounters whitespace or punctuation. Tokenizer
45. ### PHPers Poznań #3 2016-02-23 @szymanskilukasz Token ﬁlters accept a stream

of tokens from a tokenizer and can modify tokens (eg lowercasing), delete tokens (eg remove stopwords) or add tokens (eg synonyms). Token Filters
46. ### PHPers Poznań #3 2016-02-23 @szymanskilukasz These analyzers typically perform four

roles: Tokenize text into individual words: The quick brown foxes → [The, quick, brown, foxes] Lowercase tokens: The → the Remove common stopwords: [The, quick, brown, foxes] → [quick, brown, foxes] Stem tokens to their root form: foxes → fox
47. ### PHPers Poznań #3 2016-02-23 @szymanskilukasz docs query TOKENIZER INDEXER INDEX

SCORER results feedback doc rep. query rep.
48. ### PHPers Poznań #3 2016-02-23 Indexing @szymanskilukasz Convert documents to data

structures that enable fast search Precompute as much as we can

51. ### PHPers Poznań #3 2016-02-23 @szymanskilukasz docs query TOKENIZER INDEXER INDEX

SCORER results feedback doc rep. query rep.

53. ### PHPers Poznań #3 2016-02-23 Boolean Model @szymanskilukasz The Boolean model

simply applies the AND, OR, and NOT conditions expressed in the query to ﬁnd all the documents that match. A query for full AND text AND search AND (elasticsearch OR lucene) will include only documents that contain all of the terms full, text, and search, and either elasticsearch or lucene. This process is simple and fast. It is used to exclude any documents that cannot possibly match the query.

56. ### PHPers Poznań #3 2016-02-23 @szymanskilukasz This does not affect ranking,

but the default implementation does make scores from different queries more comparable than they would be by eliminating the magnitude of the Query vector as a factor in the score. queryNorm(q) Query Normalization Factor queryNorm = 1 / √sumOfSquaredWeights The sumOfSquaredWeights is calculated by adding together the IDF of each term in the query, squared.

58. ### PHPers Poznań #3 2016-02-23 @szymanskilukasz coord(q,d) Query Coordination is used

to reward documents that contain a higher percentage of the query terms. The more query terms that appear in the document, the greater the chances that the document is a good match for the query.

60. ### PHPers Poznań #3 2016-02-23 TF / IDF @szymanskilukasz Term Frequency:

How often does the term appear in this document? The more often, the higher the weight. A ﬁeld containing ﬁve mentions of the same term is more likely to be relevant than a ﬁeld containing just one mention. tf(t in d) = √frequency Term Frequency / Inverse Document Frequency

62. ### PHPers Poznań #3 2016-02-23 TF / IDF @szymanskilukasz Inverse Document

Frequency: How often does the term appear in all documents in the collection? The more often, the lower the weight. idf(t) = 1 + log ( numDocs / (docFreq + 1)) The inverse document frequency (idf) of term t is the logarithm of the number of documents in the index, divided by the number of documents that contain the term. Term Frequency / Inverse Document Frequency

64. ### PHPers Poznań #3 2016-02-23 @szymanskilukasz t.getBoost() Query-Time Boosting Query-time boosting

is the main tool that you can use to tune relevance. Remember that boost is just one of the factors involved in the relevance score; it has to compete with the other factors

66. ### PHPers Poznań #3 2016-02-23 TF / IDF @szymanskilukasz Field-length normalization

How long is the ﬁeld? The shorter the ﬁeld, the higher the weight. If a term appears in a short ﬁeld, such as a title ﬁeld, it is more likely that the content of that ﬁeld is about the term than if the same term appears in a much bigger body ﬁeld. The ﬁeld length norm is calculated as follows: norm(d) = 1 / √numTerms Term Frequency / Inverse Document Frequency
67. ### PHPers Poznań #3 2016-02-23 @szymanskilukasz GET /_search?explain { "query" :

{ "match" : { "tweet" : "honeymoon" }} }
68. ### PHPers Poznań #3 2016-02-23 @szymanskilukasz "_explanation": { "description": "weight(tweet:honeymoon in

0) [PerFieldSimilarity], result of:", "value": 0.076713204, "details": [ { "description": "ﬁeldWeight in 0, product of:", "value": 0.076713204, "details": [ { "description": "tf(freq=1.0), with freq of:", "value": 1, "details": [ { "description": "termFreq=1.0", @szymanskilukasz Summary
69. ### PHPers Poznań #3 2016-02-23 @szymanskilukasz "value": 0.076713204, "details": [ {

"description": "tf(freq=1.0), with freq of:", "value": 1, "details": [ { "description": "termFreq=1.0", "value": 1 } ] }, { "description": "idf(docFreq=1, maxDocs=1)", "value": 0.30685282 }, Term frequency How many times did the term honeymoon appear in the tweet ﬁeld in this document?
70. ### PHPers Poznań #3 2016-02-23 @szymanskilukasz "description": "termFreq=1.0", "value": 1 }

] }, { "description": "idf(docFreq=1, maxDocs=1)", "value": 0.30685282 }, { "description": "ﬁeldNorm(doc=0)", "value": 0.25, } ] } ] } Inverse document frequency How many times did the term honeymoon appear in the tweet ﬁeld of all documents in the index?
71. ### PHPers Poznań #3 2016-02-23 @szymanskilukasz { "description": "idf(docFreq=1, maxDocs=1)", "value":

0.30685282 }, { "description": "ﬁeldNorm(doc=0)", "value": 0.25, } ] } ] } Field-length norm How long is the tweet ﬁeld in this document? The longer the ﬁeld, the smaller this number.
72. ### PHPers Poznań #3 2016-02-23 @szymanskilukasz docs query TOKENIZER INDEXER INDEX

SCORER results feedback doc rep. query rep.
73. ### PHPers Poznań #3 2016-02-23 Relevance Feedback @szymanskilukasz Users make explicit

relevance judgments on the initial results (judgments are reliable, but users don’t want to make extra effort)
74. ### PHPers Poznań #3 2016-02-23 Pseudo/Blind/Automatic Feedback @szymanskilukasz Top-k initial results

are simply assumed to be relevant (judgments aren’t reliable, but no user activity is required)
75. ### PHPers Poznań #3 2016-02-23 Implicit Feedback @szymanskilukasz User-clicked docs are

assumed to be relevant; skipped ones non-relevant (judgments aren’t completely reliable, but no extra effort from users)

77. ### PHPers Poznań #3 2016-02-23 – Adding new (weighted) terms (query

expansion) – Adjusting weights of old terms @szymanskilukasz Query modiﬁcation