Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Text Retrieval

Text Retrieval

Na co dzień używamy platform wyszukiwania takich jak Solr lub Elasticsearch. Oba rozwiązania opierają się na silniku Apache Lucene, który implementuje rozwiązania z dziedziny wyszukiwania tekstowego. Opowiem o zasadach i technikach dzięki którym narzędzia te realizują swoje zadanie. Poznacie teorię i wyzwania jakie stawia przed nami wyszukiwanie tekstowe.

Łukasz Szymański

February 23, 2016
Tweet

More Decks by Łukasz Szymański

Other Decks in Technology

Transcript

  1. PHPers Poznań #3 2016-02-23 @szymanskilukasz Vocabulary: V = {w1,…,wN} -

    Set of words - might be multiple languages. Query: q = q1,…,qm where qi ∈ V Document: di = di1,…,dimj where dij ∈ V Collection: C = {d1,…,dM} Set of relevant documents: R(q) ⊆ C Our task is to compute R’(q), an approximation of R(q) Define our problem
  2. PHPers Poznań #3 2016-02-23 @szymanskilukasz where f(d,q) ∈ {0,1} f(d,q)

    here is binary classifier - system decides is a doc is relevant or not. Document selection R′(q) = { d ∈ C | f(d,q) = 1 }
  3. PHPers Poznań #3 2016-02-23 @szymanskilukasz where f(d,q)∈ℜ is a relevance

    measure function; θ is a cutoff determined by the user Document ranking R′(q) = { d ∈ C | f(d,q) > θ}
  4. PHPers Poznań #3 2016-02-23 @szymanskilukasz doc selection f(d,q) = ?

    doc ranking f(d,q) = ? 0.98 d1 + 0.94 d2 + 0.82 d3 - 0.73 d4 + }R’(q) 0.56 d5 - 0.43 d6 - 0.38 d7 + θ + + + + + + + + - - - - - - + + + + - - + - - - - - 1 0 R’(q) }
  5. PHPers Poznań #3 2016-02-23 @szymanskilukasz Similarity-based models: f(q,d) = similarity(q,d)

    Probabilistic models: f(d,q) = p(R=1|d,q), where R ∈{0,1} Axiomatic model: f(q,d) must satisfy a set of constraints
  6. PHPers Poznań #3 2016-02-23 @szymanskilukasz Lets assume that we have

    a query like q = “Best Retrieval Models”
  7. PHPers Poznań #3 2016-02-23 Term Frequency @szymanskilukasz How many times

    does a word occur in document “d”, e.g. How many times does “Best” occur in document “d”? How many times does “Retrieval” occur in document “d”? How many times does “Models” occur in document “d”?
  8. PHPers Poznań #3 2016-02-23 Document Length @szymanskilukasz Document Length -

    How long is document “d”? If term occurs with equal frequency, but one of the documents is shorter the score will be higher. Same in the other hand - if document is longer, there will be higher probability that term occurs in that document.
  9. PHPers Poznań #3 2016-02-23 @szymanskilukasz Character filters are used to

    preprocess the string of characters before it is passed to the tokenizer. A character filter may be used to strip out HTML markup, or to convert "&" characters to the word "and". Character Filters
  10. PHPers Poznań #3 2016-02-23 @szymanskilukasz Tokenizers are used to break

    a string down into a stream of terms or tokens. A simple tokenizer might split the string up into terms wherever it encounters whitespace or punctuation. Tokenizer
  11. PHPers Poznań #3 2016-02-23 @szymanskilukasz Token filters accept a stream

    of tokens from a tokenizer and can modify tokens (eg lowercasing), delete tokens (eg remove stopwords) or add tokens (eg synonyms). Token Filters
  12. PHPers Poznań #3 2016-02-23 @szymanskilukasz These analyzers typically perform four

    roles: Tokenize text into individual words: The quick brown foxes → [The, quick, brown, foxes] Lowercase tokens: The → the Remove common stopwords: [The, quick, brown, foxes] → [quick, brown, foxes] Stem tokens to their root form: foxes → fox
  13. PHPers Poznań #3 2016-02-23 Indexing @szymanskilukasz Convert documents to data

    structures that enable fast search Precompute as much as we can
  14. PHPers Poznań #3 2016-02-23 Boolean Model @szymanskilukasz The Boolean model

    simply applies the AND, OR, and NOT conditions expressed in the query to find all the documents that match. A query for full AND text AND search AND (elasticsearch OR lucene) will include only documents that contain all of the terms full, text, and search, and either elasticsearch or lucene. This process is simple and fast. It is used to exclude any documents that cannot possibly match the query.
  15. PHPers Poznań #3 2016-02-23 @szymanskilukasz This does not affect ranking,

    but the default implementation does make scores from different queries more comparable than they would be by eliminating the magnitude of the Query vector as a factor in the score. queryNorm(q) Query Normalization Factor queryNorm = 1 / √sumOfSquaredWeights The sumOfSquaredWeights is calculated by adding together the IDF of each term in the query, squared.
  16. PHPers Poznań #3 2016-02-23 @szymanskilukasz coord(q,d) Query Coordination is used

    to reward documents that contain a higher percentage of the query terms. The more query terms that appear in the document, the greater the chances that the document is a good match for the query.
  17. PHPers Poznań #3 2016-02-23 TF / IDF @szymanskilukasz Term Frequency:

    How often does the term appear in this document? The more often, the higher the weight. A field containing five mentions of the same term is more likely to be relevant than a field containing just one mention. tf(t in d) = √frequency Term Frequency / Inverse Document Frequency
  18. PHPers Poznań #3 2016-02-23 TF / IDF @szymanskilukasz Inverse Document

    Frequency: How often does the term appear in all documents in the collection? The more often, the lower the weight. idf(t) = 1 + log ( numDocs / (docFreq + 1)) The inverse document frequency (idf) of term t is the logarithm of the number of documents in the index, divided by the number of documents that contain the term. Term Frequency / Inverse Document Frequency
  19. PHPers Poznań #3 2016-02-23 @szymanskilukasz t.getBoost() Query-Time Boosting Query-time boosting

    is the main tool that you can use to tune relevance. Remember that boost is just one of the factors involved in the relevance score; it has to compete with the other factors
  20. PHPers Poznań #3 2016-02-23 TF / IDF @szymanskilukasz Field-length normalization

    How long is the field? The shorter the field, the higher the weight. If a term appears in a short field, such as a title field, it is more likely that the content of that field is about the term than if the same term appears in a much bigger body field. The field length norm is calculated as follows: norm(d) = 1 / √numTerms Term Frequency / Inverse Document Frequency
  21. PHPers Poznań #3 2016-02-23 @szymanskilukasz "_explanation": { "description": "weight(tweet:honeymoon in

    0) [PerFieldSimilarity], result of:", "value": 0.076713204, "details": [ { "description": "fieldWeight in 0, product of:", "value": 0.076713204, "details": [ { "description": "tf(freq=1.0), with freq of:", "value": 1, "details": [ { "description": "termFreq=1.0", @szymanskilukasz Summary
  22. PHPers Poznań #3 2016-02-23 @szymanskilukasz "value": 0.076713204, "details": [ {

    "description": "tf(freq=1.0), with freq of:", "value": 1, "details": [ { "description": "termFreq=1.0", "value": 1 } ] }, { "description": "idf(docFreq=1, maxDocs=1)", "value": 0.30685282 }, Term frequency How many times did the term honeymoon appear in the tweet field in this document?
  23. PHPers Poznań #3 2016-02-23 @szymanskilukasz "description": "termFreq=1.0", "value": 1 }

    ] }, { "description": "idf(docFreq=1, maxDocs=1)", "value": 0.30685282 }, { "description": "fieldNorm(doc=0)", "value": 0.25, } ] } ] } Inverse document frequency How many times did the term honeymoon appear in the tweet field of all documents in the index?
  24. PHPers Poznań #3 2016-02-23 @szymanskilukasz { "description": "idf(docFreq=1, maxDocs=1)", "value":

    0.30685282 }, { "description": "fieldNorm(doc=0)", "value": 0.25, } ] } ] } Field-length norm How long is the tweet field in this document? The longer the field, the smaller this number.
  25. PHPers Poznań #3 2016-02-23 Relevance Feedback @szymanskilukasz Users make explicit

    relevance judgments on the initial results (judgments are reliable, but users don’t want to make extra effort)
  26. PHPers Poznań #3 2016-02-23 Pseudo/Blind/Automatic Feedback @szymanskilukasz Top-k initial results

    are simply assumed to be relevant (judgments aren’t reliable, but no user activity is required)
  27. PHPers Poznań #3 2016-02-23 Implicit Feedback @szymanskilukasz User-clicked docs are

    assumed to be relevant; skipped ones non-relevant (judgments aren’t completely reliable, but no extra effort from users)
  28. PHPers Poznań #3 2016-02-23 – Adding new (weighted) terms (query

    expansion) – Adjusting weights of old terms @szymanskilukasz Query modification
  29. PHPers Poznań #3 2016-02-23 @szymanskilukasz Łukasz Szymański @szymanskilukasz http://szymanskilukasz.github.io/ Development

    Team Lead at https://www.linkedin.com/in/szymanskilukasz https://twitter.com/szymanskilukasz Thanks!