Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Text Retrieval

Text Retrieval

Na co dzień używamy platform wyszukiwania takich jak Solr lub Elasticsearch. Oba rozwiązania opierają się na silniku Apache Lucene, który implementuje rozwiązania z dziedziny wyszukiwania tekstowego. Opowiem o zasadach i technikach dzięki którym narzędzia te realizują swoje zadanie. Poznacie teorię i wyzwania jakie stawia przed nami wyszukiwanie tekstowe.

33d8f6c5f5c8b940ff4d6422a5f32b02?s=128

Łukasz Szymański

February 23, 2016
Tweet

Transcript

  1. Text Retrieval PHPers Poznań #3 2016-02-23 @szymanskilukasz

  2. PHPers Poznań #3 2016-02-23 @szymanskilukasz

  3. PHPers Poznań #3 2016-02-23 @szymanskilukasz

  4. PHPers Poznań #3 2016-02-23 @szymanskilukasz

  5. PHPers Poznań #3 2016-02-23 @szymanskilukasz

  6. PHPers Poznań #3 2016-02-23 @szymanskilukasz

  7. PHPers Poznań #3 2016-02-23 @szymanskilukasz

  8. PHPers Poznań #3 2016-02-23 @szymanskilukasz

  9. PHPers Poznań #3 2016-02-23 @szymanskilukasz

  10. PHPers Poznań #3 2016-02-23 @szymanskilukasz

  11. PHPers Poznań #3 2016-02-23 @szymanskilukasz

  12. PHPers Poznań #3 2016-02-23 @szymanskilukasz

  13. PHPers Poznań #3 2016-02-23 @szymanskilukasz

  14. PHPers Poznań #3 2016-02-23 What is text retrieval? @szymanskilukasz

  15. PHPers Poznań #3 2016-02-23 @szymanskilukasz Vocabulary: V = {w1,…,wN} -

    Set of words - might be multiple languages. Query: q = q1,…,qm where qi ∈ V Document: di = di1,…,dimj where dij ∈ V Collection: C = {d1,…,dM} Set of relevant documents: R(q) ⊆ C Our task is to compute R’(q), an approximation of R(q) Define our problem
  16. PHPers Poznań #3 2016-02-23 How to compute R’(q)? @szymanskilukasz

  17. PHPers Poznań #3 2016-02-23 @szymanskilukasz where f(d,q) ∈ {0,1} f(d,q)

    here is binary classifier - system decides is a doc is relevant or not. Document selection R′(q) = { d ∈ C | f(d,q) = 1 }
  18. PHPers Poznań #3 2016-02-23 @szymanskilukasz where f(d,q)∈ℜ is a relevance

    measure function; θ is a cutoff determined by the user Document ranking R′(q) = { d ∈ C | f(d,q) > θ}
  19. PHPers Poznań #3 2016-02-23 @szymanskilukasz doc selection f(d,q) = ?

    doc ranking f(d,q) = ? 0.98 d1 + 0.94 d2 + 0.82 d3 - 0.73 d4 + }R’(q) 0.56 d5 - 0.43 d6 - 0.38 d7 + θ + + + + + + + + - - - - - - + + + + - - + - - - - - 1 0 R’(q) }
  20. PHPers Poznań #3 2016-02-23 Retrieval Models @szymanskilukasz

  21. PHPers Poznań #3 2016-02-23 @szymanskilukasz Similarity-based models: f(q,d) = similarity(q,d)

    Probabilistic models: f(d,q) = p(R=1|d,q), where R ∈{0,1} Axiomatic model: f(q,d) must satisfy a set of constraints
  22. PHPers Poznań #3 2016-02-23 Common Ideas @szymanskilukasz

  23. PHPers Poznań #3 2016-02-23 @szymanskilukasz Lets assume that we have

    a query like q = “Best Retrieval Models”
  24. PHPers Poznań #3 2016-02-23 Term Frequency @szymanskilukasz How many times

    does a word occur in document “d”, e.g. How many times does “Best” occur in document “d”? How many times does “Retrieval” occur in document “d”? How many times does “Models” occur in document “d”?
  25. PHPers Poznań #3 2016-02-23 Document Length @szymanskilukasz Document Length -

    How long is document “d”? If term occurs with equal frequency, but one of the documents is shorter the score will be higher. Same in the other hand - if document is longer, there will be higher probability that term occurs in that document.
  26. PHPers Poznań #3 2016-02-23 Document Frequency @szymanskilukasz How often do

    we see a word in entire collection?
  27. PHPers Poznań #3 2016-02-23 Vector Space Model @szymanskilukasz

  28. PHPers Poznań #3 2016-02-23 @szymanskilukasz Best Programming Models

  29. PHPers Poznań #3 2016-02-23 @szymanskilukasz d1 Best Programming Models

  30. PHPers Poznań #3 2016-02-23 @szymanskilukasz d1 d4 Best Programming Models

  31. PHPers Poznań #3 2016-02-23 @szymanskilukasz d1 d2 d3 d4 d5

    Best Programming Models
  32. PHPers Poznań #3 2016-02-23 @szymanskilukasz query d1 d2 d3 d4

    d5 Best Programming Models
  33. PHPers Poznań #3 2016-02-23 @szymanskilukasz query d1 d2 d3 d4

    d5 Best Programming Models
  34. PHPers Poznań #3 2016-02-23 Text Retrieval System Architecture @szymanskilukasz

  35. PHPers Poznań #3 2016-02-23 @szymanskilukasz docs TOKENIZER

  36. PHPers Poznań #3 2016-02-23 @szymanskilukasz docs TOKENIZER INDEXER INDEX doc

    rep.
  37. PHPers Poznań #3 2016-02-23 @szymanskilukasz docs query TOKENIZER INDEXER INDEX

    doc rep.
  38. PHPers Poznań #3 2016-02-23 @szymanskilukasz docs query TOKENIZER INDEXER INDEX

    SCORER results doc rep. query rep.
  39. PHPers Poznań #3 2016-02-23 @szymanskilukasz docs query TOKENIZER INDEXER INDEX

    SCORER results ? doc rep. query rep.
  40. PHPers Poznań #3 2016-02-23 @szymanskilukasz docs query TOKENIZER INDEXER INDEX

    SCORER results feedback doc rep. query rep.
  41. PHPers Poznań #3 2016-02-23 @szymanskilukasz docs query TOKENIZER INDEXER INDEX

    SCORER results feedback doc rep. query rep.
  42. PHPers Poznań #3 2016-02-23 Tokenization @szymanskilukasz

  43. PHPers Poznań #3 2016-02-23 @szymanskilukasz Character filters are used to

    preprocess the string of characters before it is passed to the tokenizer. A character filter may be used to strip out HTML markup, or to convert "&" characters to the word "and". Character Filters
  44. PHPers Poznań #3 2016-02-23 @szymanskilukasz Tokenizers are used to break

    a string down into a stream of terms or tokens. A simple tokenizer might split the string up into terms wherever it encounters whitespace or punctuation. Tokenizer
  45. PHPers Poznań #3 2016-02-23 @szymanskilukasz Token filters accept a stream

    of tokens from a tokenizer and can modify tokens (eg lowercasing), delete tokens (eg remove stopwords) or add tokens (eg synonyms). Token Filters
  46. PHPers Poznań #3 2016-02-23 @szymanskilukasz These analyzers typically perform four

    roles: Tokenize text into individual words: The quick brown foxes → [The, quick, brown, foxes] Lowercase tokens: The → the Remove common stopwords: [The, quick, brown, foxes] → [quick, brown, foxes] Stem tokens to their root form: foxes → fox
  47. PHPers Poznań #3 2016-02-23 @szymanskilukasz docs query TOKENIZER INDEXER INDEX

    SCORER results feedback doc rep. query rep.
  48. PHPers Poznań #3 2016-02-23 Indexing @szymanskilukasz Convert documents to data

    structures that enable fast search Precompute as much as we can
  49. PHPers Poznań #3 2016-02-23 Inverted Index @szymanskilukasz

  50. PHPers Poznań #3 2016-02-23 @szymanskilukasz

  51. PHPers Poznań #3 2016-02-23 @szymanskilukasz docs query TOKENIZER INDEXER INDEX

    SCORER results feedback doc rep. query rep.
  52. PHPers Poznań #3 2016-02-23 Lucene’s Practical Scoring Function @szymanskilukasz

  53. PHPers Poznań #3 2016-02-23 Boolean Model @szymanskilukasz The Boolean model

    simply applies the AND, OR, and NOT conditions expressed in the query to find all the documents that match. A query for full AND text AND search AND (elasticsearch OR lucene) will include only documents that contain all of the terms full, text, and search, and either elasticsearch or lucene. This process is simple and fast. It is used to exclude any documents that cannot possibly match the query.
  54. PHPers Poznań #3 2016-02-23 @szymanskilukasz

  55. PHPers Poznań #3 2016-02-23 @szymanskilukasz

  56. PHPers Poznań #3 2016-02-23 @szymanskilukasz This does not affect ranking,

    but the default implementation does make scores from different queries more comparable than they would be by eliminating the magnitude of the Query vector as a factor in the score. queryNorm(q) Query Normalization Factor queryNorm = 1 / √sumOfSquaredWeights The sumOfSquaredWeights is calculated by adding together the IDF of each term in the query, squared.
  57. PHPers Poznań #3 2016-02-23 @szymanskilukasz

  58. PHPers Poznań #3 2016-02-23 @szymanskilukasz coord(q,d) Query Coordination is used

    to reward documents that contain a higher percentage of the query terms. The more query terms that appear in the document, the greater the chances that the document is a good match for the query.
  59. PHPers Poznań #3 2016-02-23 @szymanskilukasz

  60. PHPers Poznań #3 2016-02-23 TF / IDF @szymanskilukasz Term Frequency:

    How often does the term appear in this document? The more often, the higher the weight. A field containing five mentions of the same term is more likely to be relevant than a field containing just one mention. tf(t in d) = √frequency Term Frequency / Inverse Document Frequency
  61. PHPers Poznań #3 2016-02-23 @szymanskilukasz

  62. PHPers Poznań #3 2016-02-23 TF / IDF @szymanskilukasz Inverse Document

    Frequency: How often does the term appear in all documents in the collection? The more often, the lower the weight. idf(t) = 1 + log ( numDocs / (docFreq + 1)) The inverse document frequency (idf) of term t is the logarithm of the number of documents in the index, divided by the number of documents that contain the term. Term Frequency / Inverse Document Frequency
  63. PHPers Poznań #3 2016-02-23 @szymanskilukasz

  64. PHPers Poznań #3 2016-02-23 @szymanskilukasz t.getBoost() Query-Time Boosting Query-time boosting

    is the main tool that you can use to tune relevance. Remember that boost is just one of the factors involved in the relevance score; it has to compete with the other factors
  65. PHPers Poznań #3 2016-02-23 @szymanskilukasz

  66. PHPers Poznań #3 2016-02-23 TF / IDF @szymanskilukasz Field-length normalization

    How long is the field? The shorter the field, the higher the weight. If a term appears in a short field, such as a title field, it is more likely that the content of that field is about the term than if the same term appears in a much bigger body field. The field length norm is calculated as follows: norm(d) = 1 / √numTerms Term Frequency / Inverse Document Frequency
  67. PHPers Poznań #3 2016-02-23 @szymanskilukasz GET /_search?explain { "query" :

    { "match" : { "tweet" : "honeymoon" }} }
  68. PHPers Poznań #3 2016-02-23 @szymanskilukasz "_explanation": { "description": "weight(tweet:honeymoon in

    0) [PerFieldSimilarity], result of:", "value": 0.076713204, "details": [ { "description": "fieldWeight in 0, product of:", "value": 0.076713204, "details": [ { "description": "tf(freq=1.0), with freq of:", "value": 1, "details": [ { "description": "termFreq=1.0", @szymanskilukasz Summary
  69. PHPers Poznań #3 2016-02-23 @szymanskilukasz "value": 0.076713204, "details": [ {

    "description": "tf(freq=1.0), with freq of:", "value": 1, "details": [ { "description": "termFreq=1.0", "value": 1 } ] }, { "description": "idf(docFreq=1, maxDocs=1)", "value": 0.30685282 }, Term frequency How many times did the term honeymoon appear in the tweet field in this document?
  70. PHPers Poznań #3 2016-02-23 @szymanskilukasz "description": "termFreq=1.0", "value": 1 }

    ] }, { "description": "idf(docFreq=1, maxDocs=1)", "value": 0.30685282 }, { "description": "fieldNorm(doc=0)", "value": 0.25, } ] } ] } Inverse document frequency How many times did the term honeymoon appear in the tweet field of all documents in the index?
  71. PHPers Poznań #3 2016-02-23 @szymanskilukasz { "description": "idf(docFreq=1, maxDocs=1)", "value":

    0.30685282 }, { "description": "fieldNorm(doc=0)", "value": 0.25, } ] } ] } Field-length norm How long is the tweet field in this document? The longer the field, the smaller this number.
  72. PHPers Poznań #3 2016-02-23 @szymanskilukasz docs query TOKENIZER INDEXER INDEX

    SCORER results feedback doc rep. query rep.
  73. PHPers Poznań #3 2016-02-23 Relevance Feedback @szymanskilukasz Users make explicit

    relevance judgments on the initial results (judgments are reliable, but users don’t want to make extra effort)
  74. PHPers Poznań #3 2016-02-23 Pseudo/Blind/Automatic Feedback @szymanskilukasz Top-k initial results

    are simply assumed to be relevant (judgments aren’t reliable, but no user activity is required)
  75. PHPers Poznań #3 2016-02-23 Implicit Feedback @szymanskilukasz User-clicked docs are

    assumed to be relevant; skipped ones non-relevant (judgments aren’t completely reliable, but no extra effort from users)
  76. PHPers Poznań #3 2016-02-23 How to learn from feedback? @szymanskilukasz

  77. PHPers Poznań #3 2016-02-23 – Adding new (weighted) terms (query

    expansion) – Adjusting weights of old terms @szymanskilukasz Query modification
  78. PHPers Poznań #3 2016-02-23 @szymanskilukasz Łukasz Szymański @szymanskilukasz http://szymanskilukasz.github.io/ Development

    Team Lead at https://www.linkedin.com/in/szymanskilukasz https://twitter.com/szymanskilukasz Thanks!