Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Text Retrieval

Text Retrieval

Na co dzień używamy platform wyszukiwania takich jak Solr lub Elasticsearch. Oba rozwiązania opierają się na silniku Apache Lucene, który implementuje rozwiązania z dziedziny wyszukiwania tekstowego. Opowiem o zasadach i technikach dzięki którym narzędzia te realizują swoje zadanie. Poznacie teorię i wyzwania jakie stawia przed nami wyszukiwanie tekstowe.

Łukasz Szymański

February 23, 2016
Tweet

More Decks by Łukasz Szymański

Other Decks in Technology

Transcript

  1. Text Retrieval
    PHPers Poznań #3 2016-02-23
    @szymanskilukasz

    View Slide

  2. PHPers Poznań #3 2016-02-23
    @szymanskilukasz

    View Slide

  3. PHPers Poznań #3 2016-02-23
    @szymanskilukasz

    View Slide

  4. PHPers Poznań #3 2016-02-23
    @szymanskilukasz

    View Slide

  5. PHPers Poznań #3 2016-02-23
    @szymanskilukasz

    View Slide

  6. PHPers Poznań #3 2016-02-23
    @szymanskilukasz

    View Slide

  7. PHPers Poznań #3 2016-02-23
    @szymanskilukasz

    View Slide

  8. PHPers Poznań #3 2016-02-23
    @szymanskilukasz

    View Slide

  9. PHPers Poznań #3 2016-02-23
    @szymanskilukasz

    View Slide

  10. PHPers Poznań #3 2016-02-23
    @szymanskilukasz

    View Slide

  11. PHPers Poznań #3 2016-02-23
    @szymanskilukasz

    View Slide

  12. PHPers Poznań #3 2016-02-23
    @szymanskilukasz

    View Slide

  13. PHPers Poznań #3 2016-02-23
    @szymanskilukasz

    View Slide

  14. PHPers Poznań #3 2016-02-23
    What is text
    retrieval?
    @szymanskilukasz

    View Slide

  15. PHPers Poznań #3 2016-02-23
    @szymanskilukasz
    Vocabulary: V = {w1,…,wN} - Set of words - might be multiple
    languages.
    Query: q = q1,…,qm where qi ∈ V
    Document: di = di1,…,dimj where dij ∈ V
    Collection: C = {d1,…,dM}
    Set of relevant documents: R(q) ⊆ C
    Our task is to compute R’(q), an approximation of R(q)
    Define our problem

    View Slide

  16. PHPers Poznań #3 2016-02-23
    How to compute
    R’(q)?
    @szymanskilukasz

    View Slide

  17. PHPers Poznań #3 2016-02-23
    @szymanskilukasz
    where f(d,q) ∈ {0,1}
    f(d,q) here is binary classifier - system decides is a doc is
    relevant or not.
    Document selection
    R′(q) = { d ∈ C | f(d,q) = 1 }

    View Slide

  18. PHPers Poznań #3 2016-02-23
    @szymanskilukasz
    where f(d,q)∈ℜ is a relevance measure function;
    θ is a cutoff determined by the user
    Document ranking
    R′(q) = { d ∈ C | f(d,q) > θ}

    View Slide

  19. PHPers Poznań #3 2016-02-23
    @szymanskilukasz
    doc selection
    f(d,q) = ?
    doc ranking
    f(d,q) = ?
    0.98 d1 +
    0.94 d2 +
    0.82 d3 -
    0.73 d4 +
    }R’(q)
    0.56 d5 -
    0.43 d6 -
    0.38 d7 +
    θ
    + +
    + +
    + +
    + +
    -
    - -
    -
    -
    -
    + +
    + +
    -
    -
    +
    -
    -
    -
    - -
    1
    0
    R’(q)
    }

    View Slide

  20. PHPers Poznań #3 2016-02-23
    Retrieval
    Models
    @szymanskilukasz

    View Slide

  21. PHPers Poznań #3 2016-02-23
    @szymanskilukasz
    Similarity-based models: f(q,d) = similarity(q,d)
    Probabilistic models: f(d,q) = p(R=1|d,q), where R ∈{0,1}
    Axiomatic model: f(q,d) must satisfy a set of constraints

    View Slide

  22. PHPers Poznań #3 2016-02-23
    Common Ideas
    @szymanskilukasz

    View Slide

  23. PHPers Poznań #3 2016-02-23
    @szymanskilukasz
    Lets assume that we have a
    query like
    q = “Best Retrieval Models”

    View Slide

  24. PHPers Poznań #3 2016-02-23
    Term Frequency
    @szymanskilukasz
    How many times does a word occur in document “d”, e.g.
    How many times does “Best” occur in document “d”?
    How many times does “Retrieval” occur in document “d”?
    How many times does “Models” occur in document “d”?

    View Slide

  25. PHPers Poznań #3 2016-02-23
    Document Length
    @szymanskilukasz
    Document Length - How long is document “d”?
    If term occurs with equal frequency, but one of the documents
    is shorter the score will be higher. Same in the other hand - if
    document is longer, there will be higher probability that term
    occurs in that document.

    View Slide

  26. PHPers Poznań #3 2016-02-23
    Document Frequency
    @szymanskilukasz
    How often do we see a word in entire collection?

    View Slide

  27. PHPers Poznań #3 2016-02-23
    Vector
    Space
    Model
    @szymanskilukasz

    View Slide

  28. PHPers Poznań #3 2016-02-23
    @szymanskilukasz
    Best
    Programming
    Models

    View Slide

  29. PHPers Poznań #3 2016-02-23
    @szymanskilukasz
    d1
    Best
    Programming
    Models

    View Slide

  30. PHPers Poznań #3 2016-02-23
    @szymanskilukasz
    d1
    d4
    Best
    Programming
    Models

    View Slide

  31. PHPers Poznań #3 2016-02-23
    @szymanskilukasz
    d1
    d2
    d3
    d4 d5
    Best
    Programming
    Models

    View Slide

  32. PHPers Poznań #3 2016-02-23
    @szymanskilukasz
    query
    d1
    d2
    d3
    d4 d5
    Best
    Programming
    Models

    View Slide

  33. PHPers Poznań #3 2016-02-23
    @szymanskilukasz
    query
    d1
    d2
    d3
    d4 d5
    Best
    Programming
    Models

    View Slide

  34. PHPers Poznań #3 2016-02-23
    Text Retrieval
    System
    Architecture
    @szymanskilukasz

    View Slide

  35. PHPers Poznań #3 2016-02-23
    @szymanskilukasz
    docs
    TOKENIZER

    View Slide

  36. PHPers Poznań #3 2016-02-23
    @szymanskilukasz
    docs
    TOKENIZER
    INDEXER
    INDEX
    doc rep.

    View Slide

  37. PHPers Poznań #3 2016-02-23
    @szymanskilukasz
    docs query
    TOKENIZER
    INDEXER
    INDEX
    doc rep.

    View Slide

  38. PHPers Poznań #3 2016-02-23
    @szymanskilukasz
    docs query
    TOKENIZER
    INDEXER
    INDEX SCORER results
    doc rep. query rep.

    View Slide

  39. PHPers Poznań #3 2016-02-23
    @szymanskilukasz
    docs query
    TOKENIZER
    INDEXER
    INDEX SCORER results
    ?
    doc rep. query rep.

    View Slide

  40. PHPers Poznań #3 2016-02-23
    @szymanskilukasz
    docs query
    TOKENIZER
    INDEXER
    INDEX SCORER results
    feedback
    doc rep. query rep.

    View Slide

  41. PHPers Poznań #3 2016-02-23
    @szymanskilukasz
    docs query
    TOKENIZER
    INDEXER
    INDEX SCORER results
    feedback
    doc rep. query rep.

    View Slide

  42. PHPers Poznań #3 2016-02-23
    Tokenization
    @szymanskilukasz

    View Slide

  43. PHPers Poznań #3 2016-02-23
    @szymanskilukasz
    Character filters are used to preprocess the string
    of characters before it is passed to the tokenizer.
    A character filter may be used to strip out HTML
    markup, or to convert "&" characters to the word
    "and".
    Character Filters

    View Slide

  44. PHPers Poznań #3 2016-02-23
    @szymanskilukasz
    Tokenizers are used to break a string down into a
    stream of terms or tokens. A simple tokenizer
    might split the string up into terms wherever it
    encounters whitespace or punctuation.
    Tokenizer

    View Slide

  45. PHPers Poznań #3 2016-02-23
    @szymanskilukasz
    Token filters accept a stream of tokens from a
    tokenizer and can modify tokens (eg lowercasing),
    delete tokens (eg remove stopwords) or add
    tokens (eg synonyms).
    Token Filters

    View Slide

  46. PHPers Poznań #3 2016-02-23
    @szymanskilukasz
    These analyzers typically perform four roles:
    Tokenize text into individual words:
    The quick brown foxes → [The, quick, brown, foxes]
    Lowercase tokens:
    The → the
    Remove common stopwords:
    [The, quick, brown, foxes] → [quick, brown, foxes]
    Stem tokens to their root form:
    foxes → fox

    View Slide

  47. PHPers Poznań #3 2016-02-23
    @szymanskilukasz
    docs query
    TOKENIZER
    INDEXER
    INDEX SCORER results
    feedback
    doc rep. query rep.

    View Slide

  48. PHPers Poznań #3 2016-02-23
    Indexing
    @szymanskilukasz
    Convert documents to data structures that enable fast
    search
    Precompute as much as we can

    View Slide

  49. PHPers Poznań #3 2016-02-23
    Inverted Index
    @szymanskilukasz

    View Slide

  50. PHPers Poznań #3 2016-02-23
    @szymanskilukasz

    View Slide

  51. PHPers Poznań #3 2016-02-23
    @szymanskilukasz
    docs query
    TOKENIZER
    INDEXER
    INDEX SCORER results
    feedback
    doc rep. query rep.

    View Slide

  52. PHPers Poznań #3 2016-02-23
    Lucene’s
    Practical
    Scoring
    Function
    @szymanskilukasz

    View Slide

  53. PHPers Poznań #3 2016-02-23
    Boolean Model
    @szymanskilukasz
    The Boolean model simply applies the AND, OR, and NOT
    conditions expressed in the query to find all the
    documents that match.
    A query for full AND text AND search AND (elasticsearch
    OR lucene) will include only documents that contain all of
    the terms full, text, and search, and either elasticsearch or
    lucene.
    This process is simple and fast. It is used to exclude any
    documents that cannot possibly match the query.

    View Slide

  54. PHPers Poznań #3 2016-02-23
    @szymanskilukasz

    View Slide

  55. PHPers Poznań #3 2016-02-23
    @szymanskilukasz

    View Slide

  56. PHPers Poznań #3 2016-02-23
    @szymanskilukasz
    This does not affect ranking, but the default implementation
    does make scores from different queries more comparable
    than they would be by eliminating the magnitude of the
    Query vector as a factor in the score.
    queryNorm(q)
    Query Normalization Factor
    queryNorm = 1 / √sumOfSquaredWeights
    The sumOfSquaredWeights is calculated by adding together
    the IDF of each term in the query, squared.

    View Slide

  57. PHPers Poznań #3 2016-02-23
    @szymanskilukasz

    View Slide

  58. PHPers Poznań #3 2016-02-23
    @szymanskilukasz
    coord(q,d)
    Query Coordination
    is used to reward documents that contain a higher
    percentage of the query terms. The more query terms
    that appear in the document, the greater the chances
    that the document is a good match for the query.

    View Slide

  59. PHPers Poznań #3 2016-02-23
    @szymanskilukasz

    View Slide

  60. PHPers Poznań #3 2016-02-23
    TF / IDF
    @szymanskilukasz
    Term Frequency:
    How often does the term appear in this document? The
    more often, the higher the weight.
    A field containing five mentions of the same term is more
    likely to be relevant than a field containing just one
    mention.
    tf(t in d) = √frequency
    Term Frequency / Inverse Document Frequency

    View Slide

  61. PHPers Poznań #3 2016-02-23
    @szymanskilukasz

    View Slide

  62. PHPers Poznań #3 2016-02-23
    TF / IDF
    @szymanskilukasz
    Inverse Document Frequency:
    How often does the term appear in all documents in the
    collection? The more often, the lower the weight.
    idf(t) = 1 + log ( numDocs / (docFreq + 1))
    The inverse document frequency (idf) of term t is the
    logarithm of the number of documents in the index, divided
    by the number of documents that contain the term.
    Term Frequency / Inverse Document Frequency

    View Slide

  63. PHPers Poznań #3 2016-02-23
    @szymanskilukasz

    View Slide

  64. PHPers Poznań #3 2016-02-23
    @szymanskilukasz
    t.getBoost()
    Query-Time Boosting
    Query-time boosting is the main tool that you can use
    to tune relevance.
    Remember that boost is just one of the factors involved
    in the relevance score; it has to compete with the other
    factors

    View Slide

  65. PHPers Poznań #3 2016-02-23
    @szymanskilukasz

    View Slide

  66. PHPers Poznań #3 2016-02-23
    TF / IDF
    @szymanskilukasz
    Field-length normalization
    How long is the field? The shorter the field, the higher the
    weight. If a term appears in a short field, such as a title
    field, it is more likely that the content of that field is about
    the term than if the same term appears in a much bigger
    body field. The field length norm is calculated as follows:
    norm(d) = 1 / √numTerms
    Term Frequency / Inverse Document Frequency

    View Slide

  67. PHPers Poznań #3 2016-02-23
    @szymanskilukasz
    GET /_search?explain
    {
    "query" : { "match" : { "tweet" : "honeymoon" }}
    }

    View Slide

  68. PHPers Poznań #3 2016-02-23
    @szymanskilukasz
    "_explanation": {
    "description": "weight(tweet:honeymoon in 0)
    [PerFieldSimilarity], result of:",
    "value": 0.076713204,
    "details": [
    {
    "description": "fieldWeight in 0, product of:",
    "value": 0.076713204,
    "details": [
    {
    "description": "tf(freq=1.0), with freq of:",
    "value": 1,
    "details": [
    {
    "description": "termFreq=1.0",
    @szymanskilukasz
    Summary

    View Slide

  69. PHPers Poznań #3 2016-02-23
    @szymanskilukasz
    "value": 0.076713204,
    "details": [
    {
    "description": "tf(freq=1.0), with freq of:",
    "value": 1,
    "details": [
    {
    "description": "termFreq=1.0",
    "value": 1
    }
    ]
    },
    {
    "description": "idf(docFreq=1, maxDocs=1)",
    "value": 0.30685282
    },
    Term frequency
    How many times did the term honeymoon appear in the tweet
    field in this document?

    View Slide

  70. PHPers Poznań #3 2016-02-23
    @szymanskilukasz
    "description": "termFreq=1.0",
    "value": 1
    }
    ]
    },
    {
    "description": "idf(docFreq=1, maxDocs=1)",
    "value": 0.30685282
    },
    {
    "description": "fieldNorm(doc=0)",
    "value": 0.25,
    }
    ]
    }
    ]
    }
    Inverse document frequency
    How many times did the term honeymoon appear in the tweet
    field of all documents in the index?

    View Slide

  71. PHPers Poznań #3 2016-02-23
    @szymanskilukasz
    {
    "description": "idf(docFreq=1, maxDocs=1)",
    "value": 0.30685282
    },
    {
    "description": "fieldNorm(doc=0)",
    "value": 0.25,
    }
    ]
    }
    ]
    }
    Field-length norm
    How long is the tweet field in this document? The longer the
    field, the smaller this number.

    View Slide

  72. PHPers Poznań #3 2016-02-23
    @szymanskilukasz
    docs query
    TOKENIZER
    INDEXER
    INDEX SCORER results
    feedback
    doc rep. query rep.

    View Slide

  73. PHPers Poznań #3 2016-02-23
    Relevance Feedback
    @szymanskilukasz
    Users make explicit relevance judgments on
    the initial results (judgments are reliable, but
    users don’t want to make extra effort)

    View Slide

  74. PHPers Poznań #3 2016-02-23
    Pseudo/Blind/Automatic Feedback
    @szymanskilukasz
    Top-k initial results are simply assumed to be relevant
    (judgments aren’t reliable, but no user activity is
    required)

    View Slide

  75. PHPers Poznań #3 2016-02-23
    Implicit Feedback
    @szymanskilukasz
    User-clicked docs are assumed to be relevant; skipped
    ones non-relevant
    (judgments aren’t completely reliable, but no extra
    effort from users)

    View Slide

  76. PHPers Poznań #3 2016-02-23
    How to learn
    from feedback?
    @szymanskilukasz

    View Slide

  77. PHPers Poznań #3 2016-02-23
    – Adding new (weighted) terms (query expansion)
    – Adjusting weights of old terms
    @szymanskilukasz
    Query modification

    View Slide

  78. PHPers Poznań #3 2016-02-23
    @szymanskilukasz
    Łukasz Szymański
    @szymanskilukasz
    http://szymanskilukasz.github.io/
    Development Team Lead at
    https://www.linkedin.com/in/szymanskilukasz
    https://twitter.com/szymanskilukasz
    Thanks!

    View Slide