$30 off During Our Annual Pro Sale. View Details »

Scoring for human beings

Scoring for human beings

Talk at Munich Search Meetup (http://www.meetup.com/Search-Meetup-Munich/), Feb 4 2014

Elasticsearch Inc

February 04, 2014
Tweet

More Decks by Elasticsearch Inc

Other Decks in Technology

Transcript

  1. Scoring for human
    beings
    Britta Weber
    elasticsearch

    View Slide

  2. What is scoring?
    Determine the relevance of a document given a
    search request

    - Given keywords [“football”, “world
    cup”], what is the most relevant news article
    the user might want to read?

    - Given the criteria [“java”, “expected
    income”, “work location”], which
    candidate in the data set is most likely to be a
    good employee?

    View Slide

  3. Hm…I can just use a match query and
    filters, right?
    “query”: !
    ! “match”:!
    ! ! “proglang”: “java”!

    View Slide

  4. Agenda
    !
    !
    PART 1: Text scoring for human beings and

    the downside for tags

    !
    PART 2: Do-it-youself scoring

    !
    !
    !

    View Slide

  5. How does scoring of text work?

    View Slide

  6. Relevancy
    Step Query Doc 1 Doc 2
    The text brown fox The quick brown fox
    likes brown nuts
    The red fox
    The terms (brown, fox) (brown, brown, fox,
    likes, nuts, quick)
    (fox, red)
    A frequency
    vector
    (1, 1) (2, 1) (0, 1)
    Relevancy - 3? 1?

    View Slide

  7. So...more matching words mean higher
    score, right?

    !

    View Slide

  8. Scoring oddities
    https://gist.github.com/brwe/7229896

    View Slide

  9. Relevancy
    Step Query Doc 1 Doc 2
    The text brown fox The quick brown fox
    likes brown nuts
    The red fox
    The terms (brown, fox) (brown, brown, fox,
    likes, nuts, quick)
    (fox, red)
    A frequency
    vector
    (1, 1) (2, 1) (0, 1)
    Relevancy - 3? 1?

    View Slide

  10. Relevancy - the vector space model
    d1: “the quick brown fox
    likes brown nuts”
    tf: brown
    tf: fox
    q: “brown fox”
    d2: “the red fox”
    1 2
    2
    1
    .
    Queries and documents
    are vectors.

    What is the distance
    between query and
    document vector?

    View Slide

  11. d1: “the quick brown fox
    likes brown nuts”
    tf: brown
    tf: fox
    q: “brown fox”
    d2: “the red fox”
    1 2
    2
    1
    Distance of docs and
    query:

    Cosine of angle between
    document vector on
    query axis.
    cos
    (
    !
    ) = ~
    d
    ·
    ~
    q
    |~
    d
    | · |
    ~
    q
    |

    Relevancy - Cosine Similarity

    View Slide

  12. Relevancy - Projection distance
    d1: “the quick brown fox
    likes brown nuts”
    tf: brown
    tf: fox
    q: “brown fox”
    d2: “the red fox”
    1 2
    2
    1
    .
    .
    Distance of docs and
    query:

    Project document vector
    on query axis. score
    =>

    View Slide

  13. Relevancy - Field length
    Shorter text is more relevant than longer
    text.

    View Slide

  14. w2(brown)
    w1(fox)
    .
    original document
    vector d
    w1,d 2*w1,d
    longer document with
    same tfs
    shorter document with
    same tfs
    score
    =>
    Relevancy - Field length

    View Slide

  15. Relevancy - document frequency
    Words that appear more often in documents
    are less important that words that appear
    less often.

    View Slide

  16. Relevance: Even more tweaking!
    w2(brown)
    w1(fox)
    .
    .
    multiplied weight
    for fox by 2
    original document
    vector d
    w1,d 2*w1,d
    score
    =>
    Relevancy - term weight

    View Slide

  17. How many of these factors are there?

    View Slide

  18. Lucene Similarity
    query norm, does not fit
    on this slide core TF/IDF weight
    score of a document
    d for a given query q
    field length, some
    function turning the
    number of tokens
    into a float, roughly:
    boost of query
    term t
    http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
    inverted document
    frequency for term t

    View Slide

  19. Explain api
    If you do not understand the score:

    !
    curl -XPOST "http://localhost:9200/idfidx/test/_search" -d'
    {
    "query": {
    "match": {
    "location": "berlin kreuzberg"
    }
    },
    "explain": true
    }'

    View Slide

  20. The point is...
    - Text scoring per default is tuned for natural
    language text.

    - Empirical scoring formula works well for
    articles, mails, reviews, etc.

    - This way to score might be undesirable if the
    text represents tags.

    View Slide

  21. II: DIY scoring

    View Slide

  22. Remember…Lucene Similarity
    “I do not need that!” “Can I have the tf
    squared?”
    “I do not like the field
    length - how can I get
    rid of it?”
    “I want my boost to depend on
    the ratio of number of
    characters and average hight of
    my former gfs divided by the
    number of Friday 13ths in the
    last year!”
    http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
    “Can we not make
    this idf^1.265?”

    View Slide

  23. But wait…can we not write our own
    Lucene similarity class?
    Yes, but…

    - you must figure out which classes you need, how to
    plug them in, …

    - you might not have access to all needed properties
    (payloads, field values,…)

    - you will want to test how well your scoring actually
    works before digging through Lucene code!

    View Slide

  24. function_score - basic structure
    "function_score": {
    "(query|filter)": {},
    "functions": [
    {
    "filter": {},
    "FUNCTION": {}
    },
    ...
    ]
    }
    Apply score computation only to docs
    matching a specific filter (default
    “match_all”)
    Apply this function to matching docs
    query or filter

    View Slide

  25. Scoring odysseys
    !
    http://www.elasticsearch.org/videos/
    introducing-custom-scoring-functions/

    !
    https://gist.github.com/brwe/7049473

    View Slide

  26. Decay Functions
    JSON structure

    !
    !
    !
    !
    !
    !
    Decay functions

    • “gauss”

    • “exp”

    • “lin”

    !
    "gauss": {
    "age": {
    "reference": 40,
    "scale": 5,
    "decay": 0.5,
    "offset": 5
    }
    }
    reference
    scale
    decay
    offset
    shape of decay curve
    field name

    View Slide

  27. If you need only simple stuff…
    - Distance functions built-in

    - boost built in

    - function_score replaces field and
    document boost

    …but sometimes you need more.

    View Slide

  28. function_score - script scoring
    "function_score": {
    "(query|filter)": {},
    "functions": [
    {
    "filter": {},
    “script_score": {
    “params”: {…},
    “lang”: “mvel”,
    “script”: “…”
    }
    },
    ...
    ]
    }
    Apply score computation only to docs
    matching a specific filter (default
    “match_all”)
    Parameters that will be available at
    script execution
    query or filter
    Script language, “mvel” default, other
    languages available as plugin
    The actual script

    View Slide

  29. script examples - field values
    use document values:

    “doc[‘posted’].value”!
    !
    use math expressions

    “pow(doc[‘age’]-mean_age, -2.0)”

    View Slide

  30. script examples - term statistics
    - Brandnew!

    - _index variable allows access to lucene term statistics

    - provides document count, document frequency, term
    frequency, total term frequency,…

    !
    Term frequency:

    ! _index[‘text’][‘word’].tf()! !
    document that contains “word” most often will score

    highest!

    View Slide

  31. Detour: word count
    - Lucene does not store number of tokens in a field

    - must be enabled in mapping and accessed as regular
    field:

    !
    !
    !
    !
    !
    !
    !
    !
    !
    !
    !
    !
    !
    !
    !
    !
    !
    !
    - access as field value

    !
    ! ! “doc[‘text.word_count’].value"
    properties: !
    ! text:! !
    ! !
    ! type: multi_field!
    ! !
    ! fields: !
    ! !
    ! ! …!
    ! !
    ! ! word_count: !
    ! ! type: token_count!
    !
    ! ! …

    View Slide

  32. d1: “the quick brown fox
    likes brown nuts”
    tf: brown
    tf: fox
    q: “brown fox”
    d2: “the red fox”
    1 2
    2
    1
    Distance of docs and
    query:

    Cosine of angle between
    document vector on
    query axis.
    cos
    (
    !
    ) = ~
    d
    ·
    ~
    q
    |~
    d
    | · |
    ~
    q
    |

    Relevancy - Cosine Similarity

    View Slide

  33. Cosine similarity as script
    "params": {!
    "field": "fieldname",!
    "words": [“word1", …]!
    },!
    "script": “!
    !
    score = 0.0; !
    queryLength = 0.0; !
    docLength = 0.0;!
    for (word : words){ !
    ! tf = _index[field][word].tf(); !
    ! score = score + tf * 1.0; !
    ! queryLength = queryLength + 1.0; !
    ! docLength = docLength + pow(tf, 2.0);!
    } !
    return (float)score / !
    (sqrt(docLength) * sqrt(queryLength));!
    !
    "!
    cos
    (
    !
    ) = ~
    d
    ·
    ~
    q
    |~
    d
    | · |
    ~
    q
    |

    View Slide

  34. Cosine similarity as script
    "params": {!
    "field": "fieldname",!
    "words": [“word1", …]!
    },!
    "script": “!
    !
    score = 0.0; !
    queryLength = 0.0; !
    docLength = 0.0;!
    for (word : words){ !
    ! tf = _index[fieldname][word].tf(); !
    ! score = score + tf * 1.0; !
    ! queryLength = queryLength + 1.0; !
    ! docLength = docLength + pow(tf, 2.0);!
    } !
    return (float)score / !
    (sqrt(docLength) * sqrt(queryLength));!
    !
    "!
    cos
    (
    !
    ) = ~
    d
    ·
    ~
    q
    |~
    d
    | · |
    ~
    q
    |

    View Slide

  35. Cosine similarity as script
    "params": {!
    "field": "fieldname",!
    "words": [“word1", …]!
    },!
    "script": “!
    !
    score = 0.0; !
    queryLength = 0.0; !
    docLength = 0.0;!
    for (word : words){ !
    ! tf = _index[fieldname][word].tf(); !
    ! score = score + tf * 1.0; !
    ! queryLength = queryLength + 1.0; !
    ! docLength = docLength + pow(tf, 2.0);!
    } !
    return (float)score / !
    (sqrt(docLength) * sqrt(queryLength));!
    !
    "!
    cos
    (
    !
    ) = ~
    d
    ·
    ~
    q
    |~
    d
    | · |
    ~
    q
    |

    View Slide

  36. Cosine similarity as script
    "params": {!
    "field": "fieldname",!
    "words": [“word1", …]!
    },!
    "script": “!
    !
    score = 0.0; !
    queryLength = 0.0; !
    docLength = 0.0;!
    for (word : words){ !
    ! tf = _index[fieldname][word].tf(); !
    ! score = score + tf * 1.0; !
    ! queryLength = queryLength + 1.0; !
    ! docLength = docLength + pow(tf, 2.0);!
    } !
    return (float)score / !
    (sqrt(docLength) * sqrt(queryLength));!
    !
    "!
    cos
    (
    !
    ) = ~
    d
    ·
    ~
    q
    |~
    d
    | · |
    ~
    q
    |

    View Slide

  37. function_score - even more parameters!
    "function_score": {
    "(query|filter)": {},
    "boost": 2,
    "functions": [
    {
    "filter": {},
    "FUNCTION": {}
    },
    ...
    ],
    "max_boost": 10.0,
    "score_mode": "(mult|max|...)",
    "boost_mode": "(mult|replace|...)"
    }
    Apply score computation only to doc
    matching a specific filter (default
    “match_all”)
    Apply this function to matching docs
    Result of the different filter/
    function pairs should be
    summed, multiplied,....
    Merge with query score by
    multiply, add, ...
    query score
    limit boost to 10

    View Slide

  38. Practical advise
    - Create evaluation data

    - Write native script if you settled on one function (see
    https://github.com/imotov/elasticsearch-native-script-
    example)

    - Filter out as much as you can before applying scoring
    function

    View Slide

  39. TODOs
    - Index wide statistics, similar to
    DFS_QUERY_THEN_FETCH

    - Analysis of parameter string - script execution prior to
    search

    - More optimizing…

    View Slide