Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scoring for human beings

Scoring for human beings

Talk at Munich Search Meetup (http://www.meetup.com/Search-Meetup-Munich/), Feb 4 2014

Elasticsearch Inc

February 04, 2014
Tweet

More Decks by Elasticsearch Inc

Other Decks in Technology

Transcript

  1. What is scoring? Determine the relevance of a document given

    a search request - Given keywords [“football”, “world cup”], what is the most relevant news article the user might want to read? - Given the criteria [“java”, “expected income”, “work location”], which candidate in the data set is most likely to be a good employee?
  2. Hm…I can just use a match query and filters, right?

    “query”: ! ! “match”:! ! ! “proglang”: “java”! …
  3. Agenda ! ! PART 1: Text scoring for human beings

    and the downside for tags ! PART 2: Do-it-youself scoring ! ! !
  4. Relevancy Step Query Doc 1 Doc 2 The text brown

    fox The quick brown fox likes brown nuts The red fox The terms (brown, fox) (brown, brown, fox, likes, nuts, quick) (fox, red) A frequency vector (1, 1) (2, 1) (0, 1) Relevancy - 3? 1?
  5. Relevancy Step Query Doc 1 Doc 2 The text brown

    fox The quick brown fox likes brown nuts The red fox The terms (brown, fox) (brown, brown, fox, likes, nuts, quick) (fox, red) A frequency vector (1, 1) (2, 1) (0, 1) Relevancy - 3? 1?
  6. Relevancy - the vector space model d1: “the quick brown

    fox likes brown nuts” tf: brown tf: fox q: “brown fox” d2: “the red fox” 1 2 2 1 . Queries and documents are vectors. What is the distance between query and document vector?
  7. d1: “the quick brown fox likes brown nuts” tf: brown

    tf: fox q: “brown fox” d2: “the red fox” 1 2 2 1 Distance of docs and query: Cosine of angle between document vector on query axis. cos ( ! ) = ~ d · ~ q |~ d | · | ~ q | ↵ Relevancy - Cosine Similarity
  8. Relevancy - Projection distance d1: “the quick brown fox likes

    brown nuts” tf: brown tf: fox q: “brown fox” d2: “the red fox” 1 2 2 1 . . Distance of docs and query: Project document vector on query axis. score =>
  9. w2(brown) w1(fox) . original document vector d w1,d 2*w1,d longer

    document with same tfs shorter document with same tfs score => Relevancy - Field length
  10. Relevancy - document frequency Words that appear more often in

    documents are less important that words that appear less often.
  11. Relevance: Even more tweaking! w2(brown) w1(fox) . . multiplied weight

    for fox by 2 original document vector d w1,d 2*w1,d score => Relevancy - term weight
  12. Lucene Similarity query norm, does not fit on this slide

    core TF/IDF weight score of a document d for a given query q field length, some function turning the number of tokens into a float, roughly: boost of query term t http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html inverted document frequency for term t
  13. Explain api If you do not understand the score: !

    curl -XPOST "http://localhost:9200/idfidx/test/_search" -d' { "query": { "match": { "location": "berlin kreuzberg" } }, "explain": true }'
  14. The point is... - Text scoring per default is tuned

    for natural language text. - Empirical scoring formula works well for articles, mails, reviews, etc. - This way to score might be undesirable if the text represents tags.
  15. Remember…Lucene Similarity “I do not need that!” “Can I have

    the tf squared?” “I do not like the field length - how can I get rid of it?” “I want my boost to depend on the ratio of number of characters and average hight of my former gfs divided by the number of Friday 13ths in the last year!” http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html “Can we not make this idf^1.265?”
  16. But wait…can we not write our own Lucene similarity class?

    Yes, but… - you must figure out which classes you need, how to plug them in, … - you might not have access to all needed properties (payloads, field values,…) - you will want to test how well your scoring actually works before digging through Lucene code!
  17. function_score - basic structure "function_score": { "(query|filter)": {}, "functions": [

    { "filter": {}, "FUNCTION": {} }, ... ] } Apply score computation only to docs matching a specific filter (default “match_all”) Apply this function to matching docs query or filter
  18. Decay Functions JSON structure ! ! ! ! ! !

    Decay functions • “gauss” • “exp” • “lin” ! "gauss": { "age": { "reference": 40, "scale": 5, "decay": 0.5, "offset": 5 } } reference scale decay offset shape of decay curve field name
  19. If you need only simple stuff… - Distance functions built-in

    - boost built in - function_score replaces field and document boost …but sometimes you need more.
  20. function_score - script scoring "function_score": { "(query|filter)": {}, "functions": [

    { "filter": {}, “script_score": { “params”: {…}, “lang”: “mvel”, “script”: “…” } }, ... ] } Apply score computation only to docs matching a specific filter (default “match_all”) Parameters that will be available at script execution query or filter Script language, “mvel” default, other languages available as plugin The actual script
  21. script examples - field values use document values: “doc[‘posted’].value”! !

    use math expressions “pow(doc[‘age’]-mean_age, -2.0)”
  22. script examples - term statistics - Brandnew! - _index variable

    allows access to lucene term statistics - provides document count, document frequency, term frequency, total term frequency,… ! Term frequency: ! _index[‘text’][‘word’].tf()! ! document that contains “word” most often will score highest!
  23. Detour: word count - Lucene does not store number of

    tokens in a field - must be enabled in mapping and accessed as regular field: ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! - access as field value ! ! ! “doc[‘text.word_count’].value" properties: ! ! text:! ! ! ! ! type: multi_field! ! ! ! fields: ! ! ! ! ! …! ! ! ! ! word_count: ! ! ! type: token_count! ! ! ! …
  24. d1: “the quick brown fox likes brown nuts” tf: brown

    tf: fox q: “brown fox” d2: “the red fox” 1 2 2 1 Distance of docs and query: Cosine of angle between document vector on query axis. cos ( ! ) = ~ d · ~ q |~ d | · | ~ q | ↵ Relevancy - Cosine Similarity
  25. Cosine similarity as script "params": {! "field": "fieldname",! "words": [“word1",

    …]! },! "script": “! ! score = 0.0; ! queryLength = 0.0; ! docLength = 0.0;! for (word : words){ ! ! tf = _index[field][word].tf(); ! ! score = score + tf * 1.0; ! ! queryLength = queryLength + 1.0; ! ! docLength = docLength + pow(tf, 2.0);! } ! return (float)score / ! (sqrt(docLength) * sqrt(queryLength));! ! "! cos ( ! ) = ~ d · ~ q |~ d | · | ~ q |
  26. Cosine similarity as script "params": {! "field": "fieldname",! "words": [“word1",

    …]! },! "script": “! ! score = 0.0; ! queryLength = 0.0; ! docLength = 0.0;! for (word : words){ ! ! tf = _index[fieldname][word].tf(); ! ! score = score + tf * 1.0; ! ! queryLength = queryLength + 1.0; ! ! docLength = docLength + pow(tf, 2.0);! } ! return (float)score / ! (sqrt(docLength) * sqrt(queryLength));! ! "! cos ( ! ) = ~ d · ~ q |~ d | · | ~ q |
  27. Cosine similarity as script "params": {! "field": "fieldname",! "words": [“word1",

    …]! },! "script": “! ! score = 0.0; ! queryLength = 0.0; ! docLength = 0.0;! for (word : words){ ! ! tf = _index[fieldname][word].tf(); ! ! score = score + tf * 1.0; ! ! queryLength = queryLength + 1.0; ! ! docLength = docLength + pow(tf, 2.0);! } ! return (float)score / ! (sqrt(docLength) * sqrt(queryLength));! ! "! cos ( ! ) = ~ d · ~ q |~ d | · | ~ q |
  28. Cosine similarity as script "params": {! "field": "fieldname",! "words": [“word1",

    …]! },! "script": “! ! score = 0.0; ! queryLength = 0.0; ! docLength = 0.0;! for (word : words){ ! ! tf = _index[fieldname][word].tf(); ! ! score = score + tf * 1.0; ! ! queryLength = queryLength + 1.0; ! ! docLength = docLength + pow(tf, 2.0);! } ! return (float)score / ! (sqrt(docLength) * sqrt(queryLength));! ! "! cos ( ! ) = ~ d · ~ q |~ d | · | ~ q |
  29. function_score - even more parameters! "function_score": { "(query|filter)": {}, "boost":

    2, "functions": [ { "filter": {}, "FUNCTION": {} }, ... ], "max_boost": 10.0, "score_mode": "(mult|max|...)", "boost_mode": "(mult|replace|...)" } Apply score computation only to doc matching a specific filter (default “match_all”) Apply this function to matching docs Result of the different filter/ function pairs should be summed, multiplied,.... Merge with query score by multiply, add, ... query score limit boost to 10
  30. Practical advise - Create evaluation data - Write native script

    if you settled on one function (see https://github.com/imotov/elasticsearch-native-script- example) - Filter out as much as you can before applying scoring function
  31. TODOs - Index wide statistics, similar to DFS_QUERY_THEN_FETCH - Analysis

    of parameter string - script execution prior to search - More optimizing…