Scoring for human beings

Scoring for human beings

Talk at Munich Search Meetup (http://www.meetup.com/Search-Meetup-Munich/), Feb 4 2014

098332e9d988080a9057816f84d668f7?s=128

Elasticsearch Inc

February 04, 2014
Tweet

Transcript

  1. Scoring for human beings Britta Weber elasticsearch

  2. What is scoring? Determine the relevance of a document given

    a search request - Given keywords [“football”, “world cup”], what is the most relevant news article the user might want to read? - Given the criteria [“java”, “expected income”, “work location”], which candidate in the data set is most likely to be a good employee?
  3. Hm…I can just use a match query and filters, right?

    “query”: ! ! “match”:! ! ! “proglang”: “java”! …
  4. Agenda ! ! PART 1: Text scoring for human beings

    and the downside for tags ! PART 2: Do-it-youself scoring ! ! !
  5. How does scoring of text work?

  6. Relevancy Step Query Doc 1 Doc 2 The text brown

    fox The quick brown fox likes brown nuts The red fox The terms (brown, fox) (brown, brown, fox, likes, nuts, quick) (fox, red) A frequency vector (1, 1) (2, 1) (0, 1) Relevancy - 3? 1?
  7. So...more matching words mean higher score, right? !

  8. Scoring oddities https://gist.github.com/brwe/7229896

  9. Relevancy Step Query Doc 1 Doc 2 The text brown

    fox The quick brown fox likes brown nuts The red fox The terms (brown, fox) (brown, brown, fox, likes, nuts, quick) (fox, red) A frequency vector (1, 1) (2, 1) (0, 1) Relevancy - 3? 1?
  10. Relevancy - the vector space model d1: “the quick brown

    fox likes brown nuts” tf: brown tf: fox q: “brown fox” d2: “the red fox” 1 2 2 1 . Queries and documents are vectors. What is the distance between query and document vector?
  11. d1: “the quick brown fox likes brown nuts” tf: brown

    tf: fox q: “brown fox” d2: “the red fox” 1 2 2 1 Distance of docs and query: Cosine of angle between document vector on query axis. cos ( ! ) = ~ d · ~ q |~ d | · | ~ q | ↵ Relevancy - Cosine Similarity
  12. Relevancy - Projection distance d1: “the quick brown fox likes

    brown nuts” tf: brown tf: fox q: “brown fox” d2: “the red fox” 1 2 2 1 . . Distance of docs and query: Project document vector on query axis. score =>
  13. Relevancy - Field length Shorter text is more relevant than

    longer text.
  14. w2(brown) w1(fox) . original document vector d w1,d 2*w1,d longer

    document with same tfs shorter document with same tfs score => Relevancy - Field length
  15. Relevancy - document frequency Words that appear more often in

    documents are less important that words that appear less often.
  16. Relevance: Even more tweaking! w2(brown) w1(fox) . . multiplied weight

    for fox by 2 original document vector d w1,d 2*w1,d score => Relevancy - term weight
  17. How many of these factors are there?

  18. Lucene Similarity query norm, does not fit on this slide

    core TF/IDF weight score of a document d for a given query q field length, some function turning the number of tokens into a float, roughly: boost of query term t http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html inverted document frequency for term t
  19. Explain api If you do not understand the score: !

    curl -XPOST "http://localhost:9200/idfidx/test/_search" -d' { "query": { "match": { "location": "berlin kreuzberg" } }, "explain": true }'
  20. The point is... - Text scoring per default is tuned

    for natural language text. - Empirical scoring formula works well for articles, mails, reviews, etc. - This way to score might be undesirable if the text represents tags.
  21. II: DIY scoring

  22. Remember…Lucene Similarity “I do not need that!” “Can I have

    the tf squared?” “I do not like the field length - how can I get rid of it?” “I want my boost to depend on the ratio of number of characters and average hight of my former gfs divided by the number of Friday 13ths in the last year!” http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html “Can we not make this idf^1.265?”
  23. But wait…can we not write our own Lucene similarity class?

    Yes, but… - you must figure out which classes you need, how to plug them in, … - you might not have access to all needed properties (payloads, field values,…) - you will want to test how well your scoring actually works before digging through Lucene code!
  24. function_score - basic structure "function_score": { "(query|filter)": {}, "functions": [

    { "filter": {}, "FUNCTION": {} }, ... ] } Apply score computation only to docs matching a specific filter (default “match_all”) Apply this function to matching docs query or filter
  25. Scoring odysseys ! http://www.elasticsearch.org/videos/ introducing-custom-scoring-functions/ ! https://gist.github.com/brwe/7049473

  26. Decay Functions JSON structure ! ! ! ! ! !

    Decay functions • “gauss” • “exp” • “lin” ! "gauss": { "age": { "reference": 40, "scale": 5, "decay": 0.5, "offset": 5 } } reference scale decay offset shape of decay curve field name
  27. If you need only simple stuff… - Distance functions built-in

    - boost built in - function_score replaces field and document boost …but sometimes you need more.
  28. function_score - script scoring "function_score": { "(query|filter)": {}, "functions": [

    { "filter": {}, “script_score": { “params”: {…}, “lang”: “mvel”, “script”: “…” } }, ... ] } Apply score computation only to docs matching a specific filter (default “match_all”) Parameters that will be available at script execution query or filter Script language, “mvel” default, other languages available as plugin The actual script
  29. script examples - field values use document values: “doc[‘posted’].value”! !

    use math expressions “pow(doc[‘age’]-mean_age, -2.0)”
  30. script examples - term statistics - Brandnew! - _index variable

    allows access to lucene term statistics - provides document count, document frequency, term frequency, total term frequency,… ! Term frequency: ! _index[‘text’][‘word’].tf()! ! document that contains “word” most often will score highest!
  31. Detour: word count - Lucene does not store number of

    tokens in a field - must be enabled in mapping and accessed as regular field: ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! - access as field value ! ! ! “doc[‘text.word_count’].value" properties: ! ! text:! ! ! ! ! type: multi_field! ! ! ! fields: ! ! ! ! ! …! ! ! ! ! word_count: ! ! ! type: token_count! ! ! ! …
  32. d1: “the quick brown fox likes brown nuts” tf: brown

    tf: fox q: “brown fox” d2: “the red fox” 1 2 2 1 Distance of docs and query: Cosine of angle between document vector on query axis. cos ( ! ) = ~ d · ~ q |~ d | · | ~ q | ↵ Relevancy - Cosine Similarity
  33. Cosine similarity as script "params": {! "field": "fieldname",! "words": [“word1",

    …]! },! "script": “! ! score = 0.0; ! queryLength = 0.0; ! docLength = 0.0;! for (word : words){ ! ! tf = _index[field][word].tf(); ! ! score = score + tf * 1.0; ! ! queryLength = queryLength + 1.0; ! ! docLength = docLength + pow(tf, 2.0);! } ! return (float)score / ! (sqrt(docLength) * sqrt(queryLength));! ! "! cos ( ! ) = ~ d · ~ q |~ d | · | ~ q |
  34. Cosine similarity as script "params": {! "field": "fieldname",! "words": [“word1",

    …]! },! "script": “! ! score = 0.0; ! queryLength = 0.0; ! docLength = 0.0;! for (word : words){ ! ! tf = _index[fieldname][word].tf(); ! ! score = score + tf * 1.0; ! ! queryLength = queryLength + 1.0; ! ! docLength = docLength + pow(tf, 2.0);! } ! return (float)score / ! (sqrt(docLength) * sqrt(queryLength));! ! "! cos ( ! ) = ~ d · ~ q |~ d | · | ~ q |
  35. Cosine similarity as script "params": {! "field": "fieldname",! "words": [“word1",

    …]! },! "script": “! ! score = 0.0; ! queryLength = 0.0; ! docLength = 0.0;! for (word : words){ ! ! tf = _index[fieldname][word].tf(); ! ! score = score + tf * 1.0; ! ! queryLength = queryLength + 1.0; ! ! docLength = docLength + pow(tf, 2.0);! } ! return (float)score / ! (sqrt(docLength) * sqrt(queryLength));! ! "! cos ( ! ) = ~ d · ~ q |~ d | · | ~ q |
  36. Cosine similarity as script "params": {! "field": "fieldname",! "words": [“word1",

    …]! },! "script": “! ! score = 0.0; ! queryLength = 0.0; ! docLength = 0.0;! for (word : words){ ! ! tf = _index[fieldname][word].tf(); ! ! score = score + tf * 1.0; ! ! queryLength = queryLength + 1.0; ! ! docLength = docLength + pow(tf, 2.0);! } ! return (float)score / ! (sqrt(docLength) * sqrt(queryLength));! ! "! cos ( ! ) = ~ d · ~ q |~ d | · | ~ q |
  37. function_score - even more parameters! "function_score": { "(query|filter)": {}, "boost":

    2, "functions": [ { "filter": {}, "FUNCTION": {} }, ... ], "max_boost": 10.0, "score_mode": "(mult|max|...)", "boost_mode": "(mult|replace|...)" } Apply score computation only to doc matching a specific filter (default “match_all”) Apply this function to matching docs Result of the different filter/ function pairs should be summed, multiplied,.... Merge with query score by multiply, add, ... query score limit boost to 10
  38. Practical advise - Create evaluation data - Write native script

    if you settled on one function (see https://github.com/imotov/elasticsearch-native-script- example) - Filter out as much as you can before applying scoring function
  39. TODOs - Index wide statistics, similar to DFS_QUERY_THEN_FETCH - Analysis

    of parameter string - script execution prior to search - More optimizing…