Elasticsearch Inc
February 04, 2014
1.9k

# Scoring for human beings

Talk at Munich Search Meetup (http://www.meetup.com/Search-Meetup-Munich/), Feb 4 2014

## Elasticsearch Inc

February 04, 2014

## Transcript

2. ### What is scoring? Determine the relevance of a document given

a search request - Given keywords [“football”, “world cup”], what is the most relevant news article the user might want to read? - Given the criteria [“java”, “expected income”, “work location”], which candidate in the data set is most likely to be a good employee?
3. ### Hm…I can just use a match query and ﬁlters, right?

“query”: ! ! “match”:! ! ! “proglang”: “java”! …
4. ### Agenda ! ! PART 1: Text scoring for human beings

and the downside for tags ! PART 2: Do-it-youself scoring ! ! !

6. ### Relevancy Step Query Doc 1 Doc 2 The text brown

fox The quick brown fox likes brown nuts The red fox The terms (brown, fox) (brown, brown, fox, likes, nuts, quick) (fox, red) A frequency vector (1, 1) (2, 1) (0, 1) Relevancy - 3? 1?

9. ### Relevancy Step Query Doc 1 Doc 2 The text brown

fox The quick brown fox likes brown nuts The red fox The terms (brown, fox) (brown, brown, fox, likes, nuts, quick) (fox, red) A frequency vector (1, 1) (2, 1) (0, 1) Relevancy - 3? 1?
10. ### Relevancy - the vector space model d1: “the quick brown

fox likes brown nuts” tf: brown tf: fox q: “brown fox” d2: “the red fox” 1 2 2 1 . Queries and documents are vectors. What is the distance between query and document vector?
11. ### d1: “the quick brown fox likes brown nuts” tf: brown

tf: fox q: “brown fox” d2: “the red fox” 1 2 2 1 Distance of docs and query: Cosine of angle between document vector on query axis. cos ( ! ) = ~ d · ~ q |~ d | · | ~ q | ↵ Relevancy - Cosine Similarity
12. ### Relevancy - Projection distance d1: “the quick brown fox likes

brown nuts” tf: brown tf: fox q: “brown fox” d2: “the red fox” 1 2 2 1 . . Distance of docs and query: Project document vector on query axis. score =>

longer text.
14. ### w2(brown) w1(fox) . original document vector d w1,d 2*w1,d longer

document with same tfs shorter document with same tfs score => Relevancy - Field length
15. ### Relevancy - document frequency Words that appear more often in

documents are less important that words that appear less often.
16. ### Relevance: Even more tweaking! w2(brown) w1(fox) . . multiplied weight

for fox by 2 original document vector d w1,d 2*w1,d score => Relevancy - term weight

18. ### Lucene Similarity query norm, does not ﬁt on this slide

core TF/IDF weight score of a document d for a given query q ﬁeld length, some function turning the number of tokens into a ﬂoat, roughly: boost of query term t http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html inverted document frequency for term t
19. ### Explain api If you do not understand the score: !

curl -XPOST "http://localhost:9200/idfidx/test/_search" -d' { "query": { "match": { "location": "berlin kreuzberg" } }, "explain": true }'
20. ### The point is... - Text scoring per default is tuned

for natural language text. - Empirical scoring formula works well for articles, mails, reviews, etc. - This way to score might be undesirable if the text represents tags.

22. ### Remember…Lucene Similarity “I do not need that!” “Can I have

the tf squared?” “I do not like the ﬁeld length - how can I get rid of it?” “I want my boost to depend on the ratio of number of characters and average hight of my former gfs divided by the number of Friday 13ths in the last year!” http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html “Can we not make this idf^1.265?”
23. ### But wait…can we not write our own Lucene similarity class?

Yes, but… - you must ﬁgure out which classes you need, how to plug them in, … - you might not have access to all needed properties (payloads, ﬁeld values,…) - you will want to test how well your scoring actually works before digging through Lucene code!
24. ### function_score - basic structure "function_score": { "(query|filter)": {}, "functions": [

{ "filter": {}, "FUNCTION": {} }, ... ] } Apply score computation only to docs matching a speciﬁc ﬁlter (default “match_all”) Apply this function to matching docs query or ﬁlter

26. ### Decay Functions JSON structure ! ! ! ! ! !

Decay functions • “gauss” • “exp” • “lin” ! "gauss": { "age": { "reference": 40, "scale": 5, "decay": 0.5, "offset": 5 } } reference scale decay offset shape of decay curve ﬁeld name
27. ### If you need only simple stuff… - Distance functions built-in

- boost built in - function_score replaces ﬁeld and document boost …but sometimes you need more.
28. ### function_score - script scoring "function_score": { "(query|filter)": {}, "functions": [

{ "filter": {}, “script_score": { “params”: {…}, “lang”: “mvel”, “script”: “…” } }, ... ] } Apply score computation only to docs matching a speciﬁc ﬁlter (default “match_all”) Parameters that will be available at script execution query or ﬁlter Script language, “mvel” default, other languages available as plugin The actual script
29. ### script examples - ﬁeld values use document values: “doc[‘posted’].value”! !

use math expressions “pow(doc[‘age’]-mean_age, -2.0)”
30. ### script examples - term statistics - Brandnew! - _index variable

allows access to lucene term statistics - provides document count, document frequency, term frequency, total term frequency,… ! Term frequency: ! _index[‘text’][‘word’].tf()! ! document that contains “word” most often will score highest!
31. ### Detour: word count - Lucene does not store number of

tokens in a ﬁeld - must be enabled in mapping and accessed as regular ﬁeld: ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! - access as ﬁeld value ! ! ! “doc[‘text.word_count’].value" properties: ! ! text:! ! ! ! ! type: multi_field! ! ! ! fields: ! ! ! ! ! …! ! ! ! ! word_count: ! ! ! type: token_count! ! ! ! …
32. ### d1: “the quick brown fox likes brown nuts” tf: brown

tf: fox q: “brown fox” d2: “the red fox” 1 2 2 1 Distance of docs and query: Cosine of angle between document vector on query axis. cos ( ! ) = ~ d · ~ q |~ d | · | ~ q | ↵ Relevancy - Cosine Similarity
33. ### Cosine similarity as script "params": {! "field": "fieldname",! "words": [“word1",

…]! },! "script": “! ! score = 0.0; ! queryLength = 0.0; ! docLength = 0.0;! for (word : words){ ! ! tf = _index[field][word].tf(); ! ! score = score + tf * 1.0; ! ! queryLength = queryLength + 1.0; ! ! docLength = docLength + pow(tf, 2.0);! } ! return (float)score / ! (sqrt(docLength) * sqrt(queryLength));! ! "! cos ( ! ) = ~ d · ~ q |~ d | · | ~ q |
34. ### Cosine similarity as script "params": {! "field": "fieldname",! "words": [“word1",

…]! },! "script": “! ! score = 0.0; ! queryLength = 0.0; ! docLength = 0.0;! for (word : words){ ! ! tf = _index[fieldname][word].tf(); ! ! score = score + tf * 1.0; ! ! queryLength = queryLength + 1.0; ! ! docLength = docLength + pow(tf, 2.0);! } ! return (float)score / ! (sqrt(docLength) * sqrt(queryLength));! ! "! cos ( ! ) = ~ d · ~ q |~ d | · | ~ q |
35. ### Cosine similarity as script "params": {! "field": "fieldname",! "words": [“word1",

…]! },! "script": “! ! score = 0.0; ! queryLength = 0.0; ! docLength = 0.0;! for (word : words){ ! ! tf = _index[fieldname][word].tf(); ! ! score = score + tf * 1.0; ! ! queryLength = queryLength + 1.0; ! ! docLength = docLength + pow(tf, 2.0);! } ! return (float)score / ! (sqrt(docLength) * sqrt(queryLength));! ! "! cos ( ! ) = ~ d · ~ q |~ d | · | ~ q |
36. ### Cosine similarity as script "params": {! "field": "fieldname",! "words": [“word1",

…]! },! "script": “! ! score = 0.0; ! queryLength = 0.0; ! docLength = 0.0;! for (word : words){ ! ! tf = _index[fieldname][word].tf(); ! ! score = score + tf * 1.0; ! ! queryLength = queryLength + 1.0; ! ! docLength = docLength + pow(tf, 2.0);! } ! return (float)score / ! (sqrt(docLength) * sqrt(queryLength));! ! "! cos ( ! ) = ~ d · ~ q |~ d | · | ~ q |
37. ### function_score - even more parameters! "function_score": { "(query|filter)": {}, "boost":

2, "functions": [ { "filter": {}, "FUNCTION": {} }, ... ], "max_boost": 10.0, "score_mode": "(mult|max|...)", "boost_mode": "(mult|replace|...)" } Apply score computation only to doc matching a speciﬁc ﬁlter (default “match_all”) Apply this function to matching docs Result of the different ﬁlter/ function pairs should be summed, multiplied,.... Merge with query score by multiply, add, ... query score limit boost to 10
38. ### Practical advise - Create evaluation data - Write native script

if you settled on one function (see https://github.com/imotov/elasticsearch-native-script- example) - Filter out as much as you can before applying scoring function
39. ### TODOs - Index wide statistics, similar to DFS_QUERY_THEN_FETCH - Analysis

of parameter string - script execution prior to search - More optimizing…