function_score - Speaker Deck

Slide 1

Slide 1 text

A short introduction to function_score Britta Weber elasticsearch Wednesday, October 30, 13

Slide 2

Slide 2 text

Agenda PART 1: Text scoring for human beings and the downside for tags PART 2: Scoring numerical ﬁelds Wednesday, October 30, 13

Slide 3

Slide 3 text

How does scoring of text work? Wednesday, October 30, 13

Slide 4

Slide 4 text

Relevancy Step Query Doc 1 Doc 2 The text brown fox The quick brown fox likes brown nuts The red fox The terms (brown, fox) (brown, brown, fox, likes, nuts, quick) (fox, red) A frequency vector (1, 1) (2, 1) (0, 1) Relevancy - 3? 1? Wednesday, October 30, 13

Slide 5

Slide 5 text

So...more matching words mean higher score, right? Wednesday, October 30, 13

Slide 6

Slide 6 text

Text scoring oddities https://gist.github.com/brwe/7229896 Wednesday, October 30, 13

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Relevancy - Vector Space Model d1: “the quick brown fox likes brown nuts” tf: brown tf: fox q: “brown fox” d2: “the red fox” 1 2 2 1 . . Distance of docs and query: Project document vector on query axis. score => Wednesday, October 30, 13

Slide 9

Slide 9 text

I: Field length Shorter text is more relevant than longer text. Wednesday, October 30, 13

Slide 10

Slide 10 text

Field length - Vector Space Model w2(brown) w1(fox) . original document vector d w1,d 2*w1,d longer document with same tfs shorter document with same tfs score => Wednesday, October 30, 13

Slide 11

Slide 11 text

II: Document frequency Words that appear more often in documents are less important that words that appear less often. Wednesday, October 30, 13

Slide 12

Slide 12 text

Relevance: Even more tweaking! w2(brown) w1(fox) . . multiplied weight for fox by 2 original document vector d w1,d 2*w1,d score => Term weight - Vector Space Model Wednesday, October 30, 13

Slide 13

Slide 13 text

How many of these factors are there? Wednesday, October 30, 13

Slide 14

Slide 14 text

Lucene Similarity query norm, does not fit on this slide core TF/IDF weight score of a document d for a given query q field length, some function turning the number of tokens into a float, roughly: boost of query term t http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html inverted document frequency for term t Wednesday, October 30, 13

Slide 15

Slide 15 text

Explain api If you do not understand the score: curl -XPOST "http://localhost:9200/idfidx/test/_search" -d' { "query": { "match": { "location": "berlin kreuzberg" } }, "explain": true }' Wednesday, October 30, 13

Slide 16

Slide 16 text

The point is... - Text scoring per default is tuned for natural language text. - Empirical scoring formula works well for articles, mails, reviews, etc. - This way to score might be undesirable if the text represents tags. Wednesday, October 30, 13

Slide 17

Slide 17 text

function_score - Tags should be scored different from text - Numerical ﬁeld values should result in a score and not only score 0/1 - Sometimes we want to write our own scoring function! (Disclaimer: Not all of this is new.) Wednesday, October 30, 13

Slide 18

Slide 18 text

function_score - basic structure "function_score": { "(query|filter)": {}, "functions": [ { "filter": {}, "FUNCTION": {} }, ... ] } Apply score computation only to docs matching a specific filter (default “match_all”) Apply this function to matching docs query or filter Wednesday, October 30, 13

Slide 19

Slide 19 text

Example for function score https://gist.github.com/brwe/7049473 Wednesday, October 30, 13

Slide 20

Slide 20 text

Decay Functions JSON structure Decay functions • “gauss” • “exp” • “lin” "gauss": { "age": { "reference": 40, "scale": 5, "decay": 0.5, "offset": 5 } } reference scale decay offset shape of decay curve ﬁeld name Wednesday, October 30, 13

Slide 21

Slide 21 text

Income Experience Wednesday, October 30, 13

Slide 22

Slide 22 text

function_score - even more parameters! "function_score": { "(query|filter)": {}, "boost": 2, "functions": [ { "filter": {}, "FUNCTION": {} }, ... ], "max_boost": 10.0, "score_mode": "(mult|max|...)", "boost_mode": "(mult|replace|...)" } Apply score computation only to doc matching a specific filter (default “match_all”) Apply this function to matching docs Result of the different filter/ function pairs should be summed, multiplied,.... Merge with query score by multiply, add, ... query score limit boost to 10 Wednesday, October 30, 13

Slide 23

Slide 23 text

The downside function_score functions and their combination can be arbitrarily complex => hard to tune the parameters. Wednesday, October 30, 13

Slide 24

Slide 24 text

Coming up next in elasticsearch... Script scoring with term vectors - build your own fancy natural language scoring model! #3772 Wednesday, October 30, 13