Evaluating the relevance of search results

Evaluating the relevance of search results 2.3.2018

Who am I? Dominik Goltermann Software Engineer Backend @goltergaul

Runtastic Balance

When you search for “dog food” in an online shop...
What is relevance? High relevance Low relevance 1. Taste of the Wild Dry Dog Food, High Prairie Canine… 1. Will Clark (Baseball Card) 1993 Milk-Bone Super Stars - Dog Food Issue [Base] #7 2. PEDIGREE Adult Complete Nutrition Roasted Chicken, Rice… 2. 2.5 lb Paperboard food trays for French Fries, Hot Dogs, Carnival, Arts and Crafts 50 Pack 3. Diamond Naturals Dry Food for Adult Dog, Beef and Rice… 3. The Sims 2: Kitchen & Bath Interior Design Stuff - PC

Typical development process when implementing a search functionality • Multiple
boosting values to adjust • Many ways to analyze text (ngrams of different lengths, language settings and many more options) • Different ways of mixing and defining signals • ... Changes Testing

Example Multiple boosting values in a simple query { "query":
{ "multi_match": { "query": "click tracking", "type": "best_field", "fields": ["title^5", "brand^2", "description", "tags^1.5"] "tie_breaker": 0.4 }}} Which numbers yield the best result across all queries?!

Typical development process when implementing a search functionality • Multiple
boosting values to adjust • Many ways to analyze text (ngrams of different lengths, language settings and many more options) • Different ways of mixing and defining signals • ... Changes Testing

Automatic evaluation of Information Retrieval systems Queries Documents dog food
pedigree dog food ... Wild Dry Dog Food relevant not relevant ... PEDIGREE Adult Complete Nutrition relevant relevant ... Will Clark (Baseball Card) 1993 Milk-Bone Super Stars - Dog Food Issue [Base] #7 not relevant not relevant ... Query “cat brush” Resultset Relevant? cat food no Dog brush no Cat brush yes ... ... Query “dog” Resultset Relevant? Wet dog food yes Dog brush yes Cat brush no ... ... Query “dog food” Resultset Relevant? Wild Dry Dog Food yes Wet dog food yes Cat food stick no ... ... Ground truth:

As engineers, we want to measure, not guess Metrics Recall
/ Precision

How many of the relevant documents have been found? Recall
n_retrieved_relevant / n_relevant “How many of all the relevant documents were retrieved?”

n_retrieved_relevant / n_relevant Query: “Dog food” Relevant Documents Wild Dry Dog Food Dry Food for Adult Dog PEDIGREE Adult Complete Nutrition Roasted Chicken What did the search actually retrieve?

Retrieved Documents Wild Dry Dog Food Cat food Dry Food
for Adult Dog Rabbit food How many of the relevant documents have been found? Recall What’s the recall of that? Relevant Documents Wild Dry Dog Food Dry Food for Adult Dog PEDIGREE Adult Complete Nutrition Roasted Chicken n_retrieved_relevant / n_relevant Query: “Dog food”

Recall for this query: 2/3 Retrieved Documents Wild Dry Dog Food Cat food Dry Food for Adult Dog Rabbit food Relevant Documents Wild Dry Dog Food Dry Food for Adult Dog PEDIGREE Adult Complete Nutrition Roasted Chicken n_retrieved_relevant / n_relevant Query: “Dog food”

How many of the found documents are relevant? Precision n_retrieved_relevant
/ n_retrieved

How many of the found documents are relevant? Precision n_retrieved_relevant
/ n_retrieved Query: “dog food” What’s the precision of that? Retrieved Documents Wild Dry Dog Food Cat food Dry Food for Adult Dog Rabbit food Relevant Documents Wild Dry Dog Food Dry Food for Adult Dog PEDIGREE Adult Complete Nutrition Roasted Chicken

How many of the found documents are relevant? Precision Precision
for this query: 2/4 Retrieved Documents Wild Dry Dog Food Cat Food Dry Food for Adult Dog Rabbit Food Relevant Documents Wild Dry Dog Food Dry Food for Adult Dog PEDIGREE Adult Complete Nutrition Roasted Chicken n_retrieved_relevant / n_retrieved Query: “dog food”

Precision VS Recall What does it mean? Recall How many
of the relevant documents have been found? Does it find everything? Precision How many of the found documents are relevant? Is what it found relevant?

Retrieved Documents Wild Dry Dog Food Cat Food Dry Food
for Adult Dog Rabbit Food (Mean) Average Precision One value to assess the quality of your search solution For every relevant result with position k in the results of query q: calculate the precision of the first k documents. The average of that is AP(q). Repeat for each query in Q, average again and you get the mAP Example for one queries AP calculation: Relevant Documents Wild Dry Dog Food Dry Food for Adult Dog PEDIGREE Adult Complete Nutrition Roasted Chicken

for Adult Dog Rabbit Food (Mean) Average Precision One value to assess the quality of your search solution For every relevant result with position k in the results of query q: calculate the precision of the first k documents. The average of that is AP(q). Repeat for each query in Q, average again and you get the mAP Example for one queries AP calculation: 1/1 First result is relevant. Its precision is 1/1 Relevant Documents Wild Dry Dog Food Dry Food for Adult Dog PEDIGREE Adult Complete Nutrition Roasted Chicken

(Mean) Average Precision One value to assess the quality of
your search solution For every relevant result with position k in the results of query q: calculate the precision of the first k documents. The average of that is AP(q). Repeat for each query in Q, average again and you get the mAP Example for one queries AP calculation: Second result is not relevant. It is skipped 1/1 Retrieved Documents Wild Dry Dog Food Cat Food Dry Food for Adult Dog Rabbit Food Relevant Documents Wild Dry Dog Food Dry Food for Adult Dog PEDIGREE Adult Complete Nutrition Roasted Chicken

for Adult Dog Rabbit Food (Mean) Average Precision One value to assess the quality of your search solution For every relevant result with position k in the results of query q: calculate the precision of the first k documents. The average of that is AP(q). Repeat for each query in Q, average again and you get the mAP Example for one queries AP calculation: 2/3 Third result is relevant. Its precision is 2/3 1/1 Relevant Documents Wild Dry Dog Food Dry Food for Adult Dog PEDIGREE Adult Complete Nutrition Roasted Chicken

your search solution For every relevant result with position k in the results of query q: calculate the precision of the first k documents. The average of that is AP(q). Repeat for each query in Q, average again and you get the mAP Example for one queries AP calculation: Retrieved Documents Wild Dry Dog Food Cat Food Dry Food for Adult Dog Rabbit Food 1/1 2/3 No more relevant documents are in the results. Final AP (average precision) value is the average of all precisions: AP: (1/1 + 2/3) / 2 = 0.83 Relevant Documents Wild Dry Dog Food Dry Food for Adult Dog PEDIGREE Adult Complete Nutrition Roasted Chicken

your search solution The AP is calculated for every query and the average of all those values is the final Mean Average Precision. The higher the number the more relevant are your query results. You will always get a number between 0 and 1. It is practical to only look at the top X search results, since most users do not look through multiple pages of results

(Mean) Average Precision Why is it better than precision? Retrieved
Documents Wild Dry Dog Food Cat Food Dry Food for Adult Dog Rabbit Food 1/1 2/3 Precision: 2/4 = 0.5 AP: (1 + 2/3) / 2 = 0.83 Retrieved Documents Wild Dry Dog Food Cat Food Dry Food for Adult Dog Rabbit Food 1/3 2/4 Precision: 2/4 = 0.5 AP: (1/3 + 2/4) / 2 = 0.42 Retrieved Documents Wild Dry Dog Food Cat Food Dry Food for Adult Dog Rabbit Food 1/1 2/2 Precision: 2/4 = 0.5 AP: (1/1 + 2/2) / 2 = 1.0 The precision does not tell you if your search result was missing relevant documents. Use unit tests or other metrics for this

Visualizing the Mean Average Precision For example with gnuplot

Example Multiple boosting values in a simple query { "query":
{ "multi_match": { "query": "click tracking", "type": "best_field", "fields": ["title^5", "brand^2", "description", "tags^1.5"] "tie_breaker": 0.4 }}} Tie breaker Mean Average Precision 0.4 0.67 0.7 0.69 1.2 0.75 1.5 0.56

Visualizing the Mean Average Precision For example with gnuplot

This is nice but ... … where do I get
the ground truth from? Building a database where you define for each different query which of all the possible result documents are relevant or not is very difficult if not even infeasible. We defined only a few relevant items for ~ 400 queries using a special app. That helped a lot already

This is nice but ... … where do I get
the ground truth from? Example for a resulting ground truth dataset:

What do you get from all this work? More confidence
in quality! • When you change the search functionality you can be much more confident about how this affects the resulting quality • You can use it to tune parameters to get more relevant results. You might risk overfitting though • You can quickly validate assumptions you have on how the users search and what they expect • You can validate if a new feature really makes the experience better • You can pair this type of testing with unit tests that assert that certain queries do or do not find certain documents to get a thorough overall assessment

Tracking usage to define relevant results A less effort solution
We now track search usage in production and log which result the users select in order to build a bigger dataset. This can also be the first source for the ground truth if you already have a search in production where you can collect data.

Recommended book Relevant Search Manning https://www.manning.com/books/relevant-search

Runtastic.com Thank you!

Evaluating the relevance of search results

Evaluating the relevance of search results

Dominik Goltermann

More Decks by Dominik Goltermann

Other Decks in Programming

Featured

Transcript