$30 off During Our Annual Pro Sale. View Details »

Evaluating the relevance of search results

Evaluating the relevance of search results

How to automatically assess the relevance of search results

Dominik Goltermann

March 02, 2018
Tweet

More Decks by Dominik Goltermann

Other Decks in Programming

Transcript

  1. When you search for “dog food” in an online shop...

    What is relevance? High relevance Low relevance 1. Taste of the Wild Dry Dog Food, High Prairie Canine… 1. Will Clark (Baseball Card) 1993 Milk-Bone Super Stars - Dog Food Issue [Base] #7 2. PEDIGREE Adult Complete Nutrition Roasted Chicken, Rice… 2. 2.5 lb Paperboard food trays for French Fries, Hot Dogs, Carnival, Arts and Crafts 50 Pack 3. Diamond Naturals Dry Food for Adult Dog, Beef and Rice… 3. The Sims 2: Kitchen & Bath Interior Design Stuff - PC
  2. Typical development process when implementing a search functionality • Multiple

    boosting values to adjust • Many ways to analyze text (ngrams of different lengths, language settings and many more options) • Different ways of mixing and defining signals • ... Changes Testing
  3. Example Multiple boosting values in a simple query { "query":

    { "multi_match": { "query": "click tracking", "type": "best_field", "fields": ["title^5", "brand^2", "description", "tags^1.5"] "tie_breaker": 0.4 }}} Which numbers yield the best result across all queries?!
  4. Typical development process when implementing a search functionality • Multiple

    boosting values to adjust • Many ways to analyze text (ngrams of different lengths, language settings and many more options) • Different ways of mixing and defining signals • ... Changes Testing
  5. Automatic evaluation of Information Retrieval systems Queries Documents dog food

    pedigree dog food ... Wild Dry Dog Food relevant not relevant ... PEDIGREE Adult Complete Nutrition relevant relevant ... Will Clark (Baseball Card) 1993 Milk-Bone Super Stars - Dog Food Issue [Base] #7 not relevant not relevant ... Query “cat brush” Resultset Relevant? cat food no Dog brush no Cat brush yes ... ... Query “dog” Resultset Relevant? Wet dog food yes Dog brush yes Cat brush no ... ... Query “dog food” Resultset Relevant? Wild Dry Dog Food yes Wet dog food yes Cat food stick no ... ... Ground truth:
  6. How many of the relevant documents have been found? Recall

    n_retrieved_relevant / n_relevant “How many of all the relevant documents were retrieved?”
  7. How many of the relevant documents have been found? Recall

    n_retrieved_relevant / n_relevant Query: “Dog food” Relevant Documents Wild Dry Dog Food Dry Food for Adult Dog PEDIGREE Adult Complete Nutrition Roasted Chicken What did the search actually retrieve?
  8. Retrieved Documents Wild Dry Dog Food Cat food Dry Food

    for Adult Dog Rabbit food How many of the relevant documents have been found? Recall What’s the recall of that? Relevant Documents Wild Dry Dog Food Dry Food for Adult Dog PEDIGREE Adult Complete Nutrition Roasted Chicken n_retrieved_relevant / n_relevant Query: “Dog food”
  9. How many of the relevant documents have been found? Recall

    Recall for this query: 2/3 Retrieved Documents Wild Dry Dog Food Cat food Dry Food for Adult Dog Rabbit food Relevant Documents Wild Dry Dog Food Dry Food for Adult Dog PEDIGREE Adult Complete Nutrition Roasted Chicken n_retrieved_relevant / n_relevant Query: “Dog food”
  10. How many of the found documents are relevant? Precision n_retrieved_relevant

    / n_retrieved Query: “dog food” What’s the precision of that? Retrieved Documents Wild Dry Dog Food Cat food Dry Food for Adult Dog Rabbit food Relevant Documents Wild Dry Dog Food Dry Food for Adult Dog PEDIGREE Adult Complete Nutrition Roasted Chicken
  11. How many of the found documents are relevant? Precision Precision

    for this query: 2/4 Retrieved Documents Wild Dry Dog Food Cat Food Dry Food for Adult Dog Rabbit Food Relevant Documents Wild Dry Dog Food Dry Food for Adult Dog PEDIGREE Adult Complete Nutrition Roasted Chicken n_retrieved_relevant / n_retrieved Query: “dog food”
  12. Precision VS Recall What does it mean? Recall How many

    of the relevant documents have been found? Does it find everything? Precision How many of the found documents are relevant? Is what it found relevant?
  13. Retrieved Documents Wild Dry Dog Food Cat Food Dry Food

    for Adult Dog Rabbit Food (Mean) Average Precision One value to assess the quality of your search solution For every relevant result with position k in the results of query q: calculate the precision of the first k documents. The average of that is AP(q). Repeat for each query in Q, average again and you get the mAP Example for one queries AP calculation: Relevant Documents Wild Dry Dog Food Dry Food for Adult Dog PEDIGREE Adult Complete Nutrition Roasted Chicken
  14. Retrieved Documents Wild Dry Dog Food Cat Food Dry Food

    for Adult Dog Rabbit Food (Mean) Average Precision One value to assess the quality of your search solution For every relevant result with position k in the results of query q: calculate the precision of the first k documents. The average of that is AP(q). Repeat for each query in Q, average again and you get the mAP Example for one queries AP calculation: 1/1 First result is relevant. Its precision is 1/1 Relevant Documents Wild Dry Dog Food Dry Food for Adult Dog PEDIGREE Adult Complete Nutrition Roasted Chicken
  15. (Mean) Average Precision One value to assess the quality of

    your search solution For every relevant result with position k in the results of query q: calculate the precision of the first k documents. The average of that is AP(q). Repeat for each query in Q, average again and you get the mAP Example for one queries AP calculation: Second result is not relevant. It is skipped 1/1 Retrieved Documents Wild Dry Dog Food Cat Food Dry Food for Adult Dog Rabbit Food Relevant Documents Wild Dry Dog Food Dry Food for Adult Dog PEDIGREE Adult Complete Nutrition Roasted Chicken
  16. Retrieved Documents Wild Dry Dog Food Cat Food Dry Food

    for Adult Dog Rabbit Food (Mean) Average Precision One value to assess the quality of your search solution For every relevant result with position k in the results of query q: calculate the precision of the first k documents. The average of that is AP(q). Repeat for each query in Q, average again and you get the mAP Example for one queries AP calculation: 2/3 Third result is relevant. Its precision is 2/3 1/1 Relevant Documents Wild Dry Dog Food Dry Food for Adult Dog PEDIGREE Adult Complete Nutrition Roasted Chicken
  17. (Mean) Average Precision One value to assess the quality of

    your search solution For every relevant result with position k in the results of query q: calculate the precision of the first k documents. The average of that is AP(q). Repeat for each query in Q, average again and you get the mAP Example for one queries AP calculation: Retrieved Documents Wild Dry Dog Food Cat Food Dry Food for Adult Dog Rabbit Food 1/1 2/3 No more relevant documents are in the results. Final AP (average precision) value is the average of all precisions: AP: (1/1 + 2/3) / 2 = 0.83 Relevant Documents Wild Dry Dog Food Dry Food for Adult Dog PEDIGREE Adult Complete Nutrition Roasted Chicken
  18. (Mean) Average Precision One value to assess the quality of

    your search solution The AP is calculated for every query and the average of all those values is the final Mean Average Precision. The higher the number the more relevant are your query results. You will always get a number between 0 and 1. It is practical to only look at the top X search results, since most users do not look through multiple pages of results
  19. (Mean) Average Precision Why is it better than precision? Retrieved

    Documents Wild Dry Dog Food Cat Food Dry Food for Adult Dog Rabbit Food 1/1 2/3 Precision: 2/4 = 0.5 AP: (1 + 2/3) / 2 = 0.83 Retrieved Documents Wild Dry Dog Food Cat Food Dry Food for Adult Dog Rabbit Food 1/3 2/4 Precision: 2/4 = 0.5 AP: (1/3 + 2/4) / 2 = 0.42 Retrieved Documents Wild Dry Dog Food Cat Food Dry Food for Adult Dog Rabbit Food 1/1 2/2 Precision: 2/4 = 0.5 AP: (1/1 + 2/2) / 2 = 1.0 The precision does not tell you if your search result was missing relevant documents. Use unit tests or other metrics for this
  20. Example Multiple boosting values in a simple query { "query":

    { "multi_match": { "query": "click tracking", "type": "best_field", "fields": ["title^5", "brand^2", "description", "tags^1.5"] "tie_breaker": 0.4 }}} Tie breaker Mean Average Precision 0.4 0.67 0.7 0.69 1.2 0.75 1.5 0.56
  21. This is nice but ... … where do I get

    the ground truth from? Building a database where you define for each different query which of all the possible result documents are relevant or not is very difficult if not even infeasible. We defined only a few relevant items for ~ 400 queries using a special app. That helped a lot already
  22. This is nice but ... … where do I get

    the ground truth from? Example for a resulting ground truth dataset:
  23. What do you get from all this work? More confidence

    in quality! • When you change the search functionality you can be much more confident about how this affects the resulting quality • You can use it to tune parameters to get more relevant results. You might risk overfitting though • You can quickly validate assumptions you have on how the users search and what they expect • You can validate if a new feature really makes the experience better • You can pair this type of testing with unit tests that assert that certain queries do or do not find certain documents to get a thorough overall assessment
  24. Tracking usage to define relevant results A less effort solution

    We now track search usage in production and log which result the users select in order to build a bigger dataset. This can also be the first source for the ground truth if you already have a search in production where you can collect data.