Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Made To Measure: Ranking Evaluation using Elasticsearch

Made To Measure: Ranking Evaluation using Elasticsearch

The evaluation of search ranking results is an important task for every search engineer. The new Elasticsearch Ranking Evaluation API makes it easier to measure well-known information retrieval metrics like Precision@K or NDCG. This helps to make better relevance tuning decisions and allows to evaluate and optimize query templates over a wider range of user needs.

Christoph Büscher

December 12, 2018
Tweet

More Decks by Christoph Büscher

Other Decks in Technology

Transcript

  1. Christoph Büscher 12. Dec 2018, Elasticsearch Berlin Meetup @dalatangi Made

    to measure:
 Ranking Evaluation using Elasticsearch
  2. !2 If you can not measure it,
 you cannot improve

    it! AlmostAnActualQuoteTM by Lord Kelvin https://commons.wikimedia.org/wiki/File:Portrait_of_William_Thomson,_Baron_Kelvin.jpg
  3. ? !3 How good is your search Image by Kecko

    https://www.flickr.com/photos/kecko/18146364972 (CC BY 2.0)
  4. !5 Ranking Evaluation
 
 A repeatable way to quickly measure

    the quality of search results over a wide range of user needs
  5. !6 • Automate - don’t make people look at screens

    • no gut-feeling / “management- driven” ad-hoc search ranking REPEATABILITY
  6. !8 • numeric output • support of different metrics •

    define “quality“ in your domain QUALITY
 MEASURE
  7. !9 • optimize across wider range of use case (aka

    “information needs”) • think about what the majority of your users want • collect data to discover what is important for your use case USER
 NEEDS
  8. !10 Things needed for Ranking Evaluation 1. Define a set

    of typical information needs 2. For each case, get small set of candidate documents (both relevant and irrelevant) 3. Rate these documents (either binary relevant/non-relevant or on some graded scale) 4. Choose a metric to calculate. Some good metrics already defined in Information Retrieval research include: • Precision@K, (N)DCG, ERR, Reciprocal Rank etc… Source: Gray Arial 10pt
  9. !11 Search Evaluation Continuum speed preparation time people looking 


    at screens Some sort of
 unit test QA assisted by scripts user studies A/B testing Ranking Evaluation slow fast little lots
  10. !12 Where Ranking Evaluation can help Development Production Communication
 Tool

    • guiding design decisions • enabling quick iteration • helps defining “search quality” clearer • forces stakeholders to “get real” about their expectations • monitor changes • spot degradations
  11. !14 Ranking Evaluation API GET /my_index/_rank_eval { "metric": { "mean_reciprocal_rank":

    { [...] } }, "templates": [{ [...] }], "requests": [{
 "template_id": “my_query_template”, "ratings": [...], "params": { "query_string": “hotel amsterdam", "field": "text" }
 [...] }] } • introduced in 6.2 (still experimental API) • joint work with Isabel Drost-Fromm (@MaineC) • Inputs: • a set of search requests (“information needs”) • document ratings for each request • a metrics definition; currently available • Precision@K • (N)DCG • Expected Reciprocal Rank • MRR, …

  12. !15 Ranking Evaluation API Details "metric": { "precision": { "relevant_rating_threshold":

    "2", "k": 5 } } metric "requests": [{ "id": "JFK_query", "request": { “query”: { […] } }, "ratings": […] }, … other use cases …] requests "ratings": [ { "_id": "3054546", "rating": 3 }, { "_id": "5119376", "rating": 1 }, […] ] ratings
  13. { "rank_eval": { "metric_score": 0.431, "details": { "my_query_id1": { "metric_score":

    0.6, "unrated_docs": [ { "_index": "idx", "_id": "1960795" }, [...] ], "hits": [...], "metric_details": { “precision" : { “relevant_docs_retrieved": 6,
 "docs_retrieved": 10 } } }, "my_query_id2" : { [...] } } } } !16 _rank_eval response overall score details per query maybe rate those? details about metric
  14. !17 How to get document ratings? 1. Define a set

    of typical information needs of user (e.g. analyze logs, ask product management / customer etc…) 2. For each case, get small set of candidate documents (e.g. by very broad query) 3. Rate those documents with respect to the underlying information need • can initially be done by you or other stakeholders; later maybe outsource e.g. via Mechanical Turk 4. Iterate! Source: Gray Arial 10pt
  15. !18 Metrics currently available Metric Description Ratings Precision At K

    Set-based metric; ratio of relevant doc in top K results binary Reciprocal Rank (RR) Positional metric; inverse of the first relevant document binary Discounted Cumulative Gain (DCG) takes order into account; highly relevant docs score more if they appear earlier in result list graded Expected Reciprocal Rank (ERR) motivated by “cascade model” of search; 
 models dependency of results wrt. their predecessors graded
  16. !19 Precision At K • In short: “How many good

    results appear in the first K results” (e.g. first few pages in UI) • supports only boolean relevance judgements • PROS: easy to understand & communicate • CONS: least stable across different user needs, e.g. total number of relevant documents for a query influences precision at k Source: Gray Arial 10pt prec@k = # relevant docs { } # all results at k { }
  17. !20 Reciprocal Rank • supports only boolean relevance judgements •

    PROS: easy to understand & communicate • CONS: limited to cases where amount of good results doesn’t matter • If averaged over a sample of queries Q often called MRR (mean reciprocal rank): Source: Gray Arial 10pt RR = 1 position of first relevant document MRR = 1 Q 1 rank i i Q ∑
  18. !21 Discounted Cumulative Gain (DCG) • Predecessor: Cumulative Gain (CG)

    • sums relevance judgement over top k results Source: Gray Arial 10pt CG = rel k i=1 k ∑ DCG = rel i log 2 (i +1) i=1 k ∑ • DCG takes position into account • divides by log2 at each position • NDCG (Normalized DCG) • divides by “ideal” DCG for a query (IDCG) NDCG = DCG IDCG
  19. !22 Expected Reciprocal Rank (ERR) • cascade based metric •

    supports graded relevance judgements • model assumes user goes through result list in order an is satisfied with first relevant document • R_i probability that user stops at position i • ERR is high when relevant document appear early Source: Gray Arial 10pt ERR = 1 r (1− R i )R r i=1 r−1 ∏ r=1 k ∑ R i = 2rel i −1 2rel max rel i ! relevance at pos. i rel max ! maximal relevance grade
  20. !24 Demo project and Data • demo uses aprox. 1800

    documents from the english Wikipedia • Wikipedias Discovery department collects and publishes relevance judgements with their Discernatron project • bulk data and all query examples available at https://github.com/cbuescher/ rankEvalDemo Source: Gray Arial 10pt
  21. !26 Some questions I have for you… • How do

    you measure search relevance currently? • Did you find anything useful about the ranking evaluation approach? • Feedback about usability of the API (ping be on Github or our Discuss Forum @cbuescher) Source: Gray Arial 10pt
  22. !27 Further reading • Manning, Raghavan & Schütze: Introduction to

    Information Retrieval, Cambridge University Press. 2008. • Metlzer, D., Zhang, Y., & Grinspan, P. (2009). Expected reciprocal rank for graded relevance. Proceeding of the 18th ACM Conference on Information and Knowledge Management - CIKM ’09, 621. • Blog: https://www.elastic.co/blog/made-to-measure-how-to- use-the-ranking-evaluation-api-in-elasticsearch • Docs: https://www.elastic.co/guide/en/elasticsearch/reference/ current/search-rank-eval.html • Discuss: https://discuss.elastic.co/c/elasticsearch (cbuescher) • Github: :Search/Ranking Label (cbuescher) Source: Gray Arial 10pt