Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Made To Measure: Ranking Evaluation using Elast...

Made To Measure: Ranking Evaluation using Elasticsearch

The evaluation of search ranking results is an important task for every search engineer. The new Elasticsearch Ranking Evaluation API makes it easier to measure well-known information retrieval metrics like Precision@K or NDCG. This helps to make better relevance tuning decisions and allows to evaluate and optimize query templates over a wider range of user needs.

Avatar for Christoph Büscher

Christoph Büscher

December 12, 2018
Tweet

More Decks by Christoph Büscher

Other Decks in Technology

Transcript

  1. Christoph Büscher 12. Dec 2018, Elasticsearch Berlin Meetup @dalatangi Made

    to measure:
 Ranking Evaluation using Elasticsearch
  2. !2 If you can not measure it,
 you cannot improve

    it! AlmostAnActualQuoteTM by Lord Kelvin https://commons.wikimedia.org/wiki/File:Portrait_of_William_Thomson,_Baron_Kelvin.jpg
  3. ? !3 How good is your search Image by Kecko

    https://www.flickr.com/photos/kecko/18146364972 (CC BY 2.0)
  4. !5 Ranking Evaluation
 
 A repeatable way to quickly measure

    the quality of search results over a wide range of user needs
  5. !6 • Automate - don’t make people look at screens

    • no gut-feeling / “management- driven” ad-hoc search ranking REPEATABILITY
  6. !8 • numeric output • support of different metrics •

    define “quality“ in your domain QUALITY
 MEASURE
  7. !9 • optimize across wider range of use case (aka

    “information needs”) • think about what the majority of your users want • collect data to discover what is important for your use case USER
 NEEDS
  8. !10 Things needed for Ranking Evaluation 1. Define a set

    of typical information needs 2. For each case, get small set of candidate documents (both relevant and irrelevant) 3. Rate these documents (either binary relevant/non-relevant or on some graded scale) 4. Choose a metric to calculate. Some good metrics already defined in Information Retrieval research include: • Precision@K, (N)DCG, ERR, Reciprocal Rank etc… Source: Gray Arial 10pt
  9. !11 Search Evaluation Continuum speed preparation time people looking 


    at screens Some sort of
 unit test QA assisted by scripts user studies A/B testing Ranking Evaluation slow fast little lots
  10. !12 Where Ranking Evaluation can help Development Production Communication
 Tool

    • guiding design decisions • enabling quick iteration • helps defining “search quality” clearer • forces stakeholders to “get real” about their expectations • monitor changes • spot degradations
  11. !14 Ranking Evaluation API GET /my_index/_rank_eval { "metric": { "mean_reciprocal_rank":

    { [...] } }, "templates": [{ [...] }], "requests": [{
 "template_id": “my_query_template”, "ratings": [...], "params": { "query_string": “hotel amsterdam", "field": "text" }
 [...] }] } • introduced in 6.2 (still experimental API) • joint work with Isabel Drost-Fromm (@MaineC) • Inputs: • a set of search requests (“information needs”) • document ratings for each request • a metrics definition; currently available • Precision@K • (N)DCG • Expected Reciprocal Rank • MRR, …

  12. !15 Ranking Evaluation API Details "metric": { "precision": { "relevant_rating_threshold":

    "2", "k": 5 } } metric "requests": [{ "id": "JFK_query", "request": { “query”: { […] } }, "ratings": […] }, … other use cases …] requests "ratings": [ { "_id": "3054546", "rating": 3 }, { "_id": "5119376", "rating": 1 }, […] ] ratings
  13. { "rank_eval": { "metric_score": 0.431, "details": { "my_query_id1": { "metric_score":

    0.6, "unrated_docs": [ { "_index": "idx", "_id": "1960795" }, [...] ], "hits": [...], "metric_details": { “precision" : { “relevant_docs_retrieved": 6,
 "docs_retrieved": 10 } } }, "my_query_id2" : { [...] } } } } !16 _rank_eval response overall score details per query maybe rate those? details about metric
  14. !17 How to get document ratings? 1. Define a set

    of typical information needs of user (e.g. analyze logs, ask product management / customer etc…) 2. For each case, get small set of candidate documents (e.g. by very broad query) 3. Rate those documents with respect to the underlying information need • can initially be done by you or other stakeholders; later maybe outsource e.g. via Mechanical Turk 4. Iterate! Source: Gray Arial 10pt
  15. !18 Metrics currently available Metric Description Ratings Precision At K

    Set-based metric; ratio of relevant doc in top K results binary Reciprocal Rank (RR) Positional metric; inverse of the first relevant document binary Discounted Cumulative Gain (DCG) takes order into account; highly relevant docs score more if they appear earlier in result list graded Expected Reciprocal Rank (ERR) motivated by “cascade model” of search; 
 models dependency of results wrt. their predecessors graded
  16. !19 Precision At K • In short: “How many good

    results appear in the first K results” (e.g. first few pages in UI) • supports only boolean relevance judgements • PROS: easy to understand & communicate • CONS: least stable across different user needs, e.g. total number of relevant documents for a query influences precision at k Source: Gray Arial 10pt prec@k = # relevant docs { } # all results at k { }
  17. !20 Reciprocal Rank • supports only boolean relevance judgements •

    PROS: easy to understand & communicate • CONS: limited to cases where amount of good results doesn’t matter • If averaged over a sample of queries Q often called MRR (mean reciprocal rank): Source: Gray Arial 10pt RR = 1 position of first relevant document MRR = 1 Q 1 rank i i Q ∑
  18. !21 Discounted Cumulative Gain (DCG) • Predecessor: Cumulative Gain (CG)

    • sums relevance judgement over top k results Source: Gray Arial 10pt CG = rel k i=1 k ∑ DCG = rel i log 2 (i +1) i=1 k ∑ • DCG takes position into account • divides by log2 at each position • NDCG (Normalized DCG) • divides by “ideal” DCG for a query (IDCG) NDCG = DCG IDCG
  19. !22 Expected Reciprocal Rank (ERR) • cascade based metric •

    supports graded relevance judgements • model assumes user goes through result list in order an is satisfied with first relevant document • R_i probability that user stops at position i • ERR is high when relevant document appear early Source: Gray Arial 10pt ERR = 1 r (1− R i )R r i=1 r−1 ∏ r=1 k ∑ R i = 2rel i −1 2rel max rel i ! relevance at pos. i rel max ! maximal relevance grade
  20. !24 Demo project and Data • demo uses aprox. 1800

    documents from the english Wikipedia • Wikipedias Discovery department collects and publishes relevance judgements with their Discernatron project • bulk data and all query examples available at https://github.com/cbuescher/ rankEvalDemo Source: Gray Arial 10pt
  21. !26 Some questions I have for you… • How do

    you measure search relevance currently? • Did you find anything useful about the ranking evaluation approach? • Feedback about usability of the API (ping be on Github or our Discuss Forum @cbuescher) Source: Gray Arial 10pt
  22. !27 Further reading • Manning, Raghavan & Schütze: Introduction to

    Information Retrieval, Cambridge University Press. 2008. • Metlzer, D., Zhang, Y., & Grinspan, P. (2009). Expected reciprocal rank for graded relevance. Proceeding of the 18th ACM Conference on Information and Knowledge Management - CIKM ’09, 621. • Blog: https://www.elastic.co/blog/made-to-measure-how-to- use-the-ranking-evaluation-api-in-elasticsearch • Docs: https://www.elastic.co/guide/en/elasticsearch/reference/ current/search-rank-eval.html • Discuss: https://discuss.elastic.co/c/elasticsearch (cbuescher) • Github: :Search/Ranking Label (cbuescher) Source: Gray Arial 10pt