Made To Measure: Ranking Evaluation using Elasticsearch

Christoph Büscher 12. Dec 2018, Elasticsearch Berlin Meetup @dalatangi Made
to measure:  Ranking Evaluation using Elasticsearch

!2 If you can not measure it,  you cannot improve
it! AlmostAnActualQuoteTM by Lord Kelvin https://commons.wikimedia.org/wiki/File:Portrait_of_William_Thomson,_Baron_Kelvin.jpg

? !3 How good is your search Image by Kecko
https://www.flickr.com/photos/kecko/18146364972 (CC BY 2.0)

!4 Image by Muff Wiggler https://www.flickr.com/photos/muffwiggler/5605240619 (CC BY 2.0)

!5 Ranking Evaluation    A repeatable way to quickly measure
the quality of search results over a wide range of user needs

!6 • Automate - don’t make people look at screens
• no gut-feeling / “management- driven” ad-hoc search ranking REPEATABILITY

!7 • fast iterations instead of long waits (e.g. in
A/B testing) SPEED

!8 • numeric output • support of different metrics •
define “quality“ in your domain QUALITY  MEASURE

!9 • optimize across wider range of use case (aka
“information needs”) • think about what the majority of your users want • collect data to discover what is important for your use case USER  NEEDS

!10 Things needed for Ranking Evaluation 1. Define a set
of typical information needs 2. For each case, get small set of candidate documents (both relevant and irrelevant) 3. Rate these documents (either binary relevant/non-relevant or on some graded scale) 4. Choose a metric to calculate. Some good metrics already defined in Information Retrieval research include: • Precision@K, (N)DCG, ERR, Reciprocal Rank etc… Source: Gray Arial 10pt

!11 Search Evaluation Continuum speed preparation time people looking  
at screens Some sort of  unit test QA assisted by scripts user studies A/B testing Ranking Evaluation slow fast little lots

!12 Where Ranking Evaluation can help Development Production Communication  Tool
• guiding design decisions • enabling quick iteration • helps defining “search quality” clearer • forces stakeholders to “get real” about their expectations • monitor changes • spot degradations

!13 Elasticsearch   ‘rank_eval’ API

!14 Ranking Evaluation API GET /my_index/_rank_eval { "metric": { "mean_reciprocal_rank":
{ [...] } }, "templates": [{ [...] }], "requests": [{  "template_id": “my_query_template”, "ratings": [...], "params": { "query_string": “hotel amsterdam", "field": "text" }  [...] }] } • introduced in 6.2 (still experimental API) • joint work with Isabel Drost-Fromm (@MaineC) • Inputs: • a set of search requests (“information needs”) • document ratings for each request • a metrics definition; currently available • Precision@K • (N)DCG • Expected Reciprocal Rank • MRR, … 

!15 Ranking Evaluation API Details "metric": { "precision": { "relevant_rating_threshold":
"2", "k": 5 } } metric "requests": [{ "id": "JFK_query", "request": { “query”: { […] } }, "ratings": […] }, … other use cases …] requests "ratings": [ { "_id": "3054546", "rating": 3 }, { "_id": "5119376", "rating": 1 }, […] ] ratings

{ "rank_eval": { "metric_score": 0.431, "details": { "my_query_id1": { "metric_score":
0.6, "unrated_docs": [ { "_index": "idx", "_id": "1960795" }, [...] ], "hits": [...], "metric_details": { “precision" : { “relevant_docs_retrieved": 6,  "docs_retrieved": 10 } } }, "my_query_id2" : { [...] } } } } !16 _rank_eval response overall score details per query maybe rate those? details about metric

!17 How to get document ratings? 1. Define a set
of typical information needs of user (e.g. analyze logs, ask product management / customer etc…) 2. For each case, get small set of candidate documents (e.g. by very broad query) 3. Rate those documents with respect to the underlying information need • can initially be done by you or other stakeholders; later maybe outsource e.g. via Mechanical Turk 4. Iterate! Source: Gray Arial 10pt

!18 Metrics currently available Metric Description Ratings Precision At K
Set-based metric; ratio of relevant doc in top K results binary Reciprocal Rank (RR) Positional metric; inverse of the first relevant document binary Discounted Cumulative Gain (DCG) takes order into account; highly relevant docs score more if they appear earlier in result list graded Expected Reciprocal Rank (ERR) motivated by “cascade model” of search;   models dependency of results wrt. their predecessors graded

!19 Precision At K • In short: “How many good
results appear in the first K results” (e.g. first few pages in UI) • supports only boolean relevance judgements • PROS: easy to understand & communicate • CONS: least stable across different user needs, e.g. total number of relevant documents for a query influences precision at k Source: Gray Arial 10pt prec@k = # relevant docs { } # all results at k { }

!20 Reciprocal Rank • supports only boolean relevance judgements •
PROS: easy to understand & communicate • CONS: limited to cases where amount of good results doesn’t matter • If averaged over a sample of queries Q often called MRR (mean reciprocal rank): Source: Gray Arial 10pt RR = 1 position of first relevant document MRR = 1 Q 1 rank i i Q ∑

!21 Discounted Cumulative Gain (DCG) • Predecessor: Cumulative Gain (CG)
• sums relevance judgement over top k results Source: Gray Arial 10pt CG = rel k i=1 k ∑ DCG = rel i log 2 (i +1) i=1 k ∑ • DCG takes position into account • divides by log2 at each position • NDCG (Normalized DCG) • divides by “ideal” DCG for a query (IDCG) NDCG = DCG IDCG

!22 Expected Reciprocal Rank (ERR) • cascade based metric •
supports graded relevance judgements • model assumes user goes through result list in order an is satisfied with first relevant document • R_i probability that user stops at position i • ERR is high when relevant document appear early Source: Gray Arial 10pt ERR = 1 r (1− R i )R r i=1 r−1 ∏ r=1 k ∑ R i = 2rel i −1 2rel max rel i ! relevance at pos. i rel max ! maximal relevance grade

!23 DEMO TIME

!24 Demo project and Data • demo uses aprox. 1800
documents from the english Wikipedia • Wikipedias Discovery department collects and publishes relevance judgements with their Discernatron project • bulk data and all query examples available at https://github.com/cbuescher/ rankEvalDemo Source: Gray Arial 10pt

!25 Q&A

!26 Some questions I have for you… • How do
you measure search relevance currently? • Did you find anything useful about the ranking evaluation approach? • Feedback about usability of the API (ping be on Github or our Discuss Forum @cbuescher) Source: Gray Arial 10pt

!27 Further reading • Manning, Raghavan & Schütze: Introduction to
Information Retrieval, Cambridge University Press. 2008. • Metlzer, D., Zhang, Y., & Grinspan, P. (2009). Expected reciprocal rank for graded relevance. Proceeding of the 18th ACM Conference on Information and Knowledge Management - CIKM ’09, 621. • Blog: https://www.elastic.co/blog/made-to-measure-how-to- use-the-ranking-evaluation-api-in-elasticsearch • Docs: https://www.elastic.co/guide/en/elasticsearch/reference/ current/search-rank-eval.html • Discuss: https://discuss.elastic.co/c/elasticsearch (cbuescher) • Github: :Search/Ranking Label (cbuescher) Source: Gray Arial 10pt

Made To Measure: Ranking Evaluation using Elast...

Made To Measure: Ranking Evaluation using Elasticsearch

Christoph Büscher

More Decks by Christoph Büscher

Other Decks in Technology

Featured

Transcript

Christoph Büscher 12. Dec 2018, Elasticsearch Berlin Meetup @dalatangi Made

!2 If you can not measure it,  you cannot improve

? !3 How good is your search Image by Kecko

!4 Image by Muff Wiggler https://www.flickr.com/photos/muffwiggler/5605240619 (CC BY 2.0)

!5 Ranking Evaluation    A repeatable way to quickly measure

!6 • Automate - don’t make people look at screens

!7 • fast iterations instead of long waits (e.g. in

!8 • numeric output • support of different metrics •

!9 • optimize across wider range of use case (aka

!10 Things needed for Ranking Evaluation 1. Define a set

!11 Search Evaluation Continuum speed preparation time people looking

!12 Where Ranking Evaluation can help Development Production Communication  Tool

!13 Elasticsearch   ‘rank_eval’ API

!14 Ranking Evaluation API GET /my_index/_rank_eval { "metric": { "mean_reciprocal_rank":

!15 Ranking Evaluation API Details "metric": { "precision": { "relevant_rating_threshold":

{ "rank_eval": { "metric_score": 0.431, "details": { "my_query_id1": { "metric_score":

!17 How to get document ratings? 1. Define a set

!18 Metrics currently available Metric Description Ratings Precision At K

!19 Precision At K • In short: “How many good

!20 Reciprocal Rank • supports only boolean relevance judgements •

!21 Discounted Cumulative Gain (DCG) • Predecessor: Cumulative Gain (CG)

!22 Expected Reciprocal Rank (ERR) • cascade based metric •

!23 DEMO TIME

!24 Demo project and Data • demo uses aprox. 1800

!25 Q&A

!26 Some questions I have for you… • How do

!27 Further reading • Manning, Raghavan & Schütze: Introduction to