Efficient top-k query processing in Lucene 8

8261d04bf57a042c8eab6757c386f7b2?s=47 Tomoko Uchida
February 26, 2019

Efficient top-k query processing in Lucene 8

Talk at Search Engineering Tech Talk #1
https://search-tech.connpass.com/event/112866/

8261d04bf57a042c8eab6757c386f7b2?s=128

Tomoko Uchida

February 26, 2019
Tweet

Transcript

  1. 2019/2/25 Efficient top-k query processing in Lucene 8 http://mocobeta.github.io/slides-html/search-tech-talk-1/search-tech-talk-1.html?print-pdf 1/32

    EFFICIENT TOP-K QUERY EFFICIENT TOP-K QUERY PROCESSING IN LUCENE 8 PROCESSING IN LUCENE 8 Tomoko Uchida 2019/02/26 @ Roppongi Hills
  2. 2019/2/25 Efficient top-k query processing in Lucene 8 http://mocobeta.github.io/slides-html/search-tech-talk-1/search-tech-talk-1.html?print-pdf 2/32

    WHO AM I WHO AM I Twitter: @moco_beta 5+ years of experience w/ Solr and Elasticsearch Software Engineer @ Developing patent search w/ AI technologies developer co-mainteiner lead author AI Samurai Inc. Janome Luke: Lucene Toolbox Project 改訂3版 Apache Solr 入門
  3. 2019/2/25 Efficient top-k query processing in Lucene 8 http://mocobeta.github.io/slides-html/search-tech-talk-1/search-tech-talk-1.html?print-pdf 3/32

    Lucene/Solr 8.0 and Elasticsearch 7.0 are coming...
  4. 2019/2/25 Efficient top-k query processing in Lucene 8 http://mocobeta.github.io/slides-html/search-tech-talk-1/search-tech-talk-1.html?print-pdf 4/32

    SUMMARY OF THIS TALK SUMMARY OF THIS TALK Top-k query processing / scoring will be much faster! Especially effective in disjunction (OR) query Also works for complex queries such as PhraseQuery, WildcardQuery and their combinations Exact total hits count will not be returned (in default)
  5. 2019/2/25 Efficient top-k query processing in Lucene 8 http://mocobeta.github.io/slides-html/search-tech-talk-1/search-tech-talk-1.html?print-pdf 5/32

    AND THERE IS A LONG VERSION... AND THERE IS A LONG VERSION... This talk is a short version of my survey. Please see this post (in Japanese) for more details :) Lucene 8 の Top-k クエリプロセッシング最適化
  6. 2019/2/25 Efficient top-k query processing in Lucene 8 http://mocobeta.github.io/slides-html/search-tech-talk-1/search-tech-talk-1.html?print-pdf 6/32

    REFERENCES REFERENCES Magic WAND: Faster Retrieval of Top Hits in Elasticsearch (FOSDEM 2019) Super-speedy scoring in Lucene 8 (FOSDEM 2019) Apache Lucene and Apache Solr 8 (Berlin Buzzwords 2012) Ef cient Scoring in Lucene 転置インデックスと Top k-query
  7. 2019/2/25 Efficient top-k query processing in Lucene 8 http://mocobeta.github.io/slides-html/search-tech-talk-1/search-tech-talk-1.html?print-pdf 7/32

    PAPERS PAPERS [1] T. Strohman, H. Turtle, and B. Croft. Optimization strategies for complex queries. In Proceedings of ACM SIGIR conference, 2005. [2] K. Chakrabarti, S. Chaudhuri, V. Ganti. Interval- Based Pruning for Top-k Processing over Compressed Lists, in Proc. of ICDE, 2011. [3] A. Z. Broder, D. Carmel, M. Herscovici, A. Soffer, J. Y. Zien. Ef cient Query Evaluation using a Two-Level Retrieval Process, in Proc. of CIKM, 2003. [4] S. Ding and T. Suel. Faster top-k document retrieval using block-max indexes. SIGIR, 2011.
  8. 2019/2/25 Efficient top-k query processing in Lucene 8 http://mocobeta.github.io/slides-html/search-tech-talk-1/search-tech-talk-1.html?print-pdf 8/32

    HOW MUCH FASTER? - AND QUERY HOW MUCH FASTER? - AND QUERY http://people.apache.org/~mikemccand/lucenebench/An
  9. 2019/2/25 Efficient top-k query processing in Lucene 8 http://mocobeta.github.io/slides-html/search-tech-talk-1/search-tech-talk-1.html?print-pdf 9/32

    HOW MUCH FASTER? - OR QUERY (1) HOW MUCH FASTER? - OR QUERY (1) http://people.apache.org/~mikemccand/lucenebench/Or
  10. 2019/2/25 Efficient top-k query processing in Lucene 8 http://mocobeta.github.io/slides-html/search-tech-talk-1/search-tech-talk-1.html?print-pdf 10/32

    HOW MUCH FASTER? - OR QUERY (2) HOW MUCH FASTER? - OR QUERY (2) http://people.apache.org/~mikemccand/lucenebench/Or
  11. 2019/2/25 Efficient top-k query processing in Lucene 8 http://mocobeta.github.io/slides-html/search-tech-talk-1/search-tech-talk-1.html?print-pdf 11/32

    HOW MUCH FASTER? - TERM QUERY HOW MUCH FASTER? - TERM QUERY http://people.apache.org/~mikemccand/lucenebench/Te
  12. 2019/2/25 Efficient top-k query processing in Lucene 8 http://mocobeta.github.io/slides-html/search-tech-talk-1/search-tech-talk-1.html?print-pdf 12/32

    ALGORITHMS ALGORITHMS
  13. 2019/2/25 Efficient top-k query processing in Lucene 8 http://mocobeta.github.io/slides-html/search-tech-talk-1/search-tech-talk-1.html?print-pdf 13/32

    POSTING LIST RETRIEVAL AND THE POSTING LIST RETRIEVAL AND THE CHALLENGE ON DISJUNCTION CHALLENGE ON DISJUNCTION Query "search OR engine"
  14. 2019/2/25 Efficient top-k query processing in Lucene 8 http://mocobeta.github.io/slides-html/search-tech-talk-1/search-tech-talk-1.html?print-pdf 14/32

    MAXSCORE MAXSCORE Introduced by H.R.Turtle and J.Flood in 1995
  15. 2019/2/25 Efficient top-k query processing in Lucene 8 http://mocobeta.github.io/slides-html/search-tech-talk-1/search-tech-talk-1.html?print-pdf 15/32

    INTERVAL-BASED PRUNING INTERVAL-BASED PRUNING MaxScore variant adopted to block compressed indexes [2]
  16. 2019/2/25 Efficient top-k query processing in Lucene 8 http://mocobeta.github.io/slides-html/search-tech-talk-1/search-tech-talk-1.html?print-pdf 16/32

    WAND WAND Special operator proposed in [3] "WAND" is the abbreviation for "Week AND" or "Weighted AND" OR is being close to AND when a document contains a large enough subset of the query terms Score of a document having a large subset of the query terms is higher than the ones of documents with a few of them
  17. 2019/2/25 Efficient top-k query processing in Lucene 8 http://mocobeta.github.io/slides-html/search-tech-talk-1/search-tech-talk-1.html?print-pdf 17/32

    SOUNDS FAMILIAR? SOUNDS FAMILIAR? Lucene already has similar concept : "Minimum Should Match"
  18. 2019/2/25 Efficient top-k query processing in Lucene 8 http://mocobeta.github.io/slides-html/search-tech-talk-1/search-tech-talk-1.html?print-pdf 18/32

    WAND WAND Query "the OR search OR engine OR lucene"
  19. 2019/2/25 Efficient top-k query processing in Lucene 8 http://mocobeta.github.io/slides-html/search-tech-talk-1/search-tech-talk-1.html?print-pdf 19/32

    WAND WAND Steps 1. Assume current threshold (kth highest score) is 12. 2. Sort postings by current pointer. 3. Find "pivot" term and docid - here, that is "search" and id=486. 4. Calculate the partial score for doc 486 if it also contains "the" and "engine".
  20. 2019/2/25 Efficient top-k query processing in Lucene 8 http://mocobeta.github.io/slides-html/search-tech-talk-1/search-tech-talk-1.html?print-pdf 20/32

    BLOCK-MAX WAND BLOCK-MAX WAND WAND variant working with block compressed indexes [4] Finally come in Lucene!
  21. 2019/2/25 Efficient top-k query processing in Lucene 8 http://mocobeta.github.io/slides-html/search-tech-talk-1/search-tech-talk-1.html?print-pdf 21/32

    BLOCK-MAX WAND BLOCK-MAX WAND
  22. 2019/2/25 Efficient top-k query processing in Lucene 8 http://mocobeta.github.io/slides-html/search-tech-talk-1/search-tech-talk-1.html?print-pdf 22/32

    DIVE INTO IMPLEMENTATION DIVE INTO IMPLEMENTATION
  23. 2019/2/25 Efficient top-k query processing in Lucene 8 http://mocobeta.github.io/slides-html/search-tech-talk-1/search-tech-talk-1.html?print-pdf 23/32

    DISCLAIMER DISCLAIMER This is about low-level, complex part of Lucene. Could include mistakes... Lucene API can be rapidly changed. This is based on branch_8_0 branch.
  24. 2019/2/25 Efficient top-k query processing in Lucene 8 http://mocobeta.github.io/slides-html/search-tech-talk-1/search-tech-talk-1.html?print-pdf 24/32

    REVIEW: LUCENE SCORING ARCHITECTURE REVIEW: LUCENE SCORING ARCHITECTURE Ex. TermQuery
  25. 2019/2/25 Efficient top-k query processing in Lucene 8 http://mocobeta.github.io/slides-html/search-tech-talk-1/search-tech-talk-1.html?print-pdf 25/32

    REVIEW: LUCENE SCORING ARCHITECTURE REVIEW: LUCENE SCORING ARCHITECTURE Ex. BooleanQuery
  26. 2019/2/25 Efficient top-k query processing in Lucene 8 http://mocobeta.github.io/slides-html/search-tech-talk-1/search-tech-talk-1.html?print-pdf 26/32

    BLOCK-MAX WAND IMPLEMENTATION BLOCK-MAX WAND IMPLEMENTATION Changes in indexing o.a.l.index.Impact o.a.l.codecs.CompetitiveImpactAccumulator o.a.l.codecs.lucene50.Lucene50SkipWriter#writeImpa ...
  27. 2019/2/25 Efficient top-k query processing in Lucene 8 http://mocobeta.github.io/slides-html/search-tech-talk-1/search-tech-talk-1.html?print-pdf 27/32

    BLOCK-MAX WAND IMPLEMENTATION BLOCK-MAX WAND IMPLEMENTATION Changes in retrieving posting list o.a.l.codecs.lucene50.Lucene50ScoreSkipReader o.a.l.index.ImpactsSource o.a.l.search.MaxScoreCache o.a.l.search.ImpactsDISI ...
  28. 2019/2/25 Efficient top-k query processing in Lucene 8 http://mocobeta.github.io/slides-html/search-tech-talk-1/search-tech-talk-1.html?print-pdf 28/32

    BLOCK-MAX WAND IMPLEMENTATION BLOCK-MAX WAND IMPLEMENTATION Changes in scoring Ex. TermQuery
  29. 2019/2/25 Efficient top-k query processing in Lucene 8 http://mocobeta.github.io/slides-html/search-tech-talk-1/search-tech-talk-1.html?print-pdf 29/32

    BLOCK-MAX WAND IMPLEMENTATION BLOCK-MAX WAND IMPLEMENTATION Changes in scoring Ex. BooleanQuery
  30. 2019/2/25 Efficient top-k query processing in Lucene 8 http://mocobeta.github.io/slides-html/search-tech-talk-1/search-tech-talk-1.html?print-pdf 30/32

    [ANN] LUKE HAS BEEN REVISED! [ANN] LUKE HAS BEEN REVISED! GUI tool for introspecting and debugging your Lucene/Solr/Elasticsearch index. https://github.com/DmitryKey/luke
  31. 2019/2/25 Efficient top-k query processing in Lucene 8 http://mocobeta.github.io/slides-html/search-tech-talk-1/search-tech-talk-1.html?print-pdf 31/32

    [ANN] LUKE HAS BEEN REVISED! [ANN] LUKE HAS BEEN REVISED! Eventually rewritten on top of Swing ... in 2019? :) Licenced under ALv2 and works ne with JDK11+ Popular in US, Europe and China Still big growth potential in Japan It's a long story
  32. 2019/2/25 Efficient top-k query processing in Lucene 8 http://mocobeta.github.io/slides-html/search-tech-talk-1/search-tech-talk-1.html?print-pdf 32/32

    THANK YOU THANK YOU Happy (paper | code) reading!