Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Elastic{ON} 2018 - Get the Lay of the Lucene Land

Elastic Co
March 01, 2018

Elastic{ON} 2018 - Get the Lay of the Lucene Land

Elastic Co

March 01, 2018
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. 2 This is a sample image Working with Lucene since

    2010 Lucene committer since 2012 Lucene PMC since 2013 Elastic employee since 2013
  2. Wikipedia { } Apache Lucene is a free and open-source

    information retrieval software library
  3. Lucene 4 (2012) (Elasticsearch 0.90, 1) Doc values - Flexible

    scoring Better postings/store compression Fuzzy queries speedup Lucene 5 (2015) (Elasticsearch 2) Index safety Slow query execution Lucene 6 (2016) (Elasticsearch 5) Points - Index sorting BM25 by default Multi-term synonyms Where is Lucene heading? 4
  4. Where is Lucene heading? 5 Lucene 7 (2017) (Elasticsearch 6)

    Query planning Sparse doc values Lucene 7.x (Elasticsearch 6.x) ??? Lucene 8 (2018/2019?) (Elasticsearch 7) ???
  5. Agenda 6 Lucene 7 (2017) (Elasticsearch 6) Query planning Sparse

    doc values Lucene 7.x (Elasticsearch 6.x) CoveringQuery Fine-grained flushing BKD-based geo bounding boxes Lucene 8 (2018/2019?) (Elasticsearch 7) WAND (Weak AND) Impacts indexing
  6. Query planning: 2 options for ranges 8 Find all matches

    Verify P matches Points O(num_matches) O(num_matches) Doc values O(num_docs) O(P) 㱺 IndexOrDocValuesQuery
  7. Sparse doc values fields (7.0+) 10 Doc ID Value 0

    42 1 2 3 -3 4 100 5 Docs with field [T, F, F, T, T, F] Value [42, 0, 0, -3, 100, 0] Docs with field [0, 3, 4] Value [42, -3, 100] 6.x storage 7.0 storage
  8. CoveringQuery (7.1+) 12 Remember minimum_should_match? GET hotels/_search { “query“: {

    “bool“: { “should“: [ { “term“: { “amenities“: ”pool” } }, { “term“: { “amenities“: ”spa” } }, { “term“: { “amenities“: ”fitness_center” } } ], “minimum_should_match“: 2 } } }
  9. CoveringQuery (7.1+) 13 Attribute-based control PUT documents/_doc/1 { “content: “…”,

    “security_attributes“: [ “employee”, “project:lambda” ], “security_attributes_length“: 2 }
  10. CoveringQuery (7.1+) 14 Attribute-based control GET documents/_search { “query“: {

    “terms_set“: { “security_attributes”: { “terms”: [ “employee”, “project:gamma”, “project:lambda” ], “minimum_should_match_field”: “security_attributes_length” } } } }
  11. • Current shape support still based on postings • BKD-based

    bounding boxes available in 7.1: LatLonBoundingBox • Indexed as a 4-dimensions point • Using BKD tree as a R tree • Upcoming general BKD-based shape support (7.x) BKD-based geo-shapes 16 Distance filter on geo points
  12. Fine-grained flushing 19 Say you want to spend 1GB on

    indexing and have 2 shards, how do you do it?
  13. WAND (8.0) 26 Can you make queries faster if you

    don’t need total hit counts? Sorted by field Index sorting (6.0) Sorted by score ???
  14. • Documents are identified by doc ids 0..N • Queries

    produce iterators over (doc id, score) pairs, sorted by doc id • Score of a boolean query is the sum of the scores of its clauses Anatomy of a Lucene index/query 27
  15. How do disjunctions work? 29 the quick fox 0 1

    2 3 4 5 6 7 2.5 doc id score
  16. How do disjunctions work? 30 the quick fox 0 1

    2 3 4 5 6 7 2.5 doc id score 1.6
  17. How do disjunctions work? 31 the quick fox 0 1

    2 3 4 5 6 7 2.5 doc id score 1.6 2.3
  18. How do disjunctions work? 32 the quick fox 0 1

    2 3 4 5 6 7 2.5 doc id score 1.6 2.3 2.0
  19. How do disjunctions work? 33 the quick fox 0 1

    2 3 4 5 6 7 2.5 doc id score 1.6 2.3 2.0 0.1
  20. How do disjunctions work? 34 the quick fox 0 1

    2 3 4 5 6 7 2.5 doc id score 1.6 2.3 2.0 0.1 1.9
  21. How do disjunctions work? 35 the quick fox 0 1

    2 3 4 5 6 7 2.5 doc id score 1.6 2.3 2.0 4.0 0.1 1.9
  22. How do disjunctions work? 36 the quick fox 0 1

    2 3 4 5 6 7 2.5 doc id score 1.6 2.3 4.0 2.4 2.0 0.1 1.9
  23. How do disjunctions work? 37 the quick fox 0 1

    2 3 4 5 6 7 2.5 doc id score 1.6 2.3 4.0 2.4 2.0 0.1 1.9
  24. • Search for “ the OR fox ” • If

    • minimum competitive score is 1 • “the” contributes at most 0.2 to the score • Then documents MUST match “fox” to be competitive WAND: intuition 38 Subtitle
  25. • Given C clauses, find next target: • Sort by

    non-decreasing current doc id • Sum up max scores until Σ max_score ≥ min_competitive_score • Return doc id of the first clause to meet this requirement WAND: algorithm 40 Subtitle
  26. WAND: example 41 the quick fox 0 1 2 3

    4 5 6 7 max score 0.2 max score 2.0 max score 3.0 doc id Min competitive score = 2.3 Next target
  27. WAND: compute top 2 matches 42 the quick fox 0

    1 2 3 4 5 6 7 max score 0.2 max score 2.0 max score 3.0 doc id score Min competitive score = 0
  28. WAND: compute top 2 matches 43 the quick fox 0

    1 2 3 4 5 6 7 max score 0.2 max score 2.0 max score 3.0 2.5 doc id score Min competitive score = 0
  29. WAND: compute top 2 matches 44 the quick fox 0

    1 2 3 4 5 6 7 max score 0.2 max score 2.0 max score 3.0 2.5 doc id score Min competitive score = 1.6 1.6
  30. WAND: compute top 2 matches 45 the quick fox 0

    1 2 3 4 5 6 7 max score 0.2 max score 2.0 max score 3.0 2.5 doc id score Min competitive score = 2.3 1.6 2.3
  31. WAND: compute top 2 matches 46 the quick fox 0

    1 2 3 4 5 6 7 max score 0.2 max score 2.0 max score 3.0 2.5 doc id score Min competitive score = 2.3 1.6 2.3 X
  32. WAND: compute top 2 matches 47 the quick fox 0

    1 2 3 4 5 6 7 max score 0.2 max score 2.0 max score 3.0 2.5 doc id score Min competitive score = 2.5 1.6 2.3 X 4.0
  33. WAND: compute top 2 matches 48 the quick fox 0

    1 2 3 4 5 6 7 max score 0.2 max score 2.0 max score 3.0 2.5 doc id score Min competitive score = 2.5 1.6 2.3 X 4.0 2.4
  34. WAND: compute top 2 matches 49 the quick fox 0

    1 2 3 4 5 6 7 max score 0.2 max score 2.0 max score 3.0 2.5 doc id score Min competitive score = 2.5 1.6 2.3 X 4.0 2.4
  35. • 0 to 1000x faster • If all terms have

    the same IDF: no improvement • Otherwise: could be 1000x faster! WAND: speedup? 50
  36. Indexing of impacts 53 What does the .doc file store?

    Block of 128 doc ids Skip data Block of 128 freqs
  37. • First doc id of the block • Offset of

    block in .doc (same file) • Offset in .pos (if positions indexed) • Offset in .pay (if offsets or payloads indexed) • List of competitive (freq, norm) pairs (NEW) • Makes it easy to know the upper bound of scores • Still allows to change Similarity on existing index Skip data 54
  38. Usage of impacts? 55 Term queries Skip blocks whose max

    score is not competitive Conjunctions Skip blocks whose sum of max scores is not competitive Disjunctions WAND 㱺 block-max WAND Other queries TODO
  39. • Term queries: ~8x faster • Conjunctions and disjunctions •

    Many times faster when terms frequently appear together (united AND kingdom, new OR york) • Depends a lot on data distribution otherwise Speedup? 56
  40. Except where otherwise noted, this work is licensed under http://creativecommons.org/licenses/by-nd/4.0/

    Creative Commons and the double C in a circle are registered trademarks of Creative Commons in the United States and other countries. Third party marks and brands are the property of their respective holders. 59 Please attribute Elastic with a link to elastic.co