Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Get the Lay of the Lucene Land

Dd9d954997353b37b4c2684f478192d3?s=47 Elastic Co
March 08, 2017

Get the Lay of the Lucene Land

In spite of being close to 20 years old, the Lucene project keeps innovating. Hear stories of the latest features in Lucene 6, how they impacted Elasticsearch, and what to expect in Lucene 7.

Adrien Grand l Software Engineer l Elastic

Dd9d954997353b37b4c2684f478192d3?s=128

Elastic Co

March 08, 2017
Tweet

Transcript

  1. Elastic March 8th 2017 @jpountz Get the Lay of the

    (Lucene) Land Adrien Grand
  2. 2 This is a sample image Working with Lucene since

    2010 Lucene committer since 2012 Lucene PMC since 2013 Elastic employee since 2013
  3. Wikipedia { } Apache Lucene is a free and open-source

    information retrieval software library
  4. Lucene 4 (2012) Doc values - Flexible scoring Better postings/store

    compression Fuzzy queries speedup Lucene 5 (2015) Index safety Slow query execution Lucene 6 (2016) Points - Index sorting BM25 by default Multi-term synonyms Lucene 7 (2017?) Query planning Sparse doc values Where is Lucene heading? 4
  5. Lucene 4 (2012) Doc values - Flexible scoring Better postings/store

    compression Fuzzy queries speedup Lucene 5 (2015) Index safety Slow query execution Lucene 6 (2016) Points - Index sorting BM25 by default Multi-term synonyms Lucene 7 (2017?) Query planning Sparse doc values Where is Lucene heading? Analytics 5
  6. Lucene 4 (2012) Doc values - Flexible scoring Better postings/store

    compression Fuzzy queries speedup Lucene 5 (2015) Index safety Slow query execution Lucene 6 (2016) Points - Index sorting BM25 by default Multi-term synonyms Lucene 7 (2017?) Query planning Sparse doc values Where is Lucene heading? Structured search 6
  7. Lucene 4 (2012) Doc values - Flexible scoring Better postings/store

    compression Fuzzy queries speedup Lucene 5 (2015) Index safety Slow query execution Lucene 6 (2016) Points - Index sorting BM25 by default Multi-term synonyms Lucene 7 (2017?) Query planning Sparse doc values Where is Lucene heading? Data store 7
  8. 3 • Query parsers no longer split on whitespace •

    up to the search analyzer • Correct multi-term synonyms at query time Better query parsing (6.2-7.0+) 8 SF bay 0 1 san 2 sf francisco 3 bay (sf OR “san francisco”) OR bay
  9. 9 More information on Thursday at 12:45 Elasticsearch search improvements

    by Jim Ferenczi
  10. Range fields (6.2+) 10 • Indexed like 2D points in

    a BKD tree • More efficient than 2 separate 1D ranges • INTERSECTS / WITHIN / CONTAINS / CROSSES relations min max [1,5] [4,5] [1,2] [3,6]
  11. 11 More information on Thursday at 12:45 Elasticsearch search improvements

    by Nick “geo” Knize
  12. { “geoname_id”: 6252001, “name”: “United States”, “type”: “country”, “country_code”: “US”,

    “population”: 310232863 } Index sorting (6.2+) 12 • Queries return documents in index order • Index sorting makes index order configurable • Benchmark on the geonames dataset • 8.5 M documents
  13. Index sorting: faster sorting (6.2+) 13 INDEX ORDER RANDOM ORDER

    POPULATION DESC INDEX TIME 64s 87s (+36%) INDEX SIZE 463MB 436MB (-6%) TOP 10 LOCATIONS BY POPULATION 120ms 0.02ms (6000x faster) IDEM + HIT COUNT 120 ms 17ms (7x faster)
  14. Index sorting: faster searching (6.2+) 14 INDEX ORDER RANDOM ORDER

    TYPE ASC, COUNTRY_CODE ASC INDEX TIME 64s 136s (+112%) INDEX SIZE 463MB 374MB (-19%) TYPE:(CITY OR COUNTRY) 40ms 13ms (3x faster) TYPE:CITY AND COUNTRY_CODE:US 46ms 28ms (1.6x faster)
  15. Sparse doc values fields (7.0+) 15 Doc ID Value 0

    42 1 2 3 -3 4 100 5 Docs with field [T, F, F, T, T, F] Value [42, 0, 0, -3, 100, 0] Docs with field [0, 3, 4] Value [42, -3, 100] 6.x storage 7.0 storage
  16. • Pros • More space-efficient • Faster merging • More

    potential for compression Sparse doc value fields (7.0+) 16 • Cons • Only sequential access is efficient • 0-10% slow down for sorting
  17. 17 TermQuery (date/time sort) http://people.apache.org/~mikemccand/lucenebench/TermDTSort.html Switch to a sequential API

  18. 18 TermQuery (title sort) http://people.apache.org/~mikemccand/lucenebench/TermTitleSort.html Switch to a sequential API

  19. • Queries have 2 primitive operations: • find matches •

    verify matches • Conjunction (ANDed clauses): • 1 clause that finds matches • 1-N clauses that verify matches Query planning (6.5+) 19
  20. Range query: points 20 doc 3 value 2 doc 1

    value 5 doc 0 value 6 doc 2 value 9 [2, 5] [6, 9] [2, 9] • Find all matches? • O(#matches) • Verify N matches? • O(N + #matches)
  21. Range query: doc values 21 • Find all matches? •

    O(#docs) (linear scan) • Verify N matches? • O(N) Doc ID Value 0 6 1 5 2 9 3 2
  22. Query planning: benchmark 22 • 10M wikipedia subset • body:

    text • last edit: date • Query: full-text query on body, filtered by a date range on the last edit date • Query planning: • points if range is more selective • doc values otherwise
  23. Query planning: benchmark against 0.1% term 23

  24. Query planning: benchmark against 0.1% term 24 30x faster

  25. Query planning: benchmark against 1% term 25

  26. Query planning: conclusion 26 • Also works for: • geo

    bounding box queries • geo distance queries • Follow-ups: • Improve the heuristics • Make it work for prefix / wildcard / fuzzy / terms queries
  27. • Sequence numbers on index operations • Better query parsing

    of prefix / wildcard / fuzzy queries • Boolean similarity • Unified highlighter • Optimized geo distance sorting And more 27
  28. 28 More Questions? Visit us at the AMA

  29. www.elastic.co

  30. Except where otherwise noted, this work is licensed under http://creativecommons.org/licenses/by-nd/4.0/

    Creative Commons and the double C in a circle are registered trademarks of Creative Commons in the United States and other countries. Third party marks and brands are the property of their respective holders. 30 Please attribute Elastic with a link to elastic.co