Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Get the Lay of the (Lucene) Land

Elastic Co
February 18, 2016

Get the Lay of the (Lucene) Land

Elastic's Adrien Grand presents all the news you need to know about Apache Lucene at Elastic{ON}16 in February 18, 2016 in San Francisco.

Elastic Co

February 18, 2016
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. Working with Lucene since 2010 Lucene committer since 2012 Lucene

    PMC since 2013 Elastic employee since 2013 19
  2. 9 Efficient structured search on an inverted index 000 001

    010 011 100 101 110 111 000 001010 011100 101110 111 00 01 10 11 00 01 10 11 0 1 0 1 • Index data with several levels of precision • (1,0) • (10, 01) • (101, 010)
  3. 10 Efficient structured search on an inverted index 000 001

    010 011 100 101 110 111 000 001010 011100 101110 111 00 01 10 11 00 01 10 11 0 1 0 1 • Search by visiting as few cells as possible • (01,01) • (10,01) • (010,100) • (011,100) • (100,100) • (101,100)
  4. 11 Efficient structured search on an inverted index 000 001

    010 011 100 101 110 111 000 001010 011100 101110 111 00 01 10 11 00 01 10 11 0 1 0 1 • Search by visiting as few cells as possible • (01,01) • (10,01) • (010,100) • (011,100) • (100,100) • (101,100)
  5. 12 Efficient structured search on an inverted index 000 001

    010 011 100 101 110 111 000 001010 011100 101110 111 00 01 10 11 00 01 10 11 0 1 0 1 • Search by visiting as few cells as possible • (01,01) • (10,01) • (010,100) • (011,100) • (100,100) • (101,100)
  6. 13 Efficient structured search on an inverted index 000 001

    010 011 100 101 110 111 000 001010 011100 101110 111 00 01 10 11 00 01 10 11 0 1 0 1 • Search by visiting as few cells as possible • (01,01) • (10,01) • (010,100) • (011,100) • (100,100) • (101,100)
  7. 14 Efficient structured search on an inverted index 000 001

    010 011 100 101 110 111 000 001010 011100 101110 111 00 01 10 11 00 01 10 11 0 1 0 1 • Search by visiting as few cells as possible • (01,01) • (10,01) • (010,100) • (011,100) • (100,100) • (101,100)
  8. 15 Efficient structured search on an inverted index 000 001

    010 011 100 101 110 111 000 001010 011100 101110 111 00 01 10 11 00 01 10 11 0 1 0 1 • Search by visiting as few cells as possible • (01,01) • (10,01) • (010,100) • (011,100) • (100,100) • (101,100)
  9. 16 Efficient structured search on an inverted index 000 001

    010 011 100 101 110 111 000 001010 011100 101110 111 00 01 10 11 00 01 10 11 0 1 0 1 • Search by visiting as few cells as possible • (01,01) • (10,01) • (010,100) • (011,100) • (100,100) • (101,100)
  10. 30

  11. 31

  12. 32 More information tomorrow at 11am Geospatial Data Structures in

    Elasticsearch and Apache Lucene by Nick Knize
  13. 34 Doc freq contribution • Common words are naturally discriminated

    • No more need to exclude stop words TF-IDF BM25
  14. 38 Two-phase iteration • Query can be divided into •

    a fast approximation • a (slower) match • Phrase • Approximation = conjunction: • Match = position check • Geo polygon query • Approximation = points in cells that are in or cross the polygon • Match = check the point against the polygon
  15. Match cost API 39 description:”search engine” AND body:”postings list” (description:search

    AND description:engine) AND (body:postings AND body:list) 1. Approximate 2. Match description:”search engine” body:”postings list”
  16. Match cost API 40 description:”search engine” AND body:”postings list” Field

    Value df ttf description search 200k 420k description engine 15k 18k body postings 1k 13k body list 370k 5920k
  17. Match cost API 41 description:”search engine” AND body:”postings list” Field

    Value df ttf description search 200k 420k description engine 15k 18k body postings 1k 13k body list 370k 5920k
  18. Match cost API 42 description:”search engine” AND body:”postings list” 1.

    Approximate 2. Match description:”search engine” body:”postings list” 1. Iterate body:postings (1k) 2. Check description:engine (15k) 3. Check description:search (200k) 4. Check body:list (370k)
  19. Match cost API 43 description:”search engine” AND body:”postings list” Field

    Value df ttf ttf/df description search 200k 420k 2.1 description engine 15k 18k 1.2 body postings 1k 13k 13 body list 370k 5920k 16
  20. Match cost API 44 description:”search engine” AND body:”postings list” 1.

    Iterate body:postings (1k) 2. Check description:engine (15k) 3. Check description:search (200k) 4. Check body:list (370k) 1. Approximate 2. Match 1. description:”search engine” (2.1+1.2=3.3) 2. body:”postings list” (16+13=29)
  21. 45 Other changes • Better query-time synonyms • Disk-based norms

    • More memory-efficient doc values • More disk-efficient sparse doc values • BooleanQuery simplification in rewrite • Bulk scorer specialization for MatchAllDocsQuery and MUST_NOT clauses • Improved file truncation detection
  22. ‹#› Please attribute Elastic with a link to elastic.co Except

    where otherwise noted, this work is licensed under http://creativecommons.org/licenses/by-nd/4.0/ Creative Commons and the double C in a circle are registered trademarks of Creative Commons in the United States and other countries. Third party marks and brands are the property of their respective holders. 48