Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Index sorting with Lucene

Index sorting with Lucene

Presented at Lucene revolution 2013 in Dublin.

Elasticsearch Inc

November 06, 2013
Tweet

More Decks by Elasticsearch Inc

Other Decks in Programming

Transcript

  1. Index = collection of immutable segments Segments store documents sequentially

    on disk Add data = create a new segment Segments get eventually merged together Order of segments / documents in segments doesn’t matter – the following segments are equivalent Anatomy of a Lucene index 9 0 7 8 2 2 1 8 10 3 4 4 6 10 1 1 13 1 3 10 4 7 20 42 11 9 8 15 18 30 31 99 5 12 Id Price 13 10 10 9 8 8 7 6 4 4 3 2 2 1 1 1 0 12 9 31 1 4 11 10 30 15 18 8 7 20 42 99 5 3 Id Price
  2. ordinal of a doc in a segment = doc id

    used in the inverted index to refer to docs Anatomy of a Lucene index 9 0 7 8 2 2 1 8 10 3 4 4 6 10 1 1 13 1 3 10 4 7 20 42 11 9 8 15 18 30 31 99 5 12 Id Price 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 doc id shoe 1, 3, 5, 8, 11, 13, 15
  3. Get top N=2 results: – Create a priority queue of

    size N – Accumulate matching docs Top hits 9 0 7 8 2 2 1 8 10 3 4 4 6 10 1 1 13 1 3 10 4 7 20 42 11 9 8 15 18 30 31 99 5 12 Id Price (3) () (3,4) (4,20) (4,9) (4,9) (9,31) (9,31) Automatic overflow of the priority queue to remove the least one Create an empty priority queue Top hits
  4. Let’s do the same on a sorted index Early termination

    13 10 10 9 8 8 7 6 4 4 3 2 2 1 1 1 0 12 9 31 1 4 11 10 30 15 18 8 7 20 42 99 5 3 Id Price (9) () (9,31) (9,31) (9,31) (9,31) (9,31)(9,31) Priority queue never changes after this document
  5. Pros – makes finding the top hits much faster –

    file-system cache-friendly Cons – only works for static ranks – not if the sort order depends on the query – requires the index to be sorted – doesn’t work for tasks that require visiting every doc: – total number of matches – faceting Early termination
  6. Not uncommon! Graph-based ranks – Google’s PageRank Facebook social search

    / Unicorn – https://www.facebook.com/publications/219621248185635 Many more... Doesn’t need to be the exact sort order – heuristics when score is only a function of the static rank Static ranks
  7. A live index can’t be kept sorted – would require

    inserting docs between existing docs! – segments are immutable Offline sorting to the rescue: – index as usual – sort into a new index – search! Pros/cons – super fast to search, the whole index is fully sorted – but only works for static content Offline sorting
  8. Offline Sorting // open a reader on the unsorted index

    and create a sorted (but slow) view DirectoryReader reader = DirectoryReader.open(in); boolean ascending = false; Sorter sorter = new NumericDocValuesSorter("price", ascending); AtomicReader sortedReader = SortingAtomicReader.wrap( SlowCompositeReaderWrapper.wrap(reader), sorter); // copy the content of the sorted reader to the new dir IndexWriter writer = new IndexWriter(out, iwConf); writer.addIndexes(sortedReader); writer.close(); reader.close();
  9. Sort segments independently – wouldn’t require inserting data into existing

    segments – collection could still be early-terminated on a per-segment basis But segments are immutable – must be sorted before starting writing them Online sorting?
  10. 2 sources of segments – flush – merge flushed segments

    can’t be sorted – Lucene writes stored fields to disk on the fly – could be buffered but this would require a lot of memory merged segments can be sorted – create a sorted view over the segments to merge – pass this view to SegmentMerger instead of the original segments not a bad trade-off – flushed segments are usually small & fast to collect Online sorting?
  11. Online sorting? Flushed segments - NRT reopens - RAM buffer

    size limit hit Merged segments http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html Merged segments can easily take 99+ % of the size of the index
  12. Online Sorting IndexWriterConfig iwConf = new IndexWriterConfig(...); // original MergePolicy

    finds the segments to merge MergePolicy origMP = iwConf.getMergePolicy(); // SortingMergePolicy wraps the segments with a sorted view boolean ascending = false; Sorter sorter = new NumericDocValuesSorter("price", ascending); MergePolicy sortingMP = new SortingMergePolicy(origMP, sorter); // setup IndexWriter to use SortingMergePolicy iwConf.setMergePolicy(sortingMP); IndexWriter writer = new IndexWriter(dir, iwConf); // index as usual
  13. Collect top N matches Offline sorting – index sorted globally

    – early terminate after N matches have been collected – no priority queue needed! Online sorting – no early termination on flushed segments – early termination on merged segments – if N matches have been collected – or if current match is less than the top of the PQ Early termination
  14. Early Termination class MyCollector extends Collector { @Override public void

    setNextReader(AtomicReaderContext context) throws IOException { readerIsSorted = SortingMergePolicy.isSorted(context.reader(), sorter); collected = 0; } @Override public void collect(int doc) throws IOException { if (readerIsSorted && (++collected >= maxDocsToCollect || curVal <= pq.top()) { // Special exception that tells IndexSearcher to terminate // collection of the current segment throw new CollectionTerminatedException(); } else { // collect hit } } }