Index sorting with Lucene

Slide 1

Slide 1 text

Index Sorting How to trade index speed for search speed

Slide 2

Slide 2 text

Index = collection of immutable segments Segments store documents sequentially on disk Add data = create a new segment Segments get eventually merged together Order of segments / documents in segments doesn’t matter – the following segments are equivalent Anatomy of a Lucene index 9 0 7 8 2 2 1 8 10 3 4 4 6 10 1 1 13 1 3 10 4 7 20 42 11 9 8 15 18 30 31 99 5 12 Id Price 13 10 10 9 8 8 7 6 4 4 3 2 2 1 1 1 0 12 9 31 1 4 11 10 30 15 18 8 7 20 42 99 5 3 Id Price

Slide 3

Slide 3 text

ordinal of a doc in a segment = doc id used in the inverted index to refer to docs Anatomy of a Lucene index 9 0 7 8 2 2 1 8 10 3 4 4 6 10 1 1 13 1 3 10 4 7 20 42 11 9 8 15 18 30 31 99 5 12 Id Price 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 doc id shoe 1, 3, 5, 8, 11, 13, 15

Slide 4

Slide 4 text

Get top N=2 results: – Create a priority queue of size N – Accumulate matching docs Top hits 9 0 7 8 2 2 1 8 10 3 4 4 6 10 1 1 13 1 3 10 4 7 20 42 11 9 8 15 18 30 31 99 5 12 Id Price (3) () (3,4) (4,20) (4,9) (4,9) (9,31) (9,31) Automatic overflow of the priority queue to remove the least one Create an empty priority queue Top hits

Slide 5

Slide 5 text

Let’s do the same on a sorted index Early termination 13 10 10 9 8 8 7 6 4 4 3 2 2 1 1 1 0 12 9 31 1 4 11 10 30 15 18 8 7 20 42 99 5 3 Id Price (9) () (9,31) (9,31) (9,31) (9,31) (9,31)(9,31) Priority queue never changes after this document

Slide 6

Slide 6 text

Pros – makes finding the top hits much faster – file-system cache-friendly Cons – only works for static ranks – not if the sort order depends on the query – requires the index to be sorted – doesn’t work for tasks that require visiting every doc: – total number of matches – faceting Early termination

Slide 7

Slide 7 text

Not uncommon! Graph-based ranks – Google’s PageRank Facebook social search / Unicorn – https://www.facebook.com/publications/219621248185635 Many more... Doesn’t need to be the exact sort order – heuristics when score is only a function of the static rank Static ranks

Slide 8

Slide 8 text

A live index can’t be kept sorted – would require inserting docs between existing docs! – segments are immutable Offline sorting to the rescue: – index as usual – sort into a new index – search! Pros/cons – super fast to search, the whole index is fully sorted – but only works for static content Offline sorting

Slide 9

Slide 9 text

Offline Sorting // open a reader on the unsorted index and create a sorted (but slow) view DirectoryReader reader = DirectoryReader.open(in); boolean ascending = false; Sorter sorter = new NumericDocValuesSorter("price", ascending); AtomicReader sortedReader = SortingAtomicReader.wrap( SlowCompositeReaderWrapper.wrap(reader), sorter); // copy the content of the sorted reader to the new dir IndexWriter writer = new IndexWriter(out, iwConf); writer.addIndexes(sortedReader); writer.close(); reader.close();

Slide 10

Slide 10 text

Sort segments independently – wouldn’t require inserting data into existing segments – collection could still be early-terminated on a per-segment basis But segments are immutable – must be sorted before starting writing them Online sorting?

Slide 11

Slide 11 text

2 sources of segments – flush – merge flushed segments can’t be sorted – Lucene writes stored fields to disk on the fly – could be buffered but this would require a lot of memory merged segments can be sorted – create a sorted view over the segments to merge – pass this view to SegmentMerger instead of the original segments not a bad trade-off – flushed segments are usually small & fast to collect Online sorting?

Slide 12

Slide 12 text

Online sorting? Flushed segments - NRT reopens - RAM buffer size limit hit Merged segments http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html Merged segments can easily take 99+ % of the size of the index

Slide 13

Slide 13 text

Online Sorting IndexWriterConfig iwConf = new IndexWriterConfig(...); // original MergePolicy finds the segments to merge MergePolicy origMP = iwConf.getMergePolicy(); // SortingMergePolicy wraps the segments with a sorted view boolean ascending = false; Sorter sorter = new NumericDocValuesSorter("price", ascending); MergePolicy sortingMP = new SortingMergePolicy(origMP, sorter); // setup IndexWriter to use SortingMergePolicy iwConf.setMergePolicy(sortingMP); IndexWriter writer = new IndexWriter(dir, iwConf); // index as usual

Slide 14

Slide 14 text

Collect top N matches Offline sorting – index sorted globally – early terminate after N matches have been collected – no priority queue needed! Online sorting – no early termination on flushed segments – early termination on merged segments – if N matches have been collected – or if current match is less than the top of the PQ Early termination

Slide 15

Slide 15 text

Early Termination class MyCollector extends Collector { @Override public void setNextReader(AtomicReaderContext context) throws IOException { readerIsSorted = SortingMergePolicy.isSorted(context.reader(), sorter); collected = 0; } @Override public void collect(int doc) throws IOException { if (readerIsSorted && (++collected >= maxDocsToCollect || curVal <= pq.top()) { // Special exception that tells IndexSearcher to terminate // collection of the current segment throw new CollectionTerminatedException(); } else { // collect hit } } }

Slide 16

Slide 16 text

Questions?