Index sorting with Lucene

Index Sorting How to trade index speed for search speed

Index = collection of immutable segments Segments store documents sequentially
on disk Add data = create a new segment Segments get eventually merged together Order of segments / documents in segments doesn’t matter – the following segments are equivalent Anatomy of a Lucene index 9 0 7 8 2 2 1 8 10 3 4 4 6 10 1 1 13 1 3 10 4 7 20 42 11 9 8 15 18 30 31 99 5 12 Id Price 13 10 10 9 8 8 7 6 4 4 3 2 2 1 1 1 0 12 9 31 1 4 11 10 30 15 18 8 7 20 42 99 5 3 Id Price

ordinal of a doc in a segment = doc id
used in the inverted index to refer to docs Anatomy of a Lucene index 9 0 7 8 2 2 1 8 10 3 4 4 6 10 1 1 13 1 3 10 4 7 20 42 11 9 8 15 18 30 31 99 5 12 Id Price 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 doc id shoe 1, 3, 5, 8, 11, 13, 15

Get top N=2 results: – Create a priority queue of
size N – Accumulate matching docs Top hits 9 0 7 8 2 2 1 8 10 3 4 4 6 10 1 1 13 1 3 10 4 7 20 42 11 9 8 15 18 30 31 99 5 12 Id Price (3) () (3,4) (4,20) (4,9) (4,9) (9,31) (9,31) Automatic overflow of the priority queue to remove the least one Create an empty priority queue Top hits

Let’s do the same on a sorted index Early termination
13 10 10 9 8 8 7 6 4 4 3 2 2 1 1 1 0 12 9 31 1 4 11 10 30 15 18 8 7 20 42 99 5 3 Id Price (9) () (9,31) (9,31) (9,31) (9,31) (9,31)(9,31) Priority queue never changes after this document

Pros – makes finding the top hits much faster –
file-system cache-friendly Cons – only works for static ranks – not if the sort order depends on the query – requires the index to be sorted – doesn’t work for tasks that require visiting every doc: – total number of matches – faceting Early termination

Not uncommon! Graph-based ranks – Google’s PageRank Facebook social search
/ Unicorn – https://www.facebook.com/publications/219621248185635 Many more... Doesn’t need to be the exact sort order – heuristics when score is only a function of the static rank Static ranks

A live index can’t be kept sorted – would require
inserting docs between existing docs! – segments are immutable Offline sorting to the rescue: – index as usual – sort into a new index – search! Pros/cons – super fast to search, the whole index is fully sorted – but only works for static content Offline sorting

Offline Sorting // open a reader on the unsorted index
and create a sorted (but slow) view DirectoryReader reader = DirectoryReader.open(in); boolean ascending = false; Sorter sorter = new NumericDocValuesSorter("price", ascending); AtomicReader sortedReader = SortingAtomicReader.wrap( SlowCompositeReaderWrapper.wrap(reader), sorter); // copy the content of the sorted reader to the new dir IndexWriter writer = new IndexWriter(out, iwConf); writer.addIndexes(sortedReader); writer.close(); reader.close();

Sort segments independently – wouldn’t require inserting data into existing
segments – collection could still be early-terminated on a per-segment basis But segments are immutable – must be sorted before starting writing them Online sorting?

2 sources of segments – flush – merge flushed segments
can’t be sorted – Lucene writes stored fields to disk on the fly – could be buffered but this would require a lot of memory merged segments can be sorted – create a sorted view over the segments to merge – pass this view to SegmentMerger instead of the original segments not a bad trade-off – flushed segments are usually small & fast to collect Online sorting?

Online sorting? Flushed segments - NRT reopens - RAM buffer
size limit hit Merged segments http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html Merged segments can easily take 99+ % of the size of the index

Online Sorting IndexWriterConfig iwConf = new IndexWriterConfig(...); // original MergePolicy
finds the segments to merge MergePolicy origMP = iwConf.getMergePolicy(); // SortingMergePolicy wraps the segments with a sorted view boolean ascending = false; Sorter sorter = new NumericDocValuesSorter("price", ascending); MergePolicy sortingMP = new SortingMergePolicy(origMP, sorter); // setup IndexWriter to use SortingMergePolicy iwConf.setMergePolicy(sortingMP); IndexWriter writer = new IndexWriter(dir, iwConf); // index as usual

Collect top N matches Offline sorting – index sorted globally
– early terminate after N matches have been collected – no priority queue needed! Online sorting – no early termination on flushed segments – early termination on merged segments – if N matches have been collected – or if current match is less than the top of the PQ Early termination

Early Termination class MyCollector extends Collector { @Override public void
setNextReader(AtomicReaderContext context) throws IOException { readerIsSorted = SortingMergePolicy.isSorted(context.reader(), sorter); collected = 0; } @Override public void collect(int doc) throws IOException { if (readerIsSorted && (++collected >= maxDocsToCollect || curVal <= pq.top()) { // Special exception that tells IndexSearcher to terminate // collection of the current segment throw new CollectionTerminatedException(); } else { // collect hit } } }

Questions?

Index sorting with Lucene

Index sorting with Lucene

Elasticsearch Inc

More Decks by Elasticsearch Inc

Other Decks in Programming

Featured

Transcript

Index Sorting How to trade index speed for search speed

Index = collection of immutable segments Segments store documents sequentially

ordinal of a doc in a segment = doc id

Get top N=2 results: – Create a priority queue of

Let’s do the same on a sorted index Early termination

Pros – makes finding the top hits much faster –

Not uncommon! Graph-based ranks – Google’s PageRank Facebook social search

A live index can’t be kept sorted – would require

Offline Sorting // open a reader on the unsorted index

Sort segments independently – wouldn’t require inserting data into existing

2 sources of segments – flush – merge flushed segments

Online sorting? Flushed segments - NRT reopens - RAM buffer

Online Sorting IndexWriterConfig iwConf = new IndexWriterConfig(...); // original MergePolicy

Collect top N matches Offline sorting – index sorted globally

Early Termination class MyCollector extends Collector { @Override public void

Questions?