What is in a Lucene index

Slide 1

Slide 1 text

WHAT IS IN A LUCENE INDEX Adrien Grand Software engineer at Elasticsearch @jpountz

Slide 2

Slide 2 text

• Lucene/Solr committer • Software engineer at Elasticsearch • I like changing the index file formats! – stored fields – term vectors – doc values – ... About me

Slide 3

Slide 3 text

Why should I learn about Lucene internals?

Slide 4

Slide 4 text

• Know the cost of the APIs – to build blazing fast search applications – don’t commit all the time – when to use stored fields vs. doc values – maybe Lucene is not the right tool • Understand index size – oh, term vectors are 1/2 of the index size! – I removed 20% of my documents and index size hasn’t changed • This is a lot of fun! Why should I learn about Lucene internals?

Slide 5

Slide 5 text

• Make data fast to search – duplicate data if it helps – decide on how to index based on the queries • Trade update speed for search speed – Grep vs full-text indexing – Prefix queries vs edge n-grams – Phrase queries vs shingles • Indexing is fast – 220 GB/hour for 4K docs! – http://people.apache.org/~mikemccand/lucenebench/indexing.html Indexing

Slide 6

Slide 6 text

• Tree structure – sorted for range queries – O(log(n)) search Let’s create an index sql index term data Lucene in action Databases Lucene

Slide 7

Slide 7 text

Lucene doesn’t work this way

Slide 8

Slide 8 text

Lucene term 2 3 • Store terms and documents in arrays – binary search Another index Lucene in action Databases 0 1 data index 0 1 0,1 0,1 0 0 sql 4 1

Slide 9

Slide 9 text

Lucene term 2 3 • Store terms and documents in arrays – binary search Another index Lucene in action Databases 0 1 data index 0 1 0,1 0,1 0 0 Segment doc id document term ordinal terms dict postings list sql 4 1

Slide 10

Slide 10 text

• Insertion = write a new segment • Merge segments when there are too many of them – concatenate docs, merge terms dicts and postings lists (merge sort!) Insertions? Lucene term 2 3 Lucene in action Databases 0 1 data index 0 1 0,1 0,1 0 sql 4 0 1 Lucene term 2 3 Lucene in action 0 data index 0 1 0 0 0 0 Databases 0 data index 0 1 0 0 sql 2 0

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Lucene term 2 3 • Deletion = turn a bit off • Ignore deleted documents when searching and merging (reclaims space) • Merge policies favor segments with many deletions Deletions? Lucene in action Databases 0 1 data index 0 1 0,1 0,1 0 0 sql 4 1 1 0 live docs: 1 = live, 0 = deleted

Slide 13

Slide 13 text

• Updates require writing a new segment – single-doc updates are costly, bulk updates preferred – writes are sequential • Segments are never modified in place – filesystem-cache-friendly – lock-free! • Terms are deduplicated – saves space for high-freq terms • Docs are uniquely identified by an ord – useful for cross-API communication – Lucene can use several indexes in a single query • Terms are uniquely identified by an ord – important for sorting: compare longs, not strings – important for faceting (more on this later) Pros/cons

Slide 14

Slide 14 text

Lucene can use several indexes Many databases can’t

Slide 15

Slide 15 text

Index intersection 1, 2, 10, 11, 20, 30, 50, 100 2, 20, 21, 22, 30, 40, 100 red shoe Many databases just pick the most selective index and ignore the other ones 1 2 3 4 5 6 7 8 9 Lucene’s postings lists support skipping that can be use to “leap-frog”

Slide 16

Slide 16 text

• We just covered search • Lucene does more – term vectors – norms – numeric doc values – binary doc values – sorted doc values – sorted set doc values What else?

Slide 17

Slide 17 text

• Per-document inverted index • Useful for more-like-this • Sometimes used for highlighting Term vectors Lucene term 2 3 Lucene in action 0 data index 0 1 0 0 0 0 Databases 1 data index 0 1 0 0 sql 2 0 Lucene term 2 3 data index 0 1 0,1 0,1 0 0 sql 4 1

Slide 18

Slide 18 text

• Per doc and per field single numeric values, stored in a column-stride fashion • Useful for sorting and custom scoring • Norms are numeric doc values Numeric/binary doc values Lucene in action Databases 0 1 Solr in action Java 2 3 42 1 3 10 afc gce ppy ccn ﬁeld_a ﬁeld_b

Slide 19

Slide 19 text

• Ordinal-enabled per-doc and per-field values – sorted: single-valued, useful for sorting – sorted set: multi-valued, useful for faceting Sorted (set) doc values Lucene in action Databases 0 1 Solr in action Java 2 3 1,2 0 0,1,2 1 distributed Java search 0 1 2 Ordinals Terms dictionary for this dv ﬁeld

Slide 20

Slide 20 text

• Compute value counts for docs that match a query – eg. category counts on an ecommerce website • Naive solution – hash table: value to count – O(#docs) ordinal lookups – O(#doc) value lookups • 2nd solution – hash table: ord to count – resolve values in the end – O(#docs) ordinal lookups – O(#values) value lookups Faceting Since ordinals are dense, this can be a simple array

Slide 21

Slide 21 text

• These are the low-level Lucene APIs, everything is built on top of these APIs: searching, faceting, scoring, highlighting, etc. How can I use these APIs? API Useful for Method Inverted index Term -> doc ids, positions, offsets AtomicReader.ﬁelds Stored ﬁelds Summaries of search results IndexReader.document Live docs Ignoring deleted docs AtomicReader.liveDocs Term vectors More like this IndexReader.termVectors Doc values / Norms Sorting/faceting/scoring AtomicReader.get*Values

Slide 22

Slide 22 text

• Data duplicated up to 4 times – not a waste of space! – easy to manage thanks to immutability • Stored fields vs doc values – Optimized for different access patterns – get many field values for a few docs: stored fields – get a few field values for many docs: doc values Wrap up 0,A 1,A 2,A 0,A 0,B 0,C 1,A 1,C 2,A 2,B 2,C 1,B 0,B 1,B 2,B 0,B 1,B 2,B Stored fields Doc values At most 1 seek per doc At most 1 seek per doc per field BUT more disk / file-system cache-friendly

Slide 23

Slide 23 text

File formats

Slide 24

Slide 24 text

• Save file handles – don’t use one file per field or per doc • Avoid disk seeks whenever possible – disk seek on spinning disk is ~10 ms • BUT don’t ignore the filesystem cache – random access in small files is fine • Light compression helps – less I/O – smaller indexes – filesystem-cache-friendly Important rules

Slide 25

Slide 25 text

• File formats are codec-dependent • Default codec tries to get the best speed for little memory – To trade memory for speed, don’t use RAMDirectory: – MemoryPostingsFormat, MemoryDocValuesFormat, etc. • Detailed file formats available in javadocs – http://lucene.apache.org/core/4_5_1/core/org/apache/lucene/codecs/package- summary.html – Codecs

Slide 26

Slide 26 text

• Bit packing / vInt encoding – postings lists – numeric doc values • LZ4 – code.google.com/p/lz4 – lightweight compression algorithm – stored fields, term vectors • FSTs – conceptually a Map – keys share prefixes and suffixes – terms index Compression techniques

Slide 27

Slide 27 text

What happens when I run a TermQuery?

Slide 28

Slide 28 text

• Lookup the term in the terms index – In-memory FST storing terms prefixes – Gives the offset to look at in the terms dictionary – Can fast-fail if no terms have this prefix 1. Terms index l/4 u c r y/3 b/2 r a/1 br = 2 brac = 3 luc = 4 lyr = 7

Slide 29

Slide 29 text

• Jump to the given offset in the terms dictionary – compressed based on shared prefixes, similarly to a burst trie – called the “BlockTree terms dict” • read sequentially until the term is found – 2. Terms dictionary [preﬁx=luc] a, freq=1, offset=101 as, freq=1, offset=149 ene, freq=9, offset=205 ky, frea=7, offset=260 rative, freq=5, offset=323 Jump here Not found Not found Found

Slide 30

Slide 30 text

• Jump to the given offset in the postings lists • Encoded using modified FOR (Frame of Reference) delta – 1. delta-encode – 2. split into block of N=128 values – 3. bit packing per block – 4. if remaining docs, encode with vInt 3. Postings lists 1,3,4,6,8,20,22,26,30,31 1,2,1,2,2,12,2,4,4,1 [1,2,1,2] [2,12,2,4] 4, 1 2 bits per value 4 bits per value Example with N=4 vInt-encoded

Slide 31

Slide 31 text

• In-memory index for a subset of the doc ids – memory-efficient thanks to monotonic compression – searched using binary search • Stored fields – stored sequentially – compressed (LZ4) in 16+KB blocks 4. Stored fields 0 1 2 3 4 5 6 16KB 16KB 16KB docId=0 offset=42 docId=3 offset=127 docId=4 offset=199

Slide 32

Slide 32 text

• 2 disk seeks per field for search • 1 disk seek per doc for stored fields • It is common that the terms dict / postings lists fits into the file-system cache • “Pulse” optimization – For unique terms (freq=1), postings are inlined in the terms dict – Only 1 disk seek – Will always be used for your primary keys Query execution

Slide 33

Slide 33 text

Quizz

Slide 34

Slide 34 text

What is happening here? #docs in the index qps 1 2

Slide 35

Slide 35 text

What is happening here? #docs in the index qps 1 2 Index grows larger than the ﬁlesystem cache: stored ﬁelds not fully in the cache anymore

Slide 36

Slide 36 text

What is happening here? #docs in the index qps 1 2 Index grows larger than the ﬁlesystem cache: stored ﬁelds not fully in the cache anymore Terms dict/Postings lists not fully in the cache

Slide 37

Slide 37 text

Thank you!