Heavy Committing: DocValues aka. Column Stride Fields in Lucene 4.0

Slide 1

Slide 1 text

Simon Willnauer @ Lucene Revolution 2011 PMC Member & Core Comitter Apache Lucene [email protected] / [email protected] Column Stride Fields aka. DocValues

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Agenda Column Stride Fields aka. DocValues ‣ What is this all about? aka. The Problem! ‣ The more native solution ‣ DocValues - current state and future ‣ Questions? 3

Slide 4

Slide 4 text

What is this all about? - Inverted Index Lucene is basically an inverted index - used to ﬁnd terms QUICKLY! 1 The old night keeper keeps the keep in the town 2 In the big old house in the big old gown. 3 The house in the town had the big old keep 4 Where the old night keeper never did sleep. 5 The night keeper keeps the keep in the night 6 And keeps in the dark and sleeps in the light. term freq Posting list and 1 6 big 2 2 3 dark 1 6 did 1 4 gown 1 2 had 1 3 house 2 2 3 in 5 <1> <2> <3> <5> <6> keep 3 1 3 5 keeper 3 1 4 5 keeps 3 1 5 6 light 1 6 never 1 4 night 3 1 4 5 old 4 1 2 3 4 sleep 1 4 sleeps 1 6 the 6 <1> <2> <3> <4> <5> <6> town 2 1 3 where 1 4 Table with 6 documents TermsEnum IndexWriter

Slide 5

Slide 5 text

Intersecting posting lists Yet, once we found the right terms the game starts.... 5 5 10 11 55 57 59 77 88 1 10 13 44 55 79 88 99 score AND Query What goes into the score? PageRank?, ClickFeedback? Posting Lists (document IDs)

Slide 6

Slide 6 text

How to store scoring factors? Lucene provides 2 ways of storing data • Stored Fields (document to String or Binary mapping) • Inverted Index (term to document mapping) What if we need here is one or more values per document! • document to value mapping Why not use Stored Fields? 6

Slide 7

Slide 7 text

Using Stored Fields 7 •Stored Fields serve a different purpose •loading body or title ﬁelds for result rendering / highlighting •very suited for loading multiple values •With Stored Fields you have one indirection per document resulting in going to disk twice for each document •on-disk random access is too slow •remember Lucene could score millions of documents even if you just render the top 10 or 20!

Slide 8

Slide 8 text

Stored Fields under the hood 8 Field Index (.fdx) Field Data (.fdt) Document title: Deutschland title: Germany id: 108232 [...][id:108232title:Deutschlandtitle:Germany][...] [...][...][93438][...] numFields(vint) [ ﬁeldid(vint) length(vint) payload ] absolute ﬁle pointers

Slide 9

Slide 9 text

Stored Fields - accessing a field 9 Field Index (.fdx) Field Data (.fdt) [...][id:108232title:Deutschlandtitle:Germany][...] [...][...][93438][...] numFields(vint) [ fieldid(vint) length(vint) payload ] 1 2 Lookup filepointer in .fdx Scan on .fdt until you find the field by ID

Slide 10

Slide 10 text

Alternatives? Lucene can un-invert a field into FieldCache 10 weight 5.8 1.0 2.7 2.7 4.3 7.9 1.0 3.2 4.7 7.9 9.0 parse convert to datatype un-invert array per field / segment term freq Posting list 1.0 1 1 6 2.7 1 2 3 3.2 1 7 4.3 1 4 4.7 1 8 5.8 1 0 7.9 1 5 9 9.0 1 10 float 32 string / byte[]

Slide 11

Slide 11 text

FieldCache - is fast once loaded, once! •Constant time lookup DocID to value •Efﬁcient representation •primitive array •low GC overhead •loading can be slow (realtime can be a problem) •must parse values •builds unnecessary term dictionary •always memory resident 11

Slide 12

Slide 12 text

FieldCache - loading 12 100k Docs 1M Docs 10M Docs 122 ms 348 ms 3161 ms Simple Benchmark • Indexing 100k, 1M and 10M random floats • not analyzed no norms • load field into FieldCache from optimized index Remember, this is only one field! Some apps have many fields to load to FieldCache

Slide 13

Slide 13 text

FieldCache works ﬁne! - if... •you have enough memory •you can afford the loading time •merge is fast enough (for FieldCache you need to index the terms) 13 What if you canʼt? Like when you are in a very restricted environment? • 3 Billion Android installations world wide and growing - 2 MB Heap! • with 100 Million Documents one ﬁeld takes 30 seconds to load • 2 phase Distributed Search

Slide 14

Slide 14 text

Summary •Stored Fields are not fast enough for random access •FieldCache is fast once loaded •abuses a reverse index •must convert to String and from String •requires fair amount of memory • Lucene is missing native data-structure for primitive per-document values 14

Slide 15

Slide 15 text

Agenda Column Stride Fields aka. DocValues ‣ What is this all about? aka. The Problem! ‣ The more native solution ‣ DocValues - current state and future ‣ Questions? 15

Slide 16

Slide 16 text

The more native solution - Column Stride Fields •A dense column based storage •1 value per document •accepts primitives - no conversion from / to String •int & long •ﬂoat & double •byte[ ] •each ﬁeld has a DocValues Type but can still be indexed or stored •Entirely optional 16

Slide 17

Slide 17 text

Simple Layout - even on disk 17 field: time field: id field: page_rank 1288271631431 1 3.2 1288271631531 5 4.5 1288271631631 3 2.3 1288271631732 4 4.44 1288271631832 6 6.7 1288271631932 9 7.8 1288271632032 8 9.9 1288271632132 7 10.1 1288271632233 12 11.0 1288271632333 14 33.1 1288271632433 22 0.2 1288271632533 32 1.4 1288271632637 100 55.6 1288271632737 33 2.2 1288271632838 34 7.5 1288271632938 35 3.2 1288271633038 36 3.4 1288271633138 37 5.6 1288271632333 38 45.0 1 column per field and segment 1 value per document int64 int32 float 32

Slide 18

Slide 18 text

Numeric Types - Int 18 Random Access Math.max(1, (int) Math.ceil( Math.log(1+maxValue)/Math.log(2.0)) ); Number of bit depend on the numeric range in the ﬁeld: 7 - bit per doc ﬁeld: id 1 5 3 4 6 9 8 7 12 14 22 32 100 33 34 35 36 37 38 • Integer are stored dense based on PackedInts • Space depends on the value-range per segment Example: [1, 100] maps to [0, 99] requires 7 bit per doc • Floats are stored without compression • either 32 or 64 bit per value

Slide 19

Slide 19 text

Arbitrary Values - The byte[] variants •Length Variants: •Fixed / Variable •Store Variants: •Straight or Referenced 19 data 10/01/2011 12/01/2011 10/04/2011 10/06/2011 10/05/2011 10/01/2011 10/07/2011 10/04/2011 10/04/2011 10/04/2011 data 10/01/2011 12/01/2011 10/04/2011 10/06/2011 10/05/2011 10/01/2011 10/07/2011 offsets 0 10 20 30 40 50 60 20 20 20 ﬁxed / straight ﬁxed / deref Random Access Random Access

Slide 20

Slide 20 text

DocValues - Memory Requirements •RAM Resident - random access •similar to FieldCache •bytes are stored in byte-block pools •currently limited to 2GB per segment •On-Disk - sequential access •almost no JVM heap memory •ﬁles should be in FS cache for fast access •possible use MemoryMapped Buffers 20

Slide 21

Slide 21 text

Lets look at the API - Indexing 21 Adding DocValues follows existing patterns, simply use Fieldable Document doc = new Document(); float pageRank = 10.3f; DocValuesField valuesField = new DocValuesField("pageRank"); valuesField.setFloat(pageRank); doc.add(valuesField); writer.addDocument(doc); String titleText = "The quick brown fox"; Field field = new Field("title", titleText , Store.NO, Index.ANALYZED); DocValuesField titleDV = new DocValuesField("title"); titleDV.setBytes(new BytesRef(titleText), Type.BYTES_VAR_DEREF); field.setDocValues(titleDV); Sometimes the ﬁeld should also be indexed, stored or needs term- vectors

Slide 22

Slide 22 text

Looking at the API - Search / Retrieve 22 IndexReader reader = ...; DocValues values = reader.docValues("pageRank"); DocValuesEnum floatEnum = values.getEnum(); int doc = 0; FloatsRef ref = floatEnum.getFloat(); // values are filled when iterating while((doc = floatEnum.nextDoc()) != DocValuesEnum.NO_MORE_DOCS) { double value = ref.floats[0]; } // equivalent to ... int doc = 0; while((doc = floatEnum.advance(doc+1)) != DocValuesEnum.NO_MORE_DOCS) { double value = ref.floats[0]; } On disk sequential access is exposed through DocValuesEnum DocValuesEnum is based on DocIdSetIterator just like Scorer or DocsEnum

Slide 23

Slide 23 text

Looking at the API - Search / Retrieve 23 IndexReader reader = ...; DocValues values = reader.docValues("pageRank"); Source source = values.getSource(); double value = source.getFloat(x); // still allows iterating over the RAM resident values DocValuesEnum floatEnum = source.getEnum(); int doc; FloatsRef ref = floatEnum.getFloat(); while((doc = floatEnum.nextDoc()) != DocValuesEnum.NO_MORE_DOCS) { value = ref.floats[0]; } RAM Resident API is very similar to FieldCache DocValuesEnum still available on RAM Resident API

Slide 24

Slide 24 text

Can I add my own DocValues Implementation? •DocValues are integrated into Flexible Indexing •IndexWriter / IndexReader write and read DocValues via a Codec •DocValues Types are fixed (int, float32, float64 etc.) but implementations are Codec specific •A Codec provides access to DocValuesComsumer and DocValuesProducer •allows implementing application specific serialzation •customize compression techniques 24

Slide 25

Slide 25 text

Quick detour - Codecs 25 IndexWriter IndexReader Flex API Directory FileSystem Codec

Slide 26

Slide 26 text

Quick detour - Codecs 26 IndexWriter IndexReader Flex API Codec DocValuesProducer DocValuesConsumer write read

Slide 27

Slide 27 text

Remember the loading FieldCache benchmark? 27 Simple Benchmark • Indexing 100k, 1M and 10M random floats • not analyzed no norms • loading field into FieldCache from optimized index vs. loading DocValues field 100k Docs 1M Docs 10M Docs FieldCache 122 ms 348 ms 3161 ms DocValues 7 ms 10 ms 90 ms Loading is 100 x faster - no un-inverting, no string parsing

Slide 28

Slide 28 text

QPS - FieldCache vs. DocValues 28 Task QPS DocValues QPS FieldCache % change AndHighHigh 3.51 3.41 2.9% PKLookup 46.06 44.87 2.7% AndHighMed 37.09 36.48 1.7% Fuzzy2 17.70 17.50 1.1% Fuzzy1 27.15 27.21 -0.2% Phrase 4.12 4.13 -0.2% SpanNear 2.00 2.01 -0.5% SloppyPhrase 1.98 2.02 -2.0% Term 35.29 36.05 -2.1% OrHighMed 4.73 4.93 -4.1% OrHighHigh 3.99 4.18 -4.5% Wildcard 12.97 13.60 -4.6% Prefix3 15.86 16.70 -5.0% IntNRQ 2.72 2.91 -6.5% 6 Search Threads 20 JVM instances, 5 instances per task run 50 times on 12 core Xeon / 24 GB RAM - all queries wrapped with a CustomScoreQuery

Slide 29

Slide 29 text

Agenda Column Stride Fields aka. DocValues ‣ What is this all about? aka. The Problem! ‣ The more native solution ‣ DocValues - current state and future ‣ Questions? 29

Slide 30

Slide 30 text

DocValues - current state •Currently still in a branch •Some minor JavaDoc issues •needs some cleanups •Landing on trunk very soon •issue is already opened and active 30

Slide 31

Slide 31 text

DocValues - current features •Fully customizable via Codecs •User can control memory usage per ﬁeld •Suitable for environments where memory is tight •Compact and native representation on disk and in RAM •Fast Loading times •Comparable to FieldCache (small overhead) •Direct value access even when on disk (single seek) 31

Slide 32

Slide 32 text

DocValues - what is next? •the ultimate goal for DocValues is to be update-able •changing a per-document values without reindexing •users can replace existing values directly for each document •each ﬁeld by itself will be update-able •Will be available in Lucene 4.0 once released ;) 32

Slide 33

Slide 33 text

DocValues - Updates •Lucene has write-once policy for ﬁles •Changing in place is not a good idea - Consistency / Corruption! •Problem is comparable to norms or deleted docs •updating norms requires re-writing the entire norms array (1 byte per Document with in memory copy-on-write) •same is true for deleted docs while cost is low (1 bit per document) •DocValues will use a stacked-approach instead 33

Slide 34

Slide 34 text

DocValues - Updates 34 docID ﬁeld: permission 0 777 1 707 2 644 3 644 4 777 5 664 6 664 (id: 5, value: 777) (id: 6, value: 777) (id: 5, value: 644) DocValues store update stack IndexWriter (id: 3, value: 777) update merge docID ﬁeld: permission 0 777 1 707 2 644 3 777 4 777 5 644 6 777 ... n coalesced store

Slide 35

Slide 35 text

Use-Cases •Scoring based on frequently changing values •click feedback •iterative algorithms like page rank •user ratings •Restricted environments like Android •Realtime Search (fast loading times) •frequently changing fields •if the fields content is not searched! •fast field fetching / alternative to stored fields (Distributed Search) 35

Slide 36

Slide 36 text

Questions? 36 Thank you for your attention!