Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Heavy Committing: DocValues aka. Column Stride Fields in Lucene 4.0

Heavy Committing: DocValues aka. Column Stride Fields in Lucene 4.0

Lucene 4.0 is on its way to deliver a tremendous amount of new features and improvements. Beside Real-Time Search & Flexible Indexing DocValues aka. Column Stride Fields is one of the "next generation" features. DocValues enable Lucene to efficiently store and retrieve type-safe Document & Value pairs in a column stride fashion either entirely memory resident random access or disk resident iterator based without the need to un-invert fields. It's final goal is to provide a independently update-able per document storage for scoring, sorting or even filtering. This talk will introduce the current state of development, implementation details, its features and how DocValues have been integrated into Lucene's Codec API for full extendability.

Simon Willnauer

May 17, 2011
Tweet

More Decks by Simon Willnauer

Other Decks in Programming

Transcript

  1. Simon Willnauer @ Lucene Revolution 2011
    PMC Member & Core Comitter Apache Lucene
    [email protected] / [email protected]
    Column Stride Fields aka. DocValues

    View Slide

  2. 2

    View Slide

  3. Agenda
    Column Stride Fields aka. DocValues
    ‣ What is this all about? aka. The Problem!
    ‣ The more native solution
    ‣ DocValues - current state and future
    ‣ Questions?
    3

    View Slide

  4. What is this all about? - Inverted Index
    Lucene is basically an inverted index - used to find terms QUICKLY!
    1 The old night keeper keeps the keep in the town
    2 In the big old house in the big old gown.
    3 The house in the town had the big old keep
    4 Where the old night keeper never did sleep.
    5 The night keeper keeps the keep in the night
    6 And keeps in the dark and sleeps in the light.
    term freq Posting list
    and 1 6
    big 2 2 3
    dark 1 6
    did 1 4
    gown 1 2
    had 1 3
    house 2 2 3
    in 5 <1> <2> <3> <5> <6>
    keep 3 1 3 5
    keeper 3 1 4 5
    keeps 3 1 5 6
    light 1 6
    never 1 4
    night 3 1 4 5
    old 4 1 2 3 4
    sleep 1 4
    sleeps 1 6
    the 6 <1> <2> <3> <4> <5> <6>
    town 2 1 3
    where 1 4
    Table with 6 documents
    TermsEnum
    IndexWriter

    View Slide

  5. Intersecting posting lists
    Yet, once we found the right terms the game starts....
    5
    5 10 11 55 57 59 77 88
    1 10 13 44 55 79 88 99
    score
    AND Query
    What goes into the score? PageRank?, ClickFeedback?
    Posting Lists (document IDs)

    View Slide

  6. How to store scoring factors?
    Lucene provides 2 ways of storing data
    • Stored Fields (document to String or Binary mapping)
    • Inverted Index (term to document mapping)
    What if we need here is one or more values per document!
    • document to value mapping
    Why not use Stored Fields?
    6

    View Slide

  7. Using Stored Fields
    7
    •Stored Fields serve a different purpose
    •loading body or title fields for result rendering / highlighting
    •very suited for loading multiple values
    •With Stored Fields you have one indirection per document resulting in
    going to disk twice for each document
    •on-disk random access is too slow
    •remember Lucene could score millions of documents even if you just
    render the top 10 or 20!

    View Slide

  8. Stored Fields under the hood
    8
    Field Index (.fdx)
    Field Data (.fdt)
    Document
    title:
    Deutschland
    title:
    Germany
    id: 108232
    [...][id:108232title:Deutschlandtitle:Germany][...]
    [...][...][93438][...]
    numFields(vint) [ fieldid(vint) length(vint) payload ]
    absolute file pointers

    View Slide

  9. Stored Fields - accessing a field
    9
    Field Index (.fdx)
    Field Data (.fdt)
    [...][id:108232title:Deutschlandtitle:Germany][...]
    [...][...][93438][...]
    numFields(vint) [ fieldid(vint) length(vint) payload ]
    1
    2
    Lookup filepointer in .fdx
    Scan on .fdt until you find the field by ID

    View Slide

  10. Alternatives?
    Lucene can un-invert a field into FieldCache
    10
    weight
    5.8
    1.0
    2.7
    2.7
    4.3
    7.9
    1.0
    3.2
    4.7
    7.9
    9.0
    parse
    convert to datatype
    un-invert
    array per field /
    segment
    term freq Posting list
    1.0 1 1 6
    2.7 1 2 3
    3.2 1 7
    4.3 1 4
    4.7 1 8
    5.8 1 0
    7.9 1 5 9
    9.0 1 10
    float 32 string / byte[]

    View Slide

  11. FieldCache - is fast once loaded, once!
    •Constant time lookup DocID to value
    •Efficient representation
    •primitive array
    •low GC overhead
    •loading can be slow (realtime can be a problem)
    •must parse values
    •builds unnecessary term dictionary
    •always memory resident
    11

    View Slide

  12. FieldCache - loading
    12
    100k Docs 1M Docs 10M Docs
    122 ms 348 ms 3161 ms
    Simple Benchmark
    • Indexing 100k, 1M and 10M random floats
    • not analyzed no norms
    • load field into FieldCache from optimized index
    Remember, this is only one field! Some apps have many fields to load to
    FieldCache

    View Slide

  13. FieldCache works fine! - if...
    •you have enough memory
    •you can afford the loading time
    •merge is fast enough (for FieldCache you need to index the terms)
    13
    What if you canʼt? Like when you are in a very restricted
    environment?
    • 3 Billion Android installations world wide and growing - 2 MB Heap!
    • with 100 Million Documents one field takes 30 seconds to load
    • 2 phase Distributed Search

    View Slide

  14. Summary
    •Stored Fields are not fast enough for random access
    •FieldCache is fast once loaded
    •abuses a reverse index
    •must convert to String and from String
    •requires fair amount of memory
    • Lucene is missing native data-structure for primitive per-document
    values
    14

    View Slide

  15. Agenda
    Column Stride Fields aka. DocValues
    ‣ What is this all about? aka. The Problem!
    ‣ The more native solution
    ‣ DocValues - current state and future
    ‣ Questions?
    15

    View Slide

  16. The more native solution - Column Stride Fields
    •A dense column based storage
    •1 value per document
    •accepts primitives - no conversion from / to String
    •int & long
    •float & double
    •byte[ ]
    •each field has a DocValues Type but can still be indexed or stored
    •Entirely optional
    16

    View Slide

  17. Simple Layout - even on disk
    17
    field: time field: id field: page_rank
    1288271631431 1 3.2
    1288271631531 5 4.5
    1288271631631 3 2.3
    1288271631732 4 4.44
    1288271631832 6 6.7
    1288271631932 9 7.8
    1288271632032 8 9.9
    1288271632132 7 10.1
    1288271632233 12 11.0
    1288271632333 14 33.1
    1288271632433 22 0.2
    1288271632533 32 1.4
    1288271632637 100 55.6
    1288271632737 33 2.2
    1288271632838 34 7.5
    1288271632938 35 3.2
    1288271633038 36 3.4
    1288271633138 37 5.6
    1288271632333 38 45.0
    1 column per field and segment
    1 value per document
    int64 int32 float 32

    View Slide

  18. Numeric Types - Int
    18
    Random Access
    Math.max(1, (int) Math.ceil(
    Math.log(1+maxValue)/Math.log(2.0))
    );
    Number of bit depend on the numeric
    range in the field:
    7 - bit per doc
    field: id
    1
    5
    3
    4
    6
    9
    8
    7
    12
    14
    22
    32
    100
    33
    34
    35
    36
    37
    38
    • Integer are stored dense based on PackedInts
    • Space depends on the value-range per segment
    Example: [1, 100] maps to [0, 99] requires 7 bit per doc
    • Floats are stored without compression
    • either 32 or 64 bit per value

    View Slide

  19. Arbitrary Values - The byte[] variants
    •Length Variants:
    •Fixed / Variable
    •Store Variants:
    •Straight or Referenced
    19
    data
    10/01/2011
    12/01/2011
    10/04/2011
    10/06/2011
    10/05/2011
    10/01/2011
    10/07/2011
    10/04/2011
    10/04/2011
    10/04/2011
    data
    10/01/2011
    12/01/2011
    10/04/2011
    10/06/2011
    10/05/2011
    10/01/2011
    10/07/2011
    offsets
    0
    10
    20
    30
    40
    50
    60
    20
    20
    20
    fixed / straight fixed / deref
    Random Access
    Random Access

    View Slide

  20. DocValues - Memory Requirements
    •RAM Resident - random access
    •similar to FieldCache
    •bytes are stored in byte-block pools
    •currently limited to 2GB per segment
    •On-Disk - sequential access
    •almost no JVM heap memory
    •files should be in FS cache for fast access
    •possible use MemoryMapped Buffers
    20

    View Slide

  21. Lets look at the API - Indexing
    21
    Adding DocValues follows existing patterns, simply use Fieldable
    Document doc = new Document();
    float pageRank = 10.3f;
    DocValuesField valuesField = new DocValuesField("pageRank");
    valuesField.setFloat(pageRank);
    doc.add(valuesField);
    writer.addDocument(doc);
    String titleText = "The quick brown fox";
    Field field = new Field("title", titleText , Store.NO, Index.ANALYZED);
    DocValuesField titleDV = new DocValuesField("title");
    titleDV.setBytes(new BytesRef(titleText), Type.BYTES_VAR_DEREF);
    field.setDocValues(titleDV);
    Sometimes the field should also be indexed, stored or needs term-
    vectors

    View Slide

  22. Looking at the API - Search / Retrieve
    22
    IndexReader reader = ...;
    DocValues values = reader.docValues("pageRank");
    DocValuesEnum floatEnum = values.getEnum();
    int doc = 0;
    FloatsRef ref = floatEnum.getFloat(); // values are filled when iterating
    while((doc = floatEnum.nextDoc()) != DocValuesEnum.NO_MORE_DOCS) {
    double value = ref.floats[0];
    }
    // equivalent to ...
    int doc = 0;
    while((doc = floatEnum.advance(doc+1)) != DocValuesEnum.NO_MORE_DOCS) {
    double value = ref.floats[0];
    }
    On disk sequential access is exposed through DocValuesEnum
    DocValuesEnum is based on DocIdSetIterator just like Scorer or
    DocsEnum

    View Slide

  23. Looking at the API - Search / Retrieve
    23
    IndexReader reader = ...;
    DocValues values = reader.docValues("pageRank");
    Source source = values.getSource();
    double value = source.getFloat(x);
    // still allows iterating over the RAM resident values
    DocValuesEnum floatEnum = source.getEnum();
    int doc;
    FloatsRef ref = floatEnum.getFloat();
    while((doc = floatEnum.nextDoc()) != DocValuesEnum.NO_MORE_DOCS) {
    value = ref.floats[0];
    }
    RAM Resident API is very similar to FieldCache
    DocValuesEnum still available on RAM Resident API

    View Slide

  24. Can I add my own DocValues Implementation?
    •DocValues are integrated into Flexible Indexing
    •IndexWriter / IndexReader write and read DocValues via a Codec
    •DocValues Types are fixed (int, float32, float64 etc.) but implementations
    are Codec specific
    •A Codec provides access to DocValuesComsumer and
    DocValuesProducer
    •allows implementing application specific serialzation
    •customize compression techniques
    24

    View Slide

  25. Quick detour - Codecs
    25
    IndexWriter IndexReader
    Flex API
    Directory
    FileSystem
    Codec

    View Slide

  26. Quick detour - Codecs
    26
    IndexWriter IndexReader
    Flex API
    Codec
    DocValuesProducer
    DocValuesConsumer
    write read

    View Slide

  27. Remember the loading FieldCache benchmark?
    27
    Simple Benchmark
    • Indexing 100k, 1M and 10M random floats
    • not analyzed no norms
    • loading field into FieldCache from optimized index vs. loading
    DocValues field
    100k Docs 1M Docs 10M Docs
    FieldCache 122 ms 348 ms 3161 ms
    DocValues 7 ms 10 ms 90 ms
    Loading is 100 x faster - no un-inverting, no string parsing

    View Slide

  28. QPS - FieldCache vs. DocValues
    28
    Task QPS DocValues QPS FieldCache % change
    AndHighHigh 3.51 3.41 2.9%
    PKLookup 46.06 44.87 2.7%
    AndHighMed 37.09 36.48 1.7%
    Fuzzy2 17.70 17.50 1.1%
    Fuzzy1 27.15 27.21 -0.2%
    Phrase 4.12 4.13 -0.2%
    SpanNear 2.00 2.01 -0.5%
    SloppyPhrase 1.98 2.02 -2.0%
    Term 35.29 36.05 -2.1%
    OrHighMed 4.73 4.93 -4.1%
    OrHighHigh 3.99 4.18 -4.5%
    Wildcard 12.97 13.60 -4.6%
    Prefix3 15.86 16.70 -5.0%
    IntNRQ 2.72 2.91 -6.5%
    6 Search Threads 20 JVM instances, 5 instances per task run 50 times on 12 core
    Xeon / 24 GB RAM - all queries wrapped with a CustomScoreQuery

    View Slide

  29. Agenda
    Column Stride Fields aka. DocValues
    ‣ What is this all about? aka. The Problem!
    ‣ The more native solution
    ‣ DocValues - current state and future
    ‣ Questions?
    29

    View Slide

  30. DocValues - current state
    •Currently still in a branch
    •Some minor JavaDoc issues
    •needs some cleanups
    •Landing on trunk very soon
    •issue is already opened and active
    30

    View Slide

  31. DocValues - current features
    •Fully customizable via Codecs
    •User can control memory usage per field
    •Suitable for environments where memory is tight
    •Compact and native representation on disk and in RAM
    •Fast Loading times
    •Comparable to FieldCache (small overhead)
    •Direct value access even when on disk (single seek)
    31

    View Slide

  32. DocValues - what is next?
    •the ultimate goal for DocValues is to be update-able
    •changing a per-document values without reindexing
    •users can replace existing values directly for each document
    •each field by itself will be update-able
    •Will be available in Lucene 4.0 once released ;)
    32

    View Slide

  33. DocValues - Updates
    •Lucene has write-once policy for files
    •Changing in place is not a good idea - Consistency / Corruption!
    •Problem is comparable to norms or deleted docs
    •updating norms requires re-writing the entire norms array (1 byte per
    Document with in memory copy-on-write)
    •same is true for deleted docs while cost is low (1 bit per document)
    •DocValues will use a stacked-approach instead
    33

    View Slide

  34. DocValues - Updates
    34
    docID field: permission
    0 777
    1 707
    2 644
    3 644
    4 777
    5 664
    6 664 (id: 5, value: 777)
    (id: 6, value: 777)
    (id: 5, value: 644)
    DocValues store
    update stack
    IndexWriter
    (id: 3, value: 777)
    update
    merge
    docID field: permission
    0 777
    1 707
    2 644
    3 777
    4 777
    5 644
    6 777
    ...
    n
    coalesced store

    View Slide

  35. Use-Cases
    •Scoring based on frequently changing values
    •click feedback
    •iterative algorithms like page rank
    •user ratings
    •Restricted environments like Android
    •Realtime Search (fast loading times)
    •frequently changing fields
    •if the fields content is not searched!
    •fast field fetching / alternative to stored fields (Distributed Search)
    35

    View Slide

  36. Questions?
    36
    Thank you for your attention!

    View Slide