Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Engineering fast indexes

Daniel Lemire
February 07, 2017

Engineering fast indexes

Contemporary computing hardware offers massive new performance opportunities. Yet high-performance programming remains a daunting challenge.

Daniel Lemire

February 07, 2017
Tweet

More Decks by Daniel Lemire

Other Decks in Technology

Transcript

  1. ENGINEERING FAST INDEXES
    Daniel Lemire
    https://lemire.me
    Joint work with lots of super smart people

    View full-size slide

  2. Our recent work: Roaring Bitmaps
    http://roaringbitmap.org/
    Used by
    Apache Spark,
    Netflix Atlas,
    LinkedIn Pinot,
    Apache Lucene,
    Whoosh,
    Metamarket's Druid
    eBay's Apache Kylin
    Further reading:
    Frame of Reference and Roaring Bitmaps (at Elastic, the
    company behind Elasticsearch)
    2

    View full-size slide

  3. Set data structures
    We focus on sets of integers: S = {1, 2, 3, 1000}. Ubiquitous in
    database or search engines.
    tests: x ∈ S?
    intersections: S ∩ S
    unions: S ∪ S
    differences: S ∖ S
    Jaccard Index (Tanimoto similarity) ∣S ∩ S ∣/∣S ∪ S ∣
    2 1
    2 1
    2 1
    1 1 1 2
    3

    View full-size slide

  4. "Ordered" Set
    iterate
    in sorted order,
    in reverse order,
    skippable iterators (jump to first value ≥ x)
    Rank: how many elements of the set are smaller than k?
    Select: find the kth smallest value
    Min/max: find the maximal and minimal value
    4

    View full-size slide

  5. Let us make some assumptions...
    Many sets containing more than a few integers
    Integers span a wide range (e.g., [0, 100000))
    Mostly immutable (read often, write rarely)
    5

    View full-size slide

  6. How do we implement integer sets?
    Assume sets are mostly imutable.
    sorted arrays ( s
    t
    d
    :
    :
    v
    e
    c
    t
    o
    r
    <
    u
    i
    n
    t
    3
    2
    _
    t
    > )
    hash sets ( j
    a
    v
    a
    .
    u
    t
    i
    l
    .
    H
    a
    s
    h
    S
    e
    t
    <
    I
    n
    t
    e
    g
    e
    r
    > ,
    s
    t
    d
    :
    :
    u
    n
    o
    r
    d
    e
    r
    e
    d
    _
    s
    e
    t
    <
    u
    i
    n
    t
    3
    2
    _
    t
    > )

    bitsets ( j
    a
    v
    a
    .
    u
    t
    i
    l
    .
    B
    i
    t
    S
    e
    t )
    compressed bitsets
    6

    View full-size slide

  7. What is a bitset???
    Efficient way to represent a set of integers.
    E.g., 0, 1, 3, 4 becomes 0
    b
    1
    1
    0
    1
    1 or "27".
    Also called a "bitmap" or a "bit array".
    7

    View full-size slide

  8. Add and contains on bitset
    Most of the processors work on 64‑bit words.
    Given index x , the corresponding word index is x
    /
    6
    4 and within‑
    word bit index is x % 6
    4 .
    a
    d
    d
    (
    x
    ) {
    a
    r
    r
    a
    y
    [
    x / 6
    4
    ] |
    = (
    1 <
    < (
    x % 6
    4
    )
    )
    }
    c
    o
    n
    t
    a
    i
    n
    s
    (
    x
    ) {
    r
    e
    t
    u
    r
    n a
    r
    r
    a
    y
    [
    x / 6
    4
    ] & (
    1 <
    < (
    x % 6
    4
    )
    )
    }
    8

    View full-size slide

  9. How fast can you set bits in a bitset?
    Very fast! Roughly three instructions (on x64)...
    i
    n
    d
    e
    x = x / 6
    4 -
    > a s
    i
    n
    g
    l
    e s
    h
    i
    f
    t
    m
    a
    s
    k = 1 <
    < ( x % 6
    4
    ) -
    > a s
    i
    n
    g
    l
    e s
    h
    i
    f
    t
    a
    r
    r
    a
    y
    [ i
    n
    d
    e
    x ] |
    - m
    a
    s
    k -
    > a l
    o
    g
    i
    c
    a
    l O
    R t
    o m
    e
    m
    o
    r
    y
    (Or can use BMI's b
    t
    s .)
    On recent x64 can set one bit every ≈ 1.65 cycles (in cache)
    Recall : Modern processors are superscalar (more than one
    instruction per cycle)
    9

    View full-size slide

  10. Bit‑level parallelism
    Bitsets are efficient: intersections
    Intersection between {0, 1, 3} and {1, 3}
    can be computed as AND operation between
    0
    b
    1
    0
    1
    1 and 0
    b
    1
    0
    1
    0 .
    Result is 0
    b
    1
    0
    1
    0 or {1, 3}.
    Enables Branchless processing.
    10

    View full-size slide

  11. Bitsets are efficient: in practice
    f
    o
    r i i
    n [
    0
    .
    .
    .
    n
    ]
    o
    u
    t
    [
    i
    ] = A
    [
    i
    ] & B
    [
    i
    ]
    Recent x64 processors can do this at a speed of ≈ 0.5 cycles per
    pair of input 64‑bit words (in cache) for n = 1
    0
    2
    4 .
    0.5
    m
    e
    m
    c
    p
    y runs at ≈ 0.3 cycles.
    0.3
    11

    View full-size slide

  12. Bitsets can be inefficient
    Relatively wasteful to represent {1, 32000, 64000} with a bitset.
    Would use 1000 bytes to store 3 numbers.
    So we use compression...
    12

    View full-size slide

  13. Memory usage example
    dataset : census1881_srt
    format bits per value
    hash sets
    200
    arrays
    32
    bitsets
    900
    compressed bitsets (Roaring)
    2
    https://github.com/RoaringBitmap/CBitmapCompetition 13

    View full-size slide

  14. Performance example (unions)
    dataset : census1881_srt
    format CPU cycles per value
    hash sets
    200
    arrays
    6
    bitsets
    30
    compressed bitsets (Roaring)
    1
    https://github.com/RoaringBitmap/CBitmapCompetition 14

    View full-size slide

  15. What is happening? (Bitsets)
    Bitsets are often best... except if data is
    very sparse (lots of 0s). Then you spend a
    lot of time scanning zeros.
    Large memory usage
    Bad performance
    Threshold? ~1 100
    15

    View full-size slide

  16. Hash sets are not always fast
    Hash sets have great one‑value look‑up. But
    they have poor data locality and non‑trivial overhead...
    h
    1 <
    - s
    o
    m
    e h
    a
    s
    h s
    e
    t
    h
    2 <
    - s
    o
    m
    e h
    a
    s
    h s
    e
    t
    .
    .
    .
    f
    o
    r
    (
    x i
    n h
    1
    ) {
    i
    n
    s
    e
    r
    t x i
    n h
    2 /
    / "
    s
    u
    r
    e
    " t
    o h
    i
    t a n
    e
    w c
    a
    c
    h
    e l
    i
    n
    e
    !
    !
    !
    !
    }
    16

    View full-size slide

  17. Want to kill Swift?
    Swift is Apple's new language. Try this:
    v
    a
    r d = S
    e
    t
    <
    I
    n
    t
    >
    (
    )
    f
    o
    r i i
    n 1
    .
    .
    .
    s
    i
    z
    e {
    d
    .
    i
    n
    s
    e
    r
    t
    (
    i
    )
    }
    /
    /
    v
    a
    r z = S
    e
    t
    <
    I
    n
    t
    >
    (
    )
    f
    o
    r i i
    n d {
    z
    .
    i
    n
    s
    e
    r
    t
    (
    i
    )
    }
    This blows up! Quadratic‑time.
    Same problem with Rust.
    17

    View full-size slide

  18. What is happening? (Arrays)
    Arrays are your friends. Reliable. Simple. Economical.
    But... binary search is branchy and has bad locality...
    w
    h
    i
    l
    e (
    l
    o
    w <
    = h
    i
    g
    h
    ) {
    i
    n
    t m
    i
    d
    d
    l
    e
    I
    n
    d
    e
    x = (
    l
    o
    w + h
    i
    g
    h
    ) >
    >
    > 1
    ;
    i
    n
    t m
    i
    d
    d
    l
    e
    V
    a
    l
    u
    e = a
    r
    r
    a
    y
    .
    g
    e
    t
    (
    m
    i
    d
    d
    l
    e
    I
    n
    d
    e
    x
    )
    ;
    i
    f (
    m
    i
    d
    d
    l
    e
    V
    a
    l
    u
    e < i
    k
    e
    y
    ) {
    l
    o
    w = m
    i
    d
    d
    l
    e
    I
    n
    d
    e
    x + 1
    ;
    } e
    l
    s
    e i
    f (
    m
    i
    d
    d
    l
    e
    V
    a
    l
    u
    e > i
    k
    e
    y
    ) {
    h
    i
    g
    h = m
    i
    d
    d
    l
    e
    I
    n
    d
    e
    x - 1
    ;
    } e
    l
    s
    e {
    r
    e
    t
    u
    r
    n m
    i
    d
    d
    l
    e
    I
    n
    d
    e
    x
    ;
    }
    }
    r
    e
    t
    u
    r
    n -
    (
    l
    o
    w + 1
    )
    ;
    18

    View full-size slide

  19. Performance: value lookups (x ∈ S)
    dataset : weather_sept_85
    format CPU cycles per query
    hash sets ( s
    t
    d
    :
    :
    u
    n
    o
    r
    d
    e
    r
    e
    d
    _
    s
    e
    t )
    50
    arrays
    900
    bitsets
    4
    compressed bitsets (Roaring)
    80
    19

    View full-size slide

  20. How do you compress bitsets?
    We have long runs of 0s or 1s.
    Use run‑length encoding (RLE)
    Example: 000000001111111100 can be coded as
    00000000 − 11111111 − 00
    or
    <5><1>
    using the format < number of repetitions >< value being repeated >
    20

    View full-size slide

  21. RLE‑compressed bitsets
    Oracle's BBC
    WAH (FastBit)
    EWAH (Git + Apache Hive)
    Concise (Druid)

    Further reading:
    http://githubengineering.com/counting‑objects/
    21

    View full-size slide

  22. Hybrid Model
    Decompose 32‑bit space into
    16‑bit spaces (chunk).
    Given value x, its chunk index is x ÷ 2 (16 most significant bits).
    For each chunk, use best container to store least 16 significant bits:
    a sorted array ({1,20,144})
    a bitset (0b10000101011)
    a sequences of sorted runs ([0,10],[15,20])
    That's Roaring!
    Prior work: O'Neil's RIDBit + BitMagic
    16
    22

    View full-size slide

  23. Roaring
    All containers fit in 8 kB (several fit in L1 cache)
    Attempts to select the best container as you build the bitmaps
    Calling r
    u
    n
    O
    p
    t
    i
    m
    i
    z
    e will scan (quickly!) non‑run containers
    and try to convert them to run containers
    23

    View full-size slide

  24. Performance: union (weather_sept_85)
    format CPU cycles per value
    bitsets
    0.6
    WAH
    4
    EWAH
    2
    Concise
    5
    Roaring
    0.6
    24

    View full-size slide

  25. What helps us...
    All modern processors have fast population‑count functions
    ( p
    o
    p
    c
    n
    t ) to count the number of 1s in a word.
    Cheap to keep track of the number of values stored in a bitset!
    Choice between array, run and bitset covers many use cases!
    25

    View full-size slide

  26. Go try it out!
    Java, Go, C, C++, C#, Rust, Python... (soon: Swift)
    http://roaringbitmap.org
    Documented interoperable serialized format.
    Free. Well‑tested. Benchmarked.
    Peer reviewed
    Consistently faster and smaller compressed bitmaps with
    Roaring. Softw., Pract. Exper. (2016)
    Better bitmap performance with Roaring bitmaps. Softw.,
    Pract. Exper. (2016)
    Optimizing Druid with Roaring bitmaps, IDEAS 2016, 2016
    Wide community (dozens of contributors).
    26

    View full-size slide