$30 off During Our Annual Pro Sale. View Details »

Next Generation Indexes For Big Data Engineering (Waterloo University, May 2018)

Next Generation Indexes For Big Data Engineering (Waterloo University, May 2018)

Maximizing performance in data engineering is a daunting challenge. We present some of our work on designing faster indexes, with a particular emphasis on compressed indexes. Some of our prior work includes (1) Roaring indexes which are part of multiple big-data systems such as Spark, Hive, Druid, Atlas, Pinot, Kylin, (2) EWAH indexes are part of Git (GitHub) and included in major Linux distributions.

We will present ongoing and future work on how we can process data faster while supporting the diverse systems found in the cloud (with upcoming ARM processors) and under multiple programming languages (e.g., Java, C++, Go, Python). We seek to minimize shared resources (e.g., RAM) while exploiting algorithms designed for the single-instruction-multiple-data (SIMD) instructions available on commodity processors. Our end goal is to process billions of records per second per core.

The talk will be aimed at programmers who want to better understand the performance characteristics of current big-data systems as well as their evolution. The following specific topics will be addressed:

1. The various types of indexes and their performance characteristics and trade-offs: hashing, sorted arrays, bitsets and so forth.

2. Index and table compression techniques: advances in integer compression, dictionary coding.

Daniel Lemire

May 08, 2018
Tweet

More Decks by Daniel Lemire

Other Decks in Programming

Transcript

  1. Daniel Lemire, Waterloo University, May 10th 2018.

    View Slide

  2. Next Generation Indexes For Big Data Engineering
    Daniel Lemire and collaborators
    blog: https://lemire.me
    twitter: @lemire
    Université du Québec (TÉLUQ)
    Montreal
    Daniel Lemire, Waterloo University, May 10th 2018.

    View Slide

  3. Knuth on performance
    Premature optimization is the root of all evil
    Daniel Lemire, Waterloo University, May 10th 2018.

    View Slide

  4. Knuth on performance
    Premature optimization is the root of all evil (...) After a programmer knows which parts of his
    routines are really important, a transformation like doubling up of loops will be worthwhile.
    Daniel Lemire, Waterloo University, May 10th 2018.

    View Slide

  5. Constants matter
    fasta benchmark:
    elapsed time total time (all processors)
    single‑threaded 1.36 s 1.36 s
    https://benchmarksgame‑team.pages.debian.net/benchmarksgame/performance/fasta.html
    Daniel Lemire, Waterloo University, May 10th 2018.

    View Slide

  6. Constants matter
    fasta benchmark:
    elapsed time total time (all processors)
    single‑threaded 1.36 s 1.36 s
    multicore (4 cores) 1.00 s 2.00 s
    Daniel Lemire, Waterloo University, May 10th 2018.

    View Slide

  7. Constants matter
    fasta benchmark:
    elapsed time total time (all processors)
    single‑threaded 1.36 s 1.36 s
    multicore (4 cores) 1.00 s 2.00 s
    vectorized (1 core) 0.31 s 0.31 s
    https://lemire.me/blog/2018/01/02/multicore‑versus‑simd‑instructions‑the‑fasta‑case‑study/
    Daniel Lemire, Waterloo University, May 10th 2018.

    View Slide

  8. “One Size Fits All”: An Idea Whose Time Has Come and Gone (Stonebraker, 2005)
    Daniel Lemire, Waterloo University, May 10th 2018. 8

    View Slide

  9. Rediscover Unix
    In 2018, Big Data Engineering is made of several specialized and re‑usable components:
    Calcite : SQL + optimization
    Hadoop
    etc.
    Daniel Lemire, Waterloo University, May 10th 2018. 9

    View Slide

  10. "Make your own database engine from parts"
    We are in a Cambrian explosion, with thousands of organizations and companies building their
    custom high‑speed systems.
    Specialized used cases
    Heterogeneous data (not everything is in your Oracle DB)
    Daniel Lemire, Waterloo University, May 10th 2018. 10

    View Slide

  11. For high‑speed in data engineering you need...
    Front‑end (data frame, SQL, visualisation)
    High‑level optimizations
    Indexes (e.g., Pilosa, Elasticsearch)
    Great compression routines
    Specialized data structures
    ....
    Daniel Lemire, Waterloo University, May 10th 2018. 11

    View Slide

  12. Sets
    A fundamental concept (sets of documents, identifiers, tuples...)
    → For performance, we often work with sets of integers (identifiers).
    Daniel Lemire, Waterloo University, May 10th 2018. 12

    View Slide

  13. tests : x ∈ S?
    intersections : S ∩ S , unions : S ∪ S , differences : S ∖ S
    Similarity (Jaccard/Tanimoto): ∣S ∩ S ∣/∣S ∪ S ∣
    Iteration
    f
    o
    r x i
    n S d
    o
    p
    r
    i
    n
    t
    (
    x
    )
    2 1 2 1 2 1
    1 1 1 2
    Daniel Lemire, Waterloo University, May 10th 2018. 13

    View Slide

  14. How to implement sets?
    sorted arrays ( s
    t
    d
    :
    :
    v
    e
    c
    t
    o
    r
    <
    u
    i
    n
    t
    3
    2
    _
    t
    > )
    hash tables ( j
    a
    v
    a
    .
    u
    t
    i
    l
    .
    H
    a
    s
    h
    S
    e
    t
    <
    I
    n
    t
    e
    g
    e
    r
    > , s
    t
    d
    :
    :
    u
    n
    o
    r
    d
    e
    r
    e
    d
    _
    s
    e
    t
    <
    u
    i
    n
    t
    3
    2
    _
    t
    > )

    bitmap ( j
    a
    v
    a
    .
    u
    t
    i
    l
    .
    B
    i
    t
    S
    e
    t )
    compressed bitmaps
    Daniel Lemire, Waterloo University, May 10th 2018. 14

    View Slide

  15. Arrays are your friends
    w
    h
    i
    l
    e (
    l
    o
    w <
    = h
    i
    g
    h
    ) {
    i
    n
    t m
    I =
    (
    l
    o
    w + h
    i
    g
    h
    ) >
    >
    > 1
    ;
    i
    n
    t m = a
    r
    r
    a
    y
    .
    g
    e
    t
    (
    m
    I
    )
    ;
    i
    f (
    m < k
    e
    y
    ) {
    l
    o
    w = m
    I + 1
    ;
    } e
    l
    s
    e i
    f (
    m > k
    e
    y
    ) {
    h
    i
    g
    h = m
    I - 1
    ;
    } e
    l
    s
    e {
    r
    e
    t
    u
    r
    n m
    I
    ;
    }
    }
    r
    e
    t
    u
    r
    n -
    (
    l
    o
    w + 1
    )
    ;
    Daniel Lemire, Waterloo University, May 10th 2018. 15

    View Slide

  16. Hash tables
    value x at index h(x)
    random access to a value in expected constant‑time
    much faster than arrays
    Daniel Lemire, Waterloo University, May 10th 2018. 16

    View Slide

  17. in‑order access is kind of terrible
    [15, 3, 0, 6, 11, 4, 5, 9, 12, 13, 8, 2, 1, 14, 10, 7]
    [15, 3, 0, 6, 11, 4, 5, 9, 12, 13, 8, 2, 1, 14, 10, 7]
    [15, 3, 0, 6, 11, 4, 5, 9, 12, 13, 8, 2, 1, 14, 10, 7]
    [15, 3, 0, 6, 11, 4, 5, 9, 12, 13, 8, 2, 1, 14, 10, 7]
    [15, 3, 0, 6, 11, 4, 5, 9, 12, 13, 8, 2, 1, 14, 10, 7]
    [15, 3, 0, 6, 11, 4, 5, 9, 12, 13, 8, 2, 1, 14, 10, 7]
    (Robin Hood, linear probing, MurmurHash3 hash function)
    Daniel Lemire, Waterloo University, May 10th 2018. 17

    View Slide

  18. Set operations on hash tables
    h
    1 <
    - h
    a
    s
    h s
    e
    t
    h
    2 <
    - h
    a
    s
    h s
    e
    t
    .
    .
    .
    f
    o
    r
    (
    x i
    n h
    1
    ) {
    i
    n
    s
    e
    r
    t x i
    n h
    2 /
    / c
    a
    c
    h
    e m
    i
    s
    s
    ?
    }
    Daniel Lemire, Waterloo University, May 10th 2018. 18

    View Slide

  19. "Crash" Swift
    v
    a
    r S
    1 = S
    e
    t
    <
    I
    n
    t
    >
    (
    1
    .
    .
    .
    s
    i
    z
    e
    )
    v
    a
    r S
    2 = S
    e
    t
    <
    I
    n
    t
    >
    (
    )
    f
    o
    r i i
    n d {
    S
    2
    .
    i
    n
    s
    e
    r
    t
    (
    i
    )
    }
    Daniel Lemire, Waterloo University, May 10th 2018. 19

    View Slide

  20. Some numbers: half an hour for 64M keys
    size time (s)
    1M 0.8
    8M 22
    64M 1400
    Maps and sets can have quadratic‑time performance
    https://lemire.me/blog/2017/01/30/maps‑and‑sets‑can‑have‑quadratic‑time‑performance/
    Rust hash iteration+reinsertion
    https://accidentallyquadratic.tumblr.com/post/153545455987/rust‑hash‑iteration‑reinsertion
    Daniel Lemire, Waterloo University, May 10th 2018. 20

    View Slide

  21. Daniel Lemire, Waterloo University, May 10th 2018. 21

    View Slide

  22. Bitmaps
    Efficient way to represent sets of integers.
    For example, 0, 1, 3, 4 becomes 0
    b
    1
    1
    0
    1
    1 or "27".
    {0} → 0
    b
    0
    0
    0
    0
    1
    {0, 3} → 0
    b
    0
    1
    0
    0
    1
    {0, 3, 4} → 0
    b
    1
    1
    0
    0
    1
    {0, 1, 3, 4} → 0
    b
    1
    1
    0
    1
    1
    Daniel Lemire, Waterloo University, May 10th 2018. 22

    View Slide

  23. Manipulate a bitmap
    64‑bit processor.
    Given x , word index is x
    /
    6
    4 and bit index x % 6
    4 .
    a
    d
    d
    (
    x
    ) {
    a
    r
    r
    a
    y
    [
    x / 6
    4
    ] |
    = (
    1 <
    < (
    x % 6
    4
    )
    )
    }
    Daniel Lemire, Waterloo University, May 10th 2018. 23

    View Slide

  24. How fast is it?
    i
    n
    d
    e
    x = x / 6
    4 -
    > a s
    h
    i
    f
    t
    m
    a
    s
    k = 1 <
    < ( x % 6
    4
    ) -
    > a s
    h
    i
    f
    t
    a
    r
    r
    a
    y
    [ i
    n
    d
    e
    x ] |
    - m
    a
    s
    k -
    > a O
    R w
    i
    t
    h m
    e
    m
    o
    r
    y
    One bit every ≈ 1.65 cycles because of superscalarity
    Daniel Lemire, Waterloo University, May 10th 2018. 24

    View Slide

  25. Bit parallelism
    Intersection between {0, 1, 3} and {1, 3}
    a single AND operation
    between 0
    b
    1
    0
    1
    1 and 0
    b
    1
    0
    1
    0 .
    Result is 0
    b
    1
    0
    1
    0 or {1, 3}.
    No branching!
    Daniel Lemire, Waterloo University, May 10th 2018. 25

    View Slide

  26. Bitmaps love wide registers
    SIMD: Single Intruction Multiple Data
    SSE (Pentium 4), ARM NEON 128 bits
    AVX/AVX2 (256 bits)
    AVX‑512 (512 bits)
    AVX‑512 is now available (e.g., from Dell!) with Skylake‑X processors.
    Daniel Lemire, Waterloo University, May 10th 2018. 26

    View Slide

  27. Bitsets can take too much memory
    {1, 32000, 64000} : 1000 bytes for three values
    We use compression!
    Daniel Lemire, Waterloo University, May 10th 2018. 27

    View Slide

  28. Git (GitHub) utilise EWAH
    Run‑length encoding
    Example: 000000001111111100 est
    00000000 − 11111111 − 00
    Code long runs of 0s or 1s efficiently.
    https://github.com/git/git/blob/master/ewah/bitmap.c
    Daniel Lemire, Waterloo University, May 10th 2018. 28

    View Slide

  29. Complexity
    Intersection : O(∣S ∣ + ∣S ∣) or O(min(∣S ∣, ∣S ∣))
    In‑place union (S ← S ∪ S ): O(∣S ∣ + ∣S ∣) or O(∣S ∣)
    1 2 1 2
    2 1 2 1 2 2
    Daniel Lemire, Waterloo University, May 10th 2018. 29

    View Slide

  30. Roaring Bitmaps
    http://roaringbitmap.org/
    Apache Lucene, Solr et Elasticsearch, Metamarkets’ Druid, Apache Spark, Apache Hive,
    Apache Tez, Netflix Atlas, LinkedIn Pinot, InfluxDB, Pilosa, Microsoft Visual Studio Team
    Services (VSTS), Couchbase's Bleve, Intel’s Optimized Analytics Package (OAP), Apache
    Hivemall, eBay’s Apache Kylin.
    Java, C, Go (interoperable)
    Roaring bitmaps 30

    View Slide

  31. Hybrid model
    Set of containers
    sorted arrays ({1,20,144})
    bitset (0b10000101011)
    runs ([0,10],[15,20])
    Related to: O'Neil's RIDBit + BitMagic
    Roaring bitmaps 31

    View Slide

  32. Roaring bitmaps 32

    View Slide

  33. Roaring
    All containers are small (8 kB), fit in CPU cache
    We predict the output container type during computations
    E.g., when array gets too large, we switch to a bitset
    Union of two large arrays is materialized as a bitset...
    Dozens of heuristics... sorting networks and so on
    Roaring bitmaps 33

    View Slide

  34. Use Roaring for bitmap compression whenever possible. Do not use other bitmap compression
    methods (Wang et al., SIGMOD 2017)
    Roaring bitmaps 34

    View Slide

  35. Unions of 200 bitmaps
    bits per stored value
    bitset array hash table Roaring
    census1881 524 32 195 15.1
    weather 15.3 32 195 5.38
    cycles per input value:
    bitset array hash table Roaring
    census1881 9.85 542 1010 2.6
    weather 0.35 94 237 0.16
    Roaring bitmaps 35

    View Slide

  36. Sometimes you do want arrays!!!
    But you'd like to compress them up.
    N
    ot always: compression can be counterproductive.
    Still, if you must compress, you want to do it fast
    Integer compression 36

    View Slide

  37. Integer compression
    "Standard" technique: VByte, VarInt, VInt
    Use 1, 2, 3, 4, ... byte per integer
    Use one bit per byte to indicate the length of the integers in bytes
    Lucene, Protocol Buffers, etc.
    Integer compression 37

    View Slide

  38. varint‑GB from Google
    VByte: one branch per integer
    varint‑GB: one branch per 4 integers
    each 4‑integer block is preceded byte a control byte
    Integer compression 38

    View Slide

  39. Vectorisation
    Stepanov (STL in C++) working for Amazon proposed varint‑G8IU
    Use vectorization (SIMD)
    P
    atented
    Fastest byte‑oriented compression technique (until recently)
    SIMD‑Based Decoding of Posting Lists, CIKM 2011
    https://stepanovpapers.com/SIMD_Decoding_TR.pdf
    Integer compression 39

    View Slide

  40. Observations from Stepanov et al.
    We can vectorize Google's varint‑GB, but it is not as fast as varint‑G8IU
    Integer compression 40

    View Slide

  41. Stream VByte
    Reuse varint‑GB from Google
    But instead of mixing control bytes and data bytes, ...
    We store control bytes separately and consecutively...
    Daniel Lemire, Nathan Kurz, Christoph Rupp
    Stream VByte: Faster Byte‑Oriented Integer Compression
    Information Processing Letters 130, 2018
    Integer compression 41

    View Slide

  42. Integer compression 42

    View Slide

  43. Stream VByte is used by...
    Redis (within RediSearch) https://redislabs.com
    upscaledb https://upscaledb.com
    Trinity https://github.com/phaistos‑networks/Trinity
    Integer compression 43

    View Slide

  44. Dictionary coding
    Use, e.g., by Apache Arrow
    Given a list of values:
    "Montreal", "Toronto", "Boston", "Montreal", "Boston"...
    Map to integers
    0, 1, 2, 0, 2
    Compress integers:
    Given 2 distinct values...
    Can use n‑bit per values (binary packing, patched coding, frame‑of‑reference)
    n
    Integer compression 44

    View Slide

  45. Dictionary coding + SIMD
    dict. size bits per value scalar AVX2 (256‑bit) AVX‑512 (512‑bit)
    32 5 8 3 1.5
    1024 10 8 3.5 2
    65536 16 12 5.5 4.5
    (cycles per value decoded)
    https://github.com/lemire/dictionary
    Integer compression 45

    View Slide

  46. To learn more...
    Blog (twice a week) : https://lemire.me/blog/
    GitHub: https://github.com/lemire
    Home page : https://lemire.me/en/
    CRSNG : F
    aster C
    ompressed I
    ndexes O
    n N
    ext‑G
    eneration H
    ardware (2017‑2022)
    Twitter @lemire
    @lemire 46

    View Slide