$30 off During Our Annual Pro Sale. View Details »

Things you should know about Database Storage and Retrieval

Things you should know about Database Storage and Retrieval

Pedro Tavares

April 24, 2018
Tweet

More Decks by Pedro Tavares

Other Decks in Programming

Transcript

  1. things you should know about…
    @ordepdev
    Database Storage
    and Retrieval

    View Slide

  2. Why should
    you care?

    View Slide

  3. db_set() {
    echo “$1,$2” >> database;
    }

    View Slide

  4. Log-Structured file
    1991

    View Slide

  5. A log-structured file system writes all
    modifications to disk sequentially in
    a log-like structure, thereby speeding
    up both file writing and crash
    recovery. ”

    View Slide

  6. Collect large amounts of new data
    in a file cache in main memory, then
    write the data to disk in a single
    large I/0.”

    View Slide

  7. How do we avoid
    running out of space?

    View Slide

  8. C:1 A:1 B:1 B:2 A:2 A:3 A:4 B:3
    Data File Segment
    Compactation

    View Slide

  9. C:1 A:1 B:1 B:2 A:2 A:3 A:4 B:3
    C:1 B:3 A:4
    Data File Segment
    Compacted Segment
    Compactation

    View Slide

  10. B:4 C:2 C:3 C:4 C:5 A:5 A:6 A:7
    C:1 A:1 B:1 B:2 A:2 A:3 A:4 B:3
    Data Segment 1
    Data Segment 2
    Merging & Compactation

    View Slide

  11. B:4 C:2 C:3 C:4 C:5 A:5 A:6 A:7
    C:1 A:1 B:1 B:2 A:2 A:3 A:4 B:3
    B:4 C:5 A:7
    +
    Data Segment 1
    Data Segment 2
    Compacted & Merged Segment
    Merging & Compactation

    View Slide

  12. Why using an Append-only log?
    Sequential write operations are much more faster
    than random writes.
    Concurrency and crash recovery are much simpler.
    Merging old segments avoids fragmentation.

    View Slide

  13. How do we find the
    value of a given key?

    View Slide

  14. Index
    Additional structure that is derived from data.
    It keeps some additional metadata on the side
    that helps to locate the data.
    Maintaining such structures incurs overhead,
    especially on write!

    View Slide

  15. The simplest possible indexing
    strategy is to keep an in-memory
    hash map where each key is
    mapped to a byte offset.”

    View Slide

  16. Hash Indexes
    1 0 0 , { “ n a
    : “ P
    m e “
    o r t o “
    key byte offset
    100 0
    101 20
    Log-structured file on disk
    In-memory hash map
    0 1 , “ n a m e
    L i s b o a “ }
    } \n 1
    “ : “
    \n

    View Slide

  17. The hash map is updated when a
    new key-value pair is appended to
    the file in order to reflect the
    current data offset.”

    View Slide

  18. A:2 B:2 C:2
    A:1 B:1 C:1 D:1 E:1 F:1 G:1 H:1
    Data Segment 1
    Data Segment 2
    2010
    Hash Indexes

    View Slide

  19. When a write occurs, the keydir is
    atomically updated with the location
    of the newest data. The old data is
    still present on disk, but any new
    reads will use the latest version
    available in the keydir.”

    View Slide

  20. Hash Indexes LIMITATIONS
    Not suitable for a very large number of keys,
    since the entire hash map must fit in memory!
    Scanning over a range of keys it’s not efficient —
    it would be necessary to look up each key
    individually in the hash maps.

    View Slide

  21. SSTables
    A:2 B:2 C:2
    A:1 B:1 C:1 D:1 E:1 F:1 G:1 H:1
    Data Segment 1
    Data Segment 2
    2006

    View Slide

  22. An SSTable provides a persistent,
    ordered immutable map from keys
    to values, where both keys and
    values are arbitrary byte strings.”

    View Slide

  23. A lookup can be performed by first
    finding the appropriate block with a
    binary search in the in-memory index,
    and then reading the appropriate block
    from disk.”

    View Slide

  24. sparse in-memory index
    A:1 B:1 C:1 D:2 E:1 F:1 G:9 H:1
    B:4 C:5 A:7
    Compacted Data Segment
    I:2 J:4 K:2 L:7 M:1 N:7 O:1 P:3
    key byte offset
    A 100491
    I 101201
    M 103041
    X 104204
    Sorted segment file on disk
    …………
    …………
    In-memory index

    View Slide

  25. Merging & Compactation
    A:2 B:2 C:2
    A:1 B:1 C:1 D:1 E:1 F:1 G:1 H:1
    Data Segment 1
    Data Segment 2

    View Slide

  26. Merging & Compactation
    A:2 B:2 C:2
    A:1 B:1 C:1 D:1 E:1 F:1 G:1 H:1
    B:4 C:5 A:7
    +
    Data Segment 1
    Data Segment 2
    Compacted & Merged Segment
    A:2 B:2 C:2 D:1 E:1 F:1 G:1 H1

    View Slide

  27. Storage engines that are based on
    this principle of merging and
    compacting sorted files are often
    called LSM storage engines.

    View Slide

  28. LSM-Tree
    A:2 B:2 C:2
    A:1 B:1 C:1 D:1 E:1 F:1 G:1 H:1
    Data Segment 1
    Data Segment 2
    2006

    View Slide

  29. The LSM-tree uses an algorithm that
    defers and batches index changes,
    cascading the changes from a
    memory-based component through
    one or more disk components in an
    efficient manner reminiscent of
    merge sort.”

    View Slide

  30. What about
    performance?

    View Slide

  31. Bloom filters
    Memory-efficient data
    structure for approximating
    the contents of a set.


    View Slide

  32. Bloom filters
    It can tell if a key does not exist in the
    database, saving many unnecessary
    disk reads for nonexistent keys.

    View Slide

  33. Advantages over Hash indexes
    All values in one input segment are more recent
    than all values in the other segment.
    When multiple segments contain the same key,
    the value from the most recent segment is kept
    and older segments are discarded.
    In order to find a particular key in the file, there’s
    no longer need to keep the full index in memory!

    View Slide

  34. B-TREES
    A:2 B:2 C:2
    A:1 B:1 C:1 D:1 E:1 F:1 G:1 H:1
    Data Segment 1
    Data Segment 2
    1970

    View Slide

  35. B-TREES
    A:2 B:2 C:2
    A:1 B:1 C:1 D:1 E:1 F:1 G:1 H:1
    Data Segment 1
    Data Segment 2
    1979

    View Slide

  36. The index is organized in pages of
    fixed size capable of holding up to
    2k keys, but pages need only be
    partially filled.


    View Slide

  37. B-TREES
    51
    11 65
    2 7 12 15 55 62
    20

    View Slide

  38. B-TREES
    51
    11 65
    2 7 12 15 55 62
    20
    “look up id 15”

    View Slide

  39. B-TREES
    51
    11 65
    2 7 12 15 55 62
    20
    “look up id 15”

    View Slide

  40. B-TREES
    51
    11 65
    2 7 12 15 55 62
    20
    “look up id 15”

    View Slide

  41. B-TREES
    51
    11 65
    2 7 12 15 55 62
    20
    “look up id 15”

    View Slide

  42. B-TREES
    51
    11 65
    2 7 12 15 55 62
    20
    “look up id 15”

    View Slide

  43. B-TREES
    51
    11 65
    2 7 12 15 55 62
    20
    “look up id 15”

    View Slide

  44. B-TREES
    The number of references to child pages in one
    page is called the branching factor (~hundreds).
    In order to add a new key, we need to find the
    page within the key range and split into two
    pages if there’s no space to accommodate it.
    A four-level tree of 4KB pages with a branching
    factor of 500 can store up to 256TB!!!

    View Slide

  45. What about
    resilience?

    View Slide

  46. Write-ahead log (wal) ~ Redo-log
    All modifications must be written before it can be
    applied to the pages of the tree itself.
    Used to restore the B-tree back to a consistent
    state after a crash.
    Writing all modifications to the WAL means that a
    B-tree index must write every piece of data at
    least twice!!!

    View Slide

  47. Write amplification
    One write to the database that results in multiple
    writes to the disk.
    Write amplification has a direct performance cost!
    The more that a storage engine writes to disk,
    the fewer writes per second it can handle.

    View Slide

  48. Wrapping Up.

    View Slide

  49. Reads & Writes
    Writes are slower on B-trees since they must write
    every piece of data at least twice — once to the
    write-ahead log and once to the tree page!
    Reads are slower on LSM-trees since they have to
    check several data structures and SSTables at
    different stages of compaction!
    LSM-trees are able to sustain higher write
    throughput due to lower write amplification and
    sequential writes.

    View Slide

  50. Which one is the best
    type of storage?

    View Slide

  51. There is no quick and easy rule
    for determining which type of
    storage engine is better for
    your use case, so it is worth
    testing empirically.”

    View Slide

  52. A:2 B:2 C:2
    A:1 B:1 C:1 D:1 E:1 F:1 G:1 H:1
    Data Segment 1
    Data Segment 2
    YOU SHOULD
    READ PAPERS!

    View Slide

  53. things you should know about…
    @ordepdev
    Database Storage
    and Retrieval

    View Slide