Things you should know about Database Storage and Retrieval

Things you should know about Database Storage and Retrieval

On the most fundamental level, a database needs to do two things: when you give it some data, it should store the data, and when you ask it again later, it should give the data back to you. Why should you, as an application developer, care how the database handles storage and retrieval internally? In order to select the storage engine that is appropriate for your application, you need to have a rough idea of what the storage engine is doing under the hood. In this talk, we’ll discuss and examine some core data structures such as Hash Indexes, SSTables, LSM-Trees, and B-Trees, that are used in the traditional relational databases and NoSQL databases.

074c1726aee7c73df923c3d5064ab861?s=128

Pedro Tavares

June 14, 2018
Tweet

Transcript

  1. 8.

    A log-structured file system writes all modifications to disk sequentially

    in a log-like structure, thereby speeding up both file writing and crash recovery. ” “
  2. 9.

    Collect large amounts of new data in a file cache

    in main memory, then write the data to disk in a single large I/0.” “
  3. 10.

    C:1 A:1 B:1 B:2 A:2 A:3 A:4 B:3 Data File

    Segment 1 B:4 B:5 C:2 A:5 D:1 D:2 A:6 Data File Segment 2 A:7 B:6 C:3 C:4 B:7 E:1 Data File Segment …N Memory
  4. 13.

    C:1 A:1 B:1 B:2 A:2 A:3 A:4 B:3 C:1 B:3

    A:4 Data File Segment Compacted Segment Compactation
  5. 14.

    B:4 C:2 C:3 C:4 C:5 A:5 A:6 A:7 C:1 A:1

    B:1 B:2 A:2 A:3 A:4 B:3 Merging & Compactation Data File Segment 1 Data File Segment 2
  6. 15.

    B:4 C:2 C:3 C:4 C:5 A:5 A:6 A:7 C:1 A:1

    B:1 B:2 A:2 A:3 A:4 B:3 B:4 C:5 A:7 + Merging & Compactation Data File Segment 2 Data File Segment 1 Compacted & Merged Segment
  7. 16.

    Why using an Append-only log? Sequential write operations are much

    more faster than random writes. Concurrency and crash recovery are much simpler. Merging old segments avoids fragmentation.
  8. 18.

    Index Additional structure that is derived from data. It keeps

    some additional metadata on the side that helps to locate the data. Maintaining such structures incurs overhead, especially on write!
  9. 19.

    The simplest possible indexing strategy is to keep an in-memory

    hash map where each key is mapped to a byte offset.” “
  10. 20.

    Hash Indexes 1 0 0 , { “ n a

    : “ P m e “ o r t o “ key byte offset 100 0 101 20 Log-structured file on disk In-memory hash map 0 1 , “ n a m e L i s b o n “ } } \n 1 “ : “ \n
  11. 21.

    Hash Indexes 1 0 0 , { “ n a

    : “ P m e “ o r t o “ key byte offset 100 0 101 20 0 1 , “ n a m e L i s b o n “ } } \n 1 “ : “ \n In-memory hash map Log-structured file on disk
  12. 22.

    A:2 B:2 C:2 A:1 B:1 C:1 D:1 E:1 F:1 G:1

    H:1 Data Segment 1 Data Segment 2 2010 Hash Indexes
  13. 23.

    When a write occurs, the keydir is atomically updated with

    the location of the newest data. ” “
  14. 24.

    The old data is still present on disk, but any

    new reads will use the latest version available in the keydir.” “
  15. 25.

    Hash Indexes LIMITATIONS Not suitable for a very large number

    of keys, since the entire hash map must fit in memory! Scanning over a range of keys it’s not efficient — it would be necessary to look up each key individually in the hash maps.
  16. 26.

    SSTables A:2 B:2 C:2 A:1 B:1 C:1 D:1 E:1 F:1

    G:1 H:1 Data Segment 1 Data Segment 2 2006
  17. 27.

    An SSTable provides a persistent, ordered immutable map from keys

    to values, where both keys and values are arbitrary byte strings.” “
  18. 28.

    A lookup can be performed by first finding the appropriate

    block with a binary search in the in-memory index, and then reading the appropriate block from disk.” “
  19. 29.

    sparse in-memory index A:1 B:1 C:1 D:2 E:1 F:1 G:9

    H:1 B:4 C:5 A:7 Compacted Data Segment I:2 J:4 K:2 L:7 M:1 N:7 O:1 P:3 key byte offset A 100491 I 101201 M 103041 X 104204 Sorted segment file on disk ………… ………… In-memory index
  20. 30.

    Merging & Compactation A:2 B:2 C:2 A:1 B:1 C:1 D:1

    E:1 F:1 G:1 H:1 Data Segment 1 Data Segment 2
  21. 31.

    Merging & Compactation B:4 C:5 A:7 + Compacted & Merged

    Segment A:2 B:2 C:2 D:1 E:1 F:1 G:1 H1 A:2 B:2 C:2 A:1 B:1 C:1 D:1 E:1 F:1 G:1 H:1 Data Segment 1 Data Segment 2
  22. 32.

    Storage engines that are based on this principle of merging

    and compacting sorted files are often called LSM storage engines.
  23. 33.

    LSM-Tree A:2 B:2 C:2 A:1 B:1 C:1 D:1 E:1 F:1

    G:1 H:1 Data Segment 1 Data Segment 2 2006
  24. 36.

    Bloom filters It can tell if a key does not

    exist in the database, saving many unnecessary disk reads for nonexistent keys.
  25. 37.

    BLOOM FILTERS & SSTABLES A:1 B:1 C:1 D:2 E:1 F:1

    G:9 H:1 B:4 C:5 A:7 Compacted Data Segment I:2 J:4 K:2 L:7 M:1 N:7 O:1 P:3 key byte offset A 100491 I 101201 M 103041 X 104204 Sorted segment file on disk In-memory index Bloom Filter
  26. 38.

    Advantages over Hash indexes When multiple segments contain the same

    key, the value from the most recent segment is kept and older segments are discarded. In order to find a particular key in the file, there’s no longer need to keep the full index in memory!
  27. 39.

    B-TREES A:2 B:2 C:2 A:1 B:1 C:1 D:1 E:1 F:1

    G:1 H:1 Data Segment 1 Data Segment 2 1970 1979
  28. 40.

    The index is organized in pages of fixed size capable

    of holding up to 2k keys, but pages need only be partially filled. ” “
  29. 42.

    B-TREES 51 11 65 2 7 12 15 55 62

    20 “look up id 15”
  30. 43.

    B-TREES 51 11 65 2 7 12 15 55 62

    20 “look up id 15”
  31. 44.

    B-TREES 51 11 65 2 7 12 15 55 62

    20 “look up id 15”
  32. 45.

    B-TREES 51 11 65 2 7 12 15 55 62

    20 “look up id 15”
  33. 46.

    B-TREES 51 11 65 2 7 12 15 55 62

    20 “look up id 15”
  34. 47.

    B-TREES 51 11 65 2 7 12 15 55 62

    20 “look up id 15”
  35. 53.

    Write-ahead log (wal) ~ Redo-log All modifications must be written

    before it can be applied to the pages of the tree itself. Writing all modifications to the WAL means that a B- tree index must write every piece of data at least twice!!!
  36. 54.

    Write amplification One write to the database that results in

    multiple writes to the disk. Write amplification has a direct performance cost! The more that a storage engine writes to disk, the fewer writes per second it can handle.
  37. 59.

    Reads are slower on LSM-trees since they have to check

    several data structures at different stages of compaction.
  38. 60.

    LSM-trees are able to sustain higher write throughput due to

    lower write amplification and sequential writes.
  39. 62.

    There is no quick and easy rule for determining which

    type of storage engine is better for your use case.