Upgrade to Pro — share decks privately, control downloads, hide ads and more …

InfluxDB's new storage engine: The Time Structured Merge Tree

Paul Dix
October 14, 2015

InfluxDB's new storage engine: The Time Structured Merge Tree

Paul Dix

October 14, 2015
Tweet

More Decks by Paul Dix

Other Decks in Technology

Transcript

  1. InfluxDB’s new storage engine:
    The Time Structured Merge Tree
    Paul Dix
    CEO at InfluxDB
    @pauldix
    paul@influxdb.com

    View full-size slide

  2. preliminary intro materials…

    View full-size slide

  3. Everything is indexed by time
    and series

    View full-size slide

  4. Shards
    10/11/2015 10/12/2015
    Data organized into Shards of time, each is an underlying DB
    efficient to drop old data
    10/13/2015
    10/10/2015

    View full-size slide

  5. InfluxDB data
    temperature,device=dev1,building=b1 internal=80,external=18 1443782126

    View full-size slide

  6. InfluxDB data
    temperature,device=dev1,building=b1 internal=80,external=18 1443782126
    Measurement

    View full-size slide

  7. InfluxDB data
    temperature,device=dev1,building=b1 internal=80,external=18 1443782126
    Measurement Tags

    View full-size slide

  8. InfluxDB data
    temperature,device=dev1,building=b1 internal=80,external=18 1443782126
    Measurement Tags Fields

    View full-size slide

  9. InfluxDB data
    temperature,device=dev1,building=b1 internal=80,external=18 1443782126
    Measurement Tags Fields Timestamp

    View full-size slide

  10. InfluxDB data
    temperature,device=dev1,building=b1 internal=80,external=18 1443782126
    Measurement Tags Fields Timestamp
    We actually store up to ns scale timestamps
    but I couldn’t fit on the slide

    View full-size slide

  11. Each series and field to a unique ID
    temperature,device=dev1,building=b1#internal
    temperature,device=dev1,building=b1#external
    1
    2

    View full-size slide

  12. Data per ID is tuples ordered by time
    temperature,device=dev1,building=b1#internal
    temperature,device=dev1,building=b1#external
    1
    2
    1 (1443782126,80)
    2 (1443782126,18)

    View full-size slide

  13. Arranging in Key/Value Stores
    1,1443782126
    Key Value
    80
    ID Time

    View full-size slide

  14. Arranging in Key/Value Stores
    1,1443782126
    Key Value
    80
    2,1443782126 18

    View full-size slide

  15. Arranging in Key/Value Stores
    1,1443782126
    Key Value
    80
    2,1443782126 18
    1,1443782127 81 new data

    View full-size slide

  16. Arranging in Key/Value Stores
    1,1443782126
    Key Value
    80
    2,1443782126 18
    1,1443782127 81
    key space
    is ordered

    View full-size slide

  17. Arranging in Key/Value Stores
    1,1443782126
    Key Value
    80
    2,1443782126 18
    1,1443782127 81
    2,1443782256 15
    2,1443782130 17
    3,1443700126 18

    View full-size slide

  18. Many existing storage engines
    have this model

    View full-size slide

  19. New Storage Engine?!

    View full-size slide

  20. First we used LSM Trees

    View full-size slide

  21. deletes expensive

    View full-size slide

  22. too many open file handles

    View full-size slide

  23. Then mmap COW B+Trees

    View full-size slide

  24. write throughput

    View full-size slide

  25. met our requirements

    View full-size slide

  26. High write throughput

    View full-size slide

  27. Awesome read performance

    View full-size slide

  28. Better Compression

    View full-size slide

  29. Writes can’t block reads

    View full-size slide

  30. Reads can’t block writes

    View full-size slide

  31. Write multiple ranges
    simultaneously

    View full-size slide

  32. Many databases open in a
    single process

    View full-size slide

  33. Enter InfluxDB’s
    Time Structured Merge Tree
    (TSM Tree)

    View full-size slide

  34. Enter InfluxDB’s
    Time Structured Merge Tree
    (TSM Tree)
    like LSM, but different

    View full-size slide

  35. Components
    WAL
    In
    memory
    cache
    Index
    Files

    View full-size slide

  36. Components
    WAL
    In
    memory
    cache
    Index
    Files
    Similar to LSM Trees

    View full-size slide

  37. Components
    WAL
    In
    memory
    cache
    Index
    Files
    Similar to LSM Trees
    Same

    View full-size slide

  38. Components
    WAL
    In
    memory
    cache
    Index
    Files
    Similar to LSM Trees
    Same like MemTables

    View full-size slide

  39. Components
    WAL
    In
    memory
    cache
    Index
    Files
    Similar to LSM Trees
    Same like MemTables like SSTables

    View full-size slide

  40. awesome time series data
    WAL (an append only file)

    View full-size slide

  41. awesome time series data
    WAL (an append only file)
    in memory index

    View full-size slide

  42. In Memory Cache
    // cache and flush variables
    cacheLock sync.RWMutex
    cache map[string]Values
    flushCache map[string]Values
    temperature,device=dev1,building=b1#internal

    View full-size slide

  43. In Memory Cache
    // cache and flush variables
    cacheLock sync.RWMutex
    cache map[string]Values
    flushCache map[string]Values
    writes can come in while WAL flushes

    View full-size slide

  44. // cache and flush variables
    cacheLock sync.RWMutex
    cache map[string]Values
    flushCache map[string]Values
    dirtySort map[string]bool
    values can come in out of order.
    mark if so, sort at query time

    View full-size slide

  45. Values in Memory
    type Value interface {
    Time() time.Time
    UnixNano() int64
    Value() interface{}
    Size() int
    }

    View full-size slide

  46. awesome time series data
    WAL (an append only file)
    in memory index
    on disk index
    (periodic flushes)

    View full-size slide

  47. The Index
    Data File
    Min Time: 10000
    Max Time: 29999
    Data File
    Min Time: 30000
    Max Time: 39999
    Data File
    Min Time: 70000
    Max Time: 99999
    Contiguous blocks of time

    View full-size slide

  48. The Index
    Data File
    Min Time: 10000
    Max Time: 29999
    Data File
    Min Time: 15000
    Max Time: 39999
    Data File
    Min Time: 70000
    Max Time: 99999
    can overlap

    View full-size slide

  49. The Index
    cpu,host=A
    Min Time: 10000
    Max Time: 20000
    cpu,host=A
    Min Time: 21000
    Max Time: 39999
    Data File
    Min Time: 70000
    Max Time: 99999
    but a specific series must not overlap

    View full-size slide

  50. The Index
    Data File
    Data File
    Data File
    a file will never overlap with
    more than 2 others
    time ascending
    Data File
    Data File

    View full-size slide

  51. Data files are read only, like LSM
    SSTables

    View full-size slide

  52. The Index
    Data File
    Min Time: 10000
    Max Time: 29999
    Data File
    Min Time: 30000
    Max Time: 39999
    Data File
    Min Time: 70000
    Max Time: 99999
    Data File
    Min Time: 10000
    Max Time: 99999
    they periodically
    get compacted
    (like LSM)

    View full-size slide

  53. Compacting while appending new data

    View full-size slide

  54. Compacting while appending new data
    func (w *WriteLock) LockRange(min, max int64) {
    // sweet code here
    }
    func (w *WriteLock) UnlockRange(min, max int64) {
    // sweet code here
    }

    View full-size slide

  55. Compacting while appending new data
    func (w *WriteLock) LockRange(min, max int64) {
    // sweet code here
    }
    func (w *WriteLock) UnlockRange(min, max int64) {
    // sweet code here
    }
    This should block until we get it

    View full-size slide

  56. Locking happens inside each
    Shard

    View full-size slide

  57. Back to the data files…
    Data File
    Min Time: 10000
    Max Time: 29999
    Data File
    Min Time: 30000
    Max Time: 39999
    Data File
    Min Time: 70000
    Max Time: 99999

    View full-size slide

  58. Data File Layout

    View full-size slide

  59. Data File Layout
    Similar to SSTables

    View full-size slide

  60. Data File Layout

    View full-size slide

  61. Data File Layout
    blocks have up to 1,000 points by default

    View full-size slide

  62. Data File Layout

    View full-size slide

  63. Data File Layout
    4 byte position means data files can be at most 4GB

    View full-size slide

  64. Data Files
    type dataFile struct {
    f *os.File
    size uint32
    mmap []byte
    }

    View full-size slide

  65. Memory mapping lets the OS
    handle caching for you

    View full-size slide

  66. Access file like a byte slice
    func (d *dataFile) MinTime() int64 {
    minTimePosition := d.size - minTimeOffset
    timeBytes := d.mmap[minTimeOffset : minTimeOffset+timeSize]
    return int64(btou64(timeBytes))
    }

    View full-size slide

  67. Binary Search for ID
    func (d *dataFile) StartingPositionForID(id uint64) uint32 {
    seriesCount := d.SeriesCount()
    indexStart := d.indexPosition()
    min := uint32(0)
    max := uint32(seriesCount)
    for min < max {
    mid := (max-min)/2 + min
    offset := mid*seriesHeaderSize + indexStart
    checkID := btou64(d.mmap[offset : offset+timeSize])
    if checkID == id {
    return btou32(d.mmap[offset+timeSize : offset+timeSize+posSize])
    } else if checkID < id {
    min = mid + 1
    } else {
    max = mid
    }
    }
    return uint32(0)
    }
    The Index: IDs are sorted

    View full-size slide

  68. Compressed Data Blocks

    View full-size slide

  69. Timestamps: encoding based
    on precision and deltas

    View full-size slide

  70. Timestamps (best case):
    Run length encoding
    Deltas are all the same for a block

    View full-size slide

  71. Timestamps (good case):
    Simple8B
    Ann and Moffat in "Index compression using 64-bit words"

    View full-size slide

  72. Timestamps (worst case):
    raw values
    nano-second timestamps with large deltas

    View full-size slide

  73. float64: double delta
    Facebook’s Gorilla - google: gorilla time series facebook
    https://github.com/dgryski/go-tsz

    View full-size slide

  74. booleans are bits!

    View full-size slide

  75. int64 uses zig-zag
    same as from Protobufs
    (also looking at adding double delta and RLE)

    View full-size slide

  76. string uses Snappy
    same compression LevelDB uses
    (might add dictionary compression)

    View full-size slide

  77. How does it perform?

    View full-size slide

  78. Compression depends greatly
    on the shape of your data

    View full-size slide

  79. Write throughput depends on
    batching, CPU, and memory

    View full-size slide

  80. test last night:
    100,000 series
    100,000 points per series
    10,000,000,000 total points
    5,000 points per request
    c3.8xlarge, writes from 4 other systems
    ~390,000 points/sec
    ~3 bytes/point (random floats, could be better)

    View full-size slide

  81. ~400 IOPS
    30%-50% CPU
    There’s room for improvement!

    View full-size slide

  82. Detailed writeup
    https://influxdb.com/docs/v0.9/concepts/storage_engine.html

    View full-size slide

  83. Thank you!
    Paul Dix
    @pauldix
    paul@influxdb.com

    View full-size slide