Go snippets from the new InfluxDB storage engine

39b7a68b6cbc43ec7683ad0bcc4c9570?s=47 Paul Dix
October 02, 2015

Go snippets from the new InfluxDB storage engine

Slides from my talk at GothamGo 2015

39b7a68b6cbc43ec7683ad0bcc4c9570?s=128

Paul Dix

October 02, 2015
Tweet

Transcript

  1. Go snippets from the new InfluxDB storage engine Paul Dix

    CEO at InfluxDB @pauldix paul@influxdb.com
  2. Or, the InfluxDB storage engine… and a sprinkling of Go

  3. preliminary intro materials…

  4. InfluxDB data temperature,device=dev1,building=b1 internal=80,external=18 1443782126

  5. InfluxDB data temperature,device=dev1,building=b1 internal=80,external=18 1443782126 Measurement

  6. InfluxDB data temperature,device=dev1,building=b1 internal=80,external=18 1443782126 Measurement Tags

  7. InfluxDB data temperature,device=dev1,building=b1 internal=80,external=18 1443782126 Measurement Tags Fields

  8. InfluxDB data temperature,device=dev1,building=b1 internal=80,external=18 1443782126 Measurement Tags Fields Timestamp

  9. InfluxDB data temperature,device=dev1,building=b1 internal=80,external=18 1443782126 Measurement Tags Fields Timestamp We

    actually store ns timestamps but I couldn’t fit on the slide
  10. Each series and field to a unique ID temperature,device=dev1,building=b1#internal temperature,device=dev1,building=b1#external

    1 2
  11. Data per ID is tuples ordered by time temperature,device=dev1,building=b1#internal temperature,device=dev1,building=b1#external

    1 2 1 (1443782126,80) 2 (1443782126,18)
  12. Arranging in Key/Value Stores 1,1443782126 Key Value 80 ID Time

  13. Arranging in Key/Value Stores 1,1443782126 Key Value 80 2,1443782126 18

  14. Arranging in Key/Value Stores 1,1443782126 Key Value 80 2,1443782126 18

    1,1443782127 81 new data
  15. Arranging in Key/Value Stores 1,1443782126 Key Value 80 2,1443782126 18

    1,1443782127 81 key space is ordered
  16. Arranging in Key/Value Stores 1,1443782126 Key Value 80 2,1443782126 18

    1,1443782127 81 2,1443782256 15 2,1443782130 17 3,1443700126 18
  17. Many existing storage engines have this model

  18. New Storage Engine?!

  19. First we used LSM Trees

  20. deletes expensive

  21. too many open file handles

  22. Then mmap COW B+Trees

  23. write throughput

  24. compression

  25. met our requirements

  26. High write throughput

  27. Awesome read performance

  28. Better Compression

  29. Writes can’t block reads

  30. Reads can’t block writes

  31. Write multiple ranges simultaneously

  32. Hot backups

  33. Many databases open in a single process

  34. Enter InfluxDB’s Time Structured Merge Tree (TSM Tree)

  35. Enter InfluxDB’s Time Structured Merge Tree (TSM Tree) like LSM,

    but different
  36. Components WAL In memory cache Index Files

  37. Components WAL In memory cache Index Files Similar to LSM

    Trees
  38. Components WAL In memory cache Index Files Similar to LSM

    Trees Same
  39. Components WAL In memory cache Index Files Similar to LSM

    Trees Same like MemTables
  40. Components WAL In memory cache Index Files Similar to LSM

    Trees Same like MemTables like SSTables
  41. awesome time series data WAL (an append only file)

  42. awesome time series data WAL (an append only file) in

    memory index
  43. In Memory Cache // cache and flush variables cacheLock sync.RWMutex

    cache map[string]Values flushCache map[string]Values temperature,device=dev1,building=b1#internal
  44. // cache and flush variables cacheLock sync.RWMutex cache map[string]Values flushCache

    map[string]Values dirtySort map[string]bool mutexes are your friend
  45. // cache and flush variables cacheLock sync.RWMutex cache map[string]Values flushCache

    map[string]Values dirtySort map[string]bool mutexes are your friend values can come in out of order. may need sorting later
  46. Different Value Types type Value interface { Time() time.Time UnixNano()

    int64 Value() interface{} Size() int }
  47. type Values []Value func (v Values) Encode(buf []byte) []byte {

    /* code here */ } // Sort methods func (a Values) Len() int { return len(a) } func (a Values) Swap(i, j int) { a[i], a[j] = a[j], a[i] } func (a Values) Less(i, j int) bool { return a[i].Time().UnixNano() < a[j].Time().UnixNano() }
  48. type Values []Value func (v Values) Encode(buf []byte) []byte {

    /* code here */ } // Sort methods func (a Values) Len() int { return len(a) } func (a Values) Swap(i, j int) { a[i], a[j] = a[j], a[i] } func (a Values) Less(i, j int) bool { return a[i].Time().UnixNano() < a[j].Time().UnixNano() } Sometimes I want generics… and then I come to my senses
  49. Finding a specific time // Seek will point the cursor

    to the given time (or key) func (c *walCursor) SeekTo(seek int64) (int64, interface{}) { // Seek cache index c.position = sort.Search(len(c.cache), func(i int) bool { return c.cache[i].Time().UnixNano() >= seek }) // more sweet code }
  50. awesome time series data WAL (an append only file) in

    memory index on disk index (periodic flushes)
  51. The Index Data File Min Time: 10000 Max Time: 29999

    Data File Min Time: 30000 Max Time: 39999 Data File Min Time: 70000 Max Time: 99999 Contiguous blocks of time
  52. The Index Data File Min Time: 10000 Max Time: 29999

    Data File Min Time: 30000 Max Time: 39999 Data File Min Time: 70000 Max Time: 99999 non-overlapping
  53. Data files are read only, like LSM SSTables

  54. The Index Data File Min Time: 10000 Max Time: 29999

    Data File Min Time: 30000 Max Time: 39999 Data File Min Time: 70000 Max Time: 99999 Data File Min Time: 10000 Max Time: 99999 they periodically get compacted (like LSM)
  55. Compacting while appending new data

  56. Compacting while appending new data func (w *WriteLock) LockRange(min, max

    int64) { // sweet code here } func (w *WriteLock) UnlockRange(min, max int64) { // sweet code here }
  57. Compacting while appending new data func (w *WriteLock) LockRange(min, max

    int64) { // sweet code here } func (w *WriteLock) UnlockRange(min, max int64) { // sweet code here } This should block until we get it
  58. How to test?

  59. func TestWriteLock_RightIntersect(t *testing.T) { w := &tsm1.WriteLock{} w.LockRange(2, 10) lock

    := make(chan bool) timeout := time.NewTimer(10 * time.Millisecond) go func() { w.LockRange(5, 15) lock <- true }() select { case <-lock: t.Fatal("able to get lock when we shouldn't") case <-timeout.C: // we're all good } }
  60. Back to the data files… Data File Min Time: 10000

    Max Time: 29999 Data File Min Time: 30000 Max Time: 39999 Data File Min Time: 70000 Max Time: 99999
  61. Data File Layout

  62. Data File Layout Similar to SSTables

  63. Data File Layout

  64. Data File Layout

  65. Data Files type dataFile struct { f *os.File size uint32

    mmap []byte }
  66. Memory mapping lets the OS handle caching for you

  67. Memory Mapping fInfo, err := f.Stat() if err != nil

    { return nil, err } mmap, err := syscall.Mmap( int(f.Fd()), 0, int(fInfo.Size()), syscall.PROT_READ, syscall.MAP_SHARED) if err != nil { return nil, err }
  68. Access file like a byte slice func (d *dataFile) MinTime()

    int64 { minTimePosition := d.size - minTimeOffset timeBytes := d.mmap[minTimeOffset : minTimeOffset+timeSize] return int64(btou64(timeBytes)) }
  69. Finding the time for an ID func (d *dataFile) StartingPositionForID(id

    uint64) uint32 { seriesCount := d.SeriesCount() indexStart := d.indexPosition() min := uint32(0) max := uint32(seriesCount) for min < max { mid := (max-min)/2 + min offset := mid*seriesHeaderSize + indexStart checkID := btou64(d.mmap[offset : offset+timeSize]) if checkID == id { return btou32(d.mmap[offset+timeSize : offset+timeSize+posSize]) } else if checkID < id { min = mid + 1 } else { max = mid } } return uint32(0) } The Index: IDs are sorted
  70. Compressed Data Blocks

  71. Timestamps: encoding based on precision and deltas

  72. Timestamps (good case): Simple8B Ann and Moffat in "Index compression

    using 64-bit words"
  73. float64: double delta Facebook’s Gorilla - google: gorilla time series

    facebook https://github.com/dgryski/go-tsz
  74. booleans are bits!

  75. int64 uses double delta

  76. string uses Snappy same compression LevelDB uses

  77. How does it perform?

  78. Compression depends greatly on the shape of your data

  79. Write throughput depends on batching, CPU, and memory

  80. test last night: 100,000 series 100,000 points per series 10,000,000,000

    total points 5,000 points per request c3.8xlarge, writes from 4 other systems ~390,000 points/sec ~3 bytes/point (random floats, could be better)
  81. ~400 IOPS 30%-50% CPU There’s room for improvement!

  82. Detailed writeup next week! http://influxdb.com/blog.html

  83. Thank you! Paul Dix @pauldix paul@influxdb.com