InfluxDB’s new storage engine:
The Time Structured Merge Tree
Paul Dix
CEO at InfluxDB
@pauldix
paul@influxdb.com
Slide 2
Slide 2 text
preliminary intro materials…
Slide 3
Slide 3 text
Everything is indexed by time
and series
Slide 4
Slide 4 text
Shards
10/11/2015 10/12/2015
Data organized into Shards of time, each is an underlying DB
efficient to drop old data
10/13/2015
10/10/2015
Slide 5
Slide 5 text
InfluxDB data
temperature,device=dev1,building=b1 internal=80,external=18 1443782126
Slide 6
Slide 6 text
InfluxDB data
temperature,device=dev1,building=b1 internal=80,external=18 1443782126
Measurement
Slide 7
Slide 7 text
InfluxDB data
temperature,device=dev1,building=b1 internal=80,external=18 1443782126
Measurement Tags
Slide 8
Slide 8 text
InfluxDB data
temperature,device=dev1,building=b1 internal=80,external=18 1443782126
Measurement Tags Fields
Slide 9
Slide 9 text
InfluxDB data
temperature,device=dev1,building=b1 internal=80,external=18 1443782126
Measurement Tags Fields Timestamp
Slide 10
Slide 10 text
InfluxDB data
temperature,device=dev1,building=b1 internal=80,external=18 1443782126
Measurement Tags Fields Timestamp
We actually store up to ns scale timestamps
but I couldn’t fit on the slide
Slide 11
Slide 11 text
Each series and field to a unique ID
temperature,device=dev1,building=b1#internal
temperature,device=dev1,building=b1#external
1
2
Slide 12
Slide 12 text
Data per ID is tuples ordered by time
temperature,device=dev1,building=b1#internal
temperature,device=dev1,building=b1#external
1
2
1 (1443782126,80)
2 (1443782126,18)
Slide 13
Slide 13 text
Arranging in Key/Value Stores
1,1443782126
Key Value
80
ID Time
Slide 14
Slide 14 text
Arranging in Key/Value Stores
1,1443782126
Key Value
80
2,1443782126 18
Slide 15
Slide 15 text
Arranging in Key/Value Stores
1,1443782126
Key Value
80
2,1443782126 18
1,1443782127 81 new data
Slide 16
Slide 16 text
Arranging in Key/Value Stores
1,1443782126
Key Value
80
2,1443782126 18
1,1443782127 81
key space
is ordered
Slide 17
Slide 17 text
Arranging in Key/Value Stores
1,1443782126
Key Value
80
2,1443782126 18
1,1443782127 81
2,1443782256 15
2,1443782130 17
3,1443700126 18
Slide 18
Slide 18 text
Many existing storage engines
have this model
Slide 19
Slide 19 text
New Storage Engine?!
Slide 20
Slide 20 text
First we used LSM Trees
Slide 21
Slide 21 text
deletes expensive
Slide 22
Slide 22 text
too many open file handles
Slide 23
Slide 23 text
Then mmap COW B+Trees
Slide 24
Slide 24 text
write throughput
Slide 25
Slide 25 text
compression
Slide 26
Slide 26 text
met our requirements
Slide 27
Slide 27 text
High write throughput
Slide 28
Slide 28 text
Awesome read performance
Slide 29
Slide 29 text
Better Compression
Slide 30
Slide 30 text
Writes can’t block reads
Slide 31
Slide 31 text
Reads can’t block writes
Slide 32
Slide 32 text
Write multiple ranges
simultaneously
Slide 33
Slide 33 text
Hot backups
Slide 34
Slide 34 text
Many databases open in a
single process
Slide 35
Slide 35 text
Enter InfluxDB’s
Time Structured Merge Tree
(TSM Tree)
Slide 36
Slide 36 text
Enter InfluxDB’s
Time Structured Merge Tree
(TSM Tree)
like LSM, but different
Slide 37
Slide 37 text
Components
WAL
In
memory
cache
Index
Files
Slide 38
Slide 38 text
Components
WAL
In
memory
cache
Index
Files
Similar to LSM Trees
Slide 39
Slide 39 text
Components
WAL
In
memory
cache
Index
Files
Similar to LSM Trees
Same
Slide 40
Slide 40 text
Components
WAL
In
memory
cache
Index
Files
Similar to LSM Trees
Same like MemTables
Slide 41
Slide 41 text
Components
WAL
In
memory
cache
Index
Files
Similar to LSM Trees
Same like MemTables like SSTables
Slide 42
Slide 42 text
awesome time series data
WAL (an append only file)
Slide 43
Slide 43 text
awesome time series data
WAL (an append only file)
in memory index
Slide 44
Slide 44 text
In Memory Cache
// cache and flush variables
cacheLock sync.RWMutex
cache map[string]Values
flushCache map[string]Values
temperature,device=dev1,building=b1#internal
Slide 45
Slide 45 text
In Memory Cache
// cache and flush variables
cacheLock sync.RWMutex
cache map[string]Values
flushCache map[string]Values
writes can come in while WAL flushes
Slide 46
Slide 46 text
// cache and flush variables
cacheLock sync.RWMutex
cache map[string]Values
flushCache map[string]Values
dirtySort map[string]bool
values can come in out of order.
mark if so, sort at query time
Slide 47
Slide 47 text
Values in Memory
type Value interface {
Time() time.Time
UnixNano() int64
Value() interface{}
Size() int
}
Slide 48
Slide 48 text
awesome time series data
WAL (an append only file)
in memory index
on disk index
(periodic flushes)
Slide 49
Slide 49 text
The Index
Data File
Min Time: 10000
Max Time: 29999
Data File
Min Time: 30000
Max Time: 39999
Data File
Min Time: 70000
Max Time: 99999
Contiguous blocks of time
Slide 50
Slide 50 text
The Index
Data File
Min Time: 10000
Max Time: 29999
Data File
Min Time: 15000
Max Time: 39999
Data File
Min Time: 70000
Max Time: 99999
can overlap
Slide 51
Slide 51 text
The Index
cpu,host=A
Min Time: 10000
Max Time: 20000
cpu,host=A
Min Time: 21000
Max Time: 39999
Data File
Min Time: 70000
Max Time: 99999
but a specific series must not overlap
Slide 52
Slide 52 text
The Index
Data File
Data File
Data File
a file will never overlap with
more than 2 others
time ascending
Data File
Data File
Slide 53
Slide 53 text
Data files are read only, like LSM
SSTables
Slide 54
Slide 54 text
The Index
Data File
Min Time: 10000
Max Time: 29999
Data File
Min Time: 30000
Max Time: 39999
Data File
Min Time: 70000
Max Time: 99999
Data File
Min Time: 10000
Max Time: 99999
they periodically
get compacted
(like LSM)
Slide 55
Slide 55 text
Compacting while appending new data
Slide 56
Slide 56 text
Compacting while appending new data
func (w *WriteLock) LockRange(min, max int64) {
// sweet code here
}
func (w *WriteLock) UnlockRange(min, max int64) {
// sweet code here
}
Slide 57
Slide 57 text
Compacting while appending new data
func (w *WriteLock) LockRange(min, max int64) {
// sweet code here
}
func (w *WriteLock) UnlockRange(min, max int64) {
// sweet code here
}
This should block until we get it
Slide 58
Slide 58 text
Locking happens inside each
Shard
Slide 59
Slide 59 text
Back to the data files…
Data File
Min Time: 10000
Max Time: 29999
Data File
Min Time: 30000
Max Time: 39999
Data File
Min Time: 70000
Max Time: 99999
Slide 60
Slide 60 text
Data File Layout
Slide 61
Slide 61 text
Data File Layout
Similar to SSTables
Slide 62
Slide 62 text
Data File Layout
Slide 63
Slide 63 text
Data File Layout
blocks have up to 1,000 points by default
Slide 64
Slide 64 text
Data File Layout
Slide 65
Slide 65 text
Data File Layout
4 byte position means data files can be at most 4GB
Slide 66
Slide 66 text
Data Files
type dataFile struct {
f *os.File
size uint32
mmap []byte
}
Binary Search for ID
func (d *dataFile) StartingPositionForID(id uint64) uint32 {
seriesCount := d.SeriesCount()
indexStart := d.indexPosition()
min := uint32(0)
max := uint32(seriesCount)
for min < max {
mid := (max-min)/2 + min
offset := mid*seriesHeaderSize + indexStart
checkID := btou64(d.mmap[offset : offset+timeSize])
if checkID == id {
return btou32(d.mmap[offset+timeSize : offset+timeSize+posSize])
} else if checkID < id {
min = mid + 1
} else {
max = mid
}
}
return uint32(0)
}
The Index: IDs are sorted
Slide 70
Slide 70 text
Compressed Data Blocks
Slide 71
Slide 71 text
Timestamps: encoding based
on precision and deltas
Slide 72
Slide 72 text
Timestamps (best case):
Run length encoding
Deltas are all the same for a block
Slide 73
Slide 73 text
Timestamps (good case):
Simple8B
Ann and Moffat in "Index compression using 64-bit words"
Slide 74
Slide 74 text
Timestamps (worst case):
raw values
nano-second timestamps with large deltas
Slide 75
Slide 75 text
float64: double delta
Facebook’s Gorilla - google: gorilla time series facebook
https://github.com/dgryski/go-tsz
Slide 76
Slide 76 text
booleans are bits!
Slide 77
Slide 77 text
int64 uses zig-zag
same as from Protobufs
(also looking at adding double delta and RLE)
Compression depends greatly
on the shape of your data
Slide 81
Slide 81 text
Write throughput depends on
batching, CPU, and memory
Slide 82
Slide 82 text
test last night:
100,000 series
100,000 points per series
10,000,000,000 total points
5,000 points per request
c3.8xlarge, writes from 4 other systems
~390,000 points/sec
~3 bytes/point (random floats, could be better)
Slide 83
Slide 83 text
~400 IOPS
30%-50% CPU
There’s room for improvement!