The new InfluxDB storage engine
and some query language ideas
Paul Dix
CEO at InfluxDB
@pauldix
paul@influxdb.com
Slide 2
Slide 2 text
preliminary intro materials…
Slide 3
Slide 3 text
Everything is indexed by time
and series
Slide 4
Slide 4 text
Shards
10/11/2015 10/12/2015
Data organized into Shards of time, each is an underlying DB
efficient to drop old data
10/13/2015
10/10/2015
Slide 5
Slide 5 text
InfluxDB data
temperature,device=dev1,building=b1 internal=80,external=18 1443782126
Slide 6
Slide 6 text
InfluxDB data
temperature,device=dev1,building=b1 internal=80,external=18 1443782126
Measurement
Slide 7
Slide 7 text
InfluxDB data
temperature,device=dev1,building=b1 internal=80,external=18 1443782126
Measurement Tags
Slide 8
Slide 8 text
InfluxDB data
temperature,device=dev1,building=b1 internal=80,external=18 1443782126
Measurement Tags Fields
Slide 9
Slide 9 text
InfluxDB data
temperature,device=dev1,building=b1 internal=80,external=18 1443782126
Measurement Tags Fields Timestamp
Slide 10
Slide 10 text
InfluxDB data
temperature,device=dev1,building=b1 internal=80,external=18 1443782126
Measurement Tags Fields Timestamp
We actually store up to ns scale timestamps
but I couldn’t fit on the slide
Slide 11
Slide 11 text
Each series and field to a unique ID
temperature,device=dev1,building=b1#internal
temperature,device=dev1,building=b1#external
1
2
Slide 12
Slide 12 text
Data per ID is tuples ordered by time
temperature,device=dev1,building=b1#internal
temperature,device=dev1,building=b1#external
1
2
1 (1443782126,80)
2 (1443782126,18)
Slide 13
Slide 13 text
Storage Requirements
Slide 14
Slide 14 text
High write throughput
to hundreds of thousands of series
Slide 15
Slide 15 text
Awesome read performance
Slide 16
Slide 16 text
Better Compression
Slide 17
Slide 17 text
Writes can’t block reads
Slide 18
Slide 18 text
Reads can’t block writes
Slide 19
Slide 19 text
Write multiple ranges
simultaneously
Slide 20
Slide 20 text
Hot backups
Slide 21
Slide 21 text
Many databases open in a
single process
Slide 22
Slide 22 text
InfluxDB’s
Time Structured Merge Tree
(TSM Tree)
Slide 23
Slide 23 text
InfluxDB’s
Time Structured Merge Tree
(TSM Tree)
like LSM, but different
Slide 24
Slide 24 text
Components
WAL
In
memory
cache
Index
Files
Slide 25
Slide 25 text
Components
WAL
In
memory
cache
Index
Files
Similar to LSM Trees
Slide 26
Slide 26 text
Components
WAL
In
memory
cache
Index
Files
Similar to LSM Trees
Same
Slide 27
Slide 27 text
Components
WAL
In
memory
cache
Index
Files
Similar to LSM Trees
Same like MemTables
Slide 28
Slide 28 text
Components
WAL
In
memory
cache
Index
Files
Similar to LSM Trees
Same like MemTables like SSTables
Slide 29
Slide 29 text
awesome time series data
WAL (an append only file)
Slide 30
Slide 30 text
awesome time series data
WAL (an append only file)
in memory index
Slide 31
Slide 31 text
In Memory Cache
// cache and flush variables
cacheLock sync.RWMutex
cache map[string]Values
flushCache map[string]Values
temperature,device=dev1,building=b1#internal
Slide 32
Slide 32 text
In Memory Cache
// cache and flush variables
cacheLock sync.RWMutex
cache map[string]Values
flushCache map[string]Values
writes can come in while WAL flushes
Slide 33
Slide 33 text
// cache and flush variables
cacheLock sync.RWMutex
cache map[string]Values
flushCache map[string]Values
dirtySort map[string]bool
values can come in out of order.
mark if so, sort at query time
Slide 34
Slide 34 text
Values in Memory
type Value interface {
Time() time.Time
UnixNano() int64
Value() interface{}
Size() int
}
Slide 35
Slide 35 text
awesome time series data
WAL (an append only file)
in memory index
on disk index
(periodic flushes)
Slide 36
Slide 36 text
The Index
Data File
Min Time: 10000
Max Time: 29999
Data File
Min Time: 30000
Max Time: 39999
Data File
Min Time: 70000
Max Time: 99999
Contiguous blocks of time
Slide 37
Slide 37 text
The Index
Data File
Min Time: 10000
Max Time: 29999
Data File
Min Time: 15000
Max Time: 39999
Data File
Min Time: 70000
Max Time: 99999
can overlap
Slide 38
Slide 38 text
The Index
cpu,host=A
Min Time: 10000
Max Time: 20000
cpu,host=A
Min Time: 21000
Max Time: 39999
Data File
Min Time: 70000
Max Time: 99999
but a specific series must not overlap
Slide 39
Slide 39 text
The Index
Data File
Data File
Data File
a file will never overlap with
more than 2 others
time ascending
Data File
Data File
Slide 40
Slide 40 text
Data files are read only, like LSM
SSTables
Slide 41
Slide 41 text
The Index
Data File
Min Time: 10000
Max Time: 29999
Data File
Min Time: 30000
Max Time: 39999
Data File
Min Time: 70000
Max Time: 99999
Data File
Min Time: 10000
Max Time: 99999
they periodically
get compacted
(like LSM)
Slide 42
Slide 42 text
Compacting while appending new data
Slide 43
Slide 43 text
Compacting while appending new data
func (w *WriteLock) LockRange(min, max int64) {
// sweet code here
}
func (w *WriteLock) UnlockRange(min, max int64) {
// sweet code here
}
Slide 44
Slide 44 text
Compacting while appending new data
func (w *WriteLock) LockRange(min, max int64) {
// sweet code here
}
func (w *WriteLock) UnlockRange(min, max int64) {
// sweet code here
}
This should block until we get it
Slide 45
Slide 45 text
Locking happens inside each
Shard
Slide 46
Slide 46 text
Back to the data files…
Data File
Min Time: 10000
Max Time: 29999
Data File
Min Time: 30000
Max Time: 39999
Data File
Min Time: 70000
Max Time: 99999
Slide 47
Slide 47 text
Data File Layout
Slide 48
Slide 48 text
Data File Layout
Similar to SSTables
Slide 49
Slide 49 text
Data File Layout
Slide 50
Slide 50 text
Data File Layout
blocks have up to 1,000 points by default
Slide 51
Slide 51 text
Data File Layout
Slide 52
Slide 52 text
Data File Layout
4 byte position means data files can be at most 4GB
Slide 53
Slide 53 text
Data Files
type dataFile struct {
f *os.File
size uint32
mmap []byte
}
Slide 54
Slide 54 text
Memory mapping lets the OS
handle caching for you
Slide 55
Slide 55 text
Compressed Data Blocks
Slide 56
Slide 56 text
Timestamps: encoding based
on precision and deltas
Slide 57
Slide 57 text
Timestamps (best case):
Run length encoding
Deltas are all the same for a block
(only requires start time, delta, and count)
Slide 58
Slide 58 text
Timestamps (good case):
Simple8B
Ann and Moffat in "Index compression using 64-bit words"
Slide 59
Slide 59 text
Timestamps (worst case):
raw values
nano-second timestamps with large deltas
Slide 60
Slide 60 text
float64: double delta
Facebook’s Gorilla - google: gorilla time series facebook
https://github.com/dgryski/go-tsz
Slide 61
Slide 61 text
booleans are bits!
Slide 62
Slide 62 text
int64 uses zig-zag
same as from Protobufs
(adding double delta and RLE)
Compression depends greatly
on the shape of your data
Slide 66
Slide 66 text
Write throughput depends on
batching, CPU, and memory
Slide 67
Slide 67 text
one test:
100,000 series
100,000 points per series
10,000,000,000 total points
5,000 points per request
c3.8xlarge, writes from 4 other systems
~390,000 points/sec
~3 bytes/point (random floats, could be better)
Slide 68
Slide 68 text
~400 IOPS
30%-50% CPU
There’s room for improvement!
Aggregates
select mean(value)
from cpu
where host = 'A'
and time > now() - 4h
group by time(5m)
Slide 73
Slide 73 text
Transformations
select derivative(value)
from cpu
where host = 'A'
and time > now() - 4h
group by time(5m)
Slide 74
Slide 74 text
Selectors
select min(value)
from cpu
where host = 'A';
and time > now() - 4h
group by time(5m)
Slide 75
Slide 75 text
Then there are fills
select mean(value)
from cpu
where host = 'A'
and time > now() - 4h
group by time(5m)
fill(0)
Slide 76
Slide 76 text
How to differentiate between the
different types?
Slide 77
Slide 77 text
How do we chain functions
together?
without making breaking changes to InfluxQL
Slide 78
Slide 78 text
Mix jQuery style with InfluxQL
SELECT
mean(value).fill(previous).derivate(1s).scale(100).as(‘mvg_avg’)
FROM measurement
WHERE time > now() - 4h
GROUP BY time(1m)
Slide 79
Slide 79 text
D3 style
SELECT
mean(value)
.fill(previous)
.derivate(1s)
.scale(100)
.as(‘mvg_avg’)
FROM measurement
WHERE time > now() - 4h
GROUP BY time(1m)
Slide 80
Slide 80 text
Moving the FROM?
SELECT
from('cpu').mean(value)
from('memory').mean(value)
WHERE time > now() - 4h
GROUP BY time(1m)
Slide 81
Slide 81 text
Moving the FROM?
SELECT
from('cpu').mean(value)
from('memory').mean(value)
WHERE time > now() - 4h
GROUP BY time(1m)
consistent time and filtering applied to both
Slide 82
Slide 82 text
JOIN
SELECT
join(
from('errors')
.count(value),
from('requests')
.count(value)
).fill(0)
.count(value)
WHERE time > now() - 4h
GROUP BY time(1m)