Go snippets from the new
InfluxDB storage engine
Paul Dix
CEO at InfluxDB
@pauldix
paul@influxdb.com
Slide 2
Slide 2 text
Or, the InfluxDB storage
engine… and a sprinkling of Go
Slide 3
Slide 3 text
preliminary intro materials…
Slide 4
Slide 4 text
InfluxDB data
temperature,device=dev1,building=b1 internal=80,external=18 1443782126
Slide 5
Slide 5 text
InfluxDB data
temperature,device=dev1,building=b1 internal=80,external=18 1443782126
Measurement
Slide 6
Slide 6 text
InfluxDB data
temperature,device=dev1,building=b1 internal=80,external=18 1443782126
Measurement Tags
Slide 7
Slide 7 text
InfluxDB data
temperature,device=dev1,building=b1 internal=80,external=18 1443782126
Measurement Tags Fields
Slide 8
Slide 8 text
InfluxDB data
temperature,device=dev1,building=b1 internal=80,external=18 1443782126
Measurement Tags Fields Timestamp
Slide 9
Slide 9 text
InfluxDB data
temperature,device=dev1,building=b1 internal=80,external=18 1443782126
Measurement Tags Fields Timestamp
We actually store ns timestamps
but I couldn’t fit on the slide
Slide 10
Slide 10 text
Each series and field to a unique ID
temperature,device=dev1,building=b1#internal
temperature,device=dev1,building=b1#external
1
2
Slide 11
Slide 11 text
Data per ID is tuples ordered by time
temperature,device=dev1,building=b1#internal
temperature,device=dev1,building=b1#external
1
2
1 (1443782126,80)
2 (1443782126,18)
Slide 12
Slide 12 text
Arranging in Key/Value Stores
1,1443782126
Key Value
80
ID Time
Slide 13
Slide 13 text
Arranging in Key/Value Stores
1,1443782126
Key Value
80
2,1443782126 18
Slide 14
Slide 14 text
Arranging in Key/Value Stores
1,1443782126
Key Value
80
2,1443782126 18
1,1443782127 81 new data
Slide 15
Slide 15 text
Arranging in Key/Value Stores
1,1443782126
Key Value
80
2,1443782126 18
1,1443782127 81
key space
is ordered
Slide 16
Slide 16 text
Arranging in Key/Value Stores
1,1443782126
Key Value
80
2,1443782126 18
1,1443782127 81
2,1443782256 15
2,1443782130 17
3,1443700126 18
Slide 17
Slide 17 text
Many existing storage engines
have this model
Slide 18
Slide 18 text
New Storage Engine?!
Slide 19
Slide 19 text
First we used LSM Trees
Slide 20
Slide 20 text
deletes expensive
Slide 21
Slide 21 text
too many open file handles
Slide 22
Slide 22 text
Then mmap COW B+Trees
Slide 23
Slide 23 text
write throughput
Slide 24
Slide 24 text
compression
Slide 25
Slide 25 text
met our requirements
Slide 26
Slide 26 text
High write throughput
Slide 27
Slide 27 text
Awesome read performance
Slide 28
Slide 28 text
Better Compression
Slide 29
Slide 29 text
Writes can’t block reads
Slide 30
Slide 30 text
Reads can’t block writes
Slide 31
Slide 31 text
Write multiple ranges
simultaneously
Slide 32
Slide 32 text
Hot backups
Slide 33
Slide 33 text
Many databases open in a
single process
Slide 34
Slide 34 text
Enter InfluxDB’s
Time Structured Merge Tree
(TSM Tree)
Slide 35
Slide 35 text
Enter InfluxDB’s
Time Structured Merge Tree
(TSM Tree)
like LSM, but different
Slide 36
Slide 36 text
Components
WAL
In
memory
cache
Index
Files
Slide 37
Slide 37 text
Components
WAL
In
memory
cache
Index
Files
Similar to LSM Trees
Slide 38
Slide 38 text
Components
WAL
In
memory
cache
Index
Files
Similar to LSM Trees
Same
Slide 39
Slide 39 text
Components
WAL
In
memory
cache
Index
Files
Similar to LSM Trees
Same like MemTables
Slide 40
Slide 40 text
Components
WAL
In
memory
cache
Index
Files
Similar to LSM Trees
Same like MemTables like SSTables
Slide 41
Slide 41 text
awesome time series data
WAL (an append only file)
Slide 42
Slide 42 text
awesome time series data
WAL (an append only file)
in memory index
Slide 43
Slide 43 text
In Memory Cache
// cache and flush variables
cacheLock sync.RWMutex
cache map[string]Values
flushCache map[string]Values
temperature,device=dev1,building=b1#internal
Slide 44
Slide 44 text
// cache and flush variables
cacheLock sync.RWMutex
cache map[string]Values
flushCache map[string]Values
dirtySort map[string]bool
mutexes are your friend
Slide 45
Slide 45 text
// cache and flush variables
cacheLock sync.RWMutex
cache map[string]Values
flushCache map[string]Values
dirtySort map[string]bool
mutexes are your friend
values can come in out of order.
may need sorting later
Slide 46
Slide 46 text
Different Value Types
type Value interface {
Time() time.Time
UnixNano() int64
Value() interface{}
Size() int
}
Slide 47
Slide 47 text
type Values []Value
func (v Values) Encode(buf []byte) []byte { /* code here */ }
// Sort methods
func (a Values) Len() int { return len(a) }
func (a Values) Swap(i, j int) { a[i], a[j] = a[j], a[i] }
func (a Values) Less(i, j int) bool {
return a[i].Time().UnixNano() < a[j].Time().UnixNano()
}
Slide 48
Slide 48 text
type Values []Value
func (v Values) Encode(buf []byte) []byte { /* code here */ }
// Sort methods
func (a Values) Len() int { return len(a) }
func (a Values) Swap(i, j int) { a[i], a[j] = a[j], a[i] }
func (a Values) Less(i, j int) bool {
return a[i].Time().UnixNano() < a[j].Time().UnixNano()
}
Sometimes I want generics…
and then I come to my senses
Slide 49
Slide 49 text
Finding a specific time
// Seek will point the cursor to the given time (or key)
func (c *walCursor) SeekTo(seek int64) (int64, interface{}) {
// Seek cache index
c.position = sort.Search(len(c.cache), func(i int) bool {
return c.cache[i].Time().UnixNano() >= seek
})
// more sweet code
}
Slide 50
Slide 50 text
awesome time series data
WAL (an append only file)
in memory index
on disk index
(periodic flushes)
Slide 51
Slide 51 text
The Index
Data File
Min Time: 10000
Max Time: 29999
Data File
Min Time: 30000
Max Time: 39999
Data File
Min Time: 70000
Max Time: 99999
Contiguous blocks of time
Slide 52
Slide 52 text
The Index
Data File
Min Time: 10000
Max Time: 29999
Data File
Min Time: 30000
Max Time: 39999
Data File
Min Time: 70000
Max Time: 99999
non-overlapping
Slide 53
Slide 53 text
Data files are read only, like LSM
SSTables
Slide 54
Slide 54 text
The Index
Data File
Min Time: 10000
Max Time: 29999
Data File
Min Time: 30000
Max Time: 39999
Data File
Min Time: 70000
Max Time: 99999
Data File
Min Time: 10000
Max Time: 99999
they periodically
get compacted
(like LSM)
Slide 55
Slide 55 text
Compacting while appending new data
Slide 56
Slide 56 text
Compacting while appending new data
func (w *WriteLock) LockRange(min, max int64) {
// sweet code here
}
func (w *WriteLock) UnlockRange(min, max int64) {
// sweet code here
}
Slide 57
Slide 57 text
Compacting while appending new data
func (w *WriteLock) LockRange(min, max int64) {
// sweet code here
}
func (w *WriteLock) UnlockRange(min, max int64) {
// sweet code here
}
This should block until we get it
Slide 58
Slide 58 text
How to test?
Slide 59
Slide 59 text
func TestWriteLock_RightIntersect(t *testing.T) {
w := &tsm1.WriteLock{}
w.LockRange(2, 10)
lock := make(chan bool)
timeout := time.NewTimer(10 * time.Millisecond)
go func() {
w.LockRange(5, 15)
lock <- true
}()
select {
case <-lock:
t.Fatal("able to get lock when we shouldn't")
case <-timeout.C:
// we're all good
}
}
Slide 60
Slide 60 text
Back to the data files…
Data File
Min Time: 10000
Max Time: 29999
Data File
Min Time: 30000
Max Time: 39999
Data File
Min Time: 70000
Max Time: 99999
Slide 61
Slide 61 text
Data File Layout
Slide 62
Slide 62 text
Data File Layout
Similar to SSTables
Slide 63
Slide 63 text
Data File Layout
Slide 64
Slide 64 text
Data File Layout
Slide 65
Slide 65 text
Data Files
type dataFile struct {
f *os.File
size uint32
mmap []byte
}
Finding the time for an ID
func (d *dataFile) StartingPositionForID(id uint64) uint32 {
seriesCount := d.SeriesCount()
indexStart := d.indexPosition()
min := uint32(0)
max := uint32(seriesCount)
for min < max {
mid := (max-min)/2 + min
offset := mid*seriesHeaderSize + indexStart
checkID := btou64(d.mmap[offset : offset+timeSize])
if checkID == id {
return btou32(d.mmap[offset+timeSize : offset+timeSize+posSize])
} else if checkID < id {
min = mid + 1
} else {
max = mid
}
}
return uint32(0)
}
The Index: IDs are sorted
Slide 70
Slide 70 text
Compressed Data Blocks
Slide 71
Slide 71 text
Timestamps: encoding based
on precision and deltas
Slide 72
Slide 72 text
Timestamps (good case):
Simple8B
Ann and Moffat in "Index compression using 64-bit words"
Slide 73
Slide 73 text
float64: double delta
Facebook’s Gorilla - google: gorilla time series facebook
https://github.com/dgryski/go-tsz
Slide 74
Slide 74 text
booleans are bits!
Slide 75
Slide 75 text
int64 uses double delta
Slide 76
Slide 76 text
string uses Snappy
same compression LevelDB uses
Slide 77
Slide 77 text
How does it perform?
Slide 78
Slide 78 text
Compression depends greatly
on the shape of your data
Slide 79
Slide 79 text
Write throughput depends on
batching, CPU, and memory
Slide 80
Slide 80 text
test last night:
100,000 series
100,000 points per series
10,000,000,000 total points
5,000 points per request
c3.8xlarge, writes from 4 other systems
~390,000 points/sec
~3 bytes/point (random floats, could be better)
Slide 81
Slide 81 text
~400 IOPS
30%-50% CPU
There’s room for improvement!
Slide 82
Slide 82 text
Detailed writeup next week!
http://influxdb.com/blog.html