Go snippets from the new InfluxDB storage engine

Go snippets from the new InfluxDB storage engine Paul Dix
CEO at InfluxDB @pauldix paul@influxdb.com

Or, the InﬂuxDB storage engine… and a sprinkling of Go

preliminary intro materials…

InﬂuxDB data temperature,device=dev1,building=b1 internal=80,external=18 1443782126

InﬂuxDB data temperature,device=dev1,building=b1 internal=80,external=18 1443782126 Measurement

InﬂuxDB data temperature,device=dev1,building=b1 internal=80,external=18 1443782126 Measurement Tags

InﬂuxDB data temperature,device=dev1,building=b1 internal=80,external=18 1443782126 Measurement Tags Fields

InﬂuxDB data temperature,device=dev1,building=b1 internal=80,external=18 1443782126 Measurement Tags Fields Timestamp

InﬂuxDB data temperature,device=dev1,building=b1 internal=80,external=18 1443782126 Measurement Tags Fields Timestamp We
actually store ns timestamps but I couldn’t ﬁt on the slide

Each series and ﬁeld to a unique ID temperature,device=dev1,building=b1#internal temperature,device=dev1,building=b1#external
1 2

Data per ID is tuples ordered by time temperature,device=dev1,building=b1#internal temperature,device=dev1,building=b1#external
1 2 1 (1443782126,80) 2 (1443782126,18)

Arranging in Key/Value Stores 1,1443782126 Key Value 80 ID Time

Arranging in Key/Value Stores 1,1443782126 Key Value 80 2,1443782126 18

1,1443782127 81 new data

1,1443782127 81 key space is ordered

1,1443782127 81 2,1443782256 15 2,1443782130 17 3,1443700126 18

Many existing storage engines have this model

New Storage Engine?!

First we used LSM Trees

deletes expensive

too many open ﬁle handles

Then mmap COW B+Trees

write throughput

compression

met our requirements

High write throughput

Awesome read performance

Better Compression

Writes can’t block reads

Reads can’t block writes

Write multiple ranges simultaneously

Hot backups

Many databases open in a single process

Enter InﬂuxDB’s Time Structured Merge Tree (TSM Tree)

Enter InﬂuxDB’s Time Structured Merge Tree (TSM Tree) like LSM,
but different

Components WAL In memory cache Index Files

Components WAL In memory cache Index Files Similar to LSM
Trees

Trees Same

Trees Same like MemTables

Trees Same like MemTables like SSTables

awesome time series data WAL (an append only ﬁle)

awesome time series data WAL (an append only ﬁle) in
memory index

In Memory Cache // cache and flush variables cacheLock sync.RWMutex
cache map[string]Values flushCache map[string]Values temperature,device=dev1,building=b1#internal

// cache and flush variables cacheLock sync.RWMutex cache map[string]Values flushCache
map[string]Values dirtySort map[string]bool mutexes are your friend

// cache and flush variables cacheLock sync.RWMutex cache map[string]Values flushCache
map[string]Values dirtySort map[string]bool mutexes are your friend values can come in out of order. may need sorting later

Different Value Types type Value interface { Time() time.Time UnixNano()
int64 Value() interface{} Size() int }

type Values []Value func (v Values) Encode(buf []byte) []byte {
/* code here */ } // Sort methods func (a Values) Len() int { return len(a) } func (a Values) Swap(i, j int) { a[i], a[j] = a[j], a[i] } func (a Values) Less(i, j int) bool { return a[i].Time().UnixNano() < a[j].Time().UnixNano() }

type Values []Value func (v Values) Encode(buf []byte) []byte {
/* code here */ } // Sort methods func (a Values) Len() int { return len(a) } func (a Values) Swap(i, j int) { a[i], a[j] = a[j], a[i] } func (a Values) Less(i, j int) bool { return a[i].Time().UnixNano() < a[j].Time().UnixNano() } Sometimes I want generics… and then I come to my senses

Finding a speciﬁc time // Seek will point the cursor
to the given time (or key) func (c *walCursor) SeekTo(seek int64) (int64, interface{}) { // Seek cache index c.position = sort.Search(len(c.cache), func(i int) bool { return c.cache[i].Time().UnixNano() >= seek }) // more sweet code }

awesome time series data WAL (an append only ﬁle) in
memory index on disk index (periodic ﬂushes)

The Index Data File Min Time: 10000 Max Time: 29999
Data File Min Time: 30000 Max Time: 39999 Data File Min Time: 70000 Max Time: 99999 Contiguous blocks of time

Data File Min Time: 30000 Max Time: 39999 Data File Min Time: 70000 Max Time: 99999 non-overlapping

Data ﬁles are read only, like LSM SSTables

Data File Min Time: 30000 Max Time: 39999 Data File Min Time: 70000 Max Time: 99999 Data File Min Time: 10000 Max Time: 99999 they periodically get compacted (like LSM)

Compacting while appending new data

Compacting while appending new data func (w *WriteLock) LockRange(min, max
int64) { // sweet code here } func (w *WriteLock) UnlockRange(min, max int64) { // sweet code here }

Compacting while appending new data func (w *WriteLock) LockRange(min, max
int64) { // sweet code here } func (w *WriteLock) UnlockRange(min, max int64) { // sweet code here } This should block until we get it

How to test?

func TestWriteLock_RightIntersect(t *testing.T) { w := &tsm1.WriteLock{} w.LockRange(2, 10) lock
:= make(chan bool) timeout := time.NewTimer(10 * time.Millisecond) go func() { w.LockRange(5, 15) lock <- true }() select { case <-lock: t.Fatal("able to get lock when we shouldn't") case <-timeout.C: // we're all good } }

Back to the data ﬁles… Data File Min Time: 10000
Max Time: 29999 Data File Min Time: 30000 Max Time: 39999 Data File Min Time: 70000 Max Time: 99999

Data File Layout

Data File Layout Similar to SSTables

Data File Layout

Data Files type dataFile struct { f *os.File size uint32
mmap []byte }

Memory mapping lets the OS handle caching for you

Memory Mapping fInfo, err := f.Stat() if err != nil
{ return nil, err } mmap, err := syscall.Mmap( int(f.Fd()), 0, int(fInfo.Size()), syscall.PROT_READ, syscall.MAP_SHARED) if err != nil { return nil, err }

Access ﬁle like a byte slice func (d *dataFile) MinTime()
int64 { minTimePosition := d.size - minTimeOffset timeBytes := d.mmap[minTimeOffset : minTimeOffset+timeSize] return int64(btou64(timeBytes)) }

Finding the time for an ID func (d *dataFile) StartingPositionForID(id
uint64) uint32 { seriesCount := d.SeriesCount() indexStart := d.indexPosition() min := uint32(0) max := uint32(seriesCount) for min < max { mid := (max-min)/2 + min offset := mid*seriesHeaderSize + indexStart checkID := btou64(d.mmap[offset : offset+timeSize]) if checkID == id { return btou32(d.mmap[offset+timeSize : offset+timeSize+posSize]) } else if checkID < id { min = mid + 1 } else { max = mid } } return uint32(0) } The Index: IDs are sorted

Compressed Data Blocks

Timestamps: encoding based on precision and deltas

Timestamps (good case): Simple8B Ann and Moffat in "Index compression
using 64-bit words"

ﬂoat64: double delta Facebook’s Gorilla - google: gorilla time series
facebook https://github.com/dgryski/go-tsz

booleans are bits!

int64 uses double delta

string uses Snappy same compression LevelDB uses

How does it perform?

Compression depends greatly on the shape of your data

Write throughput depends on batching, CPU, and memory

test last night: 100,000 series 100,000 points per series 10,000,000,000
total points 5,000 points per request c3.8xlarge, writes from 4 other systems ~390,000 points/sec ~3 bytes/point (random ﬂoats, could be better)

~400 IOPS 30%-50% CPU There’s room for improvement!

Detailed writeup next week! http://inﬂuxdb.com/blog.html

Thank you! Paul Dix @pauldix paul@inﬂuxdb.com

Go snippets from the new InfluxDB storage engine

Go snippets from the new InfluxDB storage engine

More Decks by Paul Dix

Other Decks in Technology

Featured

Transcript