Slide 1

Slide 1 text

Time Series Storage Designing and implementing a time series storage engine Preetam Jinka April 2015 - beCraft

Slide 2

Slide 2 text

Me • Math student at UVA • VividCortex GitHub: PreetamJinka Twitter: @PreetamJinka 2

Slide 3

Slide 3 text

• Network flow collector • “Network Behavior Anomaly Detection” • Monitoring ⇒ time series 3 Personal project: Cistern

Slide 4

Slide 4 text

Plots of time series metrics 4

Slide 5

Slide 5 text

• It comes in time order. • It doesn’t change. • It’s accessed in “bulk.” • Deletions happen in “bulk.” How do we store this? The nature of time series data 5

Slide 6

Slide 6 text

Isn’t this simple? • Time series is already indexed in time order. • Writing to the end of a file is fast and efficient. • Data points do not change, so you only have to write once. • Reading data is fast because it’s already indexed. 6

Slide 7

Slide 7 text

This 24-port switch can easily generate >250 metrics just for the interface counters. A data center might have O(10,000) switches. 7

Slide 8

Slide 8 text

Adding these characteristics changes everything: • High cardinality • Sparse metrics These rule out solutions like RRDTool or Whisper (Graphite’s storage engine). A file per metric with 100k metrics is an operational nightmare. What else can we do? The nature of time series data 8

Slide 9

Slide 9 text

• Data storage and retrieval • Resource management • Powerful primitives? Role of a storage engine 9

Slide 10

Slide 10 text

Why not use... • ...RRDTool? It’s made for time series! • ...Graphite? It’s made for time series! • ...{RDBMS}? They’re great for structured data! • ...Elastic(search)? Inverted indexes! • ...LevelDB? It’s fast! • ...B-trees? Everyone uses them! 10

Slide 11

Slide 11 text

Why not use... • ...RRDTool? It’s made for time series! • ...Graphite? It’s made for time series! • ...{RDBMS}? They’re great for structured data! • ...Elastic(search)? Inverted indexes! • ...LevelDB? It’s fast! • ...B-trees? Everyone uses them! They don’t really fit this use case. I also don’t want external services to manage when I’m building a small utility. 11

Slide 12

Slide 12 text

So, no, you shouldn’t reinvent the wheel. Unless you plan on learning more about wheels, that is. — http://blog.codinghorror.com/dont-reinvent-the-wheel-unless-you-plan-on-learning-more-about-wheels/ 12

Slide 13

Slide 13 text

13 B-trees and LSMs Two big categories of ordered key-value stores: • B-trees ◦ Generally read optimized ◦ Most RDBMSs use B-trees for ordered indexes ◦ InnoDB, WiredTiger, TokuDB, LMDB, Bolt, etc. • Log-structured merge (LSM) ◦ Write optimized ◦ LevelDB, RocksDB, etc.

Slide 14

Slide 14 text

Copy-on-Write / Append-only B-trees 14

Slide 15

Slide 15 text

Pros and cons of B-trees • Pros: ◦ Generic, flexible ◦ COW has great benefits • Cons: ◦ Generic, too many assumptions made ◦ Reads before writes ◦ Hard to control where data get placed ◦ Hard to implement compression on top ◦ Hard to support concurrent writers 15

Slide 16

Slide 16 text

LevelDB 16

Slide 17

Slide 17 text

Pros and cons of LevelDB • Pros: ◦ Write optimized ◦ Supports concurrent writers ◦ Compression • Cons: ◦ Rewrites unchanged data ◦ Tombstoning 17

Slide 18

Slide 18 text

18

Slide 19

Slide 19 text

19 New data may come in time order per metric, but we shouldn’t make that assumption!

Slide 20

Slide 20 text

Read optimization The fastest way to read from disk is sequentially. We want to minimize expensive seeks. We want to only read what we have to. 20

Slide 21

Slide 21 text

Write optimization The fastest way to write to disk is sequentially. It’s better to write without reading (blind writes). We want to avoid rewriting data that do not change. 21

Slide 22

Slide 22 text

Write-ahead logging • Append updates to the end of a file (a write- ahead log, WAL) • Provides durability • Basically a “redo” log during recovery 22

Slide 23

Slide 23 text

You can have sequential reads without amplification, sequential writes without amplification, or an immutable write-once design—pick any two. — Baron Schwartz (http://www.xaprb.com/blog/2015/04/02/state-of-the-storage-engine/) 23

Slide 24

Slide 24 text

Compression • Time series data compresses really well • Fixed, compact structure • Imagine a metric that has 0s for every timestamp 24

Slide 25

Slide 25 text

Putting everything together... • There are fast generic key-value stores • There are time series storage engines, but they’ re not good for this use case Can we design something that is both fast, and designed for time series? Of course. 25

Slide 26

Slide 26 text

Catena n. A closely linked series. • Write optimized for fast, concurrent writes ◦ Sequential writes like LSM • Read optimized for indexed reads ◦ Uses something similar to a B-tree • Implemented in Go ◦ Still young. ~2 weeks of writing and coding 26

Slide 27

Slide 27 text

Basic API db := catena.NewDB(...) db.InsertRows([...]) i := db.NewIterator(“source”, “metric”) i.Seek(1234) i.Next() p := i.Point() 27

Slide 28

Slide 28 text

28 Preetam’s notebook doodles

Slide 29

Slide 29 text

Implementation 29

Slide 30

Slide 30 text

30 Data hierarchy Did someone say, “B-tree?”

Slide 31

Slide 31 text

Data are split into partitions of disjoint time ranges. Each partition is independent — nothing is shared. 31

Slide 32

Slide 32 text

• Memory ◦ Writable ◦ Completely in memory, backed by a write-ahead log ◦ Precise locks • Disk ◦ Read-only ◦ Compressed on disk ◦ Memory mapped ◦ Lock free 32 Partitions

Slide 33

Slide 33 text

• Operational benefits ◦ Dropping old partitions is simple ◦ Each partition is a separate file ◦ Configurable size (1 hour, 1 day, whatever) ◦ Configurable retention 33 Partitions

Slide 34

Slide 34 text

Example of files on disk $ ls -lh ... -rw-r--r-- 1 root root 1.7M Apr 8 23:00 19.part -rw-r--r-- 1 root root 809K Apr 6 06:00 1.part -rw-r--r-- 1 root root 2.8M Apr 9 07:00 20.part -rw-r--r-- 1 root root 4.4M Apr 9 11:00 21.part -rw-r--r-- 1 root root 2.3M Apr 9 12:00 22.part -rw-r--r-- 1 root root 649K Apr 9 14:00 23.part -rw-r--r-- 1 root root 3.7M Apr 9 13:59 24.wal -rw-r--r-- 1 root root 2.1M Apr 9 15:08 25.wal -rw-r--r-- 1 root root 7.0M Apr 6 09:00 2.part ... 34

Slide 35

Slide 35 text

Writing • Writes are appended to the write-ahead log (WAL) • Precise locking allows for high throughput • Writes to the WAL are serialized ◦ Concurrency != parallelism 35

Slide 36

Slide 36 text

WAL • Compressed entries • Format (little endian): ◦ Magic constant (4 bytes) ◦ Operation type (1 byte) ◦ Number of rows (4 bytes) ◦ Compressed buffer of rows. Each row: • Source and metric name lengths (4, 4 bytes) • Source and metric names (variable) • Timestamp and value (8 bytes, 8 bytes) 36

Slide 37

Slide 37 text

WAL • WAL writes are buffered until written to disk ◦ Single write() call at the end ◦ WAL mutex is locked and unlocked around the write() • Not fsync’d after each append 37

Slide 38

Slide 38 text

WAL recovery • Create a new memory partition • Read as many records from the WAL as possible • Truncate after the last good record • Start to accept writes and append to the end 38

Slide 39

Slide 39 text

Reading Iterators • Created using a source and metric pair • Great abstraction ◦ Seek to timestamps ◦ Do not block writers ◦ Stream off a disk ◦ Transition across partitions 39

Slide 40

Slide 40 text

More iterator benefits • Composable • Low-overhead reads • Offer high reader concurrency 40

Slide 41

Slide 41 text

Concurrency • This is what Go is great for, right? • Catena only launches one goroutine: the compactor • All operations are thread-safe. 41

Slide 42

Slide 42 text

Compactor • Runs in a single goroutine (for now) • Drops old partitions ◦ Only when there are no readers using that partition • Compacts partitions 42

Slide 43

Slide 43 text

Compacting • Memory partition is set to read-only mode • Iterate through sources ◦ Iterate through metrics • Split up points array into smaller extents • For each extent, remember its offset, compress extent, and write to file • Write metadata ◦ List of sources, metrics, extents, and offsets ◦ Also contains timestamps used for seeking, etc. 43

Slide 44

Slide 44 text

Compression • GZIP (entropy encoding) • Can get down to ~ 3 bytes per point ◦ 8 byte timestamp + 8 byte value Timestamp | Value ----------------- 0x001234 | 0 0x011234 | 0 0x021234 | 0 0x031234 | 0 0x041234 | 0 44

Slide 45

Slide 45 text

Compacting • Compacting happens in the background. • After compacted, a memory partition gets replaced with its newly created disk partition ◦ This is an atomic swap in the partition list 45

Slide 46

Slide 46 text

Partition list • It’s an ordered, lock-free linked list. • Newest partition is first. • Threads don’t wait. • The compactor can safely and quickly swap references to partitions. 46

Slide 47

Slide 47 text

Atomics • Used for metadata ◦ Min/max timestamps ◦ Counters • Avoids overhead of locks min := atomic.LoadInt64(&db.minTimestamp) for ; min > minTimestampInRows; min = atomic.LoadInt64(&db.minTimestamp) { if atomic.CompareAndSwapInt64(&db.minTimestamp, min, minTimestampInRows) { break } } 47

Slide 48

Slide 48 text

Locks • They’re everywhere. • Mutexes ◦ Used for slices • RWMutexes ◦ Multiple readers or a single writer ◦ Used for partitions: shared “holds” and exclusive “holds” ◦ Also used for maps in memory partitions to avoid readers blocking each other 48

Slide 49

Slide 49 text

Reducing overhead for writes • Batching writes ◦ Optimal batch size? Not sure yet. 49

Slide 50

Slide 50 text

Lagging writes • Need to keep more than one writable partition ◦ Can’t write to compacted partitions (for now) 50

Slide 51

Slide 51 text

Performance? • Inserts > 1,000,000 rows / sec on my laptop ◦ Single partition, 4 writers ◦ 10M rows total, 100K unique time series • 100 timestamps • 100 sources, 1000 metrics per source ◦ Writing in 1000-row batches with 1 timestamp, 1 source, 1000 metrics ◦ Also written to WAL (final size is ~26 MB) • Disk is not the bottleneck. Surprising? 51

Slide 52

Slide 52 text

What else can we do? • Tags ◦ Maybe an inverted index? • Another implementation ◦ In C++? • Spinlocks ◦ Avoids context switches from Go’s runtime semaphores 52

Slide 53

Slide 53 text

Fin. @PreetamJinka 53

Slide 54

Slide 54 text

Helpful resources The Right Read Optimization is Actually Write Optimization [video] Building a Time Series Database on MySQL [video] Lock-Free Programming (or, Juggling Razor Blades) [video] LevelDB and Node: What is LevelDB Anyway? [article] 54