Time Series Storage

Time Series Storage Designing and implementing a time series storage
engine Preetam Jinka April 2015 - beCraft

Me • Math student at UVA • VividCortex GitHub: PreetamJinka
Twitter: @PreetamJinka 2

• Network flow collector • “Network Behavior Anomaly Detection” •
Monitoring ⇒ time series 3 Personal project: Cistern

Plots of time series metrics 4

• It comes in time order. • It doesn’t change.
• It’s accessed in “bulk.” • Deletions happen in “bulk.” How do we store this? The nature of time series data 5

Isn’t this simple? • Time series is already indexed in
time order. • Writing to the end of a file is fast and efficient. • Data points do not change, so you only have to write once. • Reading data is fast because it’s already indexed. 6

This 24-port switch can easily generate >250 metrics just for
the interface counters. A data center might have O(10,000) switches. 7

Adding these characteristics changes everything: • High cardinality • Sparse
metrics These rule out solutions like RRDTool or Whisper (Graphite’s storage engine). A file per metric with 100k metrics is an operational nightmare. What else can we do? The nature of time series data 8

• Data storage and retrieval • Resource management • Powerful
primitives? Role of a storage engine 9

Why not use... • ...RRDTool? It’s made for time series!
• ...Graphite? It’s made for time series! • ...{RDBMS}? They’re great for structured data! • ...Elastic(search)? Inverted indexes! • ...LevelDB? It’s fast! • ...B-trees? Everyone uses them! 10

Why not use... • ...RRDTool? It’s made for time series!
• ...Graphite? It’s made for time series! • ...{RDBMS}? They’re great for structured data! • ...Elastic(search)? Inverted indexes! • ...LevelDB? It’s fast! • ...B-trees? Everyone uses them! They don’t really fit this use case. I also don’t want external services to manage when I’m building a small utility. 11

So, no, you shouldn’t reinvent the wheel. Unless you plan
on learning more about wheels, that is. — http://blog.codinghorror.com/dont-reinvent-the-wheel-unless-you-plan-on-learning-more-about-wheels/ 12

13 B-trees and LSMs Two big categories of ordered key-value
stores: • B-trees ◦ Generally read optimized ◦ Most RDBMSs use B-trees for ordered indexes ◦ InnoDB, WiredTiger, TokuDB, LMDB, Bolt, etc. • Log-structured merge (LSM) ◦ Write optimized ◦ LevelDB, RocksDB, etc.

Copy-on-Write / Append-only B-trees 14

Pros and cons of B-trees • Pros: ◦ Generic, flexible
◦ COW has great benefits • Cons: ◦ Generic, too many assumptions made ◦ Reads before writes ◦ Hard to control where data get placed ◦ Hard to implement compression on top ◦ Hard to support concurrent writers 15

LevelDB 16

Pros and cons of LevelDB • Pros: ◦ Write optimized
◦ Supports concurrent writers ◦ Compression • Cons: ◦ Rewrites unchanged data ◦ Tombstoning 17

19 New data may come in time order per metric,
but we shouldn’t make that assumption!

Read optimization The fastest way to read from disk is
sequentially. We want to minimize expensive seeks. We want to only read what we have to. 20

Write optimization The fastest way to write to disk is
sequentially. It’s better to write without reading (blind writes). We want to avoid rewriting data that do not change. 21

Write-ahead logging • Append updates to the end of a
file (a write- ahead log, WAL) • Provides durability • Basically a “redo” log during recovery 22

You can have sequential reads without amplification, sequential writes without
amplification, or an immutable write-once design—pick any two. — Baron Schwartz (http://www.xaprb.com/blog/2015/04/02/state-of-the-storage-engine/) 23

Compression • Time series data compresses really well • Fixed,
compact structure • Imagine a metric that has 0s for every timestamp 24

Putting everything together... • There are fast generic key-value stores
• There are time series storage engines, but they’ re not good for this use case Can we design something that is both fast, and designed for time series? Of course. 25

Catena n. A closely linked series. • Write optimized for
fast, concurrent writes ◦ Sequential writes like LSM • Read optimized for indexed reads ◦ Uses something similar to a B-tree • Implemented in Go ◦ Still young. ~2 weeks of writing and coding 26

Basic API db := catena.NewDB(...) db.InsertRows([...]) i := db.NewIterator(“source”, “metric”)
i.Seek(1234) i.Next() p := i.Point() 27

28 Preetam’s notebook doodles

Implementation 29

30 Data hierarchy Did someone say, “B-tree?”

Data are split into partitions of disjoint time ranges. Each
partition is independent — nothing is shared. 31

• Memory ◦ Writable ◦ Completely in memory, backed by
a write-ahead log ◦ Precise locks • Disk ◦ Read-only ◦ Compressed on disk ◦ Memory mapped ◦ Lock free 32 Partitions

• Operational benefits ◦ Dropping old partitions is simple ◦
Each partition is a separate file ◦ Configurable size (1 hour, 1 day, whatever) ◦ Configurable retention 33 Partitions

Example of files on disk $ ls -lh ... -rw-r--r--
1 root root 1.7M Apr 8 23:00 19.part -rw-r--r-- 1 root root 809K Apr 6 06:00 1.part -rw-r--r-- 1 root root 2.8M Apr 9 07:00 20.part -rw-r--r-- 1 root root 4.4M Apr 9 11:00 21.part -rw-r--r-- 1 root root 2.3M Apr 9 12:00 22.part -rw-r--r-- 1 root root 649K Apr 9 14:00 23.part -rw-r--r-- 1 root root 3.7M Apr 9 13:59 24.wal -rw-r--r-- 1 root root 2.1M Apr 9 15:08 25.wal -rw-r--r-- 1 root root 7.0M Apr 6 09:00 2.part ... 34

Writing • Writes are appended to the write-ahead log (WAL)
• Precise locking allows for high throughput • Writes to the WAL are serialized ◦ Concurrency != parallelism 35

WAL • Compressed entries • Format (little endian): ◦ Magic
constant (4 bytes) ◦ Operation type (1 byte) ◦ Number of rows (4 bytes) ◦ Compressed buffer of rows. Each row: • Source and metric name lengths (4, 4 bytes) • Source and metric names (variable) • Timestamp and value (8 bytes, 8 bytes) 36

WAL • WAL writes are buffered until written to disk
◦ Single write() call at the end ◦ WAL mutex is locked and unlocked around the write() • Not fsync’d after each append 37

WAL recovery • Create a new memory partition • Read
as many records from the WAL as possible • Truncate after the last good record • Start to accept writes and append to the end 38

Reading Iterators • Created using a source and metric pair
• Great abstraction ◦ Seek to timestamps ◦ Do not block writers ◦ Stream off a disk ◦ Transition across partitions 39

More iterator benefits • Composable • Low-overhead reads • Offer
high reader concurrency 40

Concurrency • This is what Go is great for, right?
• Catena only launches one goroutine: the compactor • All operations are thread-safe. 41

Compactor • Runs in a single goroutine (for now) •
Drops old partitions ◦ Only when there are no readers using that partition • Compacts partitions 42

Compacting • Memory partition is set to read-only mode •
Iterate through sources ◦ Iterate through metrics • Split up points array into smaller extents • For each extent, remember its offset, compress extent, and write to file • Write metadata ◦ List of sources, metrics, extents, and offsets ◦ Also contains timestamps used for seeking, etc. 43

Compression • GZIP (entropy encoding) • Can get down to
~ 3 bytes per point ◦ 8 byte timestamp + 8 byte value Timestamp | Value ----------------- 0x001234 | 0 0x011234 | 0 0x021234 | 0 0x031234 | 0 0x041234 | 0 44

Compacting • Compacting happens in the background. • After compacted,
a memory partition gets replaced with its newly created disk partition ◦ This is an atomic swap in the partition list 45

Partition list • It’s an ordered, lock-free linked list. •
Newest partition is first. • Threads don’t wait. • The compactor can safely and quickly swap references to partitions. 46

Atomics • Used for metadata ◦ Min/max timestamps ◦ Counters
• Avoids overhead of locks min := atomic.LoadInt64(&db.minTimestamp) for ; min > minTimestampInRows; min = atomic.LoadInt64(&db.minTimestamp) { if atomic.CompareAndSwapInt64(&db.minTimestamp, min, minTimestampInRows) { break } } 47

Locks • They’re everywhere. • Mutexes ◦ Used for slices
• RWMutexes ◦ Multiple readers or a single writer ◦ Used for partitions: shared “holds” and exclusive “holds” ◦ Also used for maps in memory partitions to avoid readers blocking each other 48

Reducing overhead for writes • Batching writes ◦ Optimal batch
size? Not sure yet. 49

Lagging writes • Need to keep more than one writable
partition ◦ Can’t write to compacted partitions (for now) 50

Performance? • Inserts > 1,000,000 rows / sec on my
laptop ◦ Single partition, 4 writers ◦ 10M rows total, 100K unique time series • 100 timestamps • 100 sources, 1000 metrics per source ◦ Writing in 1000-row batches with 1 timestamp, 1 source, 1000 metrics ◦ Also written to WAL (final size is ~26 MB) • Disk is not the bottleneck. Surprising? 51

What else can we do? • Tags ◦ Maybe an
inverted index? • Another implementation ◦ In C++? • Spinlocks ◦ Avoids context switches from Go’s runtime semaphores 52

Fin. @PreetamJinka 53

Helpful resources The Right Read Optimization is Actually Write Optimization
[video] Building a Time Series Database on MySQL [video] Lock-Free Programming (or, Juggling Razor Blades) [video] LevelDB and Node: What is LevelDB Anyway? [article] 54

Time Series Storage

Time Series Storage

More Decks by Preetam Jinka

Other Decks in Technology

Featured

Transcript