Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Time Series Storage

Time Series Storage

Design and implementation of a time series storage engine.

Preetam Jinka

April 10, 2015
Tweet

More Decks by Preetam Jinka

Other Decks in Technology

Transcript

  1. • Network flow collector • “Network Behavior Anomaly Detection” •

    Monitoring ⇒ time series 3 Personal project: Cistern
  2. • It comes in time order. • It doesn’t change.

    • It’s accessed in “bulk.” • Deletions happen in “bulk.” How do we store this? The nature of time series data 5
  3. Isn’t this simple? • Time series is already indexed in

    time order. • Writing to the end of a file is fast and efficient. • Data points do not change, so you only have to write once. • Reading data is fast because it’s already indexed. 6
  4. This 24-port switch can easily generate >250 metrics just for

    the interface counters. A data center might have O(10,000) switches. 7
  5. Adding these characteristics changes everything: • High cardinality • Sparse

    metrics These rule out solutions like RRDTool or Whisper (Graphite’s storage engine). A file per metric with 100k metrics is an operational nightmare. What else can we do? The nature of time series data 8
  6. Why not use... • ...RRDTool? It’s made for time series!

    • ...Graphite? It’s made for time series! • ...{RDBMS}? They’re great for structured data! • ...Elastic(search)? Inverted indexes! • ...LevelDB? It’s fast! • ...B-trees? Everyone uses them! 10
  7. Why not use... • ...RRDTool? It’s made for time series!

    • ...Graphite? It’s made for time series! • ...{RDBMS}? They’re great for structured data! • ...Elastic(search)? Inverted indexes! • ...LevelDB? It’s fast! • ...B-trees? Everyone uses them! They don’t really fit this use case. I also don’t want external services to manage when I’m building a small utility. 11
  8. So, no, you shouldn’t reinvent the wheel. Unless you plan

    on learning more about wheels, that is. — http://blog.codinghorror.com/dont-reinvent-the-wheel-unless-you-plan-on-learning-more-about-wheels/ 12
  9. 13 B-trees and LSMs Two big categories of ordered key-value

    stores: • B-trees ◦ Generally read optimized ◦ Most RDBMSs use B-trees for ordered indexes ◦ InnoDB, WiredTiger, TokuDB, LMDB, Bolt, etc. • Log-structured merge (LSM) ◦ Write optimized ◦ LevelDB, RocksDB, etc.
  10. Pros and cons of B-trees • Pros: ◦ Generic, flexible

    ◦ COW has great benefits • Cons: ◦ Generic, too many assumptions made ◦ Reads before writes ◦ Hard to control where data get placed ◦ Hard to implement compression on top ◦ Hard to support concurrent writers 15
  11. Pros and cons of LevelDB • Pros: ◦ Write optimized

    ◦ Supports concurrent writers ◦ Compression • Cons: ◦ Rewrites unchanged data ◦ Tombstoning 17
  12. 18

  13. 19 New data may come in time order per metric,

    but we shouldn’t make that assumption!
  14. Read optimization The fastest way to read from disk is

    sequentially. We want to minimize expensive seeks. We want to only read what we have to. 20
  15. Write optimization The fastest way to write to disk is

    sequentially. It’s better to write without reading (blind writes). We want to avoid rewriting data that do not change. 21
  16. Write-ahead logging • Append updates to the end of a

    file (a write- ahead log, WAL) • Provides durability • Basically a “redo” log during recovery 22
  17. You can have sequential reads without amplification, sequential writes without

    amplification, or an immutable write-once design—pick any two. — Baron Schwartz (http://www.xaprb.com/blog/2015/04/02/state-of-the-storage-engine/) 23
  18. Compression • Time series data compresses really well • Fixed,

    compact structure • Imagine a metric that has 0s for every timestamp 24
  19. Putting everything together... • There are fast generic key-value stores

    • There are time series storage engines, but they’ re not good for this use case Can we design something that is both fast, and designed for time series? Of course. 25
  20. Catena n. A closely linked series. • Write optimized for

    fast, concurrent writes ◦ Sequential writes like LSM • Read optimized for indexed reads ◦ Uses something similar to a B-tree • Implemented in Go ◦ Still young. ~2 weeks of writing and coding 26
  21. Data are split into partitions of disjoint time ranges. Each

    partition is independent — nothing is shared. 31
  22. • Memory ◦ Writable ◦ Completely in memory, backed by

    a write-ahead log ◦ Precise locks • Disk ◦ Read-only ◦ Compressed on disk ◦ Memory mapped ◦ Lock free 32 Partitions
  23. • Operational benefits ◦ Dropping old partitions is simple ◦

    Each partition is a separate file ◦ Configurable size (1 hour, 1 day, whatever) ◦ Configurable retention 33 Partitions
  24. Example of files on disk $ ls -lh ... -rw-r--r--

    1 root root 1.7M Apr 8 23:00 19.part -rw-r--r-- 1 root root 809K Apr 6 06:00 1.part -rw-r--r-- 1 root root 2.8M Apr 9 07:00 20.part -rw-r--r-- 1 root root 4.4M Apr 9 11:00 21.part -rw-r--r-- 1 root root 2.3M Apr 9 12:00 22.part -rw-r--r-- 1 root root 649K Apr 9 14:00 23.part -rw-r--r-- 1 root root 3.7M Apr 9 13:59 24.wal -rw-r--r-- 1 root root 2.1M Apr 9 15:08 25.wal -rw-r--r-- 1 root root 7.0M Apr 6 09:00 2.part ... 34
  25. Writing • Writes are appended to the write-ahead log (WAL)

    • Precise locking allows for high throughput • Writes to the WAL are serialized ◦ Concurrency != parallelism 35
  26. WAL • Compressed entries • Format (little endian): ◦ Magic

    constant (4 bytes) ◦ Operation type (1 byte) ◦ Number of rows (4 bytes) ◦ Compressed buffer of rows. Each row: • Source and metric name lengths (4, 4 bytes) • Source and metric names (variable) • Timestamp and value (8 bytes, 8 bytes) 36
  27. WAL • WAL writes are buffered until written to disk

    ◦ Single write() call at the end ◦ WAL mutex is locked and unlocked around the write() • Not fsync’d after each append 37
  28. WAL recovery • Create a new memory partition • Read

    as many records from the WAL as possible • Truncate after the last good record • Start to accept writes and append to the end 38
  29. Reading Iterators • Created using a source and metric pair

    • Great abstraction ◦ Seek to timestamps ◦ Do not block writers ◦ Stream off a disk ◦ Transition across partitions 39
  30. Concurrency • This is what Go is great for, right?

    • Catena only launches one goroutine: the compactor • All operations are thread-safe. 41
  31. Compactor • Runs in a single goroutine (for now) •

    Drops old partitions ◦ Only when there are no readers using that partition • Compacts partitions 42
  32. Compacting • Memory partition is set to read-only mode •

    Iterate through sources ◦ Iterate through metrics • Split up points array into smaller extents • For each extent, remember its offset, compress extent, and write to file • Write metadata ◦ List of sources, metrics, extents, and offsets ◦ Also contains timestamps used for seeking, etc. 43
  33. Compression • GZIP (entropy encoding) • Can get down to

    ~ 3 bytes per point ◦ 8 byte timestamp + 8 byte value Timestamp | Value ----------------- 0x001234 | 0 0x011234 | 0 0x021234 | 0 0x031234 | 0 0x041234 | 0 44
  34. Compacting • Compacting happens in the background. • After compacted,

    a memory partition gets replaced with its newly created disk partition ◦ This is an atomic swap in the partition list 45
  35. Partition list • It’s an ordered, lock-free linked list. •

    Newest partition is first. • Threads don’t wait. • The compactor can safely and quickly swap references to partitions. 46
  36. Atomics • Used for metadata ◦ Min/max timestamps ◦ Counters

    • Avoids overhead of locks min := atomic.LoadInt64(&db.minTimestamp) for ; min > minTimestampInRows; min = atomic.LoadInt64(&db.minTimestamp) { if atomic.CompareAndSwapInt64(&db.minTimestamp, min, minTimestampInRows) { break } } 47
  37. Locks • They’re everywhere. • Mutexes ◦ Used for slices

    • RWMutexes ◦ Multiple readers or a single writer ◦ Used for partitions: shared “holds” and exclusive “holds” ◦ Also used for maps in memory partitions to avoid readers blocking each other 48
  38. Lagging writes • Need to keep more than one writable

    partition ◦ Can’t write to compacted partitions (for now) 50
  39. Performance? • Inserts > 1,000,000 rows / sec on my

    laptop ◦ Single partition, 4 writers ◦ 10M rows total, 100K unique time series • 100 timestamps • 100 sources, 1000 metrics per source ◦ Writing in 1000-row batches with 1 timestamp, 1 source, 1000 metrics ◦ Also written to WAL (final size is ~26 MB) • Disk is not the bottleneck. Surprising? 51
  40. What else can we do? • Tags ◦ Maybe an

    inverted index? • Another implementation ◦ In C++? • Spinlocks ◦ Avoids context switches from Go’s runtime semaphores 52
  41. Helpful resources The Right Read Optimization is Actually Write Optimization

    [video] Building a Time Series Database on MySQL [video] Lock-Free Programming (or, Juggling Razor Blades) [video] LevelDB and Node: What is LevelDB Anyway? [article] 54