Time series data is the worst and best use case in distributed databases

Time series data is the worst and best use case
in distributed databases Paul Dix CEO @InﬂuxDB @pauldix paul@inﬂuxdb.com

What is time series data?

Stock trades and quotes

Metrics

Analytics

Events

Sensor data

Two kinds of time series data…

Regular time series t0 t1 t2 t3 t4 t6 t7
Samples at regular intervals

Irregular time series t0 t1 t2 t3 t4 t6 t7
Events whenever they come in

Inducing a regular time series from an irregular one query:
select count(customer_id) from events where time > now() - 1h group by time(1m), customer_id

Data that you ask questions about over time

1. Databases

2. Distributed Systems

Access properties suck for databases

High write throughput

Example from DevOps • 2,000 servers, VMs, containers, or sensor
units • 200 measurements per server/unit • every 10 seconds • = 3,456,000,000 distinct points per day

Use LSM Tree, optimized for writes!

Even higher read throughput

Aggregation and downsampling

Queries for dashboards

Queries for monitoring systems

LSM Tree optimized for writes

Use COW B+Tree, it’s optimized for reads!

Write throughput goes to hell

No compression

Large scale deletes

Aggregate, down-sample and phase out raw data

If clearing out point-by-point # of deletes = # of
writes

LSM Tree deletes are wildly expensive

COW B+Tree deletes expensive if we want to reclaim disk

No perfect storage engine for these properties

Time series data + databases = great sadness

Access properties suck for distributed systems

Range scans of many keys

series: cpu region=uswest, host=serverA

series: cpu region=uswest, host=serverA query: select max(value) from cpu where
time > now() - 6h group by time(5m)

series: cpu region=uswest, host=serverA query: select max(value) from cpu where
region = ‘uswest’ AND time > now() - 6h group by time(5m) Series from all hosts from uswest merged into one

How to distribute the data?

By measurement? cpu

By measurement? cpu BOTTLENECK

By measurement + tags? cpu region=uswest, host=serverA

By measurement + tags? cpu region=uswest, host=serverA SERIES GROWS INDEFINITELY

By measurement + tags, time? cpu region=uswest, host=serverA, time

By measurement + tags, time? cpu region=uswest, host=serverA, time WHICH
TIMES/KEYS EXIST?

By measurement + tags, time? cpu region=uswest, host=serverA, time NO
DATA LOCALITY

High throughput

CAP Theorem

CAP Theorem C: Consistency

CAP Theorem C: Consistency A: Availability

CAP Theorem C: Consistency A: Availability P: In the face
of Partitions

Pick either C or A

P is happening whether you have perfect network hardware or
not

Pauses under load look like partitions

High throughput = load

Consistency under high write throughput

Time series queries do range scans of recent data that
is always moving

Some sensors sample many times per second

Event streams can be even more frequent

Consistent view?

Time series data + distributed systems = great sadness

but…

Time series data has great properties for databases

No updates

Large ranges cold for writes

Immutable data structures and ﬁles

Like LSM, but more speciﬁc

Deletes mostly against ranges of old data

We partition data by ranges of time e.g. all data
for a day or hour together

Drop entire ﬁles

Tombstone the one-offs

New storage engine

Great properties for distributed systems

No updates

Large scale deletes on cold areas of keyspace

Perfect for an AP system

Conﬂict resolution made easy i.e. no updates = no contention

Partition key space by ranges of time i.e. old data
vs. new

Old data generally doesn’t change

Consistent view on new data is the union

Deletes against ranges that are cold for writes and queries

Cluster growth to increase storage capacity doesn’t require rebalancing

Data locality i.e. how we ship the code to where
the data lives when scanning large ranges of data

Evenly distribute across cluster, per day cpu region=uswest, host=serverA cpu
region=uswest, host=serverB cpu region=useast, host=serverC cpu region=useast, host=serverD Shard 1 Shard 1 Shard 2 Shard 2

Each shard lives on a server and # of replicas

Hits one shard query: select mean(value) from cpu where region
= ‘uswest’ AND host = ‘serverB’ AND time > now() - 6h group by time(5m)

Decompose into map/reduce job query: select mean(value) from cpu where
region = ‘uswest’ AND time > now() - 6h group by time(5m) Many series match this criteria, many shards to query

func MapMean(itr Iterator) interface{} { out := &meanMapOutput{} for _,
k, v := itr.Next(); k != 0; _, k, v = itr.Next() { out.Count++ out.Mean += (v.(float64) - out.Mean) / float64(out.Count) } if out.Count > 0 { return out } return nil }

func ReduceMean(values []interface{}) interface{} { out := &meanMapOutput{} var countSum
int for _, v := range values { if v == nil { continue } val := v.(*meanMapOutput) countSum = out.Count + val.Count out.Mean = val.Mean*(float64(val.Count)/float64(countSum)) + out.Mean*(float64(out.Count)/float64(countSum)) out.Count = countSum } if out.Count > 0 { return out.Mean } return nil }

We only transmit the summary ticks across the cluster one
per 5 minute interval

there will be more…

Time series data has odd workloads

High write and read throughput

Append/insert only

Deletes against large ranges

Horrible and great for distributed databases

Thank you. Paul Dix paul@inﬂuxdb.com @pauldix

Time series data is the worst and best use case...

Time series data is the worst and best use case in distributed databases

More Decks by Paul Dix

Other Decks in Technology

Featured

Transcript