Paul Dix
June 08, 2015
4.2k

# Time series data is the worst and best use case in distributed databases

My talk from 2015 dotScale in Paris. Some lessons we've learned building InfluxDB, a distributed time series database

June 08, 2015

## Transcript

1. ### Time series data is the worst and best use case

in distributed databases Paul Dix CEO @InﬂuxDB @pauldix [email protected]ﬂuxdb.com

9. ### Regular time series t0 t1 t2 t3 t4 t6 t7

Samples at regular intervals
10. ### Irregular time series t0 t1 t2 t3 t4 t6 t7

Events whenever they come in
11. ### Inducing a regular time series from an irregular one query:

select count(customer_id) from events where time > now() - 1h group by time(1m), customer_id

13. None

18. ### Example from DevOps • 2,000 servers, VMs, containers, or sensor

units • 200 measurements per server/unit • every 10 seconds • = 3,456,000,000 distinct points per day

writes

38. ### series: cpu region=uswest, host=serverA query: select max(value) from cpu where

time > now() - 6h group by time(5m)
39. ### series: cpu region=uswest, host=serverA query: select max(value) from cpu where

region = ‘uswest’ AND time > now() - 6h group by time(5m) Series from all hosts from uswest merged into one

46. ### By measurement + tags, time? cpu region=uswest, host=serverA, time WHICH

TIMES/KEYS EXIST?
47. ### By measurement + tags, time? cpu region=uswest, host=serverA, time NO

DATA LOCALITY

52. ### CAP Theorem C: Consistency A: Availability P: In the face

of Partitions

not

58. ### Time series queries do range scans of recent data that

is always moving

62. None

71. ### We partition data by ranges of time e.g. all data

for a day or hour together

vs. new

85. ### Data locality i.e. how we ship the code to where

the data lives when scanning large ranges of data
86. ### Evenly distribute across cluster, per day cpu region=uswest, host=serverA cpu

region=uswest, host=serverB cpu region=useast, host=serverC cpu region=useast, host=serverD Shard 1 Shard 1 Shard 2 Shard 2

88. ### Hits one shard query: select mean(value) from cpu where region

= ‘uswest’ AND host = ‘serverB’ AND time > now() - 6h group by time(5m)
89. ### Decompose into map/reduce job query: select mean(value) from cpu where

region = ‘uswest’ AND time > now() - 6h group by time(5m) Many series match this criteria, many shards to query
90. ### func MapMean(itr Iterator) interface{} { out := &meanMapOutput{} for _,

k, v := itr.Next(); k != 0; _, k, v = itr.Next() { out.Count++ out.Mean += (v.(float64) - out.Mean) / float64(out.Count) } if out.Count > 0 { return out } return nil }
91. ### func ReduceMean(values []interface{}) interface{} { out := &meanMapOutput{} var countSum

int for _, v := range values { if v == nil { continue } val := v.(*meanMapOutput) countSum = out.Count + val.Count out.Mean = val.Mean*(float64(val.Count)/float64(countSum)) + out.Mean*(float64(out.Count)/float64(countSum)) out.Count = countSum } if out.Count > 0 { return out.Mean } return nil }
92. ### We only transmit the summary ticks across the cluster one

per 5 minute interval