Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Time series data is the worst and best use case in distributed databases

Paul Dix
June 08, 2015

Time series data is the worst and best use case in distributed databases

My talk from 2015 dotScale in Paris. Some lessons we've learned building InfluxDB, a distributed time series database

Paul Dix

June 08, 2015
Tweet

More Decks by Paul Dix

Other Decks in Technology

Transcript

  1. Time series data is the worst and best use case

    in distributed databases Paul Dix CEO @InfluxDB @pauldix paul@influxdb.com
  2. Regular time series t0 t1 t2 t3 t4 t6 t7

    Samples at regular intervals
  3. Irregular time series t0 t1 t2 t3 t4 t6 t7

    Events whenever they come in
  4. Inducing a regular time series from an irregular one query:

    select count(customer_id) from events where time > now() - 1h group by time(1m), customer_id
  5. Example from DevOps • 2,000 servers, VMs, containers, or sensor

    units • 200 measurements per server/unit • every 10 seconds • = 3,456,000,000 distinct points per day
  6. series: cpu region=uswest, host=serverA query: select max(value) from cpu where

    region = ‘uswest’ AND time > now() - 6h group by time(5m) Series from all hosts from uswest merged into one
  7. Data locality i.e. how we ship the code to where

    the data lives when scanning large ranges of data
  8. Evenly distribute across cluster, per day cpu region=uswest, host=serverA cpu

    region=uswest, host=serverB cpu region=useast, host=serverC cpu region=useast, host=serverD Shard 1 Shard 1 Shard 2 Shard 2
  9. Hits one shard query: select mean(value) from cpu where region

    = ‘uswest’ AND host = ‘serverB’ AND time > now() - 6h group by time(5m)
  10. Decompose into map/reduce job query: select mean(value) from cpu where

    region = ‘uswest’ AND time > now() - 6h group by time(5m) Many series match this criteria, many shards to query
  11. func MapMean(itr Iterator) interface{} { out := &meanMapOutput{} for _,

    k, v := itr.Next(); k != 0; _, k, v = itr.Next() { out.Count++ out.Mean += (v.(float64) - out.Mean) / float64(out.Count) } if out.Count > 0 { return out } return nil }
  12. func ReduceMean(values []interface{}) interface{} { out := &meanMapOutput{} var countSum

    int for _, v := range values { if v == nil { continue } val := v.(*meanMapOutput) countSum = out.Count + val.Count out.Mean = val.Mean*(float64(val.Count)/float64(countSum)) + out.Mean*(float64(out.Count)/float64(countSum)) out.Count = countSum } if out.Count > 0 { return out.Mean } return nil }