The Internals of InfluxDB

39b7a68b6cbc43ec7683ad0bcc4c9570?s=47 Paul Dix
April 04, 2014

The Internals of InfluxDB

Talk on the underlying implementation details of InfluxDB. Talk given in SF the week of 04/04/14.

39b7a68b6cbc43ec7683ad0bcc4c9570?s=128

Paul Dix

April 04, 2014
Tweet

Transcript

  1. The Internals of InfluxDB Paul Dix @pauldix paul@influxdb.com

  2. An open source distributed time series, metrics, and events database.

    ! Like, for analytics.
  3. Basic Concepts

  4. Data model • Databases • Time series (or tables, but

    you can have millions) • Points (or rows, but column oriented)
  5. HTTP API (writes) curl -X POST \ 'http://localhost:8086/db/mydb/series?u=paul&p=pass' \ -d

    '[{"name":"foo", "columns":["val"], "points": [[3]]}]'
  6. HTTP API (queries) curl 'http://localhost:8086/db/mydb/series?u=paul&p=pass&q=.'

  7. Data (with timestamp) [ { "name": "cpu", "columns": ["time", "value",

    "host"], "points": [ [1395168540, 56.7, "foo.influxdb.com"], [1395168540, 43.9, "bar.influxdb.com"] ] } ]
  8. Data (auto-timestamp) [ { "name": "events", "columns": ["type", "email"], "points":

    [ ["signup", "paul@influxdb.com"], ["paid", "foo@asdf.com"] ] } ]
  9. SQL-ish select * from events where time > now() -

    1h
  10. Aggregates select percentile(90, value) from response_times group by time(10m) where

    time > now() - 1d
  11. Select from Regex select * from /stats\.cpu\..*/ limit 1

  12. Where against Regex select value from application_logs where value =~

    /.*ERROR.*/ and time > "2014-03-01" and time < "2014-03-03"
  13. Continuous queries (fan out) select * from events into events.[user_id]

  14. Continuous queries (summaries) select count(page_id) from events group by time(1h),

    page_id into events.[page_id]
  15. Continuous queries (regex downsampling) select max(value), context from /stats\.*/ group

    by time(5m) into max.:series_name
  16. Under the hood

  17. Queries - Parsing • Lex • Flex • Lexer •

    Yacc • Bison • Parser generator
  18. Queries - Processing • Totally custom • Hand code each

    aggregate function, conditional etc.
  19. How data is organized

  20. Shard type Shard struct { Id uint32 StartTime time.Time EndTime

    time.Time ServerIds []uint32 }
  21. Shard is a contiguous block of time

  22. Data for all series for a given interval exists in

    a shard* *by default, but can be modified
  23. Every server knows about every shard

  24. Query Pipeline select count(value) from foo group by time(5m) Server

    A 1. Query 2. Query Shards Server or LevelDB Shard Server or LevelDB Shard Server or LevelDB Shard Server or LevelDB Shard Server or LevelDB Shard 3. Compute Locally Server A Client 5. Collate and return 4. Stream result
  25. Group by times less than shard size gives locality

  26. Query Pipeline (no locality) select percentile(90, value) from response_times Server

    A 1. Query 2. Query Shards Server or LevelDB Shard Server or LevelDB Shard Server or LevelDB Shard Server or LevelDB Shard Server or LevelDB Shard 3. Compute Locally Server A Client 4. Collate, compute and return 3. Stream raw points
  27. serverA could buffer all results in memory *Able to limit

    number of shards to query in parallel
  28. Evaluate the query and only hit the shards we need.

  29. Query Pipeline select count(value) from foo where time > now()

    - 10m Server A 1. Query 2. Query Shard Server or LevelDB Shard 3. Compute Locally Server A Client 5. Collate and return 4. Stream result
  30. Scaling out with shards

  31. Shard Config # in the config ! [sharding] duration =

    “7d” split = 3 replication-factor = 2 # split-random= “/^big.*/”
  32. Shards (read scalability) type Shard struct { Id uint32 StartTime

    time.Time EndTime time.Time ServerIds []uint32 } Replication Factor!
  33. Splitting durations into multiple shards (write scalability)

  34. Shard Creation {“value”:23.8, “context”:”more info here”} Server A Shard exists?

    Time bucket 1. Write No Shard Done 2. Create Shards 3. Normal write Raft Normal Write Pipeline YES Shard Shard From Config Rules
  35. Mapping Data to Shard 1. Lookup time bucket 2. N

    = number of shards for that time bucket 3. If series =~ /regex in config/ 1. Write to random(N) 4. Else 1. Write to hash(database, series) % N 1. Not a consistent hashing scheme
  36. Shard Split Implications • If using hash, all data for

    a given duration of a series is in specific shard • Data locality • If using regex • scalable writes • no data locality for queries • Split can’t change on old data without total rebalancing of the shard
  37. Rules for RF and split can change over time

  38. Splits don’t need to change for old shards

  39. Underlying storage

  40. Storage Engine • LevelDB (one LDB per shard) • Log

    Structured Merge Tree (LSM Tree) • Ordered key/value hash
  41. Key - 24 bytes [id,time,sequence] Tuple with id, time, and

    sequence number
  42. Key - ID (8 bytes) [id,time,sequence] Uint64 - Identifies database,

    series, column ! Each column has its own id
  43. Key - Time (8 bytes) [id,time,sequence] Uint64 - microsecond epoch

    ! normalized to be positive (for range scans)
  44. Key [id,time,sequence] Uint64 - unique in cluster ! 1000 !

    First three digits unique server id
  45. Ordered Keys [1,1396298064000000,1001] [1,1396298064000000,2001] [1,1396298064000000,3001] [1,1396298074000000,4001] … [2,1396298064000000,1001] [2,1396298064000000,2001] [2,1396298064000000,3001]

    [2,1396298074000000,4001] {“value”:23.8, “context”:”more info here”} same point, different columns
  46. Ordered Keys [1,1396298064000000,1001]! [1,1396298064000000,2001]! [1,1396298064000000,3001] [1,1396298074000000,4001] … [2,1396298064000000,1001] [2,1396298064000000,2001] [2,1396298064000000,3001]

    [2,1396298074000000,4001] {“value”:23.8, “context”:”more info here”} different point, same timestamp
  47. Values // protobuf bytes ! message FieldValue { optional string

    string_value = 1; optional double double_value = 3; optional bool bool_value = 4; optional int64 int64_value = 5; // never stored if this optional bool is_null = 6; }
  48. Storage Implications • Aggregate queries do range scan for entire

    time range • Where clause queries do range scan for entire time range • Queries with multiple columns do range scan for entire time range for EVERY column • Splitting into many time series is the way to index • No more than 999 servers in a cluster • Strings values get repeated, but compression may lessen impact
  49. Distributed Parts

  50. Raft • Distributed Consensus Protocol (like Paxos) • Runs on

    its own port using HTTP/JSON protocol • We use for meta-data • What servers are in the cluster • What databases exist • What users exist • What shards exist and where • What continuous queries exist
  51. Raft - why not series data? • Write bottleneck (all

    go through leader) • Raft groups? • How many? • How to add new ones? • Too chatty
  52. Server Joining… • Looks at “seed-servers” in config • Raft

    connects to server, attempts to join • Gets redirected to leader • Joins cluster (all other servers informed) • Log replay (meta-data) • Connects to other servers in cluster via TCP
  53. TCP Protocol • Each server has a single connection to

    every other server • Request/response multiplexed onto connection • Protobuf wire protocol • Distributed Queries • Writes
  54. Request message Request { enum Type { WRITE = 1;

    QUERY = 2; HEARTBEAT = 3; } optional uint32 id = 1; required Type type = 2; required string database = 3; optional Series series = 4; optional uint32 shard_id = 6; optional string query = 7; optional string user_name = 8; optional uint32 request_number = 9; }
  55. Response message Response { enum Type { QUERY = 1;

    WRITE_OK = 2; END_STREAM = 3; HEARTBEAT = 9; EXPLAIN_QUERY = 10; } required Type type = 1; required uint32 request_id = 2; optional Series series = 3; optional string error_message = 5; }
  56. Writes go over the TCP protocol how do we ensure

    replication?
  57. Write Ahead Log (WAL)

  58. Writing Pipeline {“value”:23.8, “context”:”more info here”} Server A 2. Log

    to WAL Shard 3a. Return Success 1. Write 3b. Send to Write Buffer Server or LevelDB Done 4. Write 5. Commit to WAL Client
  59. WAL Implications • Servers can go down • Can replay

    from last known commit • Doesn’t need to buffer in memory • Can buffer against write bursts or LevelDB compactions
  60. Continuous Query Architecture

  61. Continuous Queries • Denormalization, downsampling • For performance • Distributed

    • Replicated through Raft
  62. Continuous Queries (create) select * from events into events.[user_id] Server

    A Raft Leader Server B Server B Server B Server B Server B 1. Query 2. Create 3. Replicate
  63. Continuous queries (fan out) select * from events into events.[user_id]

    ! select value from /^foo.*/ into bar.[type].:foo
  64. Writing Pipeline {“value”:23.8, “context”:”more info here”} Server A 2. Log

    to WAL Shard 3a. Return Success 1. Write 3b. Send to Write Buffer Server or LevelDB Done 4. Write 5. Commit to WAL Client
  65. Writing Pipeline {“value”:23.8, “context”:”more info here”} Server A 2. Log

    to WAL Shard 3a. Return Success 1. Write 3b. Evaluate fanouts Normal Write Pipeline Client
  66. Fanouts select value from foo_metric into foo_metric.[host] ! {“value”:23.8, “host”:”serverA”}

    { “series”: “foo_metric”, “time”: 1396366076000000, “sequence_number”: 10001, “value”:23.8, “host”:”serverA” } Log to WAL { “series”:”foo_metric.serverA”, “time”: 1396366076000000,! “sequence_number”: 10001,! “value”:23.8 } Fanout Time, SN the same
  67. Fanout Implications • Servers don’t check with Raft leader for

    fanout definitions • They need to be up to date with leader • Data duplication for performance • Can lookup source points on time, SN • Failures mid-fanout OK
  68. Continuous queries (summaries) select count(page_id) from events group by time(1h),

    page_id into events.1h.[page_id]
  69. Continuous queries (regex downsampling) select max(value), context from /^stats\.*/ group

    by time(5m) into max.5m.:series_name
  70. Continuous Queries (group by time) • Run on Raft leader

    • Check every second if any to run
  71. Continuous Queries Queries to Run? Checks since last successful run

    1. Run Queries Write Pipeline 2. Mark time of this run Raft replicate
  72. Continuous Queries select count(page_id) from events group by time(1h), page_id

    into events.1h.[page_id] select count(page_id) from events group by time(1h), page_id into events.1h.[page_id] where time >= 1396368660 AND time < 1396368720 1. Run 2. Write { “time” : 1396368660, “sequence_number”: 1, “series”:”events.1h.22”, “count”: 23 } , { “time” : 1396368660, “sequence_number”: 2, “series”:”events.1h.56”, “count”: 10 }
  73. Implications • Fault tolerant across server failures • New Leader

    Picks up since last known • Continuous queries can be recalculated (future) • Will overwrite old values
  74. Future Work

  75. Custom Functions select myFunc(value) from some_series

  76. Pubsub select * from some_series where host = “serverA” into

    subscription() select percentile(90, value) from some_series group by time(1m) into subscription()
  77. Rack aware sharding

  78. Multi-datacenter replication

  79. Thanks! Paul Dix @pauldix @influxdb paul@influxdb.com