Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Internals of InfluxDB

Paul Dix
April 04, 2014

The Internals of InfluxDB

Talk on the underlying implementation details of InfluxDB. Talk given in SF the week of 04/04/14.

Paul Dix

April 04, 2014
Tweet

More Decks by Paul Dix

Other Decks in Technology

Transcript

  1. Data model • Databases • Time series (or tables, but

    you can have millions) • Points (or rows, but column oriented)
  2. Data (with timestamp) [ { "name": "cpu", "columns": ["time", "value",

    "host"], "points": [ [1395168540, 56.7, "foo.influxdb.com"], [1395168540, 43.9, "bar.influxdb.com"] ] } ]
  3. Where against Regex select value from application_logs where value =~

    /.*ERROR.*/ and time > "2014-03-01" and time < "2014-03-03"
  4. Queries - Parsing • Lex • Flex • Lexer •

    Yacc • Bison • Parser generator
  5. Queries - Processing • Totally custom • Hand code each

    aggregate function, conditional etc.
  6. Data for all series for a given interval exists in

    a shard* *by default, but can be modified
  7. Query Pipeline select count(value) from foo group by time(5m) Server

    A 1. Query 2. Query Shards Server or LevelDB Shard Server or LevelDB Shard Server or LevelDB Shard Server or LevelDB Shard Server or LevelDB Shard 3. Compute Locally Server A Client 5. Collate and return 4. Stream result
  8. Query Pipeline (no locality) select percentile(90, value) from response_times Server

    A 1. Query 2. Query Shards Server or LevelDB Shard Server or LevelDB Shard Server or LevelDB Shard Server or LevelDB Shard Server or LevelDB Shard 3. Compute Locally Server A Client 4. Collate, compute and return 3. Stream raw points
  9. serverA could buffer all results in memory *Able to limit

    number of shards to query in parallel
  10. Query Pipeline select count(value) from foo where time > now()

    - 10m Server A 1. Query 2. Query Shard Server or LevelDB Shard 3. Compute Locally Server A Client 5. Collate and return 4. Stream result
  11. Shard Config # in the config ! [sharding] duration =

    “7d” split = 3 replication-factor = 2 # split-random= “/^big.*/”
  12. Shards (read scalability) type Shard struct { Id uint32 StartTime

    time.Time EndTime time.Time ServerIds []uint32 } Replication Factor!
  13. Shard Creation {“value”:23.8, “context”:”more info here”} Server A Shard exists?

    Time bucket 1. Write No Shard Done 2. Create Shards 3. Normal write Raft Normal Write Pipeline YES Shard Shard From Config Rules
  14. Mapping Data to Shard 1. Lookup time bucket 2. N

    = number of shards for that time bucket 3. If series =~ /regex in config/ 1. Write to random(N) 4. Else 1. Write to hash(database, series) % N 1. Not a consistent hashing scheme
  15. Shard Split Implications • If using hash, all data for

    a given duration of a series is in specific shard • Data locality • If using regex • scalable writes • no data locality for queries • Split can’t change on old data without total rebalancing of the shard
  16. Storage Engine • LevelDB (one LDB per shard) • Log

    Structured Merge Tree (LSM Tree) • Ordered key/value hash
  17. Key - ID (8 bytes) [id,time,sequence] Uint64 - Identifies database,

    series, column ! Each column has its own id
  18. Key - Time (8 bytes) [id,time,sequence] Uint64 - microsecond epoch

    ! normalized to be positive (for range scans)
  19. Values // protobuf bytes ! message FieldValue { optional string

    string_value = 1; optional double double_value = 3; optional bool bool_value = 4; optional int64 int64_value = 5; // never stored if this optional bool is_null = 6; }
  20. Storage Implications • Aggregate queries do range scan for entire

    time range • Where clause queries do range scan for entire time range • Queries with multiple columns do range scan for entire time range for EVERY column • Splitting into many time series is the way to index • No more than 999 servers in a cluster • Strings values get repeated, but compression may lessen impact
  21. Raft • Distributed Consensus Protocol (like Paxos) • Runs on

    its own port using HTTP/JSON protocol • We use for meta-data • What servers are in the cluster • What databases exist • What users exist • What shards exist and where • What continuous queries exist
  22. Raft - why not series data? • Write bottleneck (all

    go through leader) • Raft groups? • How many? • How to add new ones? • Too chatty
  23. Server Joining… • Looks at “seed-servers” in config • Raft

    connects to server, attempts to join • Gets redirected to leader • Joins cluster (all other servers informed) • Log replay (meta-data) • Connects to other servers in cluster via TCP
  24. TCP Protocol • Each server has a single connection to

    every other server • Request/response multiplexed onto connection • Protobuf wire protocol • Distributed Queries • Writes
  25. Request message Request { enum Type { WRITE = 1;

    QUERY = 2; HEARTBEAT = 3; } optional uint32 id = 1; required Type type = 2; required string database = 3; optional Series series = 4; optional uint32 shard_id = 6; optional string query = 7; optional string user_name = 8; optional uint32 request_number = 9; }
  26. Response message Response { enum Type { QUERY = 1;

    WRITE_OK = 2; END_STREAM = 3; HEARTBEAT = 9; EXPLAIN_QUERY = 10; } required Type type = 1; required uint32 request_id = 2; optional Series series = 3; optional string error_message = 5; }
  27. Writing Pipeline {“value”:23.8, “context”:”more info here”} Server A 2. Log

    to WAL Shard 3a. Return Success 1. Write 3b. Send to Write Buffer Server or LevelDB Done 4. Write 5. Commit to WAL Client
  28. WAL Implications • Servers can go down • Can replay

    from last known commit • Doesn’t need to buffer in memory • Can buffer against write bursts or LevelDB compactions
  29. Continuous Queries (create) select * from events into events.[user_id] Server

    A Raft Leader Server B Server B Server B Server B Server B 1. Query 2. Create 3. Replicate
  30. Continuous queries (fan out) select * from events into events.[user_id]

    ! select value from /^foo.*/ into bar.[type].:foo
  31. Writing Pipeline {“value”:23.8, “context”:”more info here”} Server A 2. Log

    to WAL Shard 3a. Return Success 1. Write 3b. Send to Write Buffer Server or LevelDB Done 4. Write 5. Commit to WAL Client
  32. Writing Pipeline {“value”:23.8, “context”:”more info here”} Server A 2. Log

    to WAL Shard 3a. Return Success 1. Write 3b. Evaluate fanouts Normal Write Pipeline Client
  33. Fanouts select value from foo_metric into foo_metric.[host] ! {“value”:23.8, “host”:”serverA”}

    { “series”: “foo_metric”, “time”: 1396366076000000, “sequence_number”: 10001, “value”:23.8, “host”:”serverA” } Log to WAL { “series”:”foo_metric.serverA”, “time”: 1396366076000000,! “sequence_number”: 10001,! “value”:23.8 } Fanout Time, SN the same
  34. Fanout Implications • Servers don’t check with Raft leader for

    fanout definitions • They need to be up to date with leader • Data duplication for performance • Can lookup source points on time, SN • Failures mid-fanout OK
  35. Continuous Queries (group by time) • Run on Raft leader

    • Check every second if any to run
  36. Continuous Queries Queries to Run? Checks since last successful run

    1. Run Queries Write Pipeline 2. Mark time of this run Raft replicate
  37. Continuous Queries select count(page_id) from events group by time(1h), page_id

    into events.1h.[page_id] select count(page_id) from events group by time(1h), page_id into events.1h.[page_id] where time >= 1396368660 AND time < 1396368720 1. Run 2. Write { “time” : 1396368660, “sequence_number”: 1, “series”:”events.1h.22”, “count”: 23 } , { “time” : 1396368660, “sequence_number”: 2, “series”:”events.1h.56”, “count”: 10 }
  38. Implications • Fault tolerant across server failures • New Leader

    Picks up since last known • Continuous queries can be recalculated (future) • Will overwrite old values
  39. Pubsub select * from some_series where host = “serverA” into

    subscription() select percentile(90, value) from some_series group by time(1m) into subscription()