Query Pipeline select count(value) from foo group by time(5m) Server A 1. Query 2. Query Shards Server or LevelDB Shard Server or LevelDB Shard Server or LevelDB Shard Server or LevelDB Shard Server or LevelDB Shard 3. Compute Locally Server A Client 5. Collate and return 4. Stream result
Query Pipeline (no locality) select percentile(90, value) from response_times Server A 1. Query 2. Query Shards Server or LevelDB Shard Server or LevelDB Shard Server or LevelDB Shard Server or LevelDB Shard Server or LevelDB Shard 3. Compute Locally Server A Client 4. Collate, compute and return 3. Stream raw points
Query Pipeline select count(value) from foo where time > now() - 10m Server A 1. Query 2. Query Shard Server or LevelDB Shard 3. Compute Locally Server A Client 5. Collate and return 4. Stream result
Shard Creation {“value”:23.8, “context”:”more info here”} Server A Shard exists? Time bucket 1. Write No Shard Done 2. Create Shards 3. Normal write Raft Normal Write Pipeline YES Shard Shard From Config Rules
Mapping Data to Shard 1. Lookup time bucket 2. N = number of shards for that time bucket 3. If series =~ /regex in config/ 1. Write to random(N) 4. Else 1. Write to hash(database, series) % N 1. Not a consistent hashing scheme
Shard Split Implications • If using hash, all data for a given duration of a series is in specific shard • Data locality • If using regex • scalable writes • no data locality for queries • Split can’t change on old data without total rebalancing of the shard
Ordered Keys [1,1396298064000000,1001] [1,1396298064000000,2001] [1,1396298064000000,3001] [1,1396298074000000,4001] … [2,1396298064000000,1001] [2,1396298064000000,2001] [2,1396298064000000,3001] [2,1396298074000000,4001] {“value”:23.8, “context”:”more info here”} same point, different columns
Ordered Keys [1,1396298064000000,1001]! [1,1396298064000000,2001]! [1,1396298064000000,3001] [1,1396298074000000,4001] … [2,1396298064000000,1001] [2,1396298064000000,2001] [2,1396298064000000,3001] [2,1396298074000000,4001] {“value”:23.8, “context”:”more info here”} different point, same timestamp
Storage Implications • Aggregate queries do range scan for entire time range • Where clause queries do range scan for entire time range • Queries with multiple columns do range scan for entire time range for EVERY column • Splitting into many time series is the way to index • No more than 999 servers in a cluster • Strings values get repeated, but compression may lessen impact
Raft • Distributed Consensus Protocol (like Paxos) • Runs on its own port using HTTP/JSON protocol • We use for meta-data • What servers are in the cluster • What databases exist • What users exist • What shards exist and where • What continuous queries exist
Server Joining… • Looks at “seed-servers” in config • Raft connects to server, attempts to join • Gets redirected to leader • Joins cluster (all other servers informed) • Log replay (meta-data) • Connects to other servers in cluster via TCP
TCP Protocol • Each server has a single connection to every other server • Request/response multiplexed onto connection • Protobuf wire protocol • Distributed Queries • Writes
Writing Pipeline {“value”:23.8, “context”:”more info here”} Server A 2. Log to WAL Shard 3a. Return Success 1. Write 3b. Send to Write Buffer Server or LevelDB Done 4. Write 5. Commit to WAL Client
WAL Implications • Servers can go down • Can replay from last known commit • Doesn’t need to buffer in memory • Can buffer against write bursts or LevelDB compactions
Continuous Queries (create) select * from events into events.[user_id] Server A Raft Leader Server B Server B Server B Server B Server B 1. Query 2. Create 3. Replicate
Writing Pipeline {“value”:23.8, “context”:”more info here”} Server A 2. Log to WAL Shard 3a. Return Success 1. Write 3b. Send to Write Buffer Server or LevelDB Done 4. Write 5. Commit to WAL Client
Writing Pipeline {“value”:23.8, “context”:”more info here”} Server A 2. Log to WAL Shard 3a. Return Success 1. Write 3b. Evaluate fanouts Normal Write Pipeline Client
Fanouts select value from foo_metric into foo_metric.[host] ! {“value”:23.8, “host”:”serverA”} { “series”: “foo_metric”, “time”: 1396366076000000, “sequence_number”: 10001, “value”:23.8, “host”:”serverA” } Log to WAL { “series”:”foo_metric.serverA”, “time”: 1396366076000000,! “sequence_number”: 10001,! “value”:23.8 } Fanout Time, SN the same
Fanout Implications • Servers don’t check with Raft leader for fanout definitions • They need to be up to date with leader • Data duplication for performance • Can lookup source points on time, SN • Failures mid-fanout OK
Continuous Queries select count(page_id) from events group by time(1h), page_id into events.1h.[page_id] select count(page_id) from events group by time(1h), page_id into events.1h.[page_id] where time >= 1396368660 AND time < 1396368720 1. Run 2. Write { “time” : 1396368660, “sequence_number”: 1, “series”:”events.1h.22”, “count”: 23 } , { “time” : 1396368660, “sequence_number”: 2, “series”:”events.1h.56”, “count”: 10 }
Implications • Fault tolerant across server failures • New Leader Picks up since last known • Continuous queries can be recalculated (future) • Will overwrite old values
Pubsub select * from some_series where host = “serverA” into subscription() select percentile(90, value) from some_series group by time(1m) into subscription()