Slide 1

Slide 1 text

The Internals of InfluxDB Paul Dix @pauldix paul@influxdb.com

Slide 2

Slide 2 text

An open source distributed time series, metrics, and events database. ! Like, for analytics.

Slide 3

Slide 3 text

Basic Concepts

Slide 4

Slide 4 text

Data model • Databases • Time series (or tables, but you can have millions) • Points (or rows, but column oriented)

Slide 5

Slide 5 text

HTTP API (writes) curl -X POST \ 'http://localhost:8086/db/mydb/series?u=paul&p=pass' \ -d '[{"name":"foo", "columns":["val"], "points": [[3]]}]'

Slide 6

Slide 6 text

HTTP API (queries) curl 'http://localhost:8086/db/mydb/series?u=paul&p=pass&q=.'

Slide 7

Slide 7 text

Data (with timestamp) [ { "name": "cpu", "columns": ["time", "value", "host"], "points": [ [1395168540, 56.7, "foo.influxdb.com"], [1395168540, 43.9, "bar.influxdb.com"] ] } ]

Slide 8

Slide 8 text

Data (auto-timestamp) [ { "name": "events", "columns": ["type", "email"], "points": [ ["signup", "[email protected]"], ["paid", "[email protected]"] ] } ]

Slide 9

Slide 9 text

SQL-ish select * from events where time > now() - 1h

Slide 10

Slide 10 text

Aggregates select percentile(90, value) from response_times group by time(10m) where time > now() - 1d

Slide 11

Slide 11 text

Select from Regex select * from /stats\.cpu\..*/ limit 1

Slide 12

Slide 12 text

Where against Regex select value from application_logs where value =~ /.*ERROR.*/ and time > "2014-03-01" and time < "2014-03-03"

Slide 13

Slide 13 text

Continuous queries (fan out) select * from events into events.[user_id]

Slide 14

Slide 14 text

Continuous queries (summaries) select count(page_id) from events group by time(1h), page_id into events.[page_id]

Slide 15

Slide 15 text

Continuous queries (regex downsampling) select max(value), context from /stats\.*/ group by time(5m) into max.:series_name

Slide 16

Slide 16 text

Under the hood

Slide 17

Slide 17 text

Queries - Parsing • Lex • Flex • Lexer • Yacc • Bison • Parser generator

Slide 18

Slide 18 text

Queries - Processing • Totally custom • Hand code each aggregate function, conditional etc.

Slide 19

Slide 19 text

How data is organized

Slide 20

Slide 20 text

Shard type Shard struct { Id uint32 StartTime time.Time EndTime time.Time ServerIds []uint32 }

Slide 21

Slide 21 text

Shard is a contiguous block of time

Slide 22

Slide 22 text

Data for all series for a given interval exists in a shard* *by default, but can be modified

Slide 23

Slide 23 text

Every server knows about every shard

Slide 24

Slide 24 text

Query Pipeline select count(value) from foo group by time(5m) Server A 1. Query 2. Query Shards Server or LevelDB Shard Server or LevelDB Shard Server or LevelDB Shard Server or LevelDB Shard Server or LevelDB Shard 3. Compute Locally Server A Client 5. Collate and return 4. Stream result

Slide 25

Slide 25 text

Group by times less than shard size gives locality

Slide 26

Slide 26 text

Query Pipeline (no locality) select percentile(90, value) from response_times Server A 1. Query 2. Query Shards Server or LevelDB Shard Server or LevelDB Shard Server or LevelDB Shard Server or LevelDB Shard Server or LevelDB Shard 3. Compute Locally Server A Client 4. Collate, compute and return 3. Stream raw points

Slide 27

Slide 27 text

serverA could buffer all results in memory *Able to limit number of shards to query in parallel

Slide 28

Slide 28 text

Evaluate the query and only hit the shards we need.

Slide 29

Slide 29 text

Query Pipeline select count(value) from foo where time > now() - 10m Server A 1. Query 2. Query Shard Server or LevelDB Shard 3. Compute Locally Server A Client 5. Collate and return 4. Stream result

Slide 30

Slide 30 text

Scaling out with shards

Slide 31

Slide 31 text

Shard Config # in the config ! [sharding] duration = “7d” split = 3 replication-factor = 2 # split-random= “/^big.*/”

Slide 32

Slide 32 text

Shards (read scalability) type Shard struct { Id uint32 StartTime time.Time EndTime time.Time ServerIds []uint32 } Replication Factor!

Slide 33

Slide 33 text

Splitting durations into multiple shards (write scalability)

Slide 34

Slide 34 text

Shard Creation {“value”:23.8, “context”:”more info here”} Server A Shard exists? Time bucket 1. Write No Shard Done 2. Create Shards 3. Normal write Raft Normal Write Pipeline YES Shard Shard From Config Rules

Slide 35

Slide 35 text

Mapping Data to Shard 1. Lookup time bucket 2. N = number of shards for that time bucket 3. If series =~ /regex in config/ 1. Write to random(N) 4. Else 1. Write to hash(database, series) % N 1. Not a consistent hashing scheme

Slide 36

Slide 36 text

Shard Split Implications • If using hash, all data for a given duration of a series is in specific shard • Data locality • If using regex • scalable writes • no data locality for queries • Split can’t change on old data without total rebalancing of the shard

Slide 37

Slide 37 text

Rules for RF and split can change over time

Slide 38

Slide 38 text

Splits don’t need to change for old shards

Slide 39

Slide 39 text

Underlying storage

Slide 40

Slide 40 text

Storage Engine • LevelDB (one LDB per shard) • Log Structured Merge Tree (LSM Tree) • Ordered key/value hash

Slide 41

Slide 41 text

Key - 24 bytes [id,time,sequence] Tuple with id, time, and sequence number

Slide 42

Slide 42 text

Key - ID (8 bytes) [id,time,sequence] Uint64 - Identifies database, series, column ! Each column has its own id

Slide 43

Slide 43 text

Key - Time (8 bytes) [id,time,sequence] Uint64 - microsecond epoch ! normalized to be positive (for range scans)

Slide 44

Slide 44 text

Key [id,time,sequence] Uint64 - unique in cluster ! 1000 ! First three digits unique server id

Slide 45

Slide 45 text

Ordered Keys [1,1396298064000000,1001] [1,1396298064000000,2001] [1,1396298064000000,3001] [1,1396298074000000,4001] … [2,1396298064000000,1001] [2,1396298064000000,2001] [2,1396298064000000,3001] [2,1396298074000000,4001] {“value”:23.8, “context”:”more info here”} same point, different columns

Slide 46

Slide 46 text

Ordered Keys [1,1396298064000000,1001]! [1,1396298064000000,2001]! [1,1396298064000000,3001] [1,1396298074000000,4001] … [2,1396298064000000,1001] [2,1396298064000000,2001] [2,1396298064000000,3001] [2,1396298074000000,4001] {“value”:23.8, “context”:”more info here”} different point, same timestamp

Slide 47

Slide 47 text

Values // protobuf bytes ! message FieldValue { optional string string_value = 1; optional double double_value = 3; optional bool bool_value = 4; optional int64 int64_value = 5; // never stored if this optional bool is_null = 6; }

Slide 48

Slide 48 text

Storage Implications • Aggregate queries do range scan for entire time range • Where clause queries do range scan for entire time range • Queries with multiple columns do range scan for entire time range for EVERY column • Splitting into many time series is the way to index • No more than 999 servers in a cluster • Strings values get repeated, but compression may lessen impact

Slide 49

Slide 49 text

Distributed Parts

Slide 50

Slide 50 text

Raft • Distributed Consensus Protocol (like Paxos) • Runs on its own port using HTTP/JSON protocol • We use for meta-data • What servers are in the cluster • What databases exist • What users exist • What shards exist and where • What continuous queries exist

Slide 51

Slide 51 text

Raft - why not series data? • Write bottleneck (all go through leader) • Raft groups? • How many? • How to add new ones? • Too chatty

Slide 52

Slide 52 text

Server Joining… • Looks at “seed-servers” in config • Raft connects to server, attempts to join • Gets redirected to leader • Joins cluster (all other servers informed) • Log replay (meta-data) • Connects to other servers in cluster via TCP

Slide 53

Slide 53 text

TCP Protocol • Each server has a single connection to every other server • Request/response multiplexed onto connection • Protobuf wire protocol • Distributed Queries • Writes

Slide 54

Slide 54 text

Request message Request { enum Type { WRITE = 1; QUERY = 2; HEARTBEAT = 3; } optional uint32 id = 1; required Type type = 2; required string database = 3; optional Series series = 4; optional uint32 shard_id = 6; optional string query = 7; optional string user_name = 8; optional uint32 request_number = 9; }

Slide 55

Slide 55 text

Response message Response { enum Type { QUERY = 1; WRITE_OK = 2; END_STREAM = 3; HEARTBEAT = 9; EXPLAIN_QUERY = 10; } required Type type = 1; required uint32 request_id = 2; optional Series series = 3; optional string error_message = 5; }

Slide 56

Slide 56 text

Writes go over the TCP protocol how do we ensure replication?

Slide 57

Slide 57 text

Write Ahead Log (WAL)

Slide 58

Slide 58 text

Writing Pipeline {“value”:23.8, “context”:”more info here”} Server A 2. Log to WAL Shard 3a. Return Success 1. Write 3b. Send to Write Buffer Server or LevelDB Done 4. Write 5. Commit to WAL Client

Slide 59

Slide 59 text

WAL Implications • Servers can go down • Can replay from last known commit • Doesn’t need to buffer in memory • Can buffer against write bursts or LevelDB compactions

Slide 60

Slide 60 text

Continuous Query Architecture

Slide 61

Slide 61 text

Continuous Queries • Denormalization, downsampling • For performance • Distributed • Replicated through Raft

Slide 62

Slide 62 text

Continuous Queries (create) select * from events into events.[user_id] Server A Raft Leader Server B Server B Server B Server B Server B 1. Query 2. Create 3. Replicate

Slide 63

Slide 63 text

Continuous queries (fan out) select * from events into events.[user_id] ! select value from /^foo.*/ into bar.[type].:foo

Slide 64

Slide 64 text

Writing Pipeline {“value”:23.8, “context”:”more info here”} Server A 2. Log to WAL Shard 3a. Return Success 1. Write 3b. Send to Write Buffer Server or LevelDB Done 4. Write 5. Commit to WAL Client

Slide 65

Slide 65 text

Writing Pipeline {“value”:23.8, “context”:”more info here”} Server A 2. Log to WAL Shard 3a. Return Success 1. Write 3b. Evaluate fanouts Normal Write Pipeline Client

Slide 66

Slide 66 text

Fanouts select value from foo_metric into foo_metric.[host] ! {“value”:23.8, “host”:”serverA”} { “series”: “foo_metric”, “time”: 1396366076000000, “sequence_number”: 10001, “value”:23.8, “host”:”serverA” } Log to WAL { “series”:”foo_metric.serverA”, “time”: 1396366076000000,! “sequence_number”: 10001,! “value”:23.8 } Fanout Time, SN the same

Slide 67

Slide 67 text

Fanout Implications • Servers don’t check with Raft leader for fanout definitions • They need to be up to date with leader • Data duplication for performance • Can lookup source points on time, SN • Failures mid-fanout OK

Slide 68

Slide 68 text

Continuous queries (summaries) select count(page_id) from events group by time(1h), page_id into events.1h.[page_id]

Slide 69

Slide 69 text

Continuous queries (regex downsampling) select max(value), context from /^stats\.*/ group by time(5m) into max.5m.:series_name

Slide 70

Slide 70 text

Continuous Queries (group by time) • Run on Raft leader • Check every second if any to run

Slide 71

Slide 71 text

Continuous Queries Queries to Run? Checks since last successful run 1. Run Queries Write Pipeline 2. Mark time of this run Raft replicate

Slide 72

Slide 72 text

Continuous Queries select count(page_id) from events group by time(1h), page_id into events.1h.[page_id] select count(page_id) from events group by time(1h), page_id into events.1h.[page_id] where time >= 1396368660 AND time < 1396368720 1. Run 2. Write { “time” : 1396368660, “sequence_number”: 1, “series”:”events.1h.22”, “count”: 23 } , { “time” : 1396368660, “sequence_number”: 2, “series”:”events.1h.56”, “count”: 10 }

Slide 73

Slide 73 text

Implications • Fault tolerant across server failures • New Leader Picks up since last known • Continuous queries can be recalculated (future) • Will overwrite old values

Slide 74

Slide 74 text

Future Work

Slide 75

Slide 75 text

Custom Functions select myFunc(value) from some_series

Slide 76

Slide 76 text

Pubsub select * from some_series where host = “serverA” into subscription() select percentile(90, value) from some_series group by time(1m) into subscription()

Slide 77

Slide 77 text

Rack aware sharding

Slide 78

Slide 78 text

Multi-datacenter replication

Slide 79

Slide 79 text

Thanks! Paul Dix @pauldix @influxdb paul@influxdb.com