The Internals of InfluxDB

The Internals of InﬂuxDB Paul Dix @pauldix paul@inﬂuxdb.com

An open source distributed time series, metrics, and events database.
! Like, for analytics.

Basic Concepts

Data model • Databases • Time series (or tables, but
you can have millions) • Points (or rows, but column oriented)

HTTP API (writes) curl -X POST \ 'http://localhost:8086/db/mydb/series?u=paul&p=pass' \ -d
'[{"name":"foo", "columns":["val"], "points": [[3]]}]'

HTTP API (queries) curl 'http://localhost:8086/db/mydb/series?u=paul&p=pass&q=.'

Data (with timestamp) [ { "name": "cpu", "columns": ["time", "value",
"host"], "points": [ [1395168540, 56.7, "foo.influxdb.com"], [1395168540, 43.9, "bar.influxdb.com"] ] } ]

Data (auto-timestamp) [ { "name": "events", "columns": ["type", "email"], "points":
[ ["signup", "[email protected]"], ["paid", "[email protected]"] ] } ]

SQL-ish select * from events where time > now() -
1h

Aggregates select percentile(90, value) from response_times group by time(10m) where
time > now() - 1d

Select from Regex select * from /stats\.cpu\..*/ limit 1

Where against Regex select value from application_logs where value =~
/.*ERROR.*/ and time > "2014-03-01" and time < "2014-03-03"

Continuous queries (fan out) select * from events into events.[user_id]

Continuous queries (summaries) select count(page_id) from events group by time(1h),
page_id into events.[page_id]

Continuous queries (regex downsampling) select max(value), context from /stats\.*/ group
by time(5m) into max.:series_name

Under the hood

Queries - Parsing • Lex • Flex • Lexer •
Yacc • Bison • Parser generator

Queries - Processing • Totally custom • Hand code each
aggregate function, conditional etc.

How data is organized

Shard type Shard struct { Id uint32 StartTime time.Time EndTime
time.Time ServerIds []uint32 }

Shard is a contiguous block of time

Data for all series for a given interval exists in
a shard* *by default, but can be modiﬁed

Every server knows about every shard

Query Pipeline select count(value) from foo group by time(5m) Server
A 1. Query 2. Query Shards Server or LevelDB Shard Server or LevelDB Shard Server or LevelDB Shard Server or LevelDB Shard Server or LevelDB Shard 3. Compute Locally Server A Client 5. Collate and return 4. Stream result

Group by times less than shard size gives locality

Query Pipeline (no locality) select percentile(90, value) from response_times Server
A 1. Query 2. Query Shards Server or LevelDB Shard Server or LevelDB Shard Server or LevelDB Shard Server or LevelDB Shard Server or LevelDB Shard 3. Compute Locally Server A Client 4. Collate, compute and return 3. Stream raw points

serverA could buffer all results in memory *Able to limit
number of shards to query in parallel

Evaluate the query and only hit the shards we need.

Query Pipeline select count(value) from foo where time > now()
- 10m Server A 1. Query 2. Query Shard Server or LevelDB Shard 3. Compute Locally Server A Client 5. Collate and return 4. Stream result

Scaling out with shards

Shard Conﬁg # in the conﬁg ! [sharding] duration =
“7d” split = 3 replication-factor = 2 # split-random= “/^big.*/”

Shards (read scalability) type Shard struct { Id uint32 StartTime
time.Time EndTime time.Time ServerIds []uint32 } Replication Factor!

Splitting durations into multiple shards (write scalability)

Shard Creation {“value”:23.8, “context”:”more info here”} Server A Shard exists?
Time bucket 1. Write No Shard Done 2. Create Shards 3. Normal write Raft Normal Write Pipeline YES Shard Shard From Conﬁg Rules

Mapping Data to Shard 1. Lookup time bucket 2. N
= number of shards for that time bucket 3. If series =~ /regex in conﬁg/ 1. Write to random(N) 4. Else 1. Write to hash(database, series) % N 1. Not a consistent hashing scheme

Shard Split Implications • If using hash, all data for
a given duration of a series is in speciﬁc shard • Data locality • If using regex • scalable writes • no data locality for queries • Split can’t change on old data without total rebalancing of the shard

Rules for RF and split can change over time

Splits don’t need to change for old shards

Underlying storage

Storage Engine • LevelDB (one LDB per shard) • Log
Structured Merge Tree (LSM Tree) • Ordered key/value hash

Key - 24 bytes [id,time,sequence] Tuple with id, time, and
sequence number

Key - ID (8 bytes) [id,time,sequence] Uint64 - Identiﬁes database,
series, column ! Each column has its own id

Key - Time (8 bytes) [id,time,sequence] Uint64 - microsecond epoch
! normalized to be positive (for range scans)

Key [id,time,sequence] Uint64 - unique in cluster ! 1000 !
First three digits unique server id

Ordered Keys [1,1396298064000000,1001] [1,1396298064000000,2001] [1,1396298064000000,3001] [1,1396298074000000,4001] … [2,1396298064000000,1001] [2,1396298064000000,2001] [2,1396298064000000,3001]
[2,1396298074000000,4001] {“value”:23.8, “context”:”more info here”} same point, different columns

Ordered Keys [1,1396298064000000,1001]! [1,1396298064000000,2001]! [1,1396298064000000,3001] [1,1396298074000000,4001] … [2,1396298064000000,1001] [2,1396298064000000,2001] [2,1396298064000000,3001]
[2,1396298074000000,4001] {“value”:23.8, “context”:”more info here”} different point, same timestamp

Values // protobuf bytes ! message FieldValue { optional string
string_value = 1; optional double double_value = 3; optional bool bool_value = 4; optional int64 int64_value = 5; // never stored if this optional bool is_null = 6; }

Storage Implications • Aggregate queries do range scan for entire
time range • Where clause queries do range scan for entire time range • Queries with multiple columns do range scan for entire time range for EVERY column • Splitting into many time series is the way to index • No more than 999 servers in a cluster • Strings values get repeated, but compression may lessen impact

Distributed Parts

Raft • Distributed Consensus Protocol (like Paxos) • Runs on
its own port using HTTP/JSON protocol • We use for meta-data • What servers are in the cluster • What databases exist • What users exist • What shards exist and where • What continuous queries exist

Raft - why not series data? • Write bottleneck (all
go through leader) • Raft groups? • How many? • How to add new ones? • Too chatty

Server Joining… • Looks at “seed-servers” in conﬁg • Raft
connects to server, attempts to join • Gets redirected to leader • Joins cluster (all other servers informed) • Log replay (meta-data) • Connects to other servers in cluster via TCP

TCP Protocol • Each server has a single connection to
every other server • Request/response multiplexed onto connection • Protobuf wire protocol • Distributed Queries • Writes

Request message Request { enum Type { WRITE = 1;
QUERY = 2; HEARTBEAT = 3; } optional uint32 id = 1; required Type type = 2; required string database = 3; optional Series series = 4; optional uint32 shard_id = 6; optional string query = 7; optional string user_name = 8; optional uint32 request_number = 9; }

Response message Response { enum Type { QUERY = 1;
WRITE_OK = 2; END_STREAM = 3; HEARTBEAT = 9; EXPLAIN_QUERY = 10; } required Type type = 1; required uint32 request_id = 2; optional Series series = 3; optional string error_message = 5; }

Writes go over the TCP protocol how do we ensure
replication?

Write Ahead Log (WAL)

Writing Pipeline {“value”:23.8, “context”:”more info here”} Server A 2. Log
to WAL Shard 3a. Return Success 1. Write 3b. Send to Write Buffer Server or LevelDB Done 4. Write 5. Commit to WAL Client

WAL Implications • Servers can go down • Can replay
from last known commit • Doesn’t need to buffer in memory • Can buffer against write bursts or LevelDB compactions

Continuous Query Architecture

Continuous Queries • Denormalization, downsampling • For performance • Distributed
• Replicated through Raft

Continuous Queries (create) select * from events into events.[user_id] Server
A Raft Leader Server B Server B Server B Server B Server B 1. Query 2. Create 3. Replicate

Continuous queries (fan out) select * from events into events.[user_id]
! select value from /^foo.*/ into bar.[type].:foo

to WAL Shard 3a. Return Success 1. Write 3b. Send to Write Buffer Server or LevelDB Done 4. Write 5. Commit to WAL Client

to WAL Shard 3a. Return Success 1. Write 3b. Evaluate fanouts Normal Write Pipeline Client

Fanouts select value from foo_metric into foo_metric.[host] ! {“value”:23.8, “host”:”serverA”}
{ “series”: “foo_metric”, “time”: 1396366076000000, “sequence_number”: 10001, “value”:23.8, “host”:”serverA” } Log to WAL { “series”:”foo_metric.serverA”, “time”: 1396366076000000,! “sequence_number”: 10001,! “value”:23.8 } Fanout Time, SN the same

Fanout Implications • Servers don’t check with Raft leader for
fanout deﬁnitions • They need to be up to date with leader • Data duplication for performance • Can lookup source points on time, SN • Failures mid-fanout OK

Continuous queries (summaries) select count(page_id) from events group by time(1h),
page_id into events.1h.[page_id]

Continuous queries (regex downsampling) select max(value), context from /^stats\.*/ group
by time(5m) into max.5m.:series_name

Continuous Queries (group by time) • Run on Raft leader
• Check every second if any to run

Continuous Queries Queries to Run? Checks since last successful run
1. Run Queries Write Pipeline 2. Mark time of this run Raft replicate

Continuous Queries select count(page_id) from events group by time(1h), page_id
into events.1h.[page_id] select count(page_id) from events group by time(1h), page_id into events.1h.[page_id] where time >= 1396368660 AND time < 1396368720 1. Run 2. Write { “time” : 1396368660, “sequence_number”: 1, “series”:”events.1h.22”, “count”: 23 } , { “time” : 1396368660, “sequence_number”: 2, “series”:”events.1h.56”, “count”: 10 }

Implications • Fault tolerant across server failures • New Leader
Picks up since last known • Continuous queries can be recalculated (future) • Will overwrite old values

Future Work

Custom Functions select myFunc(value) from some_series

Pubsub select * from some_series where host = “serverA” into
subscription() select percentile(90, value) from some_series group by time(1m) into subscription()

Rack aware sharding

Multi-datacenter replication

Thanks! Paul Dix @pauldix @inﬂuxdb paul@inﬂuxdb.com

The Internals of InfluxDB

The Internals of InfluxDB

More Decks by Paul Dix

Other Decks in Technology

Featured

Transcript