$30 off During Our Annual Pro Sale. View Details »

The Internals of InfluxDB

Paul Dix
April 04, 2014

The Internals of InfluxDB

Talk on the underlying implementation details of InfluxDB. Talk given in SF the week of 04/04/14.

Paul Dix

April 04, 2014
Tweet

More Decks by Paul Dix

Other Decks in Technology

Transcript

  1. The Internals of
    InfluxDB
    Paul Dix
    @pauldix
    paul@influxdb.com

    View Slide

  2. An open source distributed time series, metrics, and
    events database.
    !
    Like, for analytics.

    View Slide

  3. Basic Concepts

    View Slide

  4. Data model
    • Databases
    • Time series (or tables, but you can have millions)
    • Points (or rows, but column oriented)

    View Slide

  5. HTTP API (writes)
    curl -X POST \
    'http://localhost:8086/db/mydb/series?u=paul&p=pass' \
    -d '[{"name":"foo", "columns":["val"], "points": [[3]]}]'

    View Slide

  6. HTTP API (queries)
    curl 'http://localhost:8086/db/mydb/series?u=paul&p=pass&q=.'

    View Slide

  7. Data (with timestamp)
    [
    {
    "name": "cpu",
    "columns": ["time", "value", "host"],
    "points": [
    [1395168540, 56.7, "foo.influxdb.com"],
    [1395168540, 43.9, "bar.influxdb.com"]
    ]
    }
    ]

    View Slide

  8. Data (auto-timestamp)
    [
    {
    "name": "events",
    "columns": ["type", "email"],
    "points": [
    ["signup", "[email protected]"],
    ["paid", "[email protected]"]
    ]
    }
    ]

    View Slide

  9. SQL-ish
    select * from events
    where time > now() - 1h

    View Slide

  10. Aggregates
    select percentile(90, value) from response_times
    group by time(10m)
    where time > now() - 1d

    View Slide

  11. Select from Regex
    select * from /stats\.cpu\..*/
    limit 1

    View Slide

  12. Where against Regex
    select value from application_logs
    where value =~ /.*ERROR.*/ and
    time > "2014-03-01" and time < "2014-03-03"

    View Slide

  13. Continuous queries
    (fan out)
    select * from events
    into events.[user_id]

    View Slide

  14. Continuous queries
    (summaries)
    select count(page_id) from events
    group by time(1h), page_id
    into events.[page_id]

    View Slide

  15. Continuous queries
    (regex downsampling)
    select max(value), context from /stats\.*/
    group by time(5m)
    into max.:series_name

    View Slide

  16. Under the hood

    View Slide

  17. Queries - Parsing
    • Lex
    • Flex
    • Lexer
    • Yacc
    • Bison
    • Parser generator

    View Slide

  18. Queries - Processing
    • Totally custom
    • Hand code each aggregate function, conditional
    etc.

    View Slide

  19. How data is organized

    View Slide

  20. Shard
    type Shard struct {
    Id uint32
    StartTime time.Time
    EndTime time.Time
    ServerIds []uint32
    }

    View Slide

  21. Shard is a contiguous
    block of time

    View Slide

  22. Data for all series for a
    given interval exists in a
    shard*
    *by default, but can be modified

    View Slide

  23. Every server knows
    about every shard

    View Slide

  24. Query Pipeline
    select count(value) from foo group by time(5m)
    Server A
    1. Query
    2. Query Shards
    Server or
    LevelDB
    Shard
    Server or
    LevelDB
    Shard
    Server or
    LevelDB
    Shard
    Server or
    LevelDB
    Shard
    Server or
    LevelDB
    Shard
    3. Compute Locally
    Server A
    Client
    5. Collate and return
    4. Stream result

    View Slide

  25. Group by times less than
    shard size gives locality

    View Slide

  26. Query Pipeline (no locality)
    select percentile(90, value) from response_times
    Server A
    1. Query
    2. Query Shards
    Server or
    LevelDB
    Shard
    Server or
    LevelDB
    Shard
    Server or
    LevelDB
    Shard
    Server or
    LevelDB
    Shard
    Server or
    LevelDB
    Shard
    3. Compute Locally
    Server A
    Client
    4. Collate, compute
    and return
    3. Stream raw
    points

    View Slide

  27. serverA could buffer
    all results in memory
    *Able to limit number of shards to query in parallel

    View Slide

  28. Evaluate the query and
    only hit the shards we
    need.

    View Slide

  29. Query Pipeline
    select count(value) from foo where time > now() - 10m
    Server A
    1. Query
    2. Query Shard
    Server or
    LevelDB
    Shard
    3. Compute Locally
    Server A
    Client
    5. Collate and return
    4. Stream result

    View Slide

  30. Scaling out with shards

    View Slide

  31. Shard Config
    # in the config
    !
    [sharding]
    duration = “7d”
    split = 3
    replication-factor = 2
    # split-random= “/^big.*/”

    View Slide

  32. Shards (read scalability)
    type Shard struct {
    Id uint32
    StartTime time.Time
    EndTime time.Time
    ServerIds []uint32
    }
    Replication Factor!

    View Slide

  33. Splitting durations into
    multiple shards (write
    scalability)

    View Slide

  34. Shard Creation
    {“value”:23.8, “context”:”more info here”}
    Server A
    Shard exists?
    Time
    bucket
    1. Write
    No
    Shard
    Done
    2. Create Shards
    3. Normal
    write
    Raft
    Normal Write
    Pipeline
    YES
    Shard Shard
    From Config Rules

    View Slide

  35. Mapping Data to Shard
    1. Lookup time bucket
    2. N = number of shards for that time bucket
    3. If series =~ /regex in config/
    1. Write to random(N)
    4. Else
    1. Write to hash(database, series) % N
    1. Not a consistent hashing scheme

    View Slide

  36. Shard Split Implications
    • If using hash, all data for a given duration of a series is in
    specific shard
    • Data locality
    • If using regex
    • scalable writes
    • no data locality for queries
    • Split can’t change on old data without total rebalancing of
    the shard

    View Slide

  37. Rules for RF and split
    can change over time

    View Slide

  38. Splits don’t need to
    change for old shards

    View Slide

  39. Underlying storage

    View Slide

  40. Storage Engine
    • LevelDB (one LDB per shard)
    • Log Structured Merge Tree (LSM Tree)
    • Ordered key/value hash

    View Slide

  41. Key - 24 bytes
    [id,time,sequence]
    Tuple with id, time, and sequence number

    View Slide

  42. Key - ID (8 bytes)
    [id,time,sequence]
    Uint64 - Identifies database, series, column
    !
    Each column has its own id

    View Slide

  43. Key - Time (8 bytes)
    [id,time,sequence]
    Uint64 - microsecond epoch
    !
    normalized to be positive (for range scans)

    View Slide

  44. Key
    [id,time,sequence]
    Uint64 - unique in cluster
    !
    1000
    !
    First three digits unique server id

    View Slide

  45. Ordered Keys
    [1,1396298064000000,1001]
    [1,1396298064000000,2001]
    [1,1396298064000000,3001]
    [1,1396298074000000,4001]

    [2,1396298064000000,1001]
    [2,1396298064000000,2001]
    [2,1396298064000000,3001]
    [2,1396298074000000,4001]
    {“value”:23.8, “context”:”more info here”}
    same point,
    different columns

    View Slide

  46. Ordered Keys
    [1,1396298064000000,1001]!
    [1,1396298064000000,2001]!
    [1,1396298064000000,3001]
    [1,1396298074000000,4001]

    [2,1396298064000000,1001]
    [2,1396298064000000,2001]
    [2,1396298064000000,3001]
    [2,1396298074000000,4001]
    {“value”:23.8, “context”:”more info here”}
    different point,
    same timestamp

    View Slide

  47. Values
    // protobuf bytes
    !
    message FieldValue {
    optional string string_value = 1;
    optional double double_value = 3;
    optional bool bool_value = 4;
    optional int64 int64_value = 5;
    // never stored if this
    optional bool is_null = 6;
    }

    View Slide

  48. Storage Implications
    • Aggregate queries do range scan for entire time range
    • Where clause queries do range scan for entire time range
    • Queries with multiple columns do range scan for entire
    time range for EVERY column
    • Splitting into many time series is the way to index
    • No more than 999 servers in a cluster
    • Strings values get repeated, but compression may lessen
    impact

    View Slide

  49. Distributed Parts

    View Slide

  50. Raft
    • Distributed Consensus Protocol (like Paxos)
    • Runs on its own port using HTTP/JSON protocol
    • We use for meta-data
    • What servers are in the cluster
    • What databases exist
    • What users exist
    • What shards exist and where
    • What continuous queries exist

    View Slide

  51. Raft - why not series data?
    • Write bottleneck (all go through leader)
    • Raft groups?
    • How many?
    • How to add new ones?
    • Too chatty

    View Slide

  52. Server Joining…
    • Looks at “seed-servers” in config
    • Raft connects to server, attempts to join
    • Gets redirected to leader
    • Joins cluster (all other servers informed)
    • Log replay (meta-data)
    • Connects to other servers in cluster via TCP

    View Slide

  53. TCP Protocol
    • Each server has a single connection to every other
    server
    • Request/response multiplexed onto connection
    • Protobuf wire protocol
    • Distributed Queries
    • Writes

    View Slide

  54. Request
    message Request {
    enum Type {
    WRITE = 1;
    QUERY = 2;
    HEARTBEAT = 3;
    }
    optional uint32 id = 1;
    required Type type = 2;
    required string database = 3;
    optional Series series = 4;
    optional uint32 shard_id = 6;
    optional string query = 7;
    optional string user_name = 8;
    optional uint32 request_number = 9;
    }

    View Slide

  55. Response
    message Response {
    enum Type {
    QUERY = 1;
    WRITE_OK = 2;
    END_STREAM = 3;
    HEARTBEAT = 9;
    EXPLAIN_QUERY = 10;
    }
    required Type type = 1;
    required uint32 request_id = 2;
    optional Series series = 3;
    optional string error_message = 5;
    }

    View Slide

  56. Writes go over the
    TCP protocol
    how do we ensure replication?

    View Slide

  57. Write Ahead Log (WAL)

    View Slide

  58. Writing Pipeline
    {“value”:23.8, “context”:”more info here”}
    Server A
    2. Log to WAL
    Shard
    3a. Return
    Success
    1. Write
    3b. Send to Write Buffer
    Server or
    LevelDB
    Done
    4. Write
    5. Commit
    to WAL
    Client

    View Slide

  59. WAL Implications
    • Servers can go down
    • Can replay from last known commit
    • Doesn’t need to buffer in memory
    • Can buffer against write bursts or LevelDB
    compactions

    View Slide

  60. Continuous Query
    Architecture

    View Slide

  61. Continuous Queries
    • Denormalization, downsampling
    • For performance
    • Distributed
    • Replicated through Raft

    View Slide

  62. Continuous Queries (create)
    select * from events into events.[user_id]
    Server A
    Raft Leader
    Server B
    Server B
    Server B
    Server B
    Server B
    1. Query
    2. Create
    3. Replicate

    View Slide

  63. Continuous queries
    (fan out)
    select * from events
    into events.[user_id]
    !
    select value from /^foo.*/
    into bar.[type].:foo

    View Slide

  64. Writing Pipeline
    {“value”:23.8, “context”:”more info here”}
    Server A
    2. Log to WAL
    Shard
    3a. Return
    Success
    1. Write
    3b. Send to Write Buffer
    Server or
    LevelDB
    Done
    4. Write
    5. Commit
    to WAL
    Client

    View Slide

  65. Writing Pipeline
    {“value”:23.8, “context”:”more info here”}
    Server A
    2. Log to WAL
    Shard
    3a. Return
    Success
    1. Write
    3b. Evaluate fanouts
    Normal
    Write Pipeline
    Client

    View Slide

  66. Fanouts
    select value from foo_metric into foo_metric.[host]
    !
    {“value”:23.8, “host”:”serverA”}
    {
    “series”: “foo_metric”,
    “time”: 1396366076000000,
    “sequence_number”: 10001,
    “value”:23.8,
    “host”:”serverA”
    }
    Log to WAL
    {
    “series”:”foo_metric.serverA”,
    “time”: 1396366076000000,!
    “sequence_number”: 10001,!
    “value”:23.8
    }
    Fanout Time, SN the same

    View Slide

  67. Fanout Implications
    • Servers don’t check with Raft leader for fanout
    definitions
    • They need to be up to date with leader
    • Data duplication for performance
    • Can lookup source points on time, SN
    • Failures mid-fanout OK

    View Slide

  68. Continuous queries
    (summaries)
    select count(page_id) from events
    group by time(1h), page_id
    into events.1h.[page_id]

    View Slide

  69. Continuous queries
    (regex downsampling)
    select max(value), context from /^stats\.*/
    group by time(5m)
    into max.5m.:series_name

    View Slide

  70. Continuous Queries (group
    by time)
    • Run on Raft leader
    • Check every second if any to run

    View Slide

  71. Continuous Queries
    Queries
    to Run?
    Checks since
    last successful run
    1. Run Queries
    Write
    Pipeline
    2. Mark time of this run
    Raft
    replicate

    View Slide

  72. Continuous Queries
    select count(page_id) from events
    group by time(1h), page_id
    into events.1h.[page_id]
    select count(page_id) from events
    group by time(1h), page_id
    into events.1h.[page_id]
    where time >= 1396368660 AND time < 1396368720
    1. Run
    2. Write
    {
    “time” : 1396368660,
    “sequence_number”: 1,
    “series”:”events.1h.22”,
    “count”: 23
    } ,
    {
    “time” : 1396368660,
    “sequence_number”: 2,
    “series”:”events.1h.56”,
    “count”: 10
    }

    View Slide

  73. Implications
    • Fault tolerant across server failures
    • New Leader Picks up since last known
    • Continuous queries can be recalculated (future)
    • Will overwrite old values

    View Slide

  74. Future Work

    View Slide

  75. Custom Functions
    select myFunc(value) from some_series

    View Slide

  76. Pubsub
    select * from some_series
    where host = “serverA”
    into subscription()
    select percentile(90, value) from some_series
    group by time(1m)
    into subscription()

    View Slide

  77. Rack aware sharding

    View Slide

  78. Multi-datacenter
    replication

    View Slide

  79. Thanks!
    Paul Dix
    @pauldix
    @influxdb
    paul@influxdb.com

    View Slide