Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Refactoring InfluxDB: from Go to Go
Search
Paul Dix
February 19, 2015
4
6.5k
Refactoring InfluxDB: from Go to Go
Talk given at the Golang SF meetup about the rewrite from 0.8 to 0.9 of InfluxDB
Paul Dix
February 19, 2015
Tweet
Share
More Decks by Paul Dix
See All by Paul Dix
InfluxDB IOx Project Update - 2021-02-10
pauldix
0
210
InfluxDB IOx data lifecycle and object store persistence
pauldix
1
570
InfluxDB 2.0 and Flux
pauldix
1
680
Flux and InfluxDB 2.0
pauldix
1
1.3k
Querying Prometheus with Flux
pauldix
1
800
Flux (#fluxlang): a new (time series) data scripting language
pauldix
7
5k
At Scale, Everything is Hard
pauldix
2
660
IFQL and the future of InfluxData
pauldix
2
1.3k
Time series & monitoring with InfluxDB and the TICK stack
pauldix
0
420
Featured
See All Featured
Designing Experiences People Love
moore
138
23k
Building an army of robots
kneath
302
44k
Testing 201, or: Great Expectations
jmmastey
40
7.1k
Automating Front-end Workflow
addyosmani
1366
200k
実際に使うSQLの書き方 徹底解説 / pgcon21j-tutorial
soudai
169
50k
[RailsConf 2023 Opening Keynote] The Magic of Rails
eileencodes
28
9.1k
Put a Button on it: Removing Barriers to Going Fast.
kastner
59
3.6k
Creating an realtime collaboration tool: Agile Flush - .NET Oxford
marcduiker
26
1.9k
Designing for humans not robots
tammielis
250
25k
What’s in a name? Adding method to the madness
productmarketing
PRO
22
3.2k
How to Ace a Technical Interview
jacobian
276
23k
A Philosophy of Restraint
colly
203
16k
Transcript
Refactoring InfluxDB: from Go to Go Paul Dix CEO and
cofounder of InfluxDB @pauldix paul@influxdb.com
About me…
None
Organizer NYC Machine Learning (5,500+ members)
Series Editor - “Data & Analytics”
Cofounder & CEO of InfluxDB
Onto the talk!
Confession… (not a Go talk)
API Design
Software Design
–InfluxDB design philosophy “Optimize for developer happiness.”
Refactor Rewrite
Before that, what’s InfluxDB?
None
Open source distributed time series database written in Go with
no external dependencies
Metrics
Time Series
Analytics
Events
Use Cases • Metrics (DevOps) • Sensor Data • Realtime
Analytics
Features Upcoming 0.9.0 release
Data model • Databases
Data model • Databases • Measurements • cpu_load, temperature, log_lines,
click, etc.
Data model • Databases • Measurements • cpu_load, temperature, log_lines,
click, etc. • Tags • region=uswest, host=serverA, building=23, service=redis, etc.
Data model • Databases • Measurements • cpu_load, temperature, log,
click, etc. • Tags • region=uswest, host=serverA, building=23, service=redis, etc. • Series - measurement + unique tagset
Data model • Databases • Measurements • cpu_load, temperature, log,
click, etc. • Tags • region=uswest, host=serverA, building=23, service=redis, etc. • Series - measurement + unique tagset • Points • Fields - bool, int64, float64, string, []byte • Timestamp - nano epoch
Writing Data curl -XPOST 'http://localhost:8086/write' -d '...'!
Writing Data {! "database": "mydb",! "retentionPolicy": "30d",! "points": [! {!
"name": "cpu_load",! "tags": {! "host": "server01",! "region": "us-west"! },! "timestamp": "2009-11-10T23:00:00Z",! "fields": {! "value": 0.64! }! }! ]! }! Measurement Tags Fields
Querying curl -G 'http://localhost:8086/query' --data-urlencode "q=..."!
SQL-ish query language
SELECT value FROM cpu WHERE host = 'serverA'! {! "results":[!
{! "query": "SELECT value FROM cpu WHERE host='serverA'",! "series": [! {! "name": "cpu",! "tags": {! "host": "serverA"! },! "columns": ["time", "value"],! "values": [! ["2009-11-10T23:00:00Z", 22.1],! ["2009-11-10T23:00:10Z", 25.2]! ]! }! ]! }! ]! }! QUERY: RESULTS:
SELECT value FROM cpu! WHERE host = ‘serverA'OR host =
'serverB'! QUERY: {! "series": [! {! "name": "cpu",! "tags": {! "host": "serverA"! },! "columns": ["time", "value"],! "values": []! },! {! "name": "cpu",! "tags": {! "host": "serverB"! },! "columns": ["time", "value"],! "values": []! } ! ]! }! SERIES! IN RESULT:
SELECT percentile(90, value) FROM cpu! WHERE time > now() -
4h! GROUP BY time(10m), region QUERY: [! {! "name": "cpu",! "tags": {! "region": "us-west"! },! "columns": ["time", "percentile"],! "values": []! },! {! "name": "cpu",! "tags": {! "region": "us-east"! },! "columns": ["time", "percentile"],! "values": []! } ! ]! SERIES! IN RESULT:
Multiple aggregates SELECT mean(value), percentile(90, value), min(value), max(value)! FROM cpu!
WHERE host='serverA' AND time > now() - 48h! GROUP BY time(1h)!
Return every series in CPU SELECT mean(value)! FROM cpu! WHERE
time > now() - 48h! GROUP BY time(1h), *!
Discovery based on tags
{! "results":[! {! "query": "SHOW MEASUREMENTS",! "series": [! {! "name":
"measurements",! "columns": ["name"],! "values": [! ["cpu"],! ["memory"],! ["network"]! ]! }! ]! }! ]! }!
{! "results":[! {! "query": "SHOW SERIES",! "series": [! {! "name":
"cpu",! "columns": ["id", "region", "host"],! "values": [! [1, "us-west", "serverA"],! [2, "us-east", "serverB"]! ]! }! ]! }! ]! }!
{! "query": "SHOW MEASUREMENTS WHERE service='redis'",! "series": [! {! "name":
"measurements",! "name": "series",! "columns": ["measurement"],! "values": [! ["key_count"],! ["connections"]! ]! }! ]! }!
{! "query": "SHOW TAG KEYS from cpu",! "series": [! {!
"name": "keys",! "columns": ["key"],! "values": [! ["region"],! ["host"]! ]! }! ]! }!
{! "query": "SHOW TAG VALUES WITH KEY = service",! "series":
[! {! "name": "series",! "columns": ["service"],! "values": [! ["redis"],! ["apache"]! ]! }! ]! }!
{! "query": "SHOW TAG VALUES FROM cpu WITH KEY =
service",! "series": [! {! "name": "series",! "columns": ["service"],! "values": [! ["redis"],! ["apache"]! ]! }! ]! }!
Much more • Retention policies • Automatic downsampling and aggregation
• Clustering
onto the rewrite…
Since November we’ve been rewriting InfluxDB from scratch.
– Joel Spolsky on rewriting from scratch Things You Should
Never Do, Part I http://www.joelonsoftware.com/articles/fog0000000069.html “… the single worst strategic mistake that any software company can make …”
Why would we rewrite?
–InfluxDB design philosophy “Optimize for developer happiness.”
What does this mean?
for InfluxDB users…
simple setup
flexible API
empowering API
performant by default
helps developers build analytics applications faster
for InfluxDB developers…
fast build time
idiomatic Go code
easy to build and contribute
Not Happy
Split across three different legitimate areas
Feature Requests Moving average, different kinds of derivatives, ways to
fill data, top N for a given period, exact data point for min/max
Performance Bugs High CPU load on query, where clause takes
too long, list series takes too long
Bugs open file handles, out of memory, clustering crashes
Questions (illegitimate?)
Some questions point to bad API design
We identified 3 different areas of improvement
• API Design • Structural Design Problems Limiting Performance •
Underlying Technology Choices Causing Bugs
API Design should be understandable, should push users in the
right direction
Previous API Many series with metadata in the name like
in Graphite region.us.data_center.1.host.serverA.network_in! region.us.data_center.1.host.serverA.network_out! 5m.mean.region.us.data_center.1.host.serverA.network_in! 5m.mean.region.us.data_center.1.host.serverA.network_out!
Understandable API Design: Retention Policies • Previously called shard spaces
• Users tell the server which shard space to read/ write data into based on a regex to match against the series name
Too hard to use, couldn’t tell where data was going
Users could accidentally hide a bunch of data that was
previously available
Solution: have the user be explicit at write or query
time
Pushing users in the right direction: Tags SELECT mean(value) FROM
cpu! WHERE host = 'serverA'! Users wanted this:
Columns • Wasted space • Add Indexes? • User still
has to create them
Everyone was expecting tags… Indexed metadata and series
Structural Design Problems LIST SERIES /.*region\.uswest.*/! Tell users to have
many series: SELECT mean(value)! FROM merge(/.*region\.uswest.*/)! WHERE time > now() - 2h! GROUP BY time(5m)!
No way to make that perform well with > 100k
series
Solution: measurements and tags break up the namespace
Structural Design Problems: Query Engine • Pipes raw data over
the network • Need to redesign to get data locality • MapReduce framework
Solution: refactor/rewrite the query engine
Underlying Technology Choices • Protobufs • Performance • Everywhere in
code (network, database)
Solution: switch to raw bytes for storage
Underlying Technology Choices • LevelDB • Too many file handles
• No online backups • Too hard to transfer shard from one server to another
Solution: switch to BoltDB
Underlying Technology Choices • Flex & Bison for Parser •
Very hard to understand and update • CGo code
Solution: switch to pure Go parser
All of these things together pointed to a rewrite Large
API changes, underlying technology changes, and code in every area getting touched
A refactor would have been a rewrite
It would have dragged breaking changes out over multiple releases
With a rewrite we rip the bandaid off quickly
–InfluxDB design philosophy “Optimize for developer happiness.”
The 0.9.0 release gives us a solid foundation to build
on
Thank you Paul Dix @pauldix paul@influxdb.com ! P.S. we’re hiring
Go and front-end developers