Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Refactoring InfluxDB: from Go to Go
Search
Paul Dix
February 19, 2015
4
6.5k
Refactoring InfluxDB: from Go to Go
Talk given at the Golang SF meetup about the rewrite from 0.8 to 0.9 of InfluxDB
Paul Dix
February 19, 2015
Tweet
Share
More Decks by Paul Dix
See All by Paul Dix
InfluxDB IOx Project Update - 2021-02-10
pauldix
0
200
InfluxDB IOx data lifecycle and object store persistence
pauldix
1
540
InfluxDB 2.0 and Flux
pauldix
1
650
Flux and InfluxDB 2.0
pauldix
1
1.3k
Querying Prometheus with Flux
pauldix
1
770
Flux (#fluxlang): a new (time series) data scripting language
pauldix
7
5k
At Scale, Everything is Hard
pauldix
2
640
IFQL and the future of InfluxData
pauldix
2
1.3k
Time series & monitoring with InfluxDB and the TICK stack
pauldix
0
400
Featured
See All Featured
Code Reviewing Like a Champion
maltzj
519
39k
The Mythical Team-Month
searls
218
43k
Making Projects Easy
brettharned
115
5.8k
Web Components: a chance to create the future
zenorocha
310
42k
Building a Scalable Design System with Sketch
lauravandoore
459
32k
個人開発の失敗を避けるイケてる考え方 / tips for indie hackers
panda_program
92
16k
JavaScript: Past, Present, and Future - NDC Porto 2020
reverentgeek
46
4.9k
Build The Right Thing And Hit Your Dates
maggiecrowley
31
2.3k
Become a Pro
speakerdeck
PRO
24
4.9k
ピンチをチャンスに:未来をつくるプロダクトロードマップ #pmconf2020
aki_iinuma
105
48k
Cheating the UX When There Is Nothing More to Optimize - PixelPioneers
stephaniewalter
280
13k
Testing 201, or: Great Expectations
jmmastey
38
7k
Transcript
Refactoring InfluxDB: from Go to Go Paul Dix CEO and
cofounder of InfluxDB @pauldix paul@influxdb.com
About me…
None
Organizer NYC Machine Learning (5,500+ members)
Series Editor - “Data & Analytics”
Cofounder & CEO of InfluxDB
Onto the talk!
Confession… (not a Go talk)
API Design
Software Design
–InfluxDB design philosophy “Optimize for developer happiness.”
Refactor Rewrite
Before that, what’s InfluxDB?
None
Open source distributed time series database written in Go with
no external dependencies
Metrics
Time Series
Analytics
Events
Use Cases • Metrics (DevOps) • Sensor Data • Realtime
Analytics
Features Upcoming 0.9.0 release
Data model • Databases
Data model • Databases • Measurements • cpu_load, temperature, log_lines,
click, etc.
Data model • Databases • Measurements • cpu_load, temperature, log_lines,
click, etc. • Tags • region=uswest, host=serverA, building=23, service=redis, etc.
Data model • Databases • Measurements • cpu_load, temperature, log,
click, etc. • Tags • region=uswest, host=serverA, building=23, service=redis, etc. • Series - measurement + unique tagset
Data model • Databases • Measurements • cpu_load, temperature, log,
click, etc. • Tags • region=uswest, host=serverA, building=23, service=redis, etc. • Series - measurement + unique tagset • Points • Fields - bool, int64, float64, string, []byte • Timestamp - nano epoch
Writing Data curl -XPOST 'http://localhost:8086/write' -d '...'!
Writing Data {! "database": "mydb",! "retentionPolicy": "30d",! "points": [! {!
"name": "cpu_load",! "tags": {! "host": "server01",! "region": "us-west"! },! "timestamp": "2009-11-10T23:00:00Z",! "fields": {! "value": 0.64! }! }! ]! }! Measurement Tags Fields
Querying curl -G 'http://localhost:8086/query' --data-urlencode "q=..."!
SQL-ish query language
SELECT value FROM cpu WHERE host = 'serverA'! {! "results":[!
{! "query": "SELECT value FROM cpu WHERE host='serverA'",! "series": [! {! "name": "cpu",! "tags": {! "host": "serverA"! },! "columns": ["time", "value"],! "values": [! ["2009-11-10T23:00:00Z", 22.1],! ["2009-11-10T23:00:10Z", 25.2]! ]! }! ]! }! ]! }! QUERY: RESULTS:
SELECT value FROM cpu! WHERE host = ‘serverA'OR host =
'serverB'! QUERY: {! "series": [! {! "name": "cpu",! "tags": {! "host": "serverA"! },! "columns": ["time", "value"],! "values": []! },! {! "name": "cpu",! "tags": {! "host": "serverB"! },! "columns": ["time", "value"],! "values": []! } ! ]! }! SERIES! IN RESULT:
SELECT percentile(90, value) FROM cpu! WHERE time > now() -
4h! GROUP BY time(10m), region QUERY: [! {! "name": "cpu",! "tags": {! "region": "us-west"! },! "columns": ["time", "percentile"],! "values": []! },! {! "name": "cpu",! "tags": {! "region": "us-east"! },! "columns": ["time", "percentile"],! "values": []! } ! ]! SERIES! IN RESULT:
Multiple aggregates SELECT mean(value), percentile(90, value), min(value), max(value)! FROM cpu!
WHERE host='serverA' AND time > now() - 48h! GROUP BY time(1h)!
Return every series in CPU SELECT mean(value)! FROM cpu! WHERE
time > now() - 48h! GROUP BY time(1h), *!
Discovery based on tags
{! "results":[! {! "query": "SHOW MEASUREMENTS",! "series": [! {! "name":
"measurements",! "columns": ["name"],! "values": [! ["cpu"],! ["memory"],! ["network"]! ]! }! ]! }! ]! }!
{! "results":[! {! "query": "SHOW SERIES",! "series": [! {! "name":
"cpu",! "columns": ["id", "region", "host"],! "values": [! [1, "us-west", "serverA"],! [2, "us-east", "serverB"]! ]! }! ]! }! ]! }!
{! "query": "SHOW MEASUREMENTS WHERE service='redis'",! "series": [! {! "name":
"measurements",! "name": "series",! "columns": ["measurement"],! "values": [! ["key_count"],! ["connections"]! ]! }! ]! }!
{! "query": "SHOW TAG KEYS from cpu",! "series": [! {!
"name": "keys",! "columns": ["key"],! "values": [! ["region"],! ["host"]! ]! }! ]! }!
{! "query": "SHOW TAG VALUES WITH KEY = service",! "series":
[! {! "name": "series",! "columns": ["service"],! "values": [! ["redis"],! ["apache"]! ]! }! ]! }!
{! "query": "SHOW TAG VALUES FROM cpu WITH KEY =
service",! "series": [! {! "name": "series",! "columns": ["service"],! "values": [! ["redis"],! ["apache"]! ]! }! ]! }!
Much more • Retention policies • Automatic downsampling and aggregation
• Clustering
onto the rewrite…
Since November we’ve been rewriting InfluxDB from scratch.
– Joel Spolsky on rewriting from scratch Things You Should
Never Do, Part I http://www.joelonsoftware.com/articles/fog0000000069.html “… the single worst strategic mistake that any software company can make …”
Why would we rewrite?
–InfluxDB design philosophy “Optimize for developer happiness.”
What does this mean?
for InfluxDB users…
simple setup
flexible API
empowering API
performant by default
helps developers build analytics applications faster
for InfluxDB developers…
fast build time
idiomatic Go code
easy to build and contribute
Not Happy
Split across three different legitimate areas
Feature Requests Moving average, different kinds of derivatives, ways to
fill data, top N for a given period, exact data point for min/max
Performance Bugs High CPU load on query, where clause takes
too long, list series takes too long
Bugs open file handles, out of memory, clustering crashes
Questions (illegitimate?)
Some questions point to bad API design
We identified 3 different areas of improvement
• API Design • Structural Design Problems Limiting Performance •
Underlying Technology Choices Causing Bugs
API Design should be understandable, should push users in the
right direction
Previous API Many series with metadata in the name like
in Graphite region.us.data_center.1.host.serverA.network_in! region.us.data_center.1.host.serverA.network_out! 5m.mean.region.us.data_center.1.host.serverA.network_in! 5m.mean.region.us.data_center.1.host.serverA.network_out!
Understandable API Design: Retention Policies • Previously called shard spaces
• Users tell the server which shard space to read/ write data into based on a regex to match against the series name
Too hard to use, couldn’t tell where data was going
Users could accidentally hide a bunch of data that was
previously available
Solution: have the user be explicit at write or query
time
Pushing users in the right direction: Tags SELECT mean(value) FROM
cpu! WHERE host = 'serverA'! Users wanted this:
Columns • Wasted space • Add Indexes? • User still
has to create them
Everyone was expecting tags… Indexed metadata and series
Structural Design Problems LIST SERIES /.*region\.uswest.*/! Tell users to have
many series: SELECT mean(value)! FROM merge(/.*region\.uswest.*/)! WHERE time > now() - 2h! GROUP BY time(5m)!
No way to make that perform well with > 100k
series
Solution: measurements and tags break up the namespace
Structural Design Problems: Query Engine • Pipes raw data over
the network • Need to redesign to get data locality • MapReduce framework
Solution: refactor/rewrite the query engine
Underlying Technology Choices • Protobufs • Performance • Everywhere in
code (network, database)
Solution: switch to raw bytes for storage
Underlying Technology Choices • LevelDB • Too many file handles
• No online backups • Too hard to transfer shard from one server to another
Solution: switch to BoltDB
Underlying Technology Choices • Flex & Bison for Parser •
Very hard to understand and update • CGo code
Solution: switch to pure Go parser
All of these things together pointed to a rewrite Large
API changes, underlying technology changes, and code in every area getting touched
A refactor would have been a rewrite
It would have dragged breaking changes out over multiple releases
With a rewrite we rip the bandaid off quickly
–InfluxDB design philosophy “Optimize for developer happiness.”
The 0.9.0 release gives us a solid foundation to build
on
Thank you Paul Dix @pauldix paul@influxdb.com ! P.S. we’re hiring
Go and front-end developers