Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Refactoring InfluxDB: from Go to Go
Search
Paul Dix
February 19, 2015
4
6.5k
Refactoring InfluxDB: from Go to Go
Talk given at the Golang SF meetup about the rewrite from 0.8 to 0.9 of InfluxDB
Paul Dix
February 19, 2015
Tweet
Share
More Decks by Paul Dix
See All by Paul Dix
InfluxDB IOx Project Update - 2021-02-10
pauldix
0
200
InfluxDB IOx data lifecycle and object store persistence
pauldix
1
560
InfluxDB 2.0 and Flux
pauldix
1
670
Flux and InfluxDB 2.0
pauldix
1
1.3k
Querying Prometheus with Flux
pauldix
1
780
Flux (#fluxlang): a new (time series) data scripting language
pauldix
7
5k
At Scale, Everything is Hard
pauldix
2
650
IFQL and the future of InfluxData
pauldix
2
1.3k
Time series & monitoring with InfluxDB and the TICK stack
pauldix
0
410
Featured
See All Featured
Making the Leap to Tech Lead
cromwellryan
133
8.9k
Rails Girls Zürich Keynote
gr2m
94
13k
Fantastic passwords and where to find them - at NoRuKo
philnash
50
2.9k
Automating Front-end Workflow
addyosmani
1366
200k
Fight the Zombie Pattern Library - RWD Summit 2016
marcelosomers
232
17k
Building Better People: How to give real-time feedback that sticks.
wjessup
364
19k
Distributed Sagas: A Protocol for Coordinating Microservices
caitiem20
329
21k
Practical Orchestrator
shlominoach
186
10k
How To Stay Up To Date on Web Technology
chriscoyier
788
250k
Evolution of real-time – Irina Nazarova, EuRuKo, 2024
irinanazarova
4
370
A Modern Web Designer's Workflow
chriscoyier
693
190k
How to Think Like a Performance Engineer
csswizardry
20
1.1k
Transcript
Refactoring InfluxDB: from Go to Go Paul Dix CEO and
cofounder of InfluxDB @pauldix paul@influxdb.com
About me…
None
Organizer NYC Machine Learning (5,500+ members)
Series Editor - “Data & Analytics”
Cofounder & CEO of InfluxDB
Onto the talk!
Confession… (not a Go talk)
API Design
Software Design
–InfluxDB design philosophy “Optimize for developer happiness.”
Refactor Rewrite
Before that, what’s InfluxDB?
None
Open source distributed time series database written in Go with
no external dependencies
Metrics
Time Series
Analytics
Events
Use Cases • Metrics (DevOps) • Sensor Data • Realtime
Analytics
Features Upcoming 0.9.0 release
Data model • Databases
Data model • Databases • Measurements • cpu_load, temperature, log_lines,
click, etc.
Data model • Databases • Measurements • cpu_load, temperature, log_lines,
click, etc. • Tags • region=uswest, host=serverA, building=23, service=redis, etc.
Data model • Databases • Measurements • cpu_load, temperature, log,
click, etc. • Tags • region=uswest, host=serverA, building=23, service=redis, etc. • Series - measurement + unique tagset
Data model • Databases • Measurements • cpu_load, temperature, log,
click, etc. • Tags • region=uswest, host=serverA, building=23, service=redis, etc. • Series - measurement + unique tagset • Points • Fields - bool, int64, float64, string, []byte • Timestamp - nano epoch
Writing Data curl -XPOST 'http://localhost:8086/write' -d '...'!
Writing Data {! "database": "mydb",! "retentionPolicy": "30d",! "points": [! {!
"name": "cpu_load",! "tags": {! "host": "server01",! "region": "us-west"! },! "timestamp": "2009-11-10T23:00:00Z",! "fields": {! "value": 0.64! }! }! ]! }! Measurement Tags Fields
Querying curl -G 'http://localhost:8086/query' --data-urlencode "q=..."!
SQL-ish query language
SELECT value FROM cpu WHERE host = 'serverA'! {! "results":[!
{! "query": "SELECT value FROM cpu WHERE host='serverA'",! "series": [! {! "name": "cpu",! "tags": {! "host": "serverA"! },! "columns": ["time", "value"],! "values": [! ["2009-11-10T23:00:00Z", 22.1],! ["2009-11-10T23:00:10Z", 25.2]! ]! }! ]! }! ]! }! QUERY: RESULTS:
SELECT value FROM cpu! WHERE host = ‘serverA'OR host =
'serverB'! QUERY: {! "series": [! {! "name": "cpu",! "tags": {! "host": "serverA"! },! "columns": ["time", "value"],! "values": []! },! {! "name": "cpu",! "tags": {! "host": "serverB"! },! "columns": ["time", "value"],! "values": []! } ! ]! }! SERIES! IN RESULT:
SELECT percentile(90, value) FROM cpu! WHERE time > now() -
4h! GROUP BY time(10m), region QUERY: [! {! "name": "cpu",! "tags": {! "region": "us-west"! },! "columns": ["time", "percentile"],! "values": []! },! {! "name": "cpu",! "tags": {! "region": "us-east"! },! "columns": ["time", "percentile"],! "values": []! } ! ]! SERIES! IN RESULT:
Multiple aggregates SELECT mean(value), percentile(90, value), min(value), max(value)! FROM cpu!
WHERE host='serverA' AND time > now() - 48h! GROUP BY time(1h)!
Return every series in CPU SELECT mean(value)! FROM cpu! WHERE
time > now() - 48h! GROUP BY time(1h), *!
Discovery based on tags
{! "results":[! {! "query": "SHOW MEASUREMENTS",! "series": [! {! "name":
"measurements",! "columns": ["name"],! "values": [! ["cpu"],! ["memory"],! ["network"]! ]! }! ]! }! ]! }!
{! "results":[! {! "query": "SHOW SERIES",! "series": [! {! "name":
"cpu",! "columns": ["id", "region", "host"],! "values": [! [1, "us-west", "serverA"],! [2, "us-east", "serverB"]! ]! }! ]! }! ]! }!
{! "query": "SHOW MEASUREMENTS WHERE service='redis'",! "series": [! {! "name":
"measurements",! "name": "series",! "columns": ["measurement"],! "values": [! ["key_count"],! ["connections"]! ]! }! ]! }!
{! "query": "SHOW TAG KEYS from cpu",! "series": [! {!
"name": "keys",! "columns": ["key"],! "values": [! ["region"],! ["host"]! ]! }! ]! }!
{! "query": "SHOW TAG VALUES WITH KEY = service",! "series":
[! {! "name": "series",! "columns": ["service"],! "values": [! ["redis"],! ["apache"]! ]! }! ]! }!
{! "query": "SHOW TAG VALUES FROM cpu WITH KEY =
service",! "series": [! {! "name": "series",! "columns": ["service"],! "values": [! ["redis"],! ["apache"]! ]! }! ]! }!
Much more • Retention policies • Automatic downsampling and aggregation
• Clustering
onto the rewrite…
Since November we’ve been rewriting InfluxDB from scratch.
– Joel Spolsky on rewriting from scratch Things You Should
Never Do, Part I http://www.joelonsoftware.com/articles/fog0000000069.html “… the single worst strategic mistake that any software company can make …”
Why would we rewrite?
–InfluxDB design philosophy “Optimize for developer happiness.”
What does this mean?
for InfluxDB users…
simple setup
flexible API
empowering API
performant by default
helps developers build analytics applications faster
for InfluxDB developers…
fast build time
idiomatic Go code
easy to build and contribute
Not Happy
Split across three different legitimate areas
Feature Requests Moving average, different kinds of derivatives, ways to
fill data, top N for a given period, exact data point for min/max
Performance Bugs High CPU load on query, where clause takes
too long, list series takes too long
Bugs open file handles, out of memory, clustering crashes
Questions (illegitimate?)
Some questions point to bad API design
We identified 3 different areas of improvement
• API Design • Structural Design Problems Limiting Performance •
Underlying Technology Choices Causing Bugs
API Design should be understandable, should push users in the
right direction
Previous API Many series with metadata in the name like
in Graphite region.us.data_center.1.host.serverA.network_in! region.us.data_center.1.host.serverA.network_out! 5m.mean.region.us.data_center.1.host.serverA.network_in! 5m.mean.region.us.data_center.1.host.serverA.network_out!
Understandable API Design: Retention Policies • Previously called shard spaces
• Users tell the server which shard space to read/ write data into based on a regex to match against the series name
Too hard to use, couldn’t tell where data was going
Users could accidentally hide a bunch of data that was
previously available
Solution: have the user be explicit at write or query
time
Pushing users in the right direction: Tags SELECT mean(value) FROM
cpu! WHERE host = 'serverA'! Users wanted this:
Columns • Wasted space • Add Indexes? • User still
has to create them
Everyone was expecting tags… Indexed metadata and series
Structural Design Problems LIST SERIES /.*region\.uswest.*/! Tell users to have
many series: SELECT mean(value)! FROM merge(/.*region\.uswest.*/)! WHERE time > now() - 2h! GROUP BY time(5m)!
No way to make that perform well with > 100k
series
Solution: measurements and tags break up the namespace
Structural Design Problems: Query Engine • Pipes raw data over
the network • Need to redesign to get data locality • MapReduce framework
Solution: refactor/rewrite the query engine
Underlying Technology Choices • Protobufs • Performance • Everywhere in
code (network, database)
Solution: switch to raw bytes for storage
Underlying Technology Choices • LevelDB • Too many file handles
• No online backups • Too hard to transfer shard from one server to another
Solution: switch to BoltDB
Underlying Technology Choices • Flex & Bison for Parser •
Very hard to understand and update • CGo code
Solution: switch to pure Go parser
All of these things together pointed to a rewrite Large
API changes, underlying technology changes, and code in every area getting touched
A refactor would have been a rewrite
It would have dragged breaking changes out over multiple releases
With a rewrite we rip the bandaid off quickly
–InfluxDB design philosophy “Optimize for developer happiness.”
The 0.9.0 release gives us a solid foundation to build
on
Thank you Paul Dix @pauldix paul@influxdb.com ! P.S. we’re hiring
Go and front-end developers