Slide 1

Slide 1 text

At Scale, Everything is Hard Paul Dix @pauldix paul@influxdata.com

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

Scale?

Slide 4

Slide 4 text

Scale != count(servers)

Slide 5

Slide 5 text

Scaling Throughput

Slide 6

Slide 6 text

Scaling Total Data Size

Slide 7

Slide 7 text

Scaling Development Teams

Slide 8

Slide 8 text

Scaling Code Bases

Slide 9

Slide 9 text

Scaling Feature Sets

Slide 10

Slide 10 text

At Scale, Everything is Hard

Slide 11

Slide 11 text

Time series data is the worst and best use case in distributed databases dotScale 2015

Slide 12

Slide 12 text

High read & write throughput

Slide 13

Slide 13 text

Large range scans

Slide 14

Slide 14 text

Append/insert only

Slide 15

Slide 15 text

Deletes against large ranges

Slide 16

Slide 16 text

At Scale, Everything is Hard

Slide 17

Slide 17 text

InfluxDB 0.9 to InfluxDB 2.0

Slide 18

Slide 18 text

Monolith to Services

Slide 19

Slide 19 text

Modern Containerized Data Platform Architecture

Slide 20

Slide 20 text

Data Platform, not Database?

Slide 21

Slide 21 text

Flashback to June 2015…

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

No content

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

We’ve come a long way…

Slide 28

Slide 28 text

Time Structured Merge Tree

Slide 29

Slide 29 text

One-Dot-Oh

Slide 30

Slide 30 text

Clustering

Slide 31

Slide 31 text

Infrastructure software has come a long way…

Slide 32

Slide 32 text

Containerization

Slide 33

Slide 33 text

Kubernetes

Slide 34

Slide 34 text

Declarative Infrastructure Infrastructure as Code

Slide 35

Slide 35 text

Lessons at Scale

Slide 36

Slide 36 text

Single Tenant Inefficiencies

Slide 37

Slide 37 text

Team Scaling: 12 -> 90

Slide 38

Slide 38 text

Monolith Scaling: LOC 35k -> 280k

Slide 39

Slide 39 text

At Scale, Monoliths are Hard

Slide 40

Slide 40 text

Large Test Surface Area

Slide 41

Slide 41 text

Slower Releases

Slide 42

Slide 42 text

The more frequently you release code, the less risky each release is.

Slide 43

Slide 43 text

Two-Dot-Oh

Slide 44

Slide 44 text

Database designed for containers?

Slide 45

Slide 45 text

Services based Database?

Slide 46

Slide 46 text

Built on top of Kubernetes

Slide 47

Slide 47 text

Multi-tenant

Slide 48

Slide 48 text

Workload Isolation

Slide 49

Slide 49 text

Architecture

Slide 50

Slide 50 text

No content

Slide 51

Slide 51 text

No content

Slide 52

Slide 52 text

At Scale, Everything is Hard

Slide 53

Slide 53 text

Single Server Monolith

Slide 54

Slide 54 text

Architecture

Slide 55

Slide 55 text

API

Slide 56

Slide 56 text

UI

Slide 57

Slide 57 text

Storage

Slide 58

Slide 58 text

Query

Slide 59

Slide 59 text

Processing, Monitoring & Alerting

Slide 60

Slide 60 text

Collection & Scraping

Slide 61

Slide 61 text

Deploy Services Independently

Slide 62

Slide 62 text

Stateless Services

Slide 63

Slide 63 text

Stateful Services

Slide 64

Slide 64 text

Data has Gravity

Slide 65

Slide 65 text

Auto-Scaling

Slide 66

Slide 66 text

Singleton

Slide 67

Slide 67 text

Decouple Query from Storage

Slide 68

Slide 68 text

InfluxQL & TICKScript -> Flux https://github.com/influxdata/platform/query

Slide 69

Slide 69 text

Flux (#fluxlang) is a lightweight language for working with data

Slide 70

Slide 70 text

Push Down Processing

Slide 71

Slide 71 text

Push Down Processing Flux Processor Data Node Data Node

Slide 72

Slide 72 text

Push Down Processing Flux Processor Storage Node Storage Node from(db:"foo") |> range(start:-1h) |> filter(fn: (r) => r._measurement == "cpu" and r._field == "usage_system") |> sum() |> group() |> sort() |> limit(n:20)

Slide 73

Slide 73 text

Push Down Processing Flux Processor Data Node Data Node from(db:"foo") |> range(start:-1h) |> filter(fn: (r) => r._measurement == "cpu" and r._field == "usage_system") |> sum() |> group() |> sort() |> limit(n:20)

Slide 74

Slide 74 text

Push Down Processing Flux Processor Data Node Data Node Summary Ticks Back Up

Slide 75

Slide 75 text

Push Down Processing Flux Processor Data Node Data Node from(db:"foo") |> range(start:-1h) |> filter(fn: (r) => r._measurement == "cpu" and r._field == "usage_system") |> sum() |> group() |> sort() |> limit(n:20)

Slide 76

Slide 76 text

Optimize RPC Make fast?

Slide 77

Slide 77 text

At Scale, Marshaling is Slow

Slide 78

Slide 78 text

Apache Arrow

Slide 79

Slide 79 text

Zero-Copy, no marshaling overhead!

Slide 80

Slide 80 text

In-memory columnar

Slide 81

Slide 81 text

Sum 8,192 Values BenchmarkFloat64Funcs_Sum_8192-8 2000000 687 ns/op 95375.41 MB/s BenchmarkInt64Funcs_Sum_8192-8 2000000 719 ns/op 91061.06 MB/s BenchmarkUint64Funcs_Sum_8192-8 2000000 691 ns/op 94797.29 MB/s BenchmarkFloat64Funcs_Sum_8192-8 200000 10285 ns/op 6371.41 MB/s BenchmarkInt64Funcs_Sum_8192-8 500000 3892 ns/op 16837.37 MB/s BenchmarkUint64Funcs_Sum_8192-8 500000 3929 ns/op 16680.00 MB/s AVX2 using c2goasm Pure Go

Slide 82

Slide 82 text

Sum 8,192 Values BenchmarkFloat64Funcs_Sum_8192-8 2000000 687 ns/op 95375.41 MB/s BenchmarkInt64Funcs_Sum_8192-8 2000000 719 ns/op 91061.06 MB/s BenchmarkUint64Funcs_Sum_8192-8 2000000 691 ns/op 94797.29 MB/s BenchmarkFloat64Funcs_Sum_8192-8 200000 10285 ns/op 6371.41 MB/s BenchmarkInt64Funcs_Sum_8192-8 500000 3892 ns/op 16837.37 MB/s BenchmarkUint64Funcs_Sum_8192-8 500000 3929 ns/op 16680.00 MB/s AVX2 using c2goasm Pure Go

Slide 83

Slide 83 text

At Scale, Data Layout in Memory Matters

Slide 84

Slide 84 text

At Scale, CPU Instruction Set Capabilities Matter

Slide 85

Slide 85 text

Follow Arrow Development https://github.com/apache/arrow/tree/master/go/arrow

Slide 86

Slide 86 text

Follow Flux & Platform Development https://github.com/influxdata/platform

Slide 87

Slide 87 text

At Scale, Everything is… Interesting

Slide 88

Slide 88 text

At Scale, Everything is… Interesting

Slide 89

Slide 89 text

Thank you. Paul Dix @pauldix