Upgrade to Pro — share decks privately, control downloads, hide ads and more …

At Scale, Everything is Hard

At Scale, Everything is Hard

Talk presented at dotScale in Paris. It's an exploration of what we've learned over the course of developing InfluxDB and scaling the company. I also go into our architecture for our upcoming 2.0 cloud offering that is multi-tenanted and build on top of Kubernetes.

Paul Dix

June 01, 2018
Tweet

More Decks by Paul Dix

Other Decks in Technology

Transcript

  1. At Scale, Everything is Hard
    Paul Dix

    @pauldix

    paul@influxdata.com

    View full-size slide

  2. Scale != count(servers)

    View full-size slide

  3. Scaling Throughput

    View full-size slide

  4. Scaling Total Data Size

    View full-size slide

  5. Scaling Development Teams

    View full-size slide

  6. Scaling Code Bases

    View full-size slide

  7. Scaling Feature Sets

    View full-size slide

  8. At Scale, Everything is Hard

    View full-size slide

  9. Time series data is the worst and
    best use case in distributed
    databases
    dotScale 2015

    View full-size slide

  10. High read & write throughput

    View full-size slide

  11. Large range scans

    View full-size slide

  12. Append/insert only

    View full-size slide

  13. Deletes against large ranges

    View full-size slide

  14. At Scale, Everything is Hard

    View full-size slide

  15. InfluxDB 0.9 to InfluxDB 2.0

    View full-size slide

  16. Monolith to Services

    View full-size slide

  17. Modern Containerized Data
    Platform Architecture

    View full-size slide

  18. Data Platform, not Database?

    View full-size slide

  19. Flashback to June 2015…

    View full-size slide

  20. We’ve come a long way…

    View full-size slide

  21. Time Structured Merge Tree

    View full-size slide

  22. Infrastructure software has
    come a long way…

    View full-size slide

  23. Containerization

    View full-size slide

  24. Declarative Infrastructure
    Infrastructure as Code

    View full-size slide

  25. Lessons at Scale

    View full-size slide

  26. Single Tenant Inefficiencies

    View full-size slide

  27. Team Scaling: 12 -> 90

    View full-size slide

  28. Monolith Scaling:
    LOC 35k -> 280k

    View full-size slide

  29. At Scale, Monoliths are Hard

    View full-size slide

  30. Large Test Surface Area

    View full-size slide

  31. Slower Releases

    View full-size slide

  32. The more frequently you release
    code, the less risky each release is.

    View full-size slide

  33. Database designed for
    containers?

    View full-size slide

  34. Services based Database?

    View full-size slide

  35. Built on top of Kubernetes

    View full-size slide

  36. Multi-tenant

    View full-size slide

  37. Workload Isolation

    View full-size slide

  38. Architecture

    View full-size slide

  39. At Scale, Everything is Hard

    View full-size slide

  40. Single Server Monolith

    View full-size slide

  41. Architecture

    View full-size slide

  42. Processing, Monitoring & Alerting

    View full-size slide

  43. Collection & Scraping

    View full-size slide

  44. Deploy Services Independently

    View full-size slide

  45. Stateless Services

    View full-size slide

  46. Stateful Services

    View full-size slide

  47. Data has Gravity

    View full-size slide

  48. Auto-Scaling

    View full-size slide

  49. Decouple Query from Storage

    View full-size slide

  50. InfluxQL & TICKScript -> Flux
    https://github.com/influxdata/platform/query

    View full-size slide

  51. Flux (#fluxlang) is a lightweight
    language for working with data

    View full-size slide

  52. Push Down Processing

    View full-size slide

  53. Push Down Processing
    Flux
    Processor
    Data Node Data Node

    View full-size slide

  54. Push Down Processing
    Flux
    Processor
    Storage Node Storage Node
    from(db:"foo")
    |> range(start:-1h)
    |> filter(fn: (r) =>
    r._measurement == "cpu" and
    r._field == "usage_system")
    |> sum()
    |> group()
    |> sort()
    |> limit(n:20)

    View full-size slide

  55. Push Down Processing
    Flux
    Processor
    Data Node Data Node
    from(db:"foo")
    |> range(start:-1h)
    |> filter(fn: (r) =>
    r._measurement == "cpu" and
    r._field == "usage_system")
    |> sum()
    |> group()
    |> sort()
    |> limit(n:20)

    View full-size slide

  56. Push Down Processing
    Flux
    Processor
    Data Node Data Node
    Summary Ticks Back Up

    View full-size slide

  57. Push Down Processing
    Flux
    Processor
    Data Node Data Node
    from(db:"foo")
    |> range(start:-1h)
    |> filter(fn: (r) =>
    r._measurement == "cpu" and
    r._field == "usage_system")
    |> sum()
    |> group()
    |> sort()
    |> limit(n:20)

    View full-size slide

  58. Optimize RPC
    Make fast?

    View full-size slide

  59. At Scale, Marshaling is Slow

    View full-size slide

  60. Apache Arrow

    View full-size slide

  61. Zero-Copy, no marshaling
    overhead!

    View full-size slide

  62. In-memory columnar

    View full-size slide

  63. Sum 8,192 Values
    BenchmarkFloat64Funcs_Sum_8192-8 2000000 687 ns/op 95375.41 MB/s
    BenchmarkInt64Funcs_Sum_8192-8 2000000 719 ns/op 91061.06 MB/s
    BenchmarkUint64Funcs_Sum_8192-8 2000000 691 ns/op 94797.29 MB/s
    BenchmarkFloat64Funcs_Sum_8192-8 200000 10285 ns/op 6371.41 MB/s
    BenchmarkInt64Funcs_Sum_8192-8 500000 3892 ns/op 16837.37 MB/s
    BenchmarkUint64Funcs_Sum_8192-8 500000 3929 ns/op 16680.00 MB/s
    AVX2 using c2goasm
    Pure Go

    View full-size slide

  64. Sum 8,192 Values
    BenchmarkFloat64Funcs_Sum_8192-8 2000000 687 ns/op 95375.41 MB/s
    BenchmarkInt64Funcs_Sum_8192-8 2000000 719 ns/op 91061.06 MB/s
    BenchmarkUint64Funcs_Sum_8192-8 2000000 691 ns/op 94797.29 MB/s
    BenchmarkFloat64Funcs_Sum_8192-8 200000 10285 ns/op 6371.41 MB/s
    BenchmarkInt64Funcs_Sum_8192-8 500000 3892 ns/op 16837.37 MB/s
    BenchmarkUint64Funcs_Sum_8192-8 500000 3929 ns/op 16680.00 MB/s
    AVX2 using c2goasm
    Pure Go

    View full-size slide

  65. At Scale, Data Layout in
    Memory Matters

    View full-size slide

  66. At Scale, CPU Instruction Set
    Capabilities Matter

    View full-size slide

  67. Follow Arrow Development
    https://github.com/apache/arrow/tree/master/go/arrow

    View full-size slide

  68. Follow Flux & Platform
    Development
    https://github.com/influxdata/platform

    View full-size slide

  69. At Scale, Everything is…
    Interesting

    View full-size slide

  70. At Scale, Everything is…
    Interesting

    View full-size slide

  71. Thank you.
    Paul Dix

    @pauldix

    View full-size slide