At Scale, Everything is Hard

At Scale, Everything is Hard

Talk presented at dotScale in Paris. It's an exploration of what we've learned over the course of developing InfluxDB and scaling the company. I also go into our architecture for our upcoming 2.0 cloud offering that is multi-tenanted and build on top of Kubernetes.

39b7a68b6cbc43ec7683ad0bcc4c9570?s=128

Paul Dix

June 01, 2018
Tweet

Transcript

  1. At Scale, Everything is Hard Paul Dix @pauldix paul@influxdata.com

  2. None
  3. Scale?

  4. Scale != count(servers)

  5. Scaling Throughput

  6. Scaling Total Data Size

  7. Scaling Development Teams

  8. Scaling Code Bases

  9. Scaling Feature Sets

  10. At Scale, Everything is Hard

  11. Time series data is the worst and best use case

    in distributed databases dotScale 2015
  12. High read & write throughput

  13. Large range scans

  14. Append/insert only

  15. Deletes against large ranges

  16. At Scale, Everything is Hard

  17. InfluxDB 0.9 to InfluxDB 2.0

  18. Monolith to Services

  19. Modern Containerized Data Platform Architecture

  20. Data Platform, not Database?

  21. Flashback to June 2015…

  22. None
  23. None
  24. None
  25. None
  26. None
  27. We’ve come a long way…

  28. Time Structured Merge Tree

  29. One-Dot-Oh

  30. Clustering

  31. Infrastructure software has come a long way…

  32. Containerization

  33. Kubernetes

  34. Declarative Infrastructure Infrastructure as Code

  35. Lessons at Scale

  36. Single Tenant Inefficiencies

  37. Team Scaling: 12 -> 90

  38. Monolith Scaling: LOC 35k -> 280k

  39. At Scale, Monoliths are Hard

  40. Large Test Surface Area

  41. Slower Releases

  42. The more frequently you release code, the less risky each

    release is.
  43. Two-Dot-Oh

  44. Database designed for containers?

  45. Services based Database?

  46. Built on top of Kubernetes

  47. Multi-tenant

  48. Workload Isolation

  49. Architecture

  50. None
  51. None
  52. At Scale, Everything is Hard

  53. Single Server Monolith

  54. Architecture

  55. API

  56. UI

  57. Storage

  58. Query

  59. Processing, Monitoring & Alerting

  60. Collection & Scraping

  61. Deploy Services Independently

  62. Stateless Services

  63. Stateful Services

  64. Data has Gravity

  65. Auto-Scaling

  66. Singleton

  67. Decouple Query from Storage

  68. InfluxQL & TICKScript -> Flux https://github.com/influxdata/platform/query

  69. Flux (#fluxlang) is a lightweight language for working with data

  70. Push Down Processing

  71. Push Down Processing Flux Processor Data Node Data Node

  72. Push Down Processing Flux Processor Storage Node Storage Node from(db:"foo")

    |> range(start:-1h) |> filter(fn: (r) => r._measurement == "cpu" and r._field == "usage_system") |> sum() |> group() |> sort() |> limit(n:20)
  73. Push Down Processing Flux Processor Data Node Data Node from(db:"foo")

    |> range(start:-1h) |> filter(fn: (r) => r._measurement == "cpu" and r._field == "usage_system") |> sum() |> group() |> sort() |> limit(n:20)
  74. Push Down Processing Flux Processor Data Node Data Node Summary

    Ticks Back Up
  75. Push Down Processing Flux Processor Data Node Data Node from(db:"foo")

    |> range(start:-1h) |> filter(fn: (r) => r._measurement == "cpu" and r._field == "usage_system") |> sum() |> group() |> sort() |> limit(n:20)
  76. Optimize RPC Make fast?

  77. At Scale, Marshaling is Slow

  78. Apache Arrow

  79. Zero-Copy, no marshaling overhead!

  80. In-memory columnar

  81. Sum 8,192 Values BenchmarkFloat64Funcs_Sum_8192-8 2000000 687 ns/op 95375.41 MB/s BenchmarkInt64Funcs_Sum_8192-8

    2000000 719 ns/op 91061.06 MB/s BenchmarkUint64Funcs_Sum_8192-8 2000000 691 ns/op 94797.29 MB/s BenchmarkFloat64Funcs_Sum_8192-8 200000 10285 ns/op 6371.41 MB/s BenchmarkInt64Funcs_Sum_8192-8 500000 3892 ns/op 16837.37 MB/s BenchmarkUint64Funcs_Sum_8192-8 500000 3929 ns/op 16680.00 MB/s AVX2 using c2goasm Pure Go
  82. Sum 8,192 Values BenchmarkFloat64Funcs_Sum_8192-8 2000000 687 ns/op 95375.41 MB/s BenchmarkInt64Funcs_Sum_8192-8

    2000000 719 ns/op 91061.06 MB/s BenchmarkUint64Funcs_Sum_8192-8 2000000 691 ns/op 94797.29 MB/s BenchmarkFloat64Funcs_Sum_8192-8 200000 10285 ns/op 6371.41 MB/s BenchmarkInt64Funcs_Sum_8192-8 500000 3892 ns/op 16837.37 MB/s BenchmarkUint64Funcs_Sum_8192-8 500000 3929 ns/op 16680.00 MB/s AVX2 using c2goasm Pure Go
  83. At Scale, Data Layout in Memory Matters

  84. At Scale, CPU Instruction Set Capabilities Matter

  85. Follow Arrow Development https://github.com/apache/arrow/tree/master/go/arrow

  86. Follow Flux & Platform Development https://github.com/influxdata/platform

  87. At Scale, Everything is… Interesting

  88. At Scale, Everything is… Interesting

  89. Thank you. Paul Dix @pauldix