Deep Dive into LINE's Time Series Database 'Flash'

Deep Dive into LINE's Time Series Database 'Flash'

Xuanhuy Do
LINE Observability Infrastructure Team Senior Software Engineerhttps://linedevday.linecorp.com/jp/2019/sessions/B1-1

Be4518b119b8eb017625e0ead20f8fe7?s=128

LINE DevDay 2019

November 20, 2019
Tweet

Transcript

  1. 2019 DevDay Deep Dive Into LINE's Time Series Database 'Flash'

    > Xuanhuy Do > LINE Observability Infrastructure Team Senior Software Engineer
  2. Agenda > Observability Introduction > Road to our new TSDB

    Flash > Flash Architecture > Challenges, Lession and Learn
  3. LINE’s Observability Team

  4. In control theory, observability is a measure of how well

    internal states of a system can be inferred from knowledge of its external outputs
  5. Help our engineers to keep their services healthy

  6. Metrics collecting, Alert on issues Distributed tracing for root cause

    analysis Log collecting, Alert on WARN / ERROR Provide Multiple Infrastructures
  7. Provide Multiple Infrastructures Metrics collecting, Alert on issues Distributed tracing

    for root cause analysis Log collecting, Alert on WARN / ERROR
  8. Provide Multiple Infrastructures Metrics collecting, Alert on issues Distributed tracing

    for root cause analysis Log collecting, Alert on WARN / ERROR
  9. Metrics Collecting, Alert on Issues

  10. Metrics Are So Important Trending Alerting Anomaly Detection Root cause

    analysis Impact Measurement
  11. Without metrics You know nothing about your application

  12. Number of Metrics per Minute Over Years ~2016 2017 2018

    2019 2020 10M 200M 1B
  13. So Many Metrics?? Culture Microservice boom Convenient Libraries Business Growth

  14. So Many Metrics?? 1000 Servers 10.000 Metrics Each * 10.000.000

    Metrics Total =
  15. Our Metrics Platform Intake Layer Metric Storage Graph UI Alert

    Setting Agent Push Metrics
  16. Intake Layer Metric Storage Graph UI Alert Setting Agent Push

    Metrics Focus on Metric Storage
  17. Our Metric Storage History 2019 Release new storage 2017 OpenTSDB

    2012 Sharding MySQL 2018 Start develop new storage 2013~2016 2011 MySQL
  18. Mysql Era OOM Latency (Swap) Disk Full Complex Sharding No

    Tag Support
  19. Cover by Operation Mysql Era

  20. Our Metric Storage History 2019 Release new storage 2017 OpenTSDB

    2012 Sharding MySQL 2018 Start develop new storage 2013~2016 Super big sharding MySQL 2011 MySQL
  21. OpenTSDB Era Complex Modules Latency (Cache Miss) Latency
 (GC) Some

    Query Patterns Are Slow
  22. Cover by Operation OpenTSDB Era

  23. We Decided To Build Our Own Storage !

  24. Our Metric Storage History 2019 Release new storage 2017 OpenTSDB

    2012 Sharding MySQL 2018 Start develop new storage 2013~2016 Super big sharding MySQL 2011 MySQL
  25. Why Not Using Existing OSS? InfluxDB (EE) Not Satisfy Expected

    Write Throughput + High Cost Clickhouse Not Satisfy Expected Read Throughput TimeScaleDB Not Satisfy Expected Read Throughput HeroicDB Not Satisfy Expected Write Throughput FB Beringei Only in Memory Module, Not Support Tag Base Query Netflix Atlas Only in Memory Module M3DB Unstable + no Document (1 Year Ago)
  26. Why Not Using Vendor Solution? Usage cost Impossible for customization

    Migration cost
  27. What We Needed Polyglot Protocol Low Cost / Easy Maintainance

    Massive Scale Out for Write/Read
  28. What We Needed Polyglot Protocol User could use any client

    protocol : - HTTP - UDP - Prometheus Remote Write/Read
  29. What We Needed Massive Scale Out for Write/Read How massive

    ? - Vertical scale out (could serve up to 10x billions of metrics) - Low latency (p99 < 100ms) for read/write
  30. What We Needed Low maintainance cost - Binary deployment -

    Limit number of modules to maintain Low Cost / Easy Maintainance
  31. What We Achieved 4 Millions Datapoints Write per Sec 1000

    Queries per Seconds Write P99 < 100ms Read P99 < 100ms
  32. How We Did That!?

  33. What is metrics storage Or Time Series Database(TSDB)

  34. Glossary CPU

  35. Glossary CPU { “Host” = “Host1” } { “Zone” =

    “JP” } Labels
  36. Glossary 12:00 12:01 12:02 CPU 0.5 CPU 1.0 CPU 0.7

    Datapoint
  37. Glossary 12:00 12:01 12:02 Serie

  38. TSDB Storage type Serie struct { Id uint64 Label map[string]string

    Datapoints [] Datapoint } type Datapoint struct { Value float64 Timestamp int64 } type Storage interface { Put(Series []Serie) QueryRange(Label map[string]string, fromTS int64, toTs int64) []Datapoint }
  39. TSDB Storage type Serie struct { Id uint64 Label map[string]string

    Datapoints [] Datapoint } type Datapoint struct { Value float64 Timestamp int64 } type Storage interface { Put(Series []Serie) QueryRange(Label map[string]string, fromTs int64, toTs int64) []Datapoint }
  40. How We Achive This Scale? type Serie struct { Id

    uint64 Label map[string]string Datapoints [] Datapoint } type Datapoint struct { Value float64 Timestamp int64 } type Storage interface { Put(Series []Serie) QueryRange(Label map[string]string, fromTs int64, toTs int64) []Datapoint } 4.000.000 Serie Per Sec 1000 Query Range per Sec
  41. Technical Decisions In-Memory base Fit “entirely” 28 hour recent data

    into memory Microsevice Split into multiple services for local optimization Grpc as main communication protocol Employ Go language Ease of deployment and good toolings Utilize Opensource Not try to reinvent all the wheels, combine with other opensources to reduce development cost
  42. Technical Decisions In-Memory base Fit “entirely” 28 hour recent data

    into memory Microsevice Split into multiple services for local optimization Grpc as main communication protocol Employ Go language Ease of deployment and good toolings Utilize Opensource Not try to reinvent all the wheels, combine with other opensources to reduce development cost
  43. Microservice Base Architecture Idea: Split Label Storage Service And Datapoints

    Storage Service
  44. CPU { “Host” = “Host1” } { “Zone” = “JP”

    } 12:00 0.5
  45. CPU { “Host” = “Host1” } { “Zone” = “JP”

    } Hash Uint64: 1234 Serie ID
  46. 1234 CPU { “Host” = “Host1” } { “Zone” =

    “JP” } 1234 12:00 0.5
  47. 1234 CPU { “Host” = “Host1” } { “Zone” =

    “JP” } 1234 12:00 0.5 Label Storage Datapoints Storage
  48. 1234 CPU { “Host” = “Host1” } { “Zone” =

    “JP” } 1234 12:00 0.5 Optimized For Text Store Optimized For Number Store
  49. Simplified Architecture Metric Input Layer Datapoints Storage Label Storage Metric

    Query Layer Grpc
  50. Result We Split Different Services Which Have Different Requirement So

    We Could Optimize in Different Way
  51. Technical Decisions In-Memory base Fit “entirely” 28 hour recent data

    into memory Microsevice Split into multiple services for local optimization Grpc as main communication protocol Employ Go language Ease of deployment and good toolings Utilize Opensource Not try to reinvent all the wheels, combine with other opensources to reduce development cost
  52. We Want Both Fast Write + Fast Read

  53. The Only Way To Achieve Is To Using DRAM

  54. Datapoints Storage Idea: Split in Memory Datapoints Storage
 and Persistent

    Datapoints Storage Fit Latest 28 Hours Data “Entirely” Into Memory
  55. Metric Storage Metric Input Layer Label Storage Metric Query Layer

    InMemory Datapoint Storage (28h) Persistent Datapoint Storage
  56. In-Memory Datapoint Storage > Store 28 hours data entirely in

    memory • Use Delta-Delta XOR algorithm to compress data points > Vertical scale by hash-base serie id distribution > High-Availability with rocksdb as log backend and Raft concensus protocol for replication
  57. Delta-Delta XOR Algorithm 12:00 12:01 12:02 CPU 0.5 CPU 0.6

    CPU 0.8 12:02 CPU 0.7
  58. Delta-Delta XOR Algorithm 0.5 0.6 0.7 0.8 0.1 0.1 0.1

    0 0
  59. Delta-Delta XOR Algorithm 0.5 0.6 0.7 0.8 0.1 0.1 0.1

    0 0
  60. Delta-Delta XOR Algorithm 0.5 0.6 0.7 0.8 0.1 0.1 0.1

    0 0
  61. In-Memory Datapoint Storage > Store 28 hours data entirely in

    memory > Use Delta-Delta XOR algorithm to compress data points > Vertical scale by hash-base serie id distribution • High-Availability with rocksdb as log backend and Raft concensus protocol for replication
  62. Storage Unit > Store data by “Shard” Unit > Each

    Shard == 3 Physical Machines (Replication Factor 3) with high DRAM + SSD > Each machine replicate data using “Raft”
  63. Data Distributed by Hashing Client Hash(Serie 1) = 1 Shard

    1 Shard 2 Insert(Serie 1) CentralDogma Shard Topology
  64. Raft Consensus Algorithm Client Leader Follower Follower RAFT RAFT

  65. RocksDB Backed Write Raft Log Client RocksDB Leader Memory

  66. Result Having in High Availability in Memory Layer Easy Scale

    Both READ and WRITE
  67. Technical Decisions In-Memory base Fit “entirely” 28 hour recent data

    into memory Microsevice Split into multiple services for local optimization Grpc as main communication protocol Employ Go language Ease of deployment and good toolings Utilize Opensource Not try to reinvent all the wheels, combine with other opensources to reduce development cost
  68. Go Programming Language > Simplicity • Write code • Deployment

    (binary) > Very good tooling for profiling (CPU/memory..) • Very helpful for building memory intensive application > Has GC but still able to control memory layout
  69. Result Able To Deliver the Result in Very Short Time

  70. Technical Decisions In-Memory base Fit “entirely” 28 hour recent data

    into memory Microsevice Split into multiple services for local optimization Grpc as main communication protocol Employ Go language Ease of deployment and good toolings Utilize Opensource Not try to reinvent all the wheels, combine with other opensources to reduce development cost
  71. Metric Storage Metric Input Layer Label Storage Metric Query Layer

    In Memory Metric Storage (28h) Persistent Metric Storage
  72. Result Don’t Have To Reinvent “ALL” the Wheel

  73. Final words: We passed a long road to make this

    storage
  74. It take just 2 months to build first prototype and

    beta release
  75. But take a year to release to the production

  76. It takes time to make usable thing But it takes

    a lot of time to make best thing
  77. Current Achievement Is Not Good Enough 4 Millions Datapoints Write

    per Sec 1000 Queries per Seconds Write P99 < 100ms Read P99 < 100ms 100 10000
  78. We Still Have a Lot To Do > Better performance

    for label index layer > More perfomance, more reliablility, better cost optimization > Multi-Region deployment for disaster discovery > New storage base ecosystem (Alert, On-Call…)
  79. Thank You