Deep Dive into LINE's Time Series Database 'Flash'

Deep Dive into LINE's Time Series Database 'Flash'

Xuanhuy Do
LINE Observability Infrastructure Team Senior Software Engineerhttps://linedevday.linecorp.com/jp/2019/sessions/B1-1

Be4518b119b8eb017625e0ead20f8fe7?s=128

LINE DevDay 2019

November 20, 2019
Tweet

Transcript

  1. 1.

    2019 DevDay Deep Dive Into LINE's Time Series Database 'Flash'

    > Xuanhuy Do > LINE Observability Infrastructure Team Senior Software Engineer
  2. 2.

    Agenda > Observability Introduction > Road to our new TSDB

    Flash > Flash Architecture > Challenges, Lession and Learn
  3. 4.

    In control theory, observability is a measure of how well

    internal states of a system can be inferred from knowledge of its external outputs
  4. 6.

    Metrics collecting, Alert on issues Distributed tracing for root cause

    analysis Log collecting, Alert on WARN / ERROR Provide Multiple Infrastructures
  5. 7.

    Provide Multiple Infrastructures Metrics collecting, Alert on issues Distributed tracing

    for root cause analysis Log collecting, Alert on WARN / ERROR
  6. 8.

    Provide Multiple Infrastructures Metrics collecting, Alert on issues Distributed tracing

    for root cause analysis Log collecting, Alert on WARN / ERROR
  7. 17.

    Our Metric Storage History 2019 Release new storage 2017 OpenTSDB

    2012 Sharding MySQL 2018 Start develop new storage 2013~2016 2011 MySQL
  8. 20.

    Our Metric Storage History 2019 Release new storage 2017 OpenTSDB

    2012 Sharding MySQL 2018 Start develop new storage 2013~2016 Super big sharding MySQL 2011 MySQL
  9. 24.

    Our Metric Storage History 2019 Release new storage 2017 OpenTSDB

    2012 Sharding MySQL 2018 Start develop new storage 2013~2016 Super big sharding MySQL 2011 MySQL
  10. 25.

    Why Not Using Existing OSS? InfluxDB (EE) Not Satisfy Expected

    Write Throughput + High Cost Clickhouse Not Satisfy Expected Read Throughput TimeScaleDB Not Satisfy Expected Read Throughput HeroicDB Not Satisfy Expected Write Throughput FB Beringei Only in Memory Module, Not Support Tag Base Query Netflix Atlas Only in Memory Module M3DB Unstable + no Document (1 Year Ago)
  11. 28.

    What We Needed Polyglot Protocol User could use any client

    protocol : - HTTP - UDP - Prometheus Remote Write/Read
  12. 29.

    What We Needed Massive Scale Out for Write/Read How massive

    ? - Vertical scale out (could serve up to 10x billions of metrics) - Low latency (p99 < 100ms) for read/write
  13. 30.

    What We Needed Low maintainance cost - Binary deployment -

    Limit number of modules to maintain Low Cost / Easy Maintainance
  14. 31.

    What We Achieved 4 Millions Datapoints Write per Sec 1000

    Queries per Seconds Write P99 < 100ms Read P99 < 100ms
  15. 38.

    TSDB Storage type Serie struct { Id uint64 Label map[string]string

    Datapoints [] Datapoint } type Datapoint struct { Value float64 Timestamp int64 } type Storage interface { Put(Series []Serie) QueryRange(Label map[string]string, fromTS int64, toTs int64) []Datapoint }
  16. 39.

    TSDB Storage type Serie struct { Id uint64 Label map[string]string

    Datapoints [] Datapoint } type Datapoint struct { Value float64 Timestamp int64 } type Storage interface { Put(Series []Serie) QueryRange(Label map[string]string, fromTs int64, toTs int64) []Datapoint }
  17. 40.

    How We Achive This Scale? type Serie struct { Id

    uint64 Label map[string]string Datapoints [] Datapoint } type Datapoint struct { Value float64 Timestamp int64 } type Storage interface { Put(Series []Serie) QueryRange(Label map[string]string, fromTs int64, toTs int64) []Datapoint } 4.000.000 Serie Per Sec 1000 Query Range per Sec
  18. 41.

    Technical Decisions In-Memory base Fit “entirely” 28 hour recent data

    into memory Microsevice Split into multiple services for local optimization Grpc as main communication protocol Employ Go language Ease of deployment and good toolings Utilize Opensource Not try to reinvent all the wheels, combine with other opensources to reduce development cost
  19. 42.

    Technical Decisions In-Memory base Fit “entirely” 28 hour recent data

    into memory Microsevice Split into multiple services for local optimization Grpc as main communication protocol Employ Go language Ease of deployment and good toolings Utilize Opensource Not try to reinvent all the wheels, combine with other opensources to reduce development cost
  20. 47.

    1234 CPU { “Host” = “Host1” } { “Zone” =

    “JP” } 1234 12:00 0.5 Label Storage Datapoints Storage
  21. 48.

    1234 CPU { “Host” = “Host1” } { “Zone” =

    “JP” } 1234 12:00 0.5 Optimized For Text Store Optimized For Number Store
  22. 51.

    Technical Decisions In-Memory base Fit “entirely” 28 hour recent data

    into memory Microsevice Split into multiple services for local optimization Grpc as main communication protocol Employ Go language Ease of deployment and good toolings Utilize Opensource Not try to reinvent all the wheels, combine with other opensources to reduce development cost
  23. 54.

    Datapoints Storage Idea: Split in Memory Datapoints Storage
 and Persistent

    Datapoints Storage Fit Latest 28 Hours Data “Entirely” Into Memory
  24. 55.

    Metric Storage Metric Input Layer Label Storage Metric Query Layer

    InMemory Datapoint Storage (28h) Persistent Datapoint Storage
  25. 56.

    In-Memory Datapoint Storage > Store 28 hours data entirely in

    memory • Use Delta-Delta XOR algorithm to compress data points > Vertical scale by hash-base serie id distribution > High-Availability with rocksdb as log backend and Raft concensus protocol for replication
  26. 61.

    In-Memory Datapoint Storage > Store 28 hours data entirely in

    memory > Use Delta-Delta XOR algorithm to compress data points > Vertical scale by hash-base serie id distribution • High-Availability with rocksdb as log backend and Raft concensus protocol for replication
  27. 62.

    Storage Unit > Store data by “Shard” Unit > Each

    Shard == 3 Physical Machines (Replication Factor 3) with high DRAM + SSD > Each machine replicate data using “Raft”
  28. 63.

    Data Distributed by Hashing Client Hash(Serie 1) = 1 Shard

    1 Shard 2 Insert(Serie 1) CentralDogma Shard Topology
  29. 67.

    Technical Decisions In-Memory base Fit “entirely” 28 hour recent data

    into memory Microsevice Split into multiple services for local optimization Grpc as main communication protocol Employ Go language Ease of deployment and good toolings Utilize Opensource Not try to reinvent all the wheels, combine with other opensources to reduce development cost
  30. 68.

    Go Programming Language > Simplicity • Write code • Deployment

    (binary) > Very good tooling for profiling (CPU/memory..) • Very helpful for building memory intensive application > Has GC but still able to control memory layout
  31. 70.

    Technical Decisions In-Memory base Fit “entirely” 28 hour recent data

    into memory Microsevice Split into multiple services for local optimization Grpc as main communication protocol Employ Go language Ease of deployment and good toolings Utilize Opensource Not try to reinvent all the wheels, combine with other opensources to reduce development cost
  32. 71.

    Metric Storage Metric Input Layer Label Storage Metric Query Layer

    In Memory Metric Storage (28h) Persistent Metric Storage
  33. 76.

    It takes time to make usable thing But it takes

    a lot of time to make best thing
  34. 77.

    Current Achievement Is Not Good Enough 4 Millions Datapoints Write

    per Sec 1000 Queries per Seconds Write P99 < 100ms Read P99 < 100ms 100 10000
  35. 78.

    We Still Have a Lot To Do > Better performance

    for label index layer > More perfomance, more reliablility, better cost optimization > Multi-Region deployment for disaster discovery > New storage base ecosystem (Alert, On-Call…)
  36. 79.