Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scaling Time-Series Data to Infinity: A Kuberne...

Scaling Time-Series Data to Infinity: A Kubernetes-Powered Solution with Envoy

Hiroki Sakamoto

August 31, 2024
Tweet

More Decks by Hiroki Sakamoto

Other Decks in Technology

Transcript

  1. Scaling Time-Series Data to Infinity: A Kubernetes-Powered Solution with Envoy

    Hiroki Sakamoto Senior Software Engineer - LY Corporation
  2. Observability is getting expensive 4 As data increases, several issues

    happen • Cost • Scalability • Capacity
  3. 9 Prometheus Metrics Agent OTel Collector User clients Ingestion API

    Time-Series DB Query API Prometheus Grafana User clients
  4. 10 Prometheus Metrics Agent OTel Collector User clients Ingestion API

    Time-Series DB Query API Prometheus Grafana User clients IMON Flash
  5. What is Metrics? 11 Metadata Sample cpu_usage { pod="app-0", environment="prod",

    node="node-x" } T: 1697530930 V: 80 T: 1697563930 V: 92 T: 1697566930 V: 76 T: 1697569930 V: 64 T: 1697572930 V: 51
  6. 13 Client Query API Metadata Database Sample Database PromQL Metric

    IDs Retrieve target metrics IDs with given PromQL
  7. 14 Client Query API Metadata Database Sample Database PromQL Retrieve

    target metrics IDs with given PromQL Retrieve samples with the IDs & time-range Metric IDs Samples
  8. 15 Client Query API Metadata Database Sample Database PromQL Metric

    IDs Samples Evaluate PromQL with the samples and return results Retrieve samples with the IDs & time-range Retrieve target metrics IDs with given PromQL
  9. 16 In-Memory Layer for data within 1d Metadata Persistent Layer

    for data after 1d Sample Custom-built Custom-built Elasticsearch Cassandra
  10. Number of Metrics 1 Billion Sample Data Size with Replication

    1 PB Ingested Sample Size / a day 2.7TB Ingested Samples / a day 1.8 trillion
  11. Cassandra was the bottleneck for us • Cost ◦ Expensive

    due to 1PB samples • Scalability ◦ Take 6h to scale-out only a Node ◦ Repair never completes • Capacity ◦ Not allowed to obtain more Nodes 18
  12. Why not use Object Storage? • Cost-effective • Storage concerns

    are NOT an issue • Sufficient Capacity and Scalability • Real-world samples (Cortex, Mimir, Thanos) 20
  13. 22 In-Memory Layer for data within 1d Persistent Layer 1

    for data 1d ~ 2w Custom-built Cassandra Persistent Layer 2 for data 2w ~ S3-compatible Object Storage New!
  14. Requirements 25 S3-compatible Object Storage T: 1697530930 V: 80 T:

    1697563930 V: 92 T: 1697566930 V: 76 T: 1697569930 V: 64 T: 1697572930 V: 51 Time Range: 2024-08-04 12:00 - 17:00 ID: 1, 9, 200, 320 Input Output
  15. Data Sharding is important • 1B metrics ◦ Inevitable to

    merge multiple samples using a rule • For concurrency ◦ Efficient write processing ◦ Efficient read processing 26
  16. Data Sharding Strategy • 1 Bucket ◦ 1 week Time-Window

    • 1 Directory (Shard) ◦ 4 hours Time-Window ◦ Tenant ◦ Shard factor: Metrics ID % numShards 27
  17. 28 1 Week Data of Bucket shard-1_from-timestamp_to-timestamp (4h of Data)

    ------------------------------------------- 0x001 | ID:1 of Samples ------------------------------------------- 0x014 | ID:10 of Samples ------------------------------------------- 0x032 | ID:20 of Samples ------------------------------------------- 0x036 | ID:32 of Samples ------------------------------------------- Same shard of samples
  18. 29 1 Week Data of Bucket shard-1_from-timestamp_to-timestamp (4h of Data)

    ------------------------------------------- 0x001 | ID:1 of Samples ------------------------------------------- 0x014 | ID:10 of Samples ------------------------------------------- 0x032 | ID:20 of Samples ------------------------------------------- 0x036 | ID:32 of Samples ------------------------------------------- Same shard of samples Index ------------------------------------------- ID = 1 | 0x001 ------------------------------------------- ID = 10 | 0x014 ------------------------------------------- ID = 20 | 0x032 ------------------------------------------- ID = 32 | 0x036 -------------------------------------------
  19. 31 In-Memory DB Data Node 1 Data Node 150 Batch

    Node 1 Batch Node 16 How to write samples to Cassandra Retrieve 4h of data
  20. 32 In-Memory DB Data Node 1 Data Node 150 Batch

    Node 1 Batch Node 16 Cassandra How to write samples to Cassandra Retrieve 4h of data Compress & Save Inserted Rows — ID=1 : compressed samples in 4h ID=2 : compressed samples in 4h ID=3 : compressed samples in 4h …
  21. 33 In-Memory DB Data Node 1 Data Node 150 Batch

    Node 1 Batch Node 16 How to write samples to Object Storage S3-Compatible Object Storage Retrieve 4h of data How?
  22. 34 In-Memory DB Data Node 1 Data Node 150 Batch

    Node 1 Batch Node 16 How to write samples to Object Storage Shard Aggregator1 Shard Aggregator32 Compress & Aggregate S3-Compatible Object Storage Retrieve 4h of data
  23. New process - Shard Aggregator • Aggregate samples according to

    the sharding strategy • Allow scale-out when increasing number of shards • Persist samples once receiving samples for resiliency (WAL) 35
  24. Start using k8s for new services • Infrastructure abstraction •

    Self-Healing • Unified Observability • Unified deployment flow 36
  25. 38 Batch Node 1 Batch Node 16 Shard Aggregator1 Shard

    Aggregator32 Set shard factor in gRPC Header
  26. 39 Batch Node 1 Batch Node 16 Shard Aggregator1 Shard

    Aggregator32 Route to corresponding Pod using the header Set shard factor in gRPC Header
  27. 40 Batch Node 1 Batch Node 16 Shard Aggregator1 Shard

    Aggregator32 LevelDB LevelDB LSM-Tree Set shard factor in gRPC Header Route to corresponding Pod using the header Persist samples in local DB
  28. LevelDB 41 Batch Node 1 Batch Node 16 Shard Aggregator1

    LevelDB Shard Aggregator32 Export aggregated samples Set shard factor in gRPC Header Route to corresponding Pod using the header
  29. 42 LSM-Tree (LevelDB, RocksDB) B+Tree (etcd.io/bbolt) Write Performance Vary in

    cases Read Performance Vary in cases Choose correct Key-Value Store
  30. 43 LSM-Tree (LevelDB, RocksDB) B+Tree (etcd.io/bbolt) Write Performance Vary in

    cases Read Performance Vary in cases Choose correct Key-Value Store
  31. Optimizations on LSM-Tree Since only read once when uploading •

    Disabled compaction • Disabled page cache as possible (fadvise) 44
  32. Optimizations on LSM-Tree Fsync once in multiple requests for better

    performance 45 kernel space (page cache here) Even though a Pod is killed, Dirty page cache remains
  33. Write Performance • With 32 Shard Aggregator Pods ◦ Take

    40 mins to aggregate & write 450GB every 4 hours ◦ Consume only 3GB memory on each Pod ◦ No outage so far 46
  34. New process - Storage Gateway • Communicate directly with Object

    Storage • Return samples stored in Object Storage • Cache data ◦ Reduce RPS for Object Storage ◦ Return results faster 50
  35. Request for Samples 52 Query API Storage Gateway Download Index

    Identify byte locations in the sample file
  36. 53 Query API Storage Gateway Download samples with Byte-Range request

    Return Samples Request for Samples Download Index Identify byte locations in the sample file
  37. 55 LSM-Tree (LevelDB, RocksDB) B+Tree (etcd.io/bbolt) Write Performance Vary in

    cases Read Performance Vary in cases Choose correct Key-Value Store
  38. Distributed Cache with bbolt & Envoy • etcd-io/bbolt ◦ On-disk

    B+Tree Key-Value store ◦ Better read performance ◦ Page cache works well • Envoy ◦ L7 LB to route requests to fixed Pods ◦ Active health-check supported ◦ Maglev supported and optimized on even distribution 56
  39. 57 Query API Storage Gateway 1 Storage Gateway 32 Split

    a query into multiple small ones by 4h of shard
  40. 58 Storage Gateway 1 Storage Gateway 32 Route a shard

    request to a fixed Pod by Maglev Query API Split a query into multiple small ones by 4h of shard
  41. 59 Storage Gateway 1 Storage Gateway 32 Download Index &

    Samples Route a shard request to a fixed Pod by Maglev Query API Split a query into multiple small ones by 4h of shard
  42. 60 bbolt Storage Gateway 1 bbolt Storage Gateway 32 Cache

    downloaded indices & samples Route a shard request to a fixed Pod by Maglev Query API Split a query into multiple small ones by 4h of shard
  43. 62 bbolt Storage Gateway 1 bbolt Storage Gateway 32 Query

    API Merge all results Return each result
  44. 64 Download Index Decode Index Identify byte location Download Sample

    Return Pinpoint the bottleneck with trace & profile Grafana Tempo Pyroscope Consume too much time from profiling and tracing
  45. 65 Download Index Decode Index Identify byte location Download Sample

    Return Index is too big to decode or download Cry icons created by Vectors Market - Flaticon: https://www.flaticon.com/free-icons/cry
  46. 67 Download Index Decode Index Reduce the index size to

    be dealt with Identify byte location Download Sample Return
  47. Read Performance • With 64 Storage Gateway Pods ◦ Comparable

    Performance to Cassandra ▪ 2ms at p99 for 4h data ▪ 6s ~ 9s at p99 for 1 month data ◦ Cache 1.9TB 68
  48. 70 Storage Gateway Shard Aggregator Default Storage Bring Your Own

    Buckets! User A’s Storage User B’s Storage
  49. Petabyte scale is NOT an issue anymore Thanks Everyone in

    the Commnunity 71 Distributed Write Level DB Nginx Distributed Read bbolt Envoy Obervability
  50. What can we do for the community? 72 Introduced Loki

    in our org 2021 Contributed to Loki 2022 Success of this project leveraging knowledge of Loki 2023 - 2024 Contribute to Community Future Always seeking opportunities of contributions