Scaling Time-Series Data to Infinity: A Kubernetes-Powered Solution with Envoy

Slide 1

Slide 1 text

Scaling Time-Series Data to Infinity: A Kubernetes-Powered Solution with Envoy Hiroki Sakamoto Senior Software Engineer - LY Corporation

Slide 2

Slide 2 text

Have you ever dealt with Petabyte scale of metrics? 2

Slide 3

Slide 3 text

Hiroki Sakamoto — Senior Software Engineer@LY Corp Observability Engineering Team @taisho6339 @taisho6339 3

Slide 4

Slide 4 text

Observability is getting expensive 4 As data increases, several issues happen ● Cost ● Scalability ● Capacity

Slide 5

Slide 5 text

Our Storage 5 Several OSS Petabyte scale of Metrics Object Storage

Slide 6

Slide 6 text

Background 6

Slide 7

Slide 7 text

7 Prometheus Metrics Agent OTel Collector User clients

Slide 8

Slide 8 text

8 Prometheus Metrics Agent OTel Collector User clients Ingestion API Time-Series DB

Slide 9

Slide 9 text

9 Prometheus Metrics Agent OTel Collector User clients Ingestion API Time-Series DB Query API Prometheus Grafana User clients

Slide 10

Slide 10 text

10 Prometheus Metrics Agent OTel Collector User clients Ingestion API Time-Series DB Query API Prometheus Grafana User clients IMON Flash

Slide 11

Slide 11 text

What is Metrics? 11 Metadata Sample cpu_usage { pod="app-0", environment="prod", node="node-x" } T: 1697530930 V: 80 T: 1697563930 V: 92 T: 1697566930 V: 76 T: 1697569930 V: 64 T: 1697572930 V: 51

Slide 12

Slide 12 text

12 Client Query API Metadata Database Sample Database PromQL

Slide 13

Slide 13 text

13 Client Query API Metadata Database Sample Database PromQL Metric IDs Retrieve target metrics IDs with given PromQL

Slide 14

Slide 14 text

14 Client Query API Metadata Database Sample Database PromQL Retrieve target metrics IDs with given PromQL Retrieve samples with the IDs & time-range Metric IDs Samples

Slide 15

Slide 15 text

15 Client Query API Metadata Database Sample Database PromQL Metric IDs Samples Evaluate PromQL with the samples and return results Retrieve samples with the IDs & time-range Retrieve target metrics IDs with given PromQL

Slide 16

Slide 16 text

16 In-Memory Layer for data within 1d Metadata Persistent Layer for data after 1d Sample Custom-built Custom-built Elasticsearch Cassandra

Slide 17

Slide 17 text

Number of Metrics 1 Billion Sample Data Size with Replication 1 PB Ingested Sample Size / a day 2.7TB Ingested Samples / a day 1.8 trillion

Slide 18

Slide 18 text

Cassandra was the bottleneck for us ● Cost ○ Expensive due to 1PB samples ● Scalability ○ Take 6h to scale-out only a Node ○ Repair never completes ● Capacity ○ Not allowed to obtain more Nodes 18

Slide 19

Slide 19 text

New Storage is required for Sample 19

Slide 20

Slide 20 text

Why not use Object Storage? ● Cost-effective ● Storage concerns are NOT an issue ● Sufficient Capacity and Scalability ● Real-world samples (Cortex, Mimir, Thanos) 20

Slide 21

Slide 21 text

Object Storage 21 Cassandra on k8s Maintenability Maintenability Scalability Scalability Storage cost Storage cost Performance Performance VS

Slide 22

Slide 22 text

22 In-Memory Layer for data within 1d Persistent Layer 1 for data 1d ~ 2w Custom-built Cassandra Persistent Layer 2 for data 2w ~ S3-compatible Object Storage New!

Slide 23

Slide 23 text

23 1. Data Structure 2. Distributed Write 3. Distributed Read How to construct DB on Object Storage

Slide 24

Slide 24 text

1. Data Structure 24

Slide 25

Slide 25 text

Requirements 25 S3-compatible Object Storage T: 1697530930 V: 80 T: 1697563930 V: 92 T: 1697566930 V: 76 T: 1697569930 V: 64 T: 1697572930 V: 51 Time Range: 2024-08-04 12:00 - 17:00 ID: 1, 9, 200, 320 Input Output

Slide 26

Slide 26 text

Data Sharding is important ● 1B metrics ○ Inevitable to merge multiple samples using a rule ● For concurrency ○ Efficient write processing ○ Efficient read processing 26

Slide 27

Slide 27 text

Data Sharding Strategy ● 1 Bucket ○ 1 week Time-Window ● 1 Directory ○ 4 hours Time-Window ○ Tenant ○ Shard factor: Metrics ID % numShards 27

Slide 28

Slide 28 text

28 1 Week Data of Bucket shard-1_from-timestamp_to-timestamp (4h of Data) ------------------------------------------- 0x001 | ID:1 of Samples ------------------------------------------- 0x014 | ID:10 of Samples ------------------------------------------- 0x032 | ID:20 of Samples ------------------------------------------- 0x036 | ID:32 of Samples ------------------------------------------- Same shard of samples

Slide 29

Slide 29 text

29 1 Week Data of Bucket shard-1_from-timestamp_to-timestamp (4h of Data) ------------------------------------------- 0x001 | ID:1 of Samples ------------------------------------------- 0x014 | ID:10 of Samples ------------------------------------------- 0x032 | ID:20 of Samples ------------------------------------------- 0x036 | ID:32 of Samples ------------------------------------------- Same shard of samples Index ------------------------------------------- ID = 1 | 0x001 ------------------------------------------- ID = 10 | 0x014 ------------------------------------------- ID = 20 | 0x032 ------------------------------------------- ID = 32 | 0x036 -------------------------------------------

Slide 30

Slide 30 text

2. Distributed Write 30

Slide 31

Slide 31 text

31 In-Memory DB Data Node 1 Data Node 150 Batch Node 1 Batch Node 16 How to write samples to Cassandra Retrieve 4h of data

Slide 32

Slide 32 text

32 In-Memory DB Data Node 1 Data Node 150 Batch Node 1 Batch Node 16 Cassandra How to write samples to Cassandra Retrieve 4h of data Compress & Save Inserted Rows — ID=1 : compressed samples in 4h ID=2 : compressed samples in 4h ID=3 : compressed samples in 4h …

Slide 33

Slide 33 text

33 In-Memory DB Data Node 1 Data Node 150 Batch Node 1 Batch Node 16 How to write samples to Object Storage S3-Compatible Object Storage Retrieve 4h of data How?

Slide 34

Slide 34 text

34 In-Memory DB Data Node 1 Data Node 150 Batch Node 1 Batch Node 16 How to write samples to Object Storage Shard Aggregator1 Shard Aggregator32 Compress & Aggregate S3-Compatible Object Storage Retrieve 4h of data

Slide 35

Slide 35 text

New process - Shard Aggregator ● Aggregate samples according to the sharding strategy ● Allow scale-out when increasing number of shards ● Persist samples once receiving samples for resiliency (WAL) 35

Slide 36

Slide 36 text

Start using k8s for new services ● Infrastructure abstraction ● Self-Healing ● Unified Observability ● Unified deployment flow 36

Slide 37

Slide 37 text

37 Batch Node 1 Batch Node 16 Shard Aggregator1 Shard Aggregator32

Slide 38

Slide 38 text

38 Batch Node 1 Batch Node 16 Shard Aggregator1 Shard Aggregator32 Set shard factor in gRPC Header

Slide 39

Slide 39 text

39 Batch Node 1 Batch Node 16 Shard Aggregator1 Shard Aggregator32 Route to corresponding Pod using the header Set shard factor in gRPC Header

Slide 40

Slide 40 text

40 Batch Node 1 Batch Node 16 Shard Aggregator1 Shard Aggregator32 LevelDB LevelDB LSM-Tree Set shard factor in gRPC Header Route to corresponding Pod using the header Persist samples in local DB

Slide 41

Slide 41 text

LevelDB 41 Batch Node 1 Batch Node 16 Shard Aggregator1 LevelDB Shard Aggregator32 Export aggregated samples Set shard factor in gRPC Header Route to corresponding Pod using the header

Slide 42

Slide 42 text

42 LSM-Tree (LevelDB, RocksDB) B+Tree (etcd.io/bbolt) Write Performance Vary in cases Read Performance Vary in cases Choose correct Key-Value Store

Slide 43

Slide 43 text

43 LSM-Tree (LevelDB, RocksDB) B+Tree (etcd.io/bbolt) Write Performance Vary in cases Read Performance Vary in cases Choose correct Key-Value Store

Slide 44

Slide 44 text

Optimizations on LSM-Tree Since only read once when uploading ● Disabled compaction ● Disabled page cache as possible (fadvise) 44

Slide 45

Slide 45 text

Optimizations on LSM-Tree Fsync once in multiple requests for better performance 45 kernel space (page cache here) Even though a Pod is killed, Dirty page cache remains

Slide 46

Slide 46 text

Write Performance ● With 32 Shard Aggregator Pods ○ Take 40 mins to aggregate & write 450GB every 4 hours ○ Consume only 3GB memory on each Pod ○ No outage so far 46

Slide 47

Slide 47 text

3. Distributed Read 47

Slide 48

Slide 48 text

48 Query API How?

Slide 49

Slide 49 text

49 Query API Storage Gateway

Slide 50

Slide 50 text

New process - Storage Gateway ● Communicate directly with Object Storage ● Return samples stored in Object Storage ● Cache data ○ Reduce RPS for Object Storage ○ Return results faster 50

Slide 51

Slide 51 text

Request for Samples 51 Query API Storage Gateway

Slide 52

Slide 52 text

Request for Samples 52 Query API Storage Gateway Download Index Identify byte locations in the sample file

Slide 53

Slide 53 text

53 Query API Storage Gateway Download samples with Byte-Range request Return Samples Request for Samples Download Index Identify byte locations in the sample file

Slide 54

Slide 54 text

54 Query API Storage Gateway What about Cache?

Slide 55

Slide 55 text

55 LSM-Tree (LevelDB, RocksDB) B+Tree (etcd.io/bbolt) Write Performance Vary in cases Read Performance Vary in cases Choose correct Key-Value Store

Slide 56

Slide 56 text

Distributed Cache with bbolt & Envoy ● etcd-io/bbolt ○ On-disk B+Tree Key-Value store ○ Better read performance ○ Page cache works well ● Envoy ○ L7 LB to route requests to fixed Pods ○ Active health-check supported ○ Maglev supported and optimized on even distribution 56

Slide 57

Slide 57 text

57 Query API Storage Gateway 1 Storage Gateway 32 Split a query into multiple small ones by 4h of shard

Slide 58

Slide 58 text

58 Storage Gateway 1 Storage Gateway 32 Route a shard request to a fixed Pod by Maglev Query API Split a query into multiple small ones by 4h of shard

Slide 59

Slide 59 text

59 Storage Gateway 1 Storage Gateway 32 Download Index & Samples Route a shard request to a fixed Pod by Maglev Query API Split a query into multiple small ones by 4h of shard

Slide 60

Slide 60 text

60 bbolt Storage Gateway 1 bbolt Storage Gateway 32 Cache downloaded indices & samples Route a shard request to a fixed Pod by Maglev Query API Split a query into multiple small ones by 4h of shard

Slide 61

Slide 61 text

61 bbolt Storage Gateway 1 bbolt Storage Gateway 32 Query API Return each result

Slide 62

Slide 62 text

62 bbolt Storage Gateway 1 bbolt Storage Gateway 32 Query API Merge all results Return each result

Slide 63

Slide 63 text

63 But, still slow…

Slide 64

Slide 64 text

64 Download Index Decode Index Identify byte location Download Sample Return Pinpoint the bottleneck with trace & profile Grafana Tempo Pyroscope Consume too much time from profiling and tracing

Slide 65

Slide 65 text

65 Download Index Decode Index Identify byte location Download Sample Return Index is too big to decode or download Cry icons created by Vectors Market - Flaticon: https://www.flaticon.com/free-icons/cry

Slide 66

Slide 66 text

66 Index of Index

Slide 67

Slide 67 text

67 Download Index Decode Index Reduce the index size to be dealt with Identify byte location Download Sample Return

Slide 68

Slide 68 text

Read Performance ● With 64 Storage Gateway Pods ○ Comparable Performance to Cassandra ■ 2ms at p99 for 4h data ■ 6s ~ 9s at p99 for 1 month data ○ Cache 1.9TB 68

Slide 69

Slide 69 text

Obtain Unlimited Capacity 69

Slide 70

Slide 70 text

70 Storage Gateway Shard Aggregator Default Storage Bring Your Own Buckets! User A’s Storage User B’s Storage

Slide 71

Slide 71 text

Petabyte scale is NOT an issue anymore Thanks Everyone in the Commnunity 71 Distributed Write Level DB Nginx Distributed Read bbolt Envoy Obervability

Slide 72

Slide 72 text

What can we do for the community? 72 Introduced Loki in our org 2021 Contributed to Loki 2022 Success of this project leveraging knowledge of Loki 2023 - 2024 Contribute to Community Future Always seeking opportunities of contributions