12
Client Query API
Metadata
Database
Sample
Database
PromQL
Slide 13
Slide 13 text
13
Client Query API
Metadata
Database
Sample
Database
PromQL
Metric IDs
Retrieve target
metrics IDs with
given PromQL
Slide 14
Slide 14 text
14
Client Query API
Metadata
Database
Sample
Database
PromQL Retrieve target
metrics IDs with
given PromQL
Retrieve samples
with the IDs &
time-range
Metric IDs
Samples
Slide 15
Slide 15 text
15
Client Query API
Metadata
Database
Sample
Database
PromQL
Metric IDs
Samples
Evaluate PromQL
with the samples
and return results
Retrieve samples
with the IDs &
time-range
Retrieve target
metrics IDs with
given PromQL
Slide 16
Slide 16 text
16
In-Memory Layer
for data within 1d
Metadata
Persistent Layer
for data after 1d
Sample
Custom-built Custom-built
Elasticsearch Cassandra
Slide 17
Slide 17 text
Number of Metrics 1 Billion
Sample Data Size
with Replication
1 PB
Ingested Sample Size / a day 2.7TB
Ingested Samples / a day 1.8 trillion
Slide 18
Slide 18 text
Cassandra was the bottleneck for us
● Cost
○ Expensive due to 1PB samples
● Scalability
○ Take 6h to scale-out only a Node
○ Repair never completes
● Capacity
○ Not allowed to obtain more Nodes
18
Slide 19
Slide 19 text
New Storage is required for Sample
19
Slide 20
Slide 20 text
Why not use Object Storage?
● Cost-effective
● Storage concerns are NOT an issue
● Sufficient Capacity and Scalability
● Real-world samples (Cortex, Mimir, Thanos)
20
Slide 21
Slide 21 text
Object Storage
21
Cassandra on k8s
Maintenability Maintenability
Scalability Scalability
Storage cost
Storage cost
Performance
Performance
VS
Slide 22
Slide 22 text
22
In-Memory Layer
for data within 1d
Persistent Layer 1
for data 1d ~ 2w
Custom-built
Cassandra
Persistent Layer 2
for data 2w ~
S3-compatible
Object Storage
New!
Slide 23
Slide 23 text
23
1. Data Structure
2. Distributed Write 3. Distributed Read
How to construct DB on Object Storage
Data Sharding is important
● 1B metrics
○ Inevitable to merge multiple samples using a rule
● For concurrency
○ Efficient write processing
○ Efficient read processing
26
28
1 Week Data of Bucket
shard-1_from-timestamp_to-timestamp (4h of Data)
-------------------------------------------
0x001 | ID:1 of Samples
-------------------------------------------
0x014 | ID:10 of Samples
-------------------------------------------
0x032 | ID:20 of Samples
-------------------------------------------
0x036 | ID:32 of Samples
-------------------------------------------
Same shard of samples
Slide 29
Slide 29 text
29
1 Week Data of Bucket
shard-1_from-timestamp_to-timestamp (4h of Data)
-------------------------------------------
0x001 | ID:1 of Samples
-------------------------------------------
0x014 | ID:10 of Samples
-------------------------------------------
0x032 | ID:20 of Samples
-------------------------------------------
0x036 | ID:32 of Samples
-------------------------------------------
Same shard of samples
Index
-------------------------------------------
ID = 1 | 0x001
-------------------------------------------
ID = 10 | 0x014
-------------------------------------------
ID = 20 | 0x032
-------------------------------------------
ID = 32 | 0x036
-------------------------------------------
Slide 30
Slide 30 text
2. Distributed Write
30
Slide 31
Slide 31 text
31
In-Memory DB
Data Node 1
Data Node 150
Batch
Node 1
Batch
Node 16
How to write samples to Cassandra
Retrieve 4h of data
Slide 32
Slide 32 text
32
In-Memory DB
Data Node 1
Data Node 150
Batch
Node 1
Batch
Node 16
Cassandra
How to write samples to Cassandra
Retrieve 4h of data Compress & Save
Inserted Rows
—
ID=1 : compressed samples in 4h
ID=2 : compressed samples in 4h
ID=3 : compressed samples in 4h
…
Slide 33
Slide 33 text
33
In-Memory DB
Data Node 1
Data Node 150
Batch
Node 1
Batch
Node 16
How to write samples to Object Storage
S3-Compatible
Object Storage
Retrieve 4h of data
How?
Slide 34
Slide 34 text
34
In-Memory DB
Data Node 1
Data Node 150
Batch
Node 1
Batch
Node 16
How to write samples to Object Storage
Shard
Aggregator1
Shard
Aggregator32
Compress & Aggregate
S3-Compatible
Object Storage
Retrieve 4h of data
Slide 35
Slide 35 text
New process - Shard Aggregator
● Aggregate samples according to the sharding strategy
● Allow scale-out when increasing number of shards
● Persist samples once receiving samples for resiliency (WAL)
35
Slide 36
Slide 36 text
Start using k8s for new services
● Infrastructure abstraction
● Self-Healing
● Unified Observability
● Unified deployment flow
36
38
Batch
Node 1
Batch
Node 16
Shard
Aggregator1
Shard
Aggregator32
Set shard factor
in gRPC Header
Slide 39
Slide 39 text
39
Batch
Node 1
Batch
Node 16
Shard
Aggregator1
Shard
Aggregator32
Route to
corresponding Pod
using the header
Set shard factor
in gRPC Header
Slide 40
Slide 40 text
40
Batch
Node 1
Batch
Node 16
Shard
Aggregator1
Shard
Aggregator32
LevelDB
LevelDB
LSM-Tree
Set shard factor
in gRPC Header
Route to
corresponding Pod
using the header
Persist samples
in local DB
Slide 41
Slide 41 text
LevelDB
41
Batch
Node 1
Batch
Node 16
Shard
Aggregator1
LevelDB
Shard
Aggregator32
Export
aggregated samples
Set shard factor
in gRPC Header
Route to
corresponding Pod
using the header
Slide 42
Slide 42 text
42
LSM-Tree
(LevelDB, RocksDB)
B+Tree
(etcd.io/bbolt)
Write Performance Vary in cases
Read Performance Vary in cases
Choose correct Key-Value Store
Slide 43
Slide 43 text
43
LSM-Tree
(LevelDB, RocksDB)
B+Tree
(etcd.io/bbolt)
Write Performance Vary in cases
Read Performance Vary in cases
Choose correct Key-Value Store
Slide 44
Slide 44 text
Optimizations on LSM-Tree
Since only read once when uploading
● Disabled compaction
● Disabled page cache as possible (fadvise)
44
Slide 45
Slide 45 text
Optimizations on LSM-Tree
Fsync once in multiple requests for better performance
45
kernel space (page cache here)
Even though a Pod is killed,
Dirty page cache remains
Slide 46
Slide 46 text
Write Performance
● With 32 Shard Aggregator Pods
○ Take 40 mins to aggregate & write 450GB every 4 hours
○ Consume only 3GB memory on each Pod
○ No outage so far
46
Slide 47
Slide 47 text
3. Distributed Read
47
Slide 48
Slide 48 text
48
Query API
How?
Slide 49
Slide 49 text
49
Query API
Storage
Gateway
Slide 50
Slide 50 text
New process - Storage Gateway
● Communicate directly with Object Storage
● Return samples stored in Object Storage
● Cache data
○ Reduce RPS for Object Storage
○ Return results faster
50
Slide 51
Slide 51 text
Request for Samples
51
Query API
Storage
Gateway
Slide 52
Slide 52 text
Request for Samples
52
Query API
Storage
Gateway
Download Index
Identify byte locations
in the sample file
Slide 53
Slide 53 text
53
Query API
Storage
Gateway
Download samples with
Byte-Range request
Return Samples
Request for Samples
Download Index
Identify byte locations
in the sample file
Slide 54
Slide 54 text
54
Query API
Storage
Gateway
What about Cache?
Slide 55
Slide 55 text
55
LSM-Tree
(LevelDB, RocksDB)
B+Tree
(etcd.io/bbolt)
Write Performance Vary in cases
Read Performance Vary in cases
Choose correct Key-Value Store
Slide 56
Slide 56 text
Distributed Cache with bbolt & Envoy
● etcd-io/bbolt
○ On-disk B+Tree Key-Value store
○ Better read performance
○ Page cache works well
● Envoy
○ L7 LB to route requests to fixed Pods
○ Active health-check supported
○ Maglev supported and optimized on even distribution
56
Slide 57
Slide 57 text
57
Query
API
Storage
Gateway 1
Storage
Gateway 32
Split a query into
multiple small ones
by 4h of shard
Slide 58
Slide 58 text
58
Storage
Gateway 1
Storage
Gateway 32
Route a shard
request
to a fixed Pod
by Maglev
Query
API
Split a query into
multiple small ones
by 4h of shard
Slide 59
Slide 59 text
59
Storage
Gateway 1
Storage
Gateway 32
Download
Index & Samples
Route a shard
request
to a fixed Pod
by Maglev
Query
API
Split a query into
multiple small ones
by 4h of shard
Slide 60
Slide 60 text
60
bbolt
Storage
Gateway 1
bbolt
Storage
Gateway 32
Cache downloaded
indices & samples
Route a shard
request
to a fixed Pod
by Maglev
Query
API
Split a query into
multiple small ones
by 4h of shard
Slide 61
Slide 61 text
61
bbolt
Storage
Gateway 1
bbolt
Storage
Gateway 32
Query
API
Return each result
Slide 62
Slide 62 text
62
bbolt
Storage
Gateway 1
bbolt
Storage
Gateway 32
Query
API
Merge all results
Return each result
Slide 63
Slide 63 text
63
But, still slow…
Slide 64
Slide 64 text
64
Download
Index
Decode
Index
Identify
byte location
Download
Sample
Return
Pinpoint the bottleneck with trace & profile
Grafana Tempo
Pyroscope
Consume too much time
from profiling and tracing
Slide 65
Slide 65 text
65
Download
Index
Decode
Index
Identify
byte location
Download
Sample
Return
Index is too big to decode or download
Cry icons created by Vectors Market - Flaticon: https://www.flaticon.com/free-icons/cry
Slide 66
Slide 66 text
66
Index of Index
Slide 67
Slide 67 text
67
Download
Index
Decode
Index
Reduce the index size to be dealt with
Identify
byte location
Download
Sample
Return
Slide 68
Slide 68 text
Read Performance
● With 64 Storage Gateway Pods
○ Comparable Performance to Cassandra
■ 2ms at p99 for 4h data
■ 6s ~ 9s at p99 for 1 month data
○ Cache 1.9TB
68
Slide 69
Slide 69 text
Obtain Unlimited Capacity
69
Slide 70
Slide 70 text
70
Storage
Gateway
Shard
Aggregator
Default Storage
Bring Your Own Buckets!
User A’s Storage
User B’s Storage
Slide 71
Slide 71 text
Petabyte scale is NOT an issue anymore
Thanks Everyone in the Commnunity
71
Distributed Write
Level DB Nginx
Distributed Read
bbolt Envoy
Obervability
Slide 72
Slide 72 text
What can we do for the community?
72
Introduced Loki
in our org
2021
Contributed to Loki
2022
Success of this project
leveraging knowledge of Loki
2023 - 2024
Contribute to Community
Future
Always seeking opportunities of contributions