Slide 1

Slide 1 text

Building an Observability Backend Neha Pawar Founding Engineer, StarTree Real-Time Analytics Summit 2024 With StarTree Cloud

Slide 2

Slide 2 text

REAL-TIME INTERNAL EXTERNAL BATCH Observability - Logs, Metrics & Traces The analytics quadrant

Slide 3

Slide 3 text

Observability data Data collected and analyzed to gain insights into the internal state of systems and applications What insights? ● Health & reliability ● Failures & investigations ● Usage & efficiency ● Performance bottlenecks Metrics Logs Traces

Slide 4

Slide 4 text

The observability stack Visualization Query Storage Collection Agents

Slide 5

Slide 5 text

Lack of flexibility ● Much investment made in agents, added to cost ● Vendor specific formats ● Egress cost high ● Data locked in ● Each stack came with it’s own UI ● Could not use for other data ● Could not customize ● High velocity + high volume data + high freshness = High cost ● Low usage = Low ROI ● Data locked in Visualization Query Storage Collection Agents All-or-nothing stack

Slide 6

Slide 6 text

The disaggregated stack Services Agent Services Agent Services Agent Collection Viz. tools Streaming infra Data store Storage & Query Viz. Agents

Slide 7

Slide 7 text

The disaggregated stack - Agents Services Agent Services Agent Services Agent Collection Viz. tools Streaming infra Data store Storage & Query Viz. Agents

Slide 8

Slide 8 text

The disaggregated stack - Agents Services Agent Services Agent Services Agent Collection Viz. tools Streaming infra Data store Storage & Query Viz. Agents OTEL format ● Agents are commoditized ● Standards such as OTEL

Slide 9

Slide 9 text

The disaggregated stack - Collection Services Agent Services Agent Services Agent Collection Viz. tools Streaming infra Data store Storage & Query Viz. Agents OTEL format

Slide 10

Slide 10 text

The disaggregated stack - Collection Services Agent Services Agent Services Agent Collection Viz. tools Data store Storage & Query Viz. Agents OTEL format ● Rise of stream processing systems like Kafka, RedPanda ● Already in place for other streaming usecases

Slide 11

Slide 11 text

The disaggregated stack - Visualization Services Agent Services Agent Services Agent Collection Viz. tools Data store Storage & Query Viz. Agents OTEL format

Slide 12

Slide 12 text

The disaggregated stack - Visualization Services Agent Services Agent Services Agent Collection Data store Storage & Query Viz. Agents OTEL format Pluggable Grafana Connectors logQL promQL Customizable query editors

Slide 13

Slide 13 text

The disaggregated stack - Storage & Query Services Agent Services Agent Services Agent Collection Data store Storage & Query Viz. Agents OTEL format ● Storage & query is the hardest problem ● Impacts cost, performance & flexibility

Slide 14

Slide 14 text

Storage & Query Data store High Volume High Retention High Variety High Velocity High Query Complexity High Freshness

Slide 15

Slide 15 text

Metrics data Timestamp Metric name Metric value Dimensions

Slide 16

Slide 16 text

Metrics data Common Challenges with Metrics Data Map of labels (dimensions) ● Ingest ○ Ingest as is - difficult to query ○ Materialize all keys upfront ■ Dynamic nature makes it challenging ■ Sparse nature makes it costly ● Query: ○ Complex extraction logic ○ High query fanout SELECT jsonExtractScalar(labels, '$.org_id'), jsonExtractScalar(labels, '$.table), SUM(value) FROM startree_metrics_analytics WHERE name IN ('pinot_broker_queryExecution_Count') AND timestamp > 1714608005000 and timestamp < 1714608005000 GROUP BY jsonExtractScalar(labels, '$.org_id'), jsonExtractScalar(labels, '$.table) LIMIT 10

Slide 17

Slide 17 text

Logs data Timestamp Attributes (e.g. level, className, threadName) Log message

Slide 18

Slide 18 text

Logs data Common Challenges with Logs Data ● High data volume, high storage cost ● Long retention ● Text search queries SELECT message FROM logs WHERE timestamp > '1714608005000' AND timestamp > '1714608005000' AND level = 'WARN' AND REGEXP_LIKE(message, 'ip 10.50.* time_ms=100.*') LIMIT 100;

Slide 19

Slide 19 text

Trace data Trace ID Span ID Timestamp Attributes json Spans array Common payload Attributes json

Slide 20

Slide 20 text

Trace data Common Challenges with trace data ● Complex ingest ● Storing semi-structured data ● Complex json extraction queries

Slide 21

Slide 21 text

Storage & Query Data store High Volume High Retention High Variety High Velocity High Query Complexity High Freshness

Slide 22

Slide 22 text

Pinot as a backend for observability Services Agent Services Agent Services Agent Collection Storage & Query Viz. Agents OTEL format ● Scalable, fresh and reliable ingestion ● Pluggable decoders: Prometheus, JSON, OTEL ● Pluggable encoders: CLP ● Specialized data types: MAP ● Specialized indexes: Text, Json, Sparse, Inv ● Viz. tool connectors: Grafana

Slide 23

Slide 23 text

Real-time streaming systems <-> Pinot integration ● Scalable, fresh, reliable ingestion from real-time streaming systems ● Pluggable decoders for various formats ● Partition-aware ingestion

Slide 24

Slide 24 text

Handling observability data in Pinot Timestamps Metric name Metric value Dimensions Log lines Traces Data types & Encoding Dictionary Var length dictionary Raw MAP CLP Json Indexing Timestamp Index Inverted index Sorted index Range Index Inverted index Text index Json index Data layout Partitioning

Slide 25

Slide 25 text

The power of indexing Range ● Fast range filtering on numeric columns using range of values to docIds mapping E.g. finding requests with latency between 100ms and 500ms. Timestamp index ● Materialize and index different granularities of a timestamp ● Override query predicates E.g. find metrics in time range Sorted ● Sort on column frequently in query ● Increase data locality ● Reduces time to scan E.g. sort on metric name Inverted ● Fast filtering using value to docIds mapping E.g. filter on attributes like level, class name

Slide 26

Slide 26 text

Handling observability data in Pinot Timestamps Metric name Metric value Dimensions Log lines Traces Data types & Encoding Dictionary Var length dictionary Raw MAP CLP Json Indexing Timestamp Index Inverted index Sorted index Range Index Inverted index Text index Json index Data layout Partitioning

Slide 27

Slide 27 text

Partitioning by space and time Pinot Broker Query on metricName , 1 2 5 6 9 10 Server 1 Server 2 Server 3 3 4 7 8 11 12 9 10 11 12 Broker level pruning 9 11 Server level pruning 1 2 3 4 5 6 7 8 9 10 11 12 Total segments to process Server 1 Server 2 Server 3 Partition 1, 2, 3 Partition 4, 5, 6 Partition 7, 8, 9, 10 Partition column = metricName Num partitions = 10

Slide 28

Slide 28 text

Handling observability data in Pinot Timestamps Metric name Metric value Dimensions Log lines Traces Data types & Encoding Dictionary Var length dictionary Raw MAP CLP Json Indexing Timestamp Index Inverted index Sorted index Range Index Inverted index Text index Json index Data layout Partitioning

Slide 29

Slide 29 text

MAP type - Ease of use Flexible Map type for metric data ● Ease of ingestion ● Smart storage: ○ Store dense keys as dedicated column ○ Store parse keys in EAV format ● Simplified queries SELECT labels['org_id'], labels['table'], sum(value) FROM startree_metrics_analytics … GROUP BY labels['org_id'], labels['table'] docId: EAVs.. 0: key1|int|100, key18|string|foo 1: key2|long|2635822439, key999|float|0.5, key45|int|8 … "fieldConfigList": [{ "name": "tags", "encodingType": "RAW", "indexes": { "forward": { "mapIndexConfig": { "denseKeys": ["foo", "bar"] } } } }]

Slide 30

Slide 30 text

MAP type - Smart storage "fieldConfigList": [{ "name": "tags", "encodingType": "RAW", "indexes": { "forward": { "mapIndexConfig": { "dynamicallyCreateDenseKeys": true, "maxKeys": 300, } } } }]

Slide 31

Slide 31 text

MAP type - Query performance ● Low storage footprint ● Good Performance for scans on sparse keys ● Poor performance on dense key scans All Columns Stored as Sparse All Columns Stored as Dense ● Very High storage footprint ● Good performance across all

Slide 32

Slide 32 text

Handling observability data in Pinot Timestamps Metric name Metric value Dimensions Log lines Traces Data types & Encoding Dictionary Var length dictionary Raw MAP CLP Json Indexing Timestamp Index Inverted index Sorted index Range Index Inverted index Text index Json index Data layout Partitioning

Slide 33

Slide 33 text

Log Compression with CLP Log line: 2024-05-01T00:07:45.000 INFO [BrokerRequestHandler] Broker pinot-broker_7001 took 20 ms to execute requestId 72 on table foo_OFFLINE Dictionary variables Non-dictionar y variables Log Type [pinot-broker_7001, foo_OFFLINE] [20, 72] Broker \x11 took \x12 ms to execute requestId \x12 on table \x11 CLP encoding for log lines in Pinot ● Tokenize the phrase CLP is a compressor designed to encode unstructured log messages in a way that makes them more compressible while retaining the ability to search them ● Extract values ● Construct log type

Slide 34

Slide 34 text

Log Compression with CLP Query: SELECT message FROM logs WHERE regexp_like(.* on table foo_OFFLINE) Dictionary variables Non-dictionary variables Log Type [foo_OFFLINE] [] on table \x11 Querying CLP encoded columns in Pinot Process search phrase in the same way that log string is compressed: ● Tokenize the phrase ● Extract variable values ● Construct the log type

Slide 35

Slide 35 text

Log Compression with CLP Ref: https://www.uber.com/blog/reducing-logging-cos t-by-two-orders-of-magnitude-using-clp/ 300MB csv file with Helix logs → 9MB Pinot segment 400k records → 300 log types * Compression keeps increasing with data size, as log types typically keep repeating Compression ratio results Reference: https://www.uber.com/blog/reducing-l ogging-cost-by-two-orders-of-magnit ude-using-clp/

Slide 36

Slide 36 text

Text index on CLP encoded logs For regexp_like on logType and final log line match

Slide 37

Slide 37 text

Json index for trace data

Slide 38

Slide 38 text

Grafana - Pinot integration ● Query editor with Pinot query support ● PromQL ● LogQL

Slide 39

Slide 39 text

StarTree as a backend for observability Services Agent Services Agent Services Agent Collection Viz. Agents OTEL format Storage & Query Pinot ● Scalable, fresh and reliable ingestion ● Pluggable decoders: Prometheus, JSON, OTEL ● Pluggable encoders: CLP ● Specialized data types: MAP ● Specialized indexes: Text, Json, Sparse, Inv ● Viz. tool connectors: Grafana Tier Storage ● Low cost to serve ● Long retention ● Large volumes BYOC ● No data transfer ● Secure

Slide 40

Slide 40 text

Cloud Native Storage Brokers Brokers Server 1 Server 2 Server 3 Server 4 Object Store Pinot segment

Slide 41

Slide 41 text

Pin indexes locally Brokers Brokers Server 1 Server 2 Server 3 Server 4 Object Store Pinot segment Metadata & indexes

Slide 42

Slide 42 text

Column / Block level reads ○ NO lazy loading ○ Selective columnar fetch ○ Block fetch Pinot Server Object Store SELECT sum(impressions) FROM table WHERE region = ‘foo’ Columnar fetch: region.dict, region.inv_idx, impressions.fwd, Block fetch: region.inv impressions.fwd browser.fwd_idx browser.inv_idx browser.dict region.inv_idx region.fwd_idx region.dict country… … … impressions.fwd_idx impressions.dict cost.. … timestamp.. .. columns.psf

Slide 43

Slide 43 text

Run details Latency Baseline - Columnar fetch > 60s Enabling block reads <1 s Querying Logs on Cloud Native Storage ● 100GB logs, 500 segments, 200 million records ● Query all 500 segments, 10k rows ● Pinned locally = FST + inverted index header (1GB total)

Slide 44

Slide 44 text

Bring Your Own Cloud deployment ● No VPC peering ● Principle of least privilege ● Data Security / Governance ● Fully managed ● Cost effective

Slide 45

Slide 45 text

Thanks!