Slide 1

Slide 1 text

Streaming Aggregation of Cloud Scale Telemetry Shay Lin Software Engineer, Confluent

Slide 2

Slide 2 text

Contents ● Telemetry serving: the platform evolution ● Streaming aggregation architectures ● Deep dive into chosen solution

Slide 3

Slide 3 text

A tale of four clusters 3 Stores 81233469 bytes in partition 1, node-4 at 2023-04-14 06:55 PST CPU usage 89% at timestamp 1680613879 4 fetch requests from client id dev-app-1 in the last minute, at 2023-01-23 10:09T000Z 1K incoming message in the last second to broker node 4

Slide 4

Slide 4 text

Terminologies 4 ● Topic ● Partition ● Broker ● Cluster

Slide 5

Slide 5 text

Telemetry Serving: The evolution

Slide 6

Slide 6 text

Kafka Telemetry Serving Usage Patterns Data Store

Slide 7

Slide 7 text

Kafka Telemetry Serving Usage Patterns Data Store Stores 81233469 bytes in node-4 at 2023-04-14 06:55 PST

Slide 8

Slide 8 text

Kafka Telemetry Serving Usage Patterns Data Store Stores 81233469 bytes in node-4 at 2023-04-14 06:55 PST Facts: Number of bytes stored for Cluster 1 in the last hour

Slide 9

Slide 9 text

Kafka Telemetry Serving Usage Patterns Data Store Stores 81233469 bytes in node-4 at 2023-04-14 06:55 PST Facts: Number of bytes stored for Cluster 1 in the last hour CPU was at 89% at timestamp 1680613879

Slide 10

Slide 10 text

Kafka Telemetry Serving Usage Patterns Data Store Stores 81233469 bytes in node-4 at 2023-04-14 06:55 PST Facts: Number of bytes stored for Cluster 1 in the last hour CPU was at 89% at timestamp 1680613879 Trends: CPU usage is always peaked on Friday Nights(PST) during a week

Slide 11

Slide 11 text

Kafka Telemetry Serving Usage Patterns Data Store Stores 81233469 bytes in node-4 at 2023-04-14 06:55 PST Facts: Number of bytes stored for Cluster 1 in the last hour CPU was at 89% at timestamp 1680613879 Trends: CPU usage is always peaked on Friday Nights(PST) during a week 268 Fetch Request in the last minute from client id dev-app-01

Slide 12

Slide 12 text

Kafka Telemetry Serving Usage Patterns Data Store Stores 81233469 bytes in node-4 at 2023-04-14 06:55 PST Facts: Number of bytes stored for Cluster 1 in the last hour CPU was at 89% at timestamp 1680613879 Trends: CPU usage is always peaked on Friday Nights(PST) during a week 268 produce request in the last minute from client id dev-app Attribution: dev-app issues most produce request among all clients

Slide 13

Slide 13 text

Kafka Telemetry Serving Usage Patterns Data Store Stores 81233469 bytes in node-4 at 2023-04-14 06:55 PST Facts: Number of bytes stored for Cluster 1 in the last hour CPU was at 89% at timestamp 1680613879 Trends: CPU usage is always peaked on Friday Nights(PST) during a week 268 produce request in the last minute from client id dev-app Attribution: dev-app issues most produce request among all clients Diagnose: find the point in time number of requests at Friday nights and identify a fan-in problem!

Slide 14

Slide 14 text

Kafka Telemetry Serving Usage Patterns Data Store Stores 81233469 bytes in node-4 at 2023-04-14 06:55 PST Facts: Number of bytes stored for Cluster 1 in the last hour CPU was at 89% at timestamp 1680613879 Trends: CPU usage is always peaked on …… during a week 268 produce request in the last minute from client id dev-app Attribution: dev-app issues ……….. request among all clients Diagnose find the point in t………. identify a fan-in problem!

Slide 15

Slide 15 text

Time Series Optimized OLAP: Apache Druid Segment Segment Query Engine # Fetch Request % CPU # Stored Bytes …

Slide 16

Slide 16 text

A tale of four clusters: when it gets analytical 16 Storage bytes across all partitions for Kafka topic % CPU at point in time # of produce request from client ID Ingress from cluster in the last hour

Slide 17

Slide 17 text

Highly concurrent ingestion and queries Druid Segment Segment Segment Segment Query Engine # Fetch Request % CPU # Network Connection … # Fetch Request % CPU # Network Connection … # Fetch Request % CPU # Network Connection …

Slide 18

Slide 18 text

Scalability Concerns of the Pull Model Druid \ Example: Hourly storage metric of cluster = N topics x P partitions x R replication factor x 60 N = 100 P = 10 R = 3, a total of 180K data points for one metric ● Highly concurrent ingestion and query ● Rising compute and serving cost ● Inconsistent queries used by data consumers

Slide 19

Slide 19 text

Scalability Concerns of the Pull Model Druid \ Example: Hourly storage metric of cluster = N topics x P partitions x R replication factor x 60 N = 100 P = 10 R = 3, a total of 180K data points for one metric ● Highly concurrent ingestion and query ● Rising compute and serving cost ● Inconsistent queries used by data consumers

Slide 20

Slide 20 text

Streamline metrics consumption with Push Model Publish In-Demand Metric Aggregations

Slide 21

Slide 21 text

Architecture Options Data Size Narrow use cases Broad use cases

Slide 22

Slide 22 text

Architecture Options Offline Custom rollup tasks in Apache Druid, or Apache Pinot: Star Tree Index. Data Size Narrow use cases Broad use cases

Slide 23

Slide 23 text

Architecture Options Offline Custom rollup tasks in Apache Druid, or Apache Pinot: Star Tree Index. Real-time Aggregate raw telemetry as they come in via stream processing (Flink, KStreams). Data Size Narrow use cases Broad use cases

Slide 24

Slide 24 text

Architecture Options Offline Custom rollup tasks in Apache Druid, or Apache Pinot: Star Tree Index. Real-time Aggregate raw telemetry as they come in via stream processing (Flink, KStreams). Hybrid Aggregate through stream processing, and feed the results back to OLAP store. Data Size Narrow use cases Broad use cases

Slide 25

Slide 25 text

Architecture Options Offline Custom rollup tasks in Apache Druid, or Apache Pinot: Star Tree Index. Real-time Aggregate raw telemetry as they come in via stream processing (Flink, KStreams). Chosen Hybrid Aggregate through stream processing, and feed the results back to OLAP store. Key decision factors, ● High compression use cases ● Metric accuracy and consistency ● Cost efficient Data Size Narrow use cases Broad use cases

Slide 26

Slide 26 text

KStreams + Druid Chosen Hybrid Solution Architecture

Slide 27

Slide 27 text

Kafka Streams: 10,000 ft View ● It’s a Apache Kafka client library ● A processor topology defines the computational logic to be performed as messages comes through Kafka ● Java(or scala) microservice that enjoys the benefits: ○ Fault tolerance, parallelism, backed by Kafka topics ● Kafka Streams provides: ○ Streams DSL: joins, windowed aggregation ○ Processor API: custom data operations, state store management

Slide 28

Slide 28 text

Raw Telemetry Metric store built on KStreams Customer Customer Customer Customer Customer Customer Aggregates Consumers

Slide 29

Slide 29 text

A tale of four clusters: unified metric interface 29 Storage bytes across all partitions for Kafka topic % CPU at point in time # of produce request from client ID Ingress from cluster in the last hour

Slide 30

Slide 30 text

Topology 1: Global Task Manager Distributes aggregation tasks by metric and entities: ● Custom segment signal producer in the Druid Segments ● Task manager dynamically allocates task based on upstream segments and additional trigger conditions

Slide 31

Slide 31 text

Topology 2: Metric Processing Workers Leverages Druid Query Engine for metric rollups: ● Statelessly process incoming aggregate tasks ● Flatten into single metric output ● Data retention is the same as Druid Segments

Slide 32

Slide 32 text

Topology 3: Additional Processing Assumption: Consumers expect Open Telemetry(OTel) metrics: ● Processing to support OTel semantics: e.g. emit delta for counter metrics ● Consumers include Druid Segment and direct data consumers ● Out-of-order data handling with the state store up to retention period*

Slide 33

Slide 33 text

Reference Architecture Kafka Streams

Slide 34

Slide 34 text

Horizontal Scaling Story in KStreams(KIP-878) Problem: as business expands, you might want to increase the parallelism of the streams processing. Users will want to increase the number of partitions of input topics. However, internal topics(changelog, repartition) will not automatically increase, and today, KStreams application will crash upon detecting a mismatch of partition number between internal topics and input topics. KIP-878: Support auto scaling of internal topics. This works well if your application can be, ● Statically partitioned or stateless: stateless is straightforward. In KStreams, your state store(e.g. RocksDB) is backed by internal topics, thus, bound to a partition. Upon autoscaling, the pre-existing state will not move. Choose a partition strategy that works well for your use case, such that you can drive tasks(without pre-existing state) to the newly created partitions, while existing keys remains sticky. ● Upfront over-provisioned for stateful processing, while the KIP is in progress.

Slide 35

Slide 35 text

A tale of four clusters 35 Stores 81233469 bytes in partition 1, node-4 at 2023-04-14 06:55 PST CPU usage 89% at timestamp 1680613879 4 fetch requests from client id dev-app-1 in the last minute, at 2023-01-23 10:09T000Z 1K incoming message in the last second to broker node 4

Slide 36

Slide 36 text

A tale of four clusters: when it gets analytical 36 Storage bytes across all partitions for Kafka topic % CPU at point in time # of produce request from client ID Ingress from cluster in the last hour

Slide 37

Slide 37 text

A tale of four clusters: happy consumers! 37 Storage bytes across all partitions for Kafka topic % CPU at point in time # of produce request from client ID Ingress from cluster in the last hour

Slide 38

Slide 38 text

Twitter: @QiuxuanL https://www.linkedin.com/in/qiuxuanlin/

Slide 39

Slide 39 text

Bonus: KStreams Real-time raw telemetry processing

Slide 40

Slide 40 text

Pseudo Topology Kafka Streams

Slide 41

Slide 41 text

Design Diagram Time and Space Cardinality Reduction

Slide 42

Slide 42 text

As metrics use cases increases… Closing thoughts for KStreams Metrics Aggregation ● A DSL, KSQL or similar, to define versioned metrics: each metric aggregate is computed by one topology. When a metric definition changes, we need a strategy to handle and propagate changes. ● Topic partition of the raw telemetry impacts aggregation efficiency: repartition may take up most of the processing time, as well as increased storage and network costs. ● Performance tuning for RocksDB, or the state store implementation of your choice will become critical: SerDe, data retention, read/write patterns. ● Query plans and smart roll ups could become essential: pre-aggregates should be shared for space and/or time aggregations across metrics for efficiency.