Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Streaming Aggregation of Cloud Scale Telemetry (Shay Lin, Confluent) | RTA Summit 2023

Streaming Aggregation of Cloud Scale Telemetry (Shay Lin, Confluent) | RTA Summit 2023

How does one serve telemetry metrics at cloud scale? At Confluent, raw telemetry is flowing in at 5 million metrics per second. Not only is the storage expensive–often with extended retention to meet compliance requirements–but the computational cost of aggregation can also skyrocket in a pull model. In the pull model, metrics consumers like data science and billing used to query metrics from the OLAP data stores on demand, which created inconsistencies over time. This session will showcase how we switched to a push model for telemetry analytics, and tackled these challenges with Kafka Streams and Apache Druid.

You will walk away with an understanding of:
– Architecture choices for real-time aggregation
– Time semantics, and handling out-of-order events
– Partition and autoscaling story of the streaming platform

StarTree

May 23, 2023
Tweet

More Decks by StarTree

Other Decks in Technology

Transcript

  1. A tale of four clusters 3 Stores 81233469 bytes in

    partition 1, node-4 at 2023-04-14 06:55 PST CPU usage 89% at timestamp 1680613879 4 fetch requests from client id dev-app-1 in the last minute, at 2023-01-23 10:09T000Z 1K incoming message in the last second to broker node 4
  2. Kafka Telemetry Serving Usage Patterns Data Store Stores 81233469 bytes

    in node-4 at 2023-04-14 06:55 PST Facts: Number of bytes stored for Cluster 1 in the last hour
  3. Kafka Telemetry Serving Usage Patterns Data Store Stores 81233469 bytes

    in node-4 at 2023-04-14 06:55 PST Facts: Number of bytes stored for Cluster 1 in the last hour CPU was at 89% at timestamp 1680613879
  4. Kafka Telemetry Serving Usage Patterns Data Store Stores 81233469 bytes

    in node-4 at 2023-04-14 06:55 PST Facts: Number of bytes stored for Cluster 1 in the last hour CPU was at 89% at timestamp 1680613879 Trends: CPU usage is always peaked on Friday Nights(PST) during a week
  5. Kafka Telemetry Serving Usage Patterns Data Store Stores 81233469 bytes

    in node-4 at 2023-04-14 06:55 PST Facts: Number of bytes stored for Cluster 1 in the last hour CPU was at 89% at timestamp 1680613879 Trends: CPU usage is always peaked on Friday Nights(PST) during a week 268 Fetch Request in the last minute from client id dev-app-01
  6. Kafka Telemetry Serving Usage Patterns Data Store Stores 81233469 bytes

    in node-4 at 2023-04-14 06:55 PST Facts: Number of bytes stored for Cluster 1 in the last hour CPU was at 89% at timestamp 1680613879 Trends: CPU usage is always peaked on Friday Nights(PST) during a week 268 produce request in the last minute from client id dev-app Attribution: dev-app issues most produce request among all clients
  7. Kafka Telemetry Serving Usage Patterns Data Store Stores 81233469 bytes

    in node-4 at 2023-04-14 06:55 PST Facts: Number of bytes stored for Cluster 1 in the last hour CPU was at 89% at timestamp 1680613879 Trends: CPU usage is always peaked on Friday Nights(PST) during a week 268 produce request in the last minute from client id dev-app Attribution: dev-app issues most produce request among all clients Diagnose: find the point in time number of requests at Friday nights and identify a fan-in problem!
  8. Kafka Telemetry Serving Usage Patterns Data Store Stores 81233469 bytes

    in node-4 at 2023-04-14 06:55 PST Facts: Number of bytes stored for Cluster 1 in the last hour CPU was at 89% at timestamp 1680613879 Trends: CPU usage is always peaked on …… during a week 268 produce request in the last minute from client id dev-app Attribution: dev-app issues ……….. request among all clients Diagnose find the point in t………. identify a fan-in problem!
  9. A tale of four clusters: when it gets analytical 16

    Storage bytes across all partitions for Kafka topic % CPU at point in time # of produce request from client ID Ingress from cluster in the last hour
  10. Highly concurrent ingestion and queries Druid Segment Segment Segment Segment

    Query Engine # Fetch Request % CPU # Network Connection … # Fetch Request % CPU # Network Connection … # Fetch Request % CPU # Network Connection …
  11. Scalability Concerns of the Pull Model Druid \ Example: Hourly

    storage metric of cluster = N topics x P partitions x R replication factor x 60 N = 100 P = 10 R = 3, a total of 180K data points for one metric • Highly concurrent ingestion and query • Rising compute and serving cost • Inconsistent queries used by data consumers
  12. Scalability Concerns of the Pull Model Druid \ Example: Hourly

    storage metric of cluster = N topics x P partitions x R replication factor x 60 N = 100 P = 10 R = 3, a total of 180K data points for one metric • Highly concurrent ingestion and query • Rising compute and serving cost • Inconsistent queries used by data consumers
  13. Architecture Options Offline Custom rollup tasks in Apache Druid, or

    Apache Pinot: Star Tree Index. Data Size Narrow use cases Broad use cases
  14. Architecture Options Offline Custom rollup tasks in Apache Druid, or

    Apache Pinot: Star Tree Index. Real-time Aggregate raw telemetry as they come in via stream processing (Flink, KStreams). Data Size Narrow use cases Broad use cases
  15. Architecture Options Offline Custom rollup tasks in Apache Druid, or

    Apache Pinot: Star Tree Index. Real-time Aggregate raw telemetry as they come in via stream processing (Flink, KStreams). Hybrid Aggregate through stream processing, and feed the results back to OLAP store. Data Size Narrow use cases Broad use cases
  16. Architecture Options Offline Custom rollup tasks in Apache Druid, or

    Apache Pinot: Star Tree Index. Real-time Aggregate raw telemetry as they come in via stream processing (Flink, KStreams). Chosen Hybrid Aggregate through stream processing, and feed the results back to OLAP store. Key decision factors, • High compression use cases • Metric accuracy and consistency • Cost efficient Data Size Narrow use cases Broad use cases
  17. Kafka Streams: 10,000 ft View • It’s a Apache Kafka

    client library • A processor topology defines the computational logic to be performed as messages comes through Kafka • Java(or scala) microservice that enjoys the benefits: ◦ Fault tolerance, parallelism, backed by Kafka topics • Kafka Streams provides: ◦ Streams DSL: joins, windowed aggregation ◦ Processor API: custom data operations, state store management
  18. Raw Telemetry Metric store built on KStreams Customer Customer Customer

    Customer Customer Customer Aggregates Consumers
  19. A tale of four clusters: unified metric interface 29 Storage

    bytes across all partitions for Kafka topic % CPU at point in time # of produce request from client ID Ingress from cluster in the last hour
  20. Topology 1: Global Task Manager Distributes aggregation tasks by metric

    and entities: • Custom segment signal producer in the Druid Segments • Task manager dynamically allocates task based on upstream segments and additional trigger conditions
  21. Topology 2: Metric Processing Workers Leverages Druid Query Engine for

    metric rollups: • Statelessly process incoming aggregate tasks • Flatten into single metric output • Data retention is the same as Druid Segments
  22. Topology 3: Additional Processing Assumption: Consumers expect Open Telemetry(OTel) metrics:

    • Processing to support OTel semantics: e.g. emit delta for counter metrics • Consumers include Druid Segment and direct data consumers • Out-of-order data handling with the state store up to retention period*
  23. Horizontal Scaling Story in KStreams(KIP-878) Problem: as business expands, you

    might want to increase the parallelism of the streams processing. Users will want to increase the number of partitions of input topics. However, internal topics(changelog, repartition) will not automatically increase, and today, KStreams application will crash upon detecting a mismatch of partition number between internal topics and input topics. KIP-878: Support auto scaling of internal topics. This works well if your application can be, • Statically partitioned or stateless: stateless is straightforward. In KStreams, your state store(e.g. RocksDB) is backed by internal topics, thus, bound to a partition. Upon autoscaling, the pre-existing state will not move. Choose a partition strategy that works well for your use case, such that you can drive tasks(without pre-existing state) to the newly created partitions, while existing keys remains sticky. • Upfront over-provisioned for stateful processing, while the KIP is in progress.
  24. A tale of four clusters 35 Stores 81233469 bytes in

    partition 1, node-4 at 2023-04-14 06:55 PST CPU usage 89% at timestamp 1680613879 4 fetch requests from client id dev-app-1 in the last minute, at 2023-01-23 10:09T000Z 1K incoming message in the last second to broker node 4
  25. A tale of four clusters: when it gets analytical 36

    Storage bytes across all partitions for Kafka topic % CPU at point in time # of produce request from client ID Ingress from cluster in the last hour
  26. A tale of four clusters: happy consumers! 37 Storage bytes

    across all partitions for Kafka topic % CPU at point in time # of produce request from client ID Ingress from cluster in the last hour
  27. As metrics use cases increases… Closing thoughts for KStreams Metrics

    Aggregation • A DSL, KSQL or similar, to define versioned metrics: each metric aggregate is computed by one topology. When a metric definition changes, we need a strategy to handle and propagate changes. • Topic partition of the raw telemetry impacts aggregation efficiency: repartition may take up most of the processing time, as well as increased storage and network costs. • Performance tuning for RocksDB, or the state store implementation of your choice will become critical: SerDe, data retention, read/write patterns. • Query plans and smart roll ups could become essential: pre-aggregates should be shared for space and/or time aggregations across metrics for efficiency.