How does one serve telemetry metrics at cloud scale? At Confluent, raw telemetry is flowing in at 5 million metrics per second. Not only is the storage expensive–often with extended retention to meet compliance requirements–but the computational cost of aggregation can also skyrocket in a pull model. In the pull model, metrics consumers like data science and billing used to query metrics from the OLAP data stores on demand, which created inconsistencies over time. This session will showcase how we switched to a push model for telemetry analytics, and tackled these challenges with Kafka Streams and Apache Druid.
You will walk away with an understanding of:
– Architecture choices for real-time aggregation
– Time semantics, and handling out-of-order events
– Partition and autoscaling story of the streaming platform