200.000 400.000 600.000 800.000 1.000.000 1.200.000 2015 2016 2017 2018 2019 2020 2021 2022 2023 Messages produced per second (average) Messages produced per second (average)
is Observability and why do we need this? • Three Pillars of observability • Exposing Kafka Client Side Metrics • How to interpret them? Consumer lag • Kafka client-side metrics demo • OpenTelemetry • OpenTelemetry Collector • Demo (Spring + Kafka + distributed tracing/logging/metrics) • Wrap Up • Questions
Recording flow through the application(s) and the interactions between services. Helps: • Understand why something happened. Logging: Recording individual events and data. Helps: to understand what happened. Hard to tell: context Metrics: Recording time series data. Aggregate Helps: • understand the context • identify trends • alert Hard to tell: why something is not working as expected?
0912845005f1a035 Span Id 808740d73506eee2 Time Sequence of operations Duration of single span Duration of total trace Service A Service B Service C Context is propagated over the network Kafka header
Agent JVM process java -javaagent:./jmx_prometheus_javaagent-0.18.0.jar =1234:kafka_clients.yml -jar your-kafka-application.jar Expose a HTTP endpoint serving metrics of the local JVM
Defacto standard building Java applications • Very opinionated • ‘Fat’ JAR (executable + all dependencies) • Embedded webserver • Production ready features • Micrometer default instrumentation / observability library
application observability façade • Think SLF4J, but for observability • Metrics & Traces • API to instrument your application • Used by Spring projects • Support for 19 popular monitoring systems • Out of the box metrics, traces for Kafka (using Spring Kafka) Micrometer.io
5 6 7 8 old new Kafka topic: ‘stock-quotes’ 3 partitions Producer Spring Kafka Kafka broker Consumer Spring Kafka 0100101001101 0100101001101 poll send Rest API Slow Service Http API Rest Call Prometheus Grafana HTTP Call: Scrape Metrics Kafka Lag exporter HTTP Call: Scrape Metrics Calculate consumer lag Based on broker data
don’t expect a big difference. The consumer • Reports lag only for the partition(s) it is actively consuming • High lag? Doesn't switch partitions that often • Only aware of the progress of the last offset as far as it's most recent metadata pull • consumes to that offset and thinks the lag is gone as it read up to that message. The producer • Still producing. • Actual offset grew in the meantime • Consumer is not aware of that yet More instances you start the more partition metrics will get reported
something critical • Monitoring its trend • Important: know there is lag • Lag keeps on increasing? • The consumer has a problem! • Alert on increasing trend of consumer lag
Functionality to collect, process and export telemetry data Encoding Transport Delivery Prometheus Kafka Jaeger Application Telemetry Data Using Exporter Library Telemetry Data Using OTLP (gRPC or http/protobuf) Telemetry Data Backend specific format Observability and Storage Library Auto(Agent) Defines the interface for instrumenting code with traces, metrics and logs
Prometheus Scrape Metrics Kafka Streams Application Time series database for Metrics Telemetry data (logs, metrics & traces) Shaky Downstream Service Kafka Consumer Application
• Aim for vendor neutral solutions • Helps you migrate to different observability backend • Minimal changes to your applications • Micrometer • Overlap with OpenTelemetry • JVM only • No instrumentation for Kafka Streams (Traces) yet: • https://github.com/micrometer-metrics/micrometer/issues/3713 • Can send telemetry data using OpenTelemetry • ✅ Metrics • ✅ Traces (Spring Boot 3) • ❌ Logs
Logs not stable • Start small • Java Agent will give you a kickstart • Minimum dev effort using auto instrumentation • Sample traces. You most likely don’t want to store 100% of all traces • Limitations • Stateful stream processing: KAFKA-7718