Kafka Summit London 2023 - Hope is not a course of action!

Hope is not a course of action! - A practical
deep dive into Observability of Streaming Applications ING Tim van Baarsen & Kosta Chuturkov

About the Speakers The Netherlands - Amsterdam Team Dora Romania
- Bucharest

ING https://www.ing.jobs/ • 60,000+ employees • Serve 37+ million customers
• Corporate clients and financial institutions in over 40 countries

Kafka @ ING Frontrunners in Kafka since 2014 Running in
production: • 8 years • 6000+ topics • Serving 1000+ Development teams • Self service topic management

Kafka @ ING Traffic is growing with +10% monthly 0
200.000 400.000 600.000 800.000 1.000.000 1.200.000 2015 2016 2017 2018 2019 2020 2021 2022 2023 Messages produced per second (average) Messages produced per second (average)

What are we going to cover today ? • What
is Observability and why do we need this? • Three Pillars of observability • Exposing Kafka Client Side Metrics • How to interpret them? Consumer lag • Kafka client-side metrics demo • OpenTelemetry • OpenTelemetry Collector • Demo (Spring + Kafka + distributed tracing/logging/metrics) • Wrap Up • Questions

Introduction to application Observability Observability is the ability to measure
the internal state of a system only by its external outputs. (logs, metrics, and traces)

Why do we need this?

Why do we need this? • Helps to investigate root
causes of incidents • Improve our software • Prevent outages • Better user experience for our customers

Three Pillars of Observability Logging Tracing Metrics (Aggregatable) (Events) Tracing:
Recording flow through the application(s) and the interactions between services. Helps: • Understand why something happened. Logging: Recording individual events and data. Helps: to understand what happened. Hard to tell: context Metrics: Recording time series data. Aggregate Helps: • understand the context • identify trends • alert Hard to tell: why something is not working as expected?

Logs 2023-04-06T09:11:47.341Z INFO [spring-kafka- producer,210d9dc16597a0a8f1a9746c4bbd8277,6b350c56f69a6ee6] 7 --- [nio-8080-exec-1] c.e.rest.StockQuoteRestController :
Produce stock quote via Rest API { "scope": { "name": ”com.example.rest.StockQuoteRestController" }, "logRecords": [ { "timeUnixNano": "1680772307341000000", "severityNumber": 9, "severityText": "INFO", "body": { "stringValue": "Produce stock quote via Rest API" }, "flags": 1, "traceId": "210d9dc16597a0a8f1a9746c4bbd8277", "spanId": "6b350c56f69a6ee6" } ] }

Tracing Trace Id dfb987a57cbb9fb6e13a00d68413efa4 (Root) Span Id 06eeb7d5c3e8327e Span Id
0912845005f1a035 Span Id 808740d73506eee2 Time Sequence of operations Duration of single span Duration of total trace Service A Service B Service C Context is propagated over the network Kafka header

Metrics: Interceptors • Kafka-clients API is pluggable • Confluent Monitoring
Interceptor • Interceptors (interceptor.classes) • Consumer • Producer • Send metrics to Kafka topic (_confluent-monitoring) • Confluent Control Center consumerProperties.put( ProducerConfig.INTERCEPTOR_CLASSES_CONFIG, " io.confluent.monitoring.clients.interceptor.MonitoringConsumerInterceptor"); producerProperties.put( ConsumerConfig.INTERCEPTOR_CLASSES_CONFIG, " io.confluent.monitoring.clients.interceptor.MonitoringProducerInterceptor");

Metrics: Interceptors Consumer lag metrics in Confluent Control Center

Metrics: Metric Reporters • Metric Reporters (metric.reporters) • Default JMX
• View metrics in JConsole • Prefer Metrics in time series database!

Metrics: Java Agent Your Kafka Application HTTP JMX Exporter Prometheus
Agent JVM process java -javaagent:./jmx_prometheus_javaagent-0.18.0.jar =1234:kafka_clients.yml -jar your-kafka-application.jar Expose a HTTP endpoint serving metrics of the local JVM

Metrics: Spring Boot & Micrometer What is Spring Boot? •
Defacto standard building Java applications • Very opinionated • ‘Fat’ JAR (executable + all dependencies) • Embedded webserver • Production ready features • Micrometer default instrumentation / observability library

Metrics: Spring Boot & Micrometer What is Micrometer? • Vendor-neutral
application observability façade • Think SLF4J, but for observability • Metrics & Traces • API to instrument your application • Used by Spring projects • Support for 19 popular monitoring systems • Out of the box metrics, traces for Kafka (using Spring Kafka) Micrometer.io

Metrics: Spring Boot & Micrometer Prometheus Grafana /actuator/prometheus Scrape Metrics
Kafka Consumer Application Kafka client Spring Kafka Spring Boot Application <dependency> <groupId>io.micrometer</groupId> <artifactId>micrometer-registry-prometheus</artifactId> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-actuator</artifactId> </dependency> Helps to manage & monitor your application • Metrics • Health check • Application info (version, git commit, etc) /actuator/metrics Query

Metrics: Spring Boot & Micrometer Prometheus Grafana Elasticsearch Kibana Kafka
Consumer Application Kafka client Spring Kafka Spring Boot Application <dependency> <groupId>io.micrometer</groupId> <artifactId>micrometer-registry-elastic</artifactId> </dependency> Push metrics

Metrics: Spring Boot & Micrometer Prometheus Grafana Elasticsearch Kibana OpenTelemetry
Metrics backend Kafka Consumer Application Kafka client Spring Kafka Spring Boot Application <dependency> <groupId>io.micrometer</groupId> <artifactId>micrometer-registry-otlp</artifactId> </dependency> Push metrics (OTLP protocol)

Metrics: Client side metrics Demo 0 1 2 3 4
5 6 7 8 old new Kafka topic: ‘stock-quotes’ 3 partitions Producer Spring Kafka Kafka broker Consumer Spring Kafka 0100101001101 0100101001101 poll send Rest API Slow Service Http API Rest Call Prometheus Grafana HTTP Call: Scrape Metrics Kafka Lag exporter HTTP Call: Scrape Metrics Calculate consumer lag Based on broker data

Consumer lag metrics difference, why? Problem: as a developer I
don’t expect a big difference. The consumer • Reports lag only for the partition(s) it is actively consuming • High lag? Doesn't switch partitions that often • Only aware of the progress of the last offset as far as it's most recent metadata pull • consumes to that offset and thinks the lag is gone as it read up to that message. The producer • Still producing. • Actual offset grew in the meantime • Consumer is not aware of that yet More instances you start the more partition metrics will get reported

Lessons Learned Take aways • Metric precision should not be
something critical • Monitoring its trend • Important: know there is lag • Lag keeps on increasing? • The consumer has a problem! • Alert on increasing trend of consumer lag

What is OpenTelemetry? + = Open Census Open Tracing OpenTelemetry
What is Software Telemetry ? • Collection of data on the use, performance and behaviour of applications and their components

OpenTelemetry Components API SDK`s Collector Protocol (OTLP) Per Programming Language
Functionality to collect, process and export telemetry data Encoding Transport Delivery Prometheus Kafka Jaeger Application Telemetry Data Using Exporter Library Telemetry Data Using OTLP (gRPC or http/protobuf) Telemetry Data Backend specific format Observability and Storage Library Auto(Agent) Defines the interface for instrumenting code with traces, metrics and logs

OpenTelemetry Collector Components • Optional component • No change in
Application needed when switching Logging/Metrics/Tracing Backends Collector Receivers Processors Exporters

Receivers Prometheus Kafka Collector Receivers Exporters Processors

Processors (How to handle received Data) Collector Receivers Exporters Memory
Limiter Batch Processor Processors Filter Processor

Exporters Tempo Jaeger Kafka Collector Receivers Exporters Processors

OpenTelemetry Demo: Components ‘stock-quotes-exchange- nyse’ ‘stock-quotes-exchange- nasdaq’ ‘stock-quotes-exchange- ams’ Producer
Spring Kafka Consumer Spring Kafka poll send Shaky Downstream Service Http Rest Call Client REST Call Kafka broker Kafka Streams App poll Kafka Plain Consumer 0 1 2 3 4 5 6 7 8 0 1 2 0 1 0 1 2 3 send ‘stock-quotes’

OpenTelemetry Demo Collector Plain Kafka Consumer Application Kafka Producer Application
Kafka Kafka Streams Application Logs, Metrics, Traces Telemetry data (logs, metrics & traces) Kafka Consumer Application Shaky Downstream Service

Prometheus Scrape Metrics Kafka Streams Application Time series database for Metrics Telemetry data (logs, metrics & traces) Shaky Downstream Service Kafka Consumer Application

Grafana Loki Kafka Streams Application Logs Centralized logging Telemetry data (logs, metrics & traces) Grafana Kafka Consumer Application Shaky Downstream Service

Jaeger Grafana Tempo Kafka Streams Application Traces Traces Telemetry data (logs, metrics & traces) Kafka Consumer Application Shaky Downstream Service

Wrap up • Many different ways to observe your applications
• Aim for vendor neutral solutions • Helps you migrate to different observability backend • Minimal changes to your applications • Micrometer • Overlap with OpenTelemetry • JVM only • No instrumentation for Kafka Streams (Traces) yet: • https://github.com/micrometer-metrics/micrometer/issues/3713 • Can send telemetry data using OpenTelemetry • ✅ Metrics • ✅ Traces (Spring Boot 3) • ❌ Logs

Wrap up • Consumer (lag) metrics • Client side vs
broker side metrics • Monitor the trend! • Select and ship only the metrics you need

Wrap up • OpenTelemetry • Language agnostic • Specification for
Logs not stable • Start small • Java Agent will give you a kickstart • Minimum dev effort using auto instrumentation • Sample traces. You most likely don’t want to store 100% of all traces • Limitations • Stateful stream processing: KAFKA-7718

Questions? 🤔 ❔ Demo codebase: https://github.com/j-tim/kafka-summit-london-2023

Kafka Summit London 2023 - Hope is not a course...

Kafka Summit London 2023 - Hope is not a course of action!

More Decks by Tim

Other Decks in Technology

Featured

Transcript