Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Kafka Summit London 2023 - Hope is not a course...

Tim
June 22, 2023

Kafka Summit London 2023 - Hope is not a course of action!

Our slides for the Kafka Summit London 2023 - Hope is not a course of action!
Speakers:

Tim van Baarsen & Kosta Chuturkov

Tim

June 22, 2023
Tweet

More Decks by Tim

Other Decks in Technology

Transcript

  1. Hope is not a course of action! - A practical

    deep dive into Observability of Streaming Applications ING Tim van Baarsen & Kosta Chuturkov
  2. ING https://www.ing.jobs/ • 60,000+ employees • Serve 37+ million customers

    • Corporate clients and financial institutions in over 40 countries
  3. Kafka @ ING Frontrunners in Kafka since 2014 Running in

    production: • 8 years • 6000+ topics • Serving 1000+ Development teams • Self service topic management
  4. Kafka @ ING Traffic is growing with +10% monthly 0

    200.000 400.000 600.000 800.000 1.000.000 1.200.000 2015 2016 2017 2018 2019 2020 2021 2022 2023 Messages produced per second (average) Messages produced per second (average)
  5. What are we going to cover today ? • What

    is Observability and why do we need this? • Three Pillars of observability • Exposing Kafka Client Side Metrics • How to interpret them? Consumer lag • Kafka client-side metrics demo • OpenTelemetry • OpenTelemetry Collector • Demo (Spring + Kafka + distributed tracing/logging/metrics) • Wrap Up • Questions
  6. Introduction to application Observability Observability is the ability to measure

    the internal state of a system only by its external outputs. (logs, metrics, and traces)
  7. Why do we need this? • Helps to investigate root

    causes of incidents • Improve our software • Prevent outages • Better user experience for our customers
  8. Three Pillars of Observability Logging Tracing Metrics (Aggregatable) (Events) Tracing:

    Recording flow through the application(s) and the interactions between services. Helps: • Understand why something happened. Logging: Recording individual events and data. Helps: to understand what happened. Hard to tell: context Metrics: Recording time series data. Aggregate Helps: • understand the context • identify trends • alert Hard to tell: why something is not working as expected?
  9. Logs 2023-04-06T09:11:47.341Z INFO [spring-kafka- producer,210d9dc16597a0a8f1a9746c4bbd8277,6b350c56f69a6ee6] 7 --- [nio-8080-exec-1] c.e.rest.StockQuoteRestController :

    Produce stock quote via Rest API { "scope": { "name": ”com.example.rest.StockQuoteRestController" }, "logRecords": [ { "timeUnixNano": "1680772307341000000", "severityNumber": 9, "severityText": "INFO", "body": { "stringValue": "Produce stock quote via Rest API" }, "flags": 1, "traceId": "210d9dc16597a0a8f1a9746c4bbd8277", "spanId": "6b350c56f69a6ee6" } ] }
  10. Tracing Trace Id dfb987a57cbb9fb6e13a00d68413efa4 (Root) Span Id 06eeb7d5c3e8327e Span Id

    0912845005f1a035 Span Id 808740d73506eee2 Time Sequence of operations Duration of single span Duration of total trace Service A Service B Service C Context is propagated over the network Kafka header
  11. Metrics: Interceptors • Kafka-clients API is pluggable • Confluent Monitoring

    Interceptor • Interceptors (interceptor.classes) • Consumer • Producer • Send metrics to Kafka topic (_confluent-monitoring) • Confluent Control Center consumerProperties.put( ProducerConfig.INTERCEPTOR_CLASSES_CONFIG, " io.confluent.monitoring.clients.interceptor.MonitoringConsumerInterceptor"); producerProperties.put( ConsumerConfig.INTERCEPTOR_CLASSES_CONFIG, " io.confluent.monitoring.clients.interceptor.MonitoringProducerInterceptor");
  12. Metrics: Metric Reporters • Metric Reporters (metric.reporters) • Default JMX

    • View metrics in JConsole • Prefer Metrics in time series database!
  13. Metrics: Java Agent Your Kafka Application HTTP JMX Exporter Prometheus

    Agent JVM process java -javaagent:./jmx_prometheus_javaagent-0.18.0.jar =1234:kafka_clients.yml -jar your-kafka-application.jar Expose a HTTP endpoint serving metrics of the local JVM
  14. Metrics: Spring Boot & Micrometer What is Spring Boot? •

    Defacto standard building Java applications • Very opinionated • ‘Fat’ JAR (executable + all dependencies) • Embedded webserver • Production ready features • Micrometer default instrumentation / observability library
  15. Metrics: Spring Boot & Micrometer What is Micrometer? • Vendor-neutral

    application observability façade • Think SLF4J, but for observability • Metrics & Traces • API to instrument your application • Used by Spring projects • Support for 19 popular monitoring systems • Out of the box metrics, traces for Kafka (using Spring Kafka) Micrometer.io
  16. Metrics: Spring Boot & Micrometer Prometheus Grafana /actuator/prometheus Scrape Metrics

    Kafka Consumer Application Kafka client Spring Kafka Spring Boot Application <dependency> <groupId>io.micrometer</groupId> <artifactId>micrometer-registry-prometheus</artifactId> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-actuator</artifactId> </dependency> Helps to manage & monitor your application • Metrics • Health check • Application info (version, git commit, etc) /actuator/metrics Query
  17. Metrics: Spring Boot & Micrometer Prometheus Grafana Elasticsearch Kibana Kafka

    Consumer Application Kafka client Spring Kafka Spring Boot Application <dependency> <groupId>io.micrometer</groupId> <artifactId>micrometer-registry-elastic</artifactId> </dependency> Push metrics
  18. Metrics: Spring Boot & Micrometer Prometheus Grafana Elasticsearch Kibana OpenTelemetry

    Metrics backend Kafka Consumer Application Kafka client Spring Kafka Spring Boot Application <dependency> <groupId>io.micrometer</groupId> <artifactId>micrometer-registry-otlp</artifactId> </dependency> Push metrics (OTLP protocol)
  19. Metrics: Client side metrics Demo 0 1 2 3 4

    5 6 7 8 old new Kafka topic: ‘stock-quotes’ 3 partitions Producer Spring Kafka Kafka broker Consumer Spring Kafka 0100101001101 0100101001101 poll send Rest API Slow Service Http API Rest Call Prometheus Grafana HTTP Call: Scrape Metrics Kafka Lag exporter HTTP Call: Scrape Metrics Calculate consumer lag Based on broker data
  20. Consumer lag metrics difference, why? Problem: as a developer I

    don’t expect a big difference. The consumer • Reports lag only for the partition(s) it is actively consuming • High lag? Doesn't switch partitions that often • Only aware of the progress of the last offset as far as it's most recent metadata pull • consumes to that offset and thinks the lag is gone as it read up to that message. The producer • Still producing. • Actual offset grew in the meantime • Consumer is not aware of that yet More instances you start the more partition metrics will get reported
  21. Lessons Learned Take aways • Metric precision should not be

    something critical • Monitoring its trend • Important: know there is lag • Lag keeps on increasing? • The consumer has a problem! • Alert on increasing trend of consumer lag
  22. What is OpenTelemetry? + = Open Census Open Tracing OpenTelemetry

    What is Software Telemetry ? • Collection of data on the use, performance and behaviour of applications and their components
  23. OpenTelemetry Components API SDK`s Collector Protocol (OTLP) Per Programming Language

    Functionality to collect, process and export telemetry data Encoding Transport Delivery Prometheus Kafka Jaeger Application Telemetry Data Using Exporter Library Telemetry Data Using OTLP (gRPC or http/protobuf) Telemetry Data Backend specific format Observability and Storage Library Auto(Agent) Defines the interface for instrumenting code with traces, metrics and logs
  24. OpenTelemetry Collector Components • Optional component • No change in

    Application needed when switching Logging/Metrics/Tracing Backends Collector Receivers Processors Exporters
  25. Processors (How to handle received Data) Collector Receivers Exporters Memory

    Limiter Batch Processor Processors Filter Processor
  26. OpenTelemetry Demo: Components ‘stock-quotes-exchange- nyse’ ‘stock-quotes-exchange- nasdaq’ ‘stock-quotes-exchange- ams’ Producer

    Spring Kafka Consumer Spring Kafka poll send Shaky Downstream Service Http Rest Call Client REST Call Kafka broker Kafka Streams App poll Kafka Plain Consumer 0 1 2 3 4 5 6 7 8 0 1 2 0 1 0 1 2 3 send ‘stock-quotes’
  27. OpenTelemetry Demo Collector Plain Kafka Consumer Application Kafka Producer Application

    Kafka Kafka Streams Application Logs, Metrics, Traces Telemetry data (logs, metrics & traces) Kafka Consumer Application Shaky Downstream Service
  28. OpenTelemetry Demo Collector Plain Kafka Consumer Application Kafka Producer Application

    Prometheus Scrape Metrics Kafka Streams Application Time series database for Metrics Telemetry data (logs, metrics & traces) Shaky Downstream Service Kafka Consumer Application
  29. OpenTelemetry Demo Collector Plain Kafka Consumer Application Kafka Producer Application

    Grafana Loki Kafka Streams Application Logs Centralized logging Telemetry data (logs, metrics & traces) Grafana Kafka Consumer Application Shaky Downstream Service
  30. OpenTelemetry Demo Collector Plain Kafka Consumer Application Kafka Producer Application

    Jaeger Grafana Tempo Kafka Streams Application Traces Traces Telemetry data (logs, metrics & traces) Kafka Consumer Application Shaky Downstream Service
  31. Wrap up • Many different ways to observe your applications

    • Aim for vendor neutral solutions • Helps you migrate to different observability backend • Minimal changes to your applications • Micrometer • Overlap with OpenTelemetry • JVM only • No instrumentation for Kafka Streams (Traces) yet: • https://github.com/micrometer-metrics/micrometer/issues/3713 • Can send telemetry data using OpenTelemetry • ✅ Metrics • ✅ Traces (Spring Boot 3) • ❌ Logs
  32. Wrap up • Consumer (lag) metrics • Client side vs

    broker side metrics • Monitor the trend! • Select and ship only the metrics you need
  33. Wrap up • OpenTelemetry • Language agnostic • Specification for

    Logs not stable • Start small • Java Agent will give you a kickstart • Minimum dev effort using auto instrumentation • Sample traces. You most likely don’t want to store 100% of all traces • Limitations • Stateful stream processing: KAFKA-7718