Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Kafka Streams Pushed Hard: Lessons from Stream ...

Sponsored · Ship Features Fearlessly Turn features on and off without deploys. Used by thousands of Ruby developers.

Kafka Streams Pushed Hard: Lessons from Stream Processing at Scale - Apache Kafka® Meetup 2026

Kafka Streams is a powerful choice for stateful stream processing: a library, not a cluster, that runs alongside your application and handles exactly-once semantics, fault tolerance, and partition-local state out of the box. At KOR, we build financial-grade trade reporting infrastructure on top of it. That means regulatory deadlines, SLA requirements, and zero tolerance for data loss. We pushed Kafka Streams hard.

This talk is about what we found.

We will walk through four walls we hit in production. Not theoretical limitations, but real engineering decisions with real consequences. First, what happens when your access patterns outgrow key-value stores, and why plugging in a custom document database is a much larger commitment than the API suggests. Second, why Interactive Queries can lead you to accidentally build a distributed database inside your application. Third, why making external calls from within a stream processor will eventually crash your application, and the request-response offloading pattern with a parking lot that solves it. Fourth, the partition ceiling, the hard limit on Kafka Streams parallelism, what your options are when you hit it, and why we studied Flink seriously but chose not to run it.
Each section covers a constraint, the wall it created, and the exit we found or considered. Some exits are clean. Some are costly. All of them are honest.

If you are running Kafka Streams in production, or planning to, this talk will save you some expensive discoveries.

Avatar for Andreas Evers

Andreas Evers

May 07, 2026

More Decks by Andreas Evers

Other Decks in Technology

Transcript

  1. Kafka Streams Pushed Hard Four walls, progressively discovered 01 State

    Stores Key-value is not enough 02 Interactive Queries You're building a distributed DB 03 External Databases Your stream threads time out 04 Partition Ceiling Horizontal scale has a hard limit Each solution reveals the next constraint
  2. Introduction Why trade reporting exists 2008 Global financial crisis OTC

    derivatives: trillions in invisible exposure 2009 — G20 Pittsburgh Summit All OTC derivative trades must be reported to trade repositories. Regulators can now see where risk is building up. CFTC [US] SEC [US] ESMA [EU] FCA [UK] ACER [EU] MAS [SG] OSC [CA] ASIC [AU]
  3. Introduction What KOR builds Market Participants banks · funds ·

    clearing houses KOR translation · enrichment · validation submission · reconciliation Trade Repositories KOR · DTCC · REGIS-TR · UnaVista KOR also builds and hosts trade repositories — operating on both sides of the reporting chain
  4. Introduction Who we are technically 100% remote across time zones

    XP-first pair programming · TDD · continuous feedback AI-amplified XP practices make AI outcomes better Stack: Java · Spring Boot · Kafka Streams Databricks · Spark Terraform IaC · GitOps · Advanced CD Angular frontend Confluent engineers told us: "The most advanced usage of Kafka Streams we have ever seen" Some joined KOR because of it.
  5. §1 — State Stores 1 State Stores: When Key-Value Is

    Not Enough RocksDB works great when... Problems start when... Data volume fits within RocksDB assumptions Access pattern is predictable and key-based Point lookups by partition key Write-heavy, append-style state Data grows significantly You need document-style queries Index-based or range access is required Graph or multi-field lookups emerge
  6. §1 — State Stores 1 Custom State Stores: The Hard

    Parts We tried NitriteDB, an embedded document database as a custom StateStore. It broke down in pre-production. Restore on rebalance Kafka Streams restores state at the byte level. RocksDB handles this natively. A document DB routes every record through its full API including index updates — restore becomes slow and painful. Consistency during restore window While restoring your store is in a partial state. RocksDB handles this transparently because restore is fast. With a custom store the window is long enough to be operationally significant. Flush and checkpoint contract The StateStore flush cycle is tightly coupled to Kafka Streams' commit semantics. Your store must honour exactly-once guarantees through this cycle — or you silently break fault tolerance. Yours forever Every Kafka Streams upgrade is a compatibility question. The surface area of the contract is much larger than the API suggests. We went in thinking we were replacing a storage backend. We were re-implementing a fault- tolerance protocol.
  7. §2 — Interactive Queries 2 Interactive Queries: Promise vs Reality

    The promise Query state directly from outside the topology No separate read store or CDC pipeline Stream state IS your database Partition-local key lookups: fast and clean but The reality State is sharded — one instance per partition Cross-partition queries need fan-out + merge Service discovery + result aggregation required Pagination across shards: no global cursor You just built a distributed database — on top of Kafka — in your application code IQ v2 (KIP-796) improves the API — but cross-partition aggregation and pagination remain yours to architect.
  8. §3 — External Databases 3 The Stream Thread Timeout Problem

    "Just use Redis." Reasonable idea. Until your app crashes. Stream Thread External Call (Redis) Call is Slow poll() timeout (max.poll .interval.ms) Rebalance / Crash Heartbeats continue on a separate thread (since Kafka 0.10.1) — it's max.poll.interval.ms that catches you. Crashing or thrashing Application enters a rebalance loop — restarting repeatedly, never making forward progress. Operational alerts In financial services, under load, with a slow downstream — this is not a theoretical risk. We have seen it. The answer: never make the external call inline → request-response offloading pattern
  9. §3 — External Databases 3 The Request-Response Offloading Pattern In

    production at KOR KS Topology A KS Topology B Processor A initiates offload Request topic Kafka topic Kafka consumer plain · Parallel Consumer Parking lot local state · keyed Event topic Kafka topic Processor B emits release event ① park first ② emit request (atomic) ③ consumed by ④ emit response ⑤ picked up by ⑥ emit release ⑦ unpark + continue Park + emit is conceptually atomic (ideally EOS) · stream threads never block · consumer scales independently Response topic Kafka topic External System slow · error prone
  10. §4 — Partition Ceiling 4 The Partition Ceiling Kafka Streams

    parallelism is partition-bound — always. 1 2 3 4 5 6 7+ ... Hard ceiling: more consumers than partitions →→ idle threads Confluent told us directly: 2–18 partitions covers most real workloads 48 partitions almost never seen in practice 6+ usually signals a design flaw Exception: very specific justified requirements If you're hitting the ceiling: A Custom processor + fork-join B Parallel Consumer library C Apache Flink At KOR: MiFID SLA requirements forced us to confront this limit directly.
  11. §4 — Partition Ceiling 4 Three Options Beyond the Ceiling

    Custom Processor + Fork-Join Vertical scalability: push one instance harder State stays local — no movement on scale No new infrastructure needed Still partition-bound horizontally Custom impl per service (could contribute upstream) Parallel Consumer Scale beyond partition count for I/O work Low cost — library addition only Complements Option A No state management Idempotency required — real constraint in finserv Apache Flink Partition ceiling disappears entirely Task parallelism truly independent Uber & Netflix at petabyte scale Separate cluster + dedicated expertise State redistributed via checkpoints on rescale Org must be able to support it A B C
  12. Know the Walls. Know the Exits. State stores The contract

    is bigger than the API suggests. We didn't get to production. That's a valid outcome. Interactive Queries Sharp for partition-local key lookups. A trap for global queries and pagination. External I/O Never inline. Park, emit, offload, and watch your repartitioning logic. Partition ceiling Vertical headroom first (fork-join, Parallel Consumer). Flink when you have the org for it. Kafka Streams is excellent at what it is. Every limit we hit was a boundary, not a bug.