Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Transactional Change Stream Processing With Deb...

Sponsored · SiteGround - Reliable hosting with speed, security, and support you can count on.

Transactional Change Stream Processing With Debezium and Apache Flink

Apache Flink is commonly used for processing Debezium change data events: for running continuous queries enabling real-time analytics as the data in your OLTP store changes, for filtering and transforming change data feeds, or for creating denormalized data views sourced from the change data events of multiple tables.

While powerful, this processing happens message by message, resulting in the emission of partial results to downstream consumers while the change events originating from a single transaction in the source database are processed. Oftentimes, that’s not desired: instead, results should only be emitted once all the events from a transaction have been received.

In this talk, we’ll explore how this problem can be solved by leveraging Debezium’s transaction metadata. It describes how many events of which type belong to a given transaction in a database like Postgres or MySQL. We’ll show how to take advantage of this information for implementing an innovative watermarking approach which, together with a custom output buffer, ensures that event consumers will only ever receive transactionally consistent data.

Avatar for Gunnar Morling

Gunnar Morling

May 21, 2026

More Decks by Gunnar Morling

Other Decks in Programming

Transcript

  1. Gunnar Morling @gunnarmorling Transactional Change Stream Processing With Debezium and

    Apache Flink Jan Hrdina https://flic.kr/p/2jGjd6F (CC BY-SA 2.0)
  2. #Debezium #ApacheFlink · @gunnarmorling Technologist at Confluent • Former project

    lead of Debezium • Hardwood, kcctl 🧸, JfrUnit, ModiTect, MapStruct • One Billion Row Challenge 1⃣🐝🏎 • Java Champion Gunnar Morling
  3. #Debezium #ApacheFlink · @gunnarmorling • Reads DB transaction log (WAL,

    binlog, etc.) to capture every insert/update/delete • Created at Red Hat in 2016 – Battle-tested at scale Debezium – Open-Source CDC Turning Committed Database Changes Into an Event Stream
  4. #Debezium #ApacheFlink · @gunnarmorling Apache Flink Stateful Computations Over Data

    Streams • Stateful by design: per-key state, exactly-once guarantees, fault tolerance via checkpoints • Event-time semantics with watermarks; first-class support for windows, joins, aggregations • Two main APIs ◦ Flink SQL ◦ DataStream API • Top-level Apache project since 2014
  5. #Debezium #ApacheFlink · @gunnarmorling Why Debezium and Flink Belong Together

    Common Pattern and Use Cases • Filtering / transforming / routing CDC streams • Real-time analytics over OLTP data • Denormalization for search engines, caches, document stores • Streaming Data Contracts
  6. #Debezium #ApacheFlink · @gunnarmorling Problem #1 Incomplete Results & Write

    Amplification STEP EMITTED JOIN RESULT po1 arrives order with 0 lines ol1 arrives (retract) → order with 1 line ol2 arrives (retract) → order with 2 lines ol3 arrives (retract) → order with 3 lines Bad UX Elasticsearch returns a partial document mid-transaction Write amplification 7 writes (4 writes, 3 retractions) for one transaction 1 2 3 4 5 6 BEGIN; INSERT INTO purchase_order po (id, order_date) VALUES (...); INSERT INTO order_lines ol (id, product_id, quantity) VALUES (...); INSERT INTO order_lines ol (id, product_id, quantity) VALUES (...); INSERT INTO order_lines ol (id, product_id, quantity) VALUES (...); COMMIT;
  7. #Debezium #ApacheFlink · @gunnarmorling Problem #2 Incorrect Results Three transactions

    in commit order in the source DB 1. Materialize join from TX 1 → (po1, ol1) 2. Process ol2 from TX 3 → (po1, ol2) ⚠ never existed in source DB 3. Process po2 from TX 2 → (po2, ol2) What Flink may process
  8. #Debezium #ApacheFlink · @gunnarmorling Problem #2 Incorrect Results Three transactions

    in commit order in the source DB 1. Materialize join from TX 1 → (po1, ol1) 2. Process ol2 from TX 3 → (po1, ol2) ⚠ never existed in source DB 3. Process po2 from TX 2 → (po2, ol2) The final state is correct. Intermediary state may be not. What Flink may process
  9. #Debezium #ApacheFlink · @gunnarmorling Root Cause Flink is Oblivious to

    Source Transaction Boundaries • Flink computes immediately on every input event • Sources for different topics advance at independent paces • No signal in the pipeline says: "events from transaction X are now complete" WHAT WE NEED (a) Only emit at transaction boundaries (b) Process transactions in source-commit order
  10. #Debezium #ApacheFlink · @gunnarmorling "Can't We Just Mini-Batch?" Flink's Mini-Batching

    Doesn't Fix This • Mini-batch is time- or size-based: buffers for N ms or N records • Doesn't align with source transaction boundaries • Doesn't solve the cross-stream ordering problem at all • Reduces write amplification a bit, but doesn't give correctness Useful for throughput, not for transactional consistency
  11. #Debezium #ApacheFlink · @gunnarmorling The Missing Ingredient Debezium Transaction Metadata

    Every change event carries Tx id data Dedicated TX metadata topic
  12. #Debezium #ApacheFlink · @gunnarmorling The Missing Ingredient Debezium Transaction Metadata

    Every change event carries Tx id data Dedicated TX metadata topic
  13. #Debezium #ApacheFlink · @gunnarmorling Postgres: Why LSN, Not txId? txId

    Not Monotonic by Commit Time • In Postgres, txId is assigned at BEGIN, not at COMMIT • Two transactions can interleave so the higher txId commits first Commit LSN is monotonically increasing — that's our notion of "time" for watermarks.
  14. #Debezium #ApacheFlink · @gunnarmorling.dev The Idea Repurpose Flink's Watermarks to

    Signal Transaction Completeness (a) Commit LSN as the watermark • Watermark = "everything ≤ X has been observed on this stream" • Standard X = event-time • Our X = commit LSN — monotonically increasing per the transaction log • Downstream operators wait for the watermark before emitting → transactional consistency (b) Flink DataStream v2 API • Legacy DataStream API hardcodes event-time watermarks • DataStream v2 (Flink 2.x) introduces declarative custom watermark types • We declare a LongWatermarkDeclaration carrying commit LSNs • Operators can react to our custom watermarks just like event-time ones
  15. #Debezium #ApacheFlink · @gunnarmorling Solution Architecture: A Five-Stage Pipeline Patch

    each event with the correct commit LSN Emit custom watermarks at transaction boundaries Join buffer; wait for watermarks from both inputs; flush in order Per-key aggregation state; emit only on watermark Receives transactionally-consistent records only
  16. #Debezium #ApacheFlink · @gunnarmorling Stage 1: CommitLsnFixer Workaround for Debezium

    DBZ-1555 • Postgres connector reports the previous transaction's commit LSN • Workaround: broadcast-join change events with the transaction-metadata topic on txId 1 2 3 4 5 6 7 8 9 10 11 // In processRecordFromBroadcastInput(): // On END event for txId, flush all buffered events with the corrected LSN if (record.status() == Status.END) { long correctCommitLsn = record.commitLsn(); ctx.applyToAllPartitions((collector, context) -> { commitLsnState.put(txId, correctCommitLsn); for (DataChangeEvent event : buffered(txId)) { collector.collect(event.withCommitLsn(correctCommitLsn)); } }); }
  17. #Debezium #ApacheFlink · @gunnarmorling Stage 2: WatermarkInjector Custom Watermarks •

    Flink v2 DataStream API supports custom watermark types • Replace event-time watermarks with commit-LSN watermarks • The semantics: "I have seen everything from transactions ≤ LSN on this stream." 1 2 3 4 5 6 7 public static final LongWatermarkDeclaration WATERMARK_DECLARATION = WatermarkDeclarations.newBuilder("TX_WATERMARK") .typeLong() .combineFunctionMin() .combineWaitForAllChannels(true) .defaultHandlingStrategyIgnore() .build();
  18. #Debezium #ApacheFlink · @gunnarmorling Stage 2 (cont.) When to Emit

    Watermarks? • First idea: leverage in-partition ordering • New commit LSN → Flush buffered events • But: Idle stream problem
  19. #Debezium #ApacheFlink · @gunnarmorling Stage 2 (cont.) The Idle-Tables Problem

    • Solution: broadcast the TX topic to every WatermarkInjector • On each transaction END: ◦ If count == event_count for this data collection → emit watermark = commit LSN ◦ A transaction with zero events for our table still counts as 'complete' → watermark advances
  20. #Debezium #ApacheFlink · @gunnarmorling Stage 3: Custom Join Operator Buffer

    Until Both Inputs Have Crossed the Same Watermark • Per key, two keyed state stores (orders, order_lines) • Tracks the minimum watermark across both inputs • On watermark advance: ◦ For each order touched in the transaction → join with latest line per key ◦ For each order line touched → join with latest order ◦ Emit the joined result, then forward the watermark
  21. #Debezium #ApacheFlink · @gunnarmorling Stage 4: Aggregation Operator Per-Purchase-Order State,

    Emitted on Watermark 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 public class TxAwareAggregationFunction { public void processRecord(DataChangeEventPair record, ...) { // accumulate join result into the order's state OrderWithLines order = (state.value() == null) ? OrderWithLines.from(record) : state.value().update(record); state.update(order); } public void onWatermark(Watermark wm, ...) { // emit ONLY if this order was modified at this commitLsn if (order.commitLsn() == ((LongWatermark) wm).getValue()) { collector.collect(order); } } }
  22. #Debezium #ApacheFlink · @gunnarmorling Stage 5: Kafka Sink One write

    per transaction-modified order No partial documents downstream Write amplification eliminated N writes/transaction, not N×K Elasticsearch sees coherent documents on every refresh 1 2 3 4 5 6 7 8 9 { "id": 10001, "purchaser": 1001, "shippingAddress": "123 Main St", "lines": [ {"productId": 101, "quantity": 2, "price": 19.99}, {"productId": 102, "quantity": 1, "price": 49.99} ] } ✅ ✅ ✅
  23. #Debezium #ApacheFlink · @gunnarmorling Putting Everything Together DataStreamJob.java 1 2

    3 4 5 6 7 8 9 10 11 12 13 14 15 // Each source: data → CommitLsnFixer → WatermarkInjector → keyBy KeyedPartitionStream<Long, DataChangeEvent> orders = env.fromSource(ordersSource) .connectAndProcess(transactionsStream, new CommitLsnFixer()) .connectAndProcess(transactionsStream, new WatermarkInjector("inventory.orders")) .keyBy(DataChangeEvent::id); KeyedPartitionStream<Long, DataChangeEvent> orderLines = /* ...same shape... */; // Join + aggregate + sink connectAndProcess(orders, orderLines, new TxAwareJoinProcessFunction(...)) .keyBy(DataChangeEventPair::id) .process(new TxAwareAggregationFunction()) .toSink(kafkaSink);
  24. #Debezium #ApacheFlink · @gunnarmorling Results — Before vs. After BEFORE

    Same TX, three line inserts: order=10001 lines=[] order=10001 lines=[L1] ← retract previous order=10001 lines=[L1, L2] ← retract previous order=10001 lines=[L1, L2, L3] ← retract previous Cross-TX (TX1, TX2, TX3 case): ~6 emissions including the impossible (po1, ol2) state AFTER Same TX: order=10001 lines=[L1, L2, L3] ← single, complete emission Cross-TX (TX1, TX2, TX3 case): 3 emissions one per transaction, in commit order, all valid Every emission downstream now corresponds to a real, committed state of the source database.
  25. #Debezium #ApacheFlink · @gunnarmorling Limitations Parallelism = 1 only Higher

    parallelism needs per-partition transaction metadata from Debezium, so each task knows its share of the transaction. Single-output-record consistency, not multi-record A transaction touching two orders still emits two messages — both correct in isolation, but a consumer can read between them. True end-to-end multi-record transactionality would need 2PC all the way to the destination.
  26. #Debezium #ApacheFlink · @gunnarmorling What's Needed for Production… …and How

    the Community Can Help DEBEZIUM • Per-partition transaction metadata Unblocks parallelism > 1 • Fix DBZ-1555 Eliminates the CommitLsnFixer stage entirely • Transaction metadata for snapshots • Native transaction-aware operators Built-in joins / aggregations that respect commit boundaries • First-class Flink SQL support One-keyword opt-in, not a 1500-line PoC Flink
  27. #Debezium #ApacheFlink · @gunnarmorling Long-Term Vision Flink SQL Support →

    Transactional CDC processing becomes a one-keyword opt-in. 1 2 3 4 5 6 7 8 SELECT TRANSACTIONALLY po.id, po.order_date, ARRAY_AGG(ROW(ol.id, ol.product_id, ol.quantity)) FROM purchase_orders po LEFT JOIN order_lines ol ON ol.order_id = po.id GROUP BY po.id, po.order_date;
  28. #Debezium #ApacheFlink · @gunnarmorling Ad-Hoc Alternative Process Table Functions (h/t

    Martijn Visser 👋) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 @FunctionHint(output = @DataTypeHint("ROW<id BIGINT, ..., lines ARRAY<ROW<...>>>")) public class TxAwareJoinAndAggregatePTF extends ProcessTableFunction<Row> implements ChangelogFunction { public void eval( Context ctx, @StateHint MapView<String, String> expectedCounts, @StateHint MapView<String, String> actualCounts, @StateHint MapView<String, String> bufferedOrders, @StateHint MapView<String, String> bufferedLines, @StateHint MapView<Long, String> orderState, @StateHint ListView<String> pendingTxIds, @StateHint MapView<String, Long> txIdToCommitLsn, @ArgumentHint(value={SET_SEMANTIC_TABLE,OPTIONAL_PARTITION_BY}, name="orders") Row order, @ArgumentHint(value={SET_SEMANTIC_TABLE,OPTIONAL_PARTITION_BY}, name="orderLines") Row orderLine, @ArgumentHint(value={SET_SEMANTIC_TABLE,OPTIONAL_PARTITION_BY}, name="transactions") Row transaction ) throws Exception { if (transaction != null) processTransaction(...); else if (order != null) processOrder(...); else if (orderLine != null) processOrderLine(...); tryEmit(...); } }
  29. #Debezium #ApacheFlink · @gunnarmorling Takeaways 1 Flink unaware of source

    transactions → leaking partial AND incorrect results downstream 2 3 Transaction metadata provided by Debezium Custom watermarking and buffered operators ensure correct processing
  30. #Debezium #ApacheFlink · @gunnarmorling Shameless Plug ;) My New Project:

    Hardwood “Hardwood: Building a Parquet Parser From Scratch (With a Little Help From AI)” 🗓 Wednesday, 1:15 PM 📍 Meetup Hub - Expo Hall