Transactional Change Stream Processing With Debezium and Apache Flink

Gunnar Morling @gunnarmorling Transactional Change Stream Processing With Debezium and
Apache Flink Jan Hrdina https://flic.kr/p/2jGjd6F (CC BY-SA 2.0)

#Debezium #ApacheFlink · @gunnarmorling A tomicity C onsistency I solation
D urability

#Debezium #ApacheFlink · @gunnarmorling Agenda

#Debezium #ApacheFlink · @gunnarmorling Technologist at Conﬂuent • Former project
lead of Debezium • Hardwood, kcctl 🧸, JfrUnit, ModiTect, MapStruct • One Billion Row Challenge 1⃣🐝🏎 • Java Champion Gunnar Morling

#Debezium #ApacheFlink · @gunnarmorling The observer pattern for your database

#Debezium #ApacheFlink · @gunnarmorling • Reads DB transaction log (WAL,
binlog, etc.) to capture every insert/update/delete • Created at Red Hat in 2016 – Battle-tested at scale Debezium – Open-Source CDC Turning Committed Database Changes Into an Event Stream

#Debezium #ApacheFlink · @gunnarmorling Apache Flink Stateful Computations Over Data
Streams • Stateful by design: per-key state, exactly-once guarantees, fault tolerance via checkpoints • Event-time semantics with watermarks; first-class support for windows, joins, aggregations • Two main APIs ◦ Flink SQL ◦ DataStream API • Top-level Apache project since 2014

#Debezium #ApacheFlink · @gunnarmorling Why Debezium and Flink Belong Together
Common Pattern and Use Cases • Filtering / transforming / routing CDC streams • Real-time analytics over OLTP data • Denormalization for search engines, caches, document stores • Streaming Data Contracts

#Debezium #ApacheFlink · @gunnarmorling Use Case: Data Contracts

#Debezium #ApacheFlink · @gunnarmorling Deployment Options Kafka Connect vs. Flink
CDC

#Debezium #ApacheFlink · @gunnarmorling The Use Case Denormalization for Elasticsearch

#Debezium #ApacheFlink · @gunnarmorling Problem #1 Incomplete Results & Write
Ampliﬁcation STEP EMITTED JOIN RESULT po1 arrives order with 0 lines ol1 arrives (retract) → order with 1 line ol2 arrives (retract) → order with 2 lines ol3 arrives (retract) → order with 3 lines Bad UX Elasticsearch returns a partial document mid-transaction Write amplification 7 writes (4 writes, 3 retractions) for one transaction 1 2 3 4 5 6 BEGIN; INSERT INTO purchase_order po (id, order_date) VALUES (...); INSERT INTO order_lines ol (id, product_id, quantity) VALUES (...); INSERT INTO order_lines ol (id, product_id, quantity) VALUES (...); INSERT INTO order_lines ol (id, product_id, quantity) VALUES (...); COMMIT;

#Debezium #ApacheFlink · @gunnarmorling Problem #2 Incorrect Results Three transactions
in commit order in the source DB 1. Materialize join from TX 1 → (po1, ol1) 2. Process ol2 from TX 3 → (po1, ol2) ⚠ never existed in source DB 3. Process po2 from TX 2 → (po2, ol2) What Flink may process

#Debezium #ApacheFlink · @gunnarmorling Problem #2 Incorrect Results Three transactions
in commit order in the source DB 1. Materialize join from TX 1 → (po1, ol1) 2. Process ol2 from TX 3 → (po1, ol2) ⚠ never existed in source DB 3. Process po2 from TX 2 → (po2, ol2) The final state is correct. Intermediary state may be not. What Flink may process

#Debezium #ApacheFlink · @gunnarmorling Root Cause Flink is Oblivious to
Source Transaction Boundaries • Flink computes immediately on every input event • Sources for different topics advance at independent paces • No signal in the pipeline says: "events from transaction X are now complete" WHAT WE NEED (a) Only emit at transaction boundaries (b) Process transactions in source-commit order

#Debezium #ApacheFlink · @gunnarmorling "Can't We Just Mini-Batch?" Flink's Mini-Batching
Doesn't Fix This • Mini-batch is time- or size-based: buffers for N ms or N records • Doesn't align with source transaction boundaries • Doesn't solve the cross-stream ordering problem at all • Reduces write amplification a bit, but doesn't give correctness Useful for throughput, not for transactional consistency

Transaction Metadata Tim Green https://flic.kr/p/21MBf1L (CC BY 2.0)

#Debezium #ApacheFlink · @gunnarmorling The Missing Ingredient Debezium Transaction Metadata
Every change event carries Tx id data

#Debezium #ApacheFlink · @gunnarmorling The Missing Ingredient Debezium Transaction Metadata
Every change event carries Tx id data Dedicated TX metadata topic

#Debezium #ApacheFlink · @gunnarmorling Postgres: Why LSN, Not txId? txId
Not Monotonic by Commit Time • In Postgres, txId is assigned at BEGIN, not at COMMIT • Two transactions can interleave so the higher txId commits first Commit LSN is monotonically increasing — that's our notion of "time" for watermarks.

Proof of Concept Ky0n Cheng https://flic.kr/p/XhtG8Z (Public domain)

#Debezium #ApacheFlink · @gunnarmorling.dev The Idea Repurpose Flink's Watermarks to
Signal Transaction Completeness (a) Commit LSN as the watermark • Watermark = "everything ≤ X has been observed on this stream" • Standard X = event-time • Our X = commit LSN — monotonically increasing per the transaction log • Downstream operators wait for the watermark before emitting → transactional consistency (b) Flink DataStream v2 API • Legacy DataStream API hardcodes event-time watermarks • DataStream v2 (Flink 2.x) introduces declarative custom watermark types • We declare a LongWatermarkDeclaration carrying commit LSNs • Operators can react to our custom watermarks just like event-time ones

#Debezium #ApacheFlink · @gunnarmorling Solution Architecture: A Five-Stage Pipeline Patch
each event with the correct commit LSN Emit custom watermarks at transaction boundaries Join buffer; wait for watermarks from both inputs; flush in order Per-key aggregation state; emit only on watermark Receives transactionally-consistent records only

#Debezium #ApacheFlink · @gunnarmorling Stage 1: CommitLsnFixer Workaround for Debezium
DBZ-1555 • Postgres connector reports the previous transaction's commit LSN • Workaround: broadcast-join change events with the transaction-metadata topic on txId 1 2 3 4 5 6 7 8 9 10 11 // In processRecordFromBroadcastInput(): // On END event for txId, flush all buffered events with the corrected LSN if (record.status() == Status.END) { long correctCommitLsn = record.commitLsn(); ctx.applyToAllPartitions((collector, context) -> { commitLsnState.put(txId, correctCommitLsn); for (DataChangeEvent event : buffered(txId)) { collector.collect(event.withCommitLsn(correctCommitLsn)); } }); }

#Debezium #ApacheFlink · @gunnarmorling Stage 2: WatermarkInjector Custom Watermarks •
Flink v2 DataStream API supports custom watermark types • Replace event-time watermarks with commit-LSN watermarks • The semantics: "I have seen everything from transactions ≤ LSN on this stream." 1 2 3 4 5 6 7 public static final LongWatermarkDeclaration WATERMARK_DECLARATION = WatermarkDeclarations.newBuilder("TX_WATERMARK") .typeLong() .combineFunctionMin() .combineWaitForAllChannels(true) .defaultHandlingStrategyIgnore() .build();

#Debezium #ApacheFlink · @gunnarmorling Stage 2 (cont.) When to Emit
Watermarks? • First idea: leverage in-partition ordering • New commit LSN → Flush buffered events • But: Idle stream problem

#Debezium #ApacheFlink · @gunnarmorling Stage 2 (cont.) The Idle-Tables Problem
• Solution: broadcast the TX topic to every WatermarkInjector • On each transaction END: ◦ If count == event_count for this data collection → emit watermark = commit LSN ◦ A transaction with zero events for our table still counts as 'complete' → watermark advances

#Debezium #ApacheFlink · @gunnarmorling Stage 3: Custom Join Operator Buﬀer
Until Both Inputs Have Crossed the Same Watermark • Per key, two keyed state stores (orders, order_lines) • Tracks the minimum watermark across both inputs • On watermark advance: ◦ For each order touched in the transaction → join with latest line per key ◦ For each order line touched → join with latest order ◦ Emit the joined result, then forward the watermark

#Debezium #ApacheFlink · @gunnarmorling.dev Stage 3 (cont.) — Join Operator
state in action

#Debezium #ApacheFlink · @gunnarmorling Stage 4: Aggregation Operator Per-Purchase-Order State,
Emitted on Watermark 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 public class TxAwareAggregationFunction { public void processRecord(DataChangeEventPair record, ...) { // accumulate join result into the order's state OrderWithLines order = (state.value() == null) ? OrderWithLines.from(record) : state.value().update(record); state.update(order); } public void onWatermark(Watermark wm, ...) { // emit ONLY if this order was modified at this commitLsn if (order.commitLsn() == ((LongWatermark) wm).getValue()) { collector.collect(order); } } }

#Debezium #ApacheFlink · @gunnarmorling Stage 5: Kafka Sink One write
per transaction-modiﬁed order No partial documents downstream Write amplification eliminated N writes/transaction, not N×K Elasticsearch sees coherent documents on every refresh 1 2 3 4 5 6 7 8 9 { "id": 10001, "purchaser": 1001, "shippingAddress": "123 Main St", "lines": [ {"productId": 101, "quantity": 2, "price": 19.99}, {"productId": 102, "quantity": 1, "price": 49.99} ] } ✅ ✅ ✅

#Debezium #ApacheFlink · @gunnarmorling Putting Everything Together DataStreamJob.java 1 2
3 4 5 6 7 8 9 10 11 12 13 14 15 // Each source: data → CommitLsnFixer → WatermarkInjector → keyBy KeyedPartitionStream<Long, DataChangeEvent> orders = env.fromSource(ordersSource) .connectAndProcess(transactionsStream, new CommitLsnFixer()) .connectAndProcess(transactionsStream, new WatermarkInjector("inventory.orders")) .keyBy(DataChangeEvent::id); KeyedPartitionStream<Long, DataChangeEvent> orderLines = /* ...same shape... */; // Join + aggregate + sink connectAndProcess(orders, orderLines, new TxAwareJoinProcessFunction(...)) .keyBy(DataChangeEventPair::id) .process(new TxAwareAggregationFunction()) .toSink(kafkaSink);

Discussion Ky0n Cheng https://flic.kr/p/XhtG8Z (Public domain)

#Debezium #ApacheFlink · @gunnarmorling Results — Before vs. After BEFORE
Same TX, three line inserts: order=10001 lines=[] order=10001 lines=[L1] ← retract previous order=10001 lines=[L1, L2] ← retract previous order=10001 lines=[L1, L2, L3] ← retract previous Cross-TX (TX1, TX2, TX3 case): ~6 emissions including the impossible (po1, ol2) state AFTER Same TX: order=10001 lines=[L1, L2, L3] ← single, complete emission Cross-TX (TX1, TX2, TX3 case): 3 emissions one per transaction, in commit order, all valid Every emission downstream now corresponds to a real, committed state of the source database.

#Debezium #ApacheFlink · @gunnarmorling Limitations Parallelism = 1 only Higher
parallelism needs per-partition transaction metadata from Debezium, so each task knows its share of the transaction. Single-output-record consistency, not multi-record A transaction touching two orders still emits two messages — both correct in isolation, but a consumer can read between them. True end-to-end multi-record transactionality would need 2PC all the way to the destination.

#Debezium #ApacheFlink · @gunnarmorling What's Needed for Production… …and How
the Community Can Help DEBEZIUM • Per-partition transaction metadata Unblocks parallelism > 1 • Fix DBZ-1555 Eliminates the CommitLsnFixer stage entirely • Transaction metadata for snapshots • Native transaction-aware operators Built-in joins / aggregations that respect commit boundaries • First-class Flink SQL support One-keyword opt-in, not a 1500-line PoC Flink

#Debezium #ApacheFlink · @gunnarmorling Long-Term Vision Flink SQL Support →
Transactional CDC processing becomes a one-keyword opt-in. 1 2 3 4 5 6 7 8 SELECT TRANSACTIONALLY po.id, po.order_date, ARRAY_AGG(ROW(ol.id, ol.product_id, ol.quantity)) FROM purchase_orders po LEFT JOIN order_lines ol ON ol.order_id = po.id GROUP BY po.id, po.order_date;

#Debezium #ApacheFlink · @gunnarmorling Ad-Hoc Alternative Process Table Functions (h/t
Martijn Visser 👋) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 @FunctionHint(output = @DataTypeHint("ROW<id BIGINT, ..., lines ARRAY<ROW<...>>>")) public class TxAwareJoinAndAggregatePTF extends ProcessTableFunction<Row> implements ChangelogFunction { public void eval( Context ctx, @StateHint MapView<String, String> expectedCounts, @StateHint MapView<String, String> actualCounts, @StateHint MapView<String, String> bufferedOrders, @StateHint MapView<String, String> bufferedLines, @StateHint MapView<Long, String> orderState, @StateHint ListView<String> pendingTxIds, @StateHint MapView<String, Long> txIdToCommitLsn, @ArgumentHint(value={SET_SEMANTIC_TABLE,OPTIONAL_PARTITION_BY}, name="orders") Row order, @ArgumentHint(value={SET_SEMANTIC_TABLE,OPTIONAL_PARTITION_BY}, name="orderLines") Row orderLine, @ArgumentHint(value={SET_SEMANTIC_TABLE,OPTIONAL_PARTITION_BY}, name="transactions") Row transaction ) throws Exception { if (transaction != null) processTransaction(...); else if (order != null) processOrder(...); else if (orderLine != null) processOrderLine(...); tryEmit(...); } }

#Debezium #ApacheFlink · @gunnarmorling Ad-Hoc Alternative Process Table Functions (h/t
Martijn Visser 👋)

#Debezium #ApacheFlink · @gunnarmorling Takeaways 1 Flink unaware of source
transactions → leaking partial AND incorrect results downstream 2 3 Transaction metadata provided by Debezium Custom watermarking and buffered operators ensure correct processing

#Debezium #ApacheFlink · @gunnarmorling Shameless Plug ;) My New Project:
Hardwood “Hardwood: Building a Parquet Parser From Scratch (With a Little Help From AI)” 🗓 Wednesday, 1:15 PM 📍 Meetup Hub - Expo Hall

#Debezium #ApacheFlink · @gunnarmorling Get In Touch gmorling@conﬂuent.io @gunnarmorling @gunnarmorling.dev
morling.dev 📧

Transactional Change Stream Processing With Deb...

Transactional Change Stream Processing With Debezium and Apache Flink

More Decks by Gunnar Morling

Other Decks in Programming

Featured

Transcript