Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Incremental Change Processing with Apache Flink...

Avatar for Sharon Xie Sharon Xie
July 25, 2025
2

Incremental Change Processing with Apache Flink and Iceberg

Apache Iceberg is a robust foundation for large-scale data lakehouses, yet its current support for change data capture (CDC) is limited, making updates and deletes challenging for incremental processing. While the unreleased Iceberg V3 will introduce native CDC support, many production environments still run on V2 and require pragmatic workarounds.

In this talk, we’ll explore how to implement incremental change processing over Iceberg V2 using Apache Flink, by writing change data streams as append tables and reading append tables as change streams. We’ll also walk through trade-offs between append and upsert modes, and how to choose the right one for your workload.

Finally, we’ll preview what Iceberg V3 brings to the table with native CDC support, and how it shifts the design landscape for real-time pipelines. If you're building data pipelines on Iceberg, this session will provide you with pragmatic strategies to overcome existing limitations and scale efficiently while keeping the infrastructure simple.

Avatar for Sharon Xie

Sharon Xie

July 25, 2025
Tweet

Transcript

  1. Incremental Change Processing with Apache Flink and Iceberg Sharon Xie

    Founding Engineer & Head of Product @Decodable 2025-07-23
  2. Processing Change Stream w/ Flink (cont) Flink has 4 record

    types: insert, update_before, update_after, delete
  3. What is needed? • Flink can read the changes to

    a table as a stream • Flink can write change streams to a table
  4. ⚠ CDC Streaming Writes Iceberg has upsert table. But, Flink

    produces high volume retractions (deletes and inserts) • Equality deletes ◦ Optimized for write performance ◦ But slow for query time • Lots of delete files ◦ Small files problem ◦ Slow for query time
  5. The Trick • Store change events in append tables •

    Merge/Materialize changes in append tables on read
  6. What about the query performance? • Records to merge grow

    over time • Can we do “compaction” similar to kafka compacted topics?
  7. Row Lineage • Tracks changes to individual rows as they

    are updated, deleted, or inserted. ◦ Foundation for incremental change processing • Not tracked for rows updated via equality deletes
  8. What does this mean for Flink? • Writer side ◦

    Flink uses equality deletes ◦ Can’t track row lineage information • Reader side ◦ Needs to derive a stream of changes from row lineage information ◦ No development yet • TL,DR: framework is there, but solution is not ready
  9. 1. Incremental change processing with Iceberg V2 requires workarounds: a.

    Write change streams to append-only tables b. Read append tables as change streams c. Schedule batch merges to maintain performance 2. Iceberg V3’s row lineage can make change processing easier a. But still needs more development in processing engines Key Takeaways