Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Unified CDC Ingestion and Processing with Apach...

Avatar for Sharon Xie Sharon Xie
May 22, 2025
270

Unified CDC Ingestion and Processing with Apache Flink and Iceberg

This talk was given at Current25 London.

Apache Iceberg is a robust foundation for large-scale data lakehouses, yet its incremental processing model lacks native support for CDC, making updates and deletes challenging. While many teams turn to Kafka and Flink for CDC processing, this comes with high infrastructure costs and operational complexity. We needed a cost-effective solution with minute-level latency that supports dozens of terabytes of CDC data processing per day. Since we were already using Flink for Iceberg ingestion, we set out to extend it for CDC processing as well. In this session, we’ll share how we tackled this challenge by writing change data streams as append tables and reading append tables as change streams. This approach makes Iceberg tables function like Kafka topics, with two added benefits: Iceberg tables remain directly queryable, making troubleshooting and application integration more approachable and streamlined. Similar to Kafka consumers, multiple engines can independently process Iceberg tables. However, unlike Kafka clusters, there is no need to scale infrastructure. We will also explore optimization opportunities with Iceberg and Flink, including when to materialize tables and how to choose between append and upsert modes to enhance integration. If you’re working on data processing over Iceberg, this session will provide practical, battle-tested strategies to overcome limitations and scale efficiently while keeping the infrastructure simple.

Avatar for Sharon Xie

Sharon Xie

May 22, 2025
Tweet

Transcript

  1. CDC Use Case CDC lets you process its data in

    real-time and feed other systems or services • Database replication • Real-time ETL / ELT • Real-time Analytics • Power search index / cache • And many more
  2. Key Component: Kafka Topics • Durable storage for CDC streaming

    • Decoupled producers / consumers ◦ Scalable and independent data ingestion and processing • Can be compacted ◦ Efficient when setting up new processing or reprocessing data
  3. Key Component: Flink • Supports CDC processing ◦ Incremental processing

    for CDC records ◦ Native change semantics w/ dynamic table concept • CDC connectors ◦ Read & Write CDC records
  4. CDC w/ Flink: An Example (cont) Flink has 4 record

    types: insert, update_before, update_after, delete
  5. Benefit of the Architecture ⏳ Low Latency: Event Driven +

    Stream Processing 💪 Highly Scalable: Kafka and Flink are distributed
  6. Kafka Storage Cost CDC topics need infinite retention • Backfill

    / Reprocess • Set up new processing Compacted topics can’t use tiered storage
  7. Kafka Operational Cost Hard to scale - Need partition redistribution

    - Downtime for existing producers / consumers Hard to change partition count - Manually generate the plan Hard to recover - Make another cluster
  8. Can we replace Kafka? We need a storage system that:

    • Is cheap, and scales well • Supports CDC streaming reads & writes
  9. ⚠ CDC Streaming Writes Iceberg has upsert table. But, Flink

    produces high volume retractions (deletes and inserts) • Lots of delete files
  10. The Trick • Make an iceberg table behave like a

    Kafka topic ◦ Store change logs in append tables ◦ Merge changes in append tables on read
  11. What about the query performance? • Records to merge grow

    over time • Can we do “compaction” similar to kafka topics?
  12. The result? - ✅ Cheap, and scales well - ✅

    CDC Streaming Reads & Writes - ✅ Directly queryable by many query engines - ⚠ Latency at minute-level
  13. 1. Iceberg “supports" CDC ingestion and processing 2. Store change

    streams in append tables 3. Run batch jobs to optimize query performance Key Takeaways