Unified CDC Ingestion and Processing with Apache Flink and Iceberg

Unified CDC Ingestion and Processing with Apache Flink and Iceberg
Mike Araujo & Sharon Xie

Your Speakers Mike Araujo Principal Engineer Sharon Xie Head of
Product

Today we are going to learn

Change Data Capture The observer for your database.

CDC Use Case CDC lets you process its data in
real-time and feed other systems or services • Database replication • Real-time ETL / ELT • Real-time Analytics • Power search index / cache • And many more

A Typical Real-time CDC Architecture

The Downside? Hard to debug 🧐 Storage and processing is
expensive 💰

Key Component: Kafka Topics • Durable storage for CDC streaming
• Decoupled producers / consumers ◦ Scalable and independent data ingestion and processing • Can be compacted ◦ Efficient when setting up new processing or reprocessing data

Key Component: Flink • Supports CDC processing ◦ Incremental processing
for CDC records ◦ Native change semantics w/ dynamic table concept • CDC connectors ◦ Read & Write CDC records

How Does Flink Process CDC Records?

CDC Processing w/ Flink

CDC w/ Flink: An Example (cont)

CDC w/ Flink: An Example (cont) Flink has 4 record
types: insert, update_before, update_after, delete

CDC w/ Flink: An Example (cont)

Benefit of the Architecture ⏳ Low Latency: Event Driven +
Stream Processing 💪 Highly Scalable: Kafka and Flink are distributed

The Downside? Hard to debug 🧐 • Can’t directly query
Kafka topics Kafka is expensive 💰

Kafka Storage Cost CDC topics need infinite retention • Backfill
/ Reprocess • Set up new processing Compacted topics can’t use tiered storage

Kafka Operational Cost Hard to scale - Need partition redistribution
- Downtime for existing producers / consumers Hard to change partition count - Manually generate the plan Hard to recover - Make another cluster

Can we replace Kafka? We need a storage system that:
• Is cheap, and scales well • Supports CDC streaming reads & writes

Can Iceberg Replace Kafka?

Can OOB Iceberg do it? ✅ Cheap, and scales well

⚠ CDC Streaming Writes Iceberg has upsert table. But, Flink
produces high volume retractions (deletes and inserts) • Lots of delete files

❌ CDC Streaming Reads

Conclusion Iceberg doesn’t work

The Trick • Make an iceberg table behave like a
Kafka topic ◦ Store change logs in append tables ◦ Merge changes in append tables on read

How Does It Work?

Change Stream -> Append Stream

Append Stream -> Append Table for Storing

What’s in the Append Table?

Querying the Iceberg Table

What about the query performance? • Records to merge grow
over time • Can we do “compaction” similar to kafka topics?

Streaming Job Writing to the Current Partition

Batch Job Merges Keys of Old Partition

Delete the Merged Partition

Batch Job SQL

Use the Iceberg API to Delete the Old Partition

Putting it all together

The result? - ✅ Cheap, and scales well - ✅
CDC Streaming Reads & Writes - ✅ Directly queryable by many query engines - ⚠ Latency at minute-level

🎁 Wrapping Up

1. Iceberg “supports" CDC ingestion and processing 2. Store change
streams in append tables 3. Run batch jobs to optimize query performance Key Takeaways

Thank You Q&A Sharon Xie Mike Araujo

Unified CDC Ingestion and Processing with Apach...

Unified CDC Ingestion and Processing with Apache Flink and Iceberg

More Decks by Sharon Xie

Featured

Transcript