Unified CDC Ingestion and
Processing with Apache
Flink and Iceberg
Mike Araujo & Sharon Xie
Slide 2
Slide 2 text
Your Speakers
Mike Araujo
Principal Engineer
Sharon Xie
Head of Product
Slide 3
Slide 3 text
Today we are going to learn
Slide 4
Slide 4 text
Change Data Capture
The observer for your database.
Slide 5
Slide 5 text
CDC Use Case
CDC lets you process its data in real-time and feed other systems or services
● Database replication
● Real-time ETL / ELT
● Real-time Analytics
● Power search index / cache
● And many more
Slide 6
Slide 6 text
A Typical Real-time CDC Architecture
Slide 7
Slide 7 text
The Downside?
Hard to debug 🧐
Storage and processing is expensive 💰
Slide 8
Slide 8 text
Key Component: Kafka Topics
● Durable storage for CDC streaming
● Decoupled producers / consumers
○ Scalable and independent data ingestion and processing
● Can be compacted
○ Efficient when setting up new processing or reprocessing data
Slide 9
Slide 9 text
Key Component: Flink
● Supports CDC processing
○ Incremental processing for CDC records
○ Native change semantics w/ dynamic table concept
● CDC connectors
○ Read & Write CDC records
Slide 10
Slide 10 text
How Does Flink
Process CDC Records?
Slide 11
Slide 11 text
CDC Processing w/ Flink
Slide 12
Slide 12 text
CDC w/ Flink: An Example (cont)
Slide 13
Slide 13 text
CDC w/ Flink: An Example (cont)
Flink has 4 record types: insert, update_before, update_after, delete
Slide 14
Slide 14 text
CDC w/ Flink: An Example (cont)
Slide 15
Slide 15 text
Benefit of the Architecture
⏳ Low Latency: Event Driven + Stream Processing
💪 Highly Scalable: Kafka and Flink are distributed
Slide 16
Slide 16 text
The Downside?
Hard to debug 🧐
● Can’t directly query Kafka topics
Kafka is expensive 💰
Slide 17
Slide 17 text
Kafka Storage Cost
CDC topics need infinite retention
● Backfill / Reprocess
● Set up new processing
Compacted topics can’t use tiered storage
Slide 18
Slide 18 text
Kafka Operational Cost
Hard to scale
- Need partition redistribution
- Downtime for existing
producers / consumers
Hard to change partition count
- Manually generate the plan
Hard to recover
- Make another cluster
Slide 19
Slide 19 text
Can we replace Kafka?
We need a storage system that:
● Is cheap, and scales well
● Supports CDC streaming reads & writes
Slide 20
Slide 20 text
Can Iceberg Replace
Kafka?
Slide 21
Slide 21 text
Can OOB Iceberg do it?
✅ Cheap, and scales well
Slide 22
Slide 22 text
⚠ CDC Streaming Writes
Iceberg has upsert table. But, Flink
produces high volume retractions
(deletes and inserts)
● Lots of delete files
Slide 23
Slide 23 text
❌ CDC Streaming Reads
Slide 24
Slide 24 text
Conclusion
Iceberg doesn’t work
Slide 25
Slide 25 text
The Trick
● Make an iceberg table behave like a Kafka topic
○ Store change logs in append tables
○ Merge changes in append tables on read
Slide 26
Slide 26 text
How Does It Work?
Slide 27
Slide 27 text
Change Stream -> Append Stream
Slide 28
Slide 28 text
Append Stream -> Append Table for Storing
Slide 29
Slide 29 text
What’s in the Append Table?
Slide 30
Slide 30 text
Querying the Iceberg Table
Slide 31
Slide 31 text
What about the query performance?
● Records to merge grow over time
● Can we do “compaction” similar to kafka topics?
Slide 32
Slide 32 text
Streaming Job Writing to the Current Partition
Slide 33
Slide 33 text
Batch Job Merges Keys of Old Partition
Slide 34
Slide 34 text
Delete the Merged Partition
Slide 35
Slide 35 text
Batch Job SQL
Slide 36
Slide 36 text
Use the Iceberg API to Delete the Old Partition
Slide 37
Slide 37 text
Putting it all together
Slide 38
Slide 38 text
The result?
- ✅ Cheap, and scales well
- ✅ CDC Streaming Reads & Writes
- ✅ Directly queryable by many query engines
- ⚠ Latency at minute-level
Slide 39
Slide 39 text
🎁 Wrapping Up
Slide 40
Slide 40 text
1. Iceberg “supports" CDC ingestion and processing
2. Store change streams in append tables
3. Run batch jobs to optimize query performance
Key Takeaways