Unified CDC Ingestion and Processing with Apache Flink and Iceberg

Slide 1

Slide 1 text

Unified CDC Ingestion and Processing with Apache Flink and Iceberg Mike Araujo & Sharon Xie

Slide 2

Slide 2 text

Your Speakers Mike Araujo Principal Engineer Sharon Xie Head of Product

Slide 3

Slide 3 text

Today we are going to learn

Slide 4

Slide 4 text

Change Data Capture The observer for your database.

Slide 5

Slide 5 text

CDC Use Case CDC lets you process its data in real-time and feed other systems or services ● Database replication ● Real-time ETL / ELT ● Real-time Analytics ● Power search index / cache ● And many more

Slide 6

Slide 6 text

A Typical Real-time CDC Architecture

Slide 7

Slide 7 text

The Downside? Hard to debug 🧐 Storage and processing is expensive 💰

Slide 8

Slide 8 text

Key Component: Kafka Topics ● Durable storage for CDC streaming ● Decoupled producers / consumers ○ Scalable and independent data ingestion and processing ● Can be compacted ○ Efficient when setting up new processing or reprocessing data

Slide 9

Slide 9 text

Key Component: Flink ● Supports CDC processing ○ Incremental processing for CDC records ○ Native change semantics w/ dynamic table concept ● CDC connectors ○ Read & Write CDC records

Slide 10

Slide 10 text

How Does Flink Process CDC Records?

Slide 11

Slide 11 text

CDC Processing w/ Flink

Slide 12

Slide 12 text

CDC w/ Flink: An Example (cont)

Slide 13

Slide 13 text

CDC w/ Flink: An Example (cont) Flink has 4 record types: insert, update_before, update_after, delete

Slide 14

Slide 14 text

CDC w/ Flink: An Example (cont)

Slide 15

Slide 15 text

Benefit of the Architecture ⏳ Low Latency: Event Driven + Stream Processing 💪 Highly Scalable: Kafka and Flink are distributed

Slide 16

Slide 16 text

The Downside? Hard to debug 🧐 ● Can’t directly query Kafka topics Kafka is expensive 💰

Slide 17

Slide 17 text

Kafka Storage Cost CDC topics need infinite retention ● Backfill / Reprocess ● Set up new processing Compacted topics can’t use tiered storage

Slide 18

Slide 18 text

Kafka Operational Cost Hard to scale - Need partition redistribution - Downtime for existing producers / consumers Hard to change partition count - Manually generate the plan Hard to recover - Make another cluster

Slide 19

Slide 19 text

Can we replace Kafka? We need a storage system that: ● Is cheap, and scales well ● Supports CDC streaming reads & writes

Slide 20

Slide 20 text

Can Iceberg Replace Kafka?

Slide 21

Slide 21 text

Can OOB Iceberg do it? ✅ Cheap, and scales well

Slide 22

Slide 22 text

⚠ CDC Streaming Writes Iceberg has upsert table. But, Flink produces high volume retractions (deletes and inserts) ● Lots of delete files

Slide 23

Slide 23 text

❌ CDC Streaming Reads

Slide 24

Slide 24 text

Conclusion Iceberg doesn’t work

Slide 25

Slide 25 text

The Trick ● Make an iceberg table behave like a Kafka topic ○ Store change logs in append tables ○ Merge changes in append tables on read

Slide 26

Slide 26 text

How Does It Work?

Slide 27

Slide 27 text

Change Stream -> Append Stream

Slide 28

Slide 28 text

Append Stream -> Append Table for Storing

Slide 29

Slide 29 text

What’s in the Append Table?

Slide 30

Slide 30 text

Querying the Iceberg Table

Slide 31

Slide 31 text

What about the query performance? ● Records to merge grow over time ● Can we do “compaction” similar to kafka topics?

Slide 32

Slide 32 text

Streaming Job Writing to the Current Partition

Slide 33

Slide 33 text

Batch Job Merges Keys of Old Partition

Slide 34

Slide 34 text

Delete the Merged Partition

Slide 35

Slide 35 text

Batch Job SQL

Slide 36

Slide 36 text

Use the Iceberg API to Delete the Old Partition

Slide 37

Slide 37 text

Putting it all together

Slide 38

Slide 38 text

The result? - ✅ Cheap, and scales well - ✅ CDC Streaming Reads & Writes - ✅ Directly queryable by many query engines - ⚠ Latency at minute-level

Slide 39

Slide 39 text

🎁 Wrapping Up

Slide 40

Slide 40 text

1. Iceberg “supports" CDC ingestion and processing 2. Store change streams in append tables 3. Run batch jobs to optimize query performance Key Takeaways

Slide 41

Slide 41 text

Thank You Q&A Sharon Xie Mike Araujo