Streaming from Apache Iceberg - QCon NY 2023

Streaming from Apache Iceberg Building Low-Latency and Cost-Effective Data Pipelines
Steven Wu @ Apple THIS IS NOT A CONTRIBUTION

Agenda Introduction to Iceberg and Flink Motivation Streaming from Iceberg
Watermark alignment Evaluation results

Apache Iceberg is an open table format for huge analytic
datasets https://iceberg.apache.org/

What is a table format File format (like Parquet) organize
records in a fi le Table format (like Iceberg) organize fi les in a table

Iceberg offers numerous features • Serializable isolation • Fast scan
planning with advanced fi ltering • Schema and partition layout evolution • Time travel • Branching and tagging

Where does Iceberg fi t in the ecosystem Table Format
(Metadata) Iceberg Delta Lake Hudi Compute Engine Flink Storage (Data) Cloud Storage Orc HDFS Parquet

Apache Flink is a distributed framework for stateful computations over
data streams https://flink.apache.org/

Flink is a popular stream processing engine • Highly scalable
• Exactly once processing semantics • Event time semantics and watermark support • Layered APIs (DataStream, Table API/SQL)

Traditional data pipelines are largely chained by batch jobs reading
from data lake Hours/Days Feature Engineering Batch Jobs Data Lake (Feature store) Hours/Days Model Store Offline Model Training Batch Jobs Minutes Streaming Ingestion Data Lake (Raw data) Hours/Days Batch Jobs ETL Data Lake (Cleaned and enriched data) Device Seconds API Edge Message Queue

Overall latency is hours to days Hours/Days Hours/Days Model Store
Offline Model Training Batch Jobs Minutes Streaming Ingestion Data Lake (Raw data) Hours/Days Batch Jobs ETL Data Lake (Cleaned and enriched data) Device Seconds API Edge Message Queue Feature Engineering Data Lake (Feature store) Batch Jobs

Sink Flink Streaming Job Flink streaming from Kafka is very
popular

Switch everything to Flink streaming from Kafka Feature Engineering Feature
store Model Store Online Model Training Streaming Ingestion Raw data ETL Cleaned and enriched data Device Seconds API Edge Message Queue Seconds Seconds Seconds Seconds

Kafka can achieve sub-second read latency

But there are tradeoffs . . .

Operation is not easy • Upgrading stateful system is painful
• Capacity planning • Bursty workload • Isolation

Tiered storage is widely adopted because it is expensive to
store long-term data in Kafka Present Recent Past Distant Past Kafka Iceberg

Cross-AZ network cost can be 10x more than broker cost
(compute and storage combined) AZ-1 AZ-2 AZ-3 broker-2 broker-1 broker-3 consumer-1 consumer-2 consumer-3 producer-1 producer-2 producer-3

Rack aware partition assignment can avoid cross- AZ traf fi
c from broker to consumer AZ-1 AZ-2 AZ-3 broker-2 broker-1 broker-3 consumer-1 consumer-2 consumer-3 producer-1 producer-2 producer-3

Kafka source doesn’t support fi ltering or projection at broker
side Job-1 Job-2 Job-3 Filter Filter Projection

Set up routing jobs just to fi lter or project
data Job-1 Job-2 Job-3 Routing Job Filter Filter Projection

Kafka source statically assigns partitions during startup Worker-1 Worker-2 Worker-3
Partition-1 Partition-2 Partition-3 Lead to a few limitations

Other workers can’t pick up the slack from outlier Worker-1
Worker-2 Worker-3 Partition-1 Partition-2 Partition-3 Outlier

Source parallelism is limited by the number of partitions Worker-1
Worker-2 Worker-3 Partition-1 Partition-2 Partition-3 Worker-4 Idle

May not get balanced partition assignment during autoscaling Worker-1 Worker-2
Worker-3 Partition-1 Partition-3 Partition-5 Partition-2 Partition-4 Partition-6

Worker-1 Worker-2 Worker-3 Partition-1 Partition-3 Partition-5 Worker-4 Partition-2 Partition-4 Partition-6
May not get balanced partition assignment during autoscaling

Alternative streaming source?

Flink Iceberg sink commits fi les after every successful checkpoint
Upstream Streaming Job f1, f2, f3 1-10 minutes are pretty common

Sink Can Flink stream data from Iceberg as they are
committed by the upstream job Upstream Streaming Job Iceberg Streaming Source Job

Sink Iceberg Streaming Source Job Iceberg supports scan of incremental
changes between snapshots Upstream Streaming Job Sn f1, f2, f3 Sn+1 f1, f2, f3 TableScan appendsBetween( long fromSnapshotId, long toSnapshotId); This cycle continues forever

Sink Iceberg Streaming Job Kafka Streaming Job KafkaSource<RowData> source =
KafkaSource.builder() .setBootstrapServers(“…”) .setTopics(“…”) .setStartingOffsets( OffsetsInitializer.latest()) .setDeserializer(…) .build() IcebergSource<RowData> source = IcebergSource.forRowData() .tableLoader(tableLoader) .streamingStartingStrategy( INCREMENTAL_FROM_LATEST_SNAPSHOT) .monitorInterval( Duration.ofSeconds(30L)) .build() Also available in Flink SQL

Many streaming use cases are good with minute-level latency

Build low-latency data pipelines chained by Flink jobs streaming from
Iceberg Feature Engineering Data Lake (Feature store) Model Store Nearline Model Training Minutes Streaming Ingestion Data Lake (Raw data) ETL Data Lake (Cleaned and enriched data) Device Seconds API Edge Message Queue Minutes Minutes Minutes

Where does stream processing fi t in the spectrum of
data processing applications Stephan Ewen & Xiaowei Jiang & Robert Metzger. From Stream Processing to Uni fi ed Data Processing System. Flink Forward. April 1-2, 2019. San Francisco more real time more lag time Transactional Processing Event-driven Applications Streaming Analytics Data Pipelines Continuous Processing Batch Processing

Flink Iceberg streaming source fi ts well for data pipelines
and continuous processing more real time more lag time Transactional Processing Event-driven Applications Streaming Analytics Data Pipelines Batch Processing minutes Stephan Ewen & Xiaowei Jiang & Robert Metzger. From Stream Processing to Uni fi ed Data Processing System. Flink Forward. April 1-2, 2019. San Francisco Continuous Processing

What about incremental batch processing • Schedule batch runs every
a few minutes • Each run processes the new fi les added since last run • The line becomes blurry as scheduling intervals are shortened

Limitations of incremental batch processing • May be more expensive
to tear down and start the batch runs when scheduling intervals are small • Operational burden can be too high • Intermediate results for stateful processing are lost after each batch run and recomputed in the next run

FLIP-27 source interface separates work discovery with reading JobManager TaskManager-1
TaskManager-n … Enumerator Reader-1 Reader-k …

A unit of work is de fi ned as a
split • In Kafka source, a split is a partition • In Iceberg source, a split is a fi le, a slice of a large fi le, or a group of small fi les • A split can be unbounded (Kafka) or bounded (Iceberg)

JobManager TaskManager-1 TaskManager-n … Enumerator Reader-1 Reader-k … Iceberg source
dynamically assign splits to readers with pull based model 1. Split discovery 2. Discovered splits Pending splits 3. Request split upon start or done with current split 4. Assigned split Reader requests one split a time Iceberg

JobManager TaskManager-1 TaskManager-n … Enumerator Reader-1 Reader-k … FLIP-27 uni
fi es batch and streaming sources Only difference is whether split discovery is one- time or periodical 1. Split discovery 2. Discovered splits Iceberg Pending splits

Bene fi ts of Iceberg streaming source?

Leverage managed cloud storage • Of fl oad operational burden
• Scalable • Cost effective

Simplify the architecture with uni fi ed storage Present Recent
Past Distant Past Iceberg

Unify the live and back fi ll sources to Iceberg
Sink Backfill Job Live Job Sink

Most cloud blob storages don’t charge network cost within a
region AZ-1 AZ-2 AZ-3 consumer-1 consumer-2 consumer-3 Iceberg (on cloud storage)

Support advanced data pruning • File pruning (predicate pushdown) •
Column projection IcebergSource<RowData> source = IcebergSource.forRowData() .tableLoader(tableLoader) .streamingStartingStrategy( INCREMENTAL_FROM_LATEST_SNAPSHOT) .monitorInterval( Duration.ofSeconds(30L)) .filters(Expressions.equal("dt", “2020-03-20")) .project(schema.select("dt", “id”, “name”) .build()

Dynamic pull-based split assignment allows other worker to pick up
the slack Outlier Other workers can pick up the slack and process more fi les JobManager TaskManager-1 TaskManager-n … Enumerator Reader-1 Reader-k … Pending splits Iceberg

It is more operationally friendly with a lot more fi
le segments than Kafka partitions • Can support higher parallelism • Is more autoscaling friendly

FLIP-27 Flink Iceberg source is merged in Apache Iceberg project
https://github.com/apache/iceberg/projects/23 CDC (updates/deletes) read not supported yet

Watermark controls tradeoff btw latency and completeness • Records can
come out of order • Asserts all data before the watermark have arrived

Stateful join with 6 hour join window Why is watermark
alignment needed Impression Click Everything works well in steady state with live traf fi c

During replay, the two sources can proceed at diff paces
due to diff data volumes Impression (4x) Click

Flink watermark is calculated as the minimal of all inputs
Impression (4x) Click Src-1 Src-2 Co- process Sink keyBy Flink job Now - 18h Now - 18h Now

Slow watermark advancement can lead to excessive buffering Impression (4x)
Click Src-1 Src-2 Sink keyBy Flink job Buffered 24 hour of click data vs 6 hours during steady state Now - 18h Now - 18h Now Co- process

Now - 18h Watermark alignment ensures both sources progress at
similar paces and avoids excessive buffering Impression (4x) Click Now - 18h Src-1 Src-2 Sink keyBy Throttled Flink job Now - 17h buffer 7 hours of click data close to steady state Co- process

How Flink watermark alignment works Enumerator Reader-0 Reader-1 P0 Reader-2
P1 P2 FLIP-182 and FLIP-217

Kafka readers extract and track watermark from timestamp fi eld
in records consumed from Kafka 10:45 10:23 10:10 Reader-0 Reader-1 P0 Reader-2 P1 P2 Enumerator

Kafka readers periodically send local watermarks to enumerator for aggregation
Reader-0 Reader-1 P0 Reader-2 P1 P2 10:45 10:23 10:10 Enumerator

Enumerator calculates the global watermark using min value Reader-0 Reader-1
P0 Reader-2 P1 P2 10:45 10:23 10:10 10:45 10:23 10:10 Enumerator

Enumerator broadcasts the global watermark to readers Reader-0 Reader-1 P0
Reader-2 P1 P2 10:45 10:23 10:10 10:10 Enumerator 10:10

Readers check the difference btw local watermarks and the global
watermark to decide if throttling needed maxAllowedWatermarkDrift = 30 mins throttled Reader-0 Reader-1 P0 Reader-2 P1 P2 10:46 10:23 10:10 10:10 Enumerator

Kafka Iceberg Splits are unbounded Splits are bounded Static split
assignment Dynamic pull-based split assignment Records are ordered within a Kafka partition Records may not be sorted by the timestamp fi eld within a data fi le Only readers can extract the watermark info from records Enumerator can extract min-max values from column-level statistics in metadata fi les

Enumerator assign splits ordered by time

Assign fi les to readers ordered by minimal timestamp value
F2 (9:04-9:10) F3 (9:05-9:09) F5 (9:16-9:21) F6 (9:17-9:25) F7 (9:21-9:26) F8 (9:23-9:25) F9 (9:25-9:32) F4 (9:13-9:19) F1 (9:00-9:03) Reader-0 Reader-1 Reader-2 Enumerator Global Watermark = Local Watermark

Readers extract watermark using min value from timestamp column stats
Global Watermark = 9:00 Enumerator Reader-0 Reader-1 Reader-2 9:00 9:04 9:05 Local Watermark F2 (9:04-9:10) F3 (9:05-9:09) F1 (9:00-9:03) F5 (9:16-9:21) F6 (9:17-9:25) F7 (9:21-9:26) F8 (9:23-9:25) F9 (9:25-9:32) F4 (9:13-9:19)

Enumerator Reader-0 Reader-1 Reader-2 9:00 9:04 9:05 Local Watermark Global
Watermark = 9:00 F2 (9:04-9:10) F3 (9:05-9:09) F1 (9:00-9:03) F5 (9:16-9:21) F6 (9:17-9:25) F7 (9:21-9:26) F8 (9:23-9:25) F9 (9:25-9:32) F4 (9:13-9:19) Readers check the difference btw local watermarks and the global watermark to decide if throttling needed Max Allowed Drift = 10 mins

Reader-2 fi nished F3 and requested a new fi le
Req fi le Reader-0 Reader-1 Reader-2 Enumerator 9:00 9:04 9:05 Local Watermark Global Watermark = 9:00 F2 (9:04-9:10) F1 (9:00-9:03) F5 (9:16-9:21) F6 (9:17-9:25) F7 (9:21-9:26) F8 (9:23-9:25) F9 (9:25-9:32) F4 (9:13-9:19) Max Allowed Drift = 10 mins

Enumerator assigned F4 to reader-2 who advanced its local watermark
to 9:13 Reader-0 Reader-1 Reader-2 Enumerator Global Watermark = 9:00 Local Watermark 9:00 9:04 9:13 F5 (9:16-9:21) F6 (9:17-9:25) F7 (9:21-9:26) F8 (9:23-9:25) F9 (9:25-9:32) F4 (9:13-9:19) F2 (9:04-9:10) F1 (9:00-9:03) Max Allowed Drift = 10 mins

Reader-2 paused reading as its local watermark was too far
ahead Reader-0 Reader-1 Reader-2 Enumerator Global Watermark = 9:00 Local Watermark 9:00 9:04 9:13 F5 (9:16-9:21) F6 (9:17-9:25) F7 (9:21-9:26) F8 (9:23-9:25) F9 (9:25-9:32) F4 (9:13-9:19) F2 (9:04-9:10) F1 (9:00-9:03)   9:13-9:00 > 10 mins Max Allowed Drift = 10 mins

Reader-0 advanced its watermark to 9:16 after receiving the new
fi le F5 Reader-0 Reader-1 Reader-2 Enumerator Global Watermark = 9:00 Local Watermark 9:16 9:04 9:13 F5 (9:16-9:21) F6 (9:17-9:25) F7 (9:21-9:26) F8 (9:23-9:25) F9 (9:25-9:32) F4 (9:13-9:19) F2 (9:04-9:10)   9:13-9:00 > 10 mins Max Allowed Drift = 10 mins   9:16-9:00 > 10 mins

After propagation delay, global watermark advanced to 9:04 and reader-2
resumed reading Reader-0 Reader-1 Reader-2 Enumerator Global Watermark = 9:04 Local Watermark 9:16 9:04 9:13 F5 (9:16-9:21) F6 (9:17-9:25) F7 (9:21-9:26) F8 (9:23-9:25) F9 (9:25-9:32) F4 (9:13-9:19) F2 (9:04-9:12)   9:13-9:04 <= 10 mins Max Allowed Drift = 10 mins

Max out-of-orderliness = max allowed watermark drift + max timestamp
range in data fi les Enumerator Max Allowed Drift = 10 mins F2 (9:04-9:10) F3 (9:05-9:09) F5 (9:16-9:21) F6 (9:17-9:25) F7 (9:21-9:26) F8 (9:23-9:25) F9 (9:25-9:32) F4 (9:13-9:19) F1 (9:00-9:03) Max out-of-orderliness is 18 minutes 8 mins

Now - 18h Keeping max out-of-orderliness small avoids excessive buffering
in Flink state Impression (4x) Click Now - 18h Src-1 Src-2 Sink keyBy Throttled Flink job Now - 19h buffer 7 hours of click data close to steady state Co- process

Upstream contribution is in progress Credit: Peter Vary

Iceberg source job Kafka Source job Test pipeline setup

Traf fi c volume • Throughput: ~3.9K msgs/sec • Message
size: ~1 KB

Container resource dimensions • JobManager: 1 CPU, 4 GB memory
• TaskManager: 1 CPU, 4 GB memory

What are we evaluating • Processing delay • How upstream
commit interval affects the bursty consumption • CPU util comparison btw Kafka and Iceberg source

Measure the latency from Kafka to Iceberg source processing time
- event timestamp Iceberg source job Kafka source job

sub-second Commit interval Poll interval Latency is mostly decided by
commit and poll interval Iceberg source job Kafka source job

Latency histogram is within expected range for 10s commit and
5s poll interval Max < 40s Time Latency (ms) Median fl uctuates around 10s Time Latency (ms)

Transactional commit in upstream ingestion leads to bursty stop-and-go consumption
as expected Kafka Source job 30-minute graphing window Iceberg source job 300s commit and 30s poll interval Time CPU usage (core)

CPU usage becomes smoother as we shorten the upstream commit
interval and Iceberg source poll interval 30-minute graphing window 10s commit and 5s poll interval 300s commit and 30s poll interval 60s commit and 10s poll interval

How does Iceberg source compare to Kafka source in CPU
usage The only difference is the streaming source: Kafka vs Iceberg Kafka Source job Iceberg source job

Here is the CPU usage comparison btw Kafka and Iceberg
source after applying the smooth function ~36% ~60% Kafka source job Iceberg source job 60-minute graphing window (60s commit and 10s poll intervals)

Closing

Build low-latency data pipelines chained by Flink jobs streaming from
Iceberg Feature Engineering Data Lake (Feature store) Model Store Nearline Model Training Minutes Streaming Ingestion Data Lake (Raw data) ETL Data Lake (Cleaned and enriched data) Device Seconds API Edge Message Queue Minutes Minutes Minutes

Streaming from Apache Iceberg - QCon NY 2023

Streaming from Apache Iceberg - QCon NY 2023

Other Decks in Technology

Featured

Transcript