Streaming from Apache Iceberg - QCon NY 2023

Slide 1

Slide 1 text

Streaming from Apache Iceberg Building Low-Latency and Cost-Effective Data Pipelines Steven Wu @ Apple THIS IS NOT A CONTRIBUTION

Slide 2

Slide 2 text

Agenda Introduction to Iceberg and Flink Motivation Streaming from Iceberg Watermark alignment Evaluation results

Slide 3

Slide 3 text

Agenda Introduction to Iceberg and Flink Motivation Streaming from Iceberg Watermark alignment Evaluation results

Slide 4

Slide 4 text

Apache Iceberg is an open table format for huge analytic datasets https://iceberg.apache.org/

Slide 5

Slide 5 text

What is a table format File format (like Parquet) organize records in a fi le Table format (like Iceberg) organize fi les in a table

Slide 6

Slide 6 text

Iceberg offers numerous features • Serializable isolation • Fast scan planning with advanced fi ltering • Schema and partition layout evolution • Time travel • Branching and tagging

Slide 7

Slide 7 text

Where does Iceberg fi t in the ecosystem Table Format (Metadata) Iceberg Delta Lake Hudi Compute Engine Flink Storage (Data) Cloud Storage Orc HDFS Parquet

Slide 8

Slide 8 text

Apache Flink is a distributed framework for stateful computations over data streams https://flink.apache.org/

Slide 9

Slide 9 text

Flink is a popular stream processing engine • Highly scalable • Exactly once processing semantics • Event time semantics and watermark support • Layered APIs (DataStream, Table API/SQL)

Slide 10

Slide 10 text

Agenda Introduction to Iceberg and Flink Motivation Streaming from Iceberg Watermark alignment Evaluation results

Slide 11

Slide 11 text

Traditional data pipelines are largely chained by batch jobs reading from data lake Hours/Days Feature Engineering Batch Jobs Data Lake (Feature store) Hours/Days Model Store Offline Model Training Batch Jobs Minutes Streaming Ingestion Data Lake (Raw data) Hours/Days Batch Jobs ETL Data Lake (Cleaned and enriched data) Device Seconds API Edge Message Queue

Slide 12

Slide 12 text

Overall latency is hours to days Hours/Days Hours/Days Model Store Offline Model Training Batch Jobs Minutes Streaming Ingestion Data Lake (Raw data) Hours/Days Batch Jobs ETL Data Lake (Cleaned and enriched data) Device Seconds API Edge Message Queue Feature Engineering Data Lake (Feature store) Batch Jobs

Slide 13

Slide 13 text

Sink Flink Streaming Job Flink streaming from Kafka is very popular

Slide 14

Slide 14 text

Switch everything to Flink streaming from Kafka Feature Engineering Feature store Model Store Online Model Training Streaming Ingestion Raw data ETL Cleaned and enriched data Device Seconds API Edge Message Queue Seconds Seconds Seconds Seconds

Slide 15

Slide 15 text

Kafka can achieve sub-second read latency

Slide 16

Slide 16 text

But there are tradeoffs . . .

Slide 17

Slide 17 text

Operation is not easy • Upgrading stateful system is painful • Capacity planning • Bursty workload • Isolation

Slide 18

Slide 18 text

Tiered storage is widely adopted because it is expensive to store long-term data in Kafka Present Recent Past Distant Past Kafka Iceberg

Slide 19

Slide 19 text

Cross-AZ network cost can be 10x more than broker cost (compute and storage combined) AZ-1 AZ-2 AZ-3 broker-2 broker-1 broker-3 consumer-1 consumer-2 consumer-3 producer-1 producer-2 producer-3

Slide 20

Slide 20 text

Rack aware partition assignment can avoid cross- AZ traf fi c from broker to consumer AZ-1 AZ-2 AZ-3 broker-2 broker-1 broker-3 consumer-1 consumer-2 consumer-3 producer-1 producer-2 producer-3

Slide 21

Slide 21 text

Kafka source doesn’t support fi ltering or projection at broker side Job-1 Job-2 Job-3 Filter Filter Projection

Slide 22

Slide 22 text

Set up routing jobs just to fi lter or project data Job-1 Job-2 Job-3 Routing Job Filter Filter Projection

Slide 23

Slide 23 text

Kafka source statically assigns partitions during startup Worker-1 Worker-2 Worker-3 Partition-1 Partition-2 Partition-3 Lead to a few limitations

Slide 24

Slide 24 text

Other workers can’t pick up the slack from outlier Worker-1 Worker-2 Worker-3 Partition-1 Partition-2 Partition-3 Outlier

Slide 25

Slide 25 text

Source parallelism is limited by the number of partitions Worker-1 Worker-2 Worker-3 Partition-1 Partition-2 Partition-3 Worker-4 Idle

Slide 26

Slide 26 text

May not get balanced partition assignment during autoscaling Worker-1 Worker-2 Worker-3 Partition-1 Partition-3 Partition-5 Partition-2 Partition-4 Partition-6

Slide 27

Slide 27 text

Worker-1 Worker-2 Worker-3 Partition-1 Partition-3 Partition-5 Worker-4 Partition-2 Partition-4 Partition-6 May not get balanced partition assignment during autoscaling

Slide 28

Slide 28 text

Alternative streaming source?

Slide 29

Slide 29 text

Agenda Introduction to Iceberg and Flink Motivation Streaming from Iceberg Watermark alignment Evaluation results

Slide 30

Slide 30 text

Flink Iceberg sink commits fi les after every successful checkpoint Upstream Streaming Job f1, f2, f3 1-10 minutes are pretty common

Slide 31

Slide 31 text

Sink Can Flink stream data from Iceberg as they are committed by the upstream job Upstream Streaming Job Iceberg Streaming Source Job

Slide 32

Slide 32 text

Sink Iceberg Streaming Source Job Iceberg supports scan of incremental changes between snapshots Upstream Streaming Job Sn f1, f2, f3 Sn+1 f1, f2, f3 TableScan appendsBetween( long fromSnapshotId, long toSnapshotId); This cycle continues forever

Slide 33

Slide 33 text

Sink Iceberg Streaming Job Kafka Streaming Job KafkaSource source = KafkaSource.builder() .setBootstrapServers(“…”) .setTopics(“…”) .setStartingOffsets( OffsetsInitializer.latest()) .setDeserializer(…) .build() IcebergSource source = IcebergSource.forRowData() .tableLoader(tableLoader) .streamingStartingStrategy( INCREMENTAL_FROM_LATEST_SNAPSHOT) .monitorInterval( Duration.ofSeconds(30L)) .build() Also available in Flink SQL

Slide 34

Slide 34 text

Many streaming use cases are good with minute-level latency

Slide 35

Slide 35 text

Build low-latency data pipelines chained by Flink jobs streaming from Iceberg Feature Engineering Data Lake (Feature store) Model Store Nearline Model Training Minutes Streaming Ingestion Data Lake (Raw data) ETL Data Lake (Cleaned and enriched data) Device Seconds API Edge Message Queue Minutes Minutes Minutes

Slide 36

Slide 36 text

Where does stream processing fi t in the spectrum of data processing applications Stephan Ewen & Xiaowei Jiang & Robert Metzger. From Stream Processing to Uni fi ed Data Processing System. Flink Forward. April 1-2, 2019. San Francisco more real time more lag time Transactional Processing Event-driven Applications Streaming Analytics Data Pipelines Continuous Processing Batch Processing

Slide 37

Slide 37 text

Flink Iceberg streaming source fi ts well for data pipelines and continuous processing more real time more lag time Transactional Processing Event-driven Applications Streaming Analytics Data Pipelines Batch Processing minutes Stephan Ewen & Xiaowei Jiang & Robert Metzger. From Stream Processing to Uni fi ed Data Processing System. Flink Forward. April 1-2, 2019. San Francisco Continuous Processing

Slide 38

Slide 38 text

What about incremental batch processing • Schedule batch runs every a few minutes • Each run processes the new fi les added since last run • The line becomes blurry as scheduling intervals are shortened

Slide 39

Slide 39 text

Limitations of incremental batch processing • May be more expensive to tear down and start the batch runs when scheduling intervals are small • Operational burden can be too high • Intermediate results for stateful processing are lost after each batch run and recomputed in the next run

Slide 40

Slide 40 text

FLIP-27 source interface separates work discovery with reading JobManager TaskManager-1 TaskManager-n … Enumerator Reader-1 Reader-k …

Slide 41

Slide 41 text

A unit of work is de fi ned as a split • In Kafka source, a split is a partition • In Iceberg source, a split is a fi le, a slice of a large fi le, or a group of small fi les • A split can be unbounded (Kafka) or bounded (Iceberg)

Slide 42

Slide 42 text

JobManager TaskManager-1 TaskManager-n … Enumerator Reader-1 Reader-k … Iceberg source dynamically assign splits to readers with pull based model 1. Split discovery 2. Discovered splits Pending splits 3. Request split upon start or done with current split 4. Assigned split Reader requests one split a time Iceberg

Slide 43

Slide 43 text

JobManager TaskManager-1 TaskManager-n … Enumerator Reader-1 Reader-k … FLIP-27 uni fi es batch and streaming sources Only difference is whether split discovery is one- time or periodical 1. Split discovery 2. Discovered splits Iceberg Pending splits

Slide 44

Slide 44 text

Bene fi ts of Iceberg streaming source?

Slide 45

Slide 45 text

Leverage managed cloud storage • Of fl oad operational burden • Scalable • Cost effective

Slide 46

Slide 46 text

Simplify the architecture with uni fi ed storage Present Recent Past Distant Past Iceberg

Slide 47

Slide 47 text

Unify the live and back fi ll sources to Iceberg Sink Backfill Job Live Job Sink

Slide 48

Slide 48 text

Most cloud blob storages don’t charge network cost within a region AZ-1 AZ-2 AZ-3 consumer-1 consumer-2 consumer-3 Iceberg (on cloud storage)

Slide 49

Slide 49 text

Support advanced data pruning • File pruning (predicate pushdown) • Column projection IcebergSource source = IcebergSource.forRowData() .tableLoader(tableLoader) .streamingStartingStrategy( INCREMENTAL_FROM_LATEST_SNAPSHOT) .monitorInterval( Duration.ofSeconds(30L)) .filters(Expressions.equal("dt", “2020-03-20")) .project(schema.select("dt", “id”, “name”) .build()

Slide 50

Slide 50 text

Dynamic pull-based split assignment allows other worker to pick up the slack Outlier Other workers can pick up the slack and process more fi les JobManager TaskManager-1 TaskManager-n … Enumerator Reader-1 Reader-k … Pending splits Iceberg

Slide 51

Slide 51 text

It is more operationally friendly with a lot more fi le segments than Kafka partitions • Can support higher parallelism • Is more autoscaling friendly

Slide 52

Slide 52 text

FLIP-27 Flink Iceberg source is merged in Apache Iceberg project https://github.com/apache/iceberg/projects/23 CDC (updates/deletes) read not supported yet

Slide 53

Slide 53 text

Agenda Introduction to Iceberg and Flink Motivation Streaming from Iceberg Watermark alignment Evaluation results

Slide 54

Slide 54 text

Watermark controls tradeoff btw latency and completeness • Records can come out of order • Asserts all data before the watermark have arrived

Slide 55

Slide 55 text

Stateful join with 6 hour join window Why is watermark alignment needed Impression Click Everything works well in steady state with live traf fi c

Slide 56

Slide 56 text

During replay, the two sources can proceed at diff paces due to diff data volumes Impression (4x) Click

Slide 57

Slide 57 text

Flink watermark is calculated as the minimal of all inputs Impression (4x) Click Src-1 Src-2 Co- process Sink keyBy Flink job Now - 18h Now - 18h Now

Slide 58

Slide 58 text

Slow watermark advancement can lead to excessive buffering Impression (4x) Click Src-1 Src-2 Sink keyBy Flink job Buffered 24 hour of click data vs 6 hours during steady state Now - 18h Now - 18h Now Co- process

Slide 59

Slide 59 text

Now - 18h Watermark alignment ensures both sources progress at similar paces and avoids excessive buffering Impression (4x) Click Now - 18h Src-1 Src-2 Sink keyBy Throttled Flink job Now - 17h buffer 7 hours of click data close to steady state Co- process

Slide 60

Slide 60 text

How Flink watermark alignment works Enumerator Reader-0 Reader-1 P0 Reader-2 P1 P2 FLIP-182 and FLIP-217

Slide 61

Slide 61 text

Kafka readers extract and track watermark from timestamp fi eld in records consumed from Kafka 10:45 10:23 10:10 Reader-0 Reader-1 P0 Reader-2 P1 P2 Enumerator

Slide 62

Slide 62 text

Kafka readers periodically send local watermarks to enumerator for aggregation Reader-0 Reader-1 P0 Reader-2 P1 P2 10:45 10:23 10:10 Enumerator

Slide 63

Slide 63 text

Enumerator calculates the global watermark using min value Reader-0 Reader-1 P0 Reader-2 P1 P2 10:45 10:23 10:10 10:45 10:23 10:10 Enumerator

Slide 64

Slide 64 text

Enumerator broadcasts the global watermark to readers Reader-0 Reader-1 P0 Reader-2 P1 P2 10:45 10:23 10:10 10:10 Enumerator 10:10

Slide 65

Slide 65 text

Readers check the difference btw local watermarks and the global watermark to decide if throttling needed maxAllowedWatermarkDrift = 30 mins throttled Reader-0 Reader-1 P0 Reader-2 P1 P2 10:46 10:23 10:10 10:10 Enumerator

Slide 66

Slide 66 text

Kafka Iceberg Splits are unbounded Splits are bounded Static split assignment Dynamic pull-based split assignment Records are ordered within a Kafka partition Records may not be sorted by the timestamp fi eld within a data fi le Only readers can extract the watermark info from records Enumerator can extract min-max values from column-level statistics in metadata fi les

Slide 67

Slide 67 text

Enumerator assign splits ordered by time

Slide 68

Slide 68 text

Assign fi les to readers ordered by minimal timestamp value F2 (9:04-9:10) F3 (9:05-9:09) F5 (9:16-9:21) F6 (9:17-9:25) F7 (9:21-9:26) F8 (9:23-9:25) F9 (9:25-9:32) F4 (9:13-9:19) F1 (9:00-9:03) Reader-0 Reader-1 Reader-2 Enumerator Global Watermark = Local Watermark

Slide 69

Slide 69 text

Readers extract watermark using min value from timestamp column stats Global Watermark = 9:00 Enumerator Reader-0 Reader-1 Reader-2 9:00 9:04 9:05 Local Watermark F2 (9:04-9:10) F3 (9:05-9:09) F1 (9:00-9:03) F5 (9:16-9:21) F6 (9:17-9:25) F7 (9:21-9:26) F8 (9:23-9:25) F9 (9:25-9:32) F4 (9:13-9:19)

Slide 70

Slide 70 text

Enumerator Reader-0 Reader-1 Reader-2 9:00 9:04 9:05 Local Watermark Global Watermark = 9:00 F2 (9:04-9:10) F3 (9:05-9:09) F1 (9:00-9:03) F5 (9:16-9:21) F6 (9:17-9:25) F7 (9:21-9:26) F8 (9:23-9:25) F9 (9:25-9:32) F4 (9:13-9:19) Readers check the difference btw local watermarks and the global watermark to decide if throttling needed Max Allowed Drift = 10 mins

Slide 71

Slide 71 text

Reader-2 fi nished F3 and requested a new fi le Req fi le Reader-0 Reader-1 Reader-2 Enumerator 9:00 9:04 9:05 Local Watermark Global Watermark = 9:00 F2 (9:04-9:10) F1 (9:00-9:03) F5 (9:16-9:21) F6 (9:17-9:25) F7 (9:21-9:26) F8 (9:23-9:25) F9 (9:25-9:32) F4 (9:13-9:19) Max Allowed Drift = 10 mins

Slide 72

Slide 72 text

Enumerator assigned F4 to reader-2 who advanced its local watermark to 9:13 Reader-0 Reader-1 Reader-2 Enumerator Global Watermark = 9:00 Local Watermark 9:00 9:04 9:13 F5 (9:16-9:21) F6 (9:17-9:25) F7 (9:21-9:26) F8 (9:23-9:25) F9 (9:25-9:32) F4 (9:13-9:19) F2 (9:04-9:10) F1 (9:00-9:03) Max Allowed Drift = 10 mins

Slide 73

Slide 73 text

Reader-2 paused reading as its local watermark was too far ahead Reader-0 Reader-1 Reader-2 Enumerator Global Watermark = 9:00 Local Watermark 9:00 9:04 9:13 F5 (9:16-9:21) F6 (9:17-9:25) F7 (9:21-9:26) F8 (9:23-9:25) F9 (9:25-9:32) F4 (9:13-9:19) F2 (9:04-9:10) F1 (9:00-9:03)   9:13-9:00 > 10 mins Max Allowed Drift = 10 mins

Slide 74

Slide 74 text

Reader-0 advanced its watermark to 9:16 after receiving the new fi le F5 Reader-0 Reader-1 Reader-2 Enumerator Global Watermark = 9:00 Local Watermark 9:16 9:04 9:13 F5 (9:16-9:21) F6 (9:17-9:25) F7 (9:21-9:26) F8 (9:23-9:25) F9 (9:25-9:32) F4 (9:13-9:19) F2 (9:04-9:10)   9:13-9:00 > 10 mins Max Allowed Drift = 10 mins   9:16-9:00 > 10 mins

Slide 75

Slide 75 text

After propagation delay, global watermark advanced to 9:04 and reader-2 resumed reading Reader-0 Reader-1 Reader-2 Enumerator Global Watermark = 9:04 Local Watermark 9:16 9:04 9:13 F5 (9:16-9:21) F6 (9:17-9:25) F7 (9:21-9:26) F8 (9:23-9:25) F9 (9:25-9:32) F4 (9:13-9:19) F2 (9:04-9:12)   9:13-9:04 <= 10 mins Max Allowed Drift = 10 mins

Slide 76

Slide 76 text

Max out-of-orderliness = max allowed watermark drift + max timestamp range in data fi les Enumerator Max Allowed Drift = 10 mins F2 (9:04-9:10) F3 (9:05-9:09) F5 (9:16-9:21) F6 (9:17-9:25) F7 (9:21-9:26) F8 (9:23-9:25) F9 (9:25-9:32) F4 (9:13-9:19) F1 (9:00-9:03) Max out-of-orderliness is 18 minutes 8 mins

Slide 77

Slide 77 text

Now - 18h Keeping max out-of-orderliness small avoids excessive buffering in Flink state Impression (4x) Click Now - 18h Src-1 Src-2 Sink keyBy Throttled Flink job Now - 19h buffer 7 hours of click data close to steady state Co- process

Slide 78

Slide 78 text

Upstream contribution is in progress Credit: Peter Vary

Slide 79

Slide 79 text

Agenda Introduction to Iceberg and Flink Motivation Streaming from Iceberg Watermark alignment Evaluation results

Slide 80

Slide 80 text

Iceberg source job Kafka Source job Test pipeline setup

Slide 81

Slide 81 text

Traf fi c volume • Throughput: ~3.9K msgs/sec • Message size: ~1 KB

Slide 82

Slide 82 text

Container resource dimensions • JobManager: 1 CPU, 4 GB memory • TaskManager: 1 CPU, 4 GB memory

Slide 83

Slide 83 text

What are we evaluating • Processing delay • How upstream commit interval affects the bursty consumption • CPU util comparison btw Kafka and Iceberg source

Slide 84

Slide 84 text

Measure the latency from Kafka to Iceberg source processing time - event timestamp Iceberg source job Kafka source job

Slide 85

Slide 85 text

sub-second Commit interval Poll interval Latency is mostly decided by commit and poll interval Iceberg source job Kafka source job

Slide 86

Slide 86 text

Latency histogram is within expected range for 10s commit and 5s poll interval Max < 40s Time Latency (ms) Median fl uctuates around 10s Time Latency (ms)

Slide 87

Slide 87 text

Transactional commit in upstream ingestion leads to bursty stop-and-go consumption as expected Kafka Source job 30-minute graphing window Iceberg source job 300s commit and 30s poll interval Time CPU usage (core)

Slide 88

Slide 88 text

CPU usage becomes smoother as we shorten the upstream commit interval and Iceberg source poll interval 30-minute graphing window 10s commit and 5s poll interval 300s commit and 30s poll interval 60s commit and 10s poll interval

Slide 89

Slide 89 text

How does Iceberg source compare to Kafka source in CPU usage The only difference is the streaming source: Kafka vs Iceberg Kafka Source job Iceberg source job

Slide 90

Slide 90 text

Here is the CPU usage comparison btw Kafka and Iceberg source after applying the smooth function ~36% ~60% Kafka source job Iceberg source job 60-minute graphing window (60s commit and 10s poll intervals)