Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Streaming from Apache Iceberg - QCon NY 2023

Streaming from Apache Iceberg - QCon NY 2023

Apache Flink is a very popular stream processing engine featuring sophisticated state management, even-time semantics, exactly-once state consistency. For low latency processing, Flink jobs typically consume data from streaming sources like Apache Kafka. Apache Iceberg is a widely adopted data lake technology supporting numerous features like snapshot isolation, transactional commit, fast scan planning. While Iceberg was originally designed for batch, it can also be used as a streaming source in Flink. This not only lowers the processing delays from hours or days to just minutes, but also significantly reduces the infrastructure cost and operational burden.

In this talk, we will explain the design of the Flink Iceberg source that we contributed to Apache Iceberg open source project. We will compare the Kafka and Iceberg sources for streaming read and present performance evaluation results of the Iceberg streaming read. We will discuss how the Iceberg streaming source can power many common stream processing use cases (like ETL, feature engineering). It enables users to build low-latency streaming pipelines chained by Iceberg that are cost effective and easy to operate.

Steven Wu

June 13, 2023
Tweet

Other Decks in Technology

Transcript

  1. Apache Iceberg is an open table format for huge analytic

    datasets https://iceberg.apache.org/
  2. What is a table format File format (like Parquet) organize

    records in a fi le Table format (like Iceberg) organize fi les in a table
  3. Iceberg offers numerous features • Serializable isolation • Fast scan

    planning with advanced fi ltering • Schema and partition layout evolution • Time travel • Branching and tagging
  4. Where does Iceberg fi t in the ecosystem Table Format

    (Metadata) Iceberg Delta Lake Hudi Compute Engine Flink Storage (Data) Cloud Storage Orc HDFS Parquet
  5. Flink is a popular stream processing engine • Highly scalable

    • Exactly once processing semantics • Event time semantics and watermark support • Layered APIs (DataStream, Table API/SQL)
  6. Traditional data pipelines are largely chained by batch jobs reading

    from data lake Hours/Days Feature Engineering Batch Jobs Data Lake (Feature store) Hours/Days Model Store Offline Model Training Batch Jobs Minutes Streaming Ingestion Data Lake (Raw data) Hours/Days Batch Jobs ETL Data Lake (Cleaned and enriched data) Device Seconds API Edge Message Queue
  7. Overall latency is hours to days Hours/Days Hours/Days Model Store

    Offline Model Training Batch Jobs Minutes Streaming Ingestion Data Lake (Raw data) Hours/Days Batch Jobs ETL Data Lake (Cleaned and enriched data) Device Seconds API Edge Message Queue Feature Engineering Data Lake (Feature store) Batch Jobs
  8. Switch everything to Flink streaming from Kafka Feature Engineering Feature

    store Model Store Online Model Training Streaming Ingestion Raw data ETL Cleaned and enriched data Device Seconds API Edge Message Queue Seconds Seconds Seconds Seconds
  9. Operation is not easy • Upgrading stateful system is painful

    • Capacity planning • Bursty workload • Isolation
  10. Tiered storage is widely adopted because it is expensive to

    store long-term data in Kafka Present Recent Past Distant Past Kafka Iceberg
  11. Cross-AZ network cost can be 10x more than broker cost

    (compute and storage combined) AZ-1 AZ-2 AZ-3 broker-2 broker-1 broker-3 consumer-1 consumer-2 consumer-3 producer-1 producer-2 producer-3
  12. Rack aware partition assignment can avoid cross- AZ traf fi

    c from broker to consumer AZ-1 AZ-2 AZ-3 broker-2 broker-1 broker-3 consumer-1 consumer-2 consumer-3 producer-1 producer-2 producer-3
  13. Kafka source doesn’t support fi ltering or projection at broker

    side Job-1 Job-2 Job-3 Filter Filter Projection
  14. Set up routing jobs just to fi lter or project

    data Job-1 Job-2 Job-3 Routing Job Filter Filter Projection
  15. Kafka source statically assigns partitions during startup Worker-1 Worker-2 Worker-3

    Partition-1 Partition-2 Partition-3 Lead to a few limitations
  16. Other workers can’t pick up the slack from outlier Worker-1

    Worker-2 Worker-3 Partition-1 Partition-2 Partition-3 Outlier
  17. Source parallelism is limited by the number of partitions Worker-1

    Worker-2 Worker-3 Partition-1 Partition-2 Partition-3 Worker-4 Idle
  18. May not get balanced partition assignment during autoscaling Worker-1 Worker-2

    Worker-3 Partition-1 Partition-3 Partition-5 Partition-2 Partition-4 Partition-6
  19. Flink Iceberg sink commits fi les after every successful checkpoint

    Upstream Streaming Job f1, f2, f3 1-10 minutes are pretty common
  20. Sink Can Flink stream data from Iceberg as they are

    committed by the upstream job Upstream Streaming Job Iceberg Streaming Source Job
  21. Sink Iceberg Streaming Source Job Iceberg supports scan of incremental

    changes between snapshots Upstream Streaming Job Sn f1, f2, f3 Sn+1 f1, f2, f3 TableScan appendsBetween( long fromSnapshotId, long toSnapshotId); This cycle continues forever
  22. Sink Iceberg Streaming Job Kafka Streaming Job KafkaSource<RowData> source =

    KafkaSource.builder() .setBootstrapServers(“…”) .setTopics(“…”) .setStartingOffsets( OffsetsInitializer.latest()) .setDeserializer(…) .build() IcebergSource<RowData> source = IcebergSource.forRowData() .tableLoader(tableLoader) .streamingStartingStrategy( INCREMENTAL_FROM_LATEST_SNAPSHOT) .monitorInterval( Duration.ofSeconds(30L)) .build() Also available in Flink SQL
  23. Build low-latency data pipelines chained by Flink jobs streaming from

    Iceberg Feature Engineering Data Lake (Feature store) Model Store Nearline Model Training Minutes Streaming Ingestion Data Lake (Raw data) ETL Data Lake (Cleaned and enriched data) Device Seconds API Edge Message Queue Minutes Minutes Minutes
  24. Where does stream processing fi t in the spectrum of

    data processing applications Stephan Ewen & Xiaowei Jiang & Robert Metzger. From Stream Processing to Uni fi ed Data Processing System. Flink Forward. April 1-2, 2019. San Francisco more real time more lag time Transactional Processing Event-driven Applications Streaming Analytics Data Pipelines Continuous Processing Batch Processing
  25. Flink Iceberg streaming source fi ts well for data pipelines

    and continuous processing more real time more lag time Transactional Processing Event-driven Applications Streaming Analytics Data Pipelines Batch Processing minutes Stephan Ewen & Xiaowei Jiang & Robert Metzger. From Stream Processing to Uni fi ed Data Processing System. Flink Forward. April 1-2, 2019. San Francisco Continuous Processing
  26. What about incremental batch processing • Schedule batch runs every

    a few minutes • Each run processes the new fi les added since last run • The line becomes blurry as scheduling intervals are shortened
  27. Limitations of incremental batch processing • May be more expensive

    to tear down and start the batch runs when scheduling intervals are small • Operational burden can be too high • Intermediate results for stateful processing are lost after each batch run and recomputed in the next run
  28. A unit of work is de fi ned as a

    split • In Kafka source, a split is a partition • In Iceberg source, a split is a fi le, a slice of a large fi le, or a group of small fi les • A split can be unbounded (Kafka) or bounded (Iceberg)
  29. JobManager TaskManager-1 TaskManager-n … Enumerator Reader-1 Reader-k … Iceberg source

    dynamically assign splits to readers with pull based model 1. Split discovery 2. Discovered splits Pending splits 3. Request split upon start or done with current split 4. Assigned split Reader requests one split a time Iceberg
  30. JobManager TaskManager-1 TaskManager-n … Enumerator Reader-1 Reader-k … FLIP-27 uni

    fi es batch and streaming sources Only difference is whether split discovery is one- time or periodical 1. Split discovery 2. Discovered splits Iceberg Pending splits
  31. Unify the live and back fi ll sources to Iceberg

    Sink Backfill Job Live Job Sink
  32. Most cloud blob storages don’t charge network cost within a

    region AZ-1 AZ-2 AZ-3 consumer-1 consumer-2 consumer-3 Iceberg (on cloud storage)
  33. Support advanced data pruning • File pruning (predicate pushdown) •

    Column projection IcebergSource<RowData> source = IcebergSource.forRowData() .tableLoader(tableLoader) .streamingStartingStrategy( INCREMENTAL_FROM_LATEST_SNAPSHOT) .monitorInterval( Duration.ofSeconds(30L)) .filters(Expressions.equal("dt", “2020-03-20")) .project(schema.select("dt", “id”, “name”) .build()
  34. Dynamic pull-based split assignment allows other worker to pick up

    the slack Outlier Other workers can pick up the slack and process more fi les JobManager TaskManager-1 TaskManager-n … Enumerator Reader-1 Reader-k … Pending splits Iceberg
  35. It is more operationally friendly with a lot more fi

    le segments than Kafka partitions • Can support higher parallelism • Is more autoscaling friendly
  36. FLIP-27 Flink Iceberg source is merged in Apache Iceberg project

    https://github.com/apache/iceberg/projects/23 CDC (updates/deletes) read not supported yet
  37. Watermark controls tradeoff btw latency and completeness • Records can

    come out of order • Asserts all data before the watermark have arrived
  38. Stateful join with 6 hour join window Why is watermark

    alignment needed Impression Click Everything works well in steady state with live traf fi c
  39. During replay, the two sources can proceed at diff paces

    due to diff data volumes Impression (4x) Click
  40. Flink watermark is calculated as the minimal of all inputs

    Impression (4x) Click Src-1 Src-2 Co- process Sink keyBy Flink job Now - 18h Now - 18h Now
  41. Slow watermark advancement can lead to excessive buffering Impression (4x)

    Click Src-1 Src-2 Sink keyBy Flink job Buffered 24 hour of click data vs 6 hours during steady state Now - 18h Now - 18h Now Co- process
  42. Now - 18h Watermark alignment ensures both sources progress at

    similar paces and avoids excessive buffering Impression (4x) Click Now - 18h Src-1 Src-2 Sink keyBy Throttled Flink job Now - 17h buffer 7 hours of click data close to steady state Co- process
  43. Kafka readers extract and track watermark from timestamp fi eld

    in records consumed from Kafka 10:45 10:23 10:10 Reader-0 Reader-1 P0 Reader-2 P1 P2 Enumerator
  44. Kafka readers periodically send local watermarks to enumerator for aggregation

    Reader-0 Reader-1 P0 Reader-2 P1 P2 10:45 10:23 10:10 Enumerator
  45. Enumerator calculates the global watermark using min value Reader-0 Reader-1

    P0 Reader-2 P1 P2 10:45 10:23 10:10 10:45 10:23 10:10 Enumerator
  46. Enumerator broadcasts the global watermark to readers Reader-0 Reader-1 P0

    Reader-2 P1 P2 10:45 10:23 10:10 10:10 Enumerator 10:10
  47. Readers check the difference btw local watermarks and the global

    watermark to decide if throttling needed maxAllowedWatermarkDrift = 30 mins throttled Reader-0 Reader-1 P0 Reader-2 P1 P2 10:46 10:23 10:10 10:10 Enumerator
  48. Kafka Iceberg Splits are unbounded Splits are bounded Static split

    assignment Dynamic pull-based split assignment Records are ordered within a Kafka partition Records may not be sorted by the timestamp fi eld within a data fi le Only readers can extract the watermark info from records Enumerator can extract min-max values from column-level statistics in metadata fi les
  49. Assign fi les to readers ordered by minimal timestamp value

    F2 (9:04-9:10) F3 (9:05-9:09) F5 (9:16-9:21) F6 (9:17-9:25) F7 (9:21-9:26) F8 (9:23-9:25) F9 (9:25-9:32) F4 (9:13-9:19) F1 (9:00-9:03) Reader-0 Reader-1 Reader-2 Enumerator Global Watermark = Local Watermark
  50. Readers extract watermark using min value from timestamp column stats

    Global Watermark = 9:00 Enumerator Reader-0 Reader-1 Reader-2 9:00 9:04 9:05 Local Watermark F2 (9:04-9:10) F3 (9:05-9:09) F1 (9:00-9:03) F5 (9:16-9:21) F6 (9:17-9:25) F7 (9:21-9:26) F8 (9:23-9:25) F9 (9:25-9:32) F4 (9:13-9:19)
  51. Enumerator Reader-0 Reader-1 Reader-2 9:00 9:04 9:05 Local Watermark Global

    Watermark = 9:00 F2 (9:04-9:10) F3 (9:05-9:09) F1 (9:00-9:03) F5 (9:16-9:21) F6 (9:17-9:25) F7 (9:21-9:26) F8 (9:23-9:25) F9 (9:25-9:32) F4 (9:13-9:19) Readers check the difference btw local watermarks and the global watermark to decide if throttling needed Max Allowed Drift = 10 mins
  52. Reader-2 fi nished F3 and requested a new fi le

    Req fi le Reader-0 Reader-1 Reader-2 Enumerator 9:00 9:04 9:05 Local Watermark Global Watermark = 9:00 F2 (9:04-9:10) F1 (9:00-9:03) F5 (9:16-9:21) F6 (9:17-9:25) F7 (9:21-9:26) F8 (9:23-9:25) F9 (9:25-9:32) F4 (9:13-9:19) Max Allowed Drift = 10 mins
  53. Enumerator assigned F4 to reader-2 who advanced its local watermark

    to 9:13 Reader-0 Reader-1 Reader-2 Enumerator Global Watermark = 9:00 Local Watermark 9:00 9:04 9:13 F5 (9:16-9:21) F6 (9:17-9:25) F7 (9:21-9:26) F8 (9:23-9:25) F9 (9:25-9:32) F4 (9:13-9:19) F2 (9:04-9:10) F1 (9:00-9:03) Max Allowed Drift = 10 mins
  54. Reader-2 paused reading as its local watermark was too far

    ahead Reader-0 Reader-1 Reader-2 Enumerator Global Watermark = 9:00 Local Watermark 9:00 9:04 9:13 F5 (9:16-9:21) F6 (9:17-9:25) F7 (9:21-9:26) F8 (9:23-9:25) F9 (9:25-9:32) F4 (9:13-9:19) F2 (9:04-9:10) F1 (9:00-9:03) 
 9:13-9:00 > 10 mins Max Allowed Drift = 10 mins
  55. Reader-0 advanced its watermark to 9:16 after receiving the new

    fi le F5 Reader-0 Reader-1 Reader-2 Enumerator Global Watermark = 9:00 Local Watermark 9:16 9:04 9:13 F5 (9:16-9:21) F6 (9:17-9:25) F7 (9:21-9:26) F8 (9:23-9:25) F9 (9:25-9:32) F4 (9:13-9:19) F2 (9:04-9:10) 
 9:13-9:00 > 10 mins Max Allowed Drift = 10 mins 
 9:16-9:00 > 10 mins
  56. After propagation delay, global watermark advanced to 9:04 and reader-2

    resumed reading Reader-0 Reader-1 Reader-2 Enumerator Global Watermark = 9:04 Local Watermark 9:16 9:04 9:13 F5 (9:16-9:21) F6 (9:17-9:25) F7 (9:21-9:26) F8 (9:23-9:25) F9 (9:25-9:32) F4 (9:13-9:19) F2 (9:04-9:12) 
 9:13-9:04 <= 10 mins Max Allowed Drift = 10 mins
  57. Max out-of-orderliness = max allowed watermark drift + max timestamp

    range in data fi les Enumerator Max Allowed Drift = 10 mins F2 (9:04-9:10) F3 (9:05-9:09) F5 (9:16-9:21) F6 (9:17-9:25) F7 (9:21-9:26) F8 (9:23-9:25) F9 (9:25-9:32) F4 (9:13-9:19) F1 (9:00-9:03) Max out-of-orderliness is 18 minutes 8 mins
  58. Now - 18h Keeping max out-of-orderliness small avoids excessive buffering

    in Flink state Impression (4x) Click Now - 18h Src-1 Src-2 Sink keyBy Throttled Flink job Now - 19h buffer 7 hours of click data close to steady state Co- process
  59. What are we evaluating • Processing delay • How upstream

    commit interval affects the bursty consumption • CPU util comparison btw Kafka and Iceberg source
  60. Measure the latency from Kafka to Iceberg source processing time

    - event timestamp Iceberg source job Kafka source job
  61. sub-second Commit interval Poll interval Latency is mostly decided by

    commit and poll interval Iceberg source job Kafka source job
  62. Latency histogram is within expected range for 10s commit and

    5s poll interval Max < 40s Time Latency (ms) Median fl uctuates around 10s Time Latency (ms)
  63. Transactional commit in upstream ingestion leads to bursty stop-and-go consumption

    as expected Kafka Source job 30-minute graphing window Iceberg source job 300s commit and 30s poll interval Time CPU usage (core)
  64. CPU usage becomes smoother as we shorten the upstream commit

    interval and Iceberg source poll interval 30-minute graphing window 10s commit and 5s poll interval 300s commit and 30s poll interval 60s commit and 10s poll interval
  65. How does Iceberg source compare to Kafka source in CPU

    usage The only difference is the streaming source: Kafka vs Iceberg Kafka Source job Iceberg source job
  66. Here is the CPU usage comparison btw Kafka and Iceberg

    source after applying the smooth function ~36% ~60% Kafka source job Iceberg source job 60-minute graphing window (60s commit and 10s poll intervals)
  67. Build low-latency data pipelines chained by Flink jobs streaming from

    Iceberg Feature Engineering Data Lake (Feature store) Model Store Nearline Model Training Minutes Streaming Ingestion Data Lake (Raw data) ETL Data Lake (Cleaned and enriched data) Device Seconds API Edge Message Queue Minutes Minutes Minutes
  68. Q&A