Apache Flink is a very popular stream processing engine featuring sophisticated state management, even-time semantics, exactly-once state consistency. For low latency processing, Flink jobs typically consume data from streaming sources like Apache Kafka. Apache Iceberg is a widely adopted data lake technology supporting numerous features like snapshot isolation, transactional commit, fast scan planning. While Iceberg was originally designed for batch, it can also be used as a streaming source in Flink. This not only lowers the processing delays from hours or days to just minutes, but also significantly reduces the infrastructure cost and operational burden.
In this talk, we will explain the design of the Flink Iceberg source that we contributed to Apache Iceberg open source project. We will compare the Kafka and Iceberg sources for streaming read and present performance evaluation results of the Iceberg streaming read. We will discuss how the Iceberg streaming source can power many common stream processing use cases (like ETL, feature engineering). It enables users to build low-latency streaming pipelines chained by Iceberg that are cost effective and easy to operate.