Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[FF2025] Apache Fluss (Incubating) - Making You...

Avatar for Jark Wu Jark Wu
October 31, 2025

[FF2025] Apache Fluss (Incubating) - Making Your Lakehouse Truly Real-Time

Modern data architectures demand seamless integration between real-time streaming and analytical systems. Especially in the era of Gen AI, real-time lakehouses are no longer optional—they’re essential. High-quality, real-time data is critical to ensuring the accuracy, responsiveness, and reliability of AI-driven applications. However, traditional batch-oriented lakehouses struggle to meet these demands, while legacy streaming tools like Kafka lack native integration with modern lakehouse architectures, leading to inefficiencies in cost, latency, and scalability.

In this session, we’ll introduce Fluss, a Lakehouse-native streaming storage designed for analytics workloads. Discuss how Fluss unifies data streaming and data Lakehouse by serving real-time streaming data on top of the Lakehouse (Iceberg). This not only brings powerful analytics capabilities to data streams but also delivers low-latency data to Iceberg, transforming it into a Real-Time Lakehouse. Finally, we’ll explore real-world use cases where Fluss enables Real-Time Lakehouses, highlighting its benefits and the potential to power the next generation of AI applications.

Avatar for Jark Wu

Jark Wu

October 31, 2025
Tweet

More Decks by Jark Wu

Other Decks in Technology

Transcript

  1. Who Am I? • PMC member of Apache Flink •

    PPMC member of Apache Fluss (incubating) • Original creator of Flink CDC and Fluss projects • Flink SQL & Fluss Team Leader @ Alibaba • 10 years on distributed systems Jark Wu 👋
  2. Lakehouse Native What If We Could Rebuild Kafka From Scratch?

    A Lakehouse-Native “Kafka” that easily makes your Lakehouse truly Real-Time
  3. Why Real-Time Lakehouse Matters? Bronze Tables Silver Tables Gold Tables

    Business Demand For Speed Immediate Decision Making AI/ML Needs Fresh Data Agent Require Real-Time Context You can’t build the next TikTok recommender system on traditional Lakehouse which lacks real-time streaming data for AI
  4. Apache Fluss (Incubating): Extend Your Lakehouse with Real-Time Streaming Lakehouse

    Storage Paimon Iceberg Lance Query Engines Lakehouse Analytics Union Read Lookup Join Changelog Read Batch Query Databases Logs Streaming Writes Real-Time Updates Real-Time Lakehouse Fresh data in real-time Data Lakes
  5. Why Rebuild Kafka, not Enhance Kafka? No Update Data Model

    Mismatch No Schema Kafka was designed for events, not designed for analytics This leads to manual per-topic configuration.
  6. Data Model of Fluss vs. Iceberg vs. Kafka Fluss Iceberg

    Kafka Database ✅ ✅ ❌ Table ✅ ✅ ⚠ Topic Partitions (dt = 20251015) ✅ ✅ ❌ Update Row ✅ ✅ ❌ Delete Row ✅ ✅ ❌ Data Types ✅ 20+ Types ✅ 20+ Types ❌ No Schema File Format ✅ Column Format (Apache Arrow) ✅ Column Format (Apache Parquet) ⚠ Row Format (Avro/Json) Fluss is Lakehouse-Native which aligns all data model with Lakehouse systems. Kafka, however, there is a big gap with Lakehouse systems.
  7. A Glimpse at Enabling Stream to Iceberg ALTER TABLE customers

    SET ('table.datalake.enabled' = 'true') Fluss: one SQL line Confluent/Warpstream Tableflow: ugly YAML file and type mapping Imagine having 100s of topics and 50+ fields in each table 😖
  8. Trend: The Converging of Stream and Lakehouse Alibaba starts project

    Fluss Propose “Streaming Lakehouse” Confluent Tableflow 2023.07 2024.03 Redpanda Iceberg Topic 2024.09 2024.10 StreamNative Ursa 2024.12 AutoMQ Table Topic 2025.08 Aiven Iceberg Topic
  9. Why Tableflow not the Answer? Tableflow: Stream into Lakehouse Fluss:

    Streaming Lakehouse TableFlow Fluss Architecture Stream into Lakehouse: data synchronization tool Streaming Lakehouse: data shared between both side Value Zero-ETL Stream and Batch Unified Data Cost Two Dataset One Dataset Use Cost Configure per-topic for each fields A single config
  10. How Fluss build Lakehouse-Native Streaming Storage? Topics Streams As Tables

    Continuously Updating • From Topics -> Tables • First-class schema support with schema enforcement • Primary Key constraint & Update support • Data format from Avro → The Columnar Stream (10x if 10% columns read)
  11. Fluss Lake Tiering Service Fluss Table A partition=20250528 bucket1 bucket2

    partition=20250529 Fluss Table B Lake Table A partition=20250528 bucket1 bucket2 partition=20250529 Lake Table B AWS S3 •Auto create lake table •Auto mapping schema •Arrow -> Parquet convert •Freshness in minutes Lake Tiering Service Stateless Flink Jobs Metadata Catalogs DDL Commit offsets
  12. Streaming Lakehouse: Lakehouse as Historical Layer Lakehouse Analytics Query Engines

    Real-Time Data Layer (Short-Term, Second Latency) Historical Data Layer (Long-Term, Minute Latency) Shared Data , Shared Metadata ü Efficient backfill for stream processing ü Projection/Filter pushdown on Parquet ü High compression of Parquet ü High throughput of S3 Lakehouse Storage Fluss Storage Lakehouse Storage ( Paimon, Iceberg) Union Read
  13. Streaming Lakehouse: Fluss as Real-Time Layer Real-Time Data Layer (Short-Term,

    Second Latency) Historical Data Layer (Long-Term, Minute Latency) ü Unlocking real-time data to Lakehouse ü Union delta log (minutes) on Fluss ü Exchange using Arrow-native format ü Efficient process/integrate for query engines Lakehouse Analytics Query Engines Shared Data , Shared Metadata Lakehouse Storage Fluss Storage Lakehouse Storage ( Paimon, Iceberg) Union Read
  14. Union Read: Query both Historical & Real-time Data Jark, 30

    Judy, 20 +(Jark,30) Snapshot 06 +(Timo,20) -(Judy,20) Union Reads Jark, 30 Timo, 20 log_offset time Jark, 30 Judy, 20 +(Timo,20) -(Judy,20) Sort Merge
  15. Streaming + Lakehouse = Real-Time Lakehouse Cloud Storage Lakehouse Streaming

    Compute Engine AI & BI Catalog Real-Time Insights
  16. Real-Time Multimodal Lakehouse for AI AI Analytics Streaming Writes Real-Time

    Updates Real-Time Multimodal Lakehouse Fluss Lance Lakehouse Tiering Service Historical Data Text Images Videos Real-Time Data Python Ecosystem Audio Multimodal data
  17. The Current Fluss Open Source 1500 GitHub Star 81 Contributors

    Incubating in ASF On Alibaba Cloud In Alibaba 3 PB 40 GiB/s data size throughput Managed Service for Apache Fluss Private Preview
  18. Future Plan Real-Time Analytics Real-Time AI Real-Time Lakehouse Optimize real-time

    analytics with Flink, from computation to storage. Real-time data layer on Lakehouse, and more query engines, Spark, Trino Build real-time feature engineering, real-time AI context, real-time multimodal.
  19. “Bring better analytics to data streams and better data freshness

    to data Lakehouses.” The End Goal of And If you like the project … Don’t forget to give it some ❤ via ⭐ on Github. https://github.com/alibaba/fluss