Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[DSS2025] Fluss: Reinventing Kafka for the Real...

[DSS2025] Fluss: Reinventing Kafka for the Real-Time Lakehouse

The session addresses the limitations of Kafka in creating real-time lakehouses necessary for modern AI applications. Fluss is introduced as a novel system built from scratch to integrate seamlessly with Lakehouse architectures.

Video: https://www.youtube.com/watch?si=8JrS6jhVSJoykY1t&v=OzE0mVD0GPs&feature=youtu.be

Avatar for Jark Wu

Jark Wu

May 28, 2025
Tweet

More Decks by Jark Wu

Other Decks in Technology

Transcript

  1. Data Streaming Summit Virtual 2025 Fluss: Reinventing Kafka for the

    Real-Time Lakehouse Jark Wu Head of Fluss and Flink SQL at Alibaba Cloud
  2. Data Streaming Summit Virtual 2025 • Apache Flink PMC member

    and Committer • Original Creator of Flink SQL, Flink CDC, Fluss • Flink SQL & Fluss Team Leader @ Alibaba • 10 years on distributed systems Jark Wu Head of Fluss and Flink SQL at Alibaba Cloud
  3. What If We Could Rebuild Kafka From Scratch? Cloud Native

    Save Money Massive Market 💰💰💰 💰💰💰 💰💰💰 💰💰💰 RedPanda Cloud Topics Confluent Freight Cluster Lakehouse Native
  4. Why is Lakehouse Native a Problem for Kafka? No Update

    Data Model Mismatch No Schema Kafka was designed for events, not designed for analytics This leads to manual per-topic configuration.
  5. Why Tableflow is not the Answer? Lambda Architecture Tableflow Two

    Copies of data are stored Stream and Table are separated For bronze layers, not silver and gold layers AutoMQ Table Topic Confluent Tableflow Redpanda Iceberg Topic
  6. Lakehouse Needs Real-Time Insights Bronze Tables Silver Tables Gold Tables

    Business Demand For Speed Immediate Decision Making AI/ML Needs Fresh Data Agent Require Real-Time Context
  7. Fluss: a New Lakehouse-Native Streaming Storage Sub-Second Latency Updates &

    Changelog Lookup Queries Unified Stream/Batch Projection Pushdown 10x streaming read Efficient processing historical data Real-time read/write Stream-table duality Easy to inspect Databases Logs Union Reads Streaming Writes Real-Time Updates Server Server Server Streaming Reads Batch Reads Fluss Cluster Remote Storage ( S3 / OSS / HDFS ) Lakehouse Storage ( Paimon / Iceberg*) Tiering Service Lookup Join Lakehouse Analytics
  8. How to Build a Lakehouse-Native Streaming Storage? Topics Streams As

    Tables Continuously Updating ① From Topics -> Tables ② First-class schema support with schema enforcement ③ Primary Key constraint & Update support ④ Data format from Avro -> The Columnar Stream (10x if 10% columns read)
  9. Fluss Lake Tiering Service Fluss Table A partition=20250528 bucket1 bucket2

    partition=20250529 Fluss Table B Lake Table A partition=20250528 bucket1 bucket2 partition=20250529 Lake Table B AWS S3 • Auto create lake table • Auto mapping schema • Arrow -> Parquet convert • Freshness in minutes Lake Tiering Service Stateless Flink Jobs Metadata Catalogs DDL Commit offsets
  10. Lakehouse as Historical Data Layer of Fluss Lakehouse Analytics Query

    Engines Real-Time Data Layer (Short-Term, Second Latency) Historical Data Layer (Long-Term, Minute Latency) Shared Data , Shared Metadata Hybrid Source ① Backfill ② Streaming Read ü Efficient backfill for stream processing ü Projection/Filter pushdown on Parquet ü High compression of Parquet ü High throughput of S3 Lakehouse Storage Fluss Storage Lakehouse Storage ( Paimon, Iceberg* )
  11. Fluss as Real-Time Data Layer of Lakehouse Lakehouse Analytics Query

    Engines Real-Time Data Layer (Short-Term, Second Latency) Historical Data Layer (Long-Term, Minute Latency) Shared Data , Shared Metadata Union Reads ü Unlocking real-time data to Lakehouse ü Union delta log (minutes) on Fluss ü Exchange using Arrow-native format ü Efficient process/integrate for query engines Lakehouse Storage Fluss Storage Lakehouse Storage ( Paimon, Iceberg* )
  12. Union Read: Query both Historical & Real-time Data Jark, 30

    Judy, 20 +(Jark,30) Snapshot 06 +(Timo,20) -(Judy,20) Union Reads Jark, 30 Timo, 20 Sort Merge offset
  13. Streaming + Lakehouse = Real-Time Lakehouse Cloud Storage Lakehouse Streaming

    Compute Engine AI & BI Catalog Real-Time Insights
  14. The Current Fluss Open Source Community Alibaba 1100 GitHub Star

    54 Contributors Donating to ASF 1 PB 10 GiB/s data size throughput
  15. Future Plan More Table Formats Real-Time Layer on Lakehouse More

    Query Engines Iceberg, DeltaLake, Hudi Spark, StarRocks, DuckDB Shared Metadata, Deletion Vector
  16. “Bring better analytics to data streams and better data freshness

    to data Lakehouses.” The End Goal of And If you like the project … Don’t forget to give it some ❤ via ⭐ on Github. https://github.com/alibaba/fluss