[DSS2025] Fluss: Reinventing Kafka for the Real-Time Lakehouse

Data Streaming Summit Virtual 2025 Fluss: Reinventing Kafka for the
Real-Time Lakehouse Jark Wu Head of Fluss and Flink SQL at Alibaba Cloud

Data Streaming Summit Virtual 2025 • Apache Flink PMC member
and Committer • Original Creator of Flink SQL, Flink CDC, Fluss • Flink SQL & Fluss Team Leader @ Alibaba • 10 years on distributed systems Jark Wu Head of Fluss and Flink SQL at Alibaba Cloud

What If We Could Rebuild Kafka From Scratch?

What If We Could Rebuild Kafka From Scratch? Lakehouse Native
Cloud Native

What If We Could Rebuild Kafka From Scratch? Cloud Native
Save Money Massive Market 💰💰💰 💰💰💰 💰💰💰 💰💰💰 RedPanda Cloud Topics Confluent Freight Cluster Lakehouse Native

Why is Lakehouse Native a Problem for Kafka? No Update
Data Model Mismatch No Schema Kafka was designed for events, not designed for analytics This leads to manual per-topic configuration.

Why Tableflow is not the Answer? Lambda Architecture Tableﬂow Two
Copies of data are stored Stream and Table are separated For bronze layers, not silver and gold layers AutoMQ Table Topic Confluent Tableflow Redpanda Iceberg Topic

Lakehouse Needs Real-Time Insights Bronze Tables Silver Tables Gold Tables
Business Demand For Speed Immediate Decision Making AI/ML Needs Fresh Data Agent Require Real-Time Context

Fluss: a New Lakehouse-Native Streaming Storage Sub-Second Latency Updates &
Changelog Lookup Queries Uniﬁed Stream/Batch Projection Pushdown 10x streaming read Eﬃcient processing historical data Real-time read/write Stream-table duality Easy to inspect Databases Logs Union Reads Streaming Writes Real-Time Updates Server Server Server Streaming Reads Batch Reads Fluss Cluster Remote Storage ( S3 / OSS / HDFS ) Lakehouse Storage ( Paimon / Iceberg*) Tiering Service Lookup Join Lakehouse Analytics

How to Build a Lakehouse-Native Streaming Storage? Topics Streams As
Tables Continuously Updating ① From Topics -> Tables ② First-class schema support with schema enforcement ③ Primary Key constraint & Update support ④ Data format from Avro -> The Columnar Stream (10x if 10% columns read)

Fluss Lake Tiering Service Fluss Table A partition=20250528 bucket1 bucket2
partition=20250529 Fluss Table B Lake Table A partition=20250528 bucket1 bucket2 partition=20250529 Lake Table B AWS S3 • Auto create lake table • Auto mapping schema • Arrow -> Parquet convert • Freshness in minutes Lake Tiering Service Stateless Flink Jobs Metadata Catalogs DDL Commit oﬀsets

Lakehouse as Historical Data Layer of Fluss Lakehouse Analytics Query
Engines Real-Time Data Layer (Short-Term, Second Latency) Historical Data Layer (Long-Term, Minute Latency) Shared Data , Shared Metadata Hybrid Source ① Backﬁll ② Streaming Read ü Efficient backfill for stream processing ü Projection/Filter pushdown on Parquet ü High compression of Parquet ü High throughput of S3 Lakehouse Storage Fluss Storage Lakehouse Storage ( Paimon, Iceberg* )

Fluss as Real-Time Data Layer of Lakehouse Lakehouse Analytics Query
Engines Real-Time Data Layer (Short-Term, Second Latency) Historical Data Layer (Long-Term, Minute Latency) Shared Data , Shared Metadata Union Reads ü Unlocking real-time data to Lakehouse ü Union delta log (minutes) on Fluss ü Exchange using Arrow-native format ü Efficient process/integrate for query engines Lakehouse Storage Fluss Storage Lakehouse Storage ( Paimon, Iceberg* )

Union Read: Query both Historical & Real-time Data Jark, 30
Judy, 20 +(Jark,30) Snapshot 06 +(Timo,20) -(Judy,20) Union Reads Jark, 30 Timo, 20 Sort Merge oﬀset

Streaming + Lakehouse = Real-Time Lakehouse Cloud Storage Lakehouse Streaming
Compute Engine AI & BI Catalog Real-Time Insights

The Current Fluss Open Source Community Alibaba 1100 GitHub Star
54 Contributors Donating to ASF 1 PB 10 GiB/s data size throughput

Future Plan More Table Formats Real-Time Layer on Lakehouse More
Query Engines Iceberg, DeltaLake, Hudi Spark, StarRocks, DuckDB Shared Metadata, Deletion Vector

“Bring better analytics to data streams and better data freshness
to data Lakehouses.” The End Goal of And If you like the project … Don’t forget to give it some ❤ via ⭐ on Github. https://github.com/alibaba/fluss

Data Streaming Summit Virtual 2025 Thank you!

[DSS2025] Fluss: Reinventing Kafka for the Real...

[DSS2025] Fluss: Reinventing Kafka for the Real-Time Lakehouse

Jark Wu

More Decks by Jark Wu

Other Decks in Technology

Featured

Transcript

Data Streaming Summit Virtual 2025 Fluss: Reinventing Kafka for the

Data Streaming Summit Virtual 2025 • Apache Flink PMC member

What If We Could Rebuild Kafka From Scratch?

What If We Could Rebuild Kafka From Scratch? Lakehouse Native

What If We Could Rebuild Kafka From Scratch? Cloud Native

Why is Lakehouse Native a Problem for Kafka? No Update

Why Tableflow is not the Answer? Lambda Architecture Tableﬂow Two

Lakehouse Needs Real-Time Insights Bronze Tables Silver Tables Gold Tables

Fluss: a New Lakehouse-Native Streaming Storage Sub-Second Latency Updates &

How to Build a Lakehouse-Native Streaming Storage? Topics Streams As

Fluss Lake Tiering Service Fluss Table A partition=20250528 bucket1 bucket2

Lakehouse as Historical Data Layer of Fluss Lakehouse Analytics Query

Fluss as Real-Time Data Layer of Lakehouse Lakehouse Analytics Query

Union Read: Query both Historical & Real-time Data Jark, 30

Streaming + Lakehouse = Real-Time Lakehouse Cloud Storage Lakehouse Streaming

The Current Fluss Open Source Community Alibaba 1100 GitHub Star

Future Plan More Table Formats Real-Time Layer on Lakehouse More

“Bring better analytics to data streams and better data freshness

Data Streaming Summit Virtual 2025 Thank you!

Data Streaming Summit Virtual 2025 Thank you!