[FF2025] Apache Fluss (Incubating) - Making Your Lakehouse Truly Real-Time

Barcelona 2025 13-16 October 2025 Apache Fluss (Incubating): Making Your
Lakehouse Truly Real-Time Jark Wu

Who Am I? • PMC member of Apache Flink •
PPMC member of Apache Fluss (incubating) • Original creator of Flink CDC and Fluss projects • Flink SQL & Fluss Team Leader @ Alibaba • 10 years on distributed systems Jark Wu 👋

What If We Could Rebuild Kafka From Scratch?

Lakehouse Native What If We Could Rebuild Kafka From Scratch?
A Lakehouse-Native “Kafka” that easily makes your Lakehouse truly Real-Time

Why Real-Time Lakehouse Matters? Bronze Tables Silver Tables Gold Tables
Business Demand For Speed Immediate Decision Making AI/ML Needs Fresh Data Agent Require Real-Time Context You can’t build the next TikTok recommender system on traditional Lakehouse which lacks real-time streaming data for AI

Apache Fluss (Incubating): Extend Your Lakehouse with Real-Time Streaming Lakehouse
Storage Paimon Iceberg Lance Query Engines Lakehouse Analytics Union Read Lookup Join Changelog Read Batch Query Databases Logs Streaming Writes Real-Time Updates Real-Time Lakehouse Fresh data in real-time Data Lakes

Why Rebuild Kafka, not Enhance Kafka? No Update Data Model
Mismatch No Schema Kafka was designed for events, not designed for analytics This leads to manual per-topic configuration.

Data Model of Fluss vs. Iceberg vs. Kafka Fluss Iceberg
Kafka Database ✅ ✅ ❌ Table ✅ ✅ ⚠ Topic Partitions (dt = 20251015) ✅ ✅ ❌ Update Row ✅ ✅ ❌ Delete Row ✅ ✅ ❌ Data Types ✅ 20+ Types ✅ 20+ Types ❌ No Schema File Format ✅ Column Format (Apache Arrow) ✅ Column Format (Apache Parquet) ⚠ Row Format (Avro/Json) Fluss is Lakehouse-Native which aligns all data model with Lakehouse systems. Kafka, however, there is a big gap with Lakehouse systems.

A Glimpse at Enabling Stream to Iceberg ALTER TABLE customers
SET ('table.datalake.enabled' = 'true') Fluss: one SQL line Confluent/Warpstream Tableflow: ugly YAML file and type mapping Imagine having 100s of topics and 50+ fields in each table 😖

Trend: The Converging of Stream and Lakehouse Alibaba starts project
Fluss Propose “Streaming Lakehouse” Conﬂuent Tableﬂow 2023.07 2024.03 Redpanda Iceberg Topic 2024.09 2024.10 StreamNative Ursa 2024.12 AutoMQ Table Topic 2025.08 Aiven Iceberg Topic

Why Tableflow not the Answer? Tableflow: Stream into Lakehouse Fluss:
Streaming Lakehouse TableFlow Fluss Architecture Stream into Lakehouse: data synchronization tool Streaming Lakehouse: data shared between both side Value Zero-ETL Stream and Batch Unified Data Cost Two Dataset One Dataset Use Cost Configure per-topic for each fields A single config

How Fluss build Lakehouse-Native Streaming Storage? Topics Streams As Tables
Continuously Updating • From Topics -> Tables • First-class schema support with schema enforcement • Primary Key constraint & Update support • Data format from Avro → The Columnar Stream (10x if 10% columns read)

Fluss Lake Tiering Service Fluss Table A partition=20250528 bucket1 bucket2
partition=20250529 Fluss Table B Lake Table A partition=20250528 bucket1 bucket2 partition=20250529 Lake Table B AWS S3 •Auto create lake table •Auto mapping schema •Arrow -> Parquet convert •Freshness in minutes Lake Tiering Service Stateless Flink Jobs Metadata Catalogs DDL Commit oﬀsets

Streaming Lakehouse: Lakehouse as Historical Layer Lakehouse Analytics Query Engines
Real-Time Data Layer (Short-Term, Second Latency) Historical Data Layer (Long-Term, Minute Latency) Shared Data , Shared Metadata ü Eﬃcient backﬁll for stream processing ü Projection/Filter pushdown on Parquet ü High compression of Parquet ü High throughput of S3 Lakehouse Storage Fluss Storage Lakehouse Storage ( Paimon, Iceberg) Union Read

Streaming Lakehouse: Fluss as Real-Time Layer Real-Time Data Layer (Short-Term,
Second Latency) Historical Data Layer (Long-Term, Minute Latency) ü Unlocking real-time data to Lakehouse ü Union delta log (minutes) on Fluss ü Exchange using Arrow-native format ü Efficient process/integrate for query engines Lakehouse Analytics Query Engines Shared Data , Shared Metadata Lakehouse Storage Fluss Storage Lakehouse Storage ( Paimon, Iceberg) Union Read

Union Read: Query both Historical & Real-time Data Jark, 30
Judy, 20 +(Jark,30) Snapshot 06 +(Timo,20) -(Judy,20) Union Reads Jark, 30 Timo, 20 log_oﬀset time Jark, 30 Judy, 20 +(Timo,20) -(Judy,20) Sort Merge

Streaming + Lakehouse = Real-Time Lakehouse Cloud Storage Lakehouse Streaming
Compute Engine AI & BI Catalog Real-Time Insights

Real-Time Multimodal Lakehouse for AI AI Analytics Streaming Writes Real-Time
Updates Real-Time Multimodal Lakehouse Fluss Lance Lakehouse Tiering Service Historical Data Text Images Videos Real-Time Data Python Ecosystem Audio Multimodal data

Real-Time Multimodal Lakehouse for AI https://lancedb.com/blog/ﬂuss-integration/

The Current Fluss Open Source 1500 GitHub Star 81 Contributors
Incubating in ASF On Alibaba Cloud In Alibaba 3 PB 40 GiB/s data size throughput Managed Service for Apache Fluss Private Preview

Future Plan Real-Time Analytics Real-Time AI Real-Time Lakehouse Optimize real-time
analytics with Flink, from computation to storage. Real-time data layer on Lakehouse, and more query engines, Spark, Trino Build real-time feature engineering, real-time AI context, real-time multimodal.

“Bring better analytics to data streams and better data freshness
to data Lakehouses.” The End Goal of And If you like the project … Don’t forget to give it some ❤ via ⭐ on Github. https://github.com/alibaba/fluss

Barcelona 2025 Thank You Jark Wu

[FF2025] Apache Fluss (Incubating) - Making You...

[FF2025] Apache Fluss (Incubating) - Making Your Lakehouse Truly Real-Time

Jark Wu

More Decks by Jark Wu

Other Decks in Technology

Featured

Transcript

Barcelona 2025 13-16 October 2025 Apache Fluss (Incubating): Making Your

Who Am I? • PMC member of Apache Flink •

What If We Could Rebuild Kafka From Scratch?

Lakehouse Native What If We Could Rebuild Kafka From Scratch?

Why Real-Time Lakehouse Matters? Bronze Tables Silver Tables Gold Tables

Apache Fluss (Incubating): Extend Your Lakehouse with Real-Time Streaming Lakehouse

Why Rebuild Kafka, not Enhance Kafka? No Update Data Model

Data Model of Fluss vs. Iceberg vs. Kafka Fluss Iceberg

A Glimpse at Enabling Stream to Iceberg ALTER TABLE customers

Trend: The Converging of Stream and Lakehouse Alibaba starts project

Why Tableﬂow not the Answer? Tableﬂow: Stream into Lakehouse Fluss:

How Fluss build Lakehouse-Native Streaming Storage? Topics Streams As Tables

Fluss Lake Tiering Service Fluss Table A partition=20250528 bucket1 bucket2

Streaming Lakehouse: Lakehouse as Historical Layer Lakehouse Analytics Query Engines

Streaming Lakehouse: Fluss as Real-Time Layer Real-Time Data Layer (Short-Term,

Union Read: Query both Historical & Real-time Data Jark, 30

Streaming + Lakehouse = Real-Time Lakehouse Cloud Storage Lakehouse Streaming

Real-Time Multimodal Lakehouse for AI AI Analytics Streaming Writes Real-Time

Real-Time Multimodal Lakehouse for AI https://lancedb.com/blog/ﬂuss-integration/

The Current Fluss Open Source 1500 GitHub Star 81 Contributors

Future Plan Real-Time Analytics Real-Time AI Real-Time Lakehouse Optimize real-time

“Bring better analytics to data streams and better data freshness

Barcelona 2025 Thank You Jark Wu