[CMU-DB-2025FALL] Apache Fluss - A Streaming Storage for Real-Time Lakehouse

Apache Fluss: A Streaming Storage for Real-Time Lakehouse PPMC member
of Apache Fluss (Incubating) PMC member of Apache Flink Jark Wu

Who Am I? • PMC member of Apache Flink •
PPMC member of Apache Fluss (incubating) • Original creator of Flink CDC and Fluss projects • Leading Fluss & Flink SQL & Flink CDC teams @ Alibaba • 10 years on distributed systems • in/jarkwu/ Jark Wu 👋

• Apache Flink is a stateful Stream Processing framework •
Batch processing is for Passive Data, Active Query • Stream processing is for Active Data, Passive Query What is Apache Flink?

Stream Processing Use Case: TikTok Real-Time Recommendation https://arxiv.org/pdf/2209.07663

Flink SQL: “Incremental Materialized View” Compute Engine • Flink SQL
is the computation engine to maintain the materialized view • You bring your storage for the materialized view • Streaming log and changelog are keys of the storage +I(insert) +U(update_before) -U(update_after)

The Fact: No Such Suitable Storage Not scalable Not high-throughput
No updates and changelogs Can’t store long-term data (7 days vs. months/years) No streaming reads and row-level changelogs Not real-time ( 10 minutes vs. 10 ms)

Fluss: a streaming table storage for Flink Stream Processing Streaming
Storage Flink Incremental Materialized View • Streaming read/write in low latency • Updates and changelog streams • Lakehouse as historical storage Key Features

What is Apache Fluss? Analytical Systems Operational Systems Batch Stream
Row-Orient Column-Orient Columnar Streaming Storage for Analytics

Fluss Overview Databases Click Streams Streaming Writes Real-Time Updates Server
Server Server Streaming Reads Batch Reads Fluss Cluster Remote Storage (S3/OSS/GCP/Azure Blob*/OBS/HDFS) Lakehouse Storage (Paimon/Iceberg/Lance) Tiering Service Lookup Join Lakehouse Analytics * * Coming soon Union Reads Query Engines IoT Logs AI Analytics

The Logical Models in Fluss Log Table (append-only) Primary Key
Table (mutable) Changelog Stream (append-only) append consume put kv emit consume (100, John, US) (100, John, EU) -U(100, John, EU) +U(100, John, US) +I(100, John, EU) (100, MacBook, 2025-12-07) (100, iPhone, 2025-12-08) (100, MacBook, 2025-12-07) (100, iPhone, 2025-12-08) Streaming Log and Changelog are Foundational Models of Fluss

External Components Fluss Components Fluss Client Read/Write CoordinatorServer (Primary) Object
Storage (S3) ZooKeeper Fluss Cluster Metadata and coordination TabletServer (real-time storage) replication replication TabletServer (real-time storage) TabletServer (real-time storage) Log tiering and KV snapshotting Lakehouse Storage ( Iceberg / Paimon ) Read real-time Tier to Lakehouse Read cold Lake Tiering Service Fluss Architecture Iceberg / Paimon SDK Flink/Spark*/Trino* Connector (real-time) Read cold Read historical Metadata

Partition 1 Table Partition 2 Partition 3 Bucket 1 Bucket
2 Bucket 3 LogTablet KvTablet RocksDB Segment 1 Segment 2 Segment 3 .index ﬁle .log ﬁle By Partition Column (p=2015-12-08) By Bucketing Column (PK) KV Sharding Each KV shard corresponds to a RocksDB instance Log Fragementation Fluss Table Sharding Scheme Logical Physical

Write path and Durability of Log Table TabletServer 1 TabletServer
2 TabletServer 3 Log Tablet A1 (Leader) Log Tablet A1 (Follower) Log Tablet A1 (Follower) ② Replicate Log Tiering Client ① Append Log Log Store Log Store Log Store Object Storage (S3) KV Store KV Store KV Store ③ ack Tiered Log segments

Read path of Log Table TabletServer 1 TabletServer 2 TabletServer
3 Log Tablet A1 (Leader) Log Tablet A1 (Follower) Log Tablet A1 (Follower) Log Tiering Client Fetch Log Log Store Log Store Log Store Object Storage (S3) KV Store KV Store KV Store Tiered Log segments Read tiered Log

Write path and Durability of Primary-Key Table TabletServer 1 TabletServer
2 TabletServer 3 Log Tablet B1 (Leader) Log Tablet B1 (Follower) Log Tablet B1 (Follower) ③ Replicate Log Tiering Client ① Put KV Log Store Log Store Log Store Object Storage (S3) KV Tablet B1 (Leader) KV Store KV Tablet B1 (Follower) KV Store KV Store ② write changelog (WAL) ④ ack KV snapshot Log segments RocksDB SST ﬁles RocksDB RocksDB

Read path of Primary-Key Table TabletServer 1 TabletServer 2 TabletServer
3 Log Tablet B1 (Leader) Log Tablet B1 (Follower) Log Tablet B1 (Follower) Log Tiering Client ② Fetch changelog Log Store Log Store Log Store Object Storage (S3) KV Tablet B1 (Leader) KV Store KV Tablet B1 (Follower) KV Store KV Store ① Download RocksDB snapshot ﬁles (historical read) Log segments RocksDB SST ﬁles RocksDB RocksDB

Log Tablet is a Columnar Stream Disk 00001030.log a b
c d e f .. z Arrow Batch 1 a b c d e f .. z a b c d e f .. z Arrow Batch 2 Arrow Batch N ··· Log Segment File (1GB) Arrow metadata header IPC Streaming Format Log File based on Zero-copy Append

Log Tablet is a Columnar Stream Disk 00001030.log a b
c d e f .. z Arrow Batch 1 a b c d e f .. z a b c d e f .. z Arrow Batch 2 Arrow Batch N ··· Log Segment File Network Socket Zero Copy

Fluss Streaming Lakehouse Union Reads Fluss Cluster Lakehouse Storage (Iceberg/Paimon/Lance)
Tiering Service Lakehouse Analytics Real-Time Data Layer (Short-Term，Sub-Second Latency) Historical Data Layer （Long-Term，Minute Latency） Shared Data, Uniﬁed View * Coming soon Query Engines * *

API Abstraction Uniﬁed Table View Fluss Table Lakehouse Table Log
segments Data Files Metadata Files Tablet Servers Object Storage INSERT INTO … SELECT FROM … On disk On S3 RocksDB Tiered segments KV Snapshots

Storage Unification Unified Table View Fluss Table Lakehouse Table (Iceberg)
Fluss Client Lakehouse tiering Iceberg Client Writes/Updates Writes/Updates Reads Hist read Real-time read Spark/ Snowflake Analytical Read read Union Read Real-Time Layer Historical Layer

Lakehouse Tiering Service Coordinator TabletServer TabletServer TabletServer Enumerator Committer Fluss
Cluster Object Storage (S3) Tiering Service TieringWriter ① Fetch task and start log offsets ⑤ Commit snapshot and end log offsets ④ Write snapshot (with offsets) ③ Write data files (may with delete files) ② Read arrow batches Stateless Auto-scale Arrow->Parquet Lakehouse Catalog (Iceberg REST Catalog)

Lifecycle Management of Diﬀerent Tiers Real-time (now) Long-term (years/months) Tier-point
(minutes) Cold-time (hours) Uniﬁed Table View Fluss Table Lakehouse Table overlaps Tiered Segments Local Segments

Union Read: the road to Real-Time Lakehouse Lakehouse Storage Fluss
Storage Lakehouse Storage ( Iceberg / Paimon) Union Read Union Read for Streaming Query Union Read for Batch Query Use case Backﬁlling for stream processing Real-time insights for OLAP Idea Data Lakehouse as historical data of data streams Data streams as real-time data of data lakehouse Beneﬁt Long-term storage Better freshness

Union Read for Streaming Query ① Get latest snapshot of
Iceberg, and log offsets to the snapshot ② Read Iceberg snapshot for data backfilling ③ Read Fluss from the log (changelog) offsets of the Iceberg snapshot ④ Exactly-once data ✓ Projection/Filter Pushdown on Iceberg/Parquet ✓ Direct read from DFS/S3, higher throughput, no interrupt to Fluss servers ✓ Long-term storage, unlock long-period use cases Fluss Iceberg Snapshot 1002 {offset: xx} Log offset Historical Data Real-Time Data

Union Read for Batch Query • It works the same
with Streaming Union Read for Log Tables • Challenges are in Primary-Key Tables. • Streaming queries (incremental MV) process changelogs • But all the batch query engines expect no changelogs • We have to merge changelogs with historical tables. • Two approaches: (1) Merge-On-Read (with Paimon) (2) Deletion Vector (working in progress) ✅

Quick Introduction of Apache Paimon Paimon is an open table
format like Iceberg, but optimized for streaming updates. Parquet data ﬁles are in LSM-tree, sorted by primary key for eﬃcient Merge-On-Read. LSM-Tree

Merge-On-Read of Batch Union Read Jark, 30 Judy, 20 +(Jark,30)
+(Timo,20) -(Judy,20) Union Reads Timo, 20 Jark, 30 Log oﬀset time Jark, 30 Judy, 20 +(Timo,20) -(Judy,20) Sort Merge Snapshot 1002 {oﬀset: xx} Historical Data Real-Time Data data is already sorted Sort changelogs in memory by PK

Deletion Vector of Batch Union Read (WIP) Jark, 30 Judy,
20 Alex, 40 x +(Jark,30) +(Timo,20) -(Judy,20) x x Log oﬀset time Snapshot 1002 {oﬀset: xx} LogDV (rocksdb based) LakeDV (rocksdb based) IcebergDV INSERT (Timo, 20) DELETE (Judy, 20) Timo, 20 Jark, 30 Alex, 40 Judy, 20 Judy,20 Timo, 20 Union Read x Jark, 30 x x Index

Lakehouse Metastore (Iceberg REST Catalog) Client Fluss Cluster Schema Evolution
Coordinator TabletServer TabletServer TabletServer ① ADD COLUMN ⑤ new schema ZooKeeper ② Add column to Lakehouse table ③ Persist new schema metadata ④ ack

Future Plan Query Engines Spark, Trino, DuckDB, StarRocks… Faster Deletion
vector support Uniﬁed Metadata Union read support for Iceberg REST catalog

Thank you! Question? linkedin.com/company/apachefluss apache-fluss.slack.com fluss.apache.org github.com/apache/fluss

[CMU-DB-2025FALL] Apache Fluss - A Streaming St...

[CMU-DB-2025FALL] Apache Fluss - A Streaming Storage for Real-Time Lakehouse

Jark Wu

More Decks by Jark Wu

Other Decks in Technology

Featured

Transcript

Apache Fluss: A Streaming Storage for Real-Time Lakehouse PPMC member

Who Am I? • PMC member of Apache Flink •

• Apache Flink is a stateful Stream Processing framework •

Stream Processing Use Case: TikTok Real-Time Recommendation https://arxiv.org/pdf/2209.07663

Flink SQL: “Incremental Materialized View” Compute Engine • Flink SQL

The Fact: No Such Suitable Storage Not scalable Not high-throughput

Fluss: a streaming table storage for Flink Stream Processing Streaming

What is Apache Fluss? Analytical Systems Operational Systems Batch Stream

Fluss Overview Databases Click Streams Streaming Writes Real-Time Updates Server

The Logical Models in Fluss Log Table (append-only) Primary Key

External Components Fluss Components Fluss Client Read/Write CoordinatorServer (Primary) Object

Partition 1 Table Partition 2 Partition 3 Bucket 1 Bucket

Write path and Durability of Log Table TabletServer 1 TabletServer

Read path of Log Table TabletServer 1 TabletServer 2 TabletServer

Write path and Durability of Primary-Key Table TabletServer 1 TabletServer

Read path of Primary-Key Table TabletServer 1 TabletServer 2 TabletServer

Log Tablet is a Columnar Stream Disk 00001030.log a b

Log Tablet is a Columnar Stream Disk 00001030.log a b

Fluss Streaming Lakehouse Union Reads Fluss Cluster Lakehouse Storage (Iceberg/Paimon/Lance)

API Abstraction Uniﬁed Table View Fluss Table Lakehouse Table Log

Storage Uniﬁcation Uniﬁed Table View Fluss Table Lakehouse Table (Iceberg)

Lakehouse Tiering Service Coordinator TabletServer TabletServer TabletServer Enumerator Committer Fluss

Lifecycle Management of Diﬀerent Tiers Real-time (now) Long-term (years/months) Tier-point

Union Read: the road to Real-Time Lakehouse Lakehouse Storage Fluss

Union Read for Streaming Query ① Get latest snapshot of

Union Read for Batch Query • It works the same

Quick Introduction of Apache Paimon Paimon is an open table

Merge-On-Read of Batch Union Read Jark, 30 Judy, 20 +(Jark,30)

Deletion Vector of Batch Union Read (WIP) Jark, 30 Judy,

Lakehouse Metastore (Iceberg REST Catalog) Client Fluss Cluster Schema Evolution

Future Plan Query Engines Spark, Trino, DuckDB, StarRocks… Faster Deletion

Thank you! Question? linkedin.com/company/apachefluss apache-fluss.slack.com fluss.apache.org github.com/apache/fluss