Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥

[CMU-DB-2025FALL] Apache Fluss - A Streaming St...

Avatar for Jark Wu Jark Wu
December 09, 2025

[CMU-DB-2025FALL] Apache Fluss - A Streaming Storage for Real-Time Lakehouse

CMU Database Group - Future Data Systems Seminar Series (Fall 2025)
Speaker: Jark Wu ( / jarkwu )
December 8, 2025
https://db.cs.cmu.edu/events/future-data-apache-fluss-a-streaming-storage-for-real-time-lakehouse/

YouTube: https://www.youtube.com/watch?v=mcFHZFb1CAo

Modern data lakehouses promise unified batch and streaming processing, yet their storage layer remains inherently batch-oriented—optimized for large, immutable files. This mismatch forces streaming workloads to rely on external systems (e.g., Kafka), while analytical queries operate on stale snapshots, breaking end-to-end freshness.

In this talk, I’ll present Apache Fluss (incubating), a lakehouse-native streaming storage system designed to bridge this gap. Fluss rethinks streaming storage from the ground up for analytical workloads. Its core abstraction is a columnar stream built on Apache Arrow, enabling sub-second ingestion and high-throughput analytical scans. Furthermore, Fluss introduces the "Streaming Lakehouse" concept that Fluss serves as the real-time data layer on top of Lakehouse. It allows query engines to seamlessly unify both fresh streaming data in Fluss and historical data in Lakehouse (Iceberg) to achieve truly real-time data analytics.

Avatar for Jark Wu

Jark Wu

December 09, 2025
Tweet

More Decks by Jark Wu

Other Decks in Technology

Transcript

  1. Apache Fluss: A Streaming Storage for Real-Time Lakehouse PPMC member

    of Apache Fluss (Incubating) PMC member of Apache Flink Jark Wu
  2. Who Am I? • PMC member of Apache Flink •

    PPMC member of Apache Fluss (incubating) • Original creator of Flink CDC and Fluss projects • Leading Fluss & Flink SQL & Flink CDC teams @ Alibaba • 10 years on distributed systems • in/jarkwu/ Jark Wu 👋
  3. • Apache Flink is a stateful Stream Processing framework •

    Batch processing is for Passive Data, Active Query • Stream processing is for Active Data, Passive Query What is Apache Flink?
  4. Flink SQL: “Incremental Materialized View” Compute Engine • Flink SQL

    is the computation engine to maintain the materialized view • You bring your storage for the materialized view • Streaming log and changelog are keys of the storage +I(insert) +U(update_before) -U(update_after)
  5. The Fact: No Such Suitable Storage Not scalable Not high-throughput

    No updates and changelogs Can’t store long-term data (7 days vs. months/years) No streaming reads and row-level changelogs Not real-time ( 10 minutes vs. 10 ms)
  6. Fluss: a streaming table storage for Flink Stream Processing Streaming

    Storage Flink Incremental Materialized View • Streaming read/write in low latency • Updates and changelog streams • Lakehouse as historical storage Key Features
  7. What is Apache Fluss? Analytical Systems Operational Systems Batch Stream

    Row-Orient Column-Orient Columnar Streaming Storage for Analytics
  8. Fluss Overview Databases Click Streams Streaming Writes Real-Time Updates Server

    Server Server Streaming Reads Batch Reads Fluss Cluster Remote Storage (S3/OSS/GCP/Azure Blob*/OBS/HDFS) Lakehouse Storage (Paimon/Iceberg/Lance) Tiering Service Lookup Join Lakehouse Analytics * * Coming soon Union Reads Query Engines IoT Logs AI Analytics
  9. The Logical Models in Fluss Log Table (append-only) Primary Key

    Table (mutable) Changelog Stream (append-only) append consume put kv emit consume (100, John, US) (100, John, EU) -U(100, John, EU) +U(100, John, US) +I(100, John, EU) (100, MacBook, 2025-12-07) (100, iPhone, 2025-12-08) (100, MacBook, 2025-12-07) (100, iPhone, 2025-12-08) Streaming Log and Changelog are Foundational Models of Fluss
  10. External Components Fluss Components Fluss Client Read/Write CoordinatorServer (Primary) Object

    Storage (S3) ZooKeeper Fluss Cluster Metadata and coordination TabletServer (real-time storage) replication replication TabletServer (real-time storage) TabletServer (real-time storage) Log tiering and KV snapshotting Lakehouse Storage ( Iceberg / Paimon ) Read real-time Tier to Lakehouse Read cold Lake Tiering Service Fluss Architecture Iceberg / Paimon SDK Flink/Spark*/Trino* Connector (real-time) Read cold Read historical Metadata
  11. Partition 1 Table Partition 2 Partition 3 Bucket 1 Bucket

    2 Bucket 3 LogTablet KvTablet RocksDB Segment 1 Segment 2 Segment 3 .index file .log file By Partition Column (p=2015-12-08) By Bucketing Column (PK) KV Sharding Each KV shard corresponds to a RocksDB instance Log Fragementation Fluss Table Sharding Scheme Logical Physical
  12. Write path and Durability of Log Table TabletServer 1 TabletServer

    2 TabletServer 3 Log Tablet A1 (Leader) Log Tablet A1 (Follower) Log Tablet A1 (Follower) ② Replicate Log Tiering Client ① Append Log Log Store Log Store Log Store Object Storage (S3) KV Store KV Store KV Store ③ ack Tiered Log segments
  13. Read path of Log Table TabletServer 1 TabletServer 2 TabletServer

    3 Log Tablet A1 (Leader) Log Tablet A1 (Follower) Log Tablet A1 (Follower) Log Tiering Client Fetch Log Log Store Log Store Log Store Object Storage (S3) KV Store KV Store KV Store Tiered Log segments Read tiered Log
  14. Write path and Durability of Primary-Key Table TabletServer 1 TabletServer

    2 TabletServer 3 Log Tablet B1 (Leader) Log Tablet B1 (Follower) Log Tablet B1 (Follower) ③ Replicate Log Tiering Client ① Put KV Log Store Log Store Log Store Object Storage (S3) KV Tablet B1 (Leader) KV Store KV Tablet B1 (Follower) KV Store KV Store ② write changelog (WAL) ④ ack KV snapshot Log segments RocksDB SST files RocksDB RocksDB
  15. Read path of Primary-Key Table TabletServer 1 TabletServer 2 TabletServer

    3 Log Tablet B1 (Leader) Log Tablet B1 (Follower) Log Tablet B1 (Follower) Log Tiering Client ② Fetch changelog Log Store Log Store Log Store Object Storage (S3) KV Tablet B1 (Leader) KV Store KV Tablet B1 (Follower) KV Store KV Store ① Download RocksDB snapshot files (historical read) Log segments RocksDB SST files RocksDB RocksDB
  16. Log Tablet is a Columnar Stream Disk 00001030.log a b

    c d e f .. z Arrow Batch 1 a b c d e f .. z a b c d e f .. z Arrow Batch 2 Arrow Batch N ··· Log Segment File (1GB) Arrow metadata header IPC Streaming Format Log File based on Zero-copy Append
  17. Log Tablet is a Columnar Stream Disk 00001030.log a b

    c d e f .. z Arrow Batch 1 a b c d e f .. z a b c d e f .. z Arrow Batch 2 Arrow Batch N ··· Log Segment File Network Socket Zero Copy
  18. Fluss Streaming Lakehouse Union Reads Fluss Cluster Lakehouse Storage (Iceberg/Paimon/Lance)

    Tiering Service Lakehouse Analytics Real-Time Data Layer (Short-Term,Sub-Second Latency) Historical Data Layer (Long-Term,Minute Latency) Shared Data, Unified View * Coming soon Query Engines * *
  19. API Abstraction Unified Table View Fluss Table Lakehouse Table Log

    segments Data Files Metadata Files Tablet Servers Object Storage INSERT INTO … SELECT FROM … On disk On S3 RocksDB Tiered segments KV Snapshots
  20. Storage Unification Unified Table View Fluss Table Lakehouse Table (Iceberg)

    Fluss Client Lakehouse tiering Iceberg Client Writes/Updates Writes/Updates Reads Hist read Real-time read Spark/ Snowflake Analytical Read read Union Read Real-Time Layer Historical Layer
  21. Lakehouse Tiering Service Coordinator TabletServer TabletServer TabletServer Enumerator Committer Fluss

    Cluster Object Storage (S3) Tiering Service TieringWriter ① Fetch task and start log offsets ⑤ Commit snapshot and end log offsets ④ Write snapshot (with offsets) ③ Write data files (may with delete files) ② Read arrow batches Stateless Auto-scale Arrow->Parquet Lakehouse Catalog (Iceberg REST Catalog)
  22. Lifecycle Management of Different Tiers Real-time (now) Long-term (years/months) Tier-point

    (minutes) Cold-time (hours) Unified Table View Fluss Table Lakehouse Table overlaps Tiered Segments Local Segments
  23. Union Read: the road to Real-Time Lakehouse Lakehouse Storage Fluss

    Storage Lakehouse Storage ( Iceberg / Paimon) Union Read Union Read for Streaming Query Union Read for Batch Query Use case Backfilling for stream processing Real-time insights for OLAP Idea Data Lakehouse as historical data of data streams Data streams as real-time data of data lakehouse Benefit Long-term storage Better freshness
  24. Union Read for Streaming Query ① Get latest snapshot of

    Iceberg, and log offsets to the snapshot ② Read Iceberg snapshot for data backfilling ③ Read Fluss from the log (changelog) offsets of the Iceberg snapshot ④ Exactly-once data ✓ Projection/Filter Pushdown on Iceberg/Parquet ✓ Direct read from DFS/S3, higher throughput, no interrupt to Fluss servers ✓ Long-term storage, unlock long-period use cases Fluss Iceberg Snapshot 1002 {offset: xx} Log offset Historical Data Real-Time Data
  25. Union Read for Batch Query • It works the same

    with Streaming Union Read for Log Tables • Challenges are in Primary-Key Tables. • Streaming queries (incremental MV) process changelogs • But all the batch query engines expect no changelogs • We have to merge changelogs with historical tables. • Two approaches: (1) Merge-On-Read (with Paimon) (2) Deletion Vector (working in progress) ✅
  26. Quick Introduction of Apache Paimon Paimon is an open table

    format like Iceberg, but optimized for streaming updates. Parquet data files are in LSM-tree, sorted by primary key for efficient Merge-On-Read. LSM-Tree
  27. Merge-On-Read of Batch Union Read Jark, 30 Judy, 20 +(Jark,30)

    +(Timo,20) -(Judy,20) Union Reads Timo, 20 Jark, 30 Log offset time Jark, 30 Judy, 20 +(Timo,20) -(Judy,20) Sort Merge Snapshot 1002 {offset: xx} Historical Data Real-Time Data data is already sorted Sort changelogs in memory by PK
  28. Deletion Vector of Batch Union Read (WIP) Jark, 30 Judy,

    20 Alex, 40 x +(Jark,30) +(Timo,20) -(Judy,20) x x Log offset time Snapshot 1002 {offset: xx} LogDV (rocksdb based) LakeDV (rocksdb based) IcebergDV INSERT (Timo, 20) DELETE (Judy, 20) Timo, 20 Jark, 30 Alex, 40 Judy, 20 Judy,20 Timo, 20 Union Read x Jark, 30 x x Index
  29. Lakehouse Metastore (Iceberg REST Catalog) Client Fluss Cluster Schema Evolution

    Coordinator TabletServer TabletServer TabletServer ① ADD COLUMN ⑤ new schema ZooKeeper ② Add column to Lakehouse table ③ Persist new schema metadata ④ ack
  30. Future Plan Query Engines Spark, Trino, DuckDB, StarRocks… Faster Deletion

    vector support Unified Metadata Union read support for Iceberg REST catalog