Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[FFA2025] Fluss: Redefining Streaming Storage f...

[FFA2025] Fluss: Redefining Streaming Storage for Real-time Data Analytics and AI

https://asia.flink-forward.org/singapore-2025

As analytical and AI/ML workloads grow in complexity, traditional streaming storage like Kafka struggles to efficiently store and serve structured data and multimodal AI data. In this talk, we present Fluss, an open-source streaming table storage designed specifically for modern analytical and AI use cases.

We’ll walk through the architecture and core innovations of Fluss, highlighting how it seamlessly integrates with Flink to optimize streaming analytics, how it bridges the gap between data stream and data Lakehouse, how it powers real-time ML feature store, and how it efficiently ingests multimodal data for AI data lake.

Video: https://www.youtube.com/live/pzT6vCCmwq8?si=8Cpx7Rr25yGsdzFd&t=8412

Avatar for Jark Wu

Jark Wu

July 04, 2025
Tweet

More Decks by Jark Wu

Other Decks in Technology

Transcript

  1. Fluss: Redefining Streaming Storage for Real- time Data Analytics and

    AI Jark Wu Head of Fluss and Flink SQL team Alibaba Cloud
  2. Data Infra in a Company Data Copy Bronze Kafka Kafka

    Kafka Redis ClickHouse ClickHouse Iceberg Iceberg Silver Gold Dim ension Join ETL ETL Data Copy Data Copy Data Copy Data Copy Data Copied in Each Layer Components Complex Data Silos and In-Consistency Too Expensive Cost OLAP & Ad-hoc Stream Analytics KV Serving Batch Analytics
  3. The Good and The Bad of Apache Kafka No Queryable

    No Long-term No Update No Schema Ecosystem Message Queue High Throughput Events & Logs The Good Good for Operational Workloads Bad for Analytical Workloads The Bad
  4. Union Reads Streaming Writes Real-Time Updates Server Server Server Streaming

    Reads Batch Reads Fluss Cluster Remote Storage ( S3 / OSS / HDFS ) Lakehouse Storage ( Paimon / Iceberg* / Lance*) Tiering Service Lookup Join Lakehouse Analytics Streaming Read & Write Real-Time Updates & Lookups Column & Partition Pruning Unified Stream & Lake Databases Logs Images* Videos* Fluss: Streaming Storage for Analytics and AI
  5. Shared Data Zero Copy Real-Time Data Layer (Short-Term, Second Latency)

    Historical Data Layer (Long-Term, Minute Latency) Lakehouse Storage Fluss Storage Lakehouse Storage ( Paimon, Iceberg* ) Stream Analytics Dimension Join OLAP & Ad-hoc Batch Analytics Streaming Lakehouse for Unified Data Infra
  6. Fluss Production in Alibaba Total Data Size Ingest Throughput Highest

    Lookup QPS on a Single Table Rows of Largest Table 40 GB/s 500 B 500 K 3 PB
  7. Use Case 1 : Log Collection and Real-Time Analysis The

    Challenge Storage Cost Network Cost • Data keeps growing • Retain longer than 3 days • 1 write, 10 reads • Expensive cross-AZ traffic The Solution Shared Data Column Pruning • Reduce 30% data in Fluss • Keep long term in Lakehouse • Columnar log format (Arrow) • Partition Pruning and more Total Cost 30% Read Traffic 70% Use Kafka Use Fluss
  8. Use Case 2 : Delta Join for Large Scale Streaming

    Join Streaming Join Large State page order ① Streaming Read ② Lookup By Index Delta Join page order conversion conversion Alibaba Donated Delta Join to Flink v2.1: FLIP-486 Eliminates 100TB State, Checkpoint from 90s to 1s Fast Flink Job updates without state bootstrap 85% lower Flink CPU & Mem resource usage Inspect Join data in real-time via Fluss Query API
  9. Future Roadmap Python Ecosystem Connect on top of Apache Arrow

    Streaming Lakehouse More formats, more engines Multi-modal AI Streaming into Lance
  10. The Future: Real-Time Pipeline for Multi-Modal AI Data Foundation for

    AI Multi-modal Data Open Data Streaming Data Videos Images Text Audio Multi-modal Data Lance Ecosystem Python Ecosystem Applications Multi-Modal Agent AI DataLake Feature Engineering RAG Hybrid Search Lakehouse Storage
  11. The Open-Source Journey of Fluss Half a Year of Open

    Source Fluss Community 1200 Stars 56 Contributors 3 Releases Open Sourced at Flink Forward Asia 2024 Today at Flink Forward Asia 2025