Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[COC_ASIA_2025] When Flink Meets Fluss: The Fut...

[COC_ASIA_2025] When Flink Meets Fluss: The Future of Streaming Warehouse

Kafka and Flink have been widely used together in streaming processing scenarios, becoming a defacto standard paradigm for building streaming warehouses and real-time analytics. However, it still faces many challenging issues that are hard to resolve. This session will explore the challenges and problems this paradigm faces in streaming analytics.

We will first discuss the limitations and pain points of Kafka when used with Flink. Then we will introduce Fluss, a next-generation streaming storage designed for streaming analytics. We’ll walk through its architecture and core innovations, highlighting how it seamlessly integrates with Flink to power the next generation of streaming warehouses. You’ll discover the game-changing capabilities unlocked by combining Flink and Fluss, such as streaming column pruning, delta joins, union reads, and merge engines.

Finally, we’ll explore real-world use cases of Flink + Fluss, showcasing how this powerful combination delivers true benefits like reduced infrastructure costs, improved performance, and enhanced stability for large-scale streaming and batch workloads.

Avatar for Jark Wu

Jark Wu

July 26, 2025
Tweet

More Decks by Jark Wu

Other Decks in Technology

Transcript

  1. Jark Wu PMC member of Apache Flink PPMC member of

    Apache Fluss (incubating) When Apache Flink Meets Apache Fluss(Incubating): The Future of Streaming Warehouse
  2. Data Infra History 2008 2010 2017 1st batch storage defacto

    standard batch engine 2nd batch storage 2011 2014 2024 1st streaming storage defacto standard streaming engine 2nd streaming storage Batch Stream
  3. Apache Fluss (incubating): Streaming Storage for Real-Time Analytics Union Reads

    Streaming Writes Real-Time Updates Server Server Server Streaming Reads Batch Reads Fluss Cluster Remote Storage ( S3 / OSS / HDFS ) Lakehouse Storage ( Paimon / Iceberg* / Lance*) Tiering Service Lookup Join Lakehouse Analytics Streaming Read & Write Real-Time Updates & Lookups Column & Partition Pruning Unified Stream & Lake Databases Logs Images* Videos*
  4. History of Apache Fluss (incubating) 2023/07 2013/12 2025/06 2024/11 Open

    Sourced at FFA 2024 First Introduction at FFA 2023 Joins ASF Incubator Initiated by Flink Team at Alibaba
  5. 1. Integration Overview of Kafka vs. Fluss Flink Lookup Join

    Dimension Table Source Table Sink Table Catalog Real-Time Write Real-Time Update Partial Update CDC YAML Streaming Read Batch Read Union Read Full-Increment Read Binlog Read Metadata Read Metadata Update Source Table Sink Table Catalog Flink Lookup Real-Time Write Streaming Read Metadata Dimension Table Fluss: 360o Integration with Flink Kafka: Limited Integration with Flink
  6. 2. No Schema In Kafka vs. First-Class Schema in Fluss

    Kafka: manually mapping schema Fluss: auto-discover metadata via Catalog
  7. 3. Lookup Joins in Kafka vs. Fluss Lookup Join Copy

    Lookup Join With Fluss • Millions Lookup QPS • Real-Time Updated • Prefix Lookup ( 1 : N ) With Kafka • Rely External KV for Lookup Join • Cost & Complex Pipeline
  8. 4. Kafka is Not Queryable vs. Instant Query on Fluss

    Kafka Kafka Kafka dump dump dump Dump Kafka to OLAP system for query & debug Direct Query Fluss Tables in Real-Time
  9. 5. Kafka Topic is not Shareable – 中间层数据不可复用 Aggregation Sink

    Join Sink TopN Sink 100G 100G 100G Flink Job 1 Flink Job 2 Flink Job 3 Kafka Changelog Normalize Changelog Normalize Changelog Normalize 100G Kafka doesn’t support Update and Changelog Feed, Kafka Topic is hard to be reused by downstream jobs, Each job has to DEDUPLICATE Kafka topic in Flink State. Aggregation Sink Join Sink TopN Sink Flink Job 1 Flink Job 2 Flink Job 3 Fluss 100G CDC CDC CDC Fluss tables is shareable in data warehouse.
  10. 6. No I/O Pruning in Kafka vs. Column Pruning In

    Fluss Flink Col1 Col2 Col3 Col4 Kafka Row-Based Read Col1 Col2 Col3 Col4 Flink Fluss Column-Based Read Fluss can achieve 10x read perf than Kafka if only 10% columns are needed. IPC Streaming Format Based on Based on Avro/JSON/CSV format
  11. 7. Kafka is Not Unified with Lakehouse – 非流批一体架构 Shared

    Data Real-Time Data Layer (Short-Term, Second Latency) Historical Data Layer (Long-Term, Minute Latency) Lakehouse Storage Fluss Storage Lakehouse Storage ( Paimon, Iceberg* ) Bronze Kafka Kafka Kafka Iceberg Iceberg Silver Gold ETL ETL Data Copy Data Copy Iceberg Data Copy Duplicated Data Copies In Each Layer Data Shared Between Fluss and Lakehouse Reduce 50%+ Streaming Storage Cost
  12. Apache Fluss (Incubating) Production in Alibaba Total Data Size Ingest

    Throughput Highest Lookup QPS on a Single Table Rows of Largest Table 40 GB/s 500 B 500 K 3 PB
  13. Use Case 1 : Log Collection and Real-Time Analysis The

    Challenge Storage Cost Network Cost • Data keeps growing • Retain longer than 3 days • 1 write, 10 reads • Expensive cross-AZ traffic The Solution Shared Data Column Pruning • Reduce 30% data in Fluss • Keep long term in Lakehouse • Columnar log format (Arrow) • Partition Pruning and more Total Cost 30% Read Traffic 70% Use Kafka Use Fluss
  14. Use Case 2 : Delta Join for Large Scale Streaming

    Join Streaming Join Large State page order ① Streaming Read Delta Join page order conversion conversion Alibaba Donated Delta Join to Flink v2.1: FLIP-486 Eliminates 100TB State, Checkpoint from 90s to 1s Fast Flink Job updates without state bootstrap 85% lower Flink CPU & Mem resource usage Inspect Join data in real-time via Fluss Query API
  15. Future Roadmap Storage for Flink Defacto standard storage of Flink

    Streaming Lakehouse More formats, more engines Multi-modal AI Streaming into Lance