Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[FFA2024]Fluss: Next-Gen Streaming Storage for ...

Avatar for Jark Wu Jark Wu
December 01, 2024

[FFA2024]Fluss: Next-Gen Streaming Storage for Real-Time Analytics

https://asia.flink-forward.org/jakarta-2024/agenda#fluss--next-gen-streaming-storage-for-real-time-analytics

This topic will share our newly developed next-generation streaming storage, Fluss, designed specifically for streaming analytics.

We will focus on how Fluss addresses the pain points faced by traditional streaming storage used in streaming analytics. Additionally, we will introduce Fluss‘s core features, benefits, integration with lakehouse, and our future plans.

Avatar for Jark Wu

Jark Wu

December 01, 2024
Tweet

More Decks by Jark Wu

Other Decks in Technology

Transcript

  1. Fluss: Next-Gen Streaming Storage for Real-Time Analytics Jark Wu Head

    of Flink SQL and Data Channels, Alibaba Cloud
  2. Big Data is Moving from Offline to Real-Time Tech Warehouse:

    T + 1 Streaming Lakehouse:T + 1m T + 1s Lakehouse: T + 1h Use Case Recommendation Dashboard Anomaly Detect Anti Fraud Dynamic Pricing Customer360 Ads Monitoring Industry E-Commerce Videos Manufacture Games Telecom Logistics IOT Finance
  3. Typical Architecture of Real-Time Analytics + Not a Good Solution

    Kafka Topic Kafka Topic Real-Time OLAP Kafka Topic Kafka Topic Kafka Topic Kafka Topic
  4. Kafka Problem(1):No Update Support Dedup Duplications State Materialize all the

    input data! Dedup do the Updates Dedup is expensive Kafka data is not reused Hard to fix data Kafka Topic Changelog Join/ Agg
  5. Kafka Problem(2):Hard to Debug Dump to OLAP systems Query Engine

    on Kafka dump Fragile Expensive Data Consistency dump dump Kafka Kafka Kafka Full Scan No Data Skipping Too Slow Kafka Kafka Kafka
  6. Kafka Problem(3):Process Historical Data Keep Only Days Data Disrupt Live

    Traffic SLOW Flink Flink Flink Flink Kafka Broker Kafka Broker Kafka Broker
  7. Kafka Falls Short in Real-Time Streaming Analytics Hard to Debug

    No Update Support Historical Data Networking Cost
  8. Fluss: Columnar Stream File format based IPC Streaming Format Sub-second

    Latency R/W 10x perf improved if column pruned Col1 Col2 Col3 Col4 Row-Based Col1 Col2 Col3 Col4 Column-Based 0 50 100 150 200 250 20cols 10cols 5cols 2cols 1col Fluss Kafka 10x Read Throughput (k/s)
  9. Fluss: Updates and Changelog Large-Scale Real-Time Updates Streaming Log +

    LSM Partial-Update Merge Wide Table No Dedup for Reading Changelog Generate Changelogs Recover & Materialize Fluss Server KV Tablet Log Tablet
  10. Fluss: Queryable Support LIMIT,COUNT Queries Support Primary-Key Lookup Real-Time Lookup

    Joins Fluss Fluss Fluss Lookup Join Queries Fluss Dim Table
  11. Fluss: Unification of Stream/Lakehouse Union Reads Fluss Cluster Shared-Data, Single

    Meta Lakehouse Storage ( Paimon / Iceberg*) Compaction Service Lakehouse Analytics Real-Time Data Layer (Short-Term, Second Latency) Historical Data Layer (Long-Term, Minute Latency)
  12. Fluss: Streaming Storage for Real-Time Analytics Union Reads Streaming Writes

    Real-Time Updates Server Server Server Fluss Cluster Remote Storage ( S3 / OSS / HDFS ) Lakehouse Storage ( Paimon / Iceberg*) Compaction Service Streaming Reads Batch Reads Lookup Join Lakehouse Analytics Databases Logs Real-Time Read/Write Streaming Updates Projection Pushdown Changelog Subscribe Lookup Queries Unified Stream/Lake