Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Fluss: A Real-Time Data Foundation for I...

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.

Apache Fluss: A Real-Time Data Foundation for Intelligent Systems

Avatar for Open Data Circle

Open Data Circle

May 26, 2026

More Decks by Open Data Circle

Other Decks in Technology

Transcript

  1. Yuqing Guan - Open Data Circle Apache Fluss A Real-Time

    Data Foundation for Intelligent Systems May 26, 2026
  2. Agenda 1. Why intelligent systems need a new data foundation

    • Fresh state, evolving context, history, and vectors 2. Fluss as table-level stream-batch uni fi cation • From Kafka x Iceberg integration to Fluss-native Streamhouse 3. Fluss in the AI era • Real-time context, externalized state, and Lance support 4. Demo: IoT streaming with Fluss and Lance • Raw events, latest-state table, Lance tiering, and MinIO 5. Conclusion & Discussion • Key takeaways
  3. The Question for Today What do data systems really need

    to unify in the AI era? • Not just writing streaming data into a lakehouse • Not just adding another vector database • Not just letting a model query some data The real goal is this: Continuously changing data, state, context, and semantic representations should become one evolving data object. Because intelligent systems are not only reading history. They are acting on current state, recent changes, and semantic context. If these live in separate systems with separate freshness, schema, and update semantics, we get drift: the online decision sees one world, the batch analytics sees another, and the vector index may represent a third. For Data Agents or intelligent systems, that is exactly where trust breaks.
  4. From reports to decisions Traditional BI asks • What happened

    yesterday? • How did a metric change over time? • What should a human investigate? Intelligent systems ask • What is the entity state now? • What changed recently? • What context should guide the next action? The workload shifts from historical analysis to continuously changing operational context.
  5. The new foundation has four jobs 1. Fresh state The

    system knows what is true now. 2. Evolving context Recent events and entity changes stay available. 3. Historical reconstruction Decisions can be explained and replayed. 4. Semantic retrieval Vectors and similar cases can be connected back to the data path.
  6. About Apache Fluss https:// fl uss.apache.org/ Apache Fluss is an

    open-source, lakehouse-native streaming storage. It collapses the message broker, online KV store, stream-processing state backend, and lakehouse cold store into a single coherent foundation, making the Lakehouse truly real-time.
  7. Kafka x Iceberg: Streaming Into the Lakehouse Kafka x Iceberg

    is a natural combination: • Kafka handles real-time events and decoupling • Flink or connectors continuously write into tables • Iceberg provides ACID, schema evolution, time travel, and historical analytics This is a real improvement over older batch-oriented pipelines. But the uni fi cation happens at the pipeline layer.
  8. The Hidden Fracture: A Modern Lambda Architecture After the system

    runs for a long time, hot and cold paths often split again: • Low-latency dashboards read from Kafka or serving stores • Historical analytics read from Iceberg • Back fi lls may bypass Kafka and write directly into Iceberg • Real-time metrics and o ff l ine metrics get separate implementations The same business metric quietly gains one streaming-shaped implementation and one batch-shaped implementation.
  9. Fluss x Iceberg: The Lakehouse Extends Into Real Time Fluss

    starts from a di ff erent question: What if streaming storage itself behaved like a table? In a Fluss-native architecture: • Real-time data stays hot, mutable, and queryable in Fluss • Historical data is tiered into lakehouse formats such as Iceberg, Paimon, or Lance • Hot and cold are di ff erent freshness layers of the same logical table
  10. What Table-Level Uni fi cation Means Fluss does not merely

    wrap a log and call it a table. It makes table semantics part of real-time storage: • Updates are part of the model • Schema is part of the model • Partitioning and bucketing are part of the model • Changelogs, primary keys, and materialized state are part of the model • After tiering, the same semantics continue into the lakehouse layer So uni fi cation is not maintained by discipline. It is supported by the structure of the system.
  11. Bronze, Silver, and Gold Are No Longer Batch Steps In

    a traditional lakehouse: • Bronze lands raw data • Silver cleans and enriches it • Gold aggregates it • Each layer often has batch boundaries and recomputation points In a Fluss-native Streamhouse: • Bronze, Silver, and Gold can all be continuously updated tables • Upstream changes propagate as changelogs • Late data, corrections, and schema changes can fl ow naturally across layers In short: data does not arrive at Silver or Gold. It fl ows through them.
  12. Three Investments That Move Fluss Toward Intelligent Systems Over the

    last year, Fluss has evolved in three major directions: • Stateless compute and externalized state • Compute becomes lighter • State becomes durable, stable, and queryable • Recovery, scaling, and logic upgrades become easier • Complex data types and zero-copy schema evolution • Data models can naturally become richer • Schema changes do not require large rewrites • Vectors and Lance • Embeddings no longer have to live in a separate silo • Structured data, streaming signals, and vector representations can live closer to the same foundation
  13. Why Lance Matters in the AI Era AI applications often

    need three access patterns at the same time: • Row access: retrieve the latest state by key, such as user pro fi le or device state • Column access: scan features and history for analytics or training • Vector access: semantic similarity search for RAG, recommendations, and content discovery Lance matters because it brings vector and multi-modal data into the lakehouse context. The Fluss + Lance direction is powerful: real-time events enter Fluss, while semantic representations and historical context can be tiered into Lance for vector search and multi-modal analysis.
  14. The Data Loop for Data Agents In this loop, the

    key question is not only "how fast can we write?" The key questions are: • Is the data fresh enough for the decision? • Is the state explainable? • Can the history be reconstructed? • Are embeddings consistent with the main data semantics?
  15. Takeaway Fluss is a streaming- fi rst table store that

    makes real-time data durable, queryable, and lakehouse-ready. 1. Table-level stream-batch uni fi cation 2. One logical table across hot and cold freshness layers 3. Durable state outside compute 4. Real-time context for intelligent systems 5. Lance support for vector-aware lakehouse workloads 6. Less drift between online decisions, historical analytics, and semantic retrieval