Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Realtime Data Platform: from Lakehouse to Strea...

Realtime Data Platform: from Lakehouse to Streamhouse

More Decks by LINEヤフーTech (LY Corporation Tech)

Other Decks in Technology

Transcript

  1. Open research on Realtime Data Platform Explore the latest technologies

    and architectures like Streamhouse Not representative of in-house projects at LY Corporation Private, not publicly available Scope of the Session What we will cover today
  2. 肖 志彦 / Xiao Zhiyan / シヤウ ズウイエン 中国重慶 /

    Chongqing, China AKA “8D Magic City” 香港中文大学 / The Chinese University of Hong Kong Math & Information Engineering Software Engineer @ LY Corporation Data Engineering > Data Pipeline Engineering > Ingestion Processing Improving big data systems with Rust Open Data Driven community (Streamhouse, AI-native, Rust, and more) About Me
  3. Software is eating the world. Every business became a software

    company. AI is eating software. Realtime is eating everything else. Live dashboards Realtime alerts On-the-fly personalization Instant feedback from AI agents Most Data Platforms → Not fully supporting realtime scenarios Realtime Data Platform → Streamhouse Why Realtime Data Platform A large shift happening around us ( → → → )
  4. Data Warehouse and Data Lake The early days of Data

    Platforms Structured Data Semi-structured Data Unstructured Data ETL ETL Data Warehouse BI / Reports Data Lake (HDFS / S3) Query Engine (Spark) ML Data Warehouse Data Lake Data Type Structured Structured, semi-structured, unstructured Scalability Difficult and expensive Easy at a low cost Transaction Supported Not supported Query Performance High Low Can we combine the advantages of both systems into one?
  5. Lakehouse Data Lake + Data Warehouse Structured Data Semi-structured Data

    Unstructured Data ETL Lakehouse Table Format (Iceberg) BI / Reports Data Lake (HDFS / S3) Query Engine (Spark) ML Lakehouse Key Point Index data files with stats to speed up queries Table Formats Apache Iceberg, Delta Lake, Apache Hudi Pros Transaction, high query performance Cons High latency (minutes+), low throughput of streaming write How to support low-latency streaming processing?
  6. Lambda Architecture Batch Pipeline + Streaming Pipeline Data Sources Batch

    Pipeline (Spark) Lakehouse Storage (Iceberg) Query Engine Data Users Streaming Pipeline (Flink) Streaming Storage (Kafka) Lambda Architecture Key Point Dual-layer design (batch + streaming) Pros Low-latency streaming processing (sub-second) Cons Complexity, high cost, delay in historical data (minutes+) Is it feasible to unify these two data pipelines?
  7. Kappa Architecture Unify Batch and Streaming Pipelines ( 2 →

    1 ) Kafka Sources Streaming Pipeline (Flink) Streaming Storage (Kafka) Query Engine Data Users Kappa Architecture Key Point All data as streams, historical data by replaying Pros Simplicity, low-latency (sub-second) Cons Expensive replaying, inefficient queries Can we improve Kappa Architecture with Lakehouse Storage?
  8. Unified Near-Realtime Pipeline Kafka → Paimon Data Sources Unified Pipeline

    (Flink) Query Engine Data Users How does Paimon support both Streaming and Lakehouse features? Lakehouse Storage (Iceberg, minutes+) Streaming Storage (Kafka, sub-second) Streaming-first Lakehouse Storage (Paimon, sub-minute)
  9. A Little More on Paimon Streaming-first Lakehouse Table Format Can

    we further improve Unified Pipeline to achieve sub-second latency? https://paimon.apache.org/docs/master/concepts/basic-concepts/ Index data files with stats similar to Iceberg Write to multiple buckets in parallel Read changelog of PK tables Streaming read Append tables
  10. Unified Realtime Pipeline Paimon + Fluss Data Sources Unified Pipeline

    Lakehouse Storage (Paimon, sub-minute) Query Engine Data Users Streaming Storage (Fluss, sub-second) Kafka Fluss Data Format Binary (row-based) Apache Arrow (column-based) Schema External registry Internal schema Projection Pushdown Not supported Supported Query Interface Consumer / REST Arrow Flight (high performance) Let’s summarize what we have discussed so far.
  11. We have seen latest Streamhouse Architecture with Paimon and Fluss

    Are there any other potential improvements in the design? Let’s imagine how we might orchestrate data pipelines. Potential Improvements
  12. Dynamic Data Flow Static Data Pipelines → Lively Composable Data

    Flows Data Sources Ingest ODS Clean DWD Aggregate DWS Customize ADS Data Flow (YAML) Reusable Data Actions (SQL) Flow & Action Registry Flow Materializer Flow Trigger
  13. AI-native Control Plane Automated → Autonomous Data Action Data Flow

    Flow & Action Registry Flow Trigger Flow Planner Action Composer Agentic AI User Prompt Orchestrator Flow Materializer
  14. Streamhouse is still a new architecture for Realtime Data Platform

    Let’s see how it can bring real value to our business Explore, build and share Streamhouse implementations together The Road Ahead
  15. 2011, Lambda Architecture http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html 2014, Kappa Architecture https://www.oreilly.com/radar/questioning-the-lambda-architecture/ 2020, Lakehouse

    https://www.databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html 2023, Streamhouse https://www.ververica.com/blog/streamhouse-unveiled 2024, AI Native https://www.splunk.com/en_us/blog/learn/ai-native.html Related Links
  16. FIN