Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Streamhouse Architecture with Flink and Paimon

Streamhouse Architecture with Flink and Paimon

This talk introduces data teams to tools like Apache Paimon in combination with Flink. Paimon has been built with a strong focus on streaming workflows, serving as a table format in a lakehouse. It takes the stream processing approach in lakehouse architecture to the next level compared to other table formats that are more oriented towards batch data. After this talk, data teams will know how to use Paimon and Flink to build a cost-efficient and fast data layer for different data processing scenarios.

Alexey Novakov

November 22, 2024
Tweet

More Decks by Alexey Novakov

Other Decks in Programming

Transcript

  1. Streamhouse Architecture with Flink and Paimon Alexey Novakov, Product Solution

    Architect BigData Europe 2024 Original Creators of Apache Flink®
  2. In this talk … Stream Processing with Apache Flink 01

    Intro to Apache Paimon 02 Intro to Flink CDC 03 Streaming Lakehouse Architecture a.k.a. Streamhouse 04 Example Use Case 05 Takeaways 06
  3. 2024 Ververica The Original Creators of Apache FlinkⓇ 3 Donated

    Flink to Apache Apache Flink 1.0 2014 2016 2017 2019 2023 Today 2024 Years XP in Streaming for Enterprise 7
  4. 5 Apache Flink Learn more: flink.apache.org Flink is an open

    source framework and distributed engine for stateful stream processing. Flexible APIs Fault Tolerance High Performance Stateful Processing Flink Runtime Stateful Computations over Data Streams
  5. 6 Use Cases Learn more: flink.apache.org Stateful Stream Processing Streams,

    State, Time Streaming Lakehouse Streams, SQL, Apache Paimon, CDC Analytics & Flink SQL SQL, PyFlink, Tables Flink Runtime Stateful Computations over Data Streams Flink provides a robust foundation for a wide range of use cases:
  6. 7 Stateful Stream Processing Computation Computation Computation Computation Source (Stream)

    Source (Static) Sink Sink Transformation State State State Timer
  7. 8 DataStream API Source Transformation Windowed Transformation Sink val lines

    = env.fromSource(KafkaSource.builder[String]()....) val events = lines.map(parse) // def parse(s: String): String = ??? val stats = stream .keyBy("sensor") .window(TumblingProcessingTimeWindows.of(Time.seconds(5))) .aggregate(MyAggregationFunction()) stats.sinkTo(MyAlertSink(path)) Logical Dataflow Graph Source Transform Window Sink
  8. 9 Real-Time Analytics with Flink SQL Look Up INSERT INTO

    ... SELECT ... FROM ... LEFT JOIN ... ON ... GROUP BY ...
  9. 2024 Apache Paimon Overview LSM (Log-Structured-Merge) Tree based table format

    brings large-scale real-time update capabilities to data lakehouse Open Table Format
  10. 2024 CREATE CATALOG `vvc-streamhouse` WITH ( 'type' = 'paimon', 'warehouse'

    = 's3a://vvc-flink-aws/streamhouse' ); CREATE TABLE airlines( `carrier` STRING NOT NULL PRIMARY KEY NOT ENFORCED ,eventTime TIMESTAMP(3) ) WITH ( 'changelog-producer' = 'full-compaction' ,'snapshot.time-retained' = '60s' ,'snapshot.num-retained.min' = '1' ,'snapshot.num-retained.max' = '5' ); Paimon Connector for Flink 1. Data Catalog 2. Streamhouse Table
  11. 2024 USE CATALOG `vvc-streamhouse`; SET 'pipeline.name' = 'Copy Airlines'; SET

    'execution.checkpointing.interval' = '10 s'; insert into airlines select * from `system`.airlines.airlines_csv; Paimon Connector for Flink (2) s3://vvc-flink-aws/streamhouse/default.db/airlines/ Paimon table files 3. Flink SQL job to copy data
  12. 2024 Prominent Paimon features Merge Engines Sequence Field Append-only table

    Table snapshot tagging Lookup join Table management procedures - Deduplication - Partial Update - First Row - Automatic Aggregation - Out-of-Order Event Handling - The largest sequence.field value will be the last to merge - Append-only table as Message Queue - Bucket Number - Bucket Key Field - Automatic/manual tag creation - Time-based - Query historical data - Lookup with retry - Async Lookup - Diers from Flink’s lookup join - Compaction - Tags management - Rollback to Snapshot .. and many others
  13. 2024 Paimon has been a missing piece to organize Lakehouse

    with end-to-end Streaming workflows Source -> …. -> Final consumer
  14. 2024 Streaming Data Movement Data Backup Disaster Recovery Data Distribution

    Data Warehouse/ Lake Ingestion Flink CDC Non-stop Ingestion Use Cases
  15. Flink CDC APIs Data Stream API SQL API YAML API

    map filter keyBy aggregate join flatMap WHERE GROUP BY JOIN Top-N INSERT SELECT Source Route Sink Pipeline Tabele ID Transform
  16. Main Features - Lock-free data transactions reading (change logs) -

    Change Data and Change Schema Capture - Delivery Guarantees: Exactly-Once - Automatic task allocation (from snapshot to incremental mode) - Full Database Sync Scenario DB 1 DB 2 data schema sync
  17. Flink CDC 2.0: Incremental Snapshot Algorithm DB JDBC con JDBC

    con JDBC con binlog con No-Lock Algorithm Task1 Task2 Task3 Parallel Snapshot Scan Changelog Dump Auto Switch chunk2 chunk3 chunk1 Task4
  18. 2024 2024 Flink CDC opens up streaming use cases for

    high-value transactional data from RDBMs
  19. 26 Combine all pieces: Streamhouse engine to process stream and

    batch data Unified batch & stream data format to organize SQL tables on top tiered storage CDC/ETL framework to process data from RDBMS/other as a data stream Streamhouse is a data platform architecture to provide fast and cost-eicient data processing Apache Flink Apache Paimon
  20. 27 Streamhouse and its Place Streamhouse Lakehouse Real Time System

    Iceberg Flink High Cost Low Cost High Business Value, Low Latency Low Business Value, High Latency • Infrastructure cost saving • Flink SQL • Direct RDBMS ingestion • Streaming “upspert” support • Fast table reads and writes Spark Kafka FlinkCDC Paimon Flink Streamhouse
  21. Use case: Airline Delays Paimon SDK Let’s calculate total amount

    of minutes each airline delayed its flights
  22. Flink SQL: DDL CREATE TABLE `flights` ( `id` VARCHAR(64) NOT

    NULL ,`carrier` VARCHAR(2) ,`eventTime` TIMESTAMP(3) ,`plannedDep` TIMESTAMP(3) ,`actualDep` TIMESTAMP(3) ,`plannedArr` TIMESTAMP(3) ,`actualArr` TIMESTAMP(3) ,CONSTRAINT `PK_id` PRIMARY KEY (`id`) NOT ENFORCED ) WITH ( 'changelog-producer' = 'input' ,'sink.parallelism' = '4' ,'snapshot.time-retained' = '60s' ,'snapshot.num-retained.min' = '1' ,'snapshot.num-retained.max' = '5' ,'full-compaction.delta-commits' = '2' ); … similar code for all other tables here.
  23. Aggregated Data BEGIN STATEMENT SET; INSERT INTO `alexey-streamhouse`.`default`.flight_overall_delays SELECT carrier

    ,sum(IF(depratureDiff < 0, 0, depratureDiff)) as departureDelay ,sum(IF(arivalDiff < 0, 0, arivalDiff)) as arrivalDelay ,NOW() FROM `alexey-streamhouse`.`default`.flights_enriched GROUP BY carrier;
  24. 2024 Takeaways • Streamhouse brings fast and cost-eective data processing

    at scale • Streaming and Batch are unified within one processing engine and storage format • Flink CDC integrates your high-value data • Streamhouse faster than Lakehouse, but slower than Real-time System • Apache Paimon supports all data warehousing tasks you require