Streamhouse Architecture with Flink and Paimon

Streamhouse Architecture with Flink and Paimon Alexey Novakov, Product Solution
Architect BigData Europe 2024 Original Creators of Apache Flink®

In this talk … Stream Processing with Apache Flink 01
Intro to Apache Paimon 02 Intro to Flink CDC 03 Streaming Lakehouse Architecture a.k.a. Streamhouse 04 Example Use Case 05 Takeaways 06

2024 Ververica The Original Creators of Apache FlinkⓇ 3 Donated
Flink to Apache Apache Flink 1.0 2014 2016 2017 2019 2023 Today 2024 Years XP in Streaming for Enterprise 7

Stream Processing with Apache Flink 01

5 Apache Flink Learn more: ﬂink.apache.org Flink is an open
source framework and distributed engine for stateful stream processing. Flexible APIs Fault Tolerance High Performance Stateful Processing Flink Runtime Stateful Computations over Data Streams

6 Use Cases Learn more: ﬂink.apache.org Stateful Stream Processing Streams,
State, Time Streaming Lakehouse Streams, SQL, Apache Paimon, CDC Analytics & Flink SQL SQL, PyFlink, Tables Flink Runtime Stateful Computations over Data Streams Flink provides a robust foundation for a wide range of use cases:

7 Stateful Stream Processing Computation Computation Computation Computation Source (Stream)
Source (Static) Sink Sink Transformation State State State Timer

8 DataStream API Source Transformation Windowed Transformation Sink val lines
= env.fromSource(KafkaSource.builder[String]()....) val events = lines.map(parse) // def parse(s: String): String = ??? val stats = stream .keyBy("sensor") .window(TumblingProcessingTimeWindows.of(Time.seconds(5))) .aggregate(MyAggregationFunction()) stats.sinkTo(MyAlertSink(path)) Logical Dataflow Graph Source Transform Window Sink

9 Real-Time Analytics with Flink SQL Look Up INSERT INTO
... SELECT ... FROM ... LEFT JOIN ... ON ... GROUP BY ...

Intro to Apache Paimon 02

2024 Apache Paimon Overview LSM (Log-Structured-Merge) Tree based table format
brings large-scale real-time update capabilities to data lakehouse Open Table Format

2024 Data Lake Technology Stack

2024 CREATE CATALOG `vvc-streamhouse` WITH ( 'type' = 'paimon', 'warehouse'
= 's3a://vvc-flink-aws/streamhouse' ); CREATE TABLE airlines( `carrier` STRING NOT NULL PRIMARY KEY NOT ENFORCED ,eventTime TIMESTAMP(3) ) WITH ( 'changelog-producer' = 'full-compaction' ,'snapshot.time-retained' = '60s' ,'snapshot.num-retained.min' = '1' ,'snapshot.num-retained.max' = '5' ); Paimon Connector for Flink 1. Data Catalog 2. Streamhouse Table

2024 USE CATALOG `vvc-streamhouse`; SET 'pipeline.name' = 'Copy Airlines'; SET
'execution.checkpointing.interval' = '10 s'; insert into airlines select * from `system`.airlines.airlines_csv; Paimon Connector for Flink (2) s3://vvc-flink-aws/streamhouse/default.db/airlines/ Paimon table ﬁles 3. Flink SQL job to copy data

15 Paimon Table Under the Hood https://paimon.apache.org/docs/master/concepts/basic-concepts/

2024 Prominent Paimon features Merge Engines Sequence Field Append-only table
Table snapshot tagging Lookup join Table management procedures - Deduplication - Partial Update - First Row - Automatic Aggregation - Out-of-Order Event Handling - The largest sequence.ﬁeld value will be the last to merge - Append-only table as Message Queue - Bucket Number - Bucket Key Field - Automatic/manual tag creation - Time-based - Query historical data - Lookup with retry - Async Lookup - Diers from Flink’s lookup join - Compaction - Tags management - Rollback to Snapshot .. and many others

2024 Paimon has been a missing piece to organize Lakehouse
with end-to-end Streaming workﬂows Source -> …. -> Final consumer

Intro to Flink CDC 03

2024 Streaming Data Movement Data Backup Disaster Recovery Data Distribution
Data Warehouse/ Lake Ingestion Flink CDC Non-stop Ingestion Use Cases

Flink CDC APIs Data Stream API SQL API YAML API
map ﬁlter keyBy aggregate join ﬂatMap WHERE GROUP BY JOIN Top-N INSERT SELECT Source Route Sink Pipeline Tabele ID Transform

Main Features - Lock-free data transactions reading (change logs) -
Change Data and Change Schema Capture - Delivery Guarantees: Exactly-Once - Automatic task allocation (from snapshot to incremental mode) - Full Database Sync Scenario DB 1 DB 2 data schema sync

Flink CDC 2.0: Incremental Snapshot Algorithm DB JDBC con JDBC
con JDBC con binlog con No-Lock Algorithm Task1 Task2 Task3 Parallel Snapshot Scan Changelog Dump Auto Switch chunk2 chunk3 chunk1 Task4

2024 Flink CDC Job: SQL Script INSERT INTO `my-streamhouse`.etl.postgres_orders SELECT
* FROM `system`.etl.orders_cdc; paimon

2024 2024 Flink CDC opens up streaming use cases for
high-value transactional data from RDBMs

Streamhouse Architecture 04

26 Combine all pieces: Streamhouse engine to process stream and
batch data Uniﬁed batch & stream data format to organize SQL tables on top tiered storage CDC/ETL framework to process data from RDBMS/other as a data stream Streamhouse is a data platform architecture to provide fast and cost-eicient data processing Apache Flink Apache Paimon

27 Streamhouse and its Place Streamhouse Lakehouse Real Time System
Iceberg Flink High Cost Low Cost High Business Value, Low Latency Low Business Value, High Latency • Infrastructure cost saving • Flink SQL • Direct RDBMS ingestion • Streaming “upspert” support • Fast table reads and writes Spark Kafka FlinkCDC Paimon Flink Streamhouse

2024 Streaming Lakehouse Lambda Kappa Streamhouse (Flink)

2024 More about Streamhouse Architecture Streamhouse Unveiled blog post: https://www.ververica.com/blog/streamhouse-unveiled

Use Case: Airline Flight Delays 05

Use case: Airline Delays Paimon SDK Let’s calculate total amount
of minutes each airline delayed its flights

Flink SQL: DDL CREATE TABLE `flights` ( ìd` VARCHAR(64) NOT
NULL ,`carrier` VARCHAR(2) ,èventTime` TIMESTAMP(3) ,`plannedDep` TIMESTAMP(3) ,àctualDep` TIMESTAMP(3) ,`plannedArr` TIMESTAMP(3) ,àctualArr` TIMESTAMP(3) ,CONSTRAINT `PK_id` PRIMARY KEY (ìd`) NOT ENFORCED ) WITH ( 'changelog-producer' = 'input' ,'sink.parallelism' = '4' ,'snapshot.time-retained' = '60s' ,'snapshot.num-retained.min' = '1' ,'snapshot.num-retained.max' = '5' ,'full-compaction.delta-commits' = '2' ); … similar code for all other tables here.

Ingestion: Operational Data S3 Dir as source. Batch Job Streaming
Flink CDC Job Flink Jobs

Transformed Data Streaming join

Aggregated Data BEGIN STATEMENT SET; INSERT INTO `alexey-streamhouse`.`default`.flight_overall_delays SELECT carrier
,sum(IF(depratureDiff < 0, 0, depratureDiff)) as departureDelay ,sum(IF(arivalDiff < 0, 0, arivalDiff)) as arrivalDelay ,NOW() FROM `alexey-streamhouse`.`default`.flights_enriched GROUP BY carrier;

2024 Takeaways • Streamhouse brings fast and cost-eective data processing
at scale • Streaming and Batch are uniﬁed within one processing engine and storage format • Flink CDC integrates your high-value data • Streamhouse faster than Lakehouse, but slower than Real-time System • Apache Paimon supports all data warehousing tasks you require

2024 Ververica Platform available on:

Thank you!

2024 Selection of Icons for light theme

2024 Selection of Icons for dark theme

Streamhouse Architecture with Flink and Paimon

Streamhouse Architecture with Flink and Paimon

More Decks by Alexey Novakov

Other Decks in Programming

Featured

Transcript