Slide 1

Slide 1 text

Flink Forward 2023 © Flink SQL: the Challenges of Implementing a Streaming SQL Engine Jark Wu Flink SQL Team Leader @ Alibaba Apache Flink PMC Member

Slide 2

Slide 2 text

Flink Forward 2023 © About me: Jark Wu ● Long-term Flinker since 2016 ● Core Contributor of Flink SQL ● Apache Flink PMC member and Committer ● Flink SQL Team Leader @ Alibaba ● Creator of flink-cdc-connectors project (4.6k github stars)

Slide 3

Slide 3 text

Flink Forward 2023 © What is Flink SQL? ClickHouse Messaging Databases Iceberg Paimon Hudi Lakehouses Messaging Iceberg Paimon Hudi Lakehouses Non-SQL OLAP Data Integration Analytics ETL Event- Driven Flink SQL Streaming / Batch Processing

Slide 4

Slide 4 text

Flink Forward 2023 © Flink SQL in API Stack Low-level Stream Operator API DataStream API Optimizer / Planner Dataflow Runtime Table / SQL API

Slide 5

Slide 5 text

Flink Forward 2023 © DataStream CREATE TABLE sensors ( room BIGINT, temperature DOUBLE, rowtime TIMESTAMP(3), WATERMARK FOR rowtime AS rowtime - INTERVAL '5' SECOND ) WITH ( 'connector' = 'kafka', ... ); SELECT room, TUMBLE_END(rowtime, INTERVAL '1' MINUTE), AVG(temperature) FROM sensors GROUP BY room, TUMBLE(rowtime, INTERVAL '1' MINUTE); KafkaSource source = KafkaSource.builder() .setValueOnlyDeserializer(new SensorSchema()) ... .build(); WatermarkStrategy watermark = WatermarkStrategy .forBoundedOutOfOrderness( Duration.ofSeconds(5)) .withTimestampAssigner( (sensor, timestamp) -> sensor.rowtime); env.fromSource(source, watermark, "Kafka Sensor Source") .keyBy(Sensor::getRoom) .window(TumblingEventTimeWindows.of(Time.minutes(1))) .aggregate(new MyAverageAggregate()); class MyAverageAggregate implements AggregateFunction { ... } Flink SQL

Slide 6

Slide 6 text

Flink Forward 2023 © Watermark, Window and Challenges ● Flink uses Watermark to track completeness in stream processing ● Watermark expects all of the data in a window to have arrived ● However, Watermark is a tradeoff among Correctness, Latency and Cost [1] ● Challenge: Big gap between actual watermark and ideal watermark => lose late data [1] Tyler Akidau et al . The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing Correctness Latency Cost Slow watermark, result delayed Fast watermark, data incomplete dropping old state

Slide 7

Slide 7 text

Flink Forward 2023 © Handle Late Data (1): Late-Arrival & Early-Fire ● Retain window state after window fire (window watermark) è add a little Cost ● Fire for every late data è Correctness without dropping data ● Fire before window close è Low Latency result ● A better tradeoff Cost, Correctness and Latency, no need to fine-tune watermark ● Challenge: emits updates instead of append-only result -- how long state should be retained after window close to process late data SET table.exec.emit.allow-lateness = 1h; -- enables window to early fire before window close and late fire after window close SET table.exec.emit.early-fire.enabled = true; SET table.exec.emit.early-fire.delay = 1m; SET table.exec.emit.late-fire.enabled = true; SET table.exec.emit.late-fire.delay = 1m; Configuration to enable Early-Fire and Late-Arrival in Flink SQL

Slide 8

Slide 8 text

Flink Forward 2023 © Handle Late Data (2): non-window aggregation ● Another alternative: non-window aggregation (more built-in optimizations) ● No watermark and No dropping data, can group by arbitrary columns ● Correctness: late data is correct before state TTL ● Latency: can trigger for every record, minibatch config to balance throughput ● Cost: state automatically cleaned after specific TTL ● Challenge: emits updates instead of append result SELECT room, DATE_FORMAT(rowtime, ‘YYYY-MM-DD HH:mm’), AVG(temperature) FROM sensors GROUP BY room, DATE_FORMAT(rowtime, ‘YYYY-MM-DD HH:mm’) SELECT room, TUMBLE_END(rowtime, INTERVAL '1' MINUTE), AVG(temperature) FROM sensors GROUP BY room, TUMBLE(rowtime, INTERVAL '1' MINUTE); Window Aggregations Non-window Aggregations

Slide 9

Slide 9 text

Flink Forward 2023 © What is the Challenge of Updates? ● Sinks: receive duplicate events for the same key ● Need to support idempotent updates (e.g., Postgres, DataLakes) ● Sources: how to connect external changelogs natively ○ flink-cdc-connectors project: mysql-cdc, pg-cdc, mongo-cdc, oracle-cdc ○ Thanks Debezium ● Operators: how to propagate changelog events correctly ● Stream-Table Duality & Changelog Mechanism

Slide 10

Slide 10 text

Flink Forward 2023 © Handle Updates (1): Stream-Table Duality ● Changes on a Table produces a Changelog Stream ● Changelog Stream can be materialized into Table ● Flink names this equal conversion Dynamic Table which is similar to Materialized View ● Sources, Operators, Sinks work on Changelog Stream Short Name Long Name Description +I Insert Insertion operation. -U Update Before Update operation with previous content. +U Update After Update operation with new content. -D Delete Deletion operation. Easy to mapping from external changelogs (MySQL Binlog, PG transaction log) and mapping to external DML operations (INSERT .. ON DUPLICATE KEY, DELETE FROM..)

Slide 11

Slide 11 text

Flink Forward 2023 © word Hello word cnt Hello 1 word_count World 1 Hello, 1 World, 1 Hello, 2 SELECT word, COUNT(*) as cnt FROM words GROUP BY word World Hello Source cnt freq 1 2 2 1 SELECT cnt, COUNT(*) as freq FROM word_count GROUP BY cnt 1 2 Should be “1” Count Frequency The Problem: result is incorrect if stream processing without changelog mechanism Word Count Hello 1 -> 2 Handle Updates (2): Changelog Mechanism

Slide 12

Slide 12 text

Flink Forward 2023 © Handle Updates (3): Changelog Mechanism word Hello word cnt Hello 1 word_count World 1 Hello 1 -> 2 SELECT word, COUNT(*) as cnt FROM logs GROUP BY word World Hello cnt freq 1 2 2 1 1 2 -> 1 SELECT cnt, COUNT(*) as freq FROM word_count GROUP BY cnt Hello, 1 +insert World, 1 +insert Hello, 1 -update_before Hello, 2 +update_after Hello +insert World +insert Hello +insert Source Word Count Count Frequency The Solution: stream processing with changelog mechanism

Slide 13

Slide 13 text

Flink Forward 2023 © Handle Updates (4): Changelog Mechanism ● Changelog Mechanism is a way to generate and process changelog in Flink. ● It makes result correct and makes streaming SQL can support arbitrary queries (10-left-join). ● It is a building block of Flink SQL which is a big difference from other systems. ● It is handled by Flink and is transparent to users. ● Challenge: event ordering and computation deterministic is critical for changelogs

Slide 14

Slide 14 text

Flink Forward 2023 © What is the Challenge of Event Ordering? Order id product user_id User id username Join Task 1 Join Task 2 Sink Task OrderSink id product username Shuffle By order.id Shuffle By user.id Shuffle By order.user_id +I (id=1, product='apple', user_id=11) -U (id=1, product='apple', user_id=11) +U (id=1, product='apple', user_id=12) +I (id=1, product='apple', user_id=11) -U (id=1, product='apple', user_id=11) +U (id=1, product='apple', user_id=12) +U (id=1, user_id=12, username='u2') +I (id=1, user_id=11, username='u1') -U (id=1, user_id=11, username='u1') Incorrect! There is no order_id=1 in the result table, because the update_before is out-of-order.

Slide 15

Slide 15 text

Flink Forward 2023 © Handle Event Ordering (1): SinkUpsertMaterializer ● SinkUpsertMaterializer: a stateful operator added before sink [FLINK-20374] ● SinkUpsertMaterializer reorders events on the sink primary key with the help of state Source1 Source1 Source1 Source2 Source1 Join Source1 SinkUpsertMaterializer -> Sink Shuffle By sink_primary_key Shuffle By join_on_key

Slide 16

Slide 16 text

Flink Forward 2023 © Handle Event Ordering (2): SinkUpsertMaterializer ● Tracks event ordering in state using structure: State> ● Reorders events by working like Deduplication with Last Row. ○ For ADD events (+I and +U), add event to list state and forward-emit the ADD event ○ For RETRACT events (-D and -U), remove event from list state, ■ re-emit the new last ADD row if the removed event is last row ■ forward-emit the RETRACT event if the list state is empty ■ no-op otherwise SinkUpsertMaterializer +U (id=1, user_id=12, username='u2') +I (id=1, user_id=12, username='u2') id user_id username 1 12 u2 Insert

Slide 17

Slide 17 text

Flink Forward 2023 © Handle Event Ordering (2): SinkUpsertMaterializer ● Tracks event ordering in state using structure: State> ● Reorders events by working like Deduplication with Last Row. ○ For ADD events (+I and +U), add event to list state and forward-emit the ADD event ○ For RETRACT events (-D and -U), remove event from list state, ■ re-emit the new last ADD row if the removed event is last row ■ forward-emit the RETRACT event if the list state is empty ■ no-op otherwise SinkUpsertMaterializer +U (id=1, user_id=12, username='u2') +I (id=1, user_id=12, username='u2') +I (id=1, user_id=11, username='u1') +U (id=1, user_id=11, username='u1') id user_id username 1 12->11 u2 -> u1 Upsert

Slide 18

Slide 18 text

Flink Forward 2023 © Handle Event Ordering (2): SinkUpsertMaterializer ● Tracks event ordering in state using structure: State> ● Reorders events by working like Deduplication with Last Row. ○ For ADD events (+I and +U), add event to list state and forward-emit the ADD event ○ For RETRACT events (-D and -U), remove event from list state, ■ re-emit the new last ADD row if the removed event is last row ■ forward-emit the RETRACT event if the list state is empty ■ no-op otherwise SinkUpsertMaterializer +U (id=1, user_id=12, username='u2') +I (id=1, user_id=12, username='u2') +I (id=1, user_id=11, username='u1') +U (id=1, user_id=11, username='u1') -U (id=1, user_id=11, username='u1') +U (id=1, user_id=12, username='u2') id user_id username 1 11->12 u1 -> u2 Upsert

Slide 19

Slide 19 text

Flink Forward 2023 © What is the Challenge of Non-deterministic? Changelog Source Lookup Join SUM(dim.value) dim.id dim.value +I (id=1, product=‘a1’, dim_id=11) -U (id=1, product=‘a1’, dim_id=11) +U (id=1, product=‘a2’, dim_id=11) +I (id=1, dim_value=1) -U (id=1, dim_value=3) +U (id=1, dim_value=1) +1 -3 +1 = -1 value dim_id 1 11 3 11 1 11 Incorrect! Should never seen negative sum result. Non-deterministic on retractions retraction is not equal to previous row!

Slide 20

Slide 20 text

Flink Forward 2023 © Handle Non-deterministic: MaterializedLookupJoin Changelog Source Materialized Lookup Join SUM(dim.value) dim.id dim.value +I (id=1, product=‘a1’, dim_id=11) -U (id=1, product=‘a1’, dim_id=11) +U (id=1, product=‘a2’, dim_id=11) +I (id=1, dim_value=1) -U (id=1, dim_value=1) +U (id=1, dim_value=1) +1 -1 +1 = 1 value dim_id 1 11 3 11 1 11 Deterministic computation on retractions ● MaterializedLookupJoin uses state to materialize the previous lookup value and prevent non-deterministic on retractions. [FLINK-27849] Lookup state for previous joined value

Slide 21

Slide 21 text

Flink Forward 2023 © Summary ● The 4 challenges of implementing a Streaming SQL engine: ● Late Data, Changelog, Event Ordering, Non-deterministic. ● It’s all about how to improve and balance the Trade-Off Triangle. ● Allow late data improves Correctness with a little state Cost. ● Window early fire improves Latency with more computation Cost. ● Aggregation optimizations reduce above Cost. ● Changelog Mechanism Correctly support updates in source, sink, operator. ● SinkUpsertMaterializer fix event ordering Correctness with state Cost. ● MaterializedLookupJoin fix non-deterministic-update Correctness with state Cost ● Some future plans to reduce the state Cost, and balance between Cost and Latency… Correctness Latency Cost

Slide 22

Slide 22 text

Flink Forward 2023 © Thank you Jark Wu @jarkwu