Flink SQL: the Challenges of Implementing a Streaming SQL Engine

Flink Forward 2023 © Flink SQL: the Challenges of Implementing
a Streaming SQL Engine Jark Wu Flink SQL Team Leader @ Alibaba Apache Flink PMC Member

Flink Forward 2023 © About me: Jark Wu • Long-term
Flinker since 2016 • Core Contributor of Flink SQL • Apache Flink PMC member and Committer • Flink SQL Team Leader @ Alibaba • Creator of flink-cdc-connectors project (4.6k github stars)

Flink Forward 2023 © What is Flink SQL? ClickHouse Messaging
Databases Iceberg Paimon Hudi Lakehouses Messaging Iceberg Paimon Hudi Lakehouses Non-SQL OLAP Data Integration Analytics ETL Event- Driven Flink SQL Streaming / Batch Processing

Flink Forward 2023 © Flink SQL in API Stack Low-level
Stream Operator API DataStream API Optimizer / Planner Dataflow Runtime Table / SQL API

Flink Forward 2023 © DataStream CREATE TABLE sensors ( room
BIGINT, temperature DOUBLE, rowtime TIMESTAMP(3), WATERMARK FOR rowtime AS rowtime - INTERVAL '5' SECOND ) WITH ( 'connector' = 'kafka', ... ); SELECT room, TUMBLE_END(rowtime, INTERVAL '1' MINUTE), AVG(temperature) FROM sensors GROUP BY room, TUMBLE(rowtime, INTERVAL '1' MINUTE); KafkaSource<Sensor> source = KafkaSource.<Sensor>builder() .setValueOnlyDeserializer(new SensorSchema()) ... .build(); WatermarkStrategy watermark = WatermarkStrategy .<Sensor>forBoundedOutOfOrderness( Duration.ofSeconds(5)) .withTimestampAssigner( (sensor, timestamp) -> sensor.rowtime); env.fromSource(source, watermark, "Kafka Sensor Source") .keyBy(Sensor::getRoom) .window(TumblingEventTimeWindows.of(Time.minutes(1))) .aggregate(new MyAverageAggregate()); class MyAverageAggregate implements AggregateFunction { ... } Flink SQL

Flink Forward 2023 © Watermark, Window and Challenges • Flink
uses Watermark to track completeness in stream processing • Watermark expects all of the data in a window to have arrived • However, Watermark is a tradeoff among Correctness, Latency and Cost [1] • Challenge: Big gap between actual watermark and ideal watermark => lose late data [1] Tyler Akidau et al . The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing Correctness Latency Cost Slow watermark, result delayed Fast watermark, data incomplete dropping old state

Flink Forward 2023 © Handle Late Data (1): Late-Arrival &
Early-Fire • Retain window state after window fire (window watermark) è add a little Cost • Fire for every late data è Correctness without dropping data • Fire before window close è Low Latency result • A better tradeoff Cost, Correctness and Latency, no need to fine-tune watermark • Challenge: emits updates instead of append-only result -- how long state should be retained after window close to process late data SET table.exec.emit.allow-lateness = 1h; -- enables window to early fire before window close and late fire after window close SET table.exec.emit.early-fire.enabled = true; SET table.exec.emit.early-fire.delay = 1m; SET table.exec.emit.late-fire.enabled = true; SET table.exec.emit.late-fire.delay = 1m; Configuration to enable Early-Fire and Late-Arrival in Flink SQL

Flink Forward 2023 © Handle Late Data (2): non-window aggregation
• Another alternative: non-window aggregation (more built-in optimizations) • No watermark and No dropping data, can group by arbitrary columns • Correctness: late data is correct before state TTL • Latency: can trigger for every record, minibatch config to balance throughput • Cost: state automatically cleaned after specific TTL • Challenge: emits updates instead of append result SELECT room, DATE_FORMAT(rowtime, ‘YYYY-MM-DD HH:mm’), AVG(temperature) FROM sensors GROUP BY room, DATE_FORMAT(rowtime, ‘YYYY-MM-DD HH:mm’) SELECT room, TUMBLE_END(rowtime, INTERVAL '1' MINUTE), AVG(temperature) FROM sensors GROUP BY room, TUMBLE(rowtime, INTERVAL '1' MINUTE); Window Aggregations Non-window Aggregations

Flink Forward 2023 © What is the Challenge of Updates?
• Sinks: receive duplicate events for the same key • Need to support idempotent updates (e.g., Postgres, DataLakes) • Sources: how to connect external changelogs natively ◦ flink-cdc-connectors project: mysql-cdc, pg-cdc, mongo-cdc, oracle-cdc ◦ Thanks Debezium • Operators: how to propagate changelog events correctly • Stream-Table Duality & Changelog Mechanism

Flink Forward 2023 © Handle Updates (1): Stream-Table Duality •
Changes on a Table produces a Changelog Stream • Changelog Stream can be materialized into Table • Flink names this equal conversion Dynamic Table which is similar to Materialized View • Sources, Operators, Sinks work on Changelog Stream Short Name Long Name Description +I Insert Insertion operation. -U Update Before Update operation with previous content. +U Update After Update operation with new content. -D Delete Deletion operation. Easy to mapping from external changelogs (MySQL Binlog, PG transaction log) and mapping to external DML operations (INSERT .. ON DUPLICATE KEY, DELETE FROM..)

Flink Forward 2023 © word Hello word cnt Hello 1
word_count World 1 Hello, 1 World, 1 Hello, 2 SELECT word, COUNT(*) as cnt FROM words GROUP BY word World Hello Source cnt freq 1 2 2 1 SELECT cnt, COUNT(*) as freq FROM word_count GROUP BY cnt 1 2 Should be “1” Count Frequency The Problem: result is incorrect if stream processing without changelog mechanism Word Count Hello 1 -> 2 Handle Updates (2): Changelog Mechanism

Flink Forward 2023 © Handle Updates (3): Changelog Mechanism word
Hello word cnt Hello 1 word_count World 1 Hello 1 -> 2 SELECT word, COUNT(*) as cnt FROM logs GROUP BY word World Hello cnt freq 1 2 2 1 1 2 -> 1 SELECT cnt, COUNT(*) as freq FROM word_count GROUP BY cnt Hello, 1 +insert World, 1 +insert Hello, 1 -update_before Hello, 2 +update_after Hello +insert World +insert Hello +insert Source Word Count Count Frequency The Solution: stream processing with changelog mechanism

Flink Forward 2023 © Handle Updates (4): Changelog Mechanism •
Changelog Mechanism is a way to generate and process changelog in Flink. • It makes result correct and makes streaming SQL can support arbitrary queries (10-left-join). • It is a building block of Flink SQL which is a big difference from other systems. • It is handled by Flink and is transparent to users. • Challenge: event ordering and computation deterministic is critical for changelogs

Flink Forward 2023 © What is the Challenge of Event
Ordering? Order id product user_id User id username Join Task 1 Join Task 2 Sink Task OrderSink id product username Shuffle By order.id Shuffle By user.id Shuffle By order.user_id +I (id=1, product='apple', user_id=11) -U (id=1, product='apple', user_id=11) +U (id=1, product='apple', user_id=12) +I (id=1, product='apple', user_id=11) -U (id=1, product='apple', user_id=11) +U (id=1, product='apple', user_id=12) +U (id=1, user_id=12, username='u2') +I (id=1, user_id=11, username='u1') -U (id=1, user_id=11, username='u1') Incorrect! There is no order_id=1 in the result table, because the update_before is out-of-order.

Flink Forward 2023 © Handle Event Ordering (1): SinkUpsertMaterializer •
SinkUpsertMaterializer: a stateful operator added before sink [FLINK-20374] • SinkUpsertMaterializer reorders events on the sink primary key with the help of state Source1 Source1 Source1 Source2 Source1 Join Source1 SinkUpsertMaterializer -> Sink Shuffle By sink_primary_key Shuffle By join_on_key

Tracks event ordering in state using structure: State<sink_primary_key, List<input_row>> • Reorders events by working like Deduplication with Last Row. ◦ For ADD events (+I and +U), add event to list state and forward-emit the ADD event ◦ For RETRACT events (-D and -U), remove event from list state, ▪ re-emit the new last ADD row if the removed event is last row ▪ forward-emit the RETRACT event if the list state is empty ▪ no-op otherwise SinkUpsertMaterializer +U (id=1, user_id=12, username='u2') +I (id=1, user_id=12, username='u2') id user_id username 1 12 u2 <id=1, {(id=1, user_id=12, username='u2')}> Insert

Tracks event ordering in state using structure: State<sink_primary_key, List<input_row>> • Reorders events by working like Deduplication with Last Row. ◦ For ADD events (+I and +U), add event to list state and forward-emit the ADD event ◦ For RETRACT events (-D and -U), remove event from list state, ▪ re-emit the new last ADD row if the removed event is last row ▪ forward-emit the RETRACT event if the list state is empty ▪ no-op otherwise SinkUpsertMaterializer +U (id=1, user_id=12, username='u2') <id=1, { (id=1, user_id=12, username='u2’), (id=1, user_id=11, username='u1’) }> +I (id=1, user_id=12, username='u2') +I (id=1, user_id=11, username='u1') +U (id=1, user_id=11, username='u1') id user_id username 1 12->11 u2 -> u1 Upsert

Tracks event ordering in state using structure: State<sink_primary_key, List<input_row>> • Reorders events by working like Deduplication with Last Row. ◦ For ADD events (+I and +U), add event to list state and forward-emit the ADD event ◦ For RETRACT events (-D and -U), remove event from list state, ▪ re-emit the new last ADD row if the removed event is last row ▪ forward-emit the RETRACT event if the list state is empty ▪ no-op otherwise SinkUpsertMaterializer +U (id=1, user_id=12, username='u2') <id=1, { (id=1, user_id=12, username='u2’), (id=1, user_id=11, username='u1’) }> +I (id=1, user_id=12, username='u2') +I (id=1, user_id=11, username='u1') +U (id=1, user_id=11, username='u1') -U (id=1, user_id=11, username='u1') +U (id=1, user_id=12, username='u2') id user_id username 1 11->12 u1 -> u2 Upsert

Flink Forward 2023 © What is the Challenge of Non-deterministic?
Changelog Source Lookup Join SUM(dim.value) dim.id dim.value +I (id=1, product=‘a1’, dim_id=11) -U (id=1, product=‘a1’, dim_id=11) +U (id=1, product=‘a2’, dim_id=11) +I (id=1, dim_value=1) -U (id=1, dim_value=3) +U (id=1, dim_value=1) +1 -3 +1 = -1 value dim_id 1 11 3 11 1 11 Incorrect! Should never seen negative sum result. Non-deterministic on retractions retraction is not equal to previous row!

Flink Forward 2023 © Handle Non-deterministic: MaterializedLookupJoin Changelog Source Materialized
Lookup Join SUM(dim.value) dim.id dim.value +I (id=1, product=‘a1’, dim_id=11) -U (id=1, product=‘a1’, dim_id=11) +U (id=1, product=‘a2’, dim_id=11) +I (id=1, dim_value=1) -U (id=1, dim_value=1) +U (id=1, dim_value=1) +1 -1 +1 = 1 value dim_id 1 11 3 11 1 11 Deterministic computation on retractions • MaterializedLookupJoin uses state to materialize the previous lookup value and prevent non-deterministic on retractions. [FLINK-27849] Lookup state for previous joined value

Flink Forward 2023 © Summary • The 4 challenges of
implementing a Streaming SQL engine: • Late Data, Changelog, Event Ordering, Non-deterministic. • It’s all about how to improve and balance the Trade-Off Triangle. • Allow late data improves Correctness with a little state Cost. • Window early fire improves Latency with more computation Cost. • Aggregation optimizations reduce above Cost. • Changelog Mechanism Correctly support updates in source, sink, operator. • SinkUpsertMaterializer fix event ordering Correctness with state Cost. • MaterializedLookupJoin fix non-deterministic-update Correctness with state Cost • Some future plans to reduce the state Cost, and balance between Cost and Latency… Correctness Latency Cost

Flink SQL: the Challenges of Implementing a St...

Flink SQL: the Challenges of Implementing a Streaming SQL Engine

Jark Wu

More Decks by Jark Wu

Other Decks in Technology

Featured

Transcript

Flink Forward 2023 © Flink SQL: the Challenges of Implementing

Flink Forward 2023 © About me: Jark Wu • Long-term

Flink Forward 2023 © What is Flink SQL? ClickHouse Messaging

Flink Forward 2023 © Flink SQL in API Stack Low-level

Flink Forward 2023 © DataStream CREATE TABLE sensors ( room

Flink Forward 2023 © Watermark, Window and Challenges • Flink

Flink Forward 2023 © Handle Late Data (1): Late-Arrival &

Flink Forward 2023 © Handle Late Data (2): non-window aggregation

Flink Forward 2023 © What is the Challenge of Updates?

Flink Forward 2023 © Handle Updates (1): Stream-Table Duality •

Flink Forward 2023 © word Hello word cnt Hello 1

Flink Forward 2023 © Handle Updates (3): Changelog Mechanism word

Flink Forward 2023 © Handle Updates (4): Changelog Mechanism •

Flink Forward 2023 © What is the Challenge of Event

Flink Forward 2023 © Handle Event Ordering (1): SinkUpsertMaterializer •

Flink Forward 2023 © Handle Event Ordering (2): SinkUpsertMaterializer •

Flink Forward 2023 © Handle Event Ordering (2): SinkUpsertMaterializer •

Flink Forward 2023 © Handle Event Ordering (2): SinkUpsertMaterializer •

Flink Forward 2023 © What is the Challenge of Non-deterministic?

Flink Forward 2023 © Handle Non-deterministic: MaterializedLookupJoin Changelog Source Materialized

Flink Forward 2023 © Summary • The 4 challenges of

Flink Forward 2023 © Thank you Jark Wu @jarkwu