Lock in $30 Savings on PRO—Offer Ends Soon! ⏳

Flink SQL: the Challenges of Implementing a St...

Jark Wu
November 14, 2023

Flink SQL: the Challenges of Implementing a Streaming SQL Engine

This is a talk in Flink Forward Seattle 2023.

Flink SQL is a powerful tool for stream processing that allows users to write SQL queries over streaming data. However, building a streaming SQL engine is not an easy task. In this session, we will explore the challenges that arise when building a modern streaming SQL engine like Flink SQL.
We will discuss the following challenges and how Flink SQL resolves them:

- Late Data: Handling late arrival data and guaranteeing result correctness.
- Change Data Ingestion and Processing: How to ingest change data from databases in real-time and apply complex operations on the change events.
- Event Ordering: Shuffle may disrupt the order of data updates and get the wrong result.
- Nondeterminism: Nondeterministic functions and external system lookups may produce different results on change data and get the wrong result.

By the end of this session, you will better understand the challenges involved in building a streaming SQL engine and how to overcome them.

Jark Wu

November 14, 2023
Tweet

More Decks by Jark Wu

Other Decks in Technology

Transcript

  1. Flink Forward 2023 © Flink SQL: the Challenges of Implementing

    a Streaming SQL Engine Jark Wu Flink SQL Team Leader @ Alibaba Apache Flink PMC Member
  2. Flink Forward 2023 © About me: Jark Wu • Long-term

    Flinker since 2016 • Core Contributor of Flink SQL • Apache Flink PMC member and Committer • Flink SQL Team Leader @ Alibaba • Creator of flink-cdc-connectors project (4.6k github stars)
  3. Flink Forward 2023 © What is Flink SQL? ClickHouse Messaging

    Databases Iceberg Paimon Hudi Lakehouses Messaging Iceberg Paimon Hudi Lakehouses Non-SQL OLAP Data Integration Analytics ETL Event- Driven Flink SQL Streaming / Batch Processing
  4. Flink Forward 2023 © Flink SQL in API Stack Low-level

    Stream Operator API DataStream API Optimizer / Planner Dataflow Runtime Table / SQL API
  5. Flink Forward 2023 © DataStream CREATE TABLE sensors ( room

    BIGINT, temperature DOUBLE, rowtime TIMESTAMP(3), WATERMARK FOR rowtime AS rowtime - INTERVAL '5' SECOND ) WITH ( 'connector' = 'kafka', ... ); SELECT room, TUMBLE_END(rowtime, INTERVAL '1' MINUTE), AVG(temperature) FROM sensors GROUP BY room, TUMBLE(rowtime, INTERVAL '1' MINUTE); KafkaSource<Sensor> source = KafkaSource.<Sensor>builder() .setValueOnlyDeserializer(new SensorSchema()) ... .build(); WatermarkStrategy watermark = WatermarkStrategy .<Sensor>forBoundedOutOfOrderness( Duration.ofSeconds(5)) .withTimestampAssigner( (sensor, timestamp) -> sensor.rowtime); env.fromSource(source, watermark, "Kafka Sensor Source") .keyBy(Sensor::getRoom) .window(TumblingEventTimeWindows.of(Time.minutes(1))) .aggregate(new MyAverageAggregate()); class MyAverageAggregate implements AggregateFunction { ... } Flink SQL
  6. Flink Forward 2023 © Watermark, Window and Challenges • Flink

    uses Watermark to track completeness in stream processing • Watermark expects all of the data in a window to have arrived • However, Watermark is a tradeoff among Correctness, Latency and Cost [1] • Challenge: Big gap between actual watermark and ideal watermark => lose late data [1] Tyler Akidau et al . The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing Correctness Latency Cost Slow watermark, result delayed Fast watermark, data incomplete dropping old state
  7. Flink Forward 2023 © Handle Late Data (1): Late-Arrival &

    Early-Fire • Retain window state after window fire (window watermark) è add a little Cost • Fire for every late data è Correctness without dropping data • Fire before window close è Low Latency result • A better tradeoff Cost, Correctness and Latency, no need to fine-tune watermark • Challenge: emits updates instead of append-only result -- how long state should be retained after window close to process late data SET table.exec.emit.allow-lateness = 1h; -- enables window to early fire before window close and late fire after window close SET table.exec.emit.early-fire.enabled = true; SET table.exec.emit.early-fire.delay = 1m; SET table.exec.emit.late-fire.enabled = true; SET table.exec.emit.late-fire.delay = 1m; Configuration to enable Early-Fire and Late-Arrival in Flink SQL
  8. Flink Forward 2023 © Handle Late Data (2): non-window aggregation

    • Another alternative: non-window aggregation (more built-in optimizations) • No watermark and No dropping data, can group by arbitrary columns • Correctness: late data is correct before state TTL • Latency: can trigger for every record, minibatch config to balance throughput • Cost: state automatically cleaned after specific TTL • Challenge: emits updates instead of append result SELECT room, DATE_FORMAT(rowtime, ‘YYYY-MM-DD HH:mm’), AVG(temperature) FROM sensors GROUP BY room, DATE_FORMAT(rowtime, ‘YYYY-MM-DD HH:mm’) SELECT room, TUMBLE_END(rowtime, INTERVAL '1' MINUTE), AVG(temperature) FROM sensors GROUP BY room, TUMBLE(rowtime, INTERVAL '1' MINUTE); Window Aggregations Non-window Aggregations
  9. Flink Forward 2023 © What is the Challenge of Updates?

    • Sinks: receive duplicate events for the same key • Need to support idempotent updates (e.g., Postgres, DataLakes) • Sources: how to connect external changelogs natively ◦ flink-cdc-connectors project: mysql-cdc, pg-cdc, mongo-cdc, oracle-cdc ◦ Thanks Debezium • Operators: how to propagate changelog events correctly • Stream-Table Duality & Changelog Mechanism
  10. Flink Forward 2023 © Handle Updates (1): Stream-Table Duality •

    Changes on a Table produces a Changelog Stream • Changelog Stream can be materialized into Table • Flink names this equal conversion Dynamic Table which is similar to Materialized View • Sources, Operators, Sinks work on Changelog Stream Short Name Long Name Description +I Insert Insertion operation. -U Update Before Update operation with previous content. +U Update After Update operation with new content. -D Delete Deletion operation. Easy to mapping from external changelogs (MySQL Binlog, PG transaction log) and mapping to external DML operations (INSERT .. ON DUPLICATE KEY, DELETE FROM..)
  11. Flink Forward 2023 © word Hello word cnt Hello 1

    word_count World 1 Hello, 1 World, 1 Hello, 2 SELECT word, COUNT(*) as cnt FROM words GROUP BY word World Hello Source cnt freq 1 2 2 1 SELECT cnt, COUNT(*) as freq FROM word_count GROUP BY cnt 1 2 Should be “1” Count Frequency The Problem: result is incorrect if stream processing without changelog mechanism Word Count Hello 1 -> 2 Handle Updates (2): Changelog Mechanism
  12. Flink Forward 2023 © Handle Updates (3): Changelog Mechanism word

    Hello word cnt Hello 1 word_count World 1 Hello 1 -> 2 SELECT word, COUNT(*) as cnt FROM logs GROUP BY word World Hello cnt freq 1 2 2 1 1 2 -> 1 SELECT cnt, COUNT(*) as freq FROM word_count GROUP BY cnt Hello, 1 +insert World, 1 +insert Hello, 1 -update_before Hello, 2 +update_after Hello +insert World +insert Hello +insert Source Word Count Count Frequency The Solution: stream processing with changelog mechanism
  13. Flink Forward 2023 © Handle Updates (4): Changelog Mechanism •

    Changelog Mechanism is a way to generate and process changelog in Flink. • It makes result correct and makes streaming SQL can support arbitrary queries (10-left-join). • It is a building block of Flink SQL which is a big difference from other systems. • It is handled by Flink and is transparent to users. • Challenge: event ordering and computation deterministic is critical for changelogs
  14. Flink Forward 2023 © What is the Challenge of Event

    Ordering? Order id product user_id User id username Join Task 1 Join Task 2 Sink Task OrderSink id product username Shuffle By order.id Shuffle By user.id Shuffle By order.user_id +I (id=1, product='apple', user_id=11) -U (id=1, product='apple', user_id=11) +U (id=1, product='apple', user_id=12) +I (id=1, product='apple', user_id=11) -U (id=1, product='apple', user_id=11) +U (id=1, product='apple', user_id=12) +U (id=1, user_id=12, username='u2') +I (id=1, user_id=11, username='u1') -U (id=1, user_id=11, username='u1') Incorrect! There is no order_id=1 in the result table, because the update_before is out-of-order.
  15. Flink Forward 2023 © Handle Event Ordering (1): SinkUpsertMaterializer •

    SinkUpsertMaterializer: a stateful operator added before sink [FLINK-20374] • SinkUpsertMaterializer reorders events on the sink primary key with the help of state Source1 Source1 Source1 Source2 Source1 Join Source1 SinkUpsertMaterializer -> Sink Shuffle By sink_primary_key Shuffle By join_on_key
  16. Flink Forward 2023 © Handle Event Ordering (2): SinkUpsertMaterializer •

    Tracks event ordering in state using structure: State<sink_primary_key, List<input_row>> • Reorders events by working like Deduplication with Last Row. ◦ For ADD events (+I and +U), add event to list state and forward-emit the ADD event ◦ For RETRACT events (-D and -U), remove event from list state, ▪ re-emit the new last ADD row if the removed event is last row ▪ forward-emit the RETRACT event if the list state is empty ▪ no-op otherwise SinkUpsertMaterializer +U (id=1, user_id=12, username='u2') +I (id=1, user_id=12, username='u2') id user_id username 1 12 u2 <id=1, {(id=1, user_id=12, username='u2')}> Insert
  17. Flink Forward 2023 © Handle Event Ordering (2): SinkUpsertMaterializer •

    Tracks event ordering in state using structure: State<sink_primary_key, List<input_row>> • Reorders events by working like Deduplication with Last Row. ◦ For ADD events (+I and +U), add event to list state and forward-emit the ADD event ◦ For RETRACT events (-D and -U), remove event from list state, ▪ re-emit the new last ADD row if the removed event is last row ▪ forward-emit the RETRACT event if the list state is empty ▪ no-op otherwise SinkUpsertMaterializer +U (id=1, user_id=12, username='u2') <id=1, { (id=1, user_id=12, username='u2’), (id=1, user_id=11, username='u1’) }> +I (id=1, user_id=12, username='u2') +I (id=1, user_id=11, username='u1') +U (id=1, user_id=11, username='u1') id user_id username 1 12->11 u2 -> u1 Upsert
  18. Flink Forward 2023 © Handle Event Ordering (2): SinkUpsertMaterializer •

    Tracks event ordering in state using structure: State<sink_primary_key, List<input_row>> • Reorders events by working like Deduplication with Last Row. ◦ For ADD events (+I and +U), add event to list state and forward-emit the ADD event ◦ For RETRACT events (-D and -U), remove event from list state, ▪ re-emit the new last ADD row if the removed event is last row ▪ forward-emit the RETRACT event if the list state is empty ▪ no-op otherwise SinkUpsertMaterializer +U (id=1, user_id=12, username='u2') <id=1, { (id=1, user_id=12, username='u2’), (id=1, user_id=11, username='u1’) }> +I (id=1, user_id=12, username='u2') +I (id=1, user_id=11, username='u1') +U (id=1, user_id=11, username='u1') -U (id=1, user_id=11, username='u1') +U (id=1, user_id=12, username='u2') id user_id username 1 11->12 u1 -> u2 Upsert
  19. Flink Forward 2023 © What is the Challenge of Non-deterministic?

    Changelog Source Lookup Join SUM(dim.value) dim.id dim.value +I (id=1, product=‘a1’, dim_id=11) -U (id=1, product=‘a1’, dim_id=11) +U (id=1, product=‘a2’, dim_id=11) +I (id=1, dim_value=1) -U (id=1, dim_value=3) +U (id=1, dim_value=1) +1 -3 +1 = -1 value dim_id 1 11 3 11 1 11 Incorrect! Should never seen negative sum result. Non-deterministic on retractions retraction is not equal to previous row!
  20. Flink Forward 2023 © Handle Non-deterministic: MaterializedLookupJoin Changelog Source Materialized

    Lookup Join SUM(dim.value) dim.id dim.value +I (id=1, product=‘a1’, dim_id=11) -U (id=1, product=‘a1’, dim_id=11) +U (id=1, product=‘a2’, dim_id=11) +I (id=1, dim_value=1) -U (id=1, dim_value=1) +U (id=1, dim_value=1) +1 -1 +1 = 1 value dim_id 1 11 3 11 1 11 Deterministic computation on retractions • MaterializedLookupJoin uses state to materialize the previous lookup value and prevent non-deterministic on retractions. [FLINK-27849] Lookup state for previous joined value
  21. Flink Forward 2023 © Summary • The 4 challenges of

    implementing a Streaming SQL engine: • Late Data, Changelog, Event Ordering, Non-deterministic. • It’s all about how to improve and balance the Trade-Off Triangle. • Allow late data improves Correctness with a little state Cost. • Window early fire improves Latency with more computation Cost. • Aggregation optimizations reduce above Cost. • Changelog Mechanism Correctly support updates in source, sink, operator. • SinkUpsertMaterializer fix event ordering Correctness with state Cost. • MaterializedLookupJoin fix non-deterministic-update Correctness with state Cost • Some future plans to reduce the state Cost, and balance between Cost and Latency… Correctness Latency Cost