Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Flink SQL: the Challenges of Implementing a Streaming SQL Engine

Jark Wu
November 14, 2023

Flink SQL: the Challenges of Implementing a Streaming SQL Engine

This is a talk in Flink Forward Seattle 2023.

Flink SQL is a powerful tool for stream processing that allows users to write SQL queries over streaming data. However, building a streaming SQL engine is not an easy task. In this session, we will explore the challenges that arise when building a modern streaming SQL engine like Flink SQL.
We will discuss the following challenges and how Flink SQL resolves them:

- Late Data: Handling late arrival data and guaranteeing result correctness.
- Change Data Ingestion and Processing: How to ingest change data from databases in real-time and apply complex operations on the change events.
- Event Ordering: Shuffle may disrupt the order of data updates and get the wrong result.
- Nondeterminism: Nondeterministic functions and external system lookups may produce different results on change data and get the wrong result.

By the end of this session, you will better understand the challenges involved in building a streaming SQL engine and how to overcome them.

Jark Wu

November 14, 2023
Tweet

More Decks by Jark Wu

Other Decks in Technology

Transcript

  1. Flink Forward 2023 ©
    Flink SQL:
    the Challenges of
    Implementing a
    Streaming SQL Engine
    Jark Wu
    Flink SQL Team Leader @ Alibaba
    Apache Flink PMC Member

    View full-size slide

  2. Flink Forward 2023 ©
    About me: Jark Wu
    ● Long-term Flinker since 2016
    ● Core Contributor of Flink SQL
    ● Apache Flink PMC member and Committer
    ● Flink SQL Team Leader @ Alibaba
    ● Creator of flink-cdc-connectors project (4.6k github stars)

    View full-size slide

  3. Flink Forward 2023 ©
    What is Flink SQL?
    ClickHouse
    Messaging
    Databases
    Iceberg
    Paimon Hudi
    Lakehouses
    Messaging
    Iceberg
    Paimon Hudi
    Lakehouses
    Non-SQL
    OLAP
    Data
    Integration
    Analytics
    ETL
    Event-
    Driven
    Flink SQL
    Streaming / Batch Processing

    View full-size slide

  4. Flink Forward 2023 ©
    Flink SQL in API Stack
    Low-level Stream Operator API
    DataStream API
    Optimizer / Planner
    Dataflow Runtime
    Table / SQL API

    View full-size slide

  5. Flink Forward 2023 ©
    DataStream
    CREATE TABLE sensors (
    room BIGINT,
    temperature DOUBLE,
    rowtime TIMESTAMP(3),
    WATERMARK FOR rowtime AS rowtime - INTERVAL '5' SECOND
    ) WITH (
    'connector' = 'kafka',
    ...
    );
    SELECT
    room,
    TUMBLE_END(rowtime, INTERVAL '1' MINUTE),
    AVG(temperature)
    FROM sensors
    GROUP BY room, TUMBLE(rowtime, INTERVAL '1' MINUTE);
    KafkaSource source = KafkaSource.builder()
    .setValueOnlyDeserializer(new SensorSchema())
    ...
    .build();
    WatermarkStrategy watermark = WatermarkStrategy
    .forBoundedOutOfOrderness(
    Duration.ofSeconds(5))
    .withTimestampAssigner(
    (sensor, timestamp) -> sensor.rowtime);
    env.fromSource(source, watermark, "Kafka Sensor Source")
    .keyBy(Sensor::getRoom)
    .window(TumblingEventTimeWindows.of(Time.minutes(1)))
    .aggregate(new MyAverageAggregate());
    class MyAverageAggregate implements AggregateFunction {
    ...
    }
    Flink SQL

    View full-size slide

  6. Flink Forward 2023 ©
    Watermark, Window and Challenges
    ● Flink uses Watermark to track completeness in stream processing
    ● Watermark expects all of the data in a window to have arrived
    ● However, Watermark is a tradeoff among Correctness, Latency and Cost [1]
    ● Challenge: Big gap between actual watermark and ideal watermark => lose late data
    [1] Tyler Akidau et al . The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in
    Massive-Scale, Unbounded, Out-of-Order Data Processing
    Correctness Latency
    Cost
    Slow watermark, result delayed Fast watermark, data incomplete
    dropping old state

    View full-size slide

  7. Flink Forward 2023 ©
    Handle Late Data (1): Late-Arrival & Early-Fire
    ● Retain window state after window fire (window watermark) è add a little Cost
    ● Fire for every late data è Correctness without dropping data
    ● Fire before window close è Low Latency result
    ● A better tradeoff Cost, Correctness and Latency, no need to fine-tune watermark
    ● Challenge: emits updates instead of append-only result
    -- how long state should be retained after window close to process late data
    SET table.exec.emit.allow-lateness = 1h;
    -- enables window to early fire before window close and late fire after window close
    SET table.exec.emit.early-fire.enabled = true;
    SET table.exec.emit.early-fire.delay = 1m;
    SET table.exec.emit.late-fire.enabled = true;
    SET table.exec.emit.late-fire.delay = 1m;
    Configuration to enable Early-Fire and Late-Arrival in Flink SQL

    View full-size slide

  8. Flink Forward 2023 ©
    Handle Late Data (2): non-window aggregation
    ● Another alternative: non-window aggregation (more built-in optimizations)
    ● No watermark and No dropping data, can group by arbitrary columns
    ● Correctness: late data is correct before state TTL
    ● Latency: can trigger for every record, minibatch config to balance throughput
    ● Cost: state automatically cleaned after specific TTL
    ● Challenge: emits updates instead of append result
    SELECT
    room,
    DATE_FORMAT(rowtime, ‘YYYY-MM-DD HH:mm’),
    AVG(temperature)
    FROM sensors
    GROUP BY room, DATE_FORMAT(rowtime, ‘YYYY-MM-DD HH:mm’)
    SELECT
    room,
    TUMBLE_END(rowtime, INTERVAL '1' MINUTE),
    AVG(temperature)
    FROM sensors
    GROUP BY room, TUMBLE(rowtime, INTERVAL '1' MINUTE);
    Window Aggregations Non-window Aggregations

    View full-size slide

  9. Flink Forward 2023 ©
    What is the Challenge of Updates?
    ● Sinks: receive duplicate events for the same key
    ● Need to support idempotent updates (e.g., Postgres, DataLakes)
    ● Sources: how to connect external changelogs natively
    ○ flink-cdc-connectors project: mysql-cdc, pg-cdc, mongo-cdc, oracle-cdc
    ○ Thanks Debezium
    ● Operators: how to propagate changelog events correctly
    ● Stream-Table Duality & Changelog Mechanism

    View full-size slide

  10. Flink Forward 2023 ©
    Handle Updates (1): Stream-Table Duality
    ● Changes on a Table produces a Changelog Stream
    ● Changelog Stream can be materialized into Table
    ● Flink names this equal conversion Dynamic Table which is similar to Materialized View
    ● Sources, Operators, Sinks work on Changelog Stream
    Short Name Long Name Description
    +I Insert Insertion operation.
    -U Update Before Update operation with previous content.
    +U Update After Update operation with new content.
    -D Delete Deletion operation.
    Easy to mapping from external changelogs (MySQL Binlog, PG transaction log)
    and mapping to external DML operations (INSERT .. ON DUPLICATE KEY, DELETE FROM..)

    View full-size slide

  11. Flink Forward 2023 ©
    word
    Hello word cnt
    Hello 1
    word_count
    World 1
    Hello, 1
    World, 1
    Hello, 2
    SELECT
    word,
    COUNT(*) as cnt
    FROM words
    GROUP BY word
    World
    Hello
    Source
    cnt freq
    1 2
    2 1
    SELECT
    cnt,
    COUNT(*) as freq
    FROM word_count
    GROUP BY cnt
    1 2
    Should
    be “1”
    Count Frequency
    The Problem: result is incorrect if stream processing without changelog mechanism
    Word Count
    Hello 1 -> 2
    Handle Updates (2): Changelog Mechanism

    View full-size slide

  12. Flink Forward 2023 ©
    Handle Updates (3): Changelog Mechanism
    word
    Hello word cnt
    Hello 1
    word_count
    World 1
    Hello 1 -> 2
    SELECT
    word,
    COUNT(*) as cnt
    FROM logs
    GROUP BY word
    World
    Hello
    cnt freq
    1 2
    2 1
    1 2 -> 1
    SELECT
    cnt,
    COUNT(*) as freq
    FROM word_count
    GROUP BY cnt
    Hello, 1
    +insert
    World, 1
    +insert
    Hello, 1
    -update_before
    Hello, 2
    +update_after
    Hello
    +insert
    World
    +insert
    Hello
    +insert
    Source Word Count Count Frequency
    The Solution: stream processing with changelog mechanism

    View full-size slide

  13. Flink Forward 2023 ©
    Handle Updates (4): Changelog Mechanism
    ● Changelog Mechanism is a way to generate and process changelog in Flink.
    ● It makes result correct and makes streaming SQL can support arbitrary queries (10-left-join).
    ● It is a building block of Flink SQL which is a big difference from other systems.
    ● It is handled by Flink and is transparent to users.
    ● Challenge: event ordering and computation deterministic is critical for changelogs

    View full-size slide

  14. Flink Forward 2023 ©
    What is the Challenge of Event Ordering?
    Order
    id
    product
    user_id
    User
    id
    username
    Join Task 1
    Join Task 2
    Sink Task
    OrderSink
    id
    product
    username
    Shuffle By
    order.id
    Shuffle By
    user.id
    Shuffle By
    order.user_id
    +I (id=1, product='apple', user_id=11)
    -U (id=1, product='apple', user_id=11)
    +U (id=1, product='apple', user_id=12)
    +I (id=1, product='apple', user_id=11)
    -U (id=1, product='apple', user_id=11)
    +U (id=1, product='apple', user_id=12) +U (id=1, user_id=12, username='u2')
    +I (id=1, user_id=11, username='u1')
    -U (id=1, user_id=11, username='u1')
    Incorrect! There is no order_id=1 in the result table,
    because the update_before is out-of-order.

    View full-size slide

  15. Flink Forward 2023 ©
    Handle Event Ordering (1): SinkUpsertMaterializer
    ● SinkUpsertMaterializer: a stateful operator added before sink [FLINK-20374]
    ● SinkUpsertMaterializer reorders events on the sink primary key with the help of state
    Source1
    Source1
    Source1
    Source2
    Source1
    Join
    Source1
    SinkUpsertMaterializer -> Sink
    Shuffle By
    sink_primary_key
    Shuffle By
    join_on_key

    View full-size slide

  16. Flink Forward 2023 ©
    Handle Event Ordering (2): SinkUpsertMaterializer
    ● Tracks event ordering in state using structure: State>
    ● Reorders events by working like Deduplication with Last Row.
    ○ For ADD events (+I and +U), add event to list state and forward-emit the ADD event
    ○ For RETRACT events (-D and -U), remove event from list state,
    ■ re-emit the new last ADD row if the removed event is last row
    ■ forward-emit the RETRACT event if the list state is empty
    ■ no-op otherwise
    SinkUpsertMaterializer
    +U (id=1, user_id=12, username='u2') +I (id=1, user_id=12, username='u2')
    id user_id username
    1 12 u2

    Insert

    View full-size slide

  17. Flink Forward 2023 ©
    Handle Event Ordering (2): SinkUpsertMaterializer
    ● Tracks event ordering in state using structure: State>
    ● Reorders events by working like Deduplication with Last Row.
    ○ For ADD events (+I and +U), add event to list state and forward-emit the ADD event
    ○ For RETRACT events (-D and -U), remove event from list state,
    ■ re-emit the new last ADD row if the removed event is last row
    ■ forward-emit the RETRACT event if the list state is empty
    ■ no-op otherwise
    SinkUpsertMaterializer
    +U (id=1, user_id=12, username='u2')
    (id=1, user_id=12, username='u2’),
    (id=1, user_id=11, username='u1’)
    }>
    +I (id=1, user_id=12, username='u2')
    +I (id=1, user_id=11, username='u1') +U (id=1, user_id=11, username='u1')
    id user_id username
    1 12->11 u2 -> u1
    Upsert

    View full-size slide

  18. Flink Forward 2023 ©
    Handle Event Ordering (2): SinkUpsertMaterializer
    ● Tracks event ordering in state using structure: State>
    ● Reorders events by working like Deduplication with Last Row.
    ○ For ADD events (+I and +U), add event to list state and forward-emit the ADD event
    ○ For RETRACT events (-D and -U), remove event from list state,
    ■ re-emit the new last ADD row if the removed event is last row
    ■ forward-emit the RETRACT event if the list state is empty
    ■ no-op otherwise
    SinkUpsertMaterializer
    +U (id=1, user_id=12, username='u2')
    (id=1, user_id=12, username='u2’),
    (id=1, user_id=11, username='u1’)
    }>
    +I (id=1, user_id=12, username='u2')
    +I (id=1, user_id=11, username='u1') +U (id=1, user_id=11, username='u1')
    -U (id=1, user_id=11, username='u1') +U (id=1, user_id=12, username='u2')
    id user_id username
    1 11->12 u1 -> u2
    Upsert

    View full-size slide

  19. Flink Forward 2023 ©
    What is the Challenge of Non-deterministic?
    Changelog
    Source
    Lookup Join SUM(dim.value)
    dim.id dim.value
    +I (id=1, product=‘a1’, dim_id=11)
    -U (id=1, product=‘a1’, dim_id=11)
    +U (id=1, product=‘a2’, dim_id=11)
    +I (id=1, dim_value=1)
    -U (id=1, dim_value=3)
    +U (id=1, dim_value=1)
    +1 -3 +1 = -1
    value
    dim_id
    1
    11
    3
    11
    1
    11
    Incorrect! Should never seen
    negative sum result.
    Non-deterministic on retractions
    retraction is not equal
    to previous row!

    View full-size slide

  20. Flink Forward 2023 ©
    Handle Non-deterministic: MaterializedLookupJoin
    Changelog
    Source
    Materialized
    Lookup Join
    SUM(dim.value)
    dim.id dim.value
    +I (id=1, product=‘a1’, dim_id=11)
    -U (id=1, product=‘a1’, dim_id=11)
    +U (id=1, product=‘a2’, dim_id=11)
    +I (id=1, dim_value=1)
    -U (id=1, dim_value=1)
    +U (id=1, dim_value=1)
    +1 -1 +1 = 1
    value
    dim_id
    1
    11
    3
    11
    1
    11
    Deterministic computation on retractions
    ● MaterializedLookupJoin uses state to materialize the previous lookup value and prevent
    non-deterministic on retractions. [FLINK-27849]
    Lookup state
    for previous
    joined value

    View full-size slide

  21. Flink Forward 2023 ©
    Summary
    ● The 4 challenges of implementing a Streaming SQL engine:
    ● Late Data, Changelog, Event Ordering, Non-deterministic.
    ● It’s all about how to improve and balance the Trade-Off Triangle.
    ● Allow late data improves Correctness with a little state Cost.
    ● Window early fire improves Latency with more computation Cost.
    ● Aggregation optimizations reduce above Cost.
    ● Changelog Mechanism Correctly support updates in source, sink, operator.
    ● SinkUpsertMaterializer fix event ordering Correctness with state Cost.
    ● MaterializedLookupJoin fix non-deterministic-update Correctness with state Cost
    ● Some future plans to reduce the state Cost, and balance between Cost and Latency…
    Correctness Latency
    Cost

    View full-size slide

  22. Flink Forward 2023 ©
    Thank you
    Jark Wu
    @jarkwu

    View full-size slide