Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Real-time Change Stream Processing with Apache Flink

Real-time Change Stream Processing with Apache Flink

Log-based change data capture (CDC) is a key component of the modern data streaming stack, used for data replication, feeding search indexes, low-latency data warehouse updates, and more.

Merely taking data from A to B often isn't enough though; instead, change event streams, as for instance created using Debezium, may need to be filtered or routed based on event contents, multiple streams be joined, continuous queries be updated, etc. Enter Apache Flink: it lets you do stateful stream processing on change event feeds. Join us for this session and learn about

* Implementing streaming queries on CDC events with the Flink data stream API and Flink SQL
* Aggregating and enriching change data events
* Different deployment options: Kafka Connect vs. Flink CDC

In a demo we'll put all these open-source components into action, showing how to set up a data streaming pipeline from your operational database to a live dashboard within minutes.

Gunnar Morling

September 11, 2023
Tweet

More Decks by Gunnar Morling

Other Decks in Programming

Transcript

  1. Image © Marja van Bochove https://flic.kr/p/5Q6yUY (CC BY 2.0)
    Real-time Change Stream Processing
    with Apache Flink
    Gunnar Morling
    Software Engineer, Decodable
    @gunnarmorling

    View full-size slide

  2. The world is
    real-time.
    So should be
    your data.

    View full-size slide

  3. #Debezium + #ApacheFlink | @gunnarmorling
    Today’s Mission
    Learn About…

    View full-size slide

  4. #Debezium + #ApacheFlink | @gunnarmorling
    ● Software engineer at Decodable
    ● Former project lead of Debezium
    ● kcctl 🧸, JfrUnit, ModiTect,
    MapStruct
    ● Spec Lead for Bean Validation 2.0
    ● Java Champion
    Gunnar Morling

    View full-size slide

  5. © Kai Schreiber https://flic.kr/p/uecg (CC BY-SA 2.0)

    View full-size slide

  6. #Debezium + #ApacheFlink | @gunnarmorling
    Debezium
    Log-Based Change Data Capture

    View full-size slide

  7. #Debezium + #ApacheFlink | @gunnarmorling
    Debezium in a Nutshell
    Open-Source Change Data Capture
    ● A CDC Platform
    ○ Based on transaction logs
    ○ Snapshotting, filtering, etc.
    ○ Outbox support
    ○ Web-based UI
    ● Fully open-source, very active
    community
    ● Large production deployments

    View full-size slide

  8. #Debezium + #ApacheFlink | @gunnarmorling
    Change Data Capture
    Liberation for Your Data

    View full-size slide

  9. #Debezium + #ApacheFlink | @gunnarmorling
    Change Data Capture
    Liberation for Your Data

    View full-size slide

  10. #Debezium + #ApacheFlink | @gunnarmorling
    ● Core
    ○ MySQL
    ○ Postgres
    ○ SQL Server
    ○ MongoDB
    ○ Db2
    ○ Oracle
    ● Community-led:
    ○ Vitess, Cassandra, Spanner
    ● External: ScyllaDB, Yugabyte
    Debezium
    Supported Databases

    View full-size slide

  11. #Debezium + #ApacheFlink | @gunnarmorling
    Debezium: Data Change Events
    ● Old and new row state
    ● Metadata on table, TX id, etc.
    ● Operation type, timestamp

    View full-size slide

  12. #Debezium + #ApacheFlink | @gunnarmorling
    Debezium: Data Change Events
    ● Old and new row state
    ● Metadata on table, TX id, etc.
    ● Operation type, timestamp

    View full-size slide

  13. #Debezium + #ApacheFlink | @gunnarmorling
    Debezium: Data Change Events
    ● Old and new row state
    ● Metadata on table, TX id, etc.
    ● Operation type, timestamp

    View full-size slide

  14. #Debezium + #ApacheFlink | @gunnarmorling
    Becoming the De-Facto CDC Standard
    https://debezium.io/blog/2021/09/22/deep-dive-into-a-debezium-community-connector-scylla-cdc-source-connector/
    Debezium

    View full-size slide

  15. Apache Flink
    Colin Howley https://flic.kr/p/698F5j (CC BY-ND 2.0)

    View full-size slide

  16. #Debezium + #ApacheFlink | @gunnarmorling
    Apache Flink
    Stateful Computations over Data Streams
    https://flink.apache.org/

    View full-size slide

  17. #Debezium + #ApacheFlink | @gunnarmorling
    ● Real-time reporting/dashboards
    ● Low-latency alerting, notifications
    ● Materialized view maintenance, caches
    ● Real-time cross-database sync, lookup joins,
    windowed joins, aggregations
    ● Machine learning: model serving, feature
    engineering
    ● Change data capture, data integration
    Apache Flink
    Common Use Cases
    https://flink.apache.org/poweredby.html

    View full-size slide

  18. #Debezium + #ApacheFlink | @gunnarmorling
    Apache Flink
    APIs for Application Development
    Image source: “Change Data Capture with Flink SQL and Debezium” by Marta Paes at DataEngBytes
    (https://noti.st/morsapaes/liQzgs/change-data-capture-with-flink-sql-and-debezium)

    View full-size slide

  19. #Debezium + #ApacheFlink | @gunnarmorling
    Apache Flink
    Stream Processing of Change Data Events

    View full-size slide

  20. #Debezium + #ApacheFlink | @gunnarmorling
    Apache Flink
    Stream Processing of Change Data Events

    View full-size slide

  21. #Debezium + #ApacheFlink | @gunnarmorling
    Apache Flink
    Stream Processing of Change Data Events

    View full-size slide

  22. #Debezium + #ApacheFlink | @gunnarmorling
    Apache Flink
    Stream Processing of Change Data Events

    View full-size slide

  23. #Debezium + #ApacheFlink | @gunnarmorling
    Apache Flink
    Stream Processing of Change Data Events

    View full-size slide

  24. #Debezium + #ApacheFlink | @gunnarmorling
    Apache Flink
    Stream Processing of Change Data Events

    View full-size slide

  25. #Debezium + #ApacheFlink | @gunnarmorling
    Debezium and Apache Flink
    Integration Options

    View full-size slide

  26. Use Cases
    https://flic.kr/p/PFDvkY Public Domain, Angelo Brathot

    View full-size slide

  27. #Debezium + #ApacheFlink | @gunnarmorling
    pg_logical_emit_message()
    Exporting Auditing Metadata
    ● Pure CDC events lack metadata like business user, device id, etc.
    ● Solution: emit at TX begin, enrich events e.g. using SMT

    View full-size slide

  28. #Debezium + #ApacheFlink | @gunnarmorling
    Audit Logs
    Enriching Change Data Events with Metadata

    View full-size slide

  29. Data Contracts
    © Marcin Wichary https://flic.kr/p/6d9P7t (CC BY 2.0)

    View full-size slide

  30. #Debezium + #ApacheFlink | @gunnarmorling
    Data Contracts
    Encapsulating Your Schema
    Chris Riccomini
    (https://cnr.sh/essays/kafka-change-data-capture-breaks-database-encapsulation)
    🤔

    View full-size slide

  31. #Debezium + #ApacheFlink | @gunnarmorling
    Data Contracts
    Encapsulating Your Schema
    Image source: “Data Contracts — From Zero To Hero” by Mehdio
    (https://towardsdatascience.com/data-contracts-from-zero-to-hero-343717ac4d5e)

    View full-size slide

  32. #Debezium + #ApacheFlink | @gunnarmorling
    Data Contracts
    Encapsulating Your Schema
    Image source: “An Engineer's Guide to Data Contracts - Pt. 1” by Chad Sanderson and Adrian Kreuziger
    (https://dataproducts.substack.com/p/an-engineers-guide-to-data-contracts)

    View full-size slide

  33. #Debezium + #ApacheFlink | @gunnarmorling
    Data Contracts
    Encapsulating Your Schema
    ● Consciously design your exposed
    ○ Set of columns
    ○ Their names and types
    ○ Data structure (e.g. DDD aggregates)
    ● Changes to the same

    View full-size slide

  34. Demo
    © Luke Jones https://flic.kr/p/sEq4MA (CC BY-SA 2.0)

    View full-size slide

  35. #Debezium + #ApacheFlink | @gunnarmorling
    Driving a Dashboard
    Propagating Joined Data to Elasticsearch/Kibana

    View full-size slide

  36. Demo
    © Luke Jones https://flic.kr/p/sEq4MA (CC BY-SA 2.0)

    View full-size slide

  37. #Debezium + #ApacheFlink | @gunnarmorling
    Nested Data Structures
    UDFs to the Rescue

    View full-size slide

  38. #Debezium + #ApacheFlink | @gunnarmorling
    Nested Data Structures
    UDFs to the Rescue

    View full-size slide

  39. #Debezium + #ApacheFlink | @gunnarmorling
    Nested Data Structures
    UDFs to the Rescue

    View full-size slide

  40. #Debezium + #ApacheFlink | @gunnarmorling
    Nested Data Structures
    UDFs to the Rescue

    View full-size slide

  41. #Debezium + #ApacheFlink | @gunnarmorling
    Nested Data Structures
    UDFs to the Rescue
    https://www.youtube.com/@decodable

    View full-size slide

  42. #Debezium + #ApacheFlink | @gunnarmorling
    Transactional Aggregation
    Correlating Events From Same Transaction

    View full-size slide

  43. #Debezium + #ApacheFlink | @gunnarmorling
    Transactional Aggregation
    Correlating Events From Same Transaction
    https://www.slideshare.net/FlinkForward/squirreling-away-640-billion-how-stripe-leverages-flink-for-change-data-capture

    View full-size slide

  44. #Debezium + #ApacheFlink | @gunnarmorling
    Wrap-Up

    View full-size slide

  45. #Debezium + #ApacheFlink | @gunnarmorling
    ● The fresher data is, the more valuable it is
    ● Debezium and Apache Flink: Power house of change stream
    processing
    ● Data streaming stacks can be non-trivial to set up and operate
    Take Aways
    🤩

    View full-size slide

  46. #Debezium + #ApacheFlink | @gunnarmorling
    ● Debezium: @debezium | https://debezium.io/
    ● Apache Flink: @ApacheFlink | https://flink.apache.org/
    ● Getting started with Flink:
    github.com/decodableco/examples → flink-learn
    Learn More

    View full-size slide

  47. #Debezium + #ApacheFlink | @gunnarmorling
    Q & A
    [email protected]
    @gunnarmorling
    📧
    Thank You!

    View full-size slide

  48. #Debezium @gunnarmorling
    ● Incremental snapshotting
    ● Postgres logical decoding messages
    ● Multi-DB support (SQL Server)
    ● Debezium Server sinks
    ● MongoDB change streams support
    ● Debezium UI
    ● Debezium 2.0
    What’s New in Debezium?

    View full-size slide