Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Real-time Change Stream Processing with Apache ...

Real-time Change Stream Processing with Apache Flink

Log-based change data capture (CDC) is a key component of the modern data streaming stack, used for data replication, feeding search indexes, low-latency data warehouse updates, and more.

Merely taking data from A to B often isn't enough though; instead, change event streams, as for instance created using Debezium, may need to be filtered or routed based on event contents, multiple streams be joined, continuous queries be updated, etc. Enter Apache Flink: it lets you do stateful stream processing on change event feeds. Join us for this session and learn about

* Implementing streaming queries on CDC events with the Flink data stream API and Flink SQL
* Aggregating and enriching change data events
* Different deployment options: Kafka Connect vs. Flink CDC

In a demo we'll put all these open-source components into action, showing how to set up a data streaming pipeline from your operational database to a live dashboard within minutes.

Gunnar Morling

September 11, 2023
Tweet

More Decks by Gunnar Morling

Other Decks in Programming

Transcript

  1. Image © Marja van Bochove https://flic.kr/p/5Q6yUY (CC BY 2.0) Real-time

    Change Stream Processing with Apache Flink Gunnar Morling Software Engineer, Decodable @gunnarmorling
  2. #Debezium + #ApacheFlink | @gunnarmorling • Software engineer at Decodable

    • Former project lead of Debezium • kcctl 🧸, JfrUnit, ModiTect, MapStruct • Spec Lead for Bean Validation 2.0 • Java Champion Gunnar Morling
  3. #Debezium + #ApacheFlink | @gunnarmorling Debezium in a Nutshell Open-Source

    Change Data Capture • A CDC Platform ◦ Based on transaction logs ◦ Snapshotting, filtering, etc. ◦ Outbox support ◦ Web-based UI • Fully open-source, very active community • Large production deployments
  4. #Debezium + #ApacheFlink | @gunnarmorling • Core ◦ MySQL ◦

    Postgres ◦ SQL Server ◦ MongoDB ◦ Db2 ◦ Oracle • Community-led: ◦ Vitess, Cassandra, Spanner • External: ScyllaDB, Yugabyte Debezium Supported Databases
  5. #Debezium + #ApacheFlink | @gunnarmorling Debezium: Data Change Events •

    Old and new row state • Metadata on table, TX id, etc. • Operation type, timestamp
  6. #Debezium + #ApacheFlink | @gunnarmorling Debezium: Data Change Events •

    Old and new row state • Metadata on table, TX id, etc. • Operation type, timestamp
  7. #Debezium + #ApacheFlink | @gunnarmorling Debezium: Data Change Events •

    Old and new row state • Metadata on table, TX id, etc. • Operation type, timestamp
  8. #Debezium + #ApacheFlink | @gunnarmorling Becoming the De-Facto CDC Standard

    https://debezium.io/blog/2021/09/22/deep-dive-into-a-debezium-community-connector-scylla-cdc-source-connector/ Debezium
  9. #Debezium + #ApacheFlink | @gunnarmorling • Real-time reporting/dashboards • Low-latency

    alerting, notifications • Materialized view maintenance, caches • Real-time cross-database sync, lookup joins, windowed joins, aggregations • Machine learning: model serving, feature engineering • Change data capture, data integration Apache Flink Common Use Cases https://flink.apache.org/poweredby.html
  10. #Debezium + #ApacheFlink | @gunnarmorling Apache Flink APIs for Application

    Development Image source: “Change Data Capture with Flink SQL and Debezium” by Marta Paes at DataEngBytes (https://noti.st/morsapaes/liQzgs/change-data-capture-with-flink-sql-and-debezium)
  11. #Debezium + #ApacheFlink | @gunnarmorling pg_logical_emit_message() Exporting Auditing Metadata •

    Pure CDC events lack metadata like business user, device id, etc. • Solution: emit at TX begin, enrich events e.g. using SMT
  12. #Debezium + #ApacheFlink | @gunnarmorling Data Contracts Encapsulating Your Schema

    Chris Riccomini (https://cnr.sh/essays/kafka-change-data-capture-breaks-database-encapsulation) 🤔
  13. #Debezium + #ApacheFlink | @gunnarmorling Data Contracts Encapsulating Your Schema

    Image source: “Data Contracts — From Zero To Hero” by Mehdio (https://towardsdatascience.com/data-contracts-from-zero-to-hero-343717ac4d5e)
  14. #Debezium + #ApacheFlink | @gunnarmorling Data Contracts Encapsulating Your Schema

    Image source: “An Engineer's Guide to Data Contracts - Pt. 1” by Chad Sanderson and Adrian Kreuziger (https://dataproducts.substack.com/p/an-engineers-guide-to-data-contracts)
  15. #Debezium + #ApacheFlink | @gunnarmorling Data Contracts Encapsulating Your Schema

    • Consciously design your exposed ◦ Set of columns ◦ Their names and types ◦ Data structure (e.g. DDD aggregates) • Changes to the same
  16. #Debezium + #ApacheFlink | @gunnarmorling Nested Data Structures UDFs to

    the Rescue https://www.youtube.com/@decodable
  17. #Debezium + #ApacheFlink | @gunnarmorling Transactional Aggregation Correlating Events From

    Same Transaction https://www.slideshare.net/FlinkForward/squirreling-away-640-billion-how-stripe-leverages-flink-for-change-data-capture
  18. #Debezium + #ApacheFlink | @gunnarmorling • The fresher data is,

    the more valuable it is • Debezium and Apache Flink: Power house of change stream processing • Data streaming stacks can be non-trivial to set up and operate Take Aways 🤩
  19. #Debezium + #ApacheFlink | @gunnarmorling • Debezium: @debezium | https://debezium.io/

    • Apache Flink: @ApacheFlink | https://flink.apache.org/ • Getting started with Flink: github.com/decodableco/examples → flink-learn Learn More
  20. #Debezium @gunnarmorling • Incremental snapshotting • Postgres logical decoding messages

    • Multi-DB support (SQL Server) • Debezium Server sinks • MongoDB change streams support • Debezium UI • Debezium 2.0 What’s New in Debezium?