Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Contracts In Practice With Debezium and Ap...

Data Contracts In Practice With Debezium and Apache Flink

Log-based change data capture (CDC) is an invaluable part of the data engineering toolbox: it enables a variety of use cases such as real-time analytics, full-text search, or cache invalidation by publishing data change events from your database. But when publishing change event streams across context or team boundaries, aren’t you tieing external consumers to your application’s data model, thus limiting yourself in evolving the same?

Enter data contracts—consciously designed abstractions between your internal data model and the outside world. Come and join us for this session to learn about:

- Challenges you may encounter when exposing table level change event streams and how data contracts can mitigate them
- Implementation strategies for data contracts such as the outbox pattern and stream processing
- Evolving your data model and the corresponding data contracts, without breaking any existing consumers

We’ll also touch on some advanced topics at the intersection of CDC and stream processing, such as hydrating partial change events, using the popular change stream processing duo of Debezium and Apache Flink.

Gunnar Morling

September 17, 2024
Tweet

More Decks by Gunnar Morling

Other Decks in Programming

Transcript

  1. Image © Benjamin White https://flic.kr/p/2iGM2x1 (CC BY 2.0 DEED) Gunnar

    Morling @gunnarmorling Data Contracts In Practice With Debezium and Apache Flink
  2. Data Contracts With Debezium + Apache Flink | @gunnarmorling Does

    Change Data Capture Break Encapsulation? 🤔
  3. Data Contracts With Debezium + Apache Flink | @gunnarmorling •

    Software engineer at Decodable • Former project lead of Debezium • kcctl 🧸, JfrUnit, ModiTect, MapStruct • Java Champion • 1⃣ 🐝 🏎 Gunnar Morling
  4. Data Contracts With Debezium + Apache Flink | @gunnarmorling Change

    Data Capture Liberation for Your Data https://www.decodable.co/blog/seven-ways-to-put-cdc-to-work
  5. Data Contracts With Debezium + Apache Flink | @gunnarmorling (Potential)

    Concerns? Your Table Model Becomes Your API • Names and types directly exposed • Particularly problematic for legacy schemas Image © massmatt https://flic.kr/p/25eF9D3 (CC BY 2.0)
  6. Data Contracts With Debezium + Apache Flink | @gunnarmorling (Potential)

    Concerns? Fine-grained Events • 1:1 relationship between tables and event streams • May be too fine-grained Image © Michele Dorsey Walfred https://flic.kr/p/MDCCP4 (CC BY 2.0 DEED)
  7. Data Contracts With Debezium + Apache Flink | @gunnarmorling •

    Renaming columns • Changing types • Removing columns • Changing cardinality of associations (Potential) Concerns? Schema Changes Might Break Things Image © Insights Unspoken https://flic.kr/p/zNTwN9 (CC BY 2.0 DEED)
  8. Data Contracts With Debezium + Apache Flink | @gunnarmorling •

    Exposing all columns… • …and rows (Potential) Concerns? Accidental Data Leaks Image © Leonid Mamchenkov https://flic.kr/p/qzBLy (CC BY 2.0 DEED)
  9. Data Contracts With Debezium + Apache Flink | @gunnarmorling Houston,

    we have a problem! Image © Jeff Hitchcock https://flic.kr/p/2hN4RG7 (CC BY 2.0 DEED)
  10. Data Contracts With Debezium + Apache Flink | @gunnarmorling Data

    Contracts A data contract is a document that defines the structure, format, semantics, quality, and terms of use for exchanging data between a data provider and their consumers.
  11. Data Contracts With Debezium + Apache Flink | @gunnarmorling Data

    Contracts Towards Data Products • Documentation of intent • Owned and evolved by the publisher Image © Jerome Vial https://flic.kr/p/71KpZy (CC BY-SA 2.0 DEED)
  12. Data Contracts With Debezium + Apache Flink | @gunnarmorling Variation

    on Postgres pg_logical_emit_message() • Directly writing arbitrary messages to the WAL • No need for an outbox table
  13. Data Contracts With Debezium + Apache Flink | @gunnarmorling Apache

    Flink Stateful Computations over Data Streams https://flink.apache.org/
  14. Data Contracts With Debezium + Apache Flink | @gunnarmorling Apache

    Flink APIs for Application Development Image source: “Change Data Capture with Flink SQL and Debezium” by Marta Paes at DataEngBytes (https://noti.st/morsapaes/liQzgs/change-data-capture-with-flink-sql-and-debezium)
  15. Data Contracts With Debezium + Apache Flink | @gunnarmorling There’s

    More… SQL All the Things! • Filters • Derived fields • Consumer-specific contracts
  16. Data Contracts With Debezium + Apache Flink | @gunnarmorling Nested

    Data Structures UDFs to the Rescue https://www.youtube.com/@decodable
  17. Data Contracts With Debezium + Apache Flink | @gunnarmorling •

    Debezium: Real-time change event streams for your data • Apache Flink: Data contracts for… ◦ …encapsulating internal models ✅ ◦ …consciously designed events ✅ ◦ …ensuring compatibility ✅ ◦ …protecting sensitive data ✅ Take Aways 🤩
  18. Data Contracts With Debezium + Apache Flink | @gunnarmorling Houston,

    we may have a problem. If we do, we know how to solve it! Image © NASA Hubble Space Telescope https://flic.kr/p/22tV2DJ (CC BY 2.0 DEED)
  19. Data Contracts With Debezium + Apache Flink | @gunnarmorling •

    Blog post: https://www.decodable.co/blog/change-data-capture-breaks-en capsulation-does-it-though • Example source code: github.com/decodableco/examples → cdc-data-contracts Learn More
  20. Data Contracts With Debezium + Apache Flink | @gunnarmorling Decodable

    Talks at Current ‘24 Timing is Everything: Understanding Event-Time Processing in Flink SQL 🗣 Sharon Xie 📆 Tuesday 4pm 🗺 Ballroom F Data Contracts In Practice With Debezium and Apache Flink 🗣 Gunnar Morling 📆 Tuesday 3pm 🗺 Meeting Room 18C So You Want to Write a User-Defined Function (UDF) for Flink? 🗣 Hans-Peter Grahsl 📆 Wednesday 1:30pm 🗺 Ballroom F The Joy of JARs (and Other Flink SQL Troubleshooting Tales) 🗣 Robin Moffatt 📆 Wednesday 3pm 🗺 Ballroom F