Lock in $30 Savings on PRO—Offer Ends Soon! ⏳

Open-source Change Data Capture With Debezium

Open-source Change Data Capture With Debezium

Change Data Capture (CDC) is one big enabler for your data; by reacting to changes in your database in "real-time", CDC comes in handy for implementing a wide range of use cases, such as low-latency data updates from OLTP data stores to OLAP systems, caches, or search indexes, data exchange between microservices, building audit logs, and many more.

In this talk you'll learn about Debezium, a distributed open-source log-based CDC platform for a variety of databases, such as Postgres, MySQL, Cassandra, MongoDB, and Vitess. We'll not only explore what makes Debezium and CDC so interesting from a user's perspective, but we'll also dive into some of the technical challenges we encountered while implementing Debezium, such as preventing an indefinite growth of WAL files in Postgres, keeping track of the schema of captured tables as DDL statements come in, and strategies for snapshotting your initial data set before capturing data changes from transaction logs.

This talk is part of the "Vaccination Database (Booster) Tech Talk" Seminar Series at Carnegie Mellon University (https://db.cs.cmu.edu/events/vaccination-2022-open-source-change-data-capture-with-debezium-gunnar-morling/)

Gunnar Morling

March 15, 2022
Tweet

More Decks by Gunnar Morling

Other Decks in Programming

Transcript

  1. #Debezium @gunnarmorling • Open source software engineer at Red Hat

    ◦ Debezium ◦ Quarkus • Spec Lead for Bean Validation 2.0 • kcctl, ModiTect, MapStruct • Java Champion • @gunnarmorling Gunnar Morling
  2. #Debezium @gunnarmorling • Taps into TX log to capture INSERT/UPDATE/DELETE

    events • Propagated to consumers via Apache Kafka and Kafka Connect Debezium — Log-based Change Data Capture
  3. #Debezium @gunnarmorling Debezium in a Nutshell • A CDC Platform

    ▪ Based on transaction logs ▪ Snapshotting, filtering, etc. ▪ Outbox support ▪ Web-based UI • Fully open-source, very active community • Large production deployments
  4. #Debezium @gunnarmorling Debezium: Connectors • Stable ▪ MySQL ▪ Postgres

    ▪ MongoDB ▪ SQL Server ▪ Db2 ▪ Oracle • Incubating ▪ Vitess ▪ Cassandra
  5. #Debezium @gunnarmorling Data Change Events • Old and new row

    state • Metadata on table, TX id, etc. • Operation type, timestamp
  6. #Debezium @gunnarmorling • Old and new row state • Metadata

    on table, TX id, etc. • Operation type, timestamp Data Change Events
  7. #Debezium @gunnarmorling • Old and new row state • Metadata

    on table, TX id, etc. • Operation type, timestamp Data Change Events
  8. #Debezium @gunnarmorling • Services need to update their database, •

    send messages to other services, • and that consistently! Challenge: Microservices Data Exchange
  9. #Debezium @gunnarmorling pg_logical_emit_message() Exporting auditing metadata • Pure CDC events

    lack metadata like business user, device id, etc. • Solution: emit at TX begin, enrich events e.g. using SMT
  10. #Debezium @gunnarmorling Keeping Track of Table Schemas How to Interpret

    Incoming Events? • Messages typically not self-descriptive • Challenge: incoming events may adhere to earlier schema version
  11. #Debezium @gunnarmorling Keeping Track of Table Schemas The Ask •

    Efficient is good, but make it simple to consume • Provide it when needed, as e.g. in Postgres (pgoutput) ◦ At the beginning of session ◦ After a table change • Facility to query past schema versions
  12. #Debezium @gunnarmorling Preventing Unbounded WAL Growth (II) High-traffic/Low-traffic Logical Databases

    • Problem: ◦ WAL global ◦ Logical replication slots per database
  13. #Debezium @gunnarmorling Snapshotting General Idea • Need initial backfill of

    sink systems, but don’t have all TX logs • Solution: scan data once before streaming
  14. #Debezium @gunnarmorling Snapshotting Limitations of Classic Approach • Can’t update

    filter list • Long-running snapshots can’t be paused/resumed • Can’t stream changes until snapshot completed • Can’t re-snapshot selected tables
  15. #Debezium @gunnarmorling Snapshotting Incremental Snapshotting • “DBLog: A Watermark Based

    Change-Data-Capture Framework”, by Andreas Andreakis and Ioannis Papapanagiotou • Key idea: interleave snapshot events and events from TX log https://arxiv.org/pdf/2010.12597v1.pdf
  16. #Debezium @gunnarmorling Incremental Snapshotting Semantics • No guarantee for snapshot

    (read) events for all records • May receive update or delete without prior insert/read • May receive read and update/delete • What is guaranteed: complete data set after snapshot
  17. #Debezium @gunnarmorling Incremental Snapshotting Comparison • Can’t update filter list

    ✅ • Long-running snapshots can’t be paused/resumed ✅ • Can’t stream changes until snapshot completed ✅ • Can’t re-snapshot selected tables ✅
  18. #Debezium @gunnarmorling • CDC: SELECT, INSERT, UPDATE, DELETE… STREAM? •

    Debezium: open-source CDC for a variety of databases • Outlook: incrementally updated materialized views? Takeaways
  19. #Debezium @gunnarmorling • Debezium https://debezium.io/ • Incremental snapshotting https://debezium.io/blog/2021/10/07/incremental-snapshots/ •

    Outbox implementation https://debezium.io/blog/2019/02/19/reliable-microservices-data -exchange-with-the-outbox-pattern/ • Demo repo https://github.com/debezium/debezium-examples Resources
  20. #Debezium @gunnarmorling Unsplash https://unsplash.com/license © Pablo García Saldaña https://unsplash.com/photos/lPQIndZz8Mo ©

    David Clode https://unsplash.com/photos/T49WTav4LgU © Aaron Burden https://unsplash.com/photos/GFpxQ2ZyNc0 © Nathan Dumlao https://unsplash.com/photos/wQDysNUCKfw © mari lezhava https://unsplash.com/photos/q65bNe9fW-w © Michał Parzuchowski https://unsplash.com/photos/Bt0PM7cNJFQ © Charles Forerunner https://unsplash.com/photos/3fPXt37X6UQ Flickr Attribution 2.0 Generic https://creativecommons.org/licenses/by/2.0/ © Thomas Kamann https://flic.kr/p/coa2c CC0 1.0 Universal Public Domain Dedication https://creativecommons.org/publicdomain/zero/1.0/ © Wall Boat https://flic.kr/p/Y6zkmX Attribution-ShareAlike 2.0 Generic https://creativecommons.org/licenses/by-sa/2.0/ © Andrew Hart https://flic.kr/p/dmjkSk Attribution 2.0 Generic (CC BY 2.0) https://creativecommons.org/licenses/by/2.0/ © Ryan https://flic.kr/p/8gwtzo Image Credits In Order of Appearance