Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Open-source Change Data Capture With Debezium

Open-source Change Data Capture With Debezium

Change Data Capture (CDC) is one big enabler for your data; by reacting to changes in your database in "real-time", CDC comes in handy for implementing a wide range of use cases, such as low-latency data updates from OLTP data stores to OLAP systems, caches, or search indexes, data exchange between microservices, building audit logs, and many more.

In this talk you'll learn about Debezium, a distributed open-source log-based CDC platform for a variety of databases, such as Postgres, MySQL, Cassandra, MongoDB, and Vitess. We'll not only explore what makes Debezium and CDC so interesting from a user's perspective, but we'll also dive into some of the technical challenges we encountered while implementing Debezium, such as preventing an indefinite growth of WAL files in Postgres, keeping track of the schema of captured tables as DDL statements come in, and strategies for snapshotting your initial data set before capturing data changes from transaction logs.

This talk is part of the "Vaccination Database (Booster) Tech Talk" Seminar Series at Carnegie Mellon University (https://db.cs.cmu.edu/events/vaccination-2022-open-source-change-data-capture-with-debezium-gunnar-morling/)

8e25c0ca4bf25113bd9c0ccc5d118164?s=128

Gunnar Morling

March 15, 2022
Tweet

More Decks by Gunnar Morling

Other Decks in Programming

Transcript

  1. Open-Source Change Data Capture With Debezium Gunnar Morling Software Engineer,

    Red Hat @gunnarmorling
  2. #Debezium @gunnarmorling Today’s Objectives Learn About…

  3. #Debezium @gunnarmorling • Open source software engineer at Red Hat

    ◦ Debezium ◦ Quarkus • Spec Lead for Bean Validation 2.0 • kcctl, ModiTect, MapStruct • Java Champion • @gunnarmorling Gunnar Morling
  4. #Debezium @gunnarmorling • Taps into TX log to capture INSERT/UPDATE/DELETE

    events • Propagated to consumers via Apache Kafka and Kafka Connect Debezium — Log-based Change Data Capture
  5. #Debezium @gunnarmorling Change Data Capture A Giant Enabler for Your

    Data
  6. #Debezium @gunnarmorling Debezium in a Nutshell • A CDC Platform

    ▪ Based on transaction logs ▪ Snapshotting, filtering, etc. ▪ Outbox support ▪ Web-based UI • Fully open-source, very active community • Large production deployments
  7. #Debezium @gunnarmorling Debezium: Connectors • Stable ▪ MySQL ▪ Postgres

    ▪ MongoDB ▪ SQL Server ▪ Db2 ▪ Oracle • Incubating ▪ Vitess ▪ Cassandra
  8. #Debezium @gunnarmorling Debezium: Connectors Becoming the De-Facto CDC Standard https://debezium.io/blog/2021/09/22/deep-dive-into-a-debezium-community-connector-scylla-cdc-source-connector/

  9. #Debezium @gunnarmorling Debezium Architecture

  10. #Debezium @gunnarmorling Debezium: Deployment Alternatives Embedded Engine and Debezium Server

  11. #Debezium @gunnarmorling Data Change Events • Old and new row

    state • Metadata on table, TX id, etc. • Operation type, timestamp
  12. #Debezium @gunnarmorling • Old and new row state • Metadata

    on table, TX id, etc. • Operation type, timestamp Data Change Events
  13. #Debezium @gunnarmorling • Old and new row state • Metadata

    on table, TX id, etc. • Operation type, timestamp Data Change Events
  14. Outbox Pattern

  15. #Debezium @gunnarmorling • Services need to update their database, •

    send messages to other services, • and that consistently! Challenge: Microservices Data Exchange
  16. #Debezium @gunnarmorling “Dual writes” are prone to inconsistencies! Outbox Pattern

  17. #Debezium @gunnarmorling Outbox Pattern

  18. #Debezium @gunnarmorling Outbox Pattern

  19. #Debezium @gunnarmorling Outbox Pattern

  20. #Debezium @gunnarmorling Outbox Pattern

  21. #Debezium @gunnarmorling Variation on Postgres pg_logical_emit_message() • Directly writing arbitrary

    messages to the WAL • No need for an outbox table
  22. #Debezium @gunnarmorling pg_logical_emit_message() Exporting auditing metadata • Pure CDC events

    lack metadata like business user, device id, etc. • Solution: emit at TX begin, enrich events e.g. using SMT
  23. #Debezium @gunnarmorling Directly Emitting WAL Events The Ask Provide a

    facility for producing raw WAL events
  24. Challenges

  25. #Debezium @gunnarmorling Keeping Track of Table Schemas How to Interpret

    Incoming Events? • Messages typically not self-descriptive • Challenge: incoming events may adhere to earlier schema version
  26. #Debezium @gunnarmorling MySQL DDL Parser Solution: Parse DDL Events •

    Based on Antlr parser generator
  27. #Debezium @gunnarmorling Recovering schema after restarts • Persisting schema change

    history in a Kafka topic
  28. #Debezium @gunnarmorling Keeping Track of Table Schemas The Ask •

    Efficient is good, but make it simple to consume • Provide it when needed, as e.g. in Postgres (pgoutput) ◦ At the beginning of session ◦ After a table change • Facility to query past schema versions
  29. #Debezium @gunnarmorling On the Subject of Parsing… LogMiner Events

  30. #Debezium @gunnarmorling Preventing Unbounded WAL Growth (I) Challenging API designs

  31. #Debezium @gunnarmorling Preventing Unbounded WAL Growth (I) Can’t Commit Offsets

    Without Events
  32. #Debezium @gunnarmorling Preventing Unbounded WAL Growth (II) High-traffic/Low-traffic Logical Databases

    • Problem: ◦ WAL global ◦ Logical replication slots per database
  33. #Debezium @gunnarmorling Preventing Unbounded WAL Growth The Ask Make sure

    interfaces work correctly also in corner cases
  34. Snapshotting

  35. #Debezium @gunnarmorling Snapshotting General Idea • Need initial backfill of

    sink systems, but don’t have all TX logs • Solution: scan data once before streaming
  36. #Debezium @gunnarmorling Snapshotting The Ask Allow for consistent, lock-less snapshots

  37. #Debezium @gunnarmorling Snapshotting Limitations of Classic Approach • Can’t update

    filter list • Long-running snapshots can’t be paused/resumed • Can’t stream changes until snapshot completed • Can’t re-snapshot selected tables
  38. #Debezium @gunnarmorling Snapshotting Incremental Snapshotting • “DBLog: A Watermark Based

    Change-Data-Capture Framework”, by Andreas Andreakis and Ioannis Papapanagiotou • Key idea: interleave snapshot events and events from TX log https://arxiv.org/pdf/2010.12597v1.pdf
  39. #Debezium @gunnarmorling Snapshotting Incremental Snapshotting

  40. #Debezium @gunnarmorling Incremental Snapshotting Windowing via Watermarks

  41. #Debezium @gunnarmorling Incremental Snapshotting Buffer Processing

  42. #Debezium @gunnarmorling Incremental Snapshotting Buffer Processing

  43. #Debezium @gunnarmorling Incremental Snapshotting Semantics • No guarantee for snapshot

    (read) events for all records • May receive update or delete without prior insert/read • May receive read and update/delete • What is guaranteed: complete data set after snapshot
  44. #Debezium @gunnarmorling Incremental Snapshotting Comparison • Can’t update filter list

    ✅ • Long-running snapshots can’t be paused/resumed ✅ • Can’t stream changes until snapshot completed ✅ • Can’t re-snapshot selected tables ✅
  45. Wrap-Up

  46. #Debezium @gunnarmorling • CDC: SELECT, INSERT, UPDATE, DELETE… STREAM? •

    Debezium: open-source CDC for a variety of databases • Outlook: incrementally updated materialized views? Takeaways
  47. #Debezium @gunnarmorling • Debezium https://debezium.io/ • Incremental snapshotting https://debezium.io/blog/2021/10/07/incremental-snapshots/ •

    Outbox implementation https://debezium.io/blog/2019/02/19/reliable-microservices-data -exchange-with-the-outbox-pattern/ • Demo repo https://github.com/debezium/debezium-examples Resources
  48. #Debezium @gunnarmorling Q & A gunnar@hibernate.org @gunnarmorling 📧 Thank You!

  49. #Debezium @gunnarmorling Unsplash https://unsplash.com/license © Pablo García Saldaña https://unsplash.com/photos/lPQIndZz8Mo ©

    David Clode https://unsplash.com/photos/T49WTav4LgU © Aaron Burden https://unsplash.com/photos/GFpxQ2ZyNc0 © Nathan Dumlao https://unsplash.com/photos/wQDysNUCKfw © mari lezhava https://unsplash.com/photos/q65bNe9fW-w © Michał Parzuchowski https://unsplash.com/photos/Bt0PM7cNJFQ © Charles Forerunner https://unsplash.com/photos/3fPXt37X6UQ Flickr Attribution 2.0 Generic https://creativecommons.org/licenses/by/2.0/ © Thomas Kamann https://flic.kr/p/coa2c CC0 1.0 Universal Public Domain Dedication https://creativecommons.org/publicdomain/zero/1.0/ © Wall Boat https://flic.kr/p/Y6zkmX Attribution-ShareAlike 2.0 Generic https://creativecommons.org/licenses/by-sa/2.0/ © Andrew Hart https://flic.kr/p/dmjkSk Attribution 2.0 Generic (CC BY 2.0) https://creativecommons.org/licenses/by/2.0/ © Ryan https://flic.kr/p/8gwtzo Image Credits In Order of Appearance