Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Open-source Change Data Capture With Debezium

Open-source Change Data Capture With Debezium

Change Data Capture (CDC) is one big enabler for your data; by reacting to changes in your database in "real-time", CDC comes in handy for implementing a wide range of use cases, such as low-latency data updates from OLTP data stores to OLAP systems, caches, or search indexes, data exchange between microservices, building audit logs, and many more.

In this talk you'll learn about Debezium, a distributed open-source log-based CDC platform for a variety of databases, such as Postgres, MySQL, Cassandra, MongoDB, and Vitess. We'll not only explore what makes Debezium and CDC so interesting from a user's perspective, but we'll also dive into some of the technical challenges we encountered while implementing Debezium, such as preventing an indefinite growth of WAL files in Postgres, keeping track of the schema of captured tables as DDL statements come in, and strategies for snapshotting your initial data set before capturing data changes from transaction logs.

This talk is part of the "Vaccination Database (Booster) Tech Talk" Seminar Series at Carnegie Mellon University (https://db.cs.cmu.edu/events/vaccination-2022-open-source-change-data-capture-with-debezium-gunnar-morling/)

Gunnar Morling

March 15, 2022
Tweet

More Decks by Gunnar Morling

Other Decks in Programming

Transcript

  1. Open-Source Change Data Capture
    With Debezium
    Gunnar Morling
    Software Engineer, Red Hat
    @gunnarmorling

    View Slide

  2. #Debezium @gunnarmorling
    Today’s Objectives
    Learn About…

    View Slide

  3. #Debezium @gunnarmorling
    ● Open source software engineer at Red Hat
    ○ Debezium
    ○ Quarkus
    ● Spec Lead for Bean Validation 2.0
    ● kcctl, ModiTect, MapStruct
    ● Java Champion
    ● @gunnarmorling
    Gunnar Morling

    View Slide

  4. #Debezium @gunnarmorling
    ● Taps into TX log to capture INSERT/UPDATE/DELETE events
    ● Propagated to consumers via Apache Kafka and Kafka Connect
    Debezium — Log-based Change Data Capture

    View Slide

  5. #Debezium @gunnarmorling
    Change Data Capture
    A Giant Enabler for Your Data

    View Slide

  6. #Debezium @gunnarmorling
    Debezium in a Nutshell
    ● A CDC Platform
    ■ Based on transaction logs
    ■ Snapshotting, filtering, etc.
    ■ Outbox support
    ■ Web-based UI
    ● Fully open-source, very active
    community
    ● Large production deployments

    View Slide

  7. #Debezium @gunnarmorling
    Debezium: Connectors
    ● Stable
    ■ MySQL
    ■ Postgres
    ■ MongoDB
    ■ SQL Server
    ■ Db2
    ■ Oracle
    ● Incubating
    ■ Vitess
    ■ Cassandra

    View Slide

  8. #Debezium @gunnarmorling
    Debezium: Connectors
    Becoming the De-Facto CDC Standard
    https://debezium.io/blog/2021/09/22/deep-dive-into-a-debezium-community-connector-scylla-cdc-source-connector/

    View Slide

  9. #Debezium @gunnarmorling
    Debezium
    Architecture

    View Slide

  10. #Debezium @gunnarmorling
    Debezium: Deployment Alternatives
    Embedded Engine and Debezium Server

    View Slide

  11. #Debezium @gunnarmorling
    Data Change Events
    ● Old and new row state
    ● Metadata on table, TX id, etc.
    ● Operation type, timestamp

    View Slide

  12. #Debezium @gunnarmorling
    ● Old and new row state
    ● Metadata on table, TX id, etc.
    ● Operation type, timestamp
    Data Change Events

    View Slide

  13. #Debezium @gunnarmorling
    ● Old and new row state
    ● Metadata on table, TX id, etc.
    ● Operation type, timestamp
    Data Change Events

    View Slide

  14. Outbox Pattern

    View Slide

  15. #Debezium @gunnarmorling
    ● Services need to update their database,
    ● send messages to other services,
    ● and that consistently!
    Challenge: Microservices Data Exchange

    View Slide

  16. #Debezium @gunnarmorling
    “Dual writes” are prone to inconsistencies!
    Outbox Pattern

    View Slide

  17. #Debezium @gunnarmorling
    Outbox Pattern

    View Slide

  18. #Debezium @gunnarmorling
    Outbox Pattern

    View Slide

  19. #Debezium @gunnarmorling
    Outbox Pattern

    View Slide

  20. #Debezium @gunnarmorling
    Outbox Pattern

    View Slide

  21. #Debezium @gunnarmorling
    Variation on Postgres
    pg_logical_emit_message()
    ● Directly writing arbitrary messages to the WAL
    ● No need for an outbox table

    View Slide

  22. #Debezium @gunnarmorling
    pg_logical_emit_message()
    Exporting auditing metadata
    ● Pure CDC events lack metadata like business user, device id, etc.
    ● Solution: emit at TX begin, enrich events e.g. using SMT

    View Slide

  23. #Debezium @gunnarmorling
    Directly Emitting WAL Events
    The Ask
    Provide a facility for producing raw WAL events

    View Slide

  24. Challenges

    View Slide

  25. #Debezium @gunnarmorling
    Keeping Track of Table Schemas
    How to Interpret Incoming Events?
    ● Messages typically not self-descriptive
    ● Challenge: incoming events may adhere to earlier schema
    version

    View Slide

  26. #Debezium @gunnarmorling
    MySQL DDL Parser
    Solution: Parse DDL Events
    ● Based on Antlr parser generator

    View Slide

  27. #Debezium @gunnarmorling
    Recovering schema after restarts
    ● Persisting schema change
    history in a Kafka topic

    View Slide

  28. #Debezium @gunnarmorling
    Keeping Track of Table Schemas
    The Ask
    ● Efficient is good, but make it simple to consume
    ● Provide it when needed, as e.g. in Postgres (pgoutput)
    ○ At the beginning of session
    ○ After a table change
    ● Facility to query past schema versions

    View Slide

  29. #Debezium @gunnarmorling
    On the Subject of Parsing…
    LogMiner Events

    View Slide

  30. #Debezium @gunnarmorling
    Preventing Unbounded WAL Growth (I)
    Challenging API designs

    View Slide

  31. #Debezium @gunnarmorling
    Preventing Unbounded WAL Growth (I)
    Can’t Commit Offsets Without Events

    View Slide

  32. #Debezium @gunnarmorling
    Preventing Unbounded WAL Growth (II)
    High-traffic/Low-traffic Logical Databases
    ● Problem:
    ○ WAL global
    ○ Logical replication slots per database

    View Slide

  33. #Debezium @gunnarmorling
    Preventing Unbounded WAL Growth
    The Ask
    Make sure interfaces work correctly also in corner cases

    View Slide

  34. Snapshotting

    View Slide

  35. #Debezium @gunnarmorling
    Snapshotting
    General Idea
    ● Need initial backfill of sink systems, but don’t have all TX logs
    ● Solution: scan data once before streaming

    View Slide

  36. #Debezium @gunnarmorling
    Snapshotting
    The Ask
    Allow for consistent, lock-less snapshots

    View Slide

  37. #Debezium @gunnarmorling
    Snapshotting
    Limitations of Classic Approach
    ● Can’t update filter list
    ● Long-running snapshots can’t be paused/resumed
    ● Can’t stream changes until snapshot completed
    ● Can’t re-snapshot selected tables

    View Slide

  38. #Debezium @gunnarmorling
    Snapshotting
    Incremental Snapshotting
    ● “DBLog: A Watermark Based
    Change-Data-Capture
    Framework”, by Andreas Andreakis
    and Ioannis Papapanagiotou
    ● Key idea: interleave snapshot events
    and events from TX log
    https://arxiv.org/pdf/2010.12597v1.pdf

    View Slide

  39. #Debezium @gunnarmorling
    Snapshotting
    Incremental Snapshotting

    View Slide

  40. #Debezium @gunnarmorling
    Incremental Snapshotting
    Windowing via Watermarks

    View Slide

  41. #Debezium @gunnarmorling
    Incremental Snapshotting
    Buffer Processing

    View Slide

  42. #Debezium @gunnarmorling
    Incremental Snapshotting
    Buffer Processing

    View Slide

  43. #Debezium @gunnarmorling
    Incremental Snapshotting
    Semantics
    ● No guarantee for snapshot (read) events for all records
    ● May receive update or delete without prior insert/read
    ● May receive read and update/delete
    ● What is guaranteed: complete data set after snapshot

    View Slide

  44. #Debezium @gunnarmorling
    Incremental Snapshotting
    Comparison
    ● Can’t update filter list ✅
    ● Long-running snapshots can’t be paused/resumed ✅
    ● Can’t stream changes until snapshot completed ✅
    ● Can’t re-snapshot selected tables ✅

    View Slide

  45. Wrap-Up

    View Slide

  46. #Debezium @gunnarmorling
    ● CDC: SELECT, INSERT, UPDATE, DELETE…
    STREAM?
    ● Debezium: open-source CDC for
    a variety of databases
    ● Outlook: incrementally updated materialized
    views?
    Takeaways

    View Slide

  47. #Debezium @gunnarmorling
    ● Debezium
    https://debezium.io/
    ● Incremental snapshotting
    https://debezium.io/blog/2021/10/07/incremental-snapshots/
    ● Outbox implementation
    https://debezium.io/blog/2019/02/19/reliable-microservices-data
    -exchange-with-the-outbox-pattern/
    ● Demo repo
    https://github.com/debezium/debezium-examples
    Resources

    View Slide

  48. #Debezium @gunnarmorling
    Q & A
    [email protected]
    @gunnarmorling
    📧
    Thank You!

    View Slide

  49. #Debezium @gunnarmorling
    Unsplash https://unsplash.com/license
    © Pablo García Saldaña https://unsplash.com/photos/lPQIndZz8Mo
    © David Clode https://unsplash.com/photos/T49WTav4LgU
    © Aaron Burden https://unsplash.com/photos/GFpxQ2ZyNc0
    © Nathan Dumlao https://unsplash.com/photos/wQDysNUCKfw
    © mari lezhava https://unsplash.com/photos/q65bNe9fW-w
    © Michał Parzuchowski https://unsplash.com/photos/Bt0PM7cNJFQ
    © Charles Forerunner https://unsplash.com/photos/3fPXt37X6UQ
    Flickr
    Attribution 2.0 Generic https://creativecommons.org/licenses/by/2.0/
    © Thomas Kamann https://flic.kr/p/coa2c
    CC0 1.0 Universal Public Domain Dedication https://creativecommons.org/publicdomain/zero/1.0/
    © Wall Boat https://flic.kr/p/Y6zkmX
    Attribution-ShareAlike 2.0 Generic https://creativecommons.org/licenses/by-sa/2.0/
    © Andrew Hart https://flic.kr/p/dmjkSk
    Attribution 2.0 Generic (CC BY 2.0) https://creativecommons.org/licenses/by/2.0/
    © Ryan https://flic.kr/p/8gwtzo
    Image Credits
    In Order of Appearance

    View Slide