Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Debezium Snapshots Revisited! (Current '23)

Debezium Snapshots Revisited! (Current '23)

Initial snapshots are a core feature of Debezium: when setting up a new CDC connector, existing tables can be scanned in order to export their full state to consumers, before starting to capture changes from the transaction log. While this works great in general, a few questions came up again and again in the Debezium community over time:

* How to re-snapshot just a single table?
* How to pause and resume long-running snapshots?
* How to run snapshots in parallel to reading changes from the log?

All this, and more, becomes possible with the notion of incremental snapshots. In this session you'll learn how this innovative scheme of interleaving snapshot queries and log-based change events works under the hood and how it solves common tasks when running CDC pipelines. We'll also discuss advanced topics like parallelizing snapshots and customizing snapshot contents.

Gunnar Morling

September 28, 2023
Tweet

More Decks by Gunnar Morling

Other Decks in Programming

Transcript

  1. Image © Nicolas Buffler https://flic.kr/p/jpWcWD (CC BY 2.0)
    Debezium Snapshots Revisited!
    Gunnar Morling
    Senior Staff Software Engineer, Decodable
    @gunnarmorling

    View full-size slide

  2. #DebeziumSnapshotting @gunnarmorling
    Agenda

    View full-size slide

  3. #DebeziumSnapshotting @gunnarmorling
    ● Software engineer at Decodable
    ● Former project lead of Debezium
    ● kcctl 🧸, JfrUnit, ModiTect,
    MapStruct
    ● Spec Lead for Bean Validation 2.0
    ● Java Champion
    Gunnar Morling

    View full-size slide

  4. #DebeziumSnapshotting @gunnarmorling
    Recap – Debezium
    Log-Based Change Data Capture

    View full-size slide

  5. #DebeziumSnapshotting @gunnarmorling
    Snapshotting
    Why Is It Needed?
    ● Need to backfill data, but don’t
    have all TX logs
    ● Solution: scan data once before
    streaming
    ● Emit READ event for each record

    View full-size slide

  6. #DebeziumSnapshotting @gunnarmorling
    Snapshotting
    Classic Approach – General Idea
    ● Capture current
    position in transaction
    log
    ● Scan all relevant tables
    ● Start streaming

    View full-size slide

  7. #DebeziumSnapshotting @gunnarmorling
    Snapshotting
    Key Configuration Options
    ● snapshot.mode (initial, never, schema_only_recovery)
    ● snapshot.select.statement.overrides
    ● snapshot.max.threads

    View full-size slide

  8. #DebeziumSnapshotting @gunnarmorling
    Snapshotting
    Limitations of Classic Approach
    ● Can’t update filter list

    View full-size slide

  9. #DebeziumSnapshotting @gunnarmorling
    Snapshotting
    Limitations of Classic Approach
    ● Can’t update filter list
    ● Can’t pause & resume long-running snapshots

    View full-size slide

  10. #DebeziumSnapshotting @gunnarmorling
    Snapshotting
    Limitations of Classic Approach
    ● Can’t update filter list
    ● Can’t pause & resume long-running snapshots
    ● Can’t stream changes until snapshot completed

    View full-size slide

  11. #DebeziumSnapshotting @gunnarmorling
    Snapshotting
    Limitations of Classic Approach
    ● Can’t update filter list
    ● Can’t pause & resume long-running snapshots
    ● Can’t stream changes until snapshot completed
    ● Can’t re-snapshot selected tables

    View full-size slide

  12. Incremental
    Snapshots
    © Karen Blaha https://flic.kr/p/aeuPys (CC BY-SA 2.0)

    View full-size slide

  13. #DebeziumSnapshotting @gunnarmorling
    Incremental Snapshotting
    The Paper
    ● “DBLog: A Watermark Based
    Change-Data-Capture
    Framework”, by Andreas Andreakis
    and Ioannis Papapanagiotou
    ● Key idea: interleave snapshot events
    and events from TX log
    https://arxiv.org/pdf/2010.12597v1.pdf

    View full-size slide

  14. #DebeziumSnapshotting @gunnarmorling
    Incremental Snapshotting
    General Idea

    View full-size slide

  15. #DebeziumSnapshotting @gunnarmorling
    Incremental Snapshotting
    Windowing via Watermarks

    View full-size slide

  16. #DebeziumSnapshotting @gunnarmorling
    Incremental Snapshotting
    Buffer Processing

    View full-size slide

  17. #DebeziumSnapshotting @gunnarmorling
    Incremental Snapshotting
    Buffer Processing

    View full-size slide

  18. #DebeziumSnapshotting @gunnarmorling
    Incremental Snapshotting
    Semantics
    ● No guarantee for snapshot (read) events for all records
    ● May receive update or delete without prior insert/read
    ● May receive read and update/delete
    ● What is guaranteed: complete data set after snapshot

    View full-size slide

  19. Demo
    © Wall Boat https://flic.kr/p/Y6zkmX (Public Domain)

    View full-size slide

  20. #DebeziumSnapshotting @gunnarmorling
    Incremental Snapshotting
    Connector Offsets

    View full-size slide

  21. #DebeziumSnapshotting @gunnarmorling
    Incremental Snapshotting
    MySQL Read-Only Snapshots
    ● Write access to DB may be not desirable

    View full-size slide

  22. #DebeziumSnapshotting @gunnarmorling
    Incremental Snapshotting
    Signalling Channels
    ● Database table
    ● Kafka topic
    ● JMX
    ● Custom
    id 924e3ff8-2245-43ca-ba77-2af9af02fa07
    type log, {execute|pause|resume|stop}-snapshot
    value { "data-collections": ["schema1.table1", "schema2.table2"],
    "type":"incremental",
    "additional-condition":"color=blue" }

    View full-size slide

  23. #DebeziumSnapshotting @gunnarmorling
    Incremental Snapshotting
    Notifications

    View full-size slide

  24. #Debezium + #ApacheFlink | @gunnarmorling
    Comparison

    View full-size slide

  25. #DebeziumSnapshotting @gunnarmorling
    Incremental Snapshotting
    Benefits
    ● Can update filter list ✅

    View full-size slide

  26. #DebeziumSnapshotting @gunnarmorling
    Incremental Snapshotting
    Benefits
    ● Can update filter list ✅
    ● Long-running snapshots can be paused/resumed ✅

    View full-size slide

  27. #DebeziumSnapshotting @gunnarmorling
    Incremental Snapshotting
    Benefits
    ● Can update filter list ✅
    ● Long-running snapshots can be paused/resumed ✅
    ● Can stream changes before snapshot completed ✅

    View full-size slide

  28. #DebeziumSnapshotting @gunnarmorling
    Incremental Snapshotting
    Benefits
    ● Can update filter list ✅
    ● Long-running snapshots can be paused/resumed ✅
    ● Can stream changes before snapshot completed ✅
    ● Can re-snapshot selected tables ✅

    View full-size slide

  29. #DebeziumSnapshotting @gunnarmorling
    ● Incremental Snapshots in Debezium
    https://debezium.io/blog/2021/10/07/incremental-snapshots/
    ● Read-only Incremental Snapshots for MySQL
    https://debezium.io/blog/2022/04/07/read-only-incremental-snapshots/
    ● Flink CDC
    https://ververica.github.io/flink-cdc-connectors/
    Resources

    View full-size slide

  30. #DebeziumSnapshotting @gunnarmorling
    ● Debezium & Kafka Connect – Ask the Experts
    With Chris Cranford (Red Hat) and Chris Egerton (Aiven)
    Sep 27, 2:30 PM
    ● Change Stream Processing with Debezium and Apache Flink
    With Robert Metzger (Decodable)
    Sep 27, 5:30 PM, Dremio Office
    https://www.meetup.com/sf-big-analytics/events/294068331/
    Upcoming

    View full-size slide

  31. #DebeziumSnapshotting @gunnarmorling
    Q & A
    [email protected]
    @gunnarmorling
    📧
    Thank You!

    View full-size slide