Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Debezium Snapshots Revisited! (Current '23)

Debezium Snapshots Revisited! (Current '23)

Initial snapshots are a core feature of Debezium: when setting up a new CDC connector, existing tables can be scanned in order to export their full state to consumers, before starting to capture changes from the transaction log. While this works great in general, a few questions came up again and again in the Debezium community over time:

* How to re-snapshot just a single table?
* How to pause and resume long-running snapshots?
* How to run snapshots in parallel to reading changes from the log?

All this, and more, becomes possible with the notion of incremental snapshots. In this session you'll learn how this innovative scheme of interleaving snapshot queries and log-based change events works under the hood and how it solves common tasks when running CDC pipelines. We'll also discuss advanced topics like parallelizing snapshots and customizing snapshot contents.

Gunnar Morling

September 28, 2023
Tweet

More Decks by Gunnar Morling

Other Decks in Programming

Transcript

  1. Image © Nicolas Buffler https://flic.kr/p/jpWcWD (CC BY 2.0) Debezium Snapshots

    Revisited! Gunnar Morling Senior Staff Software Engineer, Decodable @gunnarmorling
  2. #DebeziumSnapshotting @gunnarmorling • Software engineer at Decodable • Former project

    lead of Debezium • kcctl 🧸, JfrUnit, ModiTect, MapStruct • Spec Lead for Bean Validation 2.0 • Java Champion Gunnar Morling
  3. #DebeziumSnapshotting @gunnarmorling Snapshotting Why Is It Needed? • Need to

    backfill data, but don’t have all TX logs • Solution: scan data once before streaming • Emit READ event for each record
  4. #DebeziumSnapshotting @gunnarmorling Snapshotting Classic Approach – General Idea • Capture

    current position in transaction log • Scan all relevant tables • Start streaming
  5. #DebeziumSnapshotting @gunnarmorling Snapshotting Key Configuration Options • snapshot.mode (initial, never,

    schema_only_recovery) • snapshot.select.statement.overrides • snapshot.max.threads
  6. #DebeziumSnapshotting @gunnarmorling Snapshotting Limitations of Classic Approach • Can’t update

    filter list • Can’t pause & resume long-running snapshots • Can’t stream changes until snapshot completed
  7. #DebeziumSnapshotting @gunnarmorling Snapshotting Limitations of Classic Approach • Can’t update

    filter list • Can’t pause & resume long-running snapshots • Can’t stream changes until snapshot completed • Can’t re-snapshot selected tables
  8. #DebeziumSnapshotting @gunnarmorling Incremental Snapshotting The Paper • “DBLog: A Watermark

    Based Change-Data-Capture Framework”, by Andreas Andreakis and Ioannis Papapanagiotou • Key idea: interleave snapshot events and events from TX log https://arxiv.org/pdf/2010.12597v1.pdf
  9. #DebeziumSnapshotting @gunnarmorling Incremental Snapshotting Semantics • No guarantee for snapshot

    (read) events for all records • May receive update or delete without prior insert/read • May receive read and update/delete • What is guaranteed: complete data set after snapshot
  10. #DebeziumSnapshotting @gunnarmorling Incremental Snapshotting Signalling Channels • Database table •

    Kafka topic • JMX • Custom id 924e3ff8-2245-43ca-ba77-2af9af02fa07 type log, {execute|pause|resume|stop}-snapshot value { "data-collections": ["schema1.table1", "schema2.table2"], "type":"incremental", "additional-condition":"color=blue" }
  11. #DebeziumSnapshotting @gunnarmorling Incremental Snapshotting Benefits • Can update filter list

    ✅ • Long-running snapshots can be paused/resumed ✅ • Can stream changes before snapshot completed ✅
  12. #DebeziumSnapshotting @gunnarmorling Incremental Snapshotting Benefits • Can update filter list

    ✅ • Long-running snapshots can be paused/resumed ✅ • Can stream changes before snapshot completed ✅ • Can re-snapshot selected tables ✅
  13. #DebeziumSnapshotting @gunnarmorling • Incremental Snapshots in Debezium https://debezium.io/blog/2021/10/07/incremental-snapshots/ • Read-only

    Incremental Snapshots for MySQL https://debezium.io/blog/2022/04/07/read-only-incremental-snapshots/ • Flink CDC https://ververica.github.io/flink-cdc-connectors/ Resources
  14. #DebeziumSnapshotting @gunnarmorling • Debezium & Kafka Connect – Ask the

    Experts With Chris Cranford (Red Hat) and Chris Egerton (Aiven) Sep 27, 2:30 PM • Change Stream Processing with Debezium and Apache Flink With Robert Metzger (Decodable) Sep 27, 5:30 PM, Dremio Office https://www.meetup.com/sf-big-analytics/events/294068331/ Upcoming