Datafold - BDE

OS data diff Verifying data across heterogeneous databases Gerard Toonstra
15 Sep 2022

Open source data diff “Open Source Data Diff is a
utility / library that helps you ﬁnd different rows across tables in different databases”

Use cases 1. Here are some common use cases: a.
Copying data into your data warehouse through ETL tooling b. Migrating your database or warehouse to a different vendor c. Debugging complex data pipelines across multiple storage systems d. Detecting hard deletes e. Maintaining data integrity SLO’s

Data warehousing & ETL Veriﬁcation

Migrating a data warehouse (now) (future) ELT ELT Veriﬁcation

Debugging complex ETL Verification Verification Verification

Detecting hard deletes Find hard deletes ELT: “SELECT .. WHERE
updated_at > 2022-10-14” “DELETE FROM x WHERE id = 500;”

Maintain data integrity SLO’s 0.01% different records found

Supported databases

Design goals I • High performance and algorithmic efﬁciency ◦
Takes advantage of parallelization ◦ Avoids full table scans ◦ Optimized for few differences in rows ◦ 25 million rows in ~10 seconds, 1 billion rows in ~2 minutes • Supports many data storage vendors ◦ Engine must support efﬁcient hashing

Design goals II • Usable as CLI and library •
Entirely available as open source, e.g. no special “Enterprise” version or other frills • Conﬁgurable options

The algorithm “Divide and Conquer”

High Level Overview replication Postgres Lots of rows …. ….
… … … … … … … Snowflake Maybe all the same rows (or maybe not) …. …. … … … … … …

Divide the work into segments Postgres id=1..100k id=100k..200k id=200k..300k id=300k..400k
id=400k..500k id=500k..600k … … … id=900k..1M Snowflake id=1..100k id=100k..200k id=200k..300k id=300k..400k id=400k..500k id=500k..600k … … … id=900k..1M

Per segment SELECT count(*), sum(cast(conv(substring(md5(concat(id, updated_at)), 18), 16, 10) as
unsigned)) FROM `some_table` WHERE (id >= 1) AND (id < 100000) • Concatenate columns of interest: here id and updated_at • Calculate MD5 over each concatenation of each row • Preserve the ﬁrst 18 characters of the calculated hash • Convert it from a 16-hexadecimal to 10-base decimal system • Cast the value for each row to an unsigned integer • Sum all integers of the entire segment

Compare segments Postgres checksum=0102 checksum=ffff checksum=abab checksum=f0f0 …. …. ….
…. …. checksum=9494 Snowflake checksum=0102 checksum=aaab checksum=abab checksum=f0f0 …. …. …. …. …. checksum=9494 mismatch!

Divide no. 2: Split this up again Postgres id=100k..110k id=110k..120k
id=120k..130k id=130k..140k id=140k..150k id=150k..160k … … … id=190..200k Snowflake id=100k..110k id=110k..120k id=120k..130k id=130k..140k id=140k..150k id=150k..160k … … … id=190..200k When the segment is small enough, compare individual rows

Output individual row • The ID and the updated_at timestamp
for every row that is different. ◦ (122001, 1653672821) • With ‘--stats’ option, also the % of rows that are different

Cool features • You can run in interactive mode •
You can see the execution plan for each query • Full control over the bisection parameters • Compare rows against any set of columns you want with the “--column” parameter • Extra statistics with “--stats” • Ignore recent rows: “--min-age” and “--max-age”

Where to get it • Open source data diff, get
it here: ◦ https://github.com/datafold/data-diff • We are in the #tools-data-diff Locally Optimistic Slack: ◦ https://locallyoptimistic.com/community/ • Open an issue: ◦ https://github.com/datafold/data-diff/issues

Recap • Why we built open source datadiff, the main
use cases • Design goals of the tool • How the algorithm works • Where to download, get involved and ask further questions

Thank you for your attention! Gerard Toonstra: [email protected] We are
hiring! https://www.datafold.com/careers Interested in our product? [email protected]

Datafold - BDE

Datafold - BDE

Marketing OGZ PRO

More Decks by Marketing OGZ

Featured

Transcript

OS data diff Verifying data across heterogeneous databases Gerard Toonstra

Open source data diff “Open Source Data Diff is a

Use cases 1. Here are some common use cases: a.

Data warehousing & ETL Veriﬁcation

Migrating a data warehouse (now) (future) ELT ELT Veriﬁcation

Debugging complex ETL Verification Verification Verification

Detecting hard deletes Find hard deletes ELT: “SELECT .. WHERE

Maintain data integrity SLO’s 0.01% different records found

Supported databases

Design goals I • High performance and algorithmic efﬁciency ◦

Design goals II • Usable as CLI and library •

The algorithm “Divide and Conquer”

High Level Overview replication Postgres Lots of rows …. ….

Divide the work into segments Postgres id=1..100k id=100k..200k id=200k..300k id=300k..400k

Per segment SELECT count(*), sum(cast(conv(substring(md5(concat(id, updated_at)), 18), 16, 10) as

Compare segments Postgres checksum=0102 checksum=ffff checksum=abab checksum=f0f0 …. …. ….

Divide no. 2: Split this up again Postgres id=100k..110k id=110k..120k

Output individual row • The ID and the updated_at timestamp

Cool features • You can run in interactive mode •

Where to get it • Open source data diff, get

Recap • Why we built open source datadiff, the main

Thank you for your attention! Gerard Toonstra: [email protected] We are