Slide 1

Slide 1 text

OS data diff Verifying data across heterogeneous databases Gerard Toonstra 15 Sep 2022

Slide 2

Slide 2 text

Open source data diff “Open Source Data Diff is a utility / library that helps you find different rows across tables in different databases”

Slide 3

Slide 3 text

Use cases 1. Here are some common use cases: a. Copying data into your data warehouse through ETL tooling b. Migrating your database or warehouse to a different vendor c. Debugging complex data pipelines across multiple storage systems d. Detecting hard deletes e. Maintaining data integrity SLO’s

Slide 4

Slide 4 text

Data warehousing & ETL Verification

Slide 5

Slide 5 text

Migrating a data warehouse (now) (future) ELT ELT Verification

Slide 6

Slide 6 text

Debugging complex ETL Verification Verification Verification

Slide 7

Slide 7 text

Detecting hard deletes Find hard deletes ELT: “SELECT .. WHERE updated_at > 2022-10-14” “DELETE FROM x WHERE id = 500;”

Slide 8

Slide 8 text

Maintain data integrity SLO’s 0.01% different records found

Slide 9

Slide 9 text

Supported databases

Slide 10

Slide 10 text

Design goals I ● High performance and algorithmic efficiency ○ Takes advantage of parallelization ○ Avoids full table scans ○ Optimized for few differences in rows ○ 25 million rows in ~10 seconds, 1 billion rows in ~2 minutes ● Supports many data storage vendors ○ Engine must support efficient hashing

Slide 11

Slide 11 text

Design goals II ● Usable as CLI and library ● Entirely available as open source, e.g. no special “Enterprise” version or other frills ● Configurable options

Slide 12

Slide 12 text

The algorithm “Divide and Conquer”

Slide 13

Slide 13 text

High Level Overview replication Postgres Lots of rows …. …. … … … … … … … Snowflake Maybe all the same rows (or maybe not) …. …. … … … … … …

Slide 14

Slide 14 text

Divide the work into segments Postgres id=1..100k id=100k..200k id=200k..300k id=300k..400k id=400k..500k id=500k..600k … … … id=900k..1M Snowflake id=1..100k id=100k..200k id=200k..300k id=300k..400k id=400k..500k id=500k..600k … … … id=900k..1M

Slide 15

Slide 15 text

Per segment SELECT count(*), sum(cast(conv(substring(md5(concat(id, updated_at)), 18), 16, 10) as unsigned)) FROM `some_table` WHERE (id >= 1) AND (id < 100000) ● Concatenate columns of interest: here id and updated_at ● Calculate MD5 over each concatenation of each row ● Preserve the first 18 characters of the calculated hash ● Convert it from a 16-hexadecimal to 10-base decimal system ● Cast the value for each row to an unsigned integer ● Sum all integers of the entire segment

Slide 16

Slide 16 text

Compare segments Postgres checksum=0102 checksum=ffff checksum=abab checksum=f0f0 …. …. …. …. …. checksum=9494 Snowflake checksum=0102 checksum=aaab checksum=abab checksum=f0f0 …. …. …. …. …. checksum=9494 mismatch!

Slide 17

Slide 17 text

Divide no. 2: Split this up again Postgres id=100k..110k id=110k..120k id=120k..130k id=130k..140k id=140k..150k id=150k..160k … … … id=190..200k Snowflake id=100k..110k id=110k..120k id=120k..130k id=130k..140k id=140k..150k id=150k..160k … … … id=190..200k When the segment is small enough, compare individual rows

Slide 18

Slide 18 text

Output individual row ● The ID and the updated_at timestamp for every row that is different. ○ (122001, 1653672821) ● With ‘--stats’ option, also the % of rows that are different

Slide 19

Slide 19 text

Cool features ● You can run in interactive mode ● You can see the execution plan for each query ● Full control over the bisection parameters ● Compare rows against any set of columns you want with the “--column” parameter ● Extra statistics with “--stats” ● Ignore recent rows: “--min-age” and “--max-age”

Slide 20

Slide 20 text

Where to get it ● Open source data diff, get it here: ○ https://github.com/datafold/data-diff ● We are in the #tools-data-diff Locally Optimistic Slack: ○ https://locallyoptimistic.com/community/ ● Open an issue: ○ https://github.com/datafold/data-diff/issues

Slide 21

Slide 21 text

Recap ● Why we built open source datadiff, the main use cases ● Design goals of the tool ● How the algorithm works ● Where to download, get involved and ask further questions

Slide 22

Slide 22 text

Thank you for your attention! Gerard Toonstra: [email protected] We are hiring! https://www.datafold.com/careers Interested in our product? [email protected]