Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Datafold - BDE

Marketing OGZ
September 22, 2022
95

Datafold - BDE

Marketing OGZ

September 22, 2022
Tweet

Transcript

  1. Open source data diff “Open Source Data Diff is a

    utility / library that helps you find different rows across tables in different databases”
  2. Use cases 1. Here are some common use cases: a.

    Copying data into your data warehouse through ETL tooling b. Migrating your database or warehouse to a different vendor c. Debugging complex data pipelines across multiple storage systems d. Detecting hard deletes e. Maintaining data integrity SLO’s
  3. Detecting hard deletes Find hard deletes ELT: “SELECT .. WHERE

    updated_at > 2022-10-14” “DELETE FROM x WHERE id = 500;”
  4. Design goals I • High performance and algorithmic efficiency ◦

    Takes advantage of parallelization ◦ Avoids full table scans ◦ Optimized for few differences in rows ◦ 25 million rows in ~10 seconds, 1 billion rows in ~2 minutes • Supports many data storage vendors ◦ Engine must support efficient hashing
  5. Design goals II • Usable as CLI and library •

    Entirely available as open source, e.g. no special “Enterprise” version or other frills • Configurable options
  6. High Level Overview replication Postgres Lots of rows …. ….

    … … … … … … … Snowflake Maybe all the same rows (or maybe not) …. …. … … … … … …
  7. Divide the work into segments Postgres id=1..100k id=100k..200k id=200k..300k id=300k..400k

    id=400k..500k id=500k..600k … … … id=900k..1M Snowflake id=1..100k id=100k..200k id=200k..300k id=300k..400k id=400k..500k id=500k..600k … … … id=900k..1M
  8. Per segment SELECT count(*), sum(cast(conv(substring(md5(concat(id, updated_at)), 18), 16, 10) as

    unsigned)) FROM `some_table` WHERE (id >= 1) AND (id < 100000) • Concatenate columns of interest: here id and updated_at • Calculate MD5 over each concatenation of each row • Preserve the first 18 characters of the calculated hash • Convert it from a 16-hexadecimal to 10-base decimal system • Cast the value for each row to an unsigned integer • Sum all integers of the entire segment
  9. Compare segments Postgres checksum=0102 checksum=ffff checksum=abab checksum=f0f0 …. …. ….

    …. …. checksum=9494 Snowflake checksum=0102 checksum=aaab checksum=abab checksum=f0f0 …. …. …. …. …. checksum=9494 mismatch!
  10. Divide no. 2: Split this up again Postgres id=100k..110k id=110k..120k

    id=120k..130k id=130k..140k id=140k..150k id=150k..160k … … … id=190..200k Snowflake id=100k..110k id=110k..120k id=120k..130k id=130k..140k id=140k..150k id=150k..160k … … … id=190..200k When the segment is small enough, compare individual rows
  11. Output individual row • The ID and the updated_at timestamp

    for every row that is different. ◦ (122001, 1653672821) • With ‘--stats’ option, also the % of rows that are different
  12. Cool features • You can run in interactive mode •

    You can see the execution plan for each query • Full control over the bisection parameters • Compare rows against any set of columns you want with the “--column” parameter • Extra statistics with “--stats” • Ignore recent rows: “--min-age” and “--max-age”
  13. Where to get it • Open source data diff, get

    it here: ◦ https://github.com/datafold/data-diff • We are in the #tools-data-diff Locally Optimistic Slack: ◦ https://locallyoptimistic.com/community/ • Open an issue: ◦ https://github.com/datafold/data-diff/issues
  14. Recap • Why we built open source datadiff, the main

    use cases • Design goals of the tool • How the algorithm works • Where to download, get involved and ask further questions
  15. Thank you for your attention! Gerard Toonstra: [email protected] We are

    hiring! https://www.datafold.com/careers Interested in our product? [email protected]