Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Performing Data Quality at Scale

Marketing OGZ
September 20, 2022
72

Performing Data Quality at Scale

Marketing OGZ

September 20, 2022
Tweet

Transcript

  1. Data quality today 03 Table of contents Um, what’s data

    quality? 01 A look back in time 02 Data quality at scale with Soda Core 04 2022 - 09 - 15 3
  2. 6 key dimensions Um, what’s data quality? Do I have

    all the essential information? Completeness Does my data match the real world? Accuracy Is my data the same across all instances? Consistency Is my data fresh enough? Timeliness Does my data align with the requirements? Validity Do I have any duplicates? Uniqueness
  3. The end goal: Ensuring that we can trust the data

    that we’re consuming. 6 Um, what’s data quality?
  4. Limited and costly options • Write dedicated tasks/jobs that cleanse

    the data and run data quality checks • Write and maintain multiple SQL queries that run on the input tables to perform the necessary checks On data lakes On data warehouses Difficult-to-scale, brute-force approach was the only available option: 8 A look back in time
  5. Data quality is top of mind • Data is a

    first-class citizen everywhere • Rise of SaaS data observability platforms • Open-source projects that drastically shorten the data quality path Data quality today
  6. Why open-source? • Smaller investment to reach data quality standards

    • Flexibility and extensibility • Integrated within version control systems (clearer versioning) • No vendor lock-in • Managed/owned by the engineering team Data quality today
  7. Great Expectations • Python library • Expectations: Python functions •

    Data profiling capabilities • Built-in data documentation Data quality today
  8. Soda Core • Python library • Checks defined in YAML

    • Soda Checks Language (SodaCL) • Convenient CLI commands Data quality today
  9. Why Soda? • The flexibility and scalability of SodaCL •

    YAML and SQL > Python • Superior metrics capabilities Data quality today
  10. How (and when) are we validating? • Perform the data

    validation as soon as possible within our DAGs • Leverage the integrations with Airflow, Prefect, and Dagster • Run the data quality checks as dedicated tasks • Use the Python library or the soda scan CLI command • Push the generated metrics into the data warehouse (or use Soda Cloud) Data quality at scale with Soda Core
  11. Takeaways • Data quality is a must for data-driven decision

    making • Ensuring data quality at-scale has never been easier • Data quality with Soda Core consists of answering 3 questions: ◦ What? ◦ Where? ◦ How? 2022 - 09 - 15