Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Performing Data Quality at Scale

Marketing OGZ
PRO
September 20, 2022
25

Performing Data Quality at Scale

Marketing OGZ
PRO

September 20, 2022
Tweet

Transcript

  1. Performing Data Quality at Scale Mahdi Karabiben 2022 - 09

    - 15
  2. A data engineer who’s passionate about open-source projects Zendesk, FactSet,

    Crédit Agricole, Numberly whoami 2
  3. Data quality today 03 Table of contents Um, what’s data

    quality? 01 A look back in time 02 Data quality at scale with Soda Core 04 2022 - 09 - 15 3
  4. Um, what’s data quality? 01 2022 - 09 - 15

    4
  5. 6 key dimensions Um, what’s data quality? Do I have

    all the essential information? Completeness Does my data match the real world? Accuracy Is my data the same across all instances? Consistency Is my data fresh enough? Timeliness Does my data align with the requirements? Validity Do I have any duplicates? Uniqueness
  6. The end goal: Ensuring that we can trust the data

    that we’re consuming. 6 Um, what’s data quality?
  7. A look back in time 02 2022 - 09 -

    15 7
  8. Limited and costly options • Write dedicated tasks/jobs that cleanse

    the data and run data quality checks • Write and maintain multiple SQL queries that run on the input tables to perform the necessary checks On data lakes On data warehouses Difficult-to-scale, brute-force approach was the only available option: 8 A look back in time
  9. Data quality today 03 2022 - 09 - 15 9

  10. Data quality is top of mind • Data is a

    first-class citizen everywhere • Rise of SaaS data observability platforms • Open-source projects that drastically shorten the data quality path Data quality today
  11. Why open-source? • Smaller investment to reach data quality standards

    • Flexibility and extensibility • Integrated within version control systems (clearer versioning) • No vendor lock-in • Managed/owned by the engineering team Data quality today
  12. Great Expectations • Python library • Expectations: Python functions •

    Data profiling capabilities • Built-in data documentation Data quality today
  13. Soda Core • Python library • Checks defined in YAML

    • Soda Checks Language (SodaCL) • Convenient CLI commands Data quality today
  14. Why Soda? • The flexibility and scalability of SodaCL •

    YAML and SQL > Python • Superior metrics capabilities Data quality today
  15. Data quality at scale with Soda Core 04 2022 -

    09 - 15 15
  16. What are we validating? Data quality at scale with Soda

    Core
  17. Where are we validating? Data quality at scale with Soda

    Core
  18. How (and when) are we validating? Data quality at scale

    with Soda Core
  19. How (and when) are we validating? Data quality at scale

    with Soda Core
  20. How (and when) are we validating? • Perform the data

    validation as soon as possible within our DAGs • Leverage the integrations with Airflow, Prefect, and Dagster • Run the data quality checks as dedicated tasks • Use the Python library or the soda scan CLI command • Push the generated metrics into the data warehouse (or use Soda Cloud) Data quality at scale with Soda Core
  21. Takeaways • Data quality is a must for data-driven decision

    making • Ensuring data quality at-scale has never been easier • Data quality with Soda Core consists of answering 3 questions: ◦ What? ◦ Where? ◦ How? 2022 - 09 - 15
  22. Thanks! Do you have any questions? @MahdiKarabiben Mahdi Karabiben 22