Slide 1

Slide 1 text

Performing Data Quality at Scale Mahdi Karabiben 2022 - 09 - 15

Slide 2

Slide 2 text

A data engineer who’s passionate about open-source projects Zendesk, FactSet, Crédit Agricole, Numberly whoami 2

Slide 3

Slide 3 text

Data quality today 03 Table of contents Um, what’s data quality? 01 A look back in time 02 Data quality at scale with Soda Core 04 2022 - 09 - 15 3

Slide 4

Slide 4 text

Um, what’s data quality? 01 2022 - 09 - 15 4

Slide 5

Slide 5 text

6 key dimensions Um, what’s data quality? Do I have all the essential information? Completeness Does my data match the real world? Accuracy Is my data the same across all instances? Consistency Is my data fresh enough? Timeliness Does my data align with the requirements? Validity Do I have any duplicates? Uniqueness

Slide 6

Slide 6 text

The end goal: Ensuring that we can trust the data that we’re consuming. 6 Um, what’s data quality?

Slide 7

Slide 7 text

A look back in time 02 2022 - 09 - 15 7

Slide 8

Slide 8 text

Limited and costly options ● Write dedicated tasks/jobs that cleanse the data and run data quality checks ● Write and maintain multiple SQL queries that run on the input tables to perform the necessary checks On data lakes On data warehouses Difficult-to-scale, brute-force approach was the only available option: 8 A look back in time

Slide 9

Slide 9 text

Data quality today 03 2022 - 09 - 15 9

Slide 10

Slide 10 text

Data quality is top of mind ● Data is a first-class citizen everywhere ● Rise of SaaS data observability platforms ● Open-source projects that drastically shorten the data quality path Data quality today

Slide 11

Slide 11 text

Why open-source? ● Smaller investment to reach data quality standards ● Flexibility and extensibility ● Integrated within version control systems (clearer versioning) ● No vendor lock-in ● Managed/owned by the engineering team Data quality today

Slide 12

Slide 12 text

Great Expectations ● Python library ● Expectations: Python functions ● Data profiling capabilities ● Built-in data documentation Data quality today

Slide 13

Slide 13 text

Soda Core ● Python library ● Checks defined in YAML ● Soda Checks Language (SodaCL) ● Convenient CLI commands Data quality today

Slide 14

Slide 14 text

Why Soda? ● The flexibility and scalability of SodaCL ● YAML and SQL > Python ● Superior metrics capabilities Data quality today

Slide 15

Slide 15 text

Data quality at scale with Soda Core 04 2022 - 09 - 15 15

Slide 16

Slide 16 text

What are we validating? Data quality at scale with Soda Core

Slide 17

Slide 17 text

Where are we validating? Data quality at scale with Soda Core

Slide 18

Slide 18 text

How (and when) are we validating? Data quality at scale with Soda Core

Slide 19

Slide 19 text

How (and when) are we validating? Data quality at scale with Soda Core

Slide 20

Slide 20 text

How (and when) are we validating? ● Perform the data validation as soon as possible within our DAGs ● Leverage the integrations with Airflow, Prefect, and Dagster ● Run the data quality checks as dedicated tasks ● Use the Python library or the soda scan CLI command ● Push the generated metrics into the data warehouse (or use Soda Cloud) Data quality at scale with Soda Core

Slide 21

Slide 21 text

Takeaways ● Data quality is a must for data-driven decision making ● Ensuring data quality at-scale has never been easier ● Data quality with Soda Core consists of answering 3 questions: ○ What? ○ Where? ○ How? 2022 - 09 - 15

Slide 22

Slide 22 text

Thanks! Do you have any questions? @MahdiKarabiben Mahdi Karabiben 22