Slide 1

Slide 1 text

Shuhsi Lin 20240922 Ensuring Data Integrity with Validation and Pipeline Testing PyConTW 2024 at Kaohsiung Calling from the harbor:Dive into Python

Slide 2

Slide 2 text

About Me Find me on sciwork member Interested in • Agile/Engineering Culture/ Developer Experience • Team Coaching • Data Engineering Shuhsi Lin Working in Smart manufacturing & AI With data and people

Slide 3

Slide 3 text

Agenda 01 Start with basic: the importance of data quality in the data-driven landscape Data Quality/Data Integrity/Data Contract 02 Type of data testing Test strategies 03 Test Frameworks 04 Data Pipeline debt & Recap Great expectation & dbt (dbt test and others) Analytics Development Lifecycle (ADLC)

Slide 4

Slide 4 text

Scenario SmartPizza (like Pizzaxxx, but smarter) ● Daily work: get orders with specific recipes to make pizza ● Data Tables ○ Order ○ Recipes ○ Customer ○ Inventory ○ … ETL/ELT pipelines Data source Data store/target Data applications

Slide 5

Slide 5 text

Data Quality

Slide 6

Slide 6 text

6 Dimensions of Data Quality The degree to which data correctly reflects the real world object/ event Accuracy Expected comprehensiveness/ are all datasets and the data items recorded Completeness Data across all systems reflects the same information and are in synch with each other across the data stores Consistency Information is available when it is expected and needed Timeliness Means that there’s only one instance of the information appearing in a database Uniqueness Refers to information that doesn’t conform to a specific format or doesn’t follow business rules Validity https://kamal-ahmed.github.io/DQ-Dimensions.github.io/ https://hub.getdbt.com/infinitelambda/dq_tools/latest/

Slide 7

Slide 7 text

Data Quality in AI Impact of poor quality data and analytics The impact of data distrust! Snaplogic “The State of Data Management – The Impact of Data Distrust” Distrust in data Data analytics challenges in organizations

Slide 8

Slide 8 text

Data Integrity vs Data Quality https://www.montecarlodata.com/blog-data-integrity-vs-data-quality/ 1. Entity Integrity 2. Referential Integrity 3. Domain Integrity 4. User-Defined Integrity

Slide 9

Slide 9 text

Entity Integrity Referential Integrity An order references a customer_id that doesn't exist in the customers table. ● Tables: orders and customers ● Foreign Key: orders.customer_id references customers.customer_id ● Table: customers ● Primary Key: customer_id Same customer_id exist in the customers table. ● Confusion in tracking orders and loyalty points. ● Inaccurate customer analytics. ● Confusion in tracking orders and loyalty points. ● Inaccurate customer analytics.

Slide 10

Slide 10 text

Domain Integrity User-Defined Integrity An ingredients of a recipes that doesn't exist in the inventory table. (out of stock! ) ● Column: pizza_type ● Valid Values: ○ Margherita', 'Pepperoni', 'Veggie Supreme” An order includes a pizza_type that is not on the menu. ● Issues with order fulfillment. ● Inaccurate inventory forecasting. ● Inability to prepare certain pizzas due to missing ingredients. ● Customers may be disappointed if their preferred menu items are unavailable. ● Staff may spend extra time handling stock shortages. Rule: All ingredients listed in the recipes table must exist in the inventory table with sufficient stock.

Slide 11

Slide 11 text

Data Contract

Slide 12

Slide 12 text

What is Data Contracts A data contract is a formal agreement, typically between a service (or producer) and a client (or consumer), that outlines the specifics of data exchange. Its core purpose is to ensure compatibility and clarity in the management and usage of data. producer consumer agreement data exchange The next generation of Data Platforms is the Data Mesh (from The PayPal Technology 1. interface 2. expectations 3. governed 4. explicit Driving Data Quality with Data Contracts (2023) by Andrew Jones

Slide 13

Slide 13 text

Four Principles of Data Contracts 1. Interface 2.Expectations 3. Governed 4.Explicit ● Stable and supported interface to the data ● Versioned, with a migration path for breaking schema ● An agreement between the generators and consumers ● Documented, including the structure, schema and semantics ● SLOs around the reliability of the data ● Well defined ownership and responsibilities ● Handling personally identifiable data in-line with company policies and regulations ● Controlling access to data ● Automated, driven by metadata provided in the data contracts ● Accessible data explicitly provided for consumption ● Data products that meet business requirements ● Generators know the value and are incentivised to produce the data Driving Data Quality with Data Contracts (2023) by Andrew Jones

Slide 14

Slide 14 text

An overview of data contracts in Open Data Contract Standard https://github.com/bitol-io/open-data-contract-standard Contributors to the data contract

Slide 15

Slide 15 text

Open Data Contract Standard https://github.com/bitol-io/open-data-contract-standard/blob/main/docs/README.md different sessions to define agreement of data (data contract)

Slide 16

Slide 16 text

Schema Template for Data Contract define column https://github.com/bitol-io/open-data-contract-standard/blob/main/docs/README.md#dataset-and-schema

Slide 17

Slide 17 text

Data Quality Rules & Parameter quality rule (content) https://github.com/bitol-io/open-data-contract-standard/blob/main/docs/README.md#data-quality

Slide 18

Slide 18 text

(Column level) Data Quality Rules & Parameter column level https://github.com/bitol-io/open-data-contract-standard/blob/main/docs/README.md#data-quality

Slide 19

Slide 19 text

Service-level agreement (SLA) https://github.com/bitol-io/open-data-contract-standard/blob/main/docs/README.md#service-level-agreement latency retention

Slide 20

Slide 20 text

Data(Pipeline) Testing

Slide 21

Slide 21 text

Do you test your code ?

Slide 22

Slide 22 text

Do you test your data pipeline ? Do you test your data ?

Slide 23

Slide 23 text

Assumption about Data Assertion about Code ref: Webinar on: Testing frameworks in dbt. (2023) Two Types of Testing DEV/TEST env CI Code Data Input Output data Code freezed, data changed Code Data Input Output data Prod env Data freezed, code changed Validating the code that processes data before deployed to prod. Validating the data as it's loaded into production.

Slide 24

Slide 24 text

What do we test?

Slide 25

Slide 25 text

Assumption about Data Assertion about Code Two Types of Testing Code freezed, data changed Data freezed, code changed Code: Lint -> Unit test -> Integration testing ● Unexpected change from pipeline code refactoring ● SQL column definition ● Column rename Data: Validate data content ● Many NULLs ● Data not updating ● Value too small/large ● … https://www.getdbt.com/blog/building-a-data-quality-framework-with-dbt-and-dbt-cloud

Slide 26

Slide 26 text

Where do we test?

Slide 27

Slide 27 text

Testing in a Data LifeCycle Ingest/collect Store Process Output Source Data Pipeline Bug-specific tests Large Logic Tests Data Test Data Test Data Test https://www.telm.ai/blog/how-to-solve-data-quality-issues-at-every-lifecycle-stage/ Test Code Test Data

Slide 28

Slide 28 text

How do we test?

Slide 29

Slide 29 text

Validating Non-Empty Orders Validate by assertion Scenario: Ensure that each order contains at least one pizza item.

Slide 30

Slide 30 text

Validating Non-Empty Orders Data Validation module/pipeline Scenario: Ensure that each order contains at least one pizza item.

Slide 31

Slide 31 text

Validating Non-Empty Orders Scenario: Ensure that each order contains at least one pizza item.

Slide 32

Slide 32 text

Data Validation Framework? Frameworks Unveiled: Tools for Ensuring Data Quality

Slide 33

Slide 33 text

What is Great Expectations? ● A Python based open source tool ● It is for validating data, documenting data, and profiling data ● It maintains the quality of data and improves communication between teams Key features ● Expectations are like assertions in traditional Python unit tests. ● Automated data profiling automates pipeline tests. ● Data Contexts and Data Sources allow you to configure connections to your data sources. ● Tooling for validation are checkpoints for data validation. ● Data Docs clean, human-readable documentation.

Slide 34

Slide 34 text

Strata 2018: Pipeline Testing with Great Expectations Data updated in 2024/0922 More about GX https://medium.com/@expectgreatdata/down-with-pipeline-debt-introducing-great-expectations-862ddc46782a https://github.com/great-expectations/great_expectations

Slide 35

Slide 35 text

https://greatexpectations.io/integrations Community Integration https://greatexpectations.io/community

Slide 36

Slide 36 text

What GX Works https://docs.greatexpectations.io/docs/core/introduction/gx_overview

Slide 37

Slide 37 text

GX env Dara context Connect Define expectation Validate https://docs.greatexpectations.io/docs/core/introduction/try_gx ?procedure=sample_code Validate data in a DataFrame

Slide 38

Slide 38 text

https://greatexpectations.io/expectations/

Slide 39

Slide 39 text

dbt (data build tool) Transform data using the same practices that software engineers use to build applications. Version Control and CI/CD Deploy safely using dev environments. Git-enabled version control enables collaboration and a return to previous states. Test and Document Test every model prior to production, and share dynamically generated documentation with all data stakeholders. Develop Write modular data transformations in .sql or .py files – dbt handles the chore of dependency management. https://www.getdbt.com/product/what-is-dbt https://github.com/dbt-labs/dbt-core What is dbt?

Slide 40

Slide 40 text

Test data content ● Built-in ○ Singular data tests ○ Generic data tests ● Packages ○ dbt_utils ○ dbt_expectation ○ Dbt_elementary ○ … Test data schema ● dbt (model) contract

Slide 41

Slide 41 text

dbt.test generic data tests dbt-utils https://docs.getdbt.com/docs/build/data-tests https://github.com/dbt-labs/dbt-utils Scenario: Validate that the order_id in the orders table increments sequentially without gaps Scenario: Ensure recipe_id is unique and not null

Slide 42

Slide 42 text

dbt.test custom Test (SQL) Scenario: Verify that every ingredient listed in the recipes table exists in the inventory table. https://docs.getdbt.com/docs/build/data-tests

Slide 43

Slide 43 text

dbt-expectations Scenario: Validating Total Sales Between orders and sales Tables https://github.com/calogica/dbt-expectations

Slide 44

Slide 44 text

dbt-expectations Scenario: Validating Total Sales Between orders and sales Tables https://github.com/calogica/dbt-expectations

Slide 45

Slide 45 text

dbt contract Scenario: prevent invalid recipes data from entering your datastore https://github.com/calogica/dbt-expectations Contract: enforce contract on data model Column:define data type Constraint: specific constraint ● nullness ● Uniqueness ● Primary keys ● Foreign keys

Slide 46

Slide 46 text

Assumption about Data Assertion about Code ref: Webinar on: Testing frameworks in dbt. (2023) Two Types of Testing DEV/TEST env CI Code Data Input Output data Code freezed, data changed Code Data Input Output data Prod env Data freezed, code changed Validating the code that processes data before deployed to prod. Validating the data as it's loaded into production. ETL code ○ pytest,... model code ○ dbt unit testing ○ Recce data content: ○ pydantic ○ great expectations ○ dbt test ○ dbt_utils/expectatio ns/elementary… data schemas: ○ dbt data contracts

Slide 47

Slide 47 text

Hub for dbt test https://docs.elementary-data.com/oss/oss-introduction https://www.elementary-data.com/dbt-test-hub

Slide 48

Slide 48 text

Do less for easier adoption 1. Data source integration 2. Available assertions 3. Custom assertions (extensibility) 4. Validation execution/ Integration with existing technology stack How to choose framework https://medium.com/@brunouy/a-guide-to-open-source-data-quality-tools-in-late-2023-f9dbadbc7948

Slide 49

Slide 49 text

Data Pipeline debt

Slide 50

Slide 50 text

https://martinfowler.com/bliki/TechnicalDebtQuadrant.html https://asana.com/zh-tw/resources/technical-debt https://www.sonarsource.com/learn/technical-debt/ Technical debt

Slide 51

Slide 51 text

Pipeline debt Technical debt in data pipeline Down with pipeline debt / introducing Great Expectations. https://greatexpectations.io/blog/down-with-pipeline-debt-introducing-great-expectations ● Undocumented/ unmanaged ● Untested ● Unstable

Slide 52

Slide 52 text

https://www.linkedin.com/pulse/shift-left-testing-revolutionizing-quality-assurance-software/ https://greatexpectations.io/blog/maximizing-productivity-of-analytics-teams-pt3 https://www.acceldata.io/blog/shift-left-data-reliability Shift Left Testing Software development life cycle (SDLC) Shift-left Data Reliability

Slide 53

Slide 53 text

Cost Team capacity Complexity With test Without test Lifecycle of a data pipeline with Tests https://www.karllhughes.com/posts/testing-matters https://greatexpectations.io/blog/maximizing-productivity-of-analytics-teams-pt3 Automation is the key

Slide 54

Slide 54 text

https://github.com/GoogleCloudPlatform/cloud-data-quality https://cloud.google.com/dataplex/docs/data-quality-tasks-overview CloudDQ

Slide 55

Slide 55 text

Auto data quality overview a reference architecture https://github.com/GoogleCloudPlatform/cloud-data-quality https://cloud.google.com/dataplex/docs/auto-data-quality-overview

Slide 56

Slide 56 text

The Analytics Development Lifecycle (ADLC) https://www.getdbt.com/resources/guides/the-analytics-development-lifecycle An integrated, iterative process Ingest/collect Store Process Output Source

Slide 57

Slide 57 text

● Data Contract may help Data Quality/Data Integrity ● Data pipeline as code + Infrastructure as code ● Test data, test data pipeline(code), and automate it ● Try a test framework (GX, dbt-test..) ● Start with priority/severity ● Integrate with your development/deploy flow (CI/CD) ● Integrate with data observability (monitoring, alerting) and operation ● Natively/easily integrated into the data orchestrator ● Data reliability Engineering (data government in a large scope) ● More data testing framework ○ Soda Core ○ Pandera ○ Fugue Recap and Tips

Slide 58

Slide 58 text

Q&A Thanks!