Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ensuring Data Integrity with Validation and Pi...

suci
September 21, 2024

Ensuring Data Integrity with Validation and Pipeline Testing

Ensuring Data Integrity with Validation and Pipeline Testing at PyconTW 2024

suci

September 21, 2024
Tweet

More Decks by suci

Other Decks in Programming

Transcript

  1. Shuhsi Lin 20240922 Ensuring Data Integrity with Validation and Pipeline

    Testing PyConTW 2024 at Kaohsiung Calling from the harbor:Dive into Python
  2. About Me Find me on sciwork member Interested in •

    Agile/Engineering Culture/ Developer Experience • Team Coaching • Data Engineering Shuhsi Lin Working in Smart manufacturing & AI With data and people
  3. Agenda 01 Start with basic: the importance of data quality

    in the data-driven landscape Data Quality/Data Integrity/Data Contract 02 Type of data testing Test strategies 03 Test Frameworks 04 Data Pipeline debt & Recap Great expectation & dbt (dbt test and others) Analytics Development Lifecycle (ADLC)
  4. Scenario SmartPizza (like Pizzaxxx, but smarter) • Daily work: get

    orders with specific recipes to make pizza • Data Tables ◦ Order ◦ Recipes ◦ Customer ◦ Inventory ◦ … ETL/ELT pipelines Data source Data store/target Data applications
  5. 6 Dimensions of Data Quality The degree to which data

    correctly reflects the real world object/ event Accuracy Expected comprehensiveness/ are all datasets and the data items recorded Completeness Data across all systems reflects the same information and are in synch with each other across the data stores Consistency Information is available when it is expected and needed Timeliness Means that there’s only one instance of the information appearing in a database Uniqueness Refers to information that doesn’t conform to a specific format or doesn’t follow business rules Validity https://kamal-ahmed.github.io/DQ-Dimensions.github.io/ https://hub.getdbt.com/infinitelambda/dq_tools/latest/
  6. Data Quality in AI Impact of poor quality data and

    analytics The impact of data distrust! Snaplogic “The State of Data Management – The Impact of Data Distrust” Distrust in data Data analytics challenges in organizations
  7. Entity Integrity Referential Integrity An order references a customer_id that

    doesn't exist in the customers table. • Tables: orders and customers • Foreign Key: orders.customer_id references customers.customer_id • Table: customers • Primary Key: customer_id Same customer_id exist in the customers table. • Confusion in tracking orders and loyalty points. • Inaccurate customer analytics. • Confusion in tracking orders and loyalty points. • Inaccurate customer analytics.
  8. Domain Integrity User-Defined Integrity An ingredients of a recipes that

    doesn't exist in the inventory table. (out of stock! ) • Column: pizza_type • Valid Values: ◦ Margherita', 'Pepperoni', 'Veggie Supreme” An order includes a pizza_type that is not on the menu. • Issues with order fulfillment. • Inaccurate inventory forecasting. • Inability to prepare certain pizzas due to missing ingredients. • Customers may be disappointed if their preferred menu items are unavailable. • Staff may spend extra time handling stock shortages. Rule: All ingredients listed in the recipes table must exist in the inventory table with sufficient stock.
  9. What is Data Contracts A data contract is a formal

    agreement, typically between a service (or producer) and a client (or consumer), that outlines the specifics of data exchange. Its core purpose is to ensure compatibility and clarity in the management and usage of data. producer consumer agreement data exchange The next generation of Data Platforms is the Data Mesh (from The PayPal Technology 1. interface 2. expectations 3. governed 4. explicit Driving Data Quality with Data Contracts (2023) by Andrew Jones
  10. Four Principles of Data Contracts 1. Interface 2.Expectations 3. Governed

    4.Explicit • Stable and supported interface to the data • Versioned, with a migration path for breaking schema • An agreement between the generators and consumers • Documented, including the structure, schema and semantics • SLOs around the reliability of the data • Well defined ownership and responsibilities • Handling personally identifiable data in-line with company policies and regulations • Controlling access to data • Automated, driven by metadata provided in the data contracts • Accessible data explicitly provided for consumption • Data products that meet business requirements • Generators know the value and are incentivised to produce the data Driving Data Quality with Data Contracts (2023) by Andrew Jones
  11. An overview of data contracts in Open Data Contract Standard

    https://github.com/bitol-io/open-data-contract-standard Contributors to the data contract
  12. Assumption about Data Assertion about Code ref: Webinar on: Testing

    frameworks in dbt. (2023) Two Types of Testing DEV/TEST env CI Code Data Input Output data Code freezed, data changed Code Data Input Output data Prod env Data freezed, code changed Validating the code that processes data before deployed to prod. Validating the data as it's loaded into production.
  13. Assumption about Data Assertion about Code Two Types of Testing

    Code freezed, data changed Data freezed, code changed Code: Lint -> Unit test -> Integration testing • Unexpected change from pipeline code refactoring • SQL column definition • Column rename Data: Validate data content • Many NULLs • Data not updating • Value too small/large • … https://www.getdbt.com/blog/building-a-data-quality-framework-with-dbt-and-dbt-cloud
  14. Testing in a Data LifeCycle Ingest/collect Store Process Output Source

    Data Pipeline Bug-specific tests Large Logic Tests Data Test Data Test Data Test https://www.telm.ai/blog/how-to-solve-data-quality-issues-at-every-lifecycle-stage/ Test Code Test Data
  15. What is Great Expectations? • A Python based open source

    tool • It is for validating data, documenting data, and profiling data • It maintains the quality of data and improves communication between teams Key features • Expectations are like assertions in traditional Python unit tests. • Automated data profiling automates pipeline tests. • Data Contexts and Data Sources allow you to configure connections to your data sources. • Tooling for validation are checkpoints for data validation. • Data Docs clean, human-readable documentation.
  16. Strata 2018: Pipeline Testing with Great Expectations Data updated in

    2024/0922 More about GX https://medium.com/@expectgreatdata/down-with-pipeline-debt-introducing-great-expectations-862ddc46782a https://github.com/great-expectations/great_expectations
  17. dbt (data build tool) Transform data using the same practices

    that software engineers use to build applications. Version Control and CI/CD Deploy safely using dev environments. Git-enabled version control enables collaboration and a return to previous states. Test and Document Test every model prior to production, and share dynamically generated documentation with all data stakeholders. Develop Write modular data transformations in .sql or .py files – dbt handles the chore of dependency management. https://www.getdbt.com/product/what-is-dbt https://github.com/dbt-labs/dbt-core What is dbt?
  18. Test data content • Built-in ◦ Singular data tests ◦

    Generic data tests • Packages ◦ dbt_utils ◦ dbt_expectation ◦ Dbt_elementary ◦ … Test data schema • dbt (model) contract
  19. dbt.test generic data tests dbt-utils https://docs.getdbt.com/docs/build/data-tests https://github.com/dbt-labs/dbt-utils Scenario: Validate that

    the order_id in the orders table increments sequentially without gaps Scenario: Ensure recipe_id is unique and not null
  20. dbt.test custom Test (SQL) Scenario: Verify that every ingredient listed

    in the recipes table exists in the inventory table. https://docs.getdbt.com/docs/build/data-tests
  21. dbt contract Scenario: prevent invalid recipes data from entering your

    datastore https://github.com/calogica/dbt-expectations Contract: enforce contract on data model Column:define data type Constraint: specific constraint • nullness • Uniqueness • Primary keys • Foreign keys
  22. Assumption about Data Assertion about Code ref: Webinar on: Testing

    frameworks in dbt. (2023) Two Types of Testing DEV/TEST env CI Code Data Input Output data Code freezed, data changed Code Data Input Output data Prod env Data freezed, code changed Validating the code that processes data before deployed to prod. Validating the data as it's loaded into production. ETL code ◦ pytest,... model code ◦ dbt unit testing ◦ Recce data content: ◦ pydantic ◦ great expectations ◦ dbt test ◦ dbt_utils/expectatio ns/elementary… data schemas: ◦ dbt data contracts
  23. Do less for easier adoption 1. Data source integration 2.

    Available assertions 3. Custom assertions (extensibility) 4. Validation execution/ Integration with existing technology stack How to choose framework https://medium.com/@brunouy/a-guide-to-open-source-data-quality-tools-in-late-2023-f9dbadbc7948
  24. Pipeline debt Technical debt in data pipeline Down with pipeline

    debt / introducing Great Expectations. https://greatexpectations.io/blog/down-with-pipeline-debt-introducing-great-expectations • Undocumented/ unmanaged • Untested • Unstable
  25. Cost Team capacity Complexity With test Without test Lifecycle of

    a data pipeline with Tests https://www.karllhughes.com/posts/testing-matters https://greatexpectations.io/blog/maximizing-productivity-of-analytics-teams-pt3 Automation is the key
  26. • Data Contract may help Data Quality/Data Integrity • Data

    pipeline as code + Infrastructure as code • Test data, test data pipeline(code), and automate it • Try a test framework (GX, dbt-test..) • Start with priority/severity • Integrate with your development/deploy flow (CI/CD) • Integrate with data observability (monitoring, alerting) and operation • Natively/easily integrated into the data orchestrator • Data reliability Engineering (data government in a large scope) • More data testing framework ◦ Soda Core ◦ Pandera ◦ Fugue Recap and Tips