in the data-driven landscape Data Quality/Data Integrity/Data Contract 02 Type of data testing Test strategies 03 Test Frameworks 04 Data Pipeline debt & Recap Great expectation & dbt (dbt test and others) Analytics Development Lifecycle (ADLC)
orders with specific recipes to make pizza • Data Tables ◦ Order ◦ Recipes ◦ Customer ◦ Inventory ◦ … ETL/ELT pipelines Data source Data store/target Data applications
correctly reflects the real world object/ event Accuracy Expected comprehensiveness/ are all datasets and the data items recorded Completeness Data across all systems reflects the same information and are in synch with each other across the data stores Consistency Information is available when it is expected and needed Timeliness Means that there’s only one instance of the information appearing in a database Uniqueness Refers to information that doesn’t conform to a specific format or doesn’t follow business rules Validity https://kamal-ahmed.github.io/DQ-Dimensions.github.io/ https://hub.getdbt.com/infinitelambda/dq_tools/latest/
analytics The impact of data distrust! Snaplogic “The State of Data Management – The Impact of Data Distrust” Distrust in data Data analytics challenges in organizations
doesn't exist in the inventory table. (out of stock! ) • Column: pizza_type • Valid Values: ◦ Margherita', 'Pepperoni', 'Veggie Supreme” An order includes a pizza_type that is not on the menu. • Issues with order fulfillment. • Inaccurate inventory forecasting. • Inability to prepare certain pizzas due to missing ingredients. • Customers may be disappointed if their preferred menu items are unavailable. • Staff may spend extra time handling stock shortages. Rule: All ingredients listed in the recipes table must exist in the inventory table with sufficient stock.
agreement, typically between a service (or producer) and a client (or consumer), that outlines the specifics of data exchange. Its core purpose is to ensure compatibility and clarity in the management and usage of data. producer consumer agreement data exchange The next generation of Data Platforms is the Data Mesh (from The PayPal Technology 1. interface 2. expectations 3. governed 4. explicit Driving Data Quality with Data Contracts (2023) by Andrew Jones
4.Explicit • Stable and supported interface to the data • Versioned, with a migration path for breaking schema • An agreement between the generators and consumers • Documented, including the structure, schema and semantics • SLOs around the reliability of the data • Well defined ownership and responsibilities • Handling personally identifiable data in-line with company policies and regulations • Controlling access to data • Automated, driven by metadata provided in the data contracts • Accessible data explicitly provided for consumption • Data products that meet business requirements • Generators know the value and are incentivised to produce the data Driving Data Quality with Data Contracts (2023) by Andrew Jones
frameworks in dbt. (2023) Two Types of Testing DEV/TEST env CI Code Data Input Output data Code freezed, data changed Code Data Input Output data Prod env Data freezed, code changed Validating the code that processes data before deployed to prod. Validating the data as it's loaded into production.
Code freezed, data changed Data freezed, code changed Code: Lint -> Unit test -> Integration testing • Unexpected change from pipeline code refactoring • SQL column definition • Column rename Data: Validate data content • Many NULLs • Data not updating • Value too small/large • … https://www.getdbt.com/blog/building-a-data-quality-framework-with-dbt-and-dbt-cloud
Data Pipeline Bug-specific tests Large Logic Tests Data Test Data Test Data Test https://www.telm.ai/blog/how-to-solve-data-quality-issues-at-every-lifecycle-stage/ Test Code Test Data
tool • It is for validating data, documenting data, and profiling data • It maintains the quality of data and improves communication between teams Key features • Expectations are like assertions in traditional Python unit tests. • Automated data profiling automates pipeline tests. • Data Contexts and Data Sources allow you to configure connections to your data sources. • Tooling for validation are checkpoints for data validation. • Data Docs clean, human-readable documentation.
2024/0922 More about GX https://medium.com/@expectgreatdata/down-with-pipeline-debt-introducing-great-expectations-862ddc46782a https://github.com/great-expectations/great_expectations
that software engineers use to build applications. Version Control and CI/CD Deploy safely using dev environments. Git-enabled version control enables collaboration and a return to previous states. Test and Document Test every model prior to production, and share dynamically generated documentation with all data stakeholders. Develop Write modular data transformations in .sql or .py files – dbt handles the chore of dependency management. https://www.getdbt.com/product/what-is-dbt https://github.com/dbt-labs/dbt-core What is dbt?
datastore https://github.com/calogica/dbt-expectations Contract: enforce contract on data model Column:define data type Constraint: specific constraint • nullness • Uniqueness • Primary keys • Foreign keys
frameworks in dbt. (2023) Two Types of Testing DEV/TEST env CI Code Data Input Output data Code freezed, data changed Code Data Input Output data Prod env Data freezed, code changed Validating the code that processes data before deployed to prod. Validating the data as it's loaded into production. ETL code ◦ pytest,... model code ◦ dbt unit testing ◦ Recce data content: ◦ pydantic ◦ great expectations ◦ dbt test ◦ dbt_utils/expectatio ns/elementary… data schemas: ◦ dbt data contracts
Available assertions 3. Custom assertions (extensibility) 4. Validation execution/ Integration with existing technology stack How to choose framework https://medium.com/@brunouy/a-guide-to-open-source-data-quality-tools-in-late-2023-f9dbadbc7948
a data pipeline with Tests https://www.karllhughes.com/posts/testing-matters https://greatexpectations.io/blog/maximizing-productivity-of-analytics-teams-pt3 Automation is the key
pipeline as code + Infrastructure as code • Test data, test data pipeline(code), and automate it • Try a test framework (GX, dbt-test..) • Start with priority/severity • Integrate with your development/deploy flow (CI/CD) • Integrate with data observability (monitoring, alerting) and operation • Natively/easily integrated into the data orchestrator • Data reliability Engineering (data government in a large scope) • More data testing framework ◦ Soda Core ◦ Pandera ◦ Fugue Recap and Tips