landscape Data Quality 02 Enhancing data quality with data contracts Data Contracts 03 Test with dbt 04 To Data Governance & Recap One of approaches Accountability of data team culture Data Contracts sciwork 2023
Interested in • Agile/Engineering culture • Team coaching • Data Engineering • Data visualization Shuhsi Lin [email protected] Working in a manufacturing company With data and people Talk people in Career Conversation at scisprint Related to Today’s topic
correctly reflects the real world object/ event Accuracy Expected comprehensiveness/ are all datasets and the data items recorded Completeness Data across all systems reflects the same information and are in synch with each other across the data stores Consistency Information is available when it is expected and needed Timeliness Means that there’s only one instance of the information appearing in a database Uniqueness Refers to information that doesn’t conform to a specific format or doesn’t follow business rules Validity https://kamal-ahmed.github.io/DQ-Dimensions.github.io/ https://hub.getdbt.com/infinitelambda/dq_tools/latest/
analytics The impact of data distrust! Snaplogic “The State of Data Management – The Impact of Data Distrust” Distrust in data Data analytics challenges in organizations
Data-centric AI, by Andrew Ng. 2022 • https://www.zenml.io/blog/its-the-data-silly-how-data-centric-ai-is-driving-mlops Model-centric Improving mode/code data fixed improve data model fixed improve model/code Data-centric vs Improving data quality “If 80 percent of our work is data preparation, then ensuring data quality is the important work of a machine learning team.” Andrew Ng AI systems code Data = +
of data and its consumers. It sets the expectations around that data, defines how it should be governed, and facilitates the explicit generation of quality data that meets the business requirements. Driving Data Quality with Data Contracts (2023) by Andrew Jones 1. interface 2. expectations 3. governed 4. explicit generators consumers agreed Driving Data Quality with Data Contracts (2023) by Andrew Jones
agreement, typically between a service (or producer) and a client (or consumer), that outlines the specifics of data exchange. Its core purpose is to ensure compatibility and clarity in the management and usage of data. producer consumer agreement data exchange The next generation of Data Platforms is the Data Mesh (from The PayPal Technology Blog)
Governed 4.Explicit • Stable and supported data interface • Versioned, with a migration path for breaking schema • An agreement between the generators and consumers • Documented, including the structure, schema and semantics • Data reliability SLOs • Clearly assigned data stewardship/ownership and accountabilities. • Managing personally identifiable data in-line with organizational policies and regulations • Regulating data access • Automated, based on contract-specified metadata • Data specifically designated and made accessible for use • Data products aligned with business requirements • Generators understand and are motivated to create valuable data. Driving Data Quality with Data Contracts (2023) by Andrew Jones
specified in the contract to implement validation checks during the CI/CD process. • Verify the results in the development environment. • Align the output with the schema and data types defined in the contract. • In cases of backward incompatible changes, evaluate the contract's designated tier. • Based on the contract's tier, decide and execute the appropriate course of action. How to Mitigate AI Biases Using Data Contracts | Data Governance to Improve Data Quality at Data Science Dojo 2023
transform their data using the same practices that software engineers use to build applications. Version Control and CI/CD Deploy safely using dev environments. Git-enabled version control enables collaboration and a return to previous states. Test and Document Test every model prior to production, and share dynamically generated documentation with all data stakeholders. Develop Write modular data transformations in .sql or .py files – dbt handles the chore of dependency management. https://www.getdbt.com/product/what-is-dbt https://github.com/dbt-labs/dbt-core What is dbt?
the data, monitor the output Freeze the data, change the code, monitor the output ref: Webinar on: Testing frameworks in dbt. (2023) Two Types of Testing DEV/TEST env CI Prod env
the data, monitor the output Freeze the data, change the code, monitor the output Prod env DEV/TEST env CI ref: Webinar on: Testing frameworks in dbt. (2023) Two Types of Testing 1.model code dbt unit testing 2.macro code dbt test Four Test cases data contest 1.data content: soda 2.data schemas: dbt data contracts
Data Contracts - Accountable Data Quality | Data Quality Camp at 2023 Data Council • Define input/output • CI/CD • Documentation • Versioning • Enforcements • Alerting Responsible for data with contracts Know how data is being used
AU 2023) 2. Data Contracts - Accountable Data Quality | Data Quality Camp (2023) 3. Shift-left governance for your dbt centered stack: Data contracts and more! - Coalesce 2023 4. https://github.com/AltimateAI/awesome-data-contracts/tree/main 5. Testing: Our assertions vs. reality at Coalesce 2022 6. Data Contracts in action powered by Python open source ecosystem -Alyona Galyeva at PDAMS 2023 7. data-contract-template (from paypal) 8. Driving Data Quality with Data Contracts- Andrew Jones 2023 9. Sneak Peek: Data Contracts for dbt – Where Expectations Meet Automation at Datahub 10. How to Mitigate AI Biases Using Data Contracts | Data Governance to Improve Data Quality at Data Science Dojo 2023 Reference dbt a. unit test: https://github.com/EqualExperts/dbt-unit-testing b. marco test: https://docs.getdbt.com/docs/build/tests c. soda, https://github.com/sodadata/soda-core d. data contracts: https://docs.getdbt.com/docs/collaborate/govern/model-contracts e. freshness: https://docs.getdbt.com/reference/resource-properties/freshness f. owners: https://docs.getdbt.com/docs/collaborate/govern/model-access