Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Contracts: Empowering Data Quality Enforcement (sciwork2023)

suci
December 09, 2023

Data Contracts: Empowering Data Quality Enforcement (sciwork2023)

Data Contracts: Empowering Data Quality Enforcement at sciwork2023
https://conf.sciwork.dev/program
https://pretalx.sciwork.dev/sw23/talk/BH7HYE/

suci

December 09, 2023
Tweet

More Decks by suci

Other Decks in Programming

Transcript

  1. Agenda 01 The importance of data quality in the data-driven

    landscape Data Quality 02 Enhancing data quality with data contracts Data Contracts 03 Test with dbt 04 To Data Governance & Recap One of approaches Accountability of data team culture Data Contracts sciwork 2023
  2. Find me on and sciwork discord About Shuhsi sciwork member

    Interested in • Agile/Engineering culture • Team coaching • Data Engineering • Data visualization Shuhsi Lin [email protected] Working in a manufacturing company With data and people Talk people in Career Conversation at scisprint Related to Today’s topic
  3. 6 Dimensions of Data Quality The degree to which data

    correctly reflects the real world object/ event Accuracy Expected comprehensiveness/ are all datasets and the data items recorded Completeness Data across all systems reflects the same information and are in synch with each other across the data stores Consistency Information is available when it is expected and needed Timeliness Means that there’s only one instance of the information appearing in a database Uniqueness Refers to information that doesn’t conform to a specific format or doesn’t follow business rules Validity https://kamal-ahmed.github.io/DQ-Dimensions.github.io/ https://hub.getdbt.com/infinitelambda/dq_tools/latest/
  4. Data Quality in AI Impact of poor quality data and

    analytics The impact of data distrust! Snaplogic “The State of Data Management – The Impact of Data Distrust” Distrust in data Data analytics challenges in organizations
  5. • A Chat with Andrew on MLOps: From Model-centric to

    Data-centric AI, by Andrew Ng. 2022 • https://www.zenml.io/blog/its-the-data-silly-how-data-centric-ai-is-driving-mlops Model-centric Improving mode/code data fixed improve data model fixed improve model/code Data-centric vs Improving data quality “If 80 percent of our work is data preparation, then ensuring data quality is the important work of a machine learning team.” Andrew Ng AI systems code Data = +
  6. What is Data Contracts An agreed interface between the generators

    of data and its consumers. It sets the expectations around that data, defines how it should be governed, and facilitates the explicit generation of quality data that meets the business requirements. Driving Data Quality with Data Contracts (2023) by Andrew Jones 1. interface 2. expectations 3. governed 4. explicit generators consumers agreed Driving Data Quality with Data Contracts (2023) by Andrew Jones
  7. What is Data Contracts A data contract is a formal

    agreement, typically between a service (or producer) and a client (or consumer), that outlines the specifics of data exchange. Its core purpose is to ensure compatibility and clarity in the management and usage of data. producer consumer agreement data exchange The next generation of Data Platforms is the Data Mesh (from The PayPal Technology Blog)
  8. Four Principles of Data Contracts 1. Interface 2. Expectations 3.

    Governed 4.Explicit • Stable and supported data interface • Versioned, with a migration path for breaking schema • An agreement between the generators and consumers • Documented, including the structure, schema and semantics • Data reliability SLOs • Clearly assigned data stewardship/ownership and accountabilities. • Managing personally identifiable data in-line with organizational policies and regulations • Regulating data access • Automated, based on contract-specified metadata • Data specifically designated and made accessible for use • Data products aligned with business requirements • Generators understand and are motivated to create valuable data. Driving Data Quality with Data Contracts (2023) by Andrew Jones
  9. Strategies for Implementing Data Contract Enforcement • Utilize the values

    specified in the contract to implement validation checks during the CI/CD process. • Verify the results in the development environment. • Align the output with the schema and data types defined in the contract. • In cases of backward incompatible changes, evaluate the contract's designated tier. • Based on the contract's tier, decide and execute the appropriate course of action. How to Mitigate AI Biases Using Data Contracts | Data Governance to Improve Data Quality at Data Science Dojo 2023
  10. Minimum Viable Data Contact Data Generator Data Consumer Data contract

    Service Define Provision Operate Write Interface
  11. dbt (data build tool) enables data analysts and engineers to

    transform their data using the same practices that software engineers use to build applications. Version Control and CI/CD Deploy safely using dev environments. Git-enabled version control enables collaboration and a return to previous states. Test and Document Test every model prior to production, and share dynamically generated documentation with all data stakeholders. Develop Write modular data transformations in .sql or .py files – dbt handles the chore of dependency management. https://www.getdbt.com/product/what-is-dbt https://github.com/dbt-labs/dbt-core What is dbt?
  12. Assumption about Data Assertion about Code Freeze the code, change

    the data, monitor the output Freeze the data, change the code, monitor the output ref: Webinar on: Testing frameworks in dbt. (2023) Two Types of Testing DEV/TEST env CI Prod env
  13. Assumption about Data Assertion about Code Freeze the code, change

    the data, monitor the output Freeze the data, change the code, monitor the output Prod env DEV/TEST env CI ref: Webinar on: Testing frameworks in dbt. (2023) Two Types of Testing 1.model code dbt unit testing 2.macro code dbt test Four Test cases data contest 1.data content: soda 2.data schemas: dbt data contracts
  14. Maturity Curve of Data Contracts Ownership Awareness (Data) API Rules

    Data Contracts - Accountable Data Quality | Data Quality Camp at 2023 Data Council • Define input/output • CI/CD • Documentation • Versioning • Enforcements • Alerting Responsible for data with contracts Know how data is being used
  15. Data Contracts Adoption Over Time Data Contracts • Awareness of

    Data Quality • Saving cost • Simplicity (with existing tools) • Incrementality Starting Iterating • Data governance • Privacy • Policy enforcement • Security Maturing Data Contracts - Accountable Data Quality | Data Quality Camp at 2023 Data Council • Data modeling • Adding constraints • Abstractions • Alerting • Violation reporting
  16. Recap • Data Quality is important (for AI) • Data

    Contract to improve data quality with enforcement • dbt may help, but more possible approaches • Adoption with data team culture
  17. 1. "Data Contracts: Consensus as Code" - Ryan Collingwood (PyCon

    AU 2023) 2. Data Contracts - Accountable Data Quality | Data Quality Camp (2023) 3. Shift-left governance for your dbt centered stack: Data contracts and more! - Coalesce 2023 4. https://github.com/AltimateAI/awesome-data-contracts/tree/main 5. Testing: Our assertions vs. reality at Coalesce 2022 6. Data Contracts in action powered by Python open source ecosystem -Alyona Galyeva at PDAMS 2023 7. data-contract-template (from paypal) 8. Driving Data Quality with Data Contracts- Andrew Jones 2023 9. Sneak Peek: Data Contracts for dbt – Where Expectations Meet Automation at Datahub 10. How to Mitigate AI Biases Using Data Contracts | Data Governance to Improve Data Quality at Data Science Dojo 2023 Reference dbt a. unit test: https://github.com/EqualExperts/dbt-unit-testing b. marco test: https://docs.getdbt.com/docs/build/tests c. soda, https://github.com/sodadata/soda-core d. data contracts: https://docs.getdbt.com/docs/collaborate/govern/model-contracts e. freshness: https://docs.getdbt.com/reference/resource-properties/freshness f. owners: https://docs.getdbt.com/docs/collaborate/govern/model-access