University_of_Amsterdam-Data_Governance_AIThe_Positive_Feedback_Loop.pdf

Data Governance & AI: A positive feedback loop Data Expo,
Utrecht September 11, 2025 Prof. Paul Groth | @pgroth | pgroth.com | indelab.org Thanks to Prof. George Fletcher, Dr. Juan Sequeda, Dr. Katleen Gregory, Dr. Laura Koesten, the INDElab team 1

Research Topics at INDE lab Design systems to support people
in working with data from diverse sources Address problems related to the preparation, management, and integration of data  3 • Automated Knowledge Graph Construction  (e.g. building KGs from multiple modalities; architectures for integrating KGs and LLMs) • Context Aware Data Systems  (e.g. rule learning & digital twins; human-data interaction; human - ai work fl ows) • Data Management for Machine Learning   (e.g. data quality assessment; data handling impact on ML models; data search)     

– https://www.lightsondata.com/what-is-data-governance/ “Data Governance is a discipline which provides the
necessary policies, processes, standards, roles and responsibilities needed to ensure that data is managed as an asset.” 4

Motivation 5

Motivation: Regulation 6

https://www.vpn.nl/faq/avg https://www.cyberpilot.io/cyberpilot-blog/data-protection-principles-the-7-principles-of-gdpr-explained/ 7

EU Data Governance Act https://digital-strategy.ec.europa.eu/en/policies/data-governance-act 8

EU Data Act • Ease user access to data generated
by them • New data sharing contracts for SMEs • Cloud switching • Public sector agencies can access data from businesses (in emergencies) 9

EU AI Act • Data and data governance • Transparency
for Users • Human oversight • Accuracy, Robustness and Cybersecurity • Traceability and Auditability Lilian Edwards. (2022). The EU AI Act proposal. Ada Lovelace Institute. Available at: https://www.adalovelaceinstitute.org/ resource/eu-ai-act-explainer/ https://www.lawfareblog.com/arti fi cial-intelligence-act-what-european-approach-ai 10

Motivation: New Opportunities 11

Finding digital truth—that is, iden ti fying and combining data
that accurately represent reality—is becoming more di ff i cult and more important. More di ff i cult because data and their sources are mul ti plying. And more important because fi rms need to get their data house in order to bene fi t from AI, which they must to stay compe ti ti ve. -- The Economist, February 2020 12

13 Graph analytics Self service analytics AI / ML models
People with business questions Data Consumers Data Analysts Data Scientist Sources Data warehouse, Data lakes and app- specific DBs Cloud services and APIs Files and shared files Analytics platforms Data Producers Data Engineers Data Stewards AI in Production is a team sport

New architectures Source: The Future of Work With AI -
Microsoft March 2023 Event https://www.youtube.com/watch?v=Bf-dbS9CcRU&ab_channel=Microsoft 14

Why Data Governance is increasingly important? • The amount of
data • More people have access to data • More ways to collect data • More kinds of data • Uses have expanded • New regulations • Ethical Concerns Eryurek, Evren, et al. Data Governance: The De fi nitive Guide: People, Processes, and Tools to Operationalize Data Trustworthiness. First edition, O’Reilly Media, Inc, 2021. 15

Data Lifecycle Perspective 16

Data Lifecycle Eryurek, Evren, et al. Data Governance: The De
fi nitive Guide: People, Processes, and Tools to Operationalize Data Trustworthiness. First edition, O’Reilly Media, Inc, 2021. 17

Governance of a data life cycle Eryurek, Evren, et al.
Data Governance: The De fi nitive Guide: People, Processes, and Tools to Operationalize Data Trustworthiness. First edition, O’Reilly Media, Inc, 2021. 18

Data 19

What kind of data do you have? 20

21 https://www.theverge.com/2024/5/14/24156610/google-follows-openai-with-its-own-multimodal-demo-meet-project-astra

Prompting 22 https://www.promptfoo.dev/

Models as Data 23

Synthetic Data https://docs.sdv.dev/sdv 24

Implications for Data Governance Premise Consequence Improving ability to use
expertise Expertise is a critical resource Improving ability to use more and di ff erent signals Signal capture becomes imperative Multiple content sources buttress each other Understanding and use the entire data estate Machine learning SOTA is accessible Problem formulation is fundamental 25

26 Examples

Modern Data Stack • Cloud- fi rst • Built around
cloud data warehouse/lake • Focus on solving one problem • O ff ered as SaaS or open-core • Low-entry barrier • Actively supported by communities https://atlan.com/modern-data-stack-101/ 27

AI throughout Data Governance Tech • Data Catalog • AI:
recommendations, prioritising curation • Semantic Layers • AI: automatically building knowledge graphs and vocabularies • Data workspaces • AI: synthetic data generation, making sure data is properly used • Monitoring and reporting • AI: governance advice, understanding an estate 28

Data Catalog • A tool to manage metadata about data
assets 29

The Data Catalog as starting point • Data catalog as
not only a place to fi nd data but understand data demands and employ AI • Including: • which datasets are used • how data is used ( fi elds) • who uses a dataset • who are the people to talk to fi gure out data 30

Case Study Spotify: Low-Intent Discovery https://engineering.atspotify.com/2020/02/how-we-improved-data-discovery-for-data-scientists-at-spotify/ 31

Case Study Spotify: High-Intent Discovery https://engineering.atspotify.com/2020/02/how-we-improved-data-discovery-for-data-scientists-at-spotify/ 32

Case Study Spotify: Usage https://engineering.atspotify.com/2020/02/how-we-improved-data-discovery-for-data-scientists-at-spotify/ 33

Case Study Spotify: Expertise https://engineering.atspotify.com/2020/02/how-we-improved-data-discovery-for-data-scientists-at-spotify/ 34

AI to prioritise metadata 35

Article Dataset Reuse: Toward Translating Principles to Practice Laura Koesten,1,*
Pavlos Vougiouklis,2 Elena Simperl,1 and Paul Groth3,4,* 1King’s College London, London WC2B 4BG, UK 2Huawei Technologies, Edinburgh EH9 3BF, UK 3University of Amsterdam, Amsterdam 1090 GH, the Netherlands 4Lead Contact *Correspondence: [email protected] (L.K.), [email protected] (P.G.) https://doi.org/10.1016/j.patter.2020.100136 SUMMARY The web provides access to millions of datasets that can have additional impact when used beyond their original context. We have little empirical insight into what makes a dataset more reusable than others and which of the existing guidelines and frameworks, if any, make a difference. In this paper, we explore potential reuse features through a literature review and present a case study on datasets on GitHub, a popular open platform for sharing code and data. We describe a corpus of more than 1.4 million data files, from over 65,000 repositories. Using GitHub’s engagement metrics as proxies for dataset reuse, we relate them to reuse features from the literature and devise an initial model, using deep neural networks, to predict a dataset’s reusability. This demonstrates the practical gap between principles and actionable insights that allow data publishers and tools designers to implement functionalities that provably facilitate reuse. 1 INTRODUCTION There has been a gradual shift in the last years from viewing datasets as byproducts of (digital) work to critical assets, whose value increases the more they are used.1,2 However, our understanding of how this value emerges, and of the factors that demonstrably affect the reusability of a dataset is still limited. Using a dataset beyond the context where it originated re- mains challenging for a variety of socio-technical reasons, which have been discussed in the literature;3,4 the bottom line is that simply making data available, even when complying with existing guidance and best practices, does not mean it can be easily used by others.5 At the same time, making data reusable to a diverse audience, in terms of domain, skill sets, and purposes, is an important way to realize its potential value (and recover some of the, sometimes considerable, resources invested in policy and infrastructure support). This is one of the reasons why scientific journals and research-funding organizations are increasingly calling for further data sharing6 or why industry bodies, such as the Interna- tional Data Spaces Association (IDSA) (https://www. internationaldataspaces.org/) are investing in reference architectures to smooth data flows from one business to another. There is plenty of advice on how to make data easier to reuse, including technical standards, legal frameworks, and guidelines. Much work places focus on machine readability THE BIGGER PICTURE The web provides access to millions of datasets. These data can have additional impact when it is used beyond the context for which it was originally created. We have little empirical insight into what makes a dataset more reusable than others, and which of the existing guidelines and frameworks, if any, make a difference. In this paper, we explore potential reuse features through a literature review and present a case study on datasets on GitHub, a popular open platform for sharing code and data. We describe a corpus of more than 1.4 million data files, from over 65,000 repositories. Using GitHub’s engagement metrics as proxies for dataset reuse, we relate them to reuse features from the literature and devise an initial model, using deep neural networks, to predict a dataset’s reusability. This work demonstrates the practical gap between principles and actionable insights that allow data publishers and tools designers to implement functionalities that provably facilitate reuse. Proof-of-Concept: Data science output has been formulated, implemented, and tested for one domain/problem Patterns 1, 100136, November 13, 2020 ª 2020 The Author(s). 1 ll OPEN ACCESS Lots of good advice for metadata • Maybe a bit too much…. • Currently, 140 policies on fairsharing.org as of April 5, 2021 • We reviewed 40 papers • Cataloged 39 di ff erent features of datasets that enable data reuse 36

Getting some data • Used Github as a case study
• ~1.4 million datasets (e.g. CSV, excel) from ~65K repos • Use engagement metrics as proxies for data reuse • Map literature features to both dataset and repository features • Train a predictive model to see what are features are good predictors 37

Dataset Features Missing values Size Columns + Rows Readme features
Issue features Age Description Parsable 38

Where to start? • Some ideas from this study if
you’re publishing data with Github • provide an informative short textual summary of the dataset   • provide a comprehensive README fi le in a structured form and links to further information   • datasets should not exceed standard processable fi le sizes   • datasets should be possible to open with a standard con fi guration of a common library (such as Pandas)  Trained a Recurrent Neural Network. Might be better models but useful for handling text, Not the greatest predicator (good for classifying not reuse) but still useful for helping us tease out features 39

Model Context Protocol for data estate analysis 40

MCP & Data Governance 41 https://curatedanalytics.ai/governance-and-data-management-using-model-context-protocol-mcp/

AI as a Governance Advisor 42 Daly, E. M., Rooney,
S., Tirupathi, S., Garces-Erice, L., Vejsbjerg, I., Bagehorn, F., Salwala, D., Giblin, C., Wolf-Bauwens, M. L., Giurgiu, I., Hind, M., & Urbanetz, P. (2025). Usage governance advisor: From intent to ai governance (arXiv:2412.01957). arXiv. https://doi.org/10.48550/arXiv.2412.01957

Building governance into the data lifecycle enables AI • Build
standards into your existing process and implement them as engineering solutions. • Engineering enables AI • AI improves governance • Better governance means better data for AI systems • People and processes are just as important as tools and infrastructure https://www.microsoft.com/insidetrack/blog/driving-e ff ective-data-governance-for-improved-quality-and-analytics/ 43

Conclusion • Data Governance is even more important as: •
data landscapes expand; and • AI requires high quality governance. • Governance needs to be throughout the data lifecycle • AI can help governance across the data lifecycle • Better Governance 🔁 Better AI Paul Groth | [email protected] | @pgroth | pgroth.com | indelab.org 44

University_of_Amsterdam-Data_Governance___AI__T...

University_of_Amsterdam-Data_Governance___AI__The_Positive_Feedback_Loop.pdf

More Decks by Marketing OGZ

Featured

Transcript

University_of_Amsterdam-Data_Governance_AIT...

University_of_Amsterdam-Data_Governance_AIThe_Positive_Feedback_Loop.pdf