Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

1 2 3 4

Slide 6

Slide 6 text

Slide 7

Slide 7 text

We talk about data science notions of “80-90%” of time spent on data discovery, assessment, transformation: ● What was the original intent for data collection, the assumptions made, “Why did we collect this?”, which protocol, what error margins, what biases, etc. (in addition to the standard who?, what?, when,? where? metadata for the data collection) ● In the broader sense, this area of concerns regarding metadata is about developing an understanding of the data enough to know how and when to trust it enough to use it ● You create a new data product by the process of “data wrangling” and that should carry the lineage along

Slide 8

Slide 8 text

The sciences have much better developed practices than in pure business analytics ● Not all data is intended/useful for further calculation (ML, etc). E.g., in social science, as well as some other areas, there is quantitative vs. qualitative data such as surveys ● Often the planning begins with documenting the data gathering process, e.g. survey methodology. ● Our understanding of where the “pipeline” begins needs to go further upstream

Slide 9

Slide 9 text

● Research data stakeholders have experience with adoption of standards and downstream effects: benefits, problems, legacy concerns ● Ex: FAIR metadata and data principles (Findability, Accessibility, Interoperability, Reuse) provide a useful high-level classification of metadata use cases and are more general than standards.

Slide 10

Slide 10 text

● Current “transaction cost” for metadata gathering and maintenance is too high. ○ Computers should reduce this burden on users! ● E.g. during data collection ○ Photo metadata became suddenly effortless in the 2010’s ● E.g. as we traverse tools/systems/orgs ● E.g. as we integrate multiple tools/systems/orgs How can HCI and AI help? Innovations needed here! ● Identify personas and incentives involved, have starting points for how to handle each

Slide 11

Slide 11 text

● We often hear that it is important to make metadata creation easier for the data creator. ● We limit the downstream usefulness of the data by requiring only “minimum metadata”, i.e. making it easier for the creator. ● There is always a tension and a balance between creators and users - easy for one is hard for the other and visa-versa.

Slide 12

Slide 12 text

● The legacy of “enterprise data warehouse” traditions from the 1990s led us toward centralization of control of fixed schema: “closed world”. E.g., at a bank there’s a data architect who decides what can or cannot go into a data store; since “Big Data” we’ve moved to the other end of the spectrum with unstructured data, not much accountability for lineage, usage, etc. ● Borrowing from “Agile”: make fast iterations in lifecycle, in other words lower the transaction costs ● Can we extend the notions of end-to-end lifecycle to earlier (i.e. ideation) stages (which are more human in nature), and develop feedback loops / process / iteration for those?

Slide 13

Slide 13 text

● Traditional Data Quality (DQ) tools were used for monitoring enterprise data pipelines ● Data engineering is having more emphasis in CI/CD tools that bring back integrity checks into more agile dev environments ○ e.g., DBT as an example => integrity coming into the tools

Slide 14

Slide 14 text

● We can leverage the flexibility of graph-based approaches ● The “enterprise knowledge graph” (examples: Google, PayPal) provides ground truth or ground context, against which we can reconcile our queries and other usage of many other data stores. For example, having persistent identifiers (or “unique identifiers”) with other metadata attached is a start. ● This also allows for multiple teams to be working concurrently (less centralized control)

Slide 15

Slide 15 text

● Avoiding problems long-term requires a lot of conversations ● Create the tooling, to create the community” ● Then iterate across the ecosystem ● Plan for longevity and maintainability

Slide 16

Slide 16 text

No content