Metadata Day 2020: Metadata Use Cases Session

1 2 3 4

We talk about data science notions of “80-90%” of time
spent on data discovery, assessment, transformation: • What was the original intent for data collection, the assumptions made, “Why did we collect this?”, which protocol, what error margins, what biases, etc. (in addition to the standard who?, what?, when,? where? metadata for the data collection) • In the broader sense, this area of concerns regarding metadata is about developing an understanding of the data enough to know how and when to trust it enough to use it • You create a new data product by the process of “data wrangling” and that should carry the lineage along

The sciences have much better developed practices than in pure
business analytics • Not all data is intended/useful for further calculation (ML, etc). E.g., in social science, as well as some other areas, there is quantitative vs. qualitative data such as surveys • Often the planning begins with documenting the data gathering process, e.g. survey methodology. • Our understanding of where the “pipeline” begins needs to go further upstream

• Research data stakeholders have experience with adoption of standards
and downstream effects: benefits, problems, legacy concerns • Ex: FAIR metadata and data principles (Findability, Accessibility, Interoperability, Reuse) provide a useful high-level classification of metadata use cases and are more general than standards.

• Current “transaction cost” for metadata gathering and maintenance is
too high. ◦ Computers should reduce this burden on users! • E.g. during data collection ◦ Photo metadata became suddenly effortless in the 2010’s • E.g. as we traverse tools/systems/orgs • E.g. as we integrate multiple tools/systems/orgs How can HCI and AI help? Innovations needed here! • Identify personas and incentives involved, have starting points for how to handle each

• We often hear that it is important to make
metadata creation easier for the data creator. • We limit the downstream usefulness of the data by requiring only “minimum metadata”, i.e. making it easier for the creator. • There is always a tension and a balance between creators and users - easy for one is hard for the other and visa-versa.

• The legacy of “enterprise data warehouse” traditions from the
1990s led us toward centralization of control of fixed schema: “closed world”. E.g., at a bank there’s a data architect who decides what can or cannot go into a data store; since “Big Data” we’ve moved to the other end of the spectrum with unstructured data, not much accountability for lineage, usage, etc. • Borrowing from “Agile”: make fast iterations in lifecycle, in other words lower the transaction costs • Can we extend the notions of end-to-end lifecycle to earlier (i.e. ideation) stages (which are more human in nature), and develop feedback loops / process / iteration for those?

• Traditional Data Quality (DQ) tools were used for monitoring
enterprise data pipelines • Data engineering is having more emphasis in CI/CD tools that bring back integrity checks into more agile dev environments ◦ e.g., DBT as an example => integrity coming into the tools

• We can leverage the flexibility of graph-based approaches •
The “enterprise knowledge graph” (examples: Google, PayPal) provides ground truth or ground context, against which we can reconcile our queries and other usage of many other data stores. For example, having persistent identifiers (or “unique identifiers”) with other metadata attached is a start. • This also allows for multiple teams to be working concurrently (less centralized control)

• Avoiding problems long-term requires a lot of conversations •
Create the tooling, to create the community” • Then iterate across the ecosystem • Plan for longevity and maintainability

Metadata Day 2020: Metadata Use Cases Session

Metadata Day 2020: Metadata Use Cases Session

Shirshanka Das

More Decks by Shirshanka Das

Other Decks in Technology

Featured

Transcript

1 2 3 4

•

We talk about data science notions of “80-90%” of time

The sciences have much better developed practices than in pure

• Research data stakeholders have experience with adoption of standards

• Current “transaction cost” for metadata gathering and maintenance is

• We often hear that it is important to make

• The legacy of “enterprise data warehouse” traditions from the

• Traditional Data Quality (DQ) tools were used for monitoring

• We can leverage the flexibility of graph-based approaches •

• Avoiding problems long-term requires a lot of conversations •