Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Metadata Day 2020: Metadata Use Cases Session

Metadata Day 2020: Metadata Use Cases Session

https://metadataday2020.splashthat.com/

The industry is tackling growing challenges in enabling productive data science in data-driven enterprises while maintaining data governance and compliance. Several popular open-source and commercial projects have emerged over the last few years, employing graph-based practices for managing and leveraging metadata. This event brings together these projects along with a larger community of metadata experts to help chart our common ground and work ahead.

This session dives into the metadata use cases their organizations have and how their practices have evolved.

See the video here: https://youtu.be/LS2LxEsj-94?t=3566

Shirshanka Das

January 14, 2021
Tweet

More Decks by Shirshanka Das

Other Decks in Technology

Transcript

  1. We talk about data science notions of “80-90%” of time

    spent on data discovery, assessment, transformation: • What was the original intent for data collection, the assumptions made, “Why did we collect this?”, which protocol, what error margins, what biases, etc. (in addition to the standard who?, what?, when,? where? metadata for the data collection) • In the broader sense, this area of concerns regarding metadata is about developing an understanding of the data enough to know how and when to trust it enough to use it • You create a new data product by the process of “data wrangling” and that should carry the lineage along
  2. The sciences have much better developed practices than in pure

    business analytics • Not all data is intended/useful for further calculation (ML, etc). E.g., in social science, as well as some other areas, there is quantitative vs. qualitative data such as surveys • Often the planning begins with documenting the data gathering process, e.g. survey methodology. • Our understanding of where the “pipeline” begins needs to go further upstream
  3. • Research data stakeholders have experience with adoption of standards

    and downstream effects: benefits, problems, legacy concerns • Ex: FAIR metadata and data principles (Findability, Accessibility, Interoperability, Reuse) provide a useful high-level classification of metadata use cases and are more general than standards.
  4. • Current “transaction cost” for metadata gathering and maintenance is

    too high. ◦ Computers should reduce this burden on users! • E.g. during data collection ◦ Photo metadata became suddenly effortless in the 2010’s • E.g. as we traverse tools/systems/orgs • E.g. as we integrate multiple tools/systems/orgs How can HCI and AI help? Innovations needed here! • Identify personas and incentives involved, have starting points for how to handle each
  5. • We often hear that it is important to make

    metadata creation easier for the data creator. • We limit the downstream usefulness of the data by requiring only “minimum metadata”, i.e. making it easier for the creator. • There is always a tension and a balance between creators and users - easy for one is hard for the other and visa-versa.
  6. • The legacy of “enterprise data warehouse” traditions from the

    1990s led us toward centralization of control of fixed schema: “closed world”. E.g., at a bank there’s a data architect who decides what can or cannot go into a data store; since “Big Data” we’ve moved to the other end of the spectrum with unstructured data, not much accountability for lineage, usage, etc. • Borrowing from “Agile”: make fast iterations in lifecycle, in other words lower the transaction costs • Can we extend the notions of end-to-end lifecycle to earlier (i.e. ideation) stages (which are more human in nature), and develop feedback loops / process / iteration for those?
  7. • Traditional Data Quality (DQ) tools were used for monitoring

    enterprise data pipelines • Data engineering is having more emphasis in CI/CD tools that bring back integrity checks into more agile dev environments ◦ e.g., DBT as an example => integrity coming into the tools
  8. • We can leverage the flexibility of graph-based approaches •

    The “enterprise knowledge graph” (examples: Google, PayPal) provides ground truth or ground context, against which we can reconcile our queries and other usage of many other data stores. For example, having persistent identifiers (or “unique identifiers”) with other metadata attached is a start. • This also allows for multiple teams to be working concurrently (less centralized control)
  9. • Avoiding problems long-term requires a lot of conversations •

    Create the tooling, to create the community” • Then iterate across the ecosystem • Plan for longevity and maintainability