Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Metadata Day 2020: Metadata In Production Session

Metadata Day 2020: Metadata In Production Session

https://metadataday2020.splashthat.com/

The industry is tackling growing challenges in enabling productive data science in data-driven enterprises while maintaining data governance and compliance. Several popular open-source and commercial projects have emerged over the last few years, employing graph-based practices for managing and leveraging metadata. This event brings together these projects along with a larger community of metadata experts to help chart our common ground and work ahead.

This session dives into problems and solutions experienced by practitioners of metadata building and deploying metadata platforms at small and large companies.
LinkedIn DataHub, Amundsen, Netflix Metacat, Uber Databook, etc.

See the video here: https://www.youtube.com/watch?v=LS2LxEsj-94

F9c84d0b07b173d294298448832f1f88?s=128

Shirshanka Das

January 14, 2021
Tweet

Transcript

  1. None
  2. None
  3. None
  4. None
  5. 1 2 3 4

  6. Search and discovery Security Compliance Metrics exploration Lineage Ownership Data

    Quality Data pipelines observability Identifying critical and important assets Automatic data management Schema management, schema registries Operational metrics Data movement AI ecosystem (beyond datasets) Business domain context Building block to build other tools Configuration for Data Lifecycle Management API services - central to other Data Platforms in the ecosystem
  7. • Segmentation of Data Space ◦ Ingestion ◦ Transformation ◦

    Orchestration ◦ Visualization ◦ ML ◦ many others • Multiple tools exist in each of these spaces due to innovation • Liability and an Opportunity • Would like to reduce the number of systems ◦ Would like to be agile: bring in new tools in the ecosystem ◦ Data stacks in different companies look very different • Building Future-Proof Capabilities on Data (through metadata) is the Opportunity ◦ e.g. Observability, Governance, Data Management, … • How ◦ Following slides! (Architectures + Standards)
  8. • Architecture is in service of use cases ◦ Productivity

    focused catalog - architecture focuses on enabling a central team ◦ Compliance focused catalog - less on viz experience and more focused on triggering of deletes, etc. • Extensibility ◦ First use-case (e.g. search and discovery) was a pull-based monolith ◦ When GDPR showed up as a use case, it showed the importance of having others push to you ◦ Extensible beyond datasets to pipelines, dashboards, ML features, models and more. ◦ Support 20+ of data storage systems across the organization ◦ Contract - have common vocabulary that is standardized and can be reused • Evolution ◦ Trigger based operators on metadata e.g. compliance ◦ Evolve from a monolith to service API architecture. Similar to a data architecture ◦ Flexible meta-model to accommodate APIs, graph node labels, kafka topics ◦ Push and Pull
  9. • Fragmentation of metadata systems ◦ Not “necessarily” a bad

    thing because if it’s a product of innovation and not reinvention ◦ we can narrow down systems if we look at use cases but we haven’t yet defined architectural principles for it ◦ each tool has their own metadata system - how do you make it pluggable? ◦ commoditization of different pieces ◦ Spark set us back.. showed up as a monolith. instead of contributing to pieces.. • Aligning on standards ◦ big open source world - not just for your company but how to you influence the industry. e.g hive became catalog ◦ metacat - was a core building block vs others who were looking at search and discovery. so it was hard to partner with others ◦ OpenLineage - agree across different projects ◦ can you reuse models that have been defined in ‘open metadata’? ◦ Making the spec extensible
  10. • Three major product areas productivity operations governance ▪ compliance

    for data protection regulations (e.g. GDPR) ▪ compliance for a domain (e.g. health/finance) ▪ Access Control within the company • Resourcing ◦ each metadata use case represents a separate, complex data product ◦ frontend, design, and researcher resources on data platforms are very limited ◦ complexities of design ▪ users span a wide range of data literacy levels ▪ “enterprise / data design” specialization different from “vanilla app design” ▪ coherence across products, unification of web app stacks • Impact ◦ Measuring the impact of the UI of a tool is very challenging ▪ Hard to get alignment on what a UI should enable ▪ Hard to measure time saved, changes in productivity ▪ Hard to attribute data insights back to a UI ▪ Users struggle to articulate exactly what a complex interface enables • Business metadata hard to tackle ◦ Definition of a “customer ID”
  11. None
  12. UX/Product/Frontend for metadata systems • Chris: • frontend resources on

    data platforms are very limited • design resources are also limited • some have invested in ‘data design’ specialization • diff from a vanilla ‘app frontend design’ • scaffold that has 90% helps backend engineers contribute rather than start from scratch • designing for a wide range of data literacy levels • Measuring the impact of the UI of the tool is extremely difficult • eng and leadership tend to be backend based - hard to appreciate impact of UI • improvement in productivity, time saved is really hard • Mark: • 3 usecases: productivity, operations, governance (compliance for regulation, compliance for a domain e.g. health, brand protection) • ‘business metadata’ - definition of customer for an organization • diff dept in same org have diff definitions •
  13. Architecture - how important is it? • Shirshanka: • First

    was for search and discovery • When GDPR showed up as a usecase, it showed the imp of having others push to you • Trigger based operators on metadata e.g. compliance • Evolved from a monolith -> service API architecture. Schemas, streams.. Similar to a data architecture • Challenges on the graph side • Mark: • architecture is in service of usecases • productivity focused catalog - arch focuses on enabling a central team • compliance focus - less on viz experience and more focused on triggering of deletes etc. • Julien: • fragmentation of ecosystem - product of innovation • big open source world - not just for your company but how to you influence the industry. e.g hive became catalog • How do I invert the dependency? • Sunheng: • more than 10 data storage system - constantly get deprecated and new ones come in • contract - what metadata can be pushed into the system • ingest metadata into the system • Deepak: • Wanted to have an API that is omnipresent in the ecosystem • Flexible metamodel to accommodate APIs, graph node labels, kafka
  14. Fragmentation of metadata - why? will standards help? • Charles

    • Not a bad thing because it’s a product of innovation • metacat - was a core building block vs others who were looking at search and discovery. so it was hard to partner with others • we can narrow down systems if we look at use cases. • but we haven’t yet defined arch principles for it • each tool has their own metadata system - how do you make it pluggable? • e.g. databases have theirr own. How do you get snowflake to adopt a metadata systems other than have standards? • commoditization of different pieces • Spark set us back.. showed up as a monolith. instead of contributing to pieces.. • diminishing returns - few systems you can push standards into. • Julien: • Open lineage - agree across different projects • unique identifiers - connecting them together • How to make the spec extendable? • Okay to have some redundant metadata • Shirshanka • data design - decisions about denormalization • can you reuse models that have been defined in ‘open metadata’? • Are they attached to engines? • Models without the impl • one generic schema across everything