Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Developing a Data Catalog to Promote Data Usage on the Data Platform

Developing a Data Catalog to Promote Data Usage on the Data Platform

Naoto Udagawa (LINE / Data Platform Department / Senior Product Manager)

https://tech-verse.me/ja/sessions/37
https://tech-verse.me/en/sessions/37
https://tech-verse.me/ko/sessions/37

Tech-Verse2022

November 17, 2022
Tweet

More Decks by Tech-Verse2022

Other Decks in Technology

Transcript

  1. › Introduction of our data catalog › Motivation and benefit

    for in-house data catalog › How we develop the data catalog Overview
  2. › People who are interested in › Operating data catalog

    › Developing data catalog Target audience
  3. Agenda › LINE’s data platform › Problem to introduce data

    catalog › Solution › Summary › Future work
  4. Agenda › LINE’s data platform › Problem to introduce data

    catalog › Solution › Summary › Future work
  5. Democratize data for businesses +200 Services including internal services Data

    platform Governance Data science Machine learning Planning LINE’s self-serve data platform
  6. Scale of data use 400 PB HDFS capacity 40,000 Hive

    Tables 150,000 Jobs/Day LINE’s data platform
  7. Our users Always data driven Data scientist Backend engineer Job

    Tasks Tools Planner Data analysis OASIS (BI tool) Service monitoring ETL Kibana, OASIS (BI tool) KPI analysis Tableau , OASIS (BI tool) LINE’s data platform
  8. Need to know before data activities When is it updated?

    What's the difference between uid and cid? Which table is important? Similar table name… Which is appropriate…? What are the data dependencies? Who is data owner? What is the affect to others? Find Understand Get Permission Plan Run search query
  9. LINE’s data catalog › Mission › Data democratization for businesses

    › Values › One-stop solution for data activities › Core features › Search data › Access control › Metadata management › Exploratory data analysis
  10. Agenda › LINE’s data platform › Problem to introduce data

    catalog › Solution › Summary › Future work
  11. Problem to introduce data catalog 1. Aggregating metadata 2. Displaying

    metadata Data source BI tools Query engine Other services ・・・ DB 1 2 Users Data catalog’s data source
  12. Problem to introduce data catalog 1. Aggregating metadata 2. Displaying

    metadata Data source BI tools Query engine Other services ・・・ DB Users 1 2 Data catalog’s data source
  13. Aggregating metadata issues 1. Various data sources 2. Overloading to

    data platform 3. Noise in metadata 1 2 3 Data catalog’s data source
  14. Difficult to aggregate from various sources › Need to provide

    a variety of metadata for users’ better understanding of data › Had to aggregate data from 10+ services including internal tools on data platform Data source BI tools Hive DWH DB Query engine Metadata Aggregating metadata issue Data catalog’s data source
  15. Overloading to our data platform › A typical data catalog

    runs a query on every table to get metadata › A table has a tremendous amount of partition in our data platform Hive DWH DB Metadata Aggregating metadata issue Data catalog’s data source
  16. Get up-to-date data without overloading › A typical data catalog

    runs a query on every table to get metadata › Need to provide as up-to-date metadata as possible › A data is used by users in various departments at the same time Data Users yyy dept. Users xxx dept. Queries Queries Aggregating metadata issue
  17. Noise in metadata › 300~500 DB/Tables are changed during a

    few minutes › It’s important not to aggregate metadata that doesn’t lead to users’ action Hive DWH DB Metadata Data catalog Aggregating metadata issue
  18. Noise in metadata › Metadata generated by user error ›

    e.g. Create a table and deleted immediately due to mis-creation Aggregating metadata issue Data catalog Create/Alter/Drop tables
  19. Noise in metadata › Some metadata does not lead to

    users’ action Changing only timestamp is of no value Aggregating metadata issue
  20. Summarize issues 1 2 3 1. Various data sources 2.

    Overloading to data platform 3. Noise in metadata Aggregating metadata issue Data catalog’s data source
  21. What are the solutions? 1. Various data source 2. Overloading

    to data platform 3. Noise in metadata Aggregating metadata issue Solutions
  22. Solutions for aggregating metadata Connect and aggregate effectively Various data

    sources Push-based ingestion Overloading to data platform Noise in metadata Aggregating metadata issue Aggregated differences by 30min Filtering meaningless data for users
  23. Solutions for aggregating metadata Connect and aggregate effectively Various data

    sources Push-based Integration Overloading to data platform Noise in metadata Aggregating metadata issue Aggregated differences by 30min Filtering meaningless data for users
  24. Connect and aggregate effectively Data source BI tools Hive DWH

    DB Query engine Data ingestion Streaming Batch Data catalog In-house data catalog Access control Metadata management Query editor Data lineage/ Change data capture Platforms ・Hadoop ・On-premises Data governance tools Internal employee info API Internal workflow Issue: Various data sources
  25. Connect and aggregate effectively Data source BI tools Hive DWH

    DB Query engine Data ingestion Streaming Batch Data catalog In-house data catalog Access control Metadata management Query editor Data lineage/ Change data capture Platforms ・Hadoop ・On-premises Data governance tools Internal employee info API Internal workflow Issue: Various data sources
  26. Connect and aggregate effectively Data source BI tools Hive DWH

    DB Query engine Data ingestion Streaming Batch Data catalog In-house data catalog Access control Metadata management Query editor Data lineage/ Change data capture Platforms ・Hadoop ・On-premises Data governance tools Internal employee info API Internal workflow Issue: Various data sources
  27. Connect and aggregate effectively Data source BI tools Hive DWH

    DB Query engine Data ingestion Streaming Batch Data catalog In-house data catalog Access control Metadata management Query editor Data lineage/ Change data capture Platforms ・Hadoop ・On-premises Data governance tools Internal employee info API Internal workflow Issue: Various data sources
  28. Connect and aggregate effectively Data source BI tools Hive DWH

    DB Query engine Data ingestion Streaming Batch Data catalog In-house data catalog Access control Metadata management Query editor Data lineage/ Change data capture Platforms ・Hadoop ・On-premises Data governance tools Internal employee info API Internal workflow Issue: Various data sources
  29. Deep dive the aggregation Data source BI tools Hive DWH

    DB Query engine Data ingestion Streaming Batch Data catalog In-house data catalog Access control Metadata management Query editor Data lineage/ Change data capture Platforms ・Hadoop ・On-premises Data governance tools Internal employee info API Internal workflow Issue: Various data sources
  30. Aggregate from BI tools Data source BI tools Hive DWH

    DB Query engine Data ingestion Streaming Batch Data catalog In-house data catalog Access control Metadata management Query editor Data lineage/ Change data capture Platforms ・Hadoop ・On-premises Data governance tools Internal employee info API Internal workflow › Feature introduction › Displaying data › How to aggregate data Issue: Various data sources
  31. Report catalog › Data sources › Tableau, OASIS (BI tool)

    › Data governance › Monitor data usage situation › Prioritize data for data management › Data enablement › Reference for the way to use data › Analyze affected reports by data change Issue: Various data sources
  32. Displaying data › Report info › Name › URL ›

    Description › PV › Timestamp › Author name › Users Issue: Various data sources
  33. Aggregate from BI tools › Aggregated reports from BI tools

    by batch Job, parse SQL in the report, and connect it with a table Data source Get query in reports Data Catalog Extract tables from reports Issue: Various data sources
  34. Aggregate from Apache Atlas Data source BI tools Hive DWH

    DB Query engine Data ingestion Streaming Batch Data catalog In-house data catalog Access control Metadata management Query editor Data lineage/ Change data capture Platforms ・Hadoop ・On-premises Data governance tools Internal employee info API Internal workflow › Feature introduction › Displaying data › How to aggregate data Issue: Various data sources
  35. Data lineage › Data source › Atlas (Hive, Spark) ›

    Data governance › Audit data generation process › Data enablement › Understand data dependencies › Impact analysis by data change › Make easier data debugging Issue: Various data sources
  36. Displaying data › Table description, timestamps › Table relationships ›

    PII › Data owner › Organization that uses the table › Reports generated from the data › User lists › Links: Wiki, GitHub, Airflow, … Issue: Various data sources
  37. Apache Atlas › OSS data catalog › Stores metadata generated

    by data modification by Apache Hive/Spark › Reasons for selection › Collaboration with Ranger › Cloudera support › Stable than other approaches Issue: Various data sources
  38. Data lineage for Hive Atlas hook Hive Kafka Atlas Data

    catalog Metadata entity Atlas hook › Atlas hook in Hive save to Atlas via Kafka Create/Alter /Drop table Issue: Various data sources REST API
  39. Data lineage for Spark Issue: Various data sources › Atlas

    doesn’t support Spark. Introduced Atlas Spark connector Atlas hook Spark Kafka Atlas Data catalog Metadata entity Atlas Spark hook Atlas hook Hive Warehouse Metadata entity Atlas hook Create/Alter /Drop table
  40. Aggregate other metadata with lineage Atlas hook Hive/Spark Kafka Atlas

    Data catalog Metadata entity Atlas hook Other data sources Issue: Various data sources › Aggregate from other data sources to display data lineage
  41. Connect and aggregate effectively Issue: Various data sources Data source

    BI tools Hive DWH DB Query engine Data ingestion Streaming Batch Data catalog In-house data catalog Access control Metadata management Query editor Data lineage/ Change data capture Platforms ・Hadoop ・On-premises Data governance tools Internal employee info API Internal workflow
  42. Solutions for aggregating metadata Connect and aggregate effectively Various data

    sources Push-based Integration Overloading to data platform Noise in metadata Aggregating metadata issue Aggregated differences by 30min Filtering meaningless data for users
  43. Apache Atlas › OSS data catalog › Stores metadata generated

    by data modification by Apache Hive/Spark › Features › Lineage › Notification Issue: Overloading to data platform
  44. Atlas notification › Atlas sends notifications about metadata changes to

    Kafka topic named ATLAS_ENTITIES Issue: Overloading to data platform Atlas entities Atlas hook Atlas hook Hive/Spark Kafka Atlas Notification hook consumer Notification
  45. Atlas notification › Reasons for selection › Need to provide

    as up-to-date metadata as possible › Avoid to send requests to our data platform unnecessarily › We will make more use of Atlas for metadata change detection › Other candidates › Hive Metastore Event Listener › Need to set up Hive Metastore Event Listener › Need to filter out due to almost raw data and too many objects › Atlas Rest API › Select streaming approach to meet our use case and reduce load on Atlas › We use Atlas notification for updating our search feature’s index Issue: Overloading to data platform
  46. Solutions for aggregating metadata Connect and Aggregate effectively Various data

    sources Push-based ingestion Overloading to data platform Noise in metadata Aggregating metadata issue Aggregated differences by 30min Filtering meaningless data for users
  47. Aggregated differences by 30min › Introduced 30min window to Kafka

    and aggregated consumed data Issue: Noise in metadata Atlas hook Hive/Spark Kafka Atlas Notification hook consumer Notification
  48. Aggregated differences by 30min › Treated multiple changes on a

    table within 30 minutes as 1 change › Not only aggregate, but also filtered out duplicated or some changes Issue: Noise in metadata Create/Update/Delete a Table Kafka Atlas Notification Hook Consumer Aggregate and filter out Notification
  49. Added logic to ignore › If a table is created

    and deleted in a very short time, it’s considered as a user error 6TFS#FIBWJPS )BOEMJOHJOPVSMPHJD $SFBUFBUBCMFˠ%FMFUFXJUIJONJOVUFT *HOPSF $SFBUFBUBCMFˠ NJOVUFTFMBQTFE ˠ%FMFUF $SFBUFˠ%FMFUF Issue: Noise in metadata
  50. Inspect Atlas notification record and filtered › Atlas notification record

    contains the following information about the changed table Issue: Noise in metadata
  51. Code introduction in the aggregation › L41~43: Filtered out, L45:

    aggregate, L46: set up 30min window Issue: Noise in metadata
  52. Filtering meaningless data › Inspect the notification and try to

    ignore meaningless data much more Issue: Noise in metadata Atlas hook Hive/Spark Kafka Atlas Notification hook consumer Notification
  53. Inspect Atlas notification record more › The notification didn’t include

    even specific DDL changes › We needed to find another solution Issue: Noise in metadata
  54. Compared Hive DWH and our database Kafka consumer Atlas Notification

    hook consumer › Compared the up-to-date data in Hive DWH and our data catalog’s database › Figured out the difference by data change Compare the DDL Notification Hive DWH Our data catalog’s database Issue: Noise in metadata
  55. Ignore the unnecessary change patterns › Analyzed meaningless DDL change

    and found out › Ignored 95% DDL changes that doesn’t lead to user actions Issue: Noise in metadata Change pattern % lUSBOTJFOU@MBTU%EM5JNFzDIBOHFE  IEGTMPDBUJPODIBOHFE  DPMVNODIBOHFE  FMTF 
  56. Summarize solutions Connect and aggregate effectively Various data sources Push-based

    Integration Overloading to data platform Aggregated differences by 30min Filtering meaningless data for users Noise in metadata Aggregating metadata issue
  57. What are the results? Connect and aggregate effectively Various data

    sources Push-based Integration Overloading to data platform Noise in metadata Solutions for aggregating metadata Results Aggregated differences by 30min Filtering meaningless data for users
  58. Results › Aggregated › the up-to-date metadata › from various

    data sources › without overloading to our data platform Solutions for aggregating metadata Data catalog’s data source
  59. Results › Aggregated › with excluding meaningless data for users’

    actions Atlas hook Hive/Spark Kafka Atlas Notification hook consumer Removed 95% noise Removed 90% noise Solutions for aggregating metadata Notification
  60. Use case Find Understand Plan Run Notify search query ›

    An analytics engineer changes a table’s DDL for ETL and notify to stakeholders announce Solutions for aggregating metadata
  61. Familiarize and find a data Plan Run Notify search query

    announce Ranked by Search Solutions for aggregating metadata Find Understand
  62. Understand and check the target table Plan Run Notify search

    query announce Table overview Data sources Solutions for aggregating metadata Find Understand
  63. Confirm the dependencies Plan Run Notify search query announce Data

    lineage & report catalog Data sources Solutions for aggregating metadata Find Understand
  64. Notify users of the data change Plan Run Notify search

    query announce Tools Notify Solutions for aggregating metadata Find Understand
  65. User feedback › Analytics engineer › Easier to find the

    root cause for data debugging › Enabled to analyze the effect by data change immediately › Data governance › More accurate and efficient monitoring of data usage › Planner › Easier to imagine how the way to use data Solutions for aggregating metadata
  66. Resolved aggregating metadata issues 1. Aggregating metadata 2. Displaying metadata

    Data source BI tools Query engine Other services ・・・ DB Users 1 2 Data catalog’s data source
  67. Problem to introduce data catalog 1. Aggregating metadata 2. Displaying

    metadata Data Source BI tools Query engine Other services ・・・ DB 1 2 Users Data catalog’s data source
  68. Displaying metadata issues › Display data lineage issues with Atlas

    1. Atlas API performance issue 2. Usability issue 1 Users 2 Data catalog Data source
  69. Displaying metadata issues 1 Users 2 Data catalog Data source

    › Display data lineage issues with Atlas 1. Atlas API performance issue 2. Usability issue
  70. Atlas API performance issue › Took about 30 minutes to

    call the API from Atlas 1. Lineage with many nodes 2. Nodes connected to the table increased each time a job was executed FYI. Node and depth Introduction Node Node Node Node Node Depth1 Depth2 Atlas hook Hive/Spark Atlas 1 2 Displaying metadata issue
  71. Many nodes issue › A table’s upstream of 250 tables

    › Atlas hook registered lots of unnecessary tables to Atlas › e.g. tmp_profile_vdo.table %FQUI /VNCFSPGOPEFT           e.g. Complexity of a lineage Atlas hook Hive/Spark Atlas temporary table, test data e.g. tmp_profile_vdo Atlas API performance issue
  72. Nodes are increased each time a job runs › We

    also aggregate Hive/Spark process as a node › Nodes related to the temporary table generated for each hourly/daily job Process Node Generated for each hourly/daily job Atlas API performance issue
  73. Displaying metadata issues › Display data lineage issues with Atlas

    1. Atlas API performance issue 2. Usability issue 1 Users 2 Data catalog Data source
  74. › Not flexible to change the lineage graph › Display

    the entire lineage when first accessed › Table and columns are displayed mixed Displaying metadata issue Usability issue
  75. › Not flexible to change the lineage graph › Display

    the entire lineage when first accessed › Table and columns are displayed mixed Displaying metadata issue Usability issue
  76. Displaying metadata issues › Display data lineage issues with Atlas

    1. Atlas API performance issue 1. Table with many nodes 2. Nodes are increased each time a job runs 2. Usability issue 1 Users 2 Data catalog Data source
  77. What are the solutions? Displaying metadata issue Solutions › Display

    data lineage issues with Atlas 1. Atlas API performance issue 1. Table with many nodes 2. Nodes are increased each time a job runs 2. Usability issue
  78. Solutions for displaying metadata Reduce the number of nodes registered

    to Atlas Table with many nodes Deleted unnecessary processes Usability issue Atlas hook Hive/Spark Kafka Atlas Provide interactive UX that satisfy various use cases Displaying metadata issue Nodes increased each time a job was executed
  79. Solutions for displaying metadata Reduce the number of nodes registered

    to Atlas Table with many nodes Atlas hook Hive/Spark Kafka Atlas Displaying metadata issue Deleted unnecessary processes Nodes increased each time a job was executed Usability issue Provide interactive UX that satisfy various use cases
  80. Reduce the number of nodes registered › Atlas supports to

    set a config “atlas.hook.hive.hive_table.ignore.pattern” › Filtered out the specific db/table by following regex patterns Atlas hook Hive/Spark Kafka 1BUUFSO 5ZQF /PUF ?  = c@ UFNQcUNQ c@  !JV 5BCMF 5FNQPSBSZ ? <B[";>\^ =E\^ = c %# &NQMPZFF`T%# ɾɾɾ ɾɾɾ ɾɾɾ Filter out Issue: Table with many nodes
  81. Solutions for displaying metadata Reduce the number of nodes registered

    to Atlas Table with many nodes Atlas hook Hive/Spark Kafka Atlas Displaying metadata issue Deleted unnecessary processes Usability issue Provide interactive UX that satisfy various use cases Nodes increased each time a job was executed
  82. Delete unnecessary processes › Couldn’t filtered out process nodes ›

    Need to delete some unnecessary process entities directly Atlas Batch Job 3VMF )BWFOPJOQVUTPSPVUQVUT 0XOUIFTBNFJOQVUTBOE PVUQVUT ɾɾɾ Delete Issue: Nodes increased each time a job was executed
  83. Solutions for displaying metadata Reduce the number of nodes registered

    to Atlas Table with many nodes Atlas hook Hive/Spark Kafka Atlas Displaying metadata issue Deleted unnecessary processes Usability issue Provide interactive UX that satisfy various use cases Nodes increased each time a job was executed
  84. Provide interactive UX › Response 3 depths by default ›

    Provide interactive UX › Dig column-level lineage › Expand the displaying lineage graph by clicking a node › Drag&Drop › Zoom in/out Issue: Usability issue
  85. Solutions for displaying metadata Reduce the number of nodes registered

    to Atlas Table with many nodes Atlas hook Hive/Spark Kafka Atlas Displaying metadata issue Deleted unnecessary processes Usability issue Provide interactive UX that satisfy various use cases Nodes increased each time a job was executed
  86. What are the Results? Solutions for displaying metadata Results Reduce

    the number of nodes registered to Atlas Table with many nodes Deleted unnecessary processes Usability issue Provide interactive UX that satisfy various use cases Nodes increased each time a job was executed
  87. Results › Atlas API respond within 6 seconds basically, 15

    seconds at the latest › The number of tables registered in Atlas has been reduced by 90% › 300,000 nodes deleted daily with the batch Job › Support to display end to end lineage and deep dive column-level impact analysis Solutions for displaying metadata
  88. Confirm a column-level dependencies Plan Run Notify search query announce

    Column level lineage Data sources Solutions for displaying metadata Find Understand
  89. User feedback › Analytics engineer, backend engineer › When data

    ingestion is delayed, it’s clear which services are affected immediately › Data governance › Enabled to identify users who involved in the table’s generation end to end › Easier to understand tables generated from a column containing important data from a governance perspective Solutions for displaying metadata
  90. Contributed to OSS › Atlas to skip external temporary table

    created in hive › https://issues.apache.org/jira/browse/ATLAS-4492 Solutions for displaying metadata
  91. Resolved displaying metadata issue 1. Aggregating metadata 2. Displaying metadata

    Data Source BI tools Query engine Other services ・・・ DB 1 2 Users Data catalog’s data source
  92. Agenda › LINE’s data platform › Problem to introduce data

    catalog 1. Aggregating metadata issues 2. Displaying metadata issues › Summary › Future work
  93. Summary › Need to provide various metadata as users become

    more diverse › Not only to aggregate metadata, but also inspect and filter out data for uses › Had some adjustments to introduce Atlas When introducing OSS, it’s important to understand the user needs and your own data platform
  94. Future work › For a wider range of uses ›

    Generate statistical data from queries and use for cost optimization on data platform › Metadata everywhere › Allow users to view metadata on various BI tools › More personalized experience › Reduce the time for searching data, increase efficiency and productivity
  95. End