Developing a Data Catalog to Promote Data Usage on the Data Platform

› Introduction of our data catalog › Motivation and benefit
for in-house data catalog › How we develop the data catalog Overview

› People who are interested in › Operating data catalog
› Developing data catalog Target audience

Agenda › LINE’s data platform › Problem to introduce data
catalog › Solution › Summary › Future work

Democratize data for businesses +200 Services including internal services Data
platform Governance Data science Machine learning Planning LINE’s self-serve data platform

Scale of data use 400 PB HDFS capacity 40,000 Hive
Tables 150,000 Jobs/Day LINE’s data platform

Our users Always data driven Data scientist Backend engineer Job
Tasks Tools Planner Data analysis OASIS (BI tool) Service monitoring ETL Kibana, OASIS (BI tool) KPI analysis Tableau , OASIS (BI tool) LINE’s data platform

Tasks before data activity Find Understand Get Permission Plan Run
search query

Need to know before data activities When is it updated?
What's the difference between uid and cid? Which table is important? Similar table name… Which is appropriate…? What are the data dependencies? Who is data owner? What is the affect to others? Find Understand Get Permission Plan Run search query

LINE’s data catalog › Mission › Data democratization for businesses
› Values › One-stop solution for data activities › Core features › Search data › Access control › Metadata management › Exploratory data analysis

catalog › Solution › Summary › Future work

Problem to introduce data catalog 1. Aggregating metadata 2. Displaying
metadata Data source BI tools Query engine Other services ・・・ DB 1 2 Users Data catalog’s data source

metadata Data source BI tools Query engine Other services ・・・ DB Users 1 2 Data catalog’s data source

Aggregating metadata issues 1. Various data sources 2. Overloading to
data platform 3. Noise in metadata 1 2 3 Data catalog’s data source

Difficult to aggregate from various sources › Need to provide
a variety of metadata for users’ better understanding of data › Had to aggregate data from 10+ services including internal tools on data platform Data source BI tools Hive DWH DB Query engine Metadata Aggregating metadata issue Data catalog’s data source

Overloading to our data platform › A typical data catalog
runs a query on every table to get metadata › A table has a tremendous amount of partition in our data platform Hive DWH DB Metadata Aggregating metadata issue Data catalog’s data source

Get up-to-date data without overloading › A typical data catalog
runs a query on every table to get metadata › Need to provide as up-to-date metadata as possible › A data is used by users in various departments at the same time Data Users yyy dept. Users xxx dept. Queries Queries Aggregating metadata issue

Noise in metadata › 300~500 DB/Tables are changed during a
few minutes › It’s important not to aggregate metadata that doesn’t lead to users’ action Hive DWH DB Metadata Data catalog Aggregating metadata issue

Noise in metadata › Metadata generated by user error ›
e.g. Create a table and deleted immediately due to mis-creation Aggregating metadata issue Data catalog Create/Alter/Drop tables

Noise in metadata › Some metadata does not lead to
users’ action Changing only timestamp is of no value Aggregating metadata issue

Summarize issues 1 2 3 1. Various data sources 2.
Overloading to data platform 3. Noise in metadata Aggregating metadata issue Data catalog’s data source

What are the solutions? 1. Various data source 2. Overloading
to data platform 3. Noise in metadata Aggregating metadata issue Solutions

Solutions for aggregating metadata Connect and aggregate effectively Various data
sources Push-based ingestion Overloading to data platform Noise in metadata Aggregating metadata issue Aggregated differences by 30min Filtering meaningless data for users

sources Push-based Integration Overloading to data platform Noise in metadata Aggregating metadata issue Aggregated differences by 30min Filtering meaningless data for users

Connect and aggregate effectively Data source BI tools Hive DWH
DB Query engine Data ingestion Streaming Batch Data catalog In-house data catalog Access control Metadata management Query editor Data lineage/ Change data capture Platforms ・Hadoop ・On-premises Data governance tools Internal employee info API Internal workflow Issue: Various data sources

Deep dive the aggregation Data source BI tools Hive DWH
DB Query engine Data ingestion Streaming Batch Data catalog In-house data catalog Access control Metadata management Query editor Data lineage/ Change data capture Platforms ・Hadoop ・On-premises Data governance tools Internal employee info API Internal workflow Issue: Various data sources

Aggregate from BI tools Data source BI tools Hive DWH
DB Query engine Data ingestion Streaming Batch Data catalog In-house data catalog Access control Metadata management Query editor Data lineage/ Change data capture Platforms ・Hadoop ・On-premises Data governance tools Internal employee info API Internal workflow › Feature introduction › Displaying data › How to aggregate data Issue: Various data sources

Report catalog › Data sources › Tableau, OASIS (BI tool)
› Data governance › Monitor data usage situation › Prioritize data for data management › Data enablement › Reference for the way to use data › Analyze affected reports by data change Issue: Various data sources

Displaying data › Report info › Name › URL ›
Description › PV › Timestamp › Author name › Users Issue: Various data sources

Aggregate from BI tools › Aggregated reports from BI tools
by batch Job, parse SQL in the report, and connect it with a table Data source Get query in reports Data Catalog Extract tables from reports Issue: Various data sources

Aggregate from Apache Atlas Data source BI tools Hive DWH
DB Query engine Data ingestion Streaming Batch Data catalog In-house data catalog Access control Metadata management Query editor Data lineage/ Change data capture Platforms ・Hadoop ・On-premises Data governance tools Internal employee info API Internal workflow › Feature introduction › Displaying data › How to aggregate data Issue: Various data sources

Data lineage › Data source › Atlas (Hive, Spark) ›
Data governance › Audit data generation process › Data enablement › Understand data dependencies › Impact analysis by data change › Make easier data debugging Issue: Various data sources

Displaying data › Table description, timestamps › Table relationships ›
PII › Data owner › Organization that uses the table › Reports generated from the data › User lists › Links: Wiki, GitHub, Airflow, … Issue: Various data sources

Apache Atlas › OSS data catalog › Stores metadata generated
by data modification by Apache Hive/Spark › Reasons for selection › Collaboration with Ranger › Cloudera support › Stable than other approaches Issue: Various data sources

Data lineage for Hive Atlas hook Hive Kafka Atlas Data
catalog Metadata entity Atlas hook › Atlas hook in Hive save to Atlas via Kafka Create/Alter /Drop table Issue: Various data sources REST API

Data lineage for Spark Issue: Various data sources › Atlas
doesn’t support Spark. Introduced Atlas Spark connector Atlas hook Spark Kafka Atlas Data catalog Metadata entity Atlas Spark hook Atlas hook Hive Warehouse Metadata entity Atlas hook Create/Alter /Drop table

Aggregate other metadata with lineage Atlas hook Hive/Spark Kafka Atlas
Data catalog Metadata entity Atlas hook Other data sources Issue: Various data sources › Aggregate from other data sources to display data lineage

Connect and aggregate effectively Issue: Various data sources Data source
BI tools Hive DWH DB Query engine Data ingestion Streaming Batch Data catalog In-house data catalog Access control Metadata management Query editor Data lineage/ Change data capture Platforms ・Hadoop ・On-premises Data governance tools Internal employee info API Internal workflow

sources Push-based Integration Overloading to data platform Noise in metadata Aggregating metadata issue Aggregated differences by 30min Filtering meaningless data for users

Apache Atlas › OSS data catalog › Stores metadata generated
by data modification by Apache Hive/Spark › Features › Lineage › Notification Issue: Overloading to data platform

Atlas notification › Atlas sends notifications about metadata changes to
Kafka topic named ATLAS_ENTITIES Issue: Overloading to data platform Atlas entities Atlas hook Atlas hook Hive/Spark Kafka Atlas Notification hook consumer Notification

Atlas notification › Reasons for selection › Need to provide
as up-to-date metadata as possible › Avoid to send requests to our data platform unnecessarily › We will make more use of Atlas for metadata change detection › Other candidates › Hive Metastore Event Listener › Need to set up Hive Metastore Event Listener › Need to filter out due to almost raw data and too many objects › Atlas Rest API › Select streaming approach to meet our use case and reduce load on Atlas › We use Atlas notification for updating our search feature’s index Issue: Overloading to data platform

Solutions for aggregating metadata Connect and Aggregate effectively Various data
sources Push-based ingestion Overloading to data platform Noise in metadata Aggregating metadata issue Aggregated differences by 30min Filtering meaningless data for users

Aggregated differences by 30min › Introduced 30min window to Kafka
and aggregated consumed data Issue: Noise in metadata Atlas hook Hive/Spark Kafka Atlas Notification hook consumer Notification

Aggregated differences by 30min › Treated multiple changes on a
table within 30 minutes as 1 change › Not only aggregate, but also filtered out duplicated or some changes Issue: Noise in metadata Create/Update/Delete a Table Kafka Atlas Notification Hook Consumer Aggregate and filter out Notification

Added logic to ignore › If a table is created
and deleted in a very short time, it’s considered as a user error 6TFS#FIBWJPS )BOEMJOHJOPVSMPHJD $SFBUFBUBCMFˠ%FMFUFXJUIJONJOVUFT *HOPSF $SFBUFBUBCMFˠ NJOVUFTFMBQTFE ˠ%FMFUF $SFBUFˠ%FMFUF Issue: Noise in metadata

Inspect Atlas notification record and filtered › Atlas notification record
contains the following information about the changed table Issue: Noise in metadata

Code introduction in the aggregation › L41~43: Filtered out, L45:
aggregate, L46: set up 30min window Issue: Noise in metadata

Filtering meaningless data › Inspect the notification and try to
ignore meaningless data much more Issue: Noise in metadata Atlas hook Hive/Spark Kafka Atlas Notification hook consumer Notification

Inspect Atlas notification record more › The notification didn’t include
even specific DDL changes › We needed to find another solution Issue: Noise in metadata

Compared Hive DWH and our database Kafka consumer Atlas Notification
hook consumer › Compared the up-to-date data in Hive DWH and our data catalog’s database › Figured out the difference by data change Compare the DDL Notification Hive DWH Our data catalog’s database Issue: Noise in metadata

Ignore the unnecessary change patterns › Analyzed meaningless DDL change
and found out › Ignored 95% DDL changes that doesn’t lead to user actions Issue: Noise in metadata Change pattern % lUSBOTJFOU@MBTU%EM5JNFzDIBOHFE IEGTMPDBUJPODIBOHFE DPMVNODIBOHFE FMTF

Summarize solutions Connect and aggregate effectively Various data sources Push-based
Integration Overloading to data platform Aggregated differences by 30min Filtering meaningless data for users Noise in metadata Aggregating metadata issue

What are the results? Connect and aggregate effectively Various data
sources Push-based Integration Overloading to data platform Noise in metadata Solutions for aggregating metadata Results Aggregated differences by 30min Filtering meaningless data for users

Results › Aggregated › the up-to-date metadata › from various
data sources › without overloading to our data platform Solutions for aggregating metadata Data catalog’s data source

Results › Aggregated › with excluding meaningless data for users’
actions Atlas hook Hive/Spark Kafka Atlas Notification hook consumer Removed 95% noise Removed 90% noise Solutions for aggregating metadata Notification

Use case Find Understand Plan Run Notify search query ›
An analytics engineer changes a table’s DDL for ETL and notify to stakeholders announce Solutions for aggregating metadata

Familiarize and find a data Plan Run Notify search query
announce Ranked by Search Solutions for aggregating metadata Find Understand

Understand and check the target table Plan Run Notify search
query announce Table overview Data sources Solutions for aggregating metadata Find Understand

Confirm the dependencies Plan Run Notify search query announce Data
lineage & report catalog Data sources Solutions for aggregating metadata Find Understand

Notify users of the data change Plan Run Notify search
query announce Tools Notify Solutions for aggregating metadata Find Understand

User feedback › Analytics engineer › Easier to find the
root cause for data debugging › Enabled to analyze the effect by data change immediately › Data governance › More accurate and efficient monitoring of data usage › Planner › Easier to imagine how the way to use data Solutions for aggregating metadata

Resolved aggregating metadata issues 1. Aggregating metadata 2. Displaying metadata
Data source BI tools Query engine Other services ・・・ DB Users 1 2 Data catalog’s data source

metadata Data Source BI tools Query engine Other services ・・・ DB 1 2 Users Data catalog’s data source

Displaying metadata issues › Display data lineage issues with Atlas
1. Atlas API performance issue 2. Usability issue 1 Users 2 Data catalog Data source

Displaying metadata issues 1 Users 2 Data catalog Data source
› Display data lineage issues with Atlas 1. Atlas API performance issue 2. Usability issue

Atlas API performance issue › Took about 30 minutes to
call the API from Atlas 1. Lineage with many nodes 2. Nodes connected to the table increased each time a job was executed FYI. Node and depth Introduction Node Node Node Node Node Depth1 Depth2 Atlas hook Hive/Spark Atlas 1 2 Displaying metadata issue

Many nodes issue › A table’s upstream of 250 tables
› Atlas hook registered lots of unnecessary tables to Atlas › e.g. tmp_profile_vdo.table %FQUI /VNCFSPGOPEFT e.g. Complexity of a lineage Atlas hook Hive/Spark Atlas temporary table, test data e.g. tmp_profile_vdo Atlas API performance issue

Nodes are increased each time a job runs › We
also aggregate Hive/Spark process as a node › Nodes related to the temporary table generated for each hourly/daily job Process Node Generated for each hourly/daily job Atlas API performance issue

1. Atlas API performance issue 2. Usability issue 1 Users 2 Data catalog Data source

› Not flexible to change the lineage graph › Display
the entire lineage when first accessed › Table and columns are displayed mixed Displaying metadata issue Usability issue

1. Atlas API performance issue 1. Table with many nodes 2. Nodes are increased each time a job runs 2. Usability issue 1 Users 2 Data catalog Data source

What are the solutions? Displaying metadata issue Solutions › Display
data lineage issues with Atlas 1. Atlas API performance issue 1. Table with many nodes 2. Nodes are increased each time a job runs 2. Usability issue

Solutions for displaying metadata Reduce the number of nodes registered
to Atlas Table with many nodes Deleted unnecessary processes Usability issue Atlas hook Hive/Spark Kafka Atlas Provide interactive UX that satisfy various use cases Displaying metadata issue Nodes increased each time a job was executed

to Atlas Table with many nodes Atlas hook Hive/Spark Kafka Atlas Displaying metadata issue Deleted unnecessary processes Nodes increased each time a job was executed Usability issue Provide interactive UX that satisfy various use cases

Reduce the number of nodes registered › Atlas supports to
set a config “atlas.hook.hive.hive_table.ignore.pattern” › Filtered out the specific db/table by following regex patterns Atlas hook Hive/Spark Kafka 1BUUFSO 5ZQF /PUF ? = c@ UFNQcUNQ c@ !JV 5BCMF 5FNQPSBSZ ? <B[";>\^ =E\^ = c %# &NQMPZFF`T%# ɾɾɾ ɾɾɾ ɾɾɾ Filter out Issue: Table with many nodes

to Atlas Table with many nodes Atlas hook Hive/Spark Kafka Atlas Displaying metadata issue Deleted unnecessary processes Usability issue Provide interactive UX that satisfy various use cases Nodes increased each time a job was executed

Delete unnecessary processes › Couldn’t filtered out process nodes ›
Need to delete some unnecessary process entities directly Atlas Batch Job 3VMF )BWFOPJOQVUTPSPVUQVUT 0XOUIFTBNFJOQVUTBOE PVUQVUT ɾɾɾ Delete Issue: Nodes increased each time a job was executed

Provide interactive UX › Response 3 depths by default ›
Provide interactive UX › Dig column-level lineage › Expand the displaying lineage graph by clicking a node › Drag&Drop › Zoom in/out Issue: Usability issue

What are the Results? Solutions for displaying metadata Results Reduce
the number of nodes registered to Atlas Table with many nodes Deleted unnecessary processes Usability issue Provide interactive UX that satisfy various use cases Nodes increased each time a job was executed

Results › Atlas API respond within 6 seconds basically, 15
seconds at the latest › The number of tables registered in Atlas has been reduced by 90% › 300,000 nodes deleted daily with the batch Job › Support to display end to end lineage and deep dive column-level impact analysis Solutions for displaying metadata

Confirm a column-level dependencies Plan Run Notify search query announce
Column level lineage Data sources Solutions for displaying metadata Find Understand

User feedback › Analytics engineer, backend engineer › When data
ingestion is delayed, it’s clear which services are affected immediately › Data governance › Enabled to identify users who involved in the table’s generation end to end › Easier to understand tables generated from a column containing important data from a governance perspective Solutions for displaying metadata

Contributed to OSS › Atlas to skip external temporary table
created in hive › https://issues.apache.org/jira/browse/ATLAS-4492 Solutions for displaying metadata

Reference › LINEの大規模なData PlatformにData Lineageを導入した話 › https://engineering.linecorp.com/ja/blog/data-lineage-on-line-big-data-platform/ Solutions for displaying
metadata

Resolved displaying metadata issue 1. Aggregating metadata 2. Displaying metadata
Data Source BI tools Query engine Other services ・・・ DB 1 2 Users Data catalog’s data source

catalog 1. Aggregating metadata issues 2. Displaying metadata issues › Summary › Future work

Summary › Need to provide various metadata as users become
more diverse › Not only to aggregate metadata, but also inspect and filter out data for uses › Had some adjustments to introduce Atlas When introducing OSS, it’s important to understand the user needs and your own data platform

Future work › For a wider range of uses ›
Generate statistical data from queries and use for cost optimization on data platform › Metadata everywhere › Allow users to view metadata on various BI tools › More personalized experience › Reduce the time for searching data, increase efficiency and productivity

Thank you

Developing a Data Catalog to Promote Data Usage...

Developing a Data Catalog to Promote Data Usage on the Data Platform

More Decks by Tech-Verse2022

Other Decks in Technology

Featured

Transcript