Tasks Tools Planner Data analysis OASIS (BI tool) Service monitoring ETL Kibana, OASIS (BI tool) KPI analysis Tableau , OASIS (BI tool) LINE’s data platform
What's the difference between uid and cid? Which table is important? Similar table name… Which is appropriate…? What are the data dependencies? Who is data owner? What is the affect to others? Find Understand Get Permission Plan Run search query
a variety of metadata for users’ better understanding of data › Had to aggregate data from 10+ services including internal tools on data platform Data source BI tools Hive DWH DB Query engine Metadata Aggregating metadata issue Data catalog’s data source
runs a query on every table to get metadata › A table has a tremendous amount of partition in our data platform Hive DWH DB Metadata Aggregating metadata issue Data catalog’s data source
runs a query on every table to get metadata › Need to provide as up-to-date metadata as possible › A data is used by users in various departments at the same time Data Users yyy dept. Users xxx dept. Queries Queries Aggregating metadata issue
sources Push-based ingestion Overloading to data platform Noise in metadata Aggregating metadata issue Aggregated differences by 30min Filtering meaningless data for users
sources Push-based Integration Overloading to data platform Noise in metadata Aggregating metadata issue Aggregated differences by 30min Filtering meaningless data for users
DB Query engine Data ingestion Streaming Batch Data catalog In-house data catalog Access control Metadata management Query editor Data lineage/ Change data capture Platforms ・Hadoop ・On-premises Data governance tools Internal employee info API Internal workflow Issue: Various data sources
DB Query engine Data ingestion Streaming Batch Data catalog In-house data catalog Access control Metadata management Query editor Data lineage/ Change data capture Platforms ・Hadoop ・On-premises Data governance tools Internal employee info API Internal workflow Issue: Various data sources
DB Query engine Data ingestion Streaming Batch Data catalog In-house data catalog Access control Metadata management Query editor Data lineage/ Change data capture Platforms ・Hadoop ・On-premises Data governance tools Internal employee info API Internal workflow Issue: Various data sources
DB Query engine Data ingestion Streaming Batch Data catalog In-house data catalog Access control Metadata management Query editor Data lineage/ Change data capture Platforms ・Hadoop ・On-premises Data governance tools Internal employee info API Internal workflow Issue: Various data sources
DB Query engine Data ingestion Streaming Batch Data catalog In-house data catalog Access control Metadata management Query editor Data lineage/ Change data capture Platforms ・Hadoop ・On-premises Data governance tools Internal employee info API Internal workflow Issue: Various data sources
DB Query engine Data ingestion Streaming Batch Data catalog In-house data catalog Access control Metadata management Query editor Data lineage/ Change data capture Platforms ・Hadoop ・On-premises Data governance tools Internal employee info API Internal workflow Issue: Various data sources
DB Query engine Data ingestion Streaming Batch Data catalog In-house data catalog Access control Metadata management Query editor Data lineage/ Change data capture Platforms ・Hadoop ・On-premises Data governance tools Internal employee info API Internal workflow › Feature introduction › Displaying data › How to aggregate data Issue: Various data sources
› Data governance › Monitor data usage situation › Prioritize data for data management › Data enablement › Reference for the way to use data › Analyze affected reports by data change Issue: Various data sources
by batch Job, parse SQL in the report, and connect it with a table Data source Get query in reports Data Catalog Extract tables from reports Issue: Various data sources
DB Query engine Data ingestion Streaming Batch Data catalog In-house data catalog Access control Metadata management Query editor Data lineage/ Change data capture Platforms ・Hadoop ・On-premises Data governance tools Internal employee info API Internal workflow › Feature introduction › Displaying data › How to aggregate data Issue: Various data sources
Data governance › Audit data generation process › Data enablement › Understand data dependencies › Impact analysis by data change › Make easier data debugging Issue: Various data sources
PII › Data owner › Organization that uses the table › Reports generated from the data › User lists › Links: Wiki, GitHub, Airflow, … Issue: Various data sources
by data modification by Apache Hive/Spark › Reasons for selection › Collaboration with Ranger › Cloudera support › Stable than other approaches Issue: Various data sources
BI tools Hive DWH DB Query engine Data ingestion Streaming Batch Data catalog In-house data catalog Access control Metadata management Query editor Data lineage/ Change data capture Platforms ・Hadoop ・On-premises Data governance tools Internal employee info API Internal workflow
sources Push-based Integration Overloading to data platform Noise in metadata Aggregating metadata issue Aggregated differences by 30min Filtering meaningless data for users
as up-to-date metadata as possible › Avoid to send requests to our data platform unnecessarily › We will make more use of Atlas for metadata change detection › Other candidates › Hive Metastore Event Listener › Need to set up Hive Metastore Event Listener › Need to filter out due to almost raw data and too many objects › Atlas Rest API › Select streaming approach to meet our use case and reduce load on Atlas › We use Atlas notification for updating our search feature’s index Issue: Overloading to data platform
sources Push-based ingestion Overloading to data platform Noise in metadata Aggregating metadata issue Aggregated differences by 30min Filtering meaningless data for users
table within 30 minutes as 1 change › Not only aggregate, but also filtered out duplicated or some changes Issue: Noise in metadata Create/Update/Delete a Table Kafka Atlas Notification Hook Consumer Aggregate and filter out Notification
and deleted in a very short time, it’s considered as a user error 6TFS#FIBWJPS )BOEMJOHJOPVSMPHJD $SFBUFBUBCMFˠ%FMFUFXJUIJONJOVUFT *HOPSF $SFBUFBUBCMFˠ NJOVUFTFMBQTFE ˠ%FMFUF $SFBUFˠ%FMFUF Issue: Noise in metadata
hook consumer › Compared the up-to-date data in Hive DWH and our data catalog’s database › Figured out the difference by data change Compare the DDL Notification Hive DWH Our data catalog’s database Issue: Noise in metadata
and found out › Ignored 95% DDL changes that doesn’t lead to user actions Issue: Noise in metadata Change pattern % lUSBOTJFOU@MBTU%EM5JNFzDIBOHFE IEGTMPDBUJPODIBOHFE DPMVNODIBOHFE FMTF
Integration Overloading to data platform Aggregated differences by 30min Filtering meaningless data for users Noise in metadata Aggregating metadata issue
sources Push-based Integration Overloading to data platform Noise in metadata Solutions for aggregating metadata Results Aggregated differences by 30min Filtering meaningless data for users
root cause for data debugging › Enabled to analyze the effect by data change immediately › Data governance › More accurate and efficient monitoring of data usage › Planner › Easier to imagine how the way to use data Solutions for aggregating metadata
call the API from Atlas 1. Lineage with many nodes 2. Nodes connected to the table increased each time a job was executed FYI. Node and depth Introduction Node Node Node Node Node Depth1 Depth2 Atlas hook Hive/Spark Atlas 1 2 Displaying metadata issue
› Atlas hook registered lots of unnecessary tables to Atlas › e.g. tmp_profile_vdo.table %FQUI /VNCFSPGOPEFT e.g. Complexity of a lineage Atlas hook Hive/Spark Atlas temporary table, test data e.g. tmp_profile_vdo Atlas API performance issue
also aggregate Hive/Spark process as a node › Nodes related to the temporary table generated for each hourly/daily job Process Node Generated for each hourly/daily job Atlas API performance issue
1. Atlas API performance issue 1. Table with many nodes 2. Nodes are increased each time a job runs 2. Usability issue 1 Users 2 Data catalog Data source
to Atlas Table with many nodes Deleted unnecessary processes Usability issue Atlas hook Hive/Spark Kafka Atlas Provide interactive UX that satisfy various use cases Displaying metadata issue Nodes increased each time a job was executed
to Atlas Table with many nodes Atlas hook Hive/Spark Kafka Atlas Displaying metadata issue Deleted unnecessary processes Nodes increased each time a job was executed Usability issue Provide interactive UX that satisfy various use cases
set a config “atlas.hook.hive.hive_table.ignore.pattern” › Filtered out the specific db/table by following regex patterns Atlas hook Hive/Spark Kafka 1BUUFSO 5ZQF /PUF ? = c@ UFNQcUNQ c@ !JV 5BCMF 5FNQPSBSZ ? <B[";>\^ =E\^ = c %# &NQMPZFF`T%# ɾɾɾ ɾɾɾ ɾɾɾ Filter out Issue: Table with many nodes
to Atlas Table with many nodes Atlas hook Hive/Spark Kafka Atlas Displaying metadata issue Deleted unnecessary processes Usability issue Provide interactive UX that satisfy various use cases Nodes increased each time a job was executed
Need to delete some unnecessary process entities directly Atlas Batch Job 3VMF )BWFOPJOQVUTPSPVUQVUT 0XOUIFTBNFJOQVUTBOE PVUQVUT ɾɾɾ Delete Issue: Nodes increased each time a job was executed
to Atlas Table with many nodes Atlas hook Hive/Spark Kafka Atlas Displaying metadata issue Deleted unnecessary processes Usability issue Provide interactive UX that satisfy various use cases Nodes increased each time a job was executed
to Atlas Table with many nodes Atlas hook Hive/Spark Kafka Atlas Displaying metadata issue Deleted unnecessary processes Usability issue Provide interactive UX that satisfy various use cases Nodes increased each time a job was executed
the number of nodes registered to Atlas Table with many nodes Deleted unnecessary processes Usability issue Provide interactive UX that satisfy various use cases Nodes increased each time a job was executed
seconds at the latest › The number of tables registered in Atlas has been reduced by 90% › 300,000 nodes deleted daily with the batch Job › Support to display end to end lineage and deep dive column-level impact analysis Solutions for displaying metadata
ingestion is delayed, it’s clear which services are affected immediately › Data governance › Enabled to identify users who involved in the table’s generation end to end › Easier to understand tables generated from a column containing important data from a governance perspective Solutions for displaying metadata
more diverse › Not only to aggregate metadata, but also inspect and filter out data for uses › Had some adjustments to introduce Atlas When introducing OSS, it’s important to understand the user needs and your own data platform
Generate statistical data from queries and use for cost optimization on data platform › Metadata everywhere › Allow users to view metadata on various BI tools › More personalized experience › Reduce the time for searching data, increase efficiency and productivity