Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

› Introduction of our data catalog › Motivation and benefit for in-house data catalog › How we develop the data catalog Overview

Slide 3

Slide 3 text

› People who are interested in › Operating data catalog › Developing data catalog Target audience

Slide 4

Slide 4 text

Agenda › LINE’s data platform › Problem to introduce data catalog › Solution › Summary › Future work

Slide 5

Slide 5 text

Agenda › LINE’s data platform › Problem to introduce data catalog › Solution › Summary › Future work

Slide 6

Slide 6 text

Democratize data for businesses +200 Services including internal services Data platform Governance Data science Machine learning Planning LINE’s self-serve data platform

Slide 7

Slide 7 text

Scale of data use 400 PB HDFS capacity 40,000 Hive Tables 150,000 Jobs/Day LINE’s data platform

Slide 8

Slide 8 text

Our users Always data driven Data scientist Backend engineer Job Tasks Tools Planner Data analysis OASIS (BI tool) Service monitoring ETL Kibana, OASIS (BI tool) KPI analysis Tableau , OASIS (BI tool) LINE’s data platform

Slide 9

Slide 9 text

Tasks before data activity Find Understand Get Permission Plan Run search query

Slide 10

Slide 10 text

Need to know before data activities When is it updated? What's the difference between uid and cid? Which table is important? Similar table name… Which is appropriate…? What are the data dependencies? Who is data owner? What is the affect to others? Find Understand Get Permission Plan Run search query

Slide 11

Slide 11 text

LINE’s data catalog › Mission › Data democratization for businesses › Values › One-stop solution for data activities › Core features › Search data › Access control › Metadata management › Exploratory data analysis

Slide 12

Slide 12 text

Agenda › LINE’s data platform › Problem to introduce data catalog › Solution › Summary › Future work

Slide 13

Slide 13 text

Problem to introduce data catalog 1. Aggregating metadata 2. Displaying metadata Data source BI tools Query engine Other services ・・・ DB 1 2 Users Data catalog’s data source

Slide 14

Slide 14 text

Problem to introduce data catalog 1. Aggregating metadata 2. Displaying metadata Data source BI tools Query engine Other services ・・・ DB Users 1 2 Data catalog’s data source

Slide 15

Slide 15 text

Aggregating metadata issues 1. Various data sources 2. Overloading to data platform 3. Noise in metadata 1 2 3 Data catalog’s data source

Slide 16

Slide 16 text

Difficult to aggregate from various sources › Need to provide a variety of metadata for users’ better understanding of data › Had to aggregate data from 10+ services including internal tools on data platform Data source BI tools Hive DWH DB Query engine Metadata Aggregating metadata issue Data catalog’s data source

Slide 17

Slide 17 text

Overloading to our data platform › A typical data catalog runs a query on every table to get metadata › A table has a tremendous amount of partition in our data platform Hive DWH DB Metadata Aggregating metadata issue Data catalog’s data source

Slide 18

Slide 18 text

Get up-to-date data without overloading › A typical data catalog runs a query on every table to get metadata › Need to provide as up-to-date metadata as possible › A data is used by users in various departments at the same time Data Users yyy dept. Users xxx dept. Queries Queries Aggregating metadata issue

Slide 19

Slide 19 text

Noise in metadata › 300~500 DB/Tables are changed during a few minutes › It’s important not to aggregate metadata that doesn’t lead to users’ action Hive DWH DB Metadata Data catalog Aggregating metadata issue

Slide 20

Slide 20 text

Noise in metadata › Metadata generated by user error › e.g. Create a table and deleted immediately due to mis-creation Aggregating metadata issue Data catalog Create/Alter/Drop tables

Slide 21

Slide 21 text

Noise in metadata › Some metadata does not lead to users’ action Changing only timestamp is of no value Aggregating metadata issue

Slide 22

Slide 22 text

Summarize issues 1 2 3 1. Various data sources 2. Overloading to data platform 3. Noise in metadata Aggregating metadata issue Data catalog’s data source

Slide 23

Slide 23 text

What are the solutions? 1. Various data source 2. Overloading to data platform 3. Noise in metadata Aggregating metadata issue Solutions

Slide 24

Slide 24 text

Solutions for aggregating metadata Connect and aggregate effectively Various data sources Push-based ingestion Overloading to data platform Noise in metadata Aggregating metadata issue Aggregated differences by 30min Filtering meaningless data for users

Slide 25

Slide 25 text

Solutions for aggregating metadata Connect and aggregate effectively Various data sources Push-based Integration Overloading to data platform Noise in metadata Aggregating metadata issue Aggregated differences by 30min Filtering meaningless data for users

Slide 26

Slide 26 text

Connect and aggregate effectively Data source BI tools Hive DWH DB Query engine Data ingestion Streaming Batch Data catalog In-house data catalog Access control Metadata management Query editor Data lineage/ Change data capture Platforms ・Hadoop ・On-premises Data governance tools Internal employee info API Internal workflow Issue: Various data sources

Slide 27

Slide 27 text

Connect and aggregate effectively Data source BI tools Hive DWH DB Query engine Data ingestion Streaming Batch Data catalog In-house data catalog Access control Metadata management Query editor Data lineage/ Change data capture Platforms ・Hadoop ・On-premises Data governance tools Internal employee info API Internal workflow Issue: Various data sources

Slide 28

Slide 28 text

Connect and aggregate effectively Data source BI tools Hive DWH DB Query engine Data ingestion Streaming Batch Data catalog In-house data catalog Access control Metadata management Query editor Data lineage/ Change data capture Platforms ・Hadoop ・On-premises Data governance tools Internal employee info API Internal workflow Issue: Various data sources

Slide 29

Slide 29 text

Connect and aggregate effectively Data source BI tools Hive DWH DB Query engine Data ingestion Streaming Batch Data catalog In-house data catalog Access control Metadata management Query editor Data lineage/ Change data capture Platforms ・Hadoop ・On-premises Data governance tools Internal employee info API Internal workflow Issue: Various data sources

Slide 30

Slide 30 text

Connect and aggregate effectively Data source BI tools Hive DWH DB Query engine Data ingestion Streaming Batch Data catalog In-house data catalog Access control Metadata management Query editor Data lineage/ Change data capture Platforms ・Hadoop ・On-premises Data governance tools Internal employee info API Internal workflow Issue: Various data sources

Slide 31

Slide 31 text

Deep dive the aggregation Data source BI tools Hive DWH DB Query engine Data ingestion Streaming Batch Data catalog In-house data catalog Access control Metadata management Query editor Data lineage/ Change data capture Platforms ・Hadoop ・On-premises Data governance tools Internal employee info API Internal workflow Issue: Various data sources

Slide 32

Slide 32 text

Aggregate from BI tools Data source BI tools Hive DWH DB Query engine Data ingestion Streaming Batch Data catalog In-house data catalog Access control Metadata management Query editor Data lineage/ Change data capture Platforms ・Hadoop ・On-premises Data governance tools Internal employee info API Internal workflow › Feature introduction › Displaying data › How to aggregate data Issue: Various data sources

Slide 33

Slide 33 text

Report catalog › Data sources › Tableau, OASIS (BI tool) › Data governance › Monitor data usage situation › Prioritize data for data management › Data enablement › Reference for the way to use data › Analyze affected reports by data change Issue: Various data sources

Slide 34

Slide 34 text

Displaying data › Report info › Name › URL › Description › PV › Timestamp › Author name › Users Issue: Various data sources

Slide 35

Slide 35 text

Aggregate from BI tools › Aggregated reports from BI tools by batch Job, parse SQL in the report, and connect it with a table Data source Get query in reports Data Catalog Extract tables from reports Issue: Various data sources

Slide 36

Slide 36 text

Aggregate from Apache Atlas Data source BI tools Hive DWH DB Query engine Data ingestion Streaming Batch Data catalog In-house data catalog Access control Metadata management Query editor Data lineage/ Change data capture Platforms ・Hadoop ・On-premises Data governance tools Internal employee info API Internal workflow › Feature introduction › Displaying data › How to aggregate data Issue: Various data sources

Slide 37

Slide 37 text

Data lineage › Data source › Atlas (Hive, Spark) › Data governance › Audit data generation process › Data enablement › Understand data dependencies › Impact analysis by data change › Make easier data debugging Issue: Various data sources

Slide 38

Slide 38 text

Displaying data › Table description, timestamps › Table relationships › PII › Data owner › Organization that uses the table › Reports generated from the data › User lists › Links: Wiki, GitHub, Airflow, … Issue: Various data sources

Slide 39

Slide 39 text

Apache Atlas › OSS data catalog › Stores metadata generated by data modification by Apache Hive/Spark › Reasons for selection › Collaboration with Ranger › Cloudera support › Stable than other approaches Issue: Various data sources

Slide 40

Slide 40 text

Data lineage for Hive Atlas hook Hive Kafka Atlas Data catalog Metadata entity Atlas hook › Atlas hook in Hive save to Atlas via Kafka Create/Alter /Drop table Issue: Various data sources REST API

Slide 41

Slide 41 text

Data lineage for Spark Issue: Various data sources › Atlas doesn’t support Spark. Introduced Atlas Spark connector Atlas hook Spark Kafka Atlas Data catalog Metadata entity Atlas Spark hook Atlas hook Hive Warehouse Metadata entity Atlas hook Create/Alter /Drop table

Slide 42

Slide 42 text

Aggregate other metadata with lineage Atlas hook Hive/Spark Kafka Atlas Data catalog Metadata entity Atlas hook Other data sources Issue: Various data sources › Aggregate from other data sources to display data lineage

Slide 43

Slide 43 text

Connect and aggregate effectively Issue: Various data sources Data source BI tools Hive DWH DB Query engine Data ingestion Streaming Batch Data catalog In-house data catalog Access control Metadata management Query editor Data lineage/ Change data capture Platforms ・Hadoop ・On-premises Data governance tools Internal employee info API Internal workflow

Slide 44

Slide 44 text

Solutions for aggregating metadata Connect and aggregate effectively Various data sources Push-based Integration Overloading to data platform Noise in metadata Aggregating metadata issue Aggregated differences by 30min Filtering meaningless data for users

Slide 45

Slide 45 text

Apache Atlas › OSS data catalog › Stores metadata generated by data modification by Apache Hive/Spark › Features › Lineage › Notification Issue: Overloading to data platform

Slide 46

Slide 46 text

Atlas notification › Atlas sends notifications about metadata changes to Kafka topic named ATLAS_ENTITIES Issue: Overloading to data platform Atlas entities Atlas hook Atlas hook Hive/Spark Kafka Atlas Notification hook consumer Notification

Slide 47

Slide 47 text

Atlas notification › Reasons for selection › Need to provide as up-to-date metadata as possible › Avoid to send requests to our data platform unnecessarily › We will make more use of Atlas for metadata change detection › Other candidates › Hive Metastore Event Listener › Need to set up Hive Metastore Event Listener › Need to filter out due to almost raw data and too many objects › Atlas Rest API › Select streaming approach to meet our use case and reduce load on Atlas › We use Atlas notification for updating our search feature’s index Issue: Overloading to data platform

Slide 48

Slide 48 text

Solutions for aggregating metadata Connect and Aggregate effectively Various data sources Push-based ingestion Overloading to data platform Noise in metadata Aggregating metadata issue Aggregated differences by 30min Filtering meaningless data for users

Slide 49

Slide 49 text

Aggregated differences by 30min › Introduced 30min window to Kafka and aggregated consumed data Issue: Noise in metadata Atlas hook Hive/Spark Kafka Atlas Notification hook consumer Notification

Slide 50

Slide 50 text

Aggregated differences by 30min › Treated multiple changes on a table within 30 minutes as 1 change › Not only aggregate, but also filtered out duplicated or some changes Issue: Noise in metadata Create/Update/Delete a Table Kafka Atlas Notification Hook Consumer Aggregate and filter out Notification

Slide 51

Slide 51 text

Added logic to ignore › If a table is created and deleted in a very short time, it’s considered as a user error 6TFS#FIBWJPS )BOEMJOHJOPVSMPHJD $SFBUFBUBCMFˠ%FMFUFXJUIJONJOVUFT *HOPSF $SFBUFBUBCMFˠ NJOVUFTFMBQTFE ˠ%FMFUF $SFBUFˠ%FMFUF Issue: Noise in metadata

Slide 52

Slide 52 text

Inspect Atlas notification record and filtered › Atlas notification record contains the following information about the changed table Issue: Noise in metadata

Slide 53

Slide 53 text

Code introduction in the aggregation › L41~43: Filtered out, L45: aggregate, L46: set up 30min window Issue: Noise in metadata

Slide 54

Slide 54 text

Filtering meaningless data › Inspect the notification and try to ignore meaningless data much more Issue: Noise in metadata Atlas hook Hive/Spark Kafka Atlas Notification hook consumer Notification

Slide 55

Slide 55 text

Inspect Atlas notification record more › The notification didn’t include even specific DDL changes › We needed to find another solution Issue: Noise in metadata

Slide 56

Slide 56 text

Compared Hive DWH and our database Kafka consumer Atlas Notification hook consumer › Compared the up-to-date data in Hive DWH and our data catalog’s database › Figured out the difference by data change Compare the DDL Notification Hive DWH Our data catalog’s database Issue: Noise in metadata

Slide 57

Slide 57 text

Ignore the unnecessary change patterns › Analyzed meaningless DDL change and found out › Ignored 95% DDL changes that doesn’t lead to user actions Issue: Noise in metadata Change pattern % lUSBOTJFOU@MBTU%EM5JNFzDIBOHFE IEGTMPDBUJPODIBOHFE DPMVNODIBOHFE FMTF

Slide 58

Slide 58 text

Summarize solutions Connect and aggregate effectively Various data sources Push-based Integration Overloading to data platform Aggregated differences by 30min Filtering meaningless data for users Noise in metadata Aggregating metadata issue

Slide 59

Slide 59 text

What are the results? Connect and aggregate effectively Various data sources Push-based Integration Overloading to data platform Noise in metadata Solutions for aggregating metadata Results Aggregated differences by 30min Filtering meaningless data for users

Slide 60

Slide 60 text

Results › Aggregated › the up-to-date metadata › from various data sources › without overloading to our data platform Solutions for aggregating metadata Data catalog’s data source

Slide 61

Slide 61 text

Results › Aggregated › with excluding meaningless data for users’ actions Atlas hook Hive/Spark Kafka Atlas Notification hook consumer Removed 95% noise Removed 90% noise Solutions for aggregating metadata Notification

Slide 62

Slide 62 text

Use case Find Understand Plan Run Notify search query › An analytics engineer changes a table’s DDL for ETL and notify to stakeholders announce Solutions for aggregating metadata

Slide 63

Slide 63 text

Familiarize and find a data Plan Run Notify search query announce Ranked by Search Solutions for aggregating metadata Find Understand

Slide 64

Slide 64 text

Understand and check the target table Plan Run Notify search query announce Table overview Data sources Solutions for aggregating metadata Find Understand

Slide 65

Slide 65 text

Confirm the dependencies Plan Run Notify search query announce Data lineage & report catalog Data sources Solutions for aggregating metadata Find Understand

Slide 66

Slide 66 text

Notify users of the data change Plan Run Notify search query announce Tools Notify Solutions for aggregating metadata Find Understand

Slide 67

Slide 67 text

User feedback › Analytics engineer › Easier to find the root cause for data debugging › Enabled to analyze the effect by data change immediately › Data governance › More accurate and efficient monitoring of data usage › Planner › Easier to imagine how the way to use data Solutions for aggregating metadata

Slide 68

Slide 68 text

Resolved aggregating metadata issues 1. Aggregating metadata 2. Displaying metadata Data source BI tools Query engine Other services ・・・ DB Users 1 2 Data catalog’s data source

Slide 69

Slide 69 text

Problem to introduce data catalog 1. Aggregating metadata 2. Displaying metadata Data Source BI tools Query engine Other services ・・・ DB 1 2 Users Data catalog’s data source

Slide 70

Slide 70 text

Displaying metadata issues › Display data lineage issues with Atlas 1. Atlas API performance issue 2. Usability issue 1 Users 2 Data catalog Data source

Slide 71

Slide 71 text

Displaying metadata issues 1 Users 2 Data catalog Data source › Display data lineage issues with Atlas 1. Atlas API performance issue 2. Usability issue

Slide 72

Slide 72 text

Atlas API performance issue › Took about 30 minutes to call the API from Atlas 1. Lineage with many nodes 2. Nodes connected to the table increased each time a job was executed FYI. Node and depth Introduction Node Node Node Node Node Depth1 Depth2 Atlas hook Hive/Spark Atlas 1 2 Displaying metadata issue

Slide 73

Slide 73 text

Many nodes issue › A table’s upstream of 250 tables › Atlas hook registered lots of unnecessary tables to Atlas › e.g. tmp_profile_vdo.table %FQUI /VNCFSPGOPEFT e.g. Complexity of a lineage Atlas hook Hive/Spark Atlas temporary table, test data e.g. tmp_profile_vdo Atlas API performance issue

Slide 74

Slide 74 text

Nodes are increased each time a job runs › We also aggregate Hive/Spark process as a node › Nodes related to the temporary table generated for each hourly/daily job Process Node Generated for each hourly/daily job Atlas API performance issue

Slide 75

Slide 75 text

Displaying metadata issues › Display data lineage issues with Atlas 1. Atlas API performance issue 2. Usability issue 1 Users 2 Data catalog Data source

Slide 76

Slide 76 text

› Not flexible to change the lineage graph › Display the entire lineage when first accessed › Table and columns are displayed mixed Displaying metadata issue Usability issue

Slide 77

Slide 77 text

› Not flexible to change the lineage graph › Display the entire lineage when first accessed › Table and columns are displayed mixed Displaying metadata issue Usability issue

Slide 78

Slide 78 text

Displaying metadata issues › Display data lineage issues with Atlas 1. Atlas API performance issue 1. Table with many nodes 2. Nodes are increased each time a job runs 2. Usability issue 1 Users 2 Data catalog Data source

Slide 79

Slide 79 text

What are the solutions? Displaying metadata issue Solutions › Display data lineage issues with Atlas 1. Atlas API performance issue 1. Table with many nodes 2. Nodes are increased each time a job runs 2. Usability issue

Slide 80

Slide 80 text

Solutions for displaying metadata Reduce the number of nodes registered to Atlas Table with many nodes Deleted unnecessary processes Usability issue Atlas hook Hive/Spark Kafka Atlas Provide interactive UX that satisfy various use cases Displaying metadata issue Nodes increased each time a job was executed

Slide 81

Slide 81 text

Solutions for displaying metadata Reduce the number of nodes registered to Atlas Table with many nodes Atlas hook Hive/Spark Kafka Atlas Displaying metadata issue Deleted unnecessary processes Nodes increased each time a job was executed Usability issue Provide interactive UX that satisfy various use cases

Slide 82

Slide 82 text

Reduce the number of nodes registered › Atlas supports to set a config “atlas.hook.hive.hive_table.ignore.pattern” › Filtered out the specific db/table by following regex patterns Atlas hook Hive/Spark Kafka 1BUUFSO 5ZQF /PUF ? = c@ UFNQcUNQ c@ !JV 5BCMF 5FNQPSBSZ ? \^ =E\^ =c %# &NQMPZFF`T%# ɾɾɾ ɾɾɾ ɾɾɾ Filter out Issue: Table with many nodes

Slide 83

Slide 83 text

Solutions for displaying metadata Reduce the number of nodes registered to Atlas Table with many nodes Atlas hook Hive/Spark Kafka Atlas Displaying metadata issue Deleted unnecessary processes Usability issue Provide interactive UX that satisfy various use cases Nodes increased each time a job was executed

Slide 84

Slide 84 text

Delete unnecessary processes › Couldn’t filtered out process nodes › Need to delete some unnecessary process entities directly Atlas Batch Job 3VMF )BWFOPJOQVUTPSPVUQVUT 0XOUIFTBNFJOQVUTBOE PVUQVUT ɾɾɾ Delete Issue: Nodes increased each time a job was executed

Slide 85

Slide 85 text

Solutions for displaying metadata Reduce the number of nodes registered to Atlas Table with many nodes Atlas hook Hive/Spark Kafka Atlas Displaying metadata issue Deleted unnecessary processes Usability issue Provide interactive UX that satisfy various use cases Nodes increased each time a job was executed

Slide 86

Slide 86 text

Provide interactive UX › Response 3 depths by default › Provide interactive UX › Dig column-level lineage › Expand the displaying lineage graph by clicking a node › Drag&Drop › Zoom in/out Issue: Usability issue

Slide 87

Slide 87 text

Solutions for displaying metadata Reduce the number of nodes registered to Atlas Table with many nodes Atlas hook Hive/Spark Kafka Atlas Displaying metadata issue Deleted unnecessary processes Usability issue Provide interactive UX that satisfy various use cases Nodes increased each time a job was executed

Slide 88

Slide 88 text

What are the Results? Solutions for displaying metadata Results Reduce the number of nodes registered to Atlas Table with many nodes Deleted unnecessary processes Usability issue Provide interactive UX that satisfy various use cases Nodes increased each time a job was executed

Slide 89

Slide 89 text

Results › Atlas API respond within 6 seconds basically, 15 seconds at the latest › The number of tables registered in Atlas has been reduced by 90% › 300,000 nodes deleted daily with the batch Job › Support to display end to end lineage and deep dive column-level impact analysis Solutions for displaying metadata

Slide 90

Slide 90 text

Confirm a column-level dependencies Plan Run Notify search query announce Column level lineage Data sources Solutions for displaying metadata Find Understand

Slide 91

Slide 91 text

User feedback › Analytics engineer, backend engineer › When data ingestion is delayed, it’s clear which services are affected immediately › Data governance › Enabled to identify users who involved in the table’s generation end to end › Easier to understand tables generated from a column containing important data from a governance perspective Solutions for displaying metadata

Slide 92

Slide 92 text

Contributed to OSS › Atlas to skip external temporary table created in hive › https://issues.apache.org/jira/browse/ATLAS-4492 Solutions for displaying metadata

Slide 93

Slide 93 text

Reference › LINEの大規模なData PlatformにData Lineageを導入した話 › https://engineering.linecorp.com/ja/blog/data-lineage-on-line-big-data-platform/ Solutions for displaying metadata

Slide 94

Slide 94 text

Resolved displaying metadata issue 1. Aggregating metadata 2. Displaying metadata Data Source BI tools Query engine Other services ・・・ DB 1 2 Users Data catalog’s data source

Slide 95

Slide 95 text

Agenda › LINE’s data platform › Problem to introduce data catalog 1. Aggregating metadata issues 2. Displaying metadata issues › Summary › Future work

Slide 96

Slide 96 text

Summary › Need to provide various metadata as users become more diverse › Not only to aggregate metadata, but also inspect and filter out data for uses › Had some adjustments to introduce Atlas When introducing OSS, it’s important to understand the user needs and your own data platform

Slide 97

Slide 97 text

Future work › For a wider range of uses › Generate statistical data from queries and use for cost optimization on data platform › Metadata everywhere › Allow users to view metadata on various BI tools › More personalized experience › Reduce the time for searching data, increase efficiency and productivity

Slide 98

Slide 98 text

End

Slide 99

Slide 99 text

Thank you