Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Developing a Data Catalog to Promote Data Usage on the Data Platform

Developing a Data Catalog to Promote Data Usage on the Data Platform

Naoto Udagawa (LINE / Data Platform Department / Senior Product Manager)

https://tech-verse.me/ja/sessions/37
https://tech-verse.me/en/sessions/37
https://tech-verse.me/ko/sessions/37

Tech-Verse2022

November 17, 2022
Tweet

More Decks by Tech-Verse2022

Other Decks in Technology

Transcript

  1. › Introduction of our data catalog
    › Motivation and benefit for in-house data catalog
    › How we develop the data catalog
    Overview

    View full-size slide

  2. › People who are interested in
    › Operating data catalog
    › Developing data catalog
    Target audience

    View full-size slide

  3. Agenda › LINE’s data platform
    › Problem to introduce data catalog
    › Solution
    › Summary
    › Future work

    View full-size slide

  4. Agenda › LINE’s data platform
    › Problem to introduce data catalog
    › Solution
    › Summary
    › Future work

    View full-size slide

  5. Democratize data for businesses
    +200 Services
    including internal
    services
    Data platform
    Governance
    Data science
    Machine learning
    Planning
    LINE’s self-serve data platform

    View full-size slide

  6. Scale of data use
    400 PB
    HDFS capacity
    40,000
    Hive Tables
    150,000
    Jobs/Day
    LINE’s data platform

    View full-size slide

  7. Our users
    Always data driven
    Data scientist Backend engineer
    Job
    Tasks
    Tools
    Planner
    Data analysis
    OASIS
    (BI tool)
    Service monitoring
    ETL
    Kibana, OASIS
    (BI tool)
    KPI analysis
    Tableau , OASIS
    (BI tool)
    LINE’s data platform

    View full-size slide

  8. Tasks before data activity
    Find Understand Get Permission Plan Run
    search query

    View full-size slide

  9. Need to know before data activities
    When is it updated?
    What's the difference
    between uid and cid?
    Which table is
    important?
    Similar table name…
    Which is appropriate…?
    What are the data
    dependencies?
    Who is data owner?
    What is the affect to
    others?
    Find Understand Get Permission Plan Run
    search query

    View full-size slide

  10. LINE’s data catalog
    › Mission
    › Data democratization for businesses
    › Values
    › One-stop solution for data activities
    › Core features
    › Search data
    › Access control
    › Metadata management
    › Exploratory data analysis

    View full-size slide

  11. Agenda › LINE’s data platform
    › Problem to introduce data catalog
    › Solution
    › Summary
    › Future work

    View full-size slide

  12. Problem to introduce data catalog
    1. Aggregating metadata
    2. Displaying metadata
    Data source
    BI tools
    Query engine
    Other services
    ・・・
    DB
    1 2
    Users
    Data catalog’s
    data source

    View full-size slide

  13. Problem to introduce data catalog
    1. Aggregating metadata
    2. Displaying metadata
    Data source
    BI tools
    Query engine
    Other services
    ・・・
    DB
    Users
    1 2
    Data catalog’s
    data source

    View full-size slide

  14. Aggregating metadata issues
    1. Various data sources
    2. Overloading to data platform
    3. Noise in metadata
    1
    2 3
    Data catalog’s
    data source

    View full-size slide

  15. Difficult to aggregate from various sources
    › Need to provide a variety of metadata for users’ better understanding of data
    › Had to aggregate data from 10+ services including internal tools on data platform
    Data source
    BI tools
    Hive DWH
    DB
    Query engine
    Metadata
    Aggregating metadata issue
    Data catalog’s
    data source

    View full-size slide

  16. Overloading to our data platform
    › A typical data catalog runs a query on every table to get metadata
    › A table has a tremendous amount of partition in our data platform
    Hive DWH
    DB
    Metadata
    Aggregating metadata issue
    Data catalog’s
    data source

    View full-size slide

  17. Get up-to-date data without overloading
    › A typical data catalog runs a query on every table to get metadata
    › Need to provide as up-to-date metadata as possible
    › A data is used by users in various departments at the same time
    Data Users
    yyy dept.
    Users
    xxx dept.
    Queries Queries
    Aggregating metadata issue

    View full-size slide

  18. Noise in metadata
    › 300~500 DB/Tables are changed during a few minutes
    › It’s important not to aggregate metadata that doesn’t lead to users’ action
    Hive DWH
    DB
    Metadata
    Data catalog
    Aggregating metadata issue

    View full-size slide

  19. Noise in metadata
    › Metadata generated by user error
    › e.g. Create a table and deleted immediately due to mis-creation
    Aggregating metadata issue
    Data catalog
    Create/Alter/Drop tables

    View full-size slide

  20. Noise in metadata
    › Some metadata does not lead to users’ action
    Changing only timestamp is of no value
    Aggregating metadata issue

    View full-size slide

  21. Summarize issues
    1
    2 3
    1. Various data sources
    2. Overloading to data platform
    3. Noise in metadata
    Aggregating metadata issue
    Data catalog’s
    data source

    View full-size slide

  22. What are the solutions?
    1. Various data source
    2. Overloading to data platform
    3. Noise in metadata
    Aggregating metadata issue
    Solutions

    View full-size slide

  23. Solutions for aggregating metadata
    Connect and aggregate effectively
    Various data sources
    Push-based ingestion
    Overloading to data platform
    Noise in metadata
    Aggregating metadata issue
    Aggregated differences by 30min
    Filtering meaningless data for users

    View full-size slide

  24. Solutions for aggregating metadata
    Connect and aggregate effectively
    Various data sources
    Push-based Integration
    Overloading to data platform
    Noise in metadata
    Aggregating metadata issue
    Aggregated differences by 30min
    Filtering meaningless data for users

    View full-size slide

  25. Connect and aggregate effectively
    Data source
    BI tools
    Hive DWH
    DB
    Query engine
    Data ingestion
    Streaming
    Batch
    Data catalog
    In-house data catalog
    Access
    control
    Metadata
    management
    Query
    editor
    Data lineage/
    Change data capture
    Platforms
    ・Hadoop
    ・On-premises
    Data governance tools
    Internal employee info API
    Internal workflow
    Issue: Various data sources

    View full-size slide

  26. Connect and aggregate effectively
    Data source
    BI tools
    Hive DWH
    DB
    Query engine
    Data ingestion
    Streaming
    Batch
    Data catalog
    In-house data catalog
    Access
    control
    Metadata
    management
    Query
    editor
    Data lineage/
    Change data capture
    Platforms
    ・Hadoop
    ・On-premises
    Data governance tools
    Internal employee info API
    Internal workflow
    Issue: Various data sources

    View full-size slide

  27. Connect and aggregate effectively
    Data source
    BI tools
    Hive DWH
    DB
    Query engine
    Data ingestion
    Streaming
    Batch
    Data catalog
    In-house data catalog
    Access
    control
    Metadata
    management
    Query
    editor
    Data lineage/
    Change data capture
    Platforms
    ・Hadoop
    ・On-premises
    Data governance tools
    Internal employee info API
    Internal workflow
    Issue: Various data sources

    View full-size slide

  28. Connect and aggregate effectively
    Data source
    BI tools
    Hive DWH
    DB
    Query engine
    Data ingestion
    Streaming
    Batch
    Data catalog
    In-house data catalog
    Access
    control
    Metadata
    management
    Query
    editor
    Data lineage/
    Change data capture
    Platforms
    ・Hadoop
    ・On-premises
    Data governance tools
    Internal employee info API
    Internal workflow
    Issue: Various data sources

    View full-size slide

  29. Connect and aggregate effectively
    Data source
    BI tools
    Hive DWH
    DB
    Query engine
    Data ingestion
    Streaming
    Batch
    Data catalog
    In-house data catalog
    Access
    control
    Metadata
    management
    Query
    editor
    Data lineage/
    Change data capture
    Platforms
    ・Hadoop
    ・On-premises
    Data governance tools
    Internal employee info API
    Internal workflow
    Issue: Various data sources

    View full-size slide

  30. Deep dive the aggregation
    Data source
    BI tools
    Hive DWH
    DB
    Query engine
    Data ingestion
    Streaming
    Batch
    Data catalog
    In-house data catalog
    Access
    control
    Metadata
    management
    Query
    editor
    Data lineage/
    Change data capture
    Platforms
    ・Hadoop
    ・On-premises
    Data governance tools
    Internal employee info API
    Internal workflow
    Issue: Various data sources

    View full-size slide

  31. Aggregate from BI tools
    Data source
    BI tools
    Hive DWH
    DB
    Query engine
    Data ingestion
    Streaming
    Batch
    Data catalog
    In-house data catalog
    Access
    control
    Metadata
    management
    Query
    editor
    Data lineage/
    Change data capture
    Platforms
    ・Hadoop
    ・On-premises
    Data governance tools
    Internal employee info API
    Internal workflow
    › Feature introduction
    › Displaying data
    › How to aggregate data
    Issue: Various data sources

    View full-size slide

  32. Report catalog
    › Data sources
    › Tableau, OASIS (BI tool)
    › Data governance
    › Monitor data usage situation
    › Prioritize data for data management
    › Data enablement
    › Reference for the way to use data
    › Analyze affected reports by data change
    Issue: Various data sources

    View full-size slide

  33. Displaying data
    › Report info
    › Name
    › URL
    › Description
    › PV
    › Timestamp
    › Author name
    › Users
    Issue: Various data sources

    View full-size slide

  34. Aggregate from BI tools
    › Aggregated reports from BI tools by batch Job, parse SQL in the report, and
    connect it with a table
    Data source Get query
    in reports
    Data Catalog
    Extract tables
    from reports
    Issue: Various data sources

    View full-size slide

  35. Aggregate from Apache Atlas
    Data source
    BI tools
    Hive DWH
    DB
    Query engine
    Data ingestion
    Streaming
    Batch
    Data catalog
    In-house data catalog
    Access
    control
    Metadata
    management
    Query
    editor
    Data lineage/
    Change data capture
    Platforms
    ・Hadoop
    ・On-premises
    Data governance tools
    Internal employee info API
    Internal workflow
    › Feature introduction
    › Displaying data
    › How to aggregate data
    Issue: Various data sources

    View full-size slide

  36. Data lineage
    › Data source
    › Atlas (Hive, Spark)
    › Data governance
    › Audit data generation process
    › Data enablement
    › Understand data dependencies
    › Impact analysis by data change
    › Make easier data debugging
    Issue: Various data sources

    View full-size slide

  37. Displaying data
    › Table description, timestamps
    › Table relationships
    › PII
    › Data owner
    › Organization that uses the table
    › Reports generated from the data
    › User lists
    › Links: Wiki, GitHub, Airflow, …
    Issue: Various data sources

    View full-size slide

  38. Apache Atlas
    › OSS data catalog
    › Stores metadata generated by data
    modification by Apache Hive/Spark
    › Reasons for selection
    › Collaboration with Ranger
    › Cloudera support
    › Stable than other approaches
    Issue: Various data sources

    View full-size slide

  39. Data lineage for Hive
    Atlas hook
    Hive
    Kafka Atlas Data catalog
    Metadata entity
    Atlas hook
    › Atlas hook in Hive save to Atlas via Kafka
    Create/Alter
    /Drop
    table
    Issue: Various data sources
    REST API

    View full-size slide

  40. Data lineage for Spark
    Issue: Various data sources
    › Atlas doesn’t support Spark. Introduced Atlas Spark connector
    Atlas hook
    Spark
    Kafka Atlas Data catalog
    Metadata entity
    Atlas Spark hook
    Atlas hook
    Hive Warehouse Metadata entity
    Atlas hook
    Create/Alter
    /Drop
    table

    View full-size slide

  41. Aggregate other metadata with lineage
    Atlas hook
    Hive/Spark
    Kafka Atlas Data catalog
    Metadata entity
    Atlas hook
    Other
    data sources
    Issue: Various data sources
    › Aggregate from other data sources to display data lineage

    View full-size slide

  42. Connect and aggregate effectively
    Issue: Various data sources
    Data source
    BI tools
    Hive DWH
    DB
    Query engine
    Data ingestion
    Streaming
    Batch
    Data catalog
    In-house data catalog
    Access
    control
    Metadata
    management
    Query
    editor
    Data lineage/
    Change data capture
    Platforms
    ・Hadoop
    ・On-premises
    Data governance tools
    Internal employee info API
    Internal workflow

    View full-size slide

  43. Solutions for aggregating metadata
    Connect and aggregate effectively
    Various data sources
    Push-based Integration
    Overloading to data platform
    Noise in metadata
    Aggregating metadata issue
    Aggregated differences by 30min
    Filtering meaningless data for users

    View full-size slide

  44. Apache Atlas
    › OSS data catalog
    › Stores metadata generated by data
    modification by Apache Hive/Spark
    › Features
    › Lineage
    › Notification
    Issue: Overloading to data platform

    View full-size slide

  45. Atlas notification
    › Atlas sends notifications about metadata changes to Kafka topic named
    ATLAS_ENTITIES
    Issue: Overloading to data platform
    Atlas entities
    Atlas hook
    Atlas hook
    Hive/Spark
    Kafka
    Atlas
    Notification
    hook
    consumer
    Notification

    View full-size slide

  46. Atlas notification
    › Reasons for selection
    › Need to provide as up-to-date metadata as possible
    › Avoid to send requests to our data platform unnecessarily
    › We will make more use of Atlas for metadata change detection
    › Other candidates
    › Hive Metastore Event Listener
    › Need to set up Hive Metastore Event Listener
    › Need to filter out due to almost raw data and too many objects
    › Atlas Rest API
    › Select streaming approach to meet our use case and reduce load on Atlas
    › We use Atlas notification for updating our search feature’s index
    Issue: Overloading to data platform

    View full-size slide

  47. Solutions for aggregating metadata
    Connect and Aggregate effectively
    Various data sources
    Push-based ingestion
    Overloading to data platform
    Noise in metadata
    Aggregating metadata issue
    Aggregated differences by 30min
    Filtering meaningless data for users

    View full-size slide

  48. Aggregated differences by 30min
    › Introduced 30min window to Kafka and aggregated consumed data
    Issue: Noise in metadata
    Atlas hook
    Hive/Spark
    Kafka
    Atlas
    Notification
    hook
    consumer
    Notification

    View full-size slide

  49. Aggregated differences by 30min
    › Treated multiple changes on a table within 30 minutes as 1 change
    › Not only aggregate, but also filtered out duplicated or some changes
    Issue: Noise in metadata
    Create/Update/Delete a Table
    Kafka
    Atlas
    Notification
    Hook
    Consumer
    Aggregate and filter out
    Notification

    View full-size slide

  50. Added logic to ignore
    › If a table is created and deleted in a very short time, it’s considered as a
    user error
    6TFS#FIBWJPS )BOEMJOHJOPVSMPHJD
    $SFBUFBUBCMFˠ%FMFUFXJUIJONJOVUFT *HOPSF
    $SFBUFBUBCMFˠ NJOVUFTFMBQTFE
    ˠ%FMFUF $SFBUFˠ%FMFUF
    Issue: Noise in metadata

    View full-size slide

  51. Inspect Atlas notification record and filtered
    › Atlas notification record contains the following information about the
    changed table
    Issue: Noise in metadata

    View full-size slide

  52. Code introduction in the aggregation
    › L41~43: Filtered out, L45: aggregate, L46: set up 30min window
    Issue: Noise in metadata

    View full-size slide

  53. Filtering meaningless data
    › Inspect the notification and try to ignore meaningless data much more
    Issue: Noise in metadata
    Atlas hook
    Hive/Spark
    Kafka
    Atlas
    Notification
    hook
    consumer
    Notification

    View full-size slide

  54. Inspect Atlas notification record more
    › The notification didn’t include even specific DDL changes
    › We needed to find another solution
    Issue: Noise in metadata

    View full-size slide

  55. Compared Hive DWH and our database
    Kafka consumer
    Atlas
    Notification
    hook
    consumer
    › Compared the up-to-date data in Hive DWH and our data catalog’s database
    › Figured out the difference by data change
    Compare
    the DDL
    Notification
    Hive DWH
    Our data
    catalog’s
    database
    Issue: Noise in metadata

    View full-size slide

  56. Ignore the unnecessary change patterns
    › Analyzed meaningless DDL change and found out
    › Ignored 95% DDL changes that doesn’t lead to user actions
    Issue: Noise in metadata
    Change pattern %
    lUSBOTJFOU@MBTU%EM5JNFzDIBOHFE
    IEGTMPDBUJPODIBOHFE
    DPMVNODIBOHFE
    FMTF

    View full-size slide

  57. Summarize solutions
    Connect and aggregate effectively
    Various data sources
    Push-based Integration
    Overloading to data platform
    Aggregated differences by 30min
    Filtering meaningless data for users
    Noise in metadata
    Aggregating metadata issue

    View full-size slide

  58. What are the results?
    Connect and aggregate effectively
    Various data sources
    Push-based Integration
    Overloading to data platform
    Noise in metadata
    Solutions for aggregating metadata
    Results
    Aggregated differences by 30min
    Filtering meaningless data for users

    View full-size slide

  59. Results
    › Aggregated
    › the up-to-date metadata
    › from various data sources
    › without overloading to our data platform
    Solutions for aggregating metadata
    Data catalog’s
    data source

    View full-size slide

  60. Results
    › Aggregated
    › with excluding meaningless data for users’ actions
    Atlas hook
    Hive/Spark
    Kafka
    Atlas
    Notification
    hook
    consumer
    Removed 95%
    noise
    Removed 90%
    noise
    Solutions for aggregating metadata
    Notification

    View full-size slide

  61. Use case
    Find Understand Plan Run Notify
    search query
    › An analytics engineer changes a table’s DDL for ETL and notify to stakeholders
    announce
    Solutions for aggregating metadata

    View full-size slide

  62. Familiarize and find a data
    Plan Run Notify
    search query announce
    Ranked by
    Search
    Solutions for aggregating metadata
    Find Understand

    View full-size slide

  63. Understand and check the target table
    Plan Run Notify
    search query announce
    Table overview Data sources
    Solutions for aggregating metadata
    Find Understand

    View full-size slide

  64. Confirm the dependencies
    Plan Run Notify
    search query announce
    Data lineage & report catalog Data sources
    Solutions for aggregating metadata
    Find Understand

    View full-size slide

  65. Notify users of the data change
    Plan Run Notify
    search query announce
    Tools
    Notify
    Solutions for aggregating metadata
    Find Understand

    View full-size slide

  66. User feedback
    › Analytics engineer
    › Easier to find the root cause for data debugging
    › Enabled to analyze the effect by data change immediately
    › Data governance
    › More accurate and efficient monitoring of data usage
    › Planner
    › Easier to imagine how the way to use data
    Solutions for aggregating metadata

    View full-size slide

  67. Resolved aggregating metadata issues
    1. Aggregating metadata
    2. Displaying metadata
    Data source
    BI tools
    Query engine
    Other services
    ・・・
    DB
    Users
    1 2
    Data catalog’s
    data source

    View full-size slide

  68. Problem to introduce data catalog
    1. Aggregating metadata
    2. Displaying metadata
    Data Source
    BI tools
    Query engine
    Other services
    ・・・
    DB
    1 2
    Users
    Data catalog’s
    data source

    View full-size slide

  69. Displaying metadata issues
    › Display data lineage issues with Atlas
    1. Atlas API performance issue
    2. Usability issue
    1
    Users
    2
    Data catalog
    Data source

    View full-size slide

  70. Displaying metadata issues
    1
    Users
    2
    Data catalog
    Data source
    › Display data lineage issues with Atlas
    1. Atlas API performance issue
    2. Usability issue

    View full-size slide

  71. Atlas API performance issue
    › Took about 30 minutes to call the API from Atlas
    1. Lineage with many nodes
    2. Nodes connected to the table increased each time a job was executed
    FYI. Node and depth Introduction
    Node
    Node
    Node
    Node
    Node
    Depth1 Depth2
    Atlas hook
    Hive/Spark
    Atlas
    1
    2
    Displaying metadata issue

    View full-size slide

  72. Many nodes issue
    › A table’s upstream of 250 tables
    › Atlas hook registered lots of unnecessary tables to Atlas
    › e.g. tmp_profile_vdo.table
    %FQUI /VNCFSPGOPEFT





    e.g. Complexity of a lineage
    Atlas hook
    Hive/Spark
    Atlas
    temporary table, test data
    e.g. tmp_profile_vdo
    Atlas API performance issue

    View full-size slide

  73. Nodes are increased each time a job runs
    › We also aggregate Hive/Spark process as a node
    › Nodes related to the temporary table generated for each hourly/daily job
    Process Node
    Generated for each hourly/daily job
    Atlas API performance issue

    View full-size slide

  74. Displaying metadata issues
    › Display data lineage issues with Atlas
    1. Atlas API performance issue
    2. Usability issue
    1
    Users
    2
    Data catalog
    Data source

    View full-size slide

  75. › Not flexible to change the lineage graph
    › Display the entire lineage when first
    accessed
    › Table and columns are displayed
    mixed
    Displaying metadata issue
    Usability issue

    View full-size slide

  76. › Not flexible to change the lineage graph
    › Display the entire lineage when first
    accessed
    › Table and columns are displayed
    mixed
    Displaying metadata issue
    Usability issue

    View full-size slide

  77. Displaying metadata issues
    › Display data lineage issues with Atlas
    1. Atlas API performance issue
    1. Table with many nodes
    2. Nodes are increased each time a job runs
    2. Usability issue
    1
    Users
    2
    Data catalog
    Data source

    View full-size slide

  78. What are the solutions?
    Displaying metadata issue
    Solutions
    › Display data lineage issues with Atlas
    1. Atlas API performance issue
    1. Table with many nodes
    2. Nodes are increased each time a job runs
    2. Usability issue

    View full-size slide

  79. Solutions for displaying metadata
    Reduce the number of nodes registered to Atlas
    Table with many nodes
    Deleted unnecessary processes
    Usability issue
    Atlas hook
    Hive/Spark
    Kafka
    Atlas
    Provide interactive UX that satisfy various use cases
    Displaying metadata issue
    Nodes increased each time a job was executed

    View full-size slide

  80. Solutions for displaying metadata
    Reduce the number of nodes registered to Atlas
    Table with many nodes
    Atlas hook
    Hive/Spark
    Kafka
    Atlas
    Displaying metadata issue
    Deleted unnecessary processes
    Nodes increased each time a job was executed
    Usability issue
    Provide interactive UX that satisfy various use cases

    View full-size slide

  81. Reduce the number of nodes registered
    › Atlas supports to set a config “atlas.hook.hive.hive_table.ignore.pattern”
    › Filtered out the specific db/table by following regex patterns
    Atlas hook
    Hive/Spark
    Kafka
    1BUUFSO 5ZQF /PUF
    ?
    = c@
    UFNQcUNQ
    c@
    !JV 5BCMF 5FNQPSBSZ
    ? \^
    =E\^
    =c %# &NQMPZFF`T%#
    ɾɾɾ ɾɾɾ ɾɾɾ
    Filter out
    Issue: Table with many nodes

    View full-size slide

  82. Solutions for displaying metadata
    Reduce the number of nodes registered to Atlas
    Table with many nodes
    Atlas hook
    Hive/Spark
    Kafka
    Atlas
    Displaying metadata issue
    Deleted unnecessary processes
    Usability issue
    Provide interactive UX that satisfy various use cases
    Nodes increased each time a job was executed

    View full-size slide

  83. Delete unnecessary processes
    › Couldn’t filtered out process nodes
    › Need to delete some unnecessary process entities directly
    Atlas
    Batch Job
    3VMF
    )BWFOPJOQVUTPSPVUQVUT
    0XOUIFTBNFJOQVUTBOE
    PVUQVUT
    ɾɾɾ
    Delete
    Issue: Nodes increased each time a job was executed

    View full-size slide

  84. Solutions for displaying metadata
    Reduce the number of nodes registered to Atlas
    Table with many nodes
    Atlas hook
    Hive/Spark
    Kafka
    Atlas
    Displaying metadata issue
    Deleted unnecessary processes
    Usability issue
    Provide interactive UX that satisfy various use cases
    Nodes increased each time a job was executed

    View full-size slide

  85. Provide interactive UX
    › Response 3 depths by default
    › Provide interactive UX
    › Dig column-level lineage
    › Expand the displaying lineage graph
    by clicking a node
    › Drag&Drop
    › Zoom in/out
    Issue: Usability issue

    View full-size slide

  86. Solutions for displaying metadata
    Reduce the number of nodes registered to Atlas
    Table with many nodes
    Atlas hook
    Hive/Spark
    Kafka
    Atlas
    Displaying metadata issue
    Deleted unnecessary processes
    Usability issue
    Provide interactive UX that satisfy various use cases
    Nodes increased each time a job was executed

    View full-size slide

  87. What are the Results?
    Solutions for displaying metadata
    Results
    Reduce the number of nodes registered to Atlas
    Table with many nodes
    Deleted unnecessary processes
    Usability issue
    Provide interactive UX that satisfy various use cases
    Nodes increased each time a job was executed

    View full-size slide

  88. Results
    › Atlas API respond within 6 seconds basically, 15 seconds at the latest
    › The number of tables registered in Atlas has been reduced by 90%
    › 300,000 nodes deleted daily with the batch Job
    › Support to display end to end lineage and deep dive column-level impact analysis
    Solutions for displaying metadata

    View full-size slide

  89. Confirm a column-level dependencies
    Plan Run Notify
    search query announce
    Column level lineage Data sources
    Solutions for displaying metadata
    Find Understand

    View full-size slide

  90. User feedback
    › Analytics engineer, backend engineer
    › When data ingestion is delayed, it’s clear which services are affected
    immediately
    › Data governance
    › Enabled to identify users who involved in the table’s generation end to end
    › Easier to understand tables generated from a column containing important data
    from a governance perspective
    Solutions for displaying metadata

    View full-size slide

  91. Contributed to OSS
    › Atlas to skip external temporary table created in hive
    › https://issues.apache.org/jira/browse/ATLAS-4492
    Solutions for displaying metadata

    View full-size slide

  92. Reference
    › LINEの大規模なData PlatformにData Lineageを導入した話
    › https://engineering.linecorp.com/ja/blog/data-lineage-on-line-big-data-platform/
    Solutions for displaying metadata

    View full-size slide

  93. Resolved displaying metadata issue
    1. Aggregating metadata
    2. Displaying metadata
    Data Source
    BI tools
    Query engine
    Other services
    ・・・
    DB
    1 2
    Users
    Data catalog’s
    data source

    View full-size slide

  94. Agenda › LINE’s data platform
    › Problem to introduce data catalog
    1. Aggregating metadata issues
    2. Displaying metadata issues
    › Summary
    › Future work

    View full-size slide

  95. Summary
    › Need to provide various metadata as users become more diverse
    › Not only to aggregate metadata, but also inspect and filter out data for uses
    › Had some adjustments to introduce Atlas
    When introducing OSS, it’s important to understand the user needs and your own
    data platform

    View full-size slide

  96. Future work
    › For a wider range of uses
    › Generate statistical data from queries and use for cost optimization on data
    platform
    › Metadata everywhere
    › Allow users to view metadata on various BI tools
    › More personalized experience
    › Reduce the time for searching data, increase efficiency and productivity

    View full-size slide