Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Vinitus Web: データプラットフォームの複雑性に立ち向かうデータ品質とリネージュのた...

Vinitus Web: データプラットフォームの複雑性に立ち向かうデータ品質とリネージュのためのジョブ管理システム / Ingestion Management by "Vinitus Web": Data Quality and Lineage for the complex data platform users

LINEとヤフーの合併後におけるデータ民主化とガバナンスを支援するためのbatch ingestionシステムの構築に関する洞察と、複雑なデータプラットフォームから得た経験を紹介します。

More Decks by LINEヤフーTech (LY Corporation Tech)

Other Decks in Technology

Transcript

  1. Ingestion Management by Vinitus Web - Data Quality and Lineage

    for the complex data platform users LY Corporation Hoshino Kuntaro Macquang Huy
  2. About Us Hoshino Kuntaro Tokyo, Japan Product Manager @ LY

    Corporation Joined as a new grad in 2020 Data Engineering > Data Pipeline Engineering > Batch Department Sliver @ League of Legends Macquang Huy Hanoi, Vietnam Data Engineer @ LY Corporation Joined in 2022 LINE Technology Vietnam > Data Platform Dev Team Master @ League of Legends
  3. Overview LY Data Platform and Vinitus Web Data Platform Service

    A Service B Service C Data Users Ingestion Ingestion Ingestion
  4. Overview LY Data Platform and Vinitus Web Data Platform Service

    A Vinitus Web Service B Service C Job Owner 0 1 * * * 0 2 * * * 0 * * * * • Create scheduled job • See execution history • Restart failed jobs Data Users
  5. 1. Data Users’ challenges in our Data Platform 2. How

    Vinitus Web Helps Data Users 3. How Vinitus Web Helps Job Owners 4. Recap & Next Agenda
  6. 1. Data Users’ challenges in our Data Platform • The

    increased effort due to diverse data origins 2. How Vinitus Web Helps Data Users 3. How Vinitus Web Helps Job Owners 4. Recap & Next Agenda
  7. Data Users’ challenges in our Data Platform LY Data Platform

    and Vinitus Web Data Platform Service A Where this data comes from? Can we use it? Data Users User Agreement for Service B User Agreement for Service A Service B Service C
  8. Data Users’ challenges in our Data Platform LY Data Platform

    and Vinitus Web Data Platform Service A How is the data ingested? Data Users Service B Service C MySQL MongoDB CAST(`col` AS CHAR) name → `name` spec.color → `color` /dt=20250630 is JST or UTC?
  9. Data Users’ challenges in our Data Platform Data Users User

    Agreement User Agreement or Which one…? MySQL MongoDB or /dt=20250630 stores 2025-06-30 00:00 ~ 2025-06-30 00:00 JST Or 2025-06-30 00:00 ~ 2025-06-30 00:00 UTC
  10. 1. Data Users’ challenges in our Data Platform 2. How

    Vinitus Web Helps Data Users 3. How Vinitus Web Helps Job Owners 4. Recap & Next Agenda
  11. 1. Enable users to quickly find the origin of the

    data • It allows them to find rules and furthermore to quickly identify necessary checks and approvals. • By providing detailed Data Lineage information 2. Ensure consistency in data structure • This will make it easier to use the data without having to worry about differences in the origin of the data. • By a functionality to absorb differences in database types during ingestion How Vinitus Web Helps Data Users
  12. How Vinitus Web Helps Data Users Ingestion job information for

    better Data Lineage Data Platform Service A Service B Service C 0 1 * * * 0 2 * * * 0 * * * * Data Lineage information Job Owner is … Columns are CAST-ed like… How is it processed? Data Users It is from Service A’s DB
  13. Data Consistency (Data Quality) 1. Source Column and Destination Column

    is 1-to-1 2. Default settings based on the source column, to be the appropriate format
  14. 1. Data Users’ challenges in our Data Platform 2. How

    Vinitus Web Helps Data Users 3. How Vinitus Web Helps Job Owners 4. Recap & Next Agenda
  15. How Vinitus Web helps Job Owners Provide a fully managed

    ETL workflow engine Job creation time Save 90% Manual operations Remove 3+ Custom metrics Provide 16+ Job-as-Config Schema evolution Failed tasks Backfill Task-level monitoring From 2 hours → 10 mins
  16. Job-as-Config Making Job Creation Easy Generate Callback DAG parsing Job

    Owner workload Vinitus takes care of the rest Vinitus Airflow Job-as-Config Key Point Users create an ingestion job through UI Vinitus Providing a fully managed ETL workflow engine Non-expert users can create ingestion jobs (in minutes+),
  17. We provide pre-processing task → Fully support all scenarios Daily

    Operations (1) Schema Evolution Data schemas change over time ! Modifying data types Deleting columns Adding columns Changing partitions CREATE TEMP TABLE INSERT data FROM Original-Table to Temp-Table Backup original data to HDFS DROP ORIGINAL TABLE CREATE ORIGINAL TABLE (new schema) Move data FROM Temp-Table to Original Table @Bi-Monthly Remove backup data @6_Hours Data is deleted completely Data is moved to Trash directory Vinitus Airflow Hadoop viewfs://iu/gov/{DB}/.IUBackup/{TABLE}/{TIME}/...
  18. Error classification Based on the log, users decide further actions

    Daily Operations (2) Handling Failed Tasks Tasks are failed ! JDBC-error! Validate-error! Fixed by myself K8s-error! Task-error! Raised an inquiry
  19. Backfill feature A heavy operator for filling data for past

    periods How ? Each backfill request will be sent to Airflow Each backfill request will be handled in dedicated pod Daily Operations (3) Backfill Support Want data for long past periods ! HTTP requests Airflow Backfill
  20. Task level monitoring Custom Metrics Export logs Execute & Orchestrate

    Expose metrics /metrics Checking logs Alert Task Run statsd Vinitus Airflow
  21. Task level monitoring What You See General information Failed jobs

    information Identify long running jobs Error classification
  22. 1. Data Users’ challenges in our Data Platform 2. How

    Vinitus Web Helps Data Users 3. How Vinitus Web Helps Job Owners 4. Recap & Next Agenda
  23. • Reduce the cost during data use by providing Data

    Lineage and Data Consistency • Reduce the burden on the job owner when maintaining data in an accessible state Recap & Next With Vinitus Web: • Support to understand the Data Lineage information • Labels for data, AI • Reduce job owner‘s burden • Data as Product, Data SLO What’s next?