Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Unification of Divided Data into Single Big Data Platform with 2000+ Nodes

Unification of Divided Data into Single Big Data Platform with 2000+ Nodes

LINE DEVDAY 2021

November 10, 2021
Tweet

More Decks by LINE DEVDAY 2021

Other Decks in Technology

Transcript

  1. Target Audience Data Engineer who faces technical issues about big-scale

    distributed system management Data Analyzer/Planner who wants to know how to deal with such big scale data platform Data Manager/Evangelist who tackles complex Data Management/Governance tasks with many stakeholders
  2. Agenda - Introduction - Data Platform in LINE - Problems

    - What’s wrong with “divided data”? - How difficult to “unify”? - Our approach - Technical approach - Data Management approach - Result & Future
  3. - Engineering Manager, Data Engineering Team - LINE New Grad

    (2013~) - Career - LINE GAME DBA (MySQL, MongoDB) - ETL Engineer for LINE app - Ingestion Pipeline dev (Spark, Flink) - Hadoop administrator - Hadoop migration project leader Tasuku OKUDA @okdtsk_eng
  4. Data Flow Kafka Flink Flink Flink ES HDFS External System

    Service-side System YARN dump Kibana Tableau Jupyter Yanagishima OASIS Datahub LINE Analytics Hive (Tez) Spark Trino (Presto) k8s k8s Storage Ingestion Computing BI Tools Github CentralDogma Ranger IU Web Prometheus Grafana Governance
  5. Divided Data Platform Datachain Datalake “Twemoji” ©Twitter, Inc and other

    contributors (Licensed under CC-BY 4.0) https://twemoji.twitter.com/
  6. Divided Data Problems HDFS YARN (Spark/Hive) Presto metastore HDFS YARN

    (Spark/Hive) Presto metastore Catalog Computing Storage Cannot JOIN No Resource Sharing Separated Permission
  7. Catalog problem Cannot JOIN Datachain metastore Datalake metastore SELECT s_sent.user_id,

    SUM(s_sent.cnt) FROM lineshop.sticker_sent AS s_sent JOIN sticker_stats.ranking AS s_rank ON (s_sent.sticker_id = s_rank.sticker_id) WHERE s_sent.dt = '20211110' AND s_rank.dt = '20211110' AND s_rank.rank = 1 lineshop.sticker_sent sticker_sent.ranking Datachain Hive/Spark Hive/Spark cannot recognize another cluster’s table information.
  8. Computing problem No Resource Sharing Datachain IDC Datalake IDC One

    cluster is too busy even during another has room for resource.
  9. Storage problem Separated Permission Datachain HDFS Ranger Datalake HDFS Ranger

    Same? Data Copy Data duplication & separated permission control give us high data management cost to follow company-wide governance policy.
  10. Divided Data Platform Datachain Datalake “Twemoji” ©Twitter, Inc and other

    contributors (Licensed under CC-BY 4.0) https://twemoji.twitter.com/
  11. Tentative Idea Datachain Datalake New Cluster IU “Twemoji” ©Twitter, Inc

    and other contributors (Licensed under CC-BY 4.0) https://twemoji.twitter.com/
  12. How to copy Data? Datachain Datalake IU distcp distcp distcp

    distcp distcp distcp - Uncontrollable copy job timing by each stakeholder - A lot of resources (vCore/Mem) are required for distcp - Network bursting risk, especially inter-IDC
  13. Who can migrate data first? - Complex job/table dependency -

    Inter-organization data flow - Not all users understand data platform deeply
  14. Where is “Active data”? Datachain Datalake IU Kafka Double write

    approach requires - Data consistency check for all data - High system management cost for multiple streaming system - Complex direction for platform users “which cluster a data uses mainly?” “Twemoji” ©Twitter, Inc and other contributors (Licensed under CC-BY 4.0) https://twemoji.twitter.com/
  15. Summary of “Problems” Cannot JOIN No Resource Sharing Separated Permission

    How to copy Data? Who can migrate data first? Where is “Active data”? We need more “Approaches” “Divided Data” ”Unification difficulty”
  16. Technical & Data Management Data Management approach ”Keep changing things

    small” Technical approach ”Make complex things simple”
  17. Technical approach HDFS HDFS YARN (Spark/Hive) Presto metastore YARN (Spark/Hive)

    Presto metastore Catalog Computing Storage Syncing Relocation Federation
  18. Storage - Federation - Benefit of federation - No need

    to data copy, double-write - Keep same directory structure, permission as-is - Disk resource is shared with all HDFS clusters - IU federation is “logical” one - Physically, all HDFS share same machines - Get bigger scale disk capacity - IU HDFS uses viewfs - No need to consider federated HDFS cluster on user-side - More convenient to access data itself /dfs/data/ ├── datachain/ ├── datalake/ ├── iu01/ ├── iu02/ └── iu03/
  19. IU IDC Computing - Relocation Datachain IDC Datalake IDC “Font

    Awesome” ©Fonticons, Inc and other contributors (Licensed under CC-BY 4.0) https://fontawesome.com/
  20. Computing - Relocation - Build migration schedule based on resource

    transition - Decommision/Recommision, IDC relocation & OS re-setup requires a lot of time - Prepare several relocation phases to minimize resource lacking risk - IDC-level relocation - To achieve more efficient capacity planning - Co-work with “moving service company”! - Hybrid hiveserver2 – IU YARN with old metastore - YARN resource is decreasing - but some users still require old metastore - Apply a hiveserver2 patch to work on IU YARN with old metastore IU YARN Hybrid Hive Non-IU metastore Get table info Submit job
  21. Catalog - Syncing IU metastore Datachain metastore ,BGLB SyncWorker MetastoreEventListener

    DDL events - CREATE TABLE - ADD PARTITION - ALTER TABLE … Datalake metastore MetastoreEventListener Filter Central Dogma rule file Convertion
  22. Catalog - Syncing - Users can access any tables only

    from IU endpoint - To promote IU components - Kafka producer – local disk writing + fluentd tail plugin - Local disk is more durable than network writing - If Kafka/SyncWorker trouble, we can resume missing DDL events from local file - Each table’s event is assigned to corresponding Kafka partition - Easy to detect metastore-syncer trouble - SyncWorker has filtering/converting feature - Not all tables require migrating to IU - Outdated table - A user decides double writing strategy
  23. Endpoint Switching Datachain HDFS Datalake HDFS Flink dump metastore metastore

    metastore IU endpoints YARN Hive Spark Trino IU HDFS metastore- syncer metastore- syncer Datachain Endpoint Datalake Endpoint Switch!
  24. Permission Preservation IU HDFS Datachain HDFS Datalake HDFS HDFS Client

    CLI/YARN/Hive/Spark/Preso/… Ranger Ranger Ranger LDAP LDAP LDAP
  25. Summary of “Approach” Endpoint Switching Permission Preservation Tiered Stakeholders Federation

    Relocation Syncing Storage Computing Catalog Technical Data Management
  26. Things we learn - Understanding internal architecture makes complex things

    simple - Keeping current policy as-is helps platform users reduce system changing cost - Making situation simple is a great first step to improve big-scale system/service