Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Unification of Divided Data into Single Big Data Platform with 2000+ Nodes

Unification of Divided Data into Single Big Data Platform with 2000+ Nodes

LINE DEVDAY 2021

November 10, 2021
Tweet

More Decks by LINE DEVDAY 2021

Other Decks in Technology

Transcript

  1. Target Audience
    Data Engineer
    who faces technical issues about big-scale distributed system management
    Data Analyzer/Planner
    who wants to know how to deal with such big scale data platform
    Data Manager/Evangelist
    who tackles complex Data Management/Governance tasks with many stakeholders

    View full-size slide

  2. Agenda
    - Introduction
    - Data Platform in LINE
    - Problems
    - What’s wrong with “divided data”?
    - How difficult to “unify”?
    - Our approach
    - Technical approach
    - Data Management approach
    - Result & Future

    View full-size slide

  3. -
    Engineering Manager,
    Data Engineering Team
    -
    LINE New Grad (2013~)
    -
    Career
    -
    LINE GAME DBA (MySQL, MongoDB)
    -
    ETL Engineer for LINE app
    -
    Ingestion Pipeline dev (Spark, Flink)
    -
    Hadoop administrator
    -
    Hadoop migration project leader
    Tasuku OKUDA
    @okdtsk_eng

    View full-size slide

  4. Data Platform
    in LINE

    View full-size slide

  5. What’s IU?

    View full-size slide

  6. IU Motivation
    Single
    Environment
    Sigle
    Endpoint
    Single
    Standard

    View full-size slide

  7. IU Motivation
    Single
    Environment
    Sigle
    Endpoint
    Single
    Standard
    SIMPLE

    View full-size slide

  8. IU Motivation
    Single
    Environment
    Sigle
    Endpoint
    Single
    Standard
    SIMPLE
    Data-Driven

    View full-size slide

  9. Data Flow
    Kafka
    Flink
    Flink
    Flink
    ES
    HDFS
    External
    System
    Service-side
    System
    YARN
    dump
    Kibana
    Tableau
    Jupyter
    Yanagishima
    OASIS
    Datahub
    LINE Analytics
    Hive
    (Tez)
    Spark
    Trino
    (Presto)
    k8s
    k8s
    Storage
    Ingestion Computing BI Tools
    Github CentralDogma
    Ranger IU Web
    Prometheus Grafana
    Governance

    View full-size slide

  10. Status
    HDFS capacity (PB)
    400
    YARN vCores
    90,000
    Machines
    5,000
    Infrastructure

    View full-size slide

  11. Status
    Hive tables
    40,000
    Incoming records/sec
    17,500,000
    Jobs/day
    150,000
    Data usage

    View full-size slide

  12. Problems
    What’s wrong with “divided data”?

    View full-size slide

  13. Divided Data Platform
    Datachain Datalake
    “Twemoji” ©Twitter, Inc and other contributors (Licensed under CC-BY 4.0) https://twemoji.twitter.com/

    View full-size slide

  14. A lot of Connection Points…
    Datachain
    Datalake

    View full-size slide

  15. A lot of Connection Points…
    Datachain
    Datalake
    COMPLEX!

    View full-size slide

  16. Divided Data Problems
    HDFS
    YARN (Spark/Hive)
    Presto
    metastore
    HDFS
    YARN (Spark/Hive)
    Presto
    metastore
    Catalog
    Computing
    Storage
    Cannot JOIN
    No Resource Sharing
    Separated Permission

    View full-size slide

  17. Catalog problem
    Cannot JOIN
    Datachain
    metastore
    Datalake
    metastore
    SELECT
    s_sent.user_id,
    SUM(s_sent.cnt)
    FROM lineshop.sticker_sent AS s_sent
    JOIN sticker_stats.ranking AS s_rank
    ON (s_sent.sticker_id = s_rank.sticker_id)
    WHERE s_sent.dt = '20211110' AND
    s_rank.dt = '20211110' AND
    s_rank.rank = 1
    lineshop.sticker_sent
    sticker_sent.ranking
    Datachain
    Hive/Spark
    Hive/Spark cannot recognize another cluster’s
    table information.

    View full-size slide

  18. Computing problem
    No Resource Sharing
    Datachain
    IDC
    Datalake
    IDC
    One cluster is too busy even during
    another has room for resource.

    View full-size slide

  19. Storage problem
    Separated Permission
    Datachain
    HDFS
    Ranger
    Datalake
    HDFS
    Ranger
    Same?
    Data Copy
    Data duplication & separated permission control
    give us high data management cost
    to follow company-wide governance policy.

    View full-size slide

  20. Divided Data Platform
    Datachain Datalake
    “Twemoji” ©Twitter, Inc and other contributors (Licensed under CC-BY 4.0) https://twemoji.twitter.com/

    View full-size slide

  21. Tentative Idea
    Datachain Datalake
    New Cluster
    IU
    “Twemoji” ©Twitter, Inc and other contributors (Licensed under CC-BY 4.0) https://twemoji.twitter.com/

    View full-size slide

  22. Problems
    How difficult to “unify”?

    View full-size slide

  23. How to copy Data?
    Datachain
    Datalake
    IU
    distcp
    distcp
    distcp
    distcp
    distcp
    distcp
    - Uncontrollable copy job timing by each stakeholder
    - A lot of resources (vCore/Mem) are required for distcp
    - Network bursting risk, especially inter-IDC

    View full-size slide

  24. Who can migrate data first?
    - Complex job/table dependency
    - Inter-organization data flow
    - Not all users understand data
    platform deeply

    View full-size slide

  25. Where is “Active data”?
    Datachain
    Datalake
    IU
    Kafka
    Double write approach requires
    - Data consistency check for all data
    - High system management cost for multiple streaming system
    - Complex direction for platform users “which cluster a data uses mainly?”
    “Twemoji” ©Twitter, Inc and other contributors (Licensed under CC-BY 4.0) https://twemoji.twitter.com/

    View full-size slide

  26. Summary of “Problems”
    Cannot JOIN No Resource Sharing Separated Permission
    How to copy Data? Who can migrate data first? Where is “Active data”?
    We need more
    “Approaches”
    “Divided Data”
    ”Unification difficulty”

    View full-size slide

  27. Technical & Data Management
    Data Management approach
    ”Keep changing things small”
    Technical approach
    ”Make complex things simple”

    View full-size slide

  28. Technical approach
    Storage
    Federation
    Computing
    Relocation
    Catalog
    Syncing

    View full-size slide

  29. Technical approach
    HDFS
    HDFS
    YARN
    (Spark/Hive)
    Presto
    metastore
    YARN
    (Spark/Hive)
    Presto
    metastore
    Catalog
    Computing
    Storage
    Syncing
    Relocation
    Federation

    View full-size slide

  30. Technical approach
    Storage
    Federation

    View full-size slide

  31. Storage - Federation
    IU
    HDFS
    Datachain
    HDFS
    Datalake
    HDFS
    HDFS Client
    CLI/YARN/Hive/Spark/Preso/…

    View full-size slide

  32. Storage - Federation
    - Benefit of federation
    - No need to data copy, double-write
    - Keep same directory structure, permission as-is
    - Disk resource is shared with all HDFS clusters
    - IU federation is “logical” one
    - Physically, all HDFS share same machines
    - Get bigger scale disk capacity
    - IU HDFS uses viewfs
    - No need to consider federated HDFS cluster on user-side
    - More convenient to access data itself
    /dfs/data/
    ├── datachain/
    ├── datalake/
    ├── iu01/
    ├── iu02/
    └── iu03/

    View full-size slide

  33. Technical approach
    Computing
    Relocation

    View full-size slide

  34. IU
    IDC
    Computing - Relocation
    Datachain
    IDC
    Datalake
    IDC
    “Font Awesome” ©Fonticons, Inc and other contributors (Licensed under CC-BY 4.0) https://fontawesome.com/

    View full-size slide

  35. Computing - Relocation
    - Build migration schedule based on resource transition
    - Decommision/Recommision, IDC relocation & OS re-setup requires a lot of time
    - Prepare several relocation phases to minimize resource lacking risk
    - IDC-level relocation
    - To achieve more efficient capacity planning
    - Co-work with “moving service company”!
    - Hybrid hiveserver2 – IU YARN with old metastore
    - YARN resource is decreasing
    - but some users still require old metastore
    - Apply a hiveserver2 patch to work on IU YARN with old metastore IU
    YARN
    Hybrid Hive
    Non-IU
    metastore
    Get table info Submit job

    View full-size slide

  36. Technical approach
    Catalog
    Syncing

    View full-size slide

  37. Catalog - Syncing
    IU
    metastore
    Datachain
    metastore
    ,BGLB
    SyncWorker
    MetastoreEventListener
    DDL events
    - CREATE TABLE
    - ADD PARTITION
    - ALTER TABLE

    Datalake
    metastore
    MetastoreEventListener
    Filter
    Central
    Dogma
    rule file
    Convertion

    View full-size slide

  38. Catalog - Syncing
    - Users can access any tables only from IU endpoint
    - To promote IU components
    - Kafka producer – local disk writing + fluentd tail plugin
    - Local disk is more durable than network writing
    - If Kafka/SyncWorker trouble, we can resume missing DDL events from local file
    - Each table’s event is assigned to corresponding Kafka partition
    - Easy to detect metastore-syncer trouble
    - SyncWorker has filtering/converting feature
    - Not all tables require migrating to IU
    - Outdated table
    - A user decides double writing strategy

    View full-size slide

  39. Data Management approach
    ”Keep changing things small”
    Technical approach
    ”Make complex things simple”

    View full-size slide

  40. Data Management approach
    Endpoint
    Switching
    Permission
    Preservation
    Tiered
    Stakeholders

    View full-size slide

  41. Data Management approach
    Endpoint
    Switching

    View full-size slide

  42. Endpoint Switching
    Datachain
    HDFS
    Datalake
    HDFS
    Flink
    dump
    metastore
    metastore
    metastore
    IU
    endpoints
    YARN
    Hive
    Spark
    Trino
    IU
    HDFS
    metastore-
    syncer
    metastore-
    syncer
    Datachain
    Endpoint
    Datalake
    Endpoint
    Switch!

    View full-size slide

  43. Data Management approach
    Permission
    Preservation

    View full-size slide

  44. Permission Preservation
    IU
    HDFS
    Datachain
    HDFS
    Datalake
    HDFS
    HDFS Client
    CLI/YARN/Hive/Spark/Preso/…
    Ranger
    Ranger Ranger
    LDAP LDAP LDAP

    View full-size slide

  45. Data Management approach
    Tiered
    Stakeholders

    View full-size slide

  46. Tiered Stakeholders
    Tier 1: Data Platform
    Tier 2: ML, DS team
    Tier 3: Service-side

    View full-size slide

  47. Tiered Stakeholders
    metastore-syncer
    metastore-syncer

    View full-size slide

  48. Summary of “Approach”
    Endpoint
    Switching
    Permission
    Preservation
    Tiered
    Stakeholders
    Federation Relocation Syncing
    Storage Computing Catalog
    Technical
    Data
    Management

    View full-size slide

  49. Result & Future

    View full-size slide

  50. Things we learn
    - Understanding internal architecture makes complex
    things simple
    - Keeping current policy as-is helps platform users
    reduce system changing cost
    - Making situation simple is a great first step to
    improve big-scale system/service

    View full-size slide

  51. Future Challenges
    Data Catalog
    Data Democracy
    ML infrastructure
    Data Lineage
    Delta table Capacity Planning

    View full-size slide

  52. We are hiring!

    View full-size slide