Unification of Divided Data into Single Big Data Platform with 2000+ Nodes

Target Audience Data Engineer who faces technical issues about big-scale
distributed system management Data Analyzer/Planner who wants to know how to deal with such big scale data platform Data Manager/Evangelist who tackles complex Data Management/Governance tasks with many stakeholders

Agenda - Introduction - Data Platform in LINE - Problems
- What’s wrong with “divided data”? - How difficult to “unify”? - Our approach - Technical approach - Data Management approach - Result & Future

- Engineering Manager, Data Engineering Team - LINE New Grad
(2013~) - Career - LINE GAME DBA (MySQL, MongoDB) - ETL Engineer for LINE app - Ingestion Pipeline dev (Spark, Flink) - Hadoop administrator - Hadoop migration project leader Tasuku OKUDA @okdtsk_eng

Data Platform in LINE

What’s IU?

IU Motivation Single Environment Sigle Endpoint Single Standard

IU Motivation Single Environment Sigle Endpoint Single Standard SIMPLE

IU Motivation Single Environment Sigle Endpoint Single Standard SIMPLE Data-Driven

Data Flow Kafka Flink Flink Flink ES HDFS External System
Service-side System YARN dump Kibana Tableau Jupyter Yanagishima OASIS Datahub LINE Analytics Hive (Tez) Spark Trino (Presto) k8s k8s Storage Ingestion Computing BI Tools Github CentralDogma Ranger IU Web Prometheus Grafana Governance

Status HDFS capacity (PB) 400 YARN vCores 90,000 Machines 5,000
Infrastructure

Status Hive tables 40,000 Incoming records/sec 17,500,000 Jobs/day 150,000 Data
usage

Problems What’s wrong with “divided data”?

Divided Data Platform Datachain Datalake “Twemoji” ©Twitter, Inc and other
contributors (Licensed under CC-BY 4.0) https://twemoji.twitter.com/

A lot of Connection Points… Datachain Datalake

A lot of Connection Points… Datachain Datalake COMPLEX!

Divided Data Problems HDFS YARN (Spark/Hive) Presto metastore HDFS YARN
(Spark/Hive) Presto metastore Catalog Computing Storage Cannot JOIN No Resource Sharing Separated Permission

Catalog problem Cannot JOIN Datachain metastore Datalake metastore SELECT s_sent.user_id,
SUM(s_sent.cnt) FROM lineshop.sticker_sent AS s_sent JOIN sticker_stats.ranking AS s_rank ON (s_sent.sticker_id = s_rank.sticker_id) WHERE s_sent.dt = '20211110' AND s_rank.dt = '20211110' AND s_rank.rank = 1 lineshop.sticker_sent sticker_sent.ranking Datachain Hive/Spark Hive/Spark cannot recognize another cluster’s table information.

Computing problem No Resource Sharing Datachain IDC Datalake IDC One
cluster is too busy even during another has room for resource.

Storage problem Separated Permission Datachain HDFS Ranger Datalake HDFS Ranger
Same? Data Copy Data duplication & separated permission control give us high data management cost to follow company-wide governance policy.

Divided Data Platform Datachain Datalake “Twemoji” ©Twitter, Inc and other
contributors (Licensed under CC-BY 4.0) https://twemoji.twitter.com/

Tentative Idea Datachain Datalake New Cluster IU “Twemoji” ©Twitter, Inc
and other contributors (Licensed under CC-BY 4.0) https://twemoji.twitter.com/

Problems How difficult to “unify”?

How to copy Data? Datachain Datalake IU distcp distcp distcp
distcp distcp distcp - Uncontrollable copy job timing by each stakeholder - A lot of resources (vCore/Mem) are required for distcp - Network bursting risk, especially inter-IDC

Who can migrate data first? - Complex job/table dependency -
Inter-organization data flow - Not all users understand data platform deeply

Where is “Active data”? Datachain Datalake IU Kafka Double write
approach requires - Data consistency check for all data - High system management cost for multiple streaming system - Complex direction for platform users “which cluster a data uses mainly?” “Twemoji” ©Twitter, Inc and other contributors (Licensed under CC-BY 4.0) https://twemoji.twitter.com/

Summary of “Problems” Cannot JOIN No Resource Sharing Separated Permission
How to copy Data? Who can migrate data first? Where is “Active data”? We need more “Approaches” “Divided Data” ”Unification difficulty”

Approach

Technical & Data Management Data Management approach ”Keep changing things
small” Technical approach ”Make complex things simple”

Technical approach Storage Federation Computing Relocation Catalog Syncing

Technical approach HDFS HDFS YARN (Spark/Hive) Presto metastore YARN (Spark/Hive)
Presto metastore Catalog Computing Storage Syncing Relocation Federation

Technical approach Storage Federation

Storage - Federation IU HDFS Datachain HDFS Datalake HDFS HDFS
Client CLI/YARN/Hive/Spark/Preso/…

Storage - Federation - Benefit of federation - No need
to data copy, double-write - Keep same directory structure, permission as-is - Disk resource is shared with all HDFS clusters - IU federation is “logical” one - Physically, all HDFS share same machines - Get bigger scale disk capacity - IU HDFS uses viewfs - No need to consider federated HDFS cluster on user-side - More convenient to access data itself /dfs/data/ ├── datachain/ ├── datalake/ ├── iu01/ ├── iu02/ └── iu03/

Technical approach Computing Relocation

IU IDC Computing - Relocation Datachain IDC Datalake IDC “Font
Awesome” ©Fonticons, Inc and other contributors (Licensed under CC-BY 4.0) https://fontawesome.com/

Computing - Relocation - Build migration schedule based on resource
transition - Decommision/Recommision, IDC relocation & OS re-setup requires a lot of time - Prepare several relocation phases to minimize resource lacking risk - IDC-level relocation - To achieve more efficient capacity planning - Co-work with “moving service company”! - Hybrid hiveserver2 – IU YARN with old metastore - YARN resource is decreasing - but some users still require old metastore - Apply a hiveserver2 patch to work on IU YARN with old metastore IU YARN Hybrid Hive Non-IU metastore Get table info Submit job

Technical approach Catalog Syncing

Catalog - Syncing IU metastore Datachain metastore ,BGLB SyncWorker MetastoreEventListener
DDL events - CREATE TABLE - ADD PARTITION - ALTER TABLE … Datalake metastore MetastoreEventListener Filter Central Dogma rule file Convertion

Catalog - Syncing - Users can access any tables only
from IU endpoint - To promote IU components - Kafka producer – local disk writing + fluentd tail plugin - Local disk is more durable than network writing - If Kafka/SyncWorker trouble, we can resume missing DDL events from local file - Each table’s event is assigned to corresponding Kafka partition - Easy to detect metastore-syncer trouble - SyncWorker has filtering/converting feature - Not all tables require migrating to IU - Outdated table - A user decides double writing strategy

Data Management approach ”Keep changing things small” Technical approach ”Make
complex things simple”

Data Management approach Endpoint Switching Permission Preservation Tiered Stakeholders

Data Management approach Endpoint Switching

Endpoint Switching Datachain HDFS Datalake HDFS Flink dump metastore metastore
metastore IU endpoints YARN Hive Spark Trino IU HDFS metastore- syncer metastore- syncer Datachain Endpoint Datalake Endpoint Switch!

Data Management approach Permission Preservation

Permission Preservation IU HDFS Datachain HDFS Datalake HDFS HDFS Client
CLI/YARN/Hive/Spark/Preso/… Ranger Ranger Ranger LDAP LDAP LDAP

Data Management approach Tiered Stakeholders

Tiered Stakeholders Tier 1: Data Platform Tier 2: ML, DS
team Tier 3: Service-side

Tiered Stakeholders metastore-syncer metastore-syncer

Summary of “Approach” Endpoint Switching Permission Preservation Tiered Stakeholders Federation
Relocation Syncing Storage Computing Catalog Technical Data Management

Result & Future

Things we learn - Understanding internal architecture makes complex things
simple - Keeping current policy as-is helps platform users reduce system changing cost - Making situation simple is a great first step to improve big-scale system/service

Future Challenges Data Catalog Data Democracy ML infrastructure Data Lineage
Delta table Capacity Planning

We are hiring!

Thank you

Unification of Divided Data into Single Big Dat...

Unification of Divided Data into Single Big Data Platform with 2000+ Nodes

More Decks by LINE DEVDAY 2021

Other Decks in Technology

Featured

Transcript