Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introducing a Unified, Managed Workflow Service for LINE Data Platform

Introducing a Unified, Managed Workflow Service for LINE Data Platform

LINE DEVDAY 2021

November 10, 2021
Tweet

More Decks by LINE DEVDAY 2021

Other Decks in Technology

Transcript

  1. Speaker - ETL pipeline, batch ingestion - Internal efficiency tools

    Work Vision - Everyone can enjoy data easily with services we provide Yang Xu - Software engineer - Data platform, data engineering center
  2. Agenda - ETL batch at LINE - Status - Problem

    - Our solution - Goal - Feature - Future work
  3. ETL batch at LINE Volume Jobs/day - 100,000 Users -

    200+ Tables - 40,000 Developer - ETL team - Official Account team - Data Scientist team - … Toolchain Language - Python - Scala - … Scheduler - Azkaban - Airflow - OASIS
  4. Problem User - Cost too much to - Learn to

    set up environment - Design necessary functions - Write codes Data Platform - Suffer from - Careless configuration - Unoptimized query - Wild operation
  5. Goal - Encoding & compression - Small file compaction -

    … Cut off burden of data engineering Relieve operation cost - Retry - Recover Central controlled - Yet easy to use
  6. Unified platform - Easy to define and integrate functionalities -

    In order to fit needs from different teams Extensibility Scalability - Convenient to scale up and out - Preparing for increasing workload and various conditions High Availability - Stable and reliable without downtime - For example, when we perform update and restart
  7. Airflow on Kubernetes - Easy to define and integrate functionalities

    - In order to fit needs from different teams Extensibility è Airflow Scalability è Kubernetes & Airflow - Convenient to scale up and out - Preparing for increasing workload and various conditions High Availability è Kubernetes - Stable and reliable without downtime - For example, when we perform update and restart
  8. Standardized functionality SLA • Landing Time • Execution Time Data

    Quality • Verification • Anomaly Detection Notification • Slack • Email Operator • SparkSQL • DistCp Dependency • In/Cross Job Engine • Spark 3
  9. Codeless - User provide reliable query or code parts -

    We handle all other parts to generate executable jobs Responsibility Solution: Job as Config - Complete config from template - Transform to real job at runtime - Independent from backend scheduler Motivation - Make user focus on business logic without worrying about coding - Protect cluster from unexpected user behaviors
  10. Dynamic DAG - One DAG + Config files - Runtime

    - Parse config - Create/Update DAG
  11. Architecture User Developer Contribute Notify Send message Operate Spark Slack

    Email … Conf Library Operator … Airflow Pod Pod Pod Driver Executer Executer Github Kubernetes YARN Release
  12. Result - Proper, efficient configuration Better control on operations More

    productive working style - Unified format understandable to everyone - Back to business logic itself Easier management of ETL code - One centralized place
  13. Future work - Integrate with data catalog tool Data lineage

    Data connectivity - More data storages, cross data source support Data quality - Profiling, Anomaly detection, Analyzer, Cleaning…