Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introducing a Unified, Managed Workflow Service...

Introducing a Unified, Managed Workflow Service for LINE Data Platform

Avatar for LINE DEVDAY 2021

LINE DEVDAY 2021

November 10, 2021
Tweet

More Decks by LINE DEVDAY 2021

Other Decks in Technology

Transcript

  1. Speaker - ETL pipeline, batch ingestion - Internal efficiency tools

    Work Vision - Everyone can enjoy data easily with services we provide Yang Xu - Software engineer - Data platform, data engineering center
  2. Agenda - ETL batch at LINE - Status - Problem

    - Our solution - Goal - Feature - Future work
  3. ETL batch at LINE Volume Jobs/day - 100,000 Users -

    200+ Tables - 40,000 Developer - ETL team - Official Account team - Data Scientist team - … Toolchain Language - Python - Scala - … Scheduler - Azkaban - Airflow - OASIS
  4. Problem User - Cost too much to - Learn to

    set up environment - Design necessary functions - Write codes Data Platform - Suffer from - Careless configuration - Unoptimized query - Wild operation
  5. Goal - Encoding & compression - Small file compaction -

    … Cut off burden of data engineering Relieve operation cost - Retry - Recover Central controlled - Yet easy to use
  6. Unified platform - Easy to define and integrate functionalities -

    In order to fit needs from different teams Extensibility Scalability - Convenient to scale up and out - Preparing for increasing workload and various conditions High Availability - Stable and reliable without downtime - For example, when we perform update and restart
  7. Airflow on Kubernetes - Easy to define and integrate functionalities

    - In order to fit needs from different teams Extensibility è Airflow Scalability è Kubernetes & Airflow - Convenient to scale up and out - Preparing for increasing workload and various conditions High Availability è Kubernetes - Stable and reliable without downtime - For example, when we perform update and restart
  8. Standardized functionality SLA • Landing Time • Execution Time Data

    Quality • Verification • Anomaly Detection Notification • Slack • Email Operator • SparkSQL • DistCp Dependency • In/Cross Job Engine • Spark 3
  9. Codeless - User provide reliable query or code parts -

    We handle all other parts to generate executable jobs Responsibility Solution: Job as Config - Complete config from template - Transform to real job at runtime - Independent from backend scheduler Motivation - Make user focus on business logic without worrying about coding - Protect cluster from unexpected user behaviors
  10. Dynamic DAG - One DAG + Config files - Runtime

    - Parse config - Create/Update DAG
  11. Architecture User Developer Contribute Notify Send message Operate Spark Slack

    Email … Conf Library Operator … Airflow Pod Pod Pod Driver Executer Executer Github Kubernetes YARN Release
  12. Result - Proper, efficient configuration Better control on operations More

    productive working style - Unified format understandable to everyone - Back to business logic itself Easier management of ETL code - One centralized place
  13. Future work - Integrate with data catalog tool Data lineage

    Data connectivity - More data storages, cross data source support Data quality - Profiling, Anomaly detection, Analyzer, Cleaning…