$30 off During Our Annual Pro Sale. View Details »

Introducing a Unified, Managed Workflow Service for LINE Data Platform

Introducing a Unified, Managed Workflow Service for LINE Data Platform

LINE DEVDAY 2021
PRO

November 10, 2021
Tweet

More Decks by LINE DEVDAY 2021

Other Decks in Technology

Transcript

  1. None
  2. Speaker - ETL pipeline, batch ingestion - Internal efficiency tools

    Work Vision - Everyone can enjoy data easily with services we provide Yang Xu - Software engineer - Data platform, data engineering center
  3. Agenda - ETL batch at LINE - Status - Problem

    - Our solution - Goal - Feature - Future work
  4. ETL batch at LINE Volume Jobs/day - 100,000 Users -

    200+ Tables - 40,000 Developer - ETL team - Official Account team - Data Scientist team - … Toolchain Language - Python - Scala - … Scheduler - Azkaban - Airflow - OASIS
  5. Problem User - Cost too much to - Learn to

    set up environment - Design necessary functions - Write codes Data Platform - Suffer from - Careless configuration - Unoptimized query - Wild operation
  6. Bad practice

  7. Bad practice

  8. Goal - Encoding & compression - Small file compaction -

    … Cut off burden of data engineering Relieve operation cost - Retry - Recover Central controlled - Yet easy to use
  9. Solution Unified Standardized Codeless

  10. Unified platform - Easy to define and integrate functionalities -

    In order to fit needs from different teams Extensibility Scalability - Convenient to scale up and out - Preparing for increasing workload and various conditions High Availability - Stable and reliable without downtime - For example, when we perform update and restart
  11. Airflow on Kubernetes - Easy to define and integrate functionalities

    - In order to fit needs from different teams Extensibility è Airflow Scalability è Kubernetes & Airflow - Convenient to scale up and out - Preparing for increasing workload and various conditions High Availability è Kubernetes - Stable and reliable without downtime - For example, when we perform update and restart
  12. Standardized functionality SLA • Landing Time • Execution Time Data

    Quality • Verification • Anomaly Detection Notification • Slack • Email Operator • SparkSQL • DistCp Dependency • In/Cross Job Engine • Spark 3
  13. Codeless - User provide reliable query or code parts -

    We handle all other parts to generate executable jobs Responsibility Solution: Job as Config - Complete config from template - Transform to real job at runtime - Independent from backend scheduler Motivation - Make user focus on business logic without worrying about coding - Protect cluster from unexpected user behaviors
  14. - Stand for a complete pipeline - Hold global configurations

    Job
  15. - Various types - Parameterization - Data validation Task

  16. Dynamic DAG - One DAG + Config files - Runtime

    - Parse config - Create/Update DAG
  17. Example

  18. Architecture User Developer Contribute Notify Send message Operate Spark Slack

    Email … Conf Library Operator … Airflow Pod Pod Pod Driver Executer Executer Github Kubernetes YARN Release
  19. Result - Proper, efficient configuration Better control on operations More

    productive working style - Unified format understandable to everyone - Back to business logic itself Easier management of ETL code - One centralized place
  20. Future work - Integrate with data catalog tool Data lineage

    Data connectivity - More data storages, cross data source support Data quality - Profiling, Anomaly detection, Analyzer, Cleaning…
  21. We are hiring! - Data Engineer - Data Platform Engineer

  22. Thank you