Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introducing a Unified, Managed Workflow Service for LINE Data Platform

Introducing a Unified, Managed Workflow Service for LINE Data Platform

LINE DEVDAY 2021

November 10, 2021
Tweet

More Decks by LINE DEVDAY 2021

Other Decks in Technology

Transcript

  1. Speaker
    - ETL pipeline, batch ingestion
    - Internal efficiency tools
    Work
    Vision
    - Everyone can enjoy data easily with services we provide
    Yang Xu
    - Software engineer
    - Data platform, data engineering center

    View full-size slide

  2. Agenda
    - ETL batch at LINE
    - Status
    - Problem
    - Our solution
    - Goal
    - Feature
    - Future work

    View full-size slide

  3. ETL batch at LINE
    Volume
    Jobs/day
    - 100,000
    Users
    - 200+
    Tables
    - 40,000
    Developer
    - ETL team
    - Official Account team
    - Data Scientist team
    - …
    Toolchain
    Language
    - Python
    - Scala
    - …
    Scheduler
    - Azkaban
    - Airflow
    - OASIS

    View full-size slide

  4. Problem
    User
    - Cost too much to
    - Learn to set up environment
    - Design necessary functions
    - Write codes
    Data Platform
    - Suffer from
    - Careless configuration
    - Unoptimized query
    - Wild operation

    View full-size slide

  5. Bad practice

    View full-size slide

  6. Bad practice

    View full-size slide

  7. Goal
    - Encoding & compression
    - Small file compaction
    - …
    Cut off burden of data engineering
    Relieve operation cost
    - Retry
    - Recover
    Central controlled
    - Yet easy to use

    View full-size slide

  8. Solution
    Unified Standardized Codeless

    View full-size slide

  9. Unified platform
    - Easy to define and integrate functionalities
    - In order to fit needs from different teams
    Extensibility
    Scalability
    - Convenient to scale up and out
    - Preparing for increasing workload and various conditions
    High Availability
    - Stable and reliable without downtime
    - For example, when we perform update and restart

    View full-size slide

  10. Airflow on Kubernetes
    - Easy to define and integrate functionalities
    - In order to fit needs from different teams
    Extensibility è Airflow
    Scalability è Kubernetes & Airflow
    - Convenient to scale up and out
    - Preparing for increasing workload and various conditions
    High Availability è Kubernetes
    - Stable and reliable without downtime
    - For example, when we perform update and restart

    View full-size slide

  11. Standardized functionality
    SLA
    • Landing Time
    • Execution Time
    Data Quality
    • Verification
    • Anomaly Detection
    Notification
    • Slack
    • Email
    Operator
    • SparkSQL
    • DistCp
    Dependency
    • In/Cross Job
    Engine
    • Spark 3

    View full-size slide

  12. Codeless
    - User provide reliable query or code parts
    - We handle all other parts to generate executable jobs
    Responsibility
    Solution: Job as Config
    - Complete config from template
    - Transform to real job at runtime
    - Independent from backend scheduler
    Motivation
    - Make user focus on business logic without worrying about coding
    - Protect cluster from unexpected user behaviors

    View full-size slide

  13. - Stand for a complete pipeline
    - Hold global configurations
    Job

    View full-size slide

  14. - Various types
    - Parameterization
    - Data validation
    Task

    View full-size slide

  15. Dynamic DAG
    - One DAG + Config files
    - Runtime
    - Parse config
    - Create/Update DAG

    View full-size slide

  16. Architecture
    User
    Developer
    Contribute
    Notify Send message
    Operate Spark
    Slack
    Email …
    Conf Library
    Operator …
    Airflow
    Pod
    Pod
    Pod
    Driver
    Executer
    Executer
    Github
    Kubernetes
    YARN
    Release

    View full-size slide

  17. Result
    - Proper, efficient configuration
    Better control on operations
    More productive working style
    - Unified format understandable to everyone
    - Back to business logic itself
    Easier management of ETL code
    - One centralized place

    View full-size slide

  18. Future work
    - Integrate with data catalog tool
    Data lineage
    Data connectivity
    - More data storages, cross data source support
    Data quality
    - Profiling, Anomaly detection, Analyzer, Cleaning…

    View full-size slide

  19. We are hiring!
    - Data Engineer
    - Data Platform Engineer

    View full-size slide