Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

Speaker - ETL pipeline, batch ingestion - Internal efficiency tools Work Vision - Everyone can enjoy data easily with services we provide Yang Xu - Software engineer - Data platform, data engineering center

Slide 3

Slide 3 text

Agenda - ETL batch at LINE - Status - Problem - Our solution - Goal - Feature - Future work

Slide 4

Slide 4 text

ETL batch at LINE Volume Jobs/day - 100,000 Users - 200+ Tables - 40,000 Developer - ETL team - Official Account team - Data Scientist team - … Toolchain Language - Python - Scala - … Scheduler - Azkaban - Airflow - OASIS

Slide 5

Slide 5 text

Problem User - Cost too much to - Learn to set up environment - Design necessary functions - Write codes Data Platform - Suffer from - Careless configuration - Unoptimized query - Wild operation

Slide 6

Slide 6 text

Bad practice

Slide 7

Slide 7 text

Bad practice

Slide 8

Slide 8 text

Goal - Encoding & compression - Small file compaction - … Cut off burden of data engineering Relieve operation cost - Retry - Recover Central controlled - Yet easy to use

Slide 9

Slide 9 text

Solution Unified Standardized Codeless

Slide 10

Slide 10 text

Unified platform - Easy to define and integrate functionalities - In order to fit needs from different teams Extensibility Scalability - Convenient to scale up and out - Preparing for increasing workload and various conditions High Availability - Stable and reliable without downtime - For example, when we perform update and restart

Slide 11

Slide 11 text

Airflow on Kubernetes - Easy to define and integrate functionalities - In order to fit needs from different teams Extensibility è Airflow Scalability è Kubernetes & Airflow - Convenient to scale up and out - Preparing for increasing workload and various conditions High Availability è Kubernetes - Stable and reliable without downtime - For example, when we perform update and restart

Slide 12

Slide 12 text

Standardized functionality SLA • Landing Time • Execution Time Data Quality • Verification • Anomaly Detection Notification • Slack • Email Operator • SparkSQL • DistCp Dependency • In/Cross Job Engine • Spark 3

Slide 13

Slide 13 text

Codeless - User provide reliable query or code parts - We handle all other parts to generate executable jobs Responsibility Solution: Job as Config - Complete config from template - Transform to real job at runtime - Independent from backend scheduler Motivation - Make user focus on business logic without worrying about coding - Protect cluster from unexpected user behaviors

Slide 14

Slide 14 text

- Stand for a complete pipeline - Hold global configurations Job

Slide 15

Slide 15 text

- Various types - Parameterization - Data validation Task

Slide 16

Slide 16 text

Dynamic DAG - One DAG + Config files - Runtime - Parse config - Create/Update DAG

Slide 17

Slide 17 text

Example

Slide 18

Slide 18 text

Architecture User Developer Contribute Notify Send message Operate Spark Slack Email … Conf Library Operator … Airflow Pod Pod Pod Driver Executer Executer Github Kubernetes YARN Release

Slide 19

Slide 19 text

Result - Proper, efficient configuration Better control on operations More productive working style - Unified format understandable to everyone - Back to business logic itself Easier management of ETL code - One centralized place

Slide 20

Slide 20 text

Future work - Integrate with data catalog tool Data lineage Data connectivity - More data storages, cross data source support Data quality - Profiling, Anomaly detection, Analyzer, Cleaning…

Slide 21

Slide 21 text

We are hiring! - Data Engineer - Data Platform Engineer

Slide 22

Slide 22 text

Thank you