Speaker
- ETL pipeline, batch ingestion
- Internal efficiency tools
Work
Vision
- Everyone can enjoy data easily with services we provide
Yang Xu
- Software engineer
- Data platform, data engineering center
Slide 3
Slide 3 text
Agenda
- ETL batch at LINE
- Status
- Problem
- Our solution
- Goal
- Feature
- Future work
Slide 4
Slide 4 text
ETL batch at LINE
Volume
Jobs/day
- 100,000
Users
- 200+
Tables
- 40,000
Developer
- ETL team
- Official Account team
- Data Scientist team
- …
Toolchain
Language
- Python
- Scala
- …
Scheduler
- Azkaban
- Airflow
- OASIS
Slide 5
Slide 5 text
Problem
User
- Cost too much to
- Learn to set up environment
- Design necessary functions
- Write codes
Data Platform
- Suffer from
- Careless configuration
- Unoptimized query
- Wild operation
Slide 6
Slide 6 text
Bad practice
Slide 7
Slide 7 text
Bad practice
Slide 8
Slide 8 text
Goal
- Encoding & compression
- Small file compaction
- …
Cut off burden of data engineering
Relieve operation cost
- Retry
- Recover
Central controlled
- Yet easy to use
Slide 9
Slide 9 text
Solution
Unified Standardized Codeless
Slide 10
Slide 10 text
Unified platform
- Easy to define and integrate functionalities
- In order to fit needs from different teams
Extensibility
Scalability
- Convenient to scale up and out
- Preparing for increasing workload and various conditions
High Availability
- Stable and reliable without downtime
- For example, when we perform update and restart
Slide 11
Slide 11 text
Airflow on Kubernetes
- Easy to define and integrate functionalities
- In order to fit needs from different teams
Extensibility è Airflow
Scalability è Kubernetes & Airflow
- Convenient to scale up and out
- Preparing for increasing workload and various conditions
High Availability è Kubernetes
- Stable and reliable without downtime
- For example, when we perform update and restart
Codeless
- User provide reliable query or code parts
- We handle all other parts to generate executable jobs
Responsibility
Solution: Job as Config
- Complete config from template
- Transform to real job at runtime
- Independent from backend scheduler
Motivation
- Make user focus on business logic without worrying about coding
- Protect cluster from unexpected user behaviors
Slide 14
Slide 14 text
- Stand for a complete pipeline
- Hold global configurations
Job
Slide 15
Slide 15 text
- Various types
- Parameterization
- Data validation
Task
Slide 16
Slide 16 text
Dynamic DAG
- One DAG + Config files
- Runtime
- Parse config
- Create/Update DAG
Slide 17
Slide 17 text
Example
Slide 18
Slide 18 text
Architecture
User
Developer
Contribute
Notify Send message
Operate Spark
Slack
Email …
Conf Library
Operator …
Airflow
Pod
Pod
Pod
Driver
Executer
Executer
Github
Kubernetes
YARN
Release
Slide 19
Slide 19 text
Result
- Proper, efficient configuration
Better control on operations
More productive working style
- Unified format understandable to everyone
- Back to business logic itself
Easier management of ETL code
- One centralized place
Slide 20
Slide 20 text
Future work
- Integrate with data catalog tool
Data lineage
Data connectivity
- More data storages, cross data source support
Data quality
- Profiling, Anomaly detection, Analyzer, Cleaning…
Slide 21
Slide 21 text
We are hiring!
- Data Engineer
- Data Platform Engineer