Building Data Pipelines with Apache Airflow

Slide 1

Slide 1 text

Building Data Pipelines with Apache Airﬂow Takumi Sakamoto Tokyo Airﬂow Meetup #1 2017.05.11

Slide 2

Slide 2 text

About me Takumi Sakamoto @takus Product Manager at KAIZEN Platform I’m here http://www.mindtheproduct.com/2011/10/what-exactly-is-a-product-manager/

Slide 3

Slide 3 text

Why I’m here I used Airﬂow as a data engineer before the recent job change https://goo.gl/wn6PCO

Slide 4

Slide 4 text

Joined Kaizen Platform in this month What does “Kaizen” mean?

Slide 5

Slide 5 text

Kaizen Platform UX optimization platform for website https://kaizenplatform.com/en/

Slide 6

Slide 6 text

Continuous Improvement Flow Data processing is important for identifying UX issues & reporting after tests All clients have to do is to embed tags. Collect data and identify issues automatically. Over 4,600 optimizers on Kaizen platform come up with optimized design variations. Clients select multiple variations out of all the submissions by optimizers. Conduct A/B test and keep replacing design with better performing variation Execute tests Collect design variations Collect data/Identify issues 100 111 110 100 114 95 Use JS and display multiple variations List of variations Use JS and display multiple variations Design variation Embed   JS tag Customer Success Decide what to optimize Request design variations Identify issues from data collected

Slide 7

Slide 7 text

Data & its key numbers User activity logs (PV, click …) on our customer websites 100M+ records per day from 230+ enterprises in various types of industries Finance Media EC Travel Education Infrastructure Job Hunt Real Estate Used Car Match making/Wedding Finance Lead Generation EC/Media (چ ΨϦόʔΠϯλʔφγϣφϧ)

Slide 8

Slide 8 text

Data pipelines Transform data from one representation to another through a series of steps https://databricks.com/blog/2014/11/14/application-spotlight-bedrock.html

Slide 9

Slide 9 text

Why data pipelines matter? • Analytics & batch processing are mission critical • serve decision makers • power machine learning models that can feed into production • Data pipelines become more complex everyday • add new data transfers for new business logic • support new data sources

Slide 10

Slide 10 text

Example of data pipelines https://medium.com/@dustinstansbury/beyond-cron-an-introduction-to-workﬂow-management-systems-19987afcdb5e Figure 1.1: An Example Workﬂow: Reporting and Predicting Online Gaming Revenue

Slide 11

Slide 11 text

Using cron to manage data pipelines 5 0 * * * app extract_ad.sh 5 0 * * * app extract_appstore.sh 30 0 * * * app extract_cv_rate.sh 30 0 * * * app transform.sh 0 2 * * * app combine.sh 0 3 * * * app import_into_db.sh

Slide 12

Slide 12 text

Using cron becomes a headache • It can not handle dependencies between tasks, so many times it forces to set up fixed execution times with ad-hoc guard times. • It’s very difficult to add new jobs in complex crons. When to schedule a new heavy task? Some independent tasks share a common resource (i.e. a database) so it’s best to do not overlap them. • Hard to debug and maintain. The crontab is just a text file. • Rich logging have to be handled externally. • Lack of stats https://danidelvalle.me/2016/09/12/im-sorry-cron-ive-met-airbnbs-airflow/

Slide 13

Slide 13 text

Workﬂow management system (WMS) • Manage scheduling and running tasks in data pipelines • Ensures jobs are ordered correctly based on dependencies • Manage allocation of scarce resources • Provides mechanism for tracking the state of tasks and recovering from failure

Slide 14

Slide 14 text

Apache Airflow • A workflow management system • define workflow as code • a lot of useful features • built in shiny Web UI & rich CLI

Slide 15

Slide 15 text

Workflow as code More maintainable, versionable, testable, and collaborative than configuration dag = DAG('tutorial', default_args=default_args) t1 = BashOperator( task_id='print_date', bash_command='date', dag=dag) t2 = BashOperator( task_id='sleep', bash_command='sleep 5', retries=3, dag=dag) t3 = BashOperator( task_id='templated', bash_command=""" {% for i in range(5) %} echo "{{ ds }}" echo "{{ macros.ds_add(ds, 7)}}" echo "{{ params.my_param }}" {% endfor %} """, params={'my_param': 'Parameter I passed in'}, dag=dag) t2.set_upstream(t1) t3.set_upstream(t1) Task Dependencies Python code DAG (Workflow)

Slide 16

Slide 16 text

Workﬂow as code Dynamic workﬂow for dynamic infrastructure # Create ETL tasks for ELB access logs # Aggregate PVs after converting JSON to Parquet format for elb in c.describe_load_balancers(): task = HiveOperator( task_id=‘to_parquet_{}’.format(elb.LoadBalancerName), hql=etl_query, op_kwargs={ 'name': elb.LoadBalancerName, ‘s3_bucket’: elb.AccessLog.S3BucketName, ‘s3_path’: elb.AccessLog. S3BucketPrefix, }, dag=DAG) task.set_upstream(aggregation_task) NOTICE: This examples doesn’t consider deleted ELBs

Slide 17

Slide 17 text

Useful feature Resource management by “pool” for avoiding too much load MySQL Operator MySQL HiveOperator HiveOperator HiveOperator HiveOperator HiveOperator YARN : resource manager Dynamically Scaling Airﬂow pool Limit task concurrency task API

Slide 18

Slide 18 text

Useful feature Task callbacks for success / failure / SLA miss https://www.slideshare.net/r39132/airﬂow-agari-63072756

Slide 19

Slide 19 text

WebUI: Workflow status Which workflows (DAG) or tasks failed? https://speakerdeck.com/artwr/apache-airflow-at-airbnb-introduction-and-lessons-learned

Slide 20

Slide 20 text

WebUI: Graph view Visualize task dependencies

Slide 21

Slide 21 text

WebUI: Gantt Which task is blocker? https://speakerdeck.com/artwr/apache-airﬂow-at-airbnb-introduction-and-lessons-learned

Slide 22

Slide 22 text

WebUI: Task detail See task metadata, rendered template, execution logs etc… for debugging

Slide 23

Slide 23 text

Rich CLI Useful for re-run some tasks after ﬁxing bugs on ETL process // Clear task execution histories from 2017-05-01 airflow clear etl \ --task_regex insight_ \ --downstream \ --start_date 2017-05-01 // Backfill cleared tasks airflow backfill etl \ --start_date 2017-05-01

Slide 24

Slide 24 text

Tips: Jupiter Notebook Useful for developing workﬂow interactively Use BigQueryHook on Airﬂow

Slide 25

Slide 25 text

Tips: Data quality check Anomaly detection with Apache Airﬂow and Datadog BigQuery Operator on_success callback https://www.datadoghq.com/blog/introducing-anomaly-detection-datadog/ on_failure callback

Slide 26

Slide 26 text

Tips: Data quality check Anomaly detection with Apache Airﬂow and Datadog from airflow.contrib.hooks.datadog_hook import DatadogHook bq = BigQueryHook() dd = DatadogHook() def dd_callback(context): # return dataframe of (metric_name, timestamp, numeric_value) df = bq.get_pandas_df(bql=get_validate_bql(context[‘dag’], context[‘task’])) for i, c in df.iterrows(): dd.send_metric( c[0], datapoint=(c[1], c[2]), tags=[ “dag:{}”.format(context[‘dag’]), ”task:{}”.format(context[‘task’]) ] )

Slide 27

Slide 27 text

What didn’t work well? • Not mature enough in v1.6.1 and hit some bugs • may be fixed in v1.8.1? • History tables become huge by high-frequency DAG • https://github.com/teamclairvoyant/airflow-maintenance-dags • Sensor tasks fills up all available slots • need to limit concurrency by pool or priority • Timezone • define macro to convert UTC to JST

Slide 28

Slide 28 text

Questions?