Building Data Pipelines with Apache Airflow

Building Data Pipelines with Apache Airﬂow Takumi Sakamoto Tokyo Airﬂow
Meetup #1 2017.05.11

About me Takumi Sakamoto @takus Product Manager at KAIZEN Platform
I’m here http://www.mindtheproduct.com/2011/10/what-exactly-is-a-product-manager/

Why I’m here I used Airﬂow as a data engineer
before the recent job change https://goo.gl/wn6PCO

Joined Kaizen Platform in this month What does “Kaizen” mean?

Kaizen Platform UX optimization platform for website https://kaizenplatform.com/en/

Continuous Improvement Flow Data processing is important for identifying UX
issues & reporting after tests All clients have to do is to embed tags. Collect data and identify issues automatically. Over 4,600 optimizers on Kaizen platform come up with optimized design variations. Clients select multiple variations out of all the submissions by optimizers. Conduct A/B test and keep replacing design with better performing variation Execute tests Collect design variations Collect data/Identify issues 100 111 110 100 114 95 Use JS and display multiple variations List of variations Use JS and display multiple variations Design variation Embed   JS tag Customer Success Decide what to optimize Request design variations Identify issues from data collected

Data & its key numbers User activity logs (PV, click
…) on our customer websites 100M+ records per day from 230+ enterprises in various types of industries Finance Media EC Travel Education Infrastructure Job Hunt Real Estate Used Car Match making/Wedding Finance Lead Generation EC/Media (چ ΨϦόʔΠϯλʔφγϣφϧ)

Data pipelines Transform data from one representation to another through
a series of steps https://databricks.com/blog/2014/11/14/application-spotlight-bedrock.html

Why data pipelines matter? • Analytics & batch processing are
mission critical • serve decision makers • power machine learning models that can feed into production • Data pipelines become more complex everyday • add new data transfers for new business logic • support new data sources

Example of data pipelines https://medium.com/@dustinstansbury/beyond-cron-an-introduction-to-workﬂow-management-systems-19987afcdb5e Figure 1.1: An Example Workﬂow:
Reporting and Predicting Online Gaming Revenue

Using cron to manage data pipelines 5 0 * *
* app extract_ad.sh 5 0 * * * app extract_appstore.sh 30 0 * * * app extract_cv_rate.sh 30 0 * * * app transform.sh 0 2 * * * app combine.sh 0 3 * * * app import_into_db.sh

Using cron becomes a headache • It can not handle
dependencies between tasks, so many times it forces to set up fixed execution times with ad-hoc guard times. • It’s very difficult to add new jobs in complex crons. When to schedule a new heavy task? Some independent tasks share a common resource (i.e. a database) so it’s best to do not overlap them. • Hard to debug and maintain. The crontab is just a text file. • Rich logging have to be handled externally. • Lack of stats https://danidelvalle.me/2016/09/12/im-sorry-cron-ive-met-airbnbs-airflow/

Workﬂow management system (WMS) • Manage scheduling and running tasks
in data pipelines • Ensures jobs are ordered correctly based on dependencies • Manage allocation of scarce resources • Provides mechanism for tracking the state of tasks and recovering from failure

Apache Airflow • A workflow management system • define workflow
as code • a lot of useful features • built in shiny Web UI & rich CLI

Workflow as code More maintainable, versionable, testable, and collaborative than
configuration dag = DAG('tutorial', default_args=default_args) t1 = BashOperator( task_id='print_date', bash_command='date', dag=dag) t2 = BashOperator( task_id='sleep', bash_command='sleep 5', retries=3, dag=dag) t3 = BashOperator( task_id='templated', bash_command=""" {% for i in range(5) %} echo "{{ ds }}" echo "{{ macros.ds_add(ds, 7)}}" echo "{{ params.my_param }}" {% endfor %} """, params={'my_param': 'Parameter I passed in'}, dag=dag) t2.set_upstream(t1) t3.set_upstream(t1) Task Dependencies Python code DAG (Workflow)

Workﬂow as code Dynamic workﬂow for dynamic infrastructure # Create
ETL tasks for ELB access logs # Aggregate PVs after converting JSON to Parquet format for elb in c.describe_load_balancers(): task = HiveOperator( task_id=‘to_parquet_{}’.format(elb.LoadBalancerName), hql=etl_query, op_kwargs={ 'name': elb.LoadBalancerName, ‘s3_bucket’: elb.AccessLog.S3BucketName, ‘s3_path’: elb.AccessLog. S3BucketPrefix, }, dag=DAG) task.set_upstream(aggregation_task) NOTICE: This examples doesn’t consider deleted ELBs

Useful feature Resource management by “pool” for avoiding too much
load MySQL Operator MySQL HiveOperator HiveOperator HiveOperator HiveOperator HiveOperator YARN : resource manager Dynamically Scaling Airﬂow pool Limit task concurrency task API

Useful feature Task callbacks for success / failure / SLA
miss https://www.slideshare.net/r39132/airﬂow-agari-63072756

WebUI: Workflow status Which workflows (DAG) or tasks failed? https://speakerdeck.com/artwr/apache-airflow-at-airbnb-introduction-and-lessons-learned

WebUI: Graph view Visualize task dependencies

WebUI: Gantt Which task is blocker? https://speakerdeck.com/artwr/apache-airﬂow-at-airbnb-introduction-and-lessons-learned

WebUI: Task detail See task metadata, rendered template, execution logs
etc… for debugging

Rich CLI Useful for re-run some tasks after ﬁxing bugs
on ETL process // Clear task execution histories from 2017-05-01 airflow clear etl \ --task_regex insight_ \ --downstream \ --start_date 2017-05-01 // Backfill cleared tasks airflow backfill etl \ --start_date 2017-05-01

Tips: Jupiter Notebook Useful for developing workﬂow interactively Use BigQueryHook
on Airﬂow

Tips: Data quality check Anomaly detection with Apache Airﬂow and
Datadog BigQuery Operator on_success callback https://www.datadoghq.com/blog/introducing-anomaly-detection-datadog/ on_failure callback

Tips: Data quality check Anomaly detection with Apache Airﬂow and
Datadog from airflow.contrib.hooks.datadog_hook import DatadogHook bq = BigQueryHook() dd = DatadogHook() def dd_callback(context): # return dataframe of (metric_name, timestamp, numeric_value) df = bq.get_pandas_df(bql=get_validate_bql(context[‘dag’], context[‘task’])) for i, c in df.iterrows(): dd.send_metric( c[0], datapoint=(c[1], c[2]), tags=[ “dag:{}”.format(context[‘dag’]), ”task:{}”.format(context[‘task’]) ] )

What didn’t work well? • Not mature enough in v1.6.1
and hit some bugs • may be fixed in v1.8.1? • History tables become huge by high-frequency DAG • https://github.com/teamclairvoyant/airflow-maintenance-dags • Sensor tasks fills up all available slots • need to limit concurrency by pool or priority • Timezone • define macro to convert UTC to JST

Questions?

Building Data Pipelines with Apache Airflow

Building Data Pipelines with Apache Airflow

Takumi Sakamoto

More Decks by Takumi Sakamoto

Other Decks in Programming

Featured

Transcript

Building Data Pipelines with Apache Airﬂow Takumi Sakamoto Tokyo Airﬂow

About me Takumi Sakamoto @takus Product Manager at KAIZEN Platform

Why I’m here I used Airﬂow as a data engineer

Joined Kaizen Platform in this month What does “Kaizen” mean?

Kaizen Platform UX optimization platform for website https://kaizenplatform.com/en/

Continuous Improvement Flow Data processing is important for identifying UX

Data & its key numbers User activity logs (PV, click

Data pipelines Transform data from one representation to another through

Why data pipelines matter? • Analytics & batch processing are

Example of data pipelines https://medium.com/@dustinstansbury/beyond-cron-an-introduction-to-workﬂow-management-systems-19987afcdb5e Figure 1.1: An Example Workﬂow:

Using cron to manage data pipelines 5 0 * *

Using cron becomes a headache • It can not handle

Workﬂow management system (WMS) • Manage scheduling and running tasks

Apache Airflow • A workflow management system • define workflow

Workﬂow as code More maintainable, versionable, testable, and collaborative than

Workﬂow as code Dynamic workﬂow for dynamic infrastructure # Create

Useful feature Resource management by “pool” for avoiding too much

Useful feature Task callbacks for success / failure / SLA

WebUI: Workflow status Which workflows (DAG) or tasks failed? https://speakerdeck.com/artwr/apache-airflow-at-airbnb-introduction-and-lessons-learned

WebUI: Graph view Visualize task dependencies

WebUI: Gantt Which task is blocker? https://speakerdeck.com/artwr/apache-airﬂow-at-airbnb-introduction-and-lessons-learned

WebUI: Task detail See task metadata, rendered template, execution logs

Rich CLI Useful for re-run some tasks after ﬁxing bugs

Tips: Jupiter Notebook Useful for developing workﬂow interactively Use BigQueryHook

Tips: Data quality check Anomaly detection with Apache Airﬂow and

Tips: Data quality check Anomaly detection with Apache Airﬂow and

What didn’t work well? • Not mature enough in v1.6.1

Questions?