Building Data Pipelines with Apache Airflow

Building Data Pipelines with Apache Airflow


Takumi Sakamoto

May 11, 2017


  1. Building Data Pipelines with Apache Airflow Takumi Sakamoto Tokyo Airflow

    Meetup #1 2017.05.11
  2. About me Takumi Sakamoto @takus Product Manager at KAIZEN Platform

    I’m here
  3. Why I’m here I used Airflow as a data engineer

    before the recent job change
  4. Joined Kaizen Platform in this month What does “Kaizen” mean?

  5. Kaizen Platform UX optimization platform for website

  6. Continuous Improvement Flow Data processing is important for identifying UX

    issues & reporting after tests All clients have to do is to embed tags. Collect data and identify issues automatically. Over 4,600 optimizers on Kaizen platform come up with optimized design variations. Clients select multiple variations out of all the submissions by optimizers. Conduct A/B test and keep replacing design with better performing variation Execute tests Collect design variations Collect data/Identify issues 100 111 110 100 114 95 Use JS and display multiple variations List of variations Use JS and display multiple variations Design variation Embed 
 JS tag Customer Success Decide what to optimize Request design variations Identify issues from data collected
  7. Data & its key numbers User activity logs (PV, click

    …) on our customer websites 100M+ records per day from 230+ enterprises in various types of industries Finance Media EC Travel Education Infrastructure Job Hunt Real Estate Used Car Match making/Wedding Finance Lead Generation EC/Media (چ ΨϦόʔΠϯλʔφγϣφϧ)
  8. Data pipelines Transform data from one representation to another through

    a series of steps
  9. Why data pipelines matter? • Analytics & batch processing are

    mission critical • serve decision makers • power machine learning models that can feed into production • Data pipelines become more complex everyday • add new data transfers for new business logic • support new data sources
  10. Example of data pipelines Figure 1.1: An Example Workflow:

    Reporting and Predicting Online Gaming Revenue
  11. Using cron to manage data pipelines 5 0 * *

    * app 5 0 * * * app 30 0 * * * app 30 0 * * * app 0 2 * * * app 0 3 * * * app
  12. Using cron becomes a headache • It can not handle

    dependencies between tasks, so many times it forces to set up fixed execution times with ad-hoc guard times. • It’s very difficult to add new jobs in complex crons. When to schedule a new heavy task? Some independent tasks share a common resource (i.e. a database) so it’s best to do not overlap them. • Hard to debug and maintain. The crontab is just a text file. • Rich logging have to be handled externally. • Lack of stats
  13. Workflow management system (WMS) • Manage scheduling and running tasks

    in data pipelines • Ensures jobs are ordered correctly based on dependencies • Manage allocation of scarce resources • Provides mechanism for tracking the state of tasks and recovering from failure
  14. Apache Airflow • A workflow management system • define workflow

    as code • a lot of useful features • built in shiny Web UI & rich CLI
  15. Workflow as code More maintainable, versionable, testable, and collaborative than

    configuration dag = DAG('tutorial', default_args=default_args) t1 = BashOperator( task_id='print_date', bash_command='date', dag=dag) t2 = BashOperator( task_id='sleep', bash_command='sleep 5', retries=3, dag=dag) t3 = BashOperator( task_id='templated', bash_command=""" {% for i in range(5) %} echo "{{ ds }}" echo "{{ macros.ds_add(ds, 7)}}" echo "{{ params.my_param }}" {% endfor %} """, params={'my_param': 'Parameter I passed in'}, dag=dag) t2.set_upstream(t1) t3.set_upstream(t1) Task Dependencies Python code DAG (Workflow)
  16. Workflow as code Dynamic workflow for dynamic infrastructure # Create

    ETL tasks for ELB access logs # Aggregate PVs after converting JSON to Parquet format for elb in c.describe_load_balancers(): task = HiveOperator( task_id=‘to_parquet_{}’.format(elb.LoadBalancerName), hql=etl_query, op_kwargs={ 'name': elb.LoadBalancerName, ‘s3_bucket’: elb.AccessLog.S3BucketName, ‘s3_path’: elb.AccessLog. S3BucketPrefix, }, dag=DAG) task.set_upstream(aggregation_task) NOTICE: This examples doesn’t consider deleted ELBs
  17. Useful feature Resource management by “pool” for avoiding too much

    load MySQL Operator MySQL HiveOperator HiveOperator HiveOperator HiveOperator HiveOperator YARN : resource manager Dynamically Scaling Airflow pool Limit task concurrency task API
  18. Useful feature Task callbacks for success / failure / SLA

  19. WebUI: Workflow status Which workflows (DAG) or tasks failed?

  20. WebUI: Graph view Visualize task dependencies

  21. WebUI: Gantt Which task is blocker?

  22. WebUI: Task detail See task metadata, rendered template, execution logs

    etc… for debugging
  23. Rich CLI Useful for re-run some tasks after fixing bugs

    on ETL process // Clear task execution histories from 2017-05-01 airflow clear etl \ --task_regex insight_ \ --downstream \ --start_date 2017-05-01 // Backfill cleared tasks airflow backfill etl \ --start_date 2017-05-01
  24. Tips: Jupiter Notebook Useful for developing workflow interactively Use BigQueryHook

    on Airflow
  25. Tips: Data quality check Anomaly detection with Apache Airflow and

    Datadog BigQuery Operator on_success callback on_failure callback
  26. Tips: Data quality check Anomaly detection with Apache Airflow and

    Datadog from airflow.contrib.hooks.datadog_hook import DatadogHook bq = BigQueryHook() dd = DatadogHook() def dd_callback(context): # return dataframe of (metric_name, timestamp, numeric_value) df = bq.get_pandas_df(bql=get_validate_bql(context[‘dag’], context[‘task’])) for i, c in df.iterrows(): dd.send_metric( c[0], datapoint=(c[1], c[2]), tags=[ “dag:{}”.format(context[‘dag’]), ”task:{}”.format(context[‘task’]) ] )
  27. What didn’t work well? • Not mature enough in v1.6.1

    and hit some bugs • may be fixed in v1.8.1? • History tables become huge by high-frequency DAG • • Sensor tasks fills up all available slots • need to limit concurrency by pool or priority • Timezone • define macro to convert UTC to JST
  28. Questions?