Building Data Pipelines with Apache Airflow

Building Data Pipelines with Apache Airflow

2db33f44183cdc9ea0ec523924cab3a0?s=128

Takumi Sakamoto

May 11, 2017
Tweet

Transcript

  1. 2.

    About me Takumi Sakamoto @takus Product Manager at KAIZEN Platform

    I’m here http://www.mindtheproduct.com/2011/10/what-exactly-is-a-product-manager/
  2. 3.

    Why I’m here I used Airflow as a data engineer

    before the recent job change https://goo.gl/wn6PCO
  3. 6.

    Continuous Improvement Flow Data processing is important for identifying UX

    issues & reporting after tests All clients have to do is to embed tags. Collect data and identify issues automatically. Over 4,600 optimizers on Kaizen platform come up with optimized design variations. Clients select multiple variations out of all the submissions by optimizers. Conduct A/B test and keep replacing design with better performing variation Execute tests Collect design variations Collect data/Identify issues 100 111 110 100 114 95 Use JS and display multiple variations List of variations Use JS and display multiple variations Design variation Embed 
 JS tag Customer Success Decide what to optimize Request design variations Identify issues from data collected
  4. 7.

    Data & its key numbers User activity logs (PV, click

    …) on our customer websites 100M+ records per day from 230+ enterprises in various types of industries Finance Media EC Travel Education Infrastructure Job Hunt Real Estate Used Car Match making/Wedding Finance Lead Generation EC/Media (چ ΨϦόʔΠϯλʔφγϣφϧ)
  5. 8.

    Data pipelines Transform data from one representation to another through

    a series of steps https://databricks.com/blog/2014/11/14/application-spotlight-bedrock.html
  6. 9.

    Why data pipelines matter? • Analytics & batch processing are

    mission critical • serve decision makers • power machine learning models that can feed into production • Data pipelines become more complex everyday • add new data transfers for new business logic • support new data sources
  7. 11.

    Using cron to manage data pipelines 5 0 * *

    * app extract_ad.sh 5 0 * * * app extract_appstore.sh 30 0 * * * app extract_cv_rate.sh 30 0 * * * app transform.sh 0 2 * * * app combine.sh 0 3 * * * app import_into_db.sh
  8. 12.

    Using cron becomes a headache • It can not handle

    dependencies between tasks, so many times it forces to set up fixed execution times with ad-hoc guard times. • It’s very difficult to add new jobs in complex crons. When to schedule a new heavy task? Some independent tasks share a common resource (i.e. a database) so it’s best to do not overlap them. • Hard to debug and maintain. The crontab is just a text file. • Rich logging have to be handled externally. • Lack of stats https://danidelvalle.me/2016/09/12/im-sorry-cron-ive-met-airbnbs-airflow/
  9. 13.

    Workflow management system (WMS) • Manage scheduling and running tasks

    in data pipelines • Ensures jobs are ordered correctly based on dependencies • Manage allocation of scarce resources • Provides mechanism for tracking the state of tasks and recovering from failure
  10. 14.

    Apache Airflow • A workflow management system • define workflow

    as code • a lot of useful features • built in shiny Web UI & rich CLI
  11. 15.

    Workflow as code More maintainable, versionable, testable, and collaborative than

    configuration dag = DAG('tutorial', default_args=default_args) t1 = BashOperator( task_id='print_date', bash_command='date', dag=dag) t2 = BashOperator( task_id='sleep', bash_command='sleep 5', retries=3, dag=dag) t3 = BashOperator( task_id='templated', bash_command=""" {% for i in range(5) %} echo "{{ ds }}" echo "{{ macros.ds_add(ds, 7)}}" echo "{{ params.my_param }}" {% endfor %} """, params={'my_param': 'Parameter I passed in'}, dag=dag) t2.set_upstream(t1) t3.set_upstream(t1) Task Dependencies Python code DAG (Workflow)
  12. 16.

    Workflow as code Dynamic workflow for dynamic infrastructure # Create

    ETL tasks for ELB access logs # Aggregate PVs after converting JSON to Parquet format for elb in c.describe_load_balancers(): task = HiveOperator( task_id=‘to_parquet_{}’.format(elb.LoadBalancerName), hql=etl_query, op_kwargs={ 'name': elb.LoadBalancerName, ‘s3_bucket’: elb.AccessLog.S3BucketName, ‘s3_path’: elb.AccessLog. S3BucketPrefix, }, dag=DAG) task.set_upstream(aggregation_task) NOTICE: This examples doesn’t consider deleted ELBs
  13. 17.

    Useful feature Resource management by “pool” for avoiding too much

    load MySQL Operator MySQL HiveOperator HiveOperator HiveOperator HiveOperator HiveOperator YARN : resource manager Dynamically Scaling Airflow pool Limit task concurrency task API
  14. 18.

    Useful feature Task callbacks for success / failure / SLA

    miss https://www.slideshare.net/r39132/airflow-agari-63072756
  15. 23.

    Rich CLI Useful for re-run some tasks after fixing bugs

    on ETL process // Clear task execution histories from 2017-05-01 airflow clear etl \ --task_regex insight_ \ --downstream \ --start_date 2017-05-01 // Backfill cleared tasks airflow backfill etl \ --start_date 2017-05-01
  16. 25.

    Tips: Data quality check Anomaly detection with Apache Airflow and

    Datadog BigQuery Operator on_success callback https://www.datadoghq.com/blog/introducing-anomaly-detection-datadog/ on_failure callback
  17. 26.

    Tips: Data quality check Anomaly detection with Apache Airflow and

    Datadog from airflow.contrib.hooks.datadog_hook import DatadogHook bq = BigQueryHook() dd = DatadogHook() def dd_callback(context): # return dataframe of (metric_name, timestamp, numeric_value) df = bq.get_pandas_df(bql=get_validate_bql(context[‘dag’], context[‘task’])) for i, c in df.iterrows(): dd.send_metric( c[0], datapoint=(c[1], c[2]), tags=[ “dag:{}”.format(context[‘dag’]), ”task:{}”.format(context[‘task’]) ] )
  18. 27.

    What didn’t work well? • Not mature enough in v1.6.1

    and hit some bugs • may be fixed in v1.8.1? • History tables become huge by high-frequency DAG • https://github.com/teamclairvoyant/airflow-maintenance-dags • Sensor tasks fills up all available slots • need to limit concurrency by pool or priority • Timezone • define macro to convert UTC to JST